PVS target device getting unresponsive (full cache)

I experienced the issue of unresponsive terminal servers. I want to show you my way of troubleshooting that issue.

A few facts about the environment:

Let me describe the problem in more detail. The VDAs rebooting daily and after rebooting everything is fine. After some time (sometimes after one hour, sometimes after 10 hours and sometimes never) the VDA getting unresponsive.

Unresponsive means that the sessions on the servers got corrupted/stuck. Everything was very slow. Sometimes the machine hung. Sometimes I could do something on the machine but everything was very slow. Most of the time it wasn’t possible to logon (doesn’t matter local account or domain account). I didn’t get any data about the machine in Citrix Director.

After a few days of trying, I was finally able to log into the machine. I opened Explorer to view the drives because I knew this behavior when C:\ or D:\ is (almost) full. After three hours I got the explorer window and saw that D:\ is full. The big file was: vdiskdif.vhdx.

I verified the hint with our monitoring solution.

At this point, I know that something filling up my vDisk cache, but I don’t know what. In theory, I need something to review the reads/writes to C:. I don’t know any suitable tool to do that. Process Monitor can do that but it is very resource-intensive and I would possibly crash the machine with Process Monitor. And even if that worked, what should I look for? I can’t run Process Monitor on 800 machines to get (maybe) a good hint.

After the machine got unresponsive, I get a full memory dump. I’m not experienced with memory dumps but I found a way to get the information without any experience.

  1. First of all, you need WinDbg (Windows Debugger). The easiest way is to download WinDbg Preview via Microsoft Store.
  2. Setting up the symbol server. A very easy way is to setup the environment variable: _NT_SYMBOL_PATH=SRV*C:\symbols*http://msdl.microsoft.com/download/symbols. The good thing is that other tools use also that environment variable.
  3. Download a WinDbg extension: DbgKit from Andrey Bazhan (because the website is offline, here my mirror link.
  4. Start WinDbg and open the memory dump (File > Open dump file).
  5. Load the debugger extension with: .load DbgKit\x64\dbgkit64.dll (You have to adjust the path).
  6. As soon as the CLI input is available type: !dbgkit.mm.
  7. A separate window appears. It’ll take a few minutes. Take a coffee.

The File Summary tab is the most useful to see the problem. I had a few memory dumps (like five), but I saw two different patterns.

DbgKit - File Summary - .ost File

That’s very interesting. A .ost file is only created when Outlook is in offline mode. Usually, you enable online mode in a non-persistent environment (I know there are exceptions to that). We migrated two weeks ago to Microsoft Office 2016 and don’t force the online mode via GPO. From then on, we enforced it.

DbgKit — File Summary — OneNote Cache files

OneNote creates a lot (the screenshots show just a few) of binary cache files. There is no way to disable that. There are a few articles out there that describe how to use symlinks to improve the situation. However, I do see some drawbacks to this approach, but that’s another topic.

After we addressed these two issues, we didn’t have any more corrupt machines. I have tried to show a simple way of troubleshooting this problem. I hope this short article will help you if you have the same problem. Please let me know if you have any comments, tips, or similar for me.

Happy troubleshooting.