A curious case of memory dump diagnostic: How Stackify can cause troubles to your site

It’s been a while since my last blog post, and this time it will be fun time with windbg. One of DXC customers has been having problem with instances of their site hang and restart. We at Episerver takes the availability and performance of DXC sites very seriously, therefore I was asked to help diagnosing the problem. It’s something I’d like to do and to learn anyway, so game is on.

The golden standard for looking at the kind of those problems is still the tried and trusted Windbg. With the help from the Managed Services Engineer, I was able to access several memory dumps from the site. Most of them were taken when the site stalled, which is usually the most useful moment because it should contain the most critical piece of information why the stall happened.

As usual, it’s just about firing Windbg, Ctrl+D to open the memory dump and start debugging, then load some essential extensions, and everything is ready.

A quick check confirms that there was a high CPU usage situation

!threadpool
CPU utilization: 81%
Worker Thread: Total: 51 Running: 33 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 454

But now it’s important part: we don’t know which thread is causing the high CPU situation. Yes we can always go with !threads or runaway and go through all the stacktraces and guess.

It’s just hard to have a good guess. Especially when there is a Garbage Collection in progress in most of memory dumps, but not all. Well, when there is a GC, GC thread would suspend all other threads to clean & sweep the “dead” objects, and that can slow down the instance greatly, and Azure AutoHeal can see that the instance is not responding and restarts it. But what we really need to explain is why a Gen 2 GC happened, and how to prevent it from happening again.

And what’s about the memory dumps that there was no GC?

It might feel easy to read some guides on finding a memory leak using Windbg. That can be easy, well, on a fairly small application. But we are talking about memory dumps 6-10GB in size, and there is no clear indication of what could possibly wrong. True, I am not a master on Windbg, at least not yet, but if there is a list of things that make me really pulling my hairs, this is at the very top of it. There have been times that I think I am this close to revealing the truth, just to be proven not.

But as they said, “Every cloud has a silver lining”. It comes to my attention that this stacktrace appears consistently in some threads:

System.Diagnostics.Tracing.EventSource.DispatchToAllListeners(Int32, System.Guid*, System.Diagnostics.Tracing.EventWrittenEventArgs)+4a 
System.Diagnostics.Tracing.EventSource.WriteToAllListeners(Int32, System.Guid*, Int32, EventData*)+16f 
System.Diagnostics.Tracing.EventSource.WriteEventWithRelatedActivityIdCore(Int32, System.Guid*, Int32, EventData*)+766 System.Diagnostics.Tracing.EventSource.WriteEvent(Int32, System.String)+97 Microsoft.ApplicationInsights.Web.SessionTelemetryInitializer.GetSessionContextFromUserCookie(System.Web.HttpCookie, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry)+33 Microsoft.ApplicationInsights.Web.SessionTelemetryInitializer.OnInitializeTelemetry(System.Web.HttpContext, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.ApplicationInsights.Channel.ITelemetry)+ad Microsoft.ApplicationInsights.Web.Implementation.WebTelemetryInitializerBase.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+85 Microsoft.ApplicationInsights.TelemetryClient.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+256 Microsoft.ApplicationInsights.DependencyCollector.Implementation.ClientServerDependencyTracker.BeginTracking(Microsoft.ApplicationInsights.TelemetryClient)+1ee 
Microsoft.ApplicationInsights.DependencyCollector.Implementation.ProfilerSqlProcessingBase.OnBegin(System.Object)+18f Microsoft.Diagnostics.Instrumentation.Extensions.Intercept.Functions+<>c__DisplayClass1_0.b__0(System.Object)+1e

But it’s the framework (Application Insights), so it can’t be the problem, right? Or can it? After reviewing several memory dumps, my suspicion grew, this appeared in almost every memory dump I had. There might really be something with it.

If we look at the last line DispatchToAllListeners, it sounds possible that we might have a very long list of “Listeners” and that take time. A peek into the framework source code seems to confirm my suspicion

private unsafe void DispatchToAllListeners(int eventId, Guid* childActivityID, EventWrittenEventArgs eventCallbackArgs)
        {
            Exception lastThrownException = null;
            for (EventDispatcher dispatcher = m_Dispatchers; dispatcher != null; dispatcher = dispatcher.m_Next)
            {
                //Do stuffs.
            }

            if (lastThrownException != null)
            {
                throw new EventSourceException(lastThrownException);
            }
        }

So each EventSource has a field of type EventDispatcher named m_Dispatchers. And as the code above suggests, EventDispatcher is actually built as a linked list, and each instance point to their next dispatcher. So that means if the linked list is long enough, then this method, DispatchToAllListeners can even take a long time, even causing a bottleneck.

The next question is to find out if our theory is “sustainable”. Let’s check how many EventDispatcher instances to we have

0:115> !dumpheap -type EventDispatcher -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff9d3c28308        1           48 Microsoft.ServiceBus.Tracing.EventDispatcher
00007ff9d4ae8cc0   177825      8535600 System.Diagnostics.Tracing.EventDispatcher
Total 177826 objects

Oops!

That does not look good. So it kinda reaffirms our suspicion. But we still need to find out which code caused this high number of EventDispatcher . Just pick any of those and explore

0:115> !DumpHeap /d -mt 00007ff9d4ae8cc0
         Address               MT     Size
  
00000244309ba368 00007ff9d4ae8cc0       48     
User interrupt.

Using !do /!DumpObject to see what in there

0:115> !DumpObj /d 00000244305aa758
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 00000244305aa670 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 00000244305aa6a0 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 000002473197b630 m_Next

We can see that we have addresses of the next dispatcher in list, and most importantly, the EventListener . But let’s see where m_Next leads us

0:115> !DumpObj /d 000002473197b630
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 000002473197b548 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 000002473197b578 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 0000024631edea68 m_Next
0:115> !DumpObj /d 0000024631edea68
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 0000024631ede980 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 0000024631ede9b0 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 0000024631edf240 m_Next
0:115> !DumpObj /d 0000024631edf240
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 0000024631edf140 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 0000024631edf188 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 0000024531331bc0 m_Next
0:115> !DumpObj /d 0000024531331bc0
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 0000024531331ac8 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 0000024531331b08 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 0000024531305fc8 m_Next
0:115> !DumpObj /d 0000024531305fc8
Name:        System.Diagnostics.Tracing.EventDispatcher
MethodTable: 00007ff9d4ae8cc0
EEClass:     00007ff9d4b08930
Size:        48(0x30) bytes
File:        D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9cfc6a118  40016ed        8 ...ing.EventListener  0 instance 0000024531305ee0 m_Listener
00007ff9cfc79560  40016ee       10     System.Boolean[]  0 instance 0000024531305f10 m_EventEnabled
00007ff9cf9ecd30  40016ef       20       System.Boolean  1 instance                0 m_activityFilteringEnabled
00007ff9d4ae8cc0  40016f0       18 ...g.EventDispatcher  0 instance 0000000000000000 m_Next

So only 5 more EventDispatcher down the chain. But the first dispatcher might not be the start of the list. It’s not easy to find out with dispatcher points to it, but we can technically find out which objects have references to it by !gcroot (man, that takes a loooooooooong time)

0:115> !gcroot 00000244305aa758
Thread 1428:
    00000072157bebb0 00007ff9d38358d6 System.Net.TimerThread.ThreadProc()
        rdi:  (interior)
            ->  000002482f6697f0 System.Object[]
            ->  0000024732924120 System.Net.WebRequest+DelEtwFireBeginWRGet
            ->  000002442f664158 System.Diagnostics.Tracing.FrameworkEventSource
            ->  0000024461f2bd58 System.Diagnostics.Tracing.EventDispatcher
            ->  00000245764e59b0 System.Diagnostics.Tracing.EventDispatcher
...very, very long list

This one has about 7112 dispatcher in the queue. Eww! If any event was ever handled by the `FrameworkEventSource`, this would be a very, very long list to walk through. And for a purpose of logging, that is too much!

Let’s get back to the EventListener to see if there is something interesting there

Dump that object gives us this

0:115> !DumpObj /d 00000244305aa670
Name:        StackifyHttpTracer.EventListener
MethodTable: 00007ff9d28460f8
EEClass:     00007ff9d282b850
Size:        48(0x30) bytes
File:        D:\local\Temporary ASP.NET Files\root5ea09caf438d\assembly\dl3224559
0:115> !DumpObj /d 00000244305aa670
Name:        StackifyHttpTracer.EventListener
MethodTable: 00007ff9d28460f8
EEClass:     00007ff9d282b850
Size:        48(0x30) bytes
File:        D:\local\Temporary ASP.NET Files\root\125ea09c\39af438d\assembly\dl3\93224559\0052a77a_8212d201\StackifyHttpTracer.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9d580a5e8  40016a3        8 ...tArgs, mscorlib]]  0 instance 0000000000000000 _EventSourceCreated
0000000000000000  40016a4       10                       0 instance 0000000000000000 EventWritten
00007ff9cfc6a118  40016a5       18 ...ing.EventListener  0 instance 000002473197b548 m_Next
00007ff9d4ae9960  40016a6       20 ...ng.ActivityFilter  0 instance 0000000000000000 m_activityFilter
00007ff9cf9e67a0  40016a2      cb8        System.Object  0   shared           static s_EventSourceCreatedLock
                                 >> Domain:Value  000002442b6da6a0:000002442f651c98 000002442d7fd6c0:000002442f664378 000002488b1ebbb0:00000247311c68d0 <<
00007ff9cfc6a118  40016a7      cc0 ...ing.EventListener  0   shared           static s_Listeners
                                 >> Domain:Value  000002442b6da6a0:0000000000000000 000002442d7fd6c0:0000024461f2bc58 000002488b1ebbb0:0000000000000000 <<
00007ff9cfc77fd8  40016a8      cc8 ...rence, mscorlib]]  0   shared           static s_EventSources
                                 >> Domain:Value  000002442b6da6a0:000002442f651cb0 000002442d7fd6c0:000002442f664390 000002488b1ebbb0:00000247311c68e8 <<
00007ff9cf9ecd30  40016a9      ee8       System.Boolean  1   shared           static s_CreatingListener
                                 >> Domain:Value  000002442b6da6a0:0 000002442d7fd6c0:0 000002488b1ebbb0:0 <<
00007ff9cf9ecd30  40016aa      ee9       System.Boolean  1   shared           static s_EventSourceShutdownRegistered
                                 >> Domain:Value  000002442b6da6a0:1 000002442d7fd6c0:1 000002488b1ebbb0:1 <<
00007ff9cfb48830  4000006       10          System.Guid  1   static 000002482f6a6f90 tplGuid
00007ff9cfb48830  4000007       18          System.Guid  1   static 000002482f6a6f98 aspnetGuid
52a77a_8212d201\StackifyHttpTracer.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9d580a5e8 40016a3 8 ...tArgs, mscorlib]] 0 instance 0000000000000000 _EventSourceCreated 0000000000000000 40016a4 10 0 instance 0000000000000000 EventWritten 00007ff9cfc6a118 40016a5 18 ...ing.EventListener 0 instance 000002473197b548 m_Next 00007ff9d4ae9960 40016a6 20 ...ng.ActivityFilter 0 instance 0000000000000000 m_activityFilter 00007ff9cf9e67a0 40016a2 cb8 System.Object 0 shared static s_EventSourceCreatedLock >> Domain:Value 000002442b6da6a0:000002442f651c98 000002442d7fd6c0:000002442f664378 000002488b1ebbb0:00000247311c68d0 << 00007ff9cfc6a118 40016a7 cc0 ...ing.EventListener 0 shared static s_Listeners >> Domain:Value 000002442b6da6a0:0000000000000000 000002442d7fd6c0:0000024461f2bc58 000002488b1ebbb0:0000000000000000 << 00007ff9cfc77fd8 40016a8 cc8 ...rence, mscorlib]] 0 shared static s_EventSources >> Domain:Value 000002442b6da6a0:000002442f651cb0 000002442d7fd6c0:000002442f664390 000002488b1ebbb0:00000247311c68e8 << 00007ff9cf9ecd30 40016a9 ee8 System.Boolean 1 shared static s_CreatingListener >> Domain:Value 000002442b6da6a0:0 000002442d7fd6c0:0 000002488b1ebbb0:0 << 00007ff9cf9ecd30 40016aa ee9 System.Boolean 1 shared static s_EventSourceShutdownRegistered >> Domain:Value 000002442b6da6a0:1 000002442d7fd6c0:1 000002488b1ebbb0:1 << 00007ff9cfb48830 4000006 10 System.Guid 1 static 000002482f6a6f90 tplGuid 00007ff9cfb48830 4000007 18 System.Guid 1 static 000002482f6a6f98 aspnetGuid

Wait, what, StackifyHttpTracer.EventListener? It is not part of Episerver Framework, nor DXC. It is not even part of Azure. A quick search on Google shows no result about it. But searching with StackifyHttpTracer reveals that it is part of Stackify, a profiler. Looking at their web.config, it is confirmed that StackifyHttpModule is present:

 <remove name="StackifyHttpModule_Net40" />
      <add name="StackifyHttpModule_Net40" type="StackifyHttpTracer.StackifyHttpModule,StackifyHttpTracer, Version=1.0.0.0, Culture=neutral, PublicKeyToken=93c44ce23f2048dd" preCondition="managedHandler,runtimeVersionv4.0" />
      <remove name="TelemetryCorrelationHttpModule" />

But – how does that relate to the high number of dispatchers we have seen? Using dotPeek to look into StackifyHttpModule, I get the answer:

  public class StackifyHttpModule : IHttpModule
  {
    private static readonly DateTime _Epoch = new DateTime(1970, 1, 1, 0, 0, 0, 0, DateTimeKind.Utc);
    private ModuleSettings _moduleSettings = new ModuleSettings();
    private EventListener el = new EventListener();

It easy to think that you can have only one instance of your IHttpModule in your website (I did), but memory dump is saying otherwise. It seems like Azure creates instances of IHttpModule over time, even thought I’m not sure why, or when

0:115> !dumpheap -type StackifyHttpTracer.StackifyHttpModule -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff9d28465f0      305         9760 StackifyHttpTracer.StackifyHttpModule
Total 305 objects

So that’s it. We have 305 instance of StackifyHttpTracer.StackifyHttpModule, and each of them has an instance of StackifyHttpTracer.EventListener. Worse, many might have been disposed, but their EventListener stays.

0:115> !dumpheap -type StackifyHttpTracer.EventListener -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff9d28460f8     7112       341376 StackifyHttpTracer.EventListener
Total 7112 objects

This number of EventListener matches with the number of EventDispatcher in our previous check.

Oops!

That explains why we have that many instances of EventDispatcher in place.

The solution for this case, well, just to remove that StackifyHttpModule from the web.config is to get rid of that. I do not daresay it is a bug in Stackify, but it seems to me that the EventListener field should have been static instead of instance. But again I’m not sure about what they are trying to achieve there.

The result? The said customer deployed a new version which removed StackifyHttpModule, and their average server response time drop from around 130ms (and most importantly, increasing, which is understandable as we saw that the list of EventDispatcher is ever growing) to around 80ms (and, stable). For some of their most frequently called methods, there have been an improvement of 4 or 5 times faster . True, there might be some other improvements in their new version, I really think removal of the module has had the biggest impact.

My takes from the story:

  • Sometimes it’s hard to find the answer. Just keep looking and leave no stone untouched.
  • This is why you should use Episerver DXC, you get early alerts on any anomalies and you get the expert help to solve the problem.

6 thoughts on “A curious case of memory dump diagnostic: How Stackify can cause troubles to your site

Leave a Reply

Your email address will not be published. Required fields are marked *