It’s been a while since my last blog post, and this time it will be fun time with windbg. One of DXC customers has been having problem with instances of their site hang and restart. We at Episerver takes the availability and performance of DXC sites very seriously, therefore I was asked to help diagnosing the problem. It’s something I’d like to do and to learn anyway, so game is on.
The golden standard for looking at the kind of those problems is still the tried and trusted Windbg. With the help from the Managed Services Engineer, I was able to access several memory dumps from the site. Most of them were taken when the site stalled, which is usually the most useful moment because it should contain the most critical piece of information why the stall happened.
As usual, it’s just about firing Windbg, Ctrl+D to open the memory dump and start debugging, then load some essential extensions, and everything is ready.
A quick check confirms that there was a high CPU usage situation
!threadpool CPU utilization: 81% Worker Thread: Total: 51 Running: 33 Idle: 0 MaxLimit: 32767 MinLimit: 4 Work Request in Queue: 454
But now it’s important part: we don’t know which thread is causing the high CPU situation. Yes we can always go with !threads
or runaway
and go through all the stacktraces and guess.
It’s just hard to have a good guess. Especially when there is a Garbage Collection in progress in most of memory dumps, but not all. Well, when there is a GC, GC thread would suspend all other threads to clean & sweep the “dead” objects, and that can slow down the instance greatly, and Azure AutoHeal can see that the instance is not responding and restarts it. But what we really need to explain is why a Gen 2 GC happened, and how to prevent it from happening again.
And what’s about the memory dumps that there was no GC?
It might feel easy to read some guides on finding a memory leak using Windbg. That can be easy, well, on a fairly small application. But we are talking about memory dumps 6-10GB in size, and there is no clear indication of what could possibly wrong. True, I am not a master on Windbg, at least not yet, but if there is a list of things that make me really pulling my hairs, this is at the very top of it. There have been times that I think I am this close to revealing the truth, just to be proven not.
But as they said, “Every cloud has a silver lining”. It comes to my attention that this stacktrace appears consistently in some threads:
System.Diagnostics.Tracing.EventSource.DispatchToAllListeners(Int32, System.Guid*, System.Diagnostics.Tracing.EventWrittenEventArgs)+4a System.Diagnostics.Tracing.EventSource.WriteToAllListeners(Int32, System.Guid*, Int32, EventData*)+16f System.Diagnostics.Tracing.EventSource.WriteEventWithRelatedActivityIdCore(Int32, System.Guid*, Int32, EventData*)+766 System.Diagnostics.Tracing.EventSource.WriteEvent(Int32, System.String)+97 Microsoft.ApplicationInsights.Web.SessionTelemetryInitializer.GetSessionContextFromUserCookie(System.Web.HttpCookie, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry)+33 Microsoft.ApplicationInsights.Web.SessionTelemetryInitializer.OnInitializeTelemetry(System.Web.HttpContext, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.ApplicationInsights.Channel.ITelemetry)+ad Microsoft.ApplicationInsights.Web.Implementation.WebTelemetryInitializerBase.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+85 Microsoft.ApplicationInsights.TelemetryClient.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+256 Microsoft.ApplicationInsights.DependencyCollector.Implementation.ClientServerDependencyTracker.BeginTracking(Microsoft.ApplicationInsights.TelemetryClient)+1ee Microsoft.ApplicationInsights.DependencyCollector.Implementation.ProfilerSqlProcessingBase.OnBegin(System.Object)+18f Microsoft.Diagnostics.Instrumentation.Extensions.Intercept.Functions+<>c__DisplayClass1_0.b__0(System.Object)+1e
But it’s the framework (Application Insights), so it can’t be the problem, right? Or can it? After reviewing several memory dumps, my suspicion grew, this appeared in almost every memory dump I had. There might really be something with it.
If we look at the last line DispatchToAllListeners
, it sounds possible that we might have a very long list of “Listeners” and that take time. A peek into the framework source code seems to confirm my suspicion
private unsafe void DispatchToAllListeners(int eventId, Guid* childActivityID, EventWrittenEventArgs eventCallbackArgs) { Exception lastThrownException = null; for (EventDispatcher dispatcher = m_Dispatchers; dispatcher != null; dispatcher = dispatcher.m_Next) { //Do stuffs. } if (lastThrownException != null) { throw new EventSourceException(lastThrownException); } }
So each EventSource
has a field of type EventDispatcher
named m_Dispatchers
. And as the code above suggests, EventDispatcher is actually built as a linked list, and each instance point to their next dispatcher. So that means if the linked list is long enough, then this method, DispatchToAllListeners
can even take a long time, even causing a bottleneck.
The next question is to find out if our theory is “sustainable”. Let’s check how many EventDispatcher
instances to we have
0:115> !dumpheap -type EventDispatcher -stat Statistics: MT Count TotalSize Class Name 00007ff9d3c28308 1 48 Microsoft.ServiceBus.Tracing.EventDispatcher 00007ff9d4ae8cc0 177825 8535600 System.Diagnostics.Tracing.EventDispatcher Total 177826 objects
Oops!
That does not look good. So it kinda reaffirms our suspicion. But we still need to find out which code caused this high number of EventDispatcher
. Just pick any of those and explore
0:115> !DumpHeap /d -mt 00007ff9d4ae8cc0 Address MT Size 00000244309ba368 00007ff9d4ae8cc0 48 User interrupt.
Using !do /!DumpObject to see what in there
0:115> !DumpObj /d 00000244305aa758 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 00000244305aa670 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 00000244305aa6a0 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 000002473197b630 m_Next
We can see that we have addresses of the next dispatcher in list, and most importantly, the EventListener
. But let’s see where m_Next leads us
0:115> !DumpObj /d 000002473197b630 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 000002473197b548 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 000002473197b578 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 0000024631edea68 m_Next 0:115> !DumpObj /d 0000024631edea68 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 0000024631ede980 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 0000024631ede9b0 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 0000024631edf240 m_Next 0:115> !DumpObj /d 0000024631edf240 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 0000024631edf140 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 0000024631edf188 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 0000024531331bc0 m_Next 0:115> !DumpObj /d 0000024531331bc0 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 0000024531331ac8 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 0000024531331b08 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 0000024531305fc8 m_Next 0:115> !DumpObj /d 0000024531305fc8 Name: System.Diagnostics.Tracing.EventDispatcher MethodTable: 00007ff9d4ae8cc0 EEClass: 00007ff9d4b08930 Size: 48(0x30) bytes File: D:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9cfc6a118 40016ed 8 ...ing.EventListener 0 instance 0000024531305ee0 m_Listener 00007ff9cfc79560 40016ee 10 System.Boolean[] 0 instance 0000024531305f10 m_EventEnabled 00007ff9cf9ecd30 40016ef 20 System.Boolean 1 instance 0 m_activityFilteringEnabled 00007ff9d4ae8cc0 40016f0 18 ...g.EventDispatcher 0 instance 0000000000000000 m_Next
So only 5 more EventDispatcher
down the chain. But the first dispatcher might not be the start of the list. It’s not easy to find out with dispatcher points to it, but we can technically find out which objects have references to it by !gcroot
(man, that takes a loooooooooong time)
0:115> !gcroot 00000244305aa758 Thread 1428: 00000072157bebb0 00007ff9d38358d6 System.Net.TimerThread.ThreadProc() rdi: (interior) -> 000002482f6697f0 System.Object[] -> 0000024732924120 System.Net.WebRequest+DelEtwFireBeginWRGet -> 000002442f664158 System.Diagnostics.Tracing.FrameworkEventSource -> 0000024461f2bd58 System.Diagnostics.Tracing.EventDispatcher -> 00000245764e59b0 System.Diagnostics.Tracing.EventDispatcher ...very, very long list
This one has about 7112 dispatcher in the queue. Eww! If any event was ever handled by the `FrameworkEventSource`, this would be a very, very long list to walk through. And for a purpose of logging, that is too much!
Let’s get back to the EventListener
to see if there is something interesting there
Dump that object gives us this
0:115> !DumpObj /d 00000244305aa670 Name: StackifyHttpTracer.EventListener MethodTable: 00007ff9d28460f8 EEClass: 00007ff9d282b850 Size: 48(0x30) bytes File: D:\local\Temporary ASP.NET Files\root5ea09caf438d\assembly\dl32245590:115> !DumpObj /d 00000244305aa670 Name: StackifyHttpTracer.EventListener MethodTable: 00007ff9d28460f8 EEClass: 00007ff9d282b850 Size: 48(0x30) bytes File: D:\local\Temporary ASP.NET Files\root\125ea09c\39af438d\assembly\dl3\93224559\0052a77a_8212d201\StackifyHttpTracer.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9d580a5e8 40016a3 8 ...tArgs, mscorlib]] 0 instance 0000000000000000 _EventSourceCreated 0000000000000000 40016a4 10 0 instance 0000000000000000 EventWritten 00007ff9cfc6a118 40016a5 18 ...ing.EventListener 0 instance 000002473197b548 m_Next 00007ff9d4ae9960 40016a6 20 ...ng.ActivityFilter 0 instance 0000000000000000 m_activityFilter 00007ff9cf9e67a0 40016a2 cb8 System.Object 0 shared static s_EventSourceCreatedLock >> Domain:Value 000002442b6da6a0:000002442f651c98 000002442d7fd6c0:000002442f664378 000002488b1ebbb0:00000247311c68d0 << 00007ff9cfc6a118 40016a7 cc0 ...ing.EventListener 0 shared static s_Listeners >> Domain:Value 000002442b6da6a0:0000000000000000 000002442d7fd6c0:0000024461f2bc58 000002488b1ebbb0:0000000000000000 << 00007ff9cfc77fd8 40016a8 cc8 ...rence, mscorlib]] 0 shared static s_EventSources >> Domain:Value 000002442b6da6a0:000002442f651cb0 000002442d7fd6c0:000002442f664390 000002488b1ebbb0:00000247311c68e8 << 00007ff9cf9ecd30 40016a9 ee8 System.Boolean 1 shared static s_CreatingListener >> Domain:Value 000002442b6da6a0:0 000002442d7fd6c0:0 000002488b1ebbb0:0 << 00007ff9cf9ecd30 40016aa ee9 System.Boolean 1 shared static s_EventSourceShutdownRegistered >> Domain:Value 000002442b6da6a0:1 000002442d7fd6c0:1 000002488b1ebbb0:1 << 00007ff9cfb48830 4000006 10 System.Guid 1 static 000002482f6a6f90 tplGuid 00007ff9cfb48830 4000007 18 System.Guid 1 static 000002482f6a6f98 aspnetGuid52a77a_8212d201\StackifyHttpTracer.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff9d580a5e8 40016a3 8 ...tArgs, mscorlib]] 0 instance 0000000000000000 _EventSourceCreated 0000000000000000 40016a4 10 0 instance 0000000000000000 EventWritten 00007ff9cfc6a118 40016a5 18 ...ing.EventListener 0 instance 000002473197b548 m_Next 00007ff9d4ae9960 40016a6 20 ...ng.ActivityFilter 0 instance 0000000000000000 m_activityFilter 00007ff9cf9e67a0 40016a2 cb8 System.Object 0 shared static s_EventSourceCreatedLock >> Domain:Value 000002442b6da6a0:000002442f651c98 000002442d7fd6c0:000002442f664378 000002488b1ebbb0:00000247311c68d0 << 00007ff9cfc6a118 40016a7 cc0 ...ing.EventListener 0 shared static s_Listeners >> Domain:Value 000002442b6da6a0:0000000000000000 000002442d7fd6c0:0000024461f2bc58 000002488b1ebbb0:0000000000000000 << 00007ff9cfc77fd8 40016a8 cc8 ...rence, mscorlib]] 0 shared static s_EventSources >> Domain:Value 000002442b6da6a0:000002442f651cb0 000002442d7fd6c0:000002442f664390 000002488b1ebbb0:00000247311c68e8 << 00007ff9cf9ecd30 40016a9 ee8 System.Boolean 1 shared static s_CreatingListener >> Domain:Value 000002442b6da6a0:0 000002442d7fd6c0:0 000002488b1ebbb0:0 << 00007ff9cf9ecd30 40016aa ee9 System.Boolean 1 shared static s_EventSourceShutdownRegistered >> Domain:Value 000002442b6da6a0:1 000002442d7fd6c0:1 000002488b1ebbb0:1 << 00007ff9cfb48830 4000006 10 System.Guid 1 static 000002482f6a6f90 tplGuid 00007ff9cfb48830 4000007 18 System.Guid 1 static 000002482f6a6f98 aspnetGuid
Wait, what, StackifyHttpTracer.EventListener
? It is not part of Episerver Framework, nor DXC. It is not even part of Azure. A quick search on Google shows no result about it. But searching with StackifyHttpTracer
reveals that it is part of Stackify
, a profiler. Looking at their web.config, it is confirmed that StackifyHttpModule
is present:
<remove name="StackifyHttpModule_Net40" /> <add name="StackifyHttpModule_Net40" type="StackifyHttpTracer.StackifyHttpModule,StackifyHttpTracer, Version=1.0.0.0, Culture=neutral, PublicKeyToken=93c44ce23f2048dd" preCondition="managedHandler,runtimeVersionv4.0" /> <remove name="TelemetryCorrelationHttpModule" />
But – how does that relate to the high number of dispatchers we have seen? Using dotPeek to look into StackifyHttpModule
, I get the answer:
public class StackifyHttpModule : IHttpModule { private static readonly DateTime _Epoch = new DateTime(1970, 1, 1, 0, 0, 0, 0, DateTimeKind.Utc); private ModuleSettings _moduleSettings = new ModuleSettings(); private EventListener el = new EventListener();
It easy to think that you can have only one instance of your IHttpModule
in your website (I did), but memory dump is saying otherwise. It seems like Azure creates instances of IHttpModule
over time, even thought I’m not sure why, or when
0:115> !dumpheap -type StackifyHttpTracer.StackifyHttpModule -stat Statistics: MT Count TotalSize Class Name 00007ff9d28465f0 305 9760 StackifyHttpTracer.StackifyHttpModule Total 305 objects
So that’s it. We have 305 instance of StackifyHttpTracer.StackifyHttpModule
, and each of them has an instance of StackifyHttpTracer.EventListener
. Worse, many might have been disposed, but their EventListener
stays.
0:115> !dumpheap -type StackifyHttpTracer.EventListener -stat Statistics: MT Count TotalSize Class Name 00007ff9d28460f8 7112 341376 StackifyHttpTracer.EventListener Total 7112 objects
This number of EventListener
matches with the number of EventDispatcher
in our previous check.
Oops!
That explains why we have that many instances of EventDispatcher
in place.
The solution for this case, well, just to remove that StackifyHttpModule
from the web.config is to get rid of that. I do not daresay it is a bug in Stackify, but it seems to me that the EventListener
field should have been static
instead of instance. But again I’m not sure about what they are trying to achieve there.
The result? The said customer deployed a new version which removed StackifyHttpModule
, and their average server response time drop from around 130ms (and most importantly, increasing, which is understandable as we saw that the list of EventDispatcher
is ever growing) to around 80ms (and, stable). For some of their most frequently called methods, there have been an improvement of 4 or 5 times faster . True, there might be some other improvements in their new version, I really think removal of the module has had the biggest impact.
My takes from the story:
- Sometimes it’s hard to find the answer. Just keep looking and leave no stone untouched.
- This is why you should use Episerver DXC, you get early alerts on any anomalies and you get the expert help to solve the problem.
this is very nais! oh… good old friend windbg… haven’t seen you in a while.. 🙂
I’ve contacted library authors and they are investigating the issue. Thx!
Thanks, let’s see what they say
Amazing bit of investigation there mate. Sherlock Holmes stuff
Thanks 🙂