Commerce,  Debugging,  Episerver

A curious case of memory dump diagnostic: How Stackify can cause troubles to your site

It’s been a while since my last blog post, and this time it will be fun time with windbg. One of DXC customers has been having problem with instances of their site hang and restart. We at Episerver takes the availability and performance of DXC sites very seriously, therefore I was asked to help diagnosing the problem. It’s something I’d like to do and to learn anyway, so game is on.

The golden standard for looking at the kind of those problems is still the tried and trusted Windbg. With the help from the Managed Services Engineer, I was able to access several memory dumps from the site. Most of them were taken when the site stalled, which is usually the most useful moment because it should contain the most critical piece of information why the stall happened.

As usual, it’s just about firing Windbg, Ctrl+D to open the memory dump and start debugging, then load some essential extensions, and everything is ready.

A quick check confirms that there was a high CPU usage situation

But now it’s important part: we don’t know which thread is causing the high CPU situation. Yes we can always go with !threads or runaway and go through all the stacktraces and guess.

It’s just hard to have a good guess. Especially when there is a Garbage Collection in progress in most of memory dumps, but not all. Well, when there is a GC, GC thread would suspend all other threads to clean & sweep the “dead” objects, and that can slow down the instance greatly, and Azure AutoHeal can see that the instance is not responding and restarts it. But what we really need to explain is why a Gen 2 GC happened, and how to prevent it from happening again.

And what’s about the memory dumps that there was no GC?

It might feel easy to read some guides on finding a memory leak using Windbg. That can be easy, well, on a fairly small application. But we are talking about memory dumps 6-10GB in size, and there is no clear indication of what could possibly wrong. True, I am not a master on Windbg, at least not yet, but if there is a list of things that make me really pulling my hairs, this is at the very top of it. There have been times that I think I am this close to revealing the truth, just to be proven not.

But as they said, “Every cloud has a silver lining”. It comes to my attention that this stacktrace appears consistently in some threads:

But it’s the framework (Application Insights), so it can’t be the problem, right? Or can it? After reviewing several memory dumps, my suspicion grew, this appeared in almost every memory dump I had. There might really be something with it.

If we look at the last line DispatchToAllListeners, it sounds possible that we might have a very long list of “Listeners” and that take time. A peek into the framework source code seems to confirm my suspicion

So each EventSource has a field of type EventDispatcher named m_Dispatchers. And as the code above suggests, EventDispatcher is actually built as a linked list, and each instance point to their next dispatcher. So that means if the linked list is long enough, then this method, DispatchToAllListeners can even take a long time, even causing a bottleneck.

The next question is to find out if our theory is “sustainable”. Let’s check how many EventDispatcher instances to we have

Oops!

That does not look good. So it kinda reaffirms our suspicion. But we still need to find out which code caused this high number of EventDispatcher . Just pick any of those and explore

Using !do /!DumpObject to see what in there

We can see that we have addresses of the next dispatcher in list, and most importantly, the EventListener . But let’s see where m_Next leads us

So only 5 more EventDispatcher down the chain. But the first dispatcher might not be the start of the list. It’s not easy to find out with dispatcher points to it, but we can technically find out which objects have references to it by !gcroot (man, that takes a loooooooooong time)

This one has about 7112 dispatcher in the queue. Eww! If any event was ever handled by the FrameworkEventSource, this would be a very, very long list to walk through. And for a purpose of logging, that is too much!

Let’s get back to the EventListener to see if there is something interesting there

Dump that object gives us this

Wait, what, StackifyHttpTracer.EventListener? It is not part of Episerver Framework, nor DXC. It is not even part of Azure. A quick search on Google shows no result about it. But searching with StackifyHttpTracer reveals that it is part of Stackify, a profiler. Looking at their web.config, it is confirmed that StackifyHttpModule is present:

But – how does that relate to the high number of dispatchers we have seen? Using dotPeek to look into StackifyHttpModule, I get the answer:

It easy to think that you can have only one instance of your IHttpModule in your website (I did), but memory dump is saying otherwise. It seems like Azure creates instances of IHttpModule over time, even thought I’m not sure why, or when

So that’s it. We have 305 instance of StackifyHttpTracer.StackifyHttpModule, and each of them has an instance of StackifyHttpTracer.EventListener. Worse, many might have been disposed, but their EventListener stays.

This number of EventListener matches with the number of EventDispatcher in our previous check.

Oops!

That explains why we have that many instances of EventDispatcher in place.

The solution for this case, well, just to remove that StackifyHttpModule from the web.config is to get rid of that. I do not daresay it is a bug in Stackify, but it seems to me that the EventListener field should have been static instead of instance. But again I’m not sure about what they are trying to achieve there.

The result? The said customer deployed a new version which removed StackifyHttpModule, and their average server response time drop from around 130ms (and most importantly, increasing, which is understandable as we saw that the list of EventDispatcher is ever growing) to around 80ms (and, stable). For some of their most frequently called methods, there have been an improvement of 4 or 5 times faster . True, there might be some other improvements in their new version, I really think removal of the module has had the biggest impact.

My takes from the story:

  • Sometimes it’s hard to find the answer. Just keep looking and leave no stone untouched.
  • This is why you should use Episerver DXC, you get early alerts on any anomalies and you get the expert help to solve the problem.

5 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: