Threading is hard. It’s hard to get right. It’s hard to avoid race condition. Even with experienced developers, it’s not always a given (trust me, I’ve been there).
This time, the problem comes from a report that a customer constantly has high CPU situation, on all instances. Memory dumps were taken and I was able to take a look. As always, high CPU can be result of several causes, most likely thread deadlocks. For educational purposes, let’s take this memory dump step by step.
First steps are as with routine Windbg – open it. use Ctrl + D to start debugging a memory dump, then you would need to run .loadby sos clr
to load the clr runtime
If you are debugging a memory dump that is captured on Azure, .loadby sos clr
will not work with this error (if your Windows is installed to C:\ drive)
0:000> .loadby sos clr
The call to LoadLibrary(D:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll) failed, Win32 error 0n126
"The specified module could not be found."
Please check your debugger configuration and/or network access.
Simply fix it by copy the path to sos.dll, replacing D with C, and rerun
.load C:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll
That should fix it.
Next step would be checking the CPU situation by .threadpool
command
0:000> !threadpool
CPU utilization: 98%
Worker Thread: Total: 160 Running: 6 Idle: 134 MaxLimit: 32767 MinLimit: 140
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 6 Free: 4 MaxFree: 8 CurrentLimit: 6 MaxLimit: 1000 MinLimit: 4
This confirms a high CPU situation. It’s however worth noting this CPU is for the entire instance, not necessarily for the process (w3wp) only, but it’s highly likely for an Azure App Service instance that the w3wp is the problem here.
The next step would be checking if there is any long running thread
0:000> !runaway
User Mode Time
Thread Time
39:1e34 2 days 11:44:59.546
40:219c 2 days 8:08:56.765
41:2198 1 days 4:53:46.687
17:1480 0 days 0:19:38.171
20:1334 0 days 0:14:57.718
18:1798 0 days 0:12:53.625
19:12f8 0 days 0:12:05.015
31:a9c 0 days 0:03:39.093
27:1494 0 days 0:01:35.406
36:1e14 0 days 0:01:28.890
5:1124 0 days 0:00:38.578
6:1560 0 days 0:00:37.218
3:ff8 0 days 0:00:34.984
Normally, the first few threads in the list are not that interesting, they are usually just timers that run from the application start up. But let’s leave no stone unturned, shall we? Let’s switch to the longest running thread
0:000> ~39s
00007ff9`297d9f59 458bf7 mov r14d,r15d
Hmm, it does not look like a listener. Let’s see what it actually is with !clrstack
0:039> !clrstack
OS Thread Id: 0x1e34 (39)
Child SP IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
000000740273d100 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
000000740273d140 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
000000740273d1f0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)
000000740273d2a0 00007ff93153b492 Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder.GetProductPriceViewModel(System.String, Abck.Web.Features.User.UserContext)
That looks interesting. If we move to the second longest running thread, it looks oddly familiar
0:040> !clrstack
OS Thread Id: 0x219c (40)
Child SP IP Call Site
00000074be23d060 00007ff926c82f2e System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Contains(System.__Canon)
00000074be23d0d0 00007ff9307aebcd System.Web.HttpCookieCollection.EnsureKeyValidated(System.String, System.String)
00000074be23d110 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000074be23d150 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000074be23d190 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000074be23d1d0 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
And the third longest running thread
0:041> !clrstack
OS Thread Id: 0x2198 (41)
Child SP IP Call Site
00000078597bd8d0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
00000078597bd960 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000078597bd9a0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000078597bd9e0 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000078597bda20 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
00000078597bdad0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)
Any guess on what happened?
It looks like what we have a clear case of infinite loop. There were multiple threads trying to get the same cookie, and under the neath, the HttpCookieCollection.Get
is not threadsafe, it tries to access an underlying HashSet<T>
without properly locks. While one thread tried to remove items from the hashset, another tried to read it, and messed up the internal hash table. All three threads are then doomed in infinite loops, keep running and use up all CPU resources.
The fix in this case would be as simple as a lock around the code to get the CountryCurrency
, but it’s best to be careful. Again, when fixing the problem of lack of lock, it’s easy to run into the problem of too much of lock.
I find these sorts of posts so interesting as you make finding a root cause look so straightforward!
Iām curious for a bit more context. Cookies are user specific, so tied to a http request, which is thread safe?
So is this caused by a single request using async or contention from multiple simultaneous requests from the client / browser?
Yes, in this specific case, there was multiple accesses in same request. as always, if we can get something once and share that data with each “unit of work”, we should. I asked support to relay that information to the specific customer :). but this is also a textbook case of high CPU investigation which is why I wrote about it š