A curious case of cookie threading issue – part 2

A while back, I wrote this post A curious case of cookie threading issue – Quan Mai’s blog (vimvq1987.com) as an example of how to use WinDbg to diagnose a race condition (with results in an infinite loop). Naturally that issue needs to be fixed. A starting point to fix a race condition is to put locks whenever you change the data. But it’s not that simple. Too much locking would definitely hurt your application performance, and increase the chance of you running into another nasty problem (namely, deadlock). So the art of locking is “lagom” – Swedish word for “just right”, not too much, not too little.

ReaderWriteLockSlim comes to rescue. You only need to lock when writing, not locking. So this for reading

            readerWriterLockSlim.EnterReadLock();
            try
            {
                return httpRequest.Cookies[CookieKey];
            }
            finally
            {
                readerWriterLockSlim.ExitReadLock();
            }

And this for writing

            readerWriterLockSlim.EnterWriteLock();
            try
            {
                httpRequest.Cookies.Remove(CookieKey);               
            }
            finally
            {
                readerWriterLockSlim.ExitWriteLock();
            }

Would be enough, right?

No. The issue runs deeper than that.

If we go back to the stacktrace of the issue

0:039> !clrstack
OS Thread Id: 0x1e34 (39)
        Child SP               IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d 

and look closer. Something is not right. If you do not spot the issue – it’s fine, I missed it too. The problem is that we have a HashSet.Remove inside a HttpCookieCollection.Get. Uh oh.

If we look at source code of HttpCookieCollection, inside Get, it calls to EnsureKeyValidated

if (cookie != null) {
                EnsureKeyValidated(name, cookie.Value);
            }

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

which itself calls to

            _keysAwaitingValidation.Remove(key);

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

_keysAwaitingValidation is a HashSet<string> . That explains the Remove we saw. And that explains why ReaderWriterLockSlim is not enough – changes are made within the supposedly read only action as well.

The only valid solution here is to lock both read and write actions on HttpCookieCollection. However, as HttpCookieCollection is per request, so our lock object should only be per request as well (we certainly do not want every thread to be locked when we get cookie on a request).

Moral of the story:

  • Look closer. There might always be something underneath. Close to the truth is still not the truth.
  • Never assume that a Get method is thread-safe. The implementation can do plenty of unexpected things under the hood.
  • You might ask why this happen, as HttpCookieCollection is per request. Turned out that there is a code that use ThreadPool.QueueUserWorkItem to queue tasks which share the same HttpContext object. As we learned, that’s the recipe for disaster. Think twice (or thrice) before sharing an object that is not thread-safe between threads.

A curious case of cookie threading issue

Threading is hard. It’s hard to get right. It’s hard to avoid race condition. Even with experienced developers, it’s not always a given (trust me, I’ve been there).

This time, the problem comes from a report that a customer constantly has high CPU situation, on all instances. Memory dumps were taken and I was able to take a look. As always, high CPU can be result of several causes, most likely thread deadlocks. For educational purposes, let’s take this memory dump step by step.

First steps are as with routine Windbg – open it. use Ctrl + D to start debugging a memory dump, then you would need to run .loadby sos clr to load the clr runtime

If you are debugging a memory dump that is captured on Azure, .loadby sos clr will not work with this error (if your Windows is installed to C:\ drive)

0:000> .loadby sos clr
The call to LoadLibrary(D:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll) failed, Win32 error 0n126
    "The specified module could not be found."
Please check your debugger configuration and/or network access.

Simply fix it by copy the path to sos.dll, replacing D with C, and rerun

.load C:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll

That should fix it.

Next step would be checking the CPU situation by .threadpool command

0:000> !threadpool
CPU utilization: 98%
Worker Thread: Total: 160 Running: 6 Idle: 134 MaxLimit: 32767 MinLimit: 140
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 6 Free: 4 MaxFree: 8 CurrentLimit: 6 MaxLimit: 1000 MinLimit: 4

This confirms a high CPU situation. It’s however worth noting this CPU is for the entire instance, not necessarily for the process (w3wp) only, but it’s highly likely for an Azure App Service instance that the w3wp is the problem here.

The next step would be checking if there is any long running thread

0:000> !runaway
 User Mode Time
  Thread       Time
   39:1e34     2 days 11:44:59.546
   40:219c     2 days 8:08:56.765
   41:2198     1 days 4:53:46.687
   17:1480     0 days 0:19:38.171
   20:1334     0 days 0:14:57.718
   18:1798     0 days 0:12:53.625
   19:12f8     0 days 0:12:05.015
   31:a9c      0 days 0:03:39.093
   27:1494     0 days 0:01:35.406
   36:1e14     0 days 0:01:28.890
    5:1124     0 days 0:00:38.578
    6:1560     0 days 0:00:37.218
    3:ff8      0 days 0:00:34.984

Normally, the first few threads in the list are not that interesting, they are usually just timers that run from the application start up. But let’s leave no stone unturned, shall we? Let’s switch to the longest running thread

0:000> ~39s
00007ff9`297d9f59 458bf7          mov     r14d,r15d

Hmm, it does not look like a listener. Let’s see what it actually is with !clrstack

0:039> !clrstack
OS Thread Id: 0x1e34 (39)
        Child SP               IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
000000740273d100 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
000000740273d140 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
000000740273d1f0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)
000000740273d2a0 00007ff93153b492 Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder.GetProductPriceViewModel(System.String, Abck.Web.Features.User.UserContext)

That looks interesting. If we move to the second longest running thread, it looks oddly familiar

0:040> !clrstack
OS Thread Id: 0x219c (40)
        Child SP               IP Call Site
00000074be23d060 00007ff926c82f2e System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Contains(System.__Canon)
00000074be23d0d0 00007ff9307aebcd System.Web.HttpCookieCollection.EnsureKeyValidated(System.String, System.String)
00000074be23d110 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000074be23d150 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000074be23d190 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000074be23d1d0 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()

And the third longest running thread

0:041> !clrstack
OS Thread Id: 0x2198 (41)
        Child SP               IP Call Site
00000078597bd8d0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
00000078597bd960 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000078597bd9a0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000078597bd9e0 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000078597bda20 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
00000078597bdad0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)

Any guess on what happened?

It looks like what we have a clear case of infinite loop. There were multiple threads trying to get the same cookie, and under the neath, the HttpCookieCollection.Get is not threadsafe, it tries to access an underlying HashSet<T> without properly locks. While one thread tried to remove items from the hashset, another tried to read it, and messed up the internal hash table. All three threads are then doomed in infinite loops, keep running and use up all CPU resources.

The fix in this case would be as simple as a lock around the code to get the CountryCurrency, but it’s best to be careful. Again, when fixing the problem of lack of lock, it’s easy to run into the problem of too much of lock.

Delete the undelete-able folder – Windows 7, 8, 10

If you are using Windows 7, 8, or 10, sometime there are some folders appear to be super stubborn, and refuse to go away. No matter how much you try, it just can’t be deleted. It’s undelete-able. Any attempts would result in this dreadful dialog box:

“Are you f*cking kidding me? This is my computer and I have the the absolute control over everything” – You said. Naturally, you would go to open the Security tab in Properties dialog to grant the permissions to everyone.

Strangely, the tab is empty! In other words, there is no one granted permission to nothing.

What? Why? How?

The truth is quite shocking: the folder might be already deleted – but there are processes still referring to it – and by that they keep the handle to the folder open. When that happens, you can’t really do anything to the folder without terminating the process first.

How to find the process then?

Just use Process Explorer, a free SysInternals tool from Microsoft.  Download, extract and run the .exe file (procexp.exe if you are running x86 version, or procexp64.exe if you are running 64 bit version of Windows) Click on Find/Find Handle or Dll, or use Ctrl+F shortcut. In this case we’re looking for typings, which is a fairly unique string. You can of course search by the path if the folder name is quite generic (i.e. there might be a lot of them). Wait for a little bit for this to show up. We got you the sneaky process(es):

Now you know which process is holding the handle to the folder, it’s as easy to fix it. Double click on the process and Process Explorer even let you know the parent process. In most of the cases you can have a peaceful exit and everything will be fixed. In worst cases, a forceful close might be needed.

 

Episerver Commerce performance optimization – part 2

Or lock or no lock – that’s the question.

This is the second part of the series on how can you improve the performance of Episerver Commerce site – or more precisely, to avoid the deadlocks and 100% CPU usage. This is not Commerce specific actually, and you can apply the knowledge and techniques here for a normal CMS site as well.

It’s a common and well-known best practice to store the slow-to-retrieve data in cache. These days memory is cheap – not free – but cheap. Yet it is still much faster than the fastest PCIe SSD in the market (if your site is running on traditional HDD, it’s not even close). And having objects in cache means you won’t have to open the connection to SQL Server, wait for it to read the data and send back to you – which all cost time. And if the object you need is a complex one, for example a Catalog content, you will also save the time needed to construct the object. Even if it’s fast, it is still not instantaneous, and it will cost you both memory and CPU cycles. All in all – caching is the right way to go. But how to get it right?

One common mistake for to have no lock when you load the data for the first time and insert it into cache.

Continue reading “Episerver Commerce performance optimization – part 2”