Solving the mystery of high memory usage

Sometimes, my work is easy, the problem could be resolved with one look (when I’m lucky enough to look at where it needs to be looked, just like this one Varchar can be harmful to your performance – Quan Mai’s blog (vimvq1987.com) ). Sometimes, it is hard. Can’t count number of times that I stared blankly at the screen, and decided I’d better take a nap, roast a batch of coffee, or take a walk (that is lying, however, I don’t walk), because I’m out of idea and this is going nowhere. The life of a software diagnostic engineer is like that, sometimes you are solving the mystery of “what do I need to solve this mystery”. There are usually more dots scattered around in all places, your job is to figure out which dots make senses, which dots do not, and how to connect those that are relevant to solve the problem, and to tell a story.

The story today is about a customer complaining about their scheduled instance on DXP keeps having high memory after running Find indexing job. They have a custom job that was built to optimize performance for their language settings, but the idea is the same – load content, serialize it and send it to the server endpoint for indexing. It is, indeed a memory heavy job, especially when you have a lot of content that needs to be indexed (basically, number of content x number of languages x the complexity of the content). It is normal to have an increase in memory usage during such job – the application (or rather, the runtime, depending on which way you look at it) is doing it job – content needs to be loaded in memory, and if there is available memory it will be a huge waste if it is not used for something useful. And the application will not immediately release that memory, as the content is cached. The memory will only be reclaimed only if the cache expired, or the application has memory pressure (i.e. it asks the operating system for more memory and the OS refuses “there is nothing left”). Even if the cache is expired, the application will not always compact and release the memory back to the OS (LOH etc.)

Now what is problematic is that the customer application retains 25GB of memory for indefinitely. They waited for 24h but the memory usage is still high. The application appears to be fine, it does not crash because of memory issues (like Out of Memory), but it causes confusion s and worries to our customer. Game’s on.

One thing that does not make senses in this case is that even thought they have a custom index job, it is still a scheduled job. And for scheduled jobs, the contents are supposed to have a very short sliding expiration time (default to 1 minute). However, the cache entries in the memory dumps tell a different story. A majority of the cache entries have 12h sliding expiration time. Which does explain – in part at least – why the memory remains high. When you have a longer sliding time, chance is higher that the cache is hit at least once before it expires, which reset the expiration. If you have sufficient hit, the cache will effectively remain in memory forever, until you actively evict it (by editing the content for example)

0000753878028910                        0.77kb          0                           12:00:00                    2/16/2024 5:58:43 AM +00:00    EPPageData:601596:en__CatalogContent
0000753878029DC0                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1345603:es-pr__CatalogContent
00007538781C7F48                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351986:es-pr__CatalogContent
00007538781C8058                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1346230:es-pr__CatalogContent
00007538781C8168                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351988:es-pr__CatalogContent
00007538786FA8E8                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049433:no__CatalogContent
00007538786FC598                        0.78kb          0                           12:00:00                    2/16/2024 9:32:28 AM +00:00    EPPageData:1088026:es-pr__CatalogContent
00007538786FD9E0                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049435:no__CatalogContent
0000753878700770                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029725:da__CatalogContent
0000753878706528                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351990:es-pr__CatalogContent
0000753878706638                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1350104:es-pr__CatalogContent
00007538787A2F80                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049439:no__CatalogContent
00007538787A3FD0                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029729:da__CatalogContent
00007538787A6B48                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029731:da__CatalogContent
00007538787A74C0                        0.77kb          0                           12:00:00                    2/16/2024 6:21:34 AM +00:00    EPPageData:690644:en__CatalogContent
00007538787A9CC8                        0.78kb          0                           12:00:00                    2/16/2024 5:43:57 AM +00:00    EPPageData:181410:cs-cz__CatalogContent
00007538787ACDD8                        0.82kb          0                           12:00:00                    2/16/2024 2:17:38 PM +00:00    EPPageData:1343746__CatalogContent
00007538787ACFF8                        0.83kb          0                           12:00:00                    2/16/2024 2:17:25 PM +00:00    EPPageData:1343746:en__CatalogContent
00007538787AE658                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1350160:da__CatalogContent
00007538787AE768                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1350162:da__CatalogContent
00007538787AEA98                        0.39kb          0                           00:00:00                    2/16/2024 2:17:38 PM +00:00    EPiAnc:ContentAssetAware1343745__CatalogContent
00007538787AF058                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1347560:da__CatalogContent
00007538787B29A0                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329806:da__CatalogContent
00007538787B2E68                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329808:da__CatalogContent
00007538787B31E8                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329810:da__CatalogContent

It is not what it should be, however, as the default value for sliding expiration timeout of a content loaded by a scheduled job is 1 minute – i.e. it is considered to be load once and be done item. Was it set to 12h by mistake. Nope

Timeout is set to 600.000.000 ticks which is 60 second, which is the default value.

I have been pulling my hairs over this for quite a while. What if the cache entries were not added by the scheduled job, but by some other way not affected by the limitation of scheduled job? In short, we were deceived by customer’s statement regarding Find indexing job. It was merely a victim of same issue. It was resetting the last access to the cache entry but that’s about it.

Time to dig a bit more. While Windbg is extremely powerful, it does not let you know where is the code that load a specific content into cache (not unless you catch it red handed). So the only way to know is to look around and check if there are any suspicious call the IContentLoader.GetItems or IContentLoader.GetChildren . A colleague of mine worked with the customer to obtain their source code, and another deep dive.

Fortunately for us, the customer has a custom built Find indexer we helped to built in a previous problem, and that was shown in the search for GetItems. It struck me that it could be the culprit. The job itself is … fine, however it was given wrong data so it keeps loading content to index.

If my hypothesis is correct, then these things must be true:

  • The app’s memory usage will raise to 25GB regardless of the indexing job running or not. And it remains there without much fluctuation
  • There are a lot of row in tblFindIndexQueue

It turned out both of those were correct: there were more than 4 millions of rows in tblFindIndexQueue, and this is the memory consumption of the app over 24 hours

One we figured out the source of content loading, the fix was pretty straightforward. One thing we could do from our side is to shorten caching time of content loaded by the event-driven indexer. You should upgrade to Find 16.2.0 which contains the fix for FIND-12436 which is a nice improvement for memory usage.

Moral of story:

  • I’m a workaholic. I definitely should not work on weekends, but sometimes I need to because that’s when my mind is clearest
  • Keep looking. But as always, know when to give up and admit defeat
  • Take breaks. Long, shorts. Refresh your mind and look at different angles.
  • The sliding cache expiration time can be quite unexpected. if a content is already in cache with long sliding expiration, then a cache hit (via ISynchronizedObjectInstanceCache.ReadThrough to get that content with short sliding expiration will not change that value, only refresh the last access time, and vice versa)

Performance optimization – the hardcore series – part 2

Earlier we started a new series about performance optimization, here Performance optimization – the hardcore series – part 1 – Quan Mai’s blog (vimvq1987.com) . There are ton of places where things can go wrong. A seasoned developer can, from experience, avoid some obvious performance errors. But as we will soon learn, a small thing can make a huge impact if it is called repeatedly, and a big thing might be OK to use as long as it is called once.

Let’s take this example – how would you think about this snippet – CategoryIds is a list of string converted from ContentReference

            if (CategoryIds.Any(x => new ContentReference(x).ToReferenceWithoutVersion() == contentLink))
            {
                //do stuff
            }

If this is in any “cool” path that run a few hundred times a day, you will be fine. It’s not “elegant”, but it works, and maybe you can get away with it. However, if it is in a hot path that is executed every time a visitor visits a product page in your website, it can create a huge problem.

And can you guess what it is?

new ContentReference(string) is fairly lightweight, but if it is called a lot, this is what happen. This is allocations from the constructor alone, and only within 220 seconds of the trace

A lot of allocations which should have been avoided if CategoryIds was just an IEnumerable<ContentReference> instead of IEnumerable<string>

For comparison, this is how 10.000 and 1000.000 new ContentReference would allocate

Thing is similar if you use .ToReferenceWithoutVersion() to compare to another ContentReference (although to a lesser extend as ToReferenceWithoutVersion would return the same ContentReference if the WorkId is 0, and it use cloning instead of new). The correct way to compare two instances of ContentReference without caring about versions, is to use .Compare with ContentReferenceComparer.IgnoreVersion

Moral of the story

  • It is not only what you do, but also how you do it
  • Small things can make big impacts, don’t guess, measure!

Performance optimization – the hardcore series – part 1

Hi again every body. New day – new thing to write about. today we will talk about memory allocation, and effect it has on your website performance. With .NET, memory allocations are usually overlooked because CLR handles that for you. Except in rare cases that you need to handle unmanaged resources, that you have to be conscious about releasing that said resources yourself, it’s usually fire and forget approach.

Truth is, it is more complicated than that. The more objects you created, the more memory you need, and the more time CLR needs to clean it up after you. When you might have written code that is executed blazing fast in your benchmarks, in reality, your website might still struggle to perform well in long run – and that’s because of Garbage collection. Occasional GC is not of a concern – because it’s nature of .NET CLR, but frequent GC, especially Gen 2 GC, is definitely something you should look into and should fix, because it natively affects your website performance.

The follow up question – how do you fix that.

Of course, the first step is always measuring the memory allocations of your app. Locally you can use something like Jetbrains dotMemory to profile your website, but that has a big caveat – you can’t really mimic the actual traffic to your website. Sure, it is very helpful to profile something like a scheduled job, but it is less than optimal to see how your website performs in reality. To do that, we need another tool, and I’ve found nothing better than Application Insights Profiler trace on Azure. It will sample your website periodically, taking ETL ( event trace log) files in 220 seconds (Note, depends on your .NET version, you might download a .diagsession or a .netperf.zip file from Application Insights, but they are essentially the same inside (zipped .ETL)). Those files are extremely informative, they contains whole load of information which might be overwhelming if you’re new, but take small steps, you’ll be there.

To open a ETL file, common tool is Perfview (microsoft/perfview: PerfView is a CPU and memory performance-analysis tool (github.com)). Yes it has certain 2000 look like other analysis tool (remember Windbg), but it is fast, efficient, and gets the job done

Note that once extracted ETL can be very big – in 1GB or more range often. Perfview has to go through all that event log so it’s extremely memory hungry as well, especially if you open multiple ETL files at once. My perfview kept crashing when I had a 16GB RAM machine (I had several Visual Studio instances open), and that was solved when I switched to 32GB RAM

The first step is to confirm the allocation problems with GCStats (this is one of the extreme ones, but it does happen)

Two main things to look into – Total Allocs, i.e. the total size of objects allocated, and then the time spent in Garbage collection. They are naturally closely related, but not always. Total allocation might not be high but time for GC might be – in case of large objects allocation (we will talk about it in a later post). Then for the purpose of memory allocation analysis, this is where you should look at

What you find in there, might surprise you. And that’s the purpose of this series, point out possible unexpected allocations that are easy – or fairly easy – to fix.

In this first post, we will talk about a somewhat popular feature – Injected<T>.

We all know that in Optimizely Content/Commerce, the preferred way of dependency injection is constructor injection. I.e. if your class has a dependency on a certain type, that dependency should be declared as a parameter of the constructor. That’s nice and all, but not always possible. For example you might have a static class (used for extension methods) so no constructor is available. Or in some rare cases, that you can’t added a new parameter to the constructor because it is a breaking change.

Adding Injected<T> as a hidden dependency in your class is at least working, so can you forget about it?

Not quite!

This is how the uses of Injected<T> result in allocation of Structuremap objects – yes every time you call Injected<T>.Service the whole dependency tree must be built again.

And that’s not everything, during that process, other objects need to be created as well. You can right click on a path and select “Include item”. The allocations below are for anything that were created by `module episerver.framework episerver.framework!EPiServer.ServiceLocation.Injected1[System.__Canon].get_Service() i.e. all object allocations, related to Injected<T>

You can expand further to see what Injected<T>(s) have the most allocations, and therefore, are the ones should be fixed.

How can one fix a Injected<T> then? The best fix is to make it constructor dependency, but that might not always be possible. Alternative fix is to use ServiceLocator.GetInstance, but to make that variable static if possible. That way you won’t have to call Injected<T>.Service every time you need the instance.

There are cases that you indeed need a new instance every time, then the fix might be more complicated, and you might want to check if you need the whole dependency tree, or just a data object.

Moral of the story

  • Performance can’t be guessed, it must be measured
  • Injected<T> is not your good friend. You can use it if you have no other choice, but definitely avoid it in hot paths.

Debug .NET memory dump with WinDBG – crash course. Part 1

If you ask me what had I been doing the last two weeks – then the answer is I was pulling my hairs.  A customer had a problem with their site as the memory hiked up after catalog imports and stayed there “forever” – and in the end it slowed the site down. I jumped in and almost regretted that decision – had to spent days  messing around with WinDBG and memory dumps. In the end – I found the problem and it was fixed. A lot of hairs were loss in progress, but I learned something about WinDBG – and that’s what I’m sharing today.

WinDBG is probably the most famous tool for debugging stuffs on Windows. Out of the box, it only works with native applications, aka assembly and such – but lucky for us, there are plenty of extensions to allow it to work with .NET application. The “standard” SOS and more advanced extension SOSEX. SOS is included in WinDBG, while you can download SOSEX from here (for 64 bit) or here (for 32 bit) . Download the zip file and extract the dll somewhere.

WinDBG comes with the Windows SDK, not the standard .NET framework, so you’ll probably need to install it separately from here

Continue reading “Debug .NET memory dump with WinDBG – crash course. Part 1”