Solving the mystery of high memory usage

Sometimes, my work is easy, the problem could be resolved with one look (when I’m lucky enough to look at where it needs to be looked, just like this one Varchar can be harmful to your performance – Quan Mai’s blog (vimvq1987.com) ). Sometimes, it is hard. Can’t count number of times that I stared blankly at the screen, and decided I’d better take a nap, roast a batch of coffee, or take a walk (that is lying, however, I don’t walk), because I’m out of idea and this is going nowhere. The life of a software diagnostic engineer is like that, sometimes you are solving the mystery of “what do I need to solve this mystery”. There are usually more dots scattered around in all places, your job is to figure out which dots make senses, which dots do not, and how to connect those that are relevant to solve the problem, and to tell a story.

The story today is about a customer complaining about their scheduled instance on DXP keeps having high memory after running Find indexing job. They have a custom job that was built to optimize performance for their language settings, but the idea is the same – load content, serialize it and send it to the server endpoint for indexing. It is, indeed a memory heavy job, especially when you have a lot of content that needs to be indexed (basically, number of content x number of languages x the complexity of the content). It is normal to have an increase in memory usage during such job – the application (or rather, the runtime, depending on which way you look at it) is doing it job – content needs to be loaded in memory, and if there is available memory it will be a huge waste if it is not used for something useful. And the application will not immediately release that memory, as the content is cached. The memory will only be reclaimed only if the cache expired, or the application has memory pressure (i.e. it asks the operating system for more memory and the OS refuses “there is nothing left”). Even if the cache is expired, the application will not always compact and release the memory back to the OS (LOH etc.)

Now what is problematic is that the customer application retains 25GB of memory for indefinitely. They waited for 24h but the memory usage is still high. The application appears to be fine, it does not crash because of memory issues (like Out of Memory), but it causes confusion s and worries to our customer. Game’s on.

One thing that does not make senses in this case is that even thought they have a custom index job, it is still a scheduled job. And for scheduled jobs, the contents are supposed to have a very short sliding expiration time (default to 1 minute). However, the cache entries in the memory dumps tell a different story. A majority of the cache entries have 12h sliding expiration time. Which does explain – in part at least – why the memory remains high. When you have a longer sliding time, chance is higher that the cache is hit at least once before it expires, which reset the expiration. If you have sufficient hit, the cache will effectively remain in memory forever, until you actively evict it (by editing the content for example)

0000753878028910                        0.77kb          0                           12:00:00                    2/16/2024 5:58:43 AM +00:00    EPPageData:601596:en__CatalogContent
0000753878029DC0                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1345603:es-pr__CatalogContent
00007538781C7F48                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351986:es-pr__CatalogContent
00007538781C8058                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1346230:es-pr__CatalogContent
00007538781C8168                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351988:es-pr__CatalogContent
00007538786FA8E8                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049433:no__CatalogContent
00007538786FC598                        0.78kb          0                           12:00:00                    2/16/2024 9:32:28 AM +00:00    EPPageData:1088026:es-pr__CatalogContent
00007538786FD9E0                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049435:no__CatalogContent
0000753878700770                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029725:da__CatalogContent
0000753878706528                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1351990:es-pr__CatalogContent
0000753878706638                        0.78kb          0                           12:00:00                    2/16/2024 2:59:39 PM +00:00    EPPageData:1350104:es-pr__CatalogContent
00007538787A2F80                        0.77kb          0                           12:00:00                    2/16/2024 8:14:53 AM +00:00    EPPageData:1049439:no__CatalogContent
00007538787A3FD0                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029729:da__CatalogContent
00007538787A6B48                        0.77kb          0                           12:00:00                    2/16/2024 7:52:53 AM +00:00    EPPageData:1029731:da__CatalogContent
00007538787A74C0                        0.77kb          0                           12:00:00                    2/16/2024 6:21:34 AM +00:00    EPPageData:690644:en__CatalogContent
00007538787A9CC8                        0.78kb          0                           12:00:00                    2/16/2024 5:43:57 AM +00:00    EPPageData:181410:cs-cz__CatalogContent
00007538787ACDD8                        0.82kb          0                           12:00:00                    2/16/2024 2:17:38 PM +00:00    EPPageData:1343746__CatalogContent
00007538787ACFF8                        0.83kb          0                           12:00:00                    2/16/2024 2:17:25 PM +00:00    EPPageData:1343746:en__CatalogContent
00007538787AE658                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1350160:da__CatalogContent
00007538787AE768                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1350162:da__CatalogContent
00007538787AEA98                        0.39kb          0                           00:00:00                    2/16/2024 2:17:38 PM +00:00    EPiAnc:ContentAssetAware1343745__CatalogContent
00007538787AF058                        0.77kb          0                           12:00:00                    2/16/2024 2:59:37 PM +00:00    EPPageData:1347560:da__CatalogContent
00007538787B29A0                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329806:da__CatalogContent
00007538787B2E68                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329808:da__CatalogContent
00007538787B31E8                        0.77kb          0                           12:00:00                    2/16/2024 2:17:07 PM +00:00    EPPageData:1329810:da__CatalogContent

It is not what it should be, however, as the default value for sliding expiration timeout of a content loaded by a scheduled job is 1 minute – i.e. it is considered to be load once and be done item. Was it set to 12h by mistake. Nope

Timeout is set to 600.000.000 ticks which is 60 second, which is the default value.

I have been pulling my hairs over this for quite a while. What if the cache entries were not added by the scheduled job, but by some other way not affected by the limitation of scheduled job? In short, we were deceived by customer’s statement regarding Find indexing job. It was merely a victim of same issue. It was resetting the last access to the cache entry but that’s about it.

Time to dig a bit more. While Windbg is extremely powerful, it does not let you know where is the code that load a specific content into cache (not unless you catch it red handed). So the only way to know is to look around and check if there are any suspicious call the IContentLoader.GetItems or IContentLoader.GetChildren . A colleague of mine worked with the customer to obtain their source code, and another deep dive.

Fortunately for us, the customer has a custom built Find indexer we helped to built in a previous problem, and that was shown in the search for GetItems. It struck me that it could be the culprit. The job itself is … fine, however it was given wrong data so it keeps loading content to index.

If my hypothesis is correct, then these things must be true:

  • The app’s memory usage will raise to 25GB regardless of the indexing job running or not. And it remains there without much fluctuation
  • There are a lot of row in tblFindIndexQueue

It turned out both of those were correct: there were more than 4 millions of rows in tblFindIndexQueue, and this is the memory consumption of the app over 24 hours

One we figured out the source of content loading, the fix was pretty straightforward. One thing we could do from our side is to shorten caching time of content loaded by the event-driven indexer. You should upgrade to Find 16.2.0 which contains the fix for FIND-12436 which is a nice improvement for memory usage.

Moral of story:

  • I’m a workaholic. I definitely should not work on weekends, but sometimes I need to because that’s when my mind is clearest
  • Keep looking. But as always, know when to give up and admit defeat
  • Take breaks. Long, shorts. Refresh your mind and look at different angles.
  • The sliding cache expiration time can be quite unexpected. if a content is already in cache with long sliding expiration, then a cache hit (via ISynchronizedObjectInstanceCache.ReadThrough to get that content with short sliding expiration will not change that value, only refresh the last access time, and vice versa)

Fix your Search & Navigation (Find) indexing job, please

Once upon a time, a colleague asked me to look into a customer database with weird spikes in database log usage. (You might start to wonder why I am always the one who looks into weird things, is there a pattern here)

Upon reviewing the query store, I noticed a very high logical reads related to tblScheduledItem. From past experience, it was likely because of fragmentation of indexes in this table (which has only one clustered). I did a quick look at the table, and confirmed the index indeed has high fragmentation. I suggested to do a rebuild of that index and see what happen. Well, it could have been one of the daily simple quick questions, and I almost forgot about it.

A few days passed, the colleague pinged me again. Apparently they rebuilt it but it does not really help. That raised my eye brows a little bit, so I dug deeper.

To my surprised, the problem was not really fragmentation (it definitely contributed). The problem is that the column has a column of type nvarchar(max) , and it’s for recording the last text from the Execute method of the scheduled job. It was meant for something short like “The job failed successfully”, or “Deleted 12345 versions from 1337 contents”. But because it’s nvarchar(max) it could be very, very long. You can, in theory, store the entire book content of a a few very big libraries in there.

Of course because of you can, does not mean you should. When your column is long, each read from the table will be a burden to SQL Server. And the offending job was nothing less than our S&N indexing job.

In theory, any job could have caused this issue, however it’s much more likely happen with S&N indexing job for a reason – it keeps track of every exception thrown during indexing process, and because it indexes each and every content of your website (except the ones you specifically, explicitly tell it not too), the chance of its running into a recurring issue that affects multiple (reads, a lot) of content is much higher than any built-in job.

I asked, this time, to trim the column, and most importantly, fix any exceptions that might be thrown during the index. I was on my day off when my colleague noticed me that the job is running for 10h without errors, as they fixed it. Curious, so I did check some statistic. Well, let those screenshots speak for themselves:

The query itself went from 16,000ms to a mere 2.27ms. Even better, each call to get the list of scheduled job before results in 3.5GB logical reads. Now? 100KB. A lot of resource saved!

So, make sure your job is not throwing a lot of errors. And fix your S&N indexing job.

P/S: I do think S&N indexing job should have simpler return result. Maybe “Indexed 100.000 content with 1234 errors”, and the exceptions could have been logged. But that’s debatable. For now, you can do your parts!

Use Find for CSR UI

If you have been using Find, you might be surprised to find that CSR UI uses the SearchProvider internally. This is a bit unfortunate because you likely are using Find, and that creates unnecessary complexity. For starter, you need to configure a SearchProvider, then you need to index the entries, separately from the Find index. If you install EPiServer.CloudPlatform.Commerce, it will setup the DXPLucenceSearchProvider for you, which is basically a wrapper of LuceneSearchProvider to let it work on DXP (i.e. Azure storage). But even with that, you have to index your entries anyway. You can use FindSearchProvider, but that actually just creates another problem – it uses a different index compared to Find, so you double your index count, yet you have still make sure to index your content. Is there a better way – to use the existing Find indexed content?

Yes, there is

Searches for entries in CSR is done by IEntrySearchService which the default implementation uses the configured SearchProvider internally . Fortunately for us, as with most thing in Commerce, we can create our own implementation and inject it. Now that’s with a caveat – IEntrySearchService is marked as BETA remark, so prepare for some breaking changes without prior notice. However it has not changed much since its inception (funny thing, when I checked for its history, I was the one who created it 6 years ago, in 2017. Feeling old now), and if it is changed, it would be quite easy to adapt for such changes.

IEntrySearchService is a simple with just one method:


IEnumerable<int> Search(string keyword, MarketId marketId, Currency currency, string siteId);

It is a bit weird to return an IEnumerable<int> (what was I thinking ? ), but it was likely created as a scaffolding of SearchManager.Search which returns an IEnumerable<int>, and was not updated later. Anyway, an implementation using Find should look like this:

    public class FindEntrySearchService : IEntrySearchService
    {
        private EPiServer.Find.IClient _searchClient;

        public FindEntrySearchService(EPiServer.Find.IClient searchClient) => _searchClient = searchClient;

        public IEnumerable<int> Search(string keyword, MarketId marketId, Currency currency, string siteId)
        {
            return _searchClient.Search<EntryContentBase>()
                 .For(keyword)
                 .Filter(x => x.MatchMarketId(marketId))
                 .Filter(x => x.SiteId().Match(siteId))
                 .Filter(x => FilterPriceAvailableForCurrency<IPricing>(y => y.Prices(), currency))
                 .GetResult()
                 .Select(x => x.ContentLink.ID);
        }

        public FilterExpression<Price> FilterPriceAvailableForCurrency<T>(Expression<Func<T, IEnumerable<Price>>> prices, Currency currency)
        {
            var currencyCode = currency != null ? currency.CurrencyCode : string.Empty;

            return new NestedFilterExpression<T, Price>(prices, price => price.UnitPrice.Currency.CurrencyCode.Match(currencyCode), _searchClient.Conventions);
        }
    }

Note that I am not an expert on Find, especially on NestedFilterExpression, so my FilterPriceAvailableForCurrency might be wrong. Feel free to correct it, the code is not copyrighted and is provided as-is.

As always, you need to register this implementation for IEntrySearchService. You can add it anywhere you like as long as it’s after .AddCommerce.

_services.AddSingleton<IEntrySearchService, FindEntrySearchService>();

A curious case of cookie threading issue – part 2

A while back, I wrote this post A curious case of cookie threading issue – Quan Mai’s blog (vimvq1987.com) as an example of how to use WinDbg to diagnose a race condition (with results in an infinite loop). Naturally that issue needs to be fixed. A starting point to fix a race condition is to put locks whenever you change the data. But it’s not that simple. Too much locking would definitely hurt your application performance, and increase the chance of you running into another nasty problem (namely, deadlock). So the art of locking is “lagom” – Swedish word for “just right”, not too much, not too little.

ReaderWriteLockSlim comes to rescue. You only need to lock when writing, not locking. So this for reading

            readerWriterLockSlim.EnterReadLock();
            try
            {
                return httpRequest.Cookies[CookieKey];
            }
            finally
            {
                readerWriterLockSlim.ExitReadLock();
            }

And this for writing

            readerWriterLockSlim.EnterWriteLock();
            try
            {
                httpRequest.Cookies.Remove(CookieKey);               
            }
            finally
            {
                readerWriterLockSlim.ExitWriteLock();
            }

Would be enough, right?

No. The issue runs deeper than that.

If we go back to the stacktrace of the issue

0:039> !clrstack
OS Thread Id: 0x1e34 (39)
        Child SP               IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d 

and look closer. Something is not right. If you do not spot the issue – it’s fine, I missed it too. The problem is that we have a HashSet.Remove inside a HttpCookieCollection.Get. Uh oh.

If we look at source code of HttpCookieCollection, inside Get, it calls to EnsureKeyValidated

if (cookie != null) {
                EnsureKeyValidated(name, cookie.Value);
            }

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

which itself calls to

            _keysAwaitingValidation.Remove(key);

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

_keysAwaitingValidation is a HashSet<string> . That explains the Remove we saw. And that explains why ReaderWriterLockSlim is not enough – changes are made within the supposedly read only action as well.

The only valid solution here is to lock both read and write actions on HttpCookieCollection. However, as HttpCookieCollection is per request, so our lock object should only be per request as well (we certainly do not want every thread to be locked when we get cookie on a request).

Moral of the story:

  • Look closer. There might always be something underneath. Close to the truth is still not the truth.
  • Never assume that a Get method is thread-safe. The implementation can do plenty of unexpected things under the hood.
  • You might ask why this happen, as HttpCookieCollection is per request. Turned out that there is a code that use ThreadPool.QueueUserWorkItem to queue tasks which share the same HttpContext object. As we learned, that’s the recipe for disaster. Think twice (or thrice) before sharing an object that is not thread-safe between threads.

Beware of GetContentResult()

If you are using Episerver Search & Navigation (Formerly Find), you are likely using GetContentResult() . While I’m fairly skeptical about it (if you do not need the content explicitly, it’s probably a good idea to use GetResult instead), there are legitimate uses of the APIs. However, it can come with a surprise, and a not very pleasant one.

Your code probably looks like this

SearchClient.Instance.Search<IContent>()
  .For("banana")
  .StaticallyCacheFor(15)
  .GetContentResult();

Looks good, right? You search for content with keyword “banana” and cache the result for 15 minutes.

But is it? Does your search result really get cached for 15 minutes?

As it might surprise you, it doesn’t. GetContentResult() caches the result by default (unlike GetResult() which does not), but it only caches for 1 minutes. Even thought you asked it to cache for 15 minutes.

The right way to do it is to use other overload which takes an int as parameter. That is the cache time in seconds, like this

SearchClient.Instance.Search<IContent>()
  .For("banana")
  .GetContentResult(900);

In upcoming version of Find, the surprise will be fixed, but it’s probably a good idea to specify the cache time out explicitly.

Find.Commerce is not for Commerce Manager

I’ve seen this more than once, and this can be quite tiresome to fix the problem(s) after that. So here the TL;DR: If you are installing Find.Commerce to Commerce Manager, you are doing it wrong.

You’ll probably end up in the error like this

While loading .NET types from "EPiServer.Find.UI" the following error(s) was reported:

     - System.IO.FileNotFoundException: Could not load file or assembly 'EPiServer.Cms.Shell.UI, Version=9.3.8.0, Culture=neutral, PublicKeyToken=8fe83dea738b45b7' or one of its dependencies. The system cannot find the file specified.
File name: 'EPiServer.Cms.Shell.UI, Version=9.3.8.0, Culture=neutral, PublicKeyToken=8fe83dea738b45b7'

=== Pre-bind state information ===
LOG: DisplayName = EPiServer.Cms.Shell.UI, Version=9.3.8.0, Culture=neutral, PublicKeyToken=8fe83dea738b45b7
 (Fully-specified)
LOG: Appbase = file:///C:/EPiServer/FindSearchProvider/backend/
LOG: Initial PrivatePath = C:\EPiServer\FindSearchProvider\backend\bin
Calling assembly : EPiServer.Find.UI, Version=12.0.0.4448, Culture=neutral, PublicKeyToken=8fe83dea738b45b7.

Continue reading “Find.Commerce is not for Commerce Manager”

Find indexing job + HierarchicalCatalogPartialRouter: A note

I ran into this problem recently and while in the end it’s quite simple issue (Everything is simple if we understand it, right?), it costed me quite many hairs in the process – as it involved debugging with 3 solutions – Find.Commerce (where the problem appears), Commerce (where the router does the work), CMS Core (where the routers are handled). It was both fun, and confusing.

The problem as a customer has this code in an initialization module:

            var contentLoader = ServiceLocator.Current.GetInstance<IContentLoader>();
            var referenceConverter = ServiceLocator.Current.GetInstance<ReferenceConverter>();

            var firstCatalog = contentLoader.GetChildren<CatalogContent>(referenceConverter.GetRootLink()).FirstOrDefault();

            var partialRouter = new HierarchicalCatalogPartialRouter(() => SiteDefinition.Current.StartPage, firstCatalog, false);

            routes.RegisterPartialRouter(partialRouter);

Continue reading “Find indexing job + HierarchicalCatalogPartialRouter: A note”