Use Find for CSR UI

If you have been using Find, you might be surprised to find that CSR UI uses the SearchProvider internally. This is a bit unfortunate because you likely are using Find, and that creates unnecessary complexity. For starter, you need to configure a SearchProvider, then you need to index the entries, separately from the Find index. If you install EPiServer.CloudPlatform.Commerce, it will setup the DXPLucenceSearchProvider for you, which is basically a wrapper of LuceneSearchProvider to let it work on DXP (i.e. Azure storage). But even with that, you have to index your entries anyway. You can use FindSearchProvider, but that actually just creates another problem – it uses a different index compared to Find, so you double your index count, yet you have still make sure to index your content. Is there a better way – to use the existing Find indexed content?

Yes, there is

Searches for entries in CSR is done by IEntrySearchService which the default implementation uses the configured SearchProvider internally . Fortunately for us, as with most thing in Commerce, we can create our own implementation and inject it. Now that’s with a caveat – IEntrySearchService is marked as BETA remark, so prepare for some breaking changes without prior notice. However it has not changed much since its inception (funny thing, when I checked for its history, I was the one who created it 6 years ago, in 2017. Feeling old now), and if it is changed, it would be quite easy to adapt for such changes.

IEntrySearchService is a simple with just one method:


IEnumerable<int> Search(string keyword, MarketId marketId, Currency currency, string siteId);

It is a bit weird to return an IEnumerable<int> (what was I thinking ? ), but it was likely created as a scaffolding of SearchManager.Search which returns an IEnumerable<int>, and was not updated later. Anyway, an implementation using Find should look like this:

    public class FindEntrySearchService : IEntrySearchService
    {
        private EPiServer.Find.IClient _searchClient;

        public FindEntrySearchService(EPiServer.Find.IClient searchClient) => _searchClient = searchClient;

        public IEnumerable<int> Search(string keyword, MarketId marketId, Currency currency, string siteId)
        {
            return _searchClient.Search<EntryContentBase>()
                 .For(keyword)
                 .Filter(x => x.MatchMarketId(marketId))
                 .Filter(x => x.SiteId().Match(siteId))
                 .Filter(x => FilterPriceAvailableForCurrency<IPricing>(y => y.Prices(), currency))
                 .GetResult()
                 .Select(x => x.ContentLink.ID);
        }

        public FilterExpression<Price> FilterPriceAvailableForCurrency<T>(Expression<Func<T, IEnumerable<Price>>> prices, Currency currency)
        {
            var currencyCode = currency != null ? currency.CurrencyCode : string.Empty;

            return new NestedFilterExpression<T, Price>(prices, price => price.UnitPrice.Currency.CurrencyCode.Match(currencyCode), _searchClient.Conventions);
        }
    }

Note that I am not an expert on Find, especially on NestedFilterExpression, so my FilterPriceAvailableForCurrency might be wrong. Feel free to correct it, the code is not copyrighted and is provided as-is.

As always, you need to register this implementation for IEntrySearchService. You can add it anywhere you like as long as it’s after .AddCommerce.

_services.AddSingleton<IEntrySearchService, FindEntrySearchService>();

Optimizing an interesting query

It’s not a secret, I love optimizing things. In a sense, I am both an Optimizer (literally) and an optimizer. And today we will be back to basic – optimizing a tricky SQL query.

The query in question is this particular stored procedure ecf_CatalogNode_GetAllChildNodes, this is used to get all children nodes of specific nodes. It is used in between to find all entries that are direct, or indirect children of specific nodes. Why, you might ask, because when you change the url segment of the node, you want to make sure that all entries that are under that node, will have their indexed object refreshed.

Let’s take a look at this stored procedure, this is how it looks like

CREATE PROCEDURE [dbo].[ecf_CatalogNode_GetAllChildNodes]
    @catalogNodeIds udttCatalogNodeList readonly
AS
BEGIN
    WITH all_node_relations AS 
    (
        SELECT ParentNodeId, CatalogNodeId AS ChildNodeId FROM CatalogNode
        WHERE ParentNodeId > 0
        UNION
        SELECT ParentNodeId, ChildNodeId FROM CatalogNodeRelation
    ),
    hierarchy AS
    (
        SELECT 
            n.CatalogNodeId,
            '|' + CAST(n.CatalogNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM @catalogNodeIds n
        UNION ALL
        SELECT
            children.ChildNodeId AS CatalogNodeId,
            parent.CyclePrevention + CAST(children.ChildNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM hierarchy parent
        JOIN all_node_relations children ON parent.CatalogNodeId = children.ParentNodeId
        WHERE CHARINDEX('|' + CAST(children.ChildNodeId AS nvarchar(4000)) + '|', parent.CyclePrevention) = 0
    )
    SELECT CatalogNodeId FROM hierarchy
END

I previously wrote about the relations between entities in Commerce catalog, here Commerce relation(ship), a story – Quan Mai’s blog (vimvq1987.com) , so relations between nodes can be a bit complicated – a node can have one true parent defined in CatalogNode table, and then other “linked” nodes in CatalogNodeRelation . So to find all children – and grand children of a node, you need to get from both.

Getting children of a node from CatalogNode or CatalogNodeRelation is simple, but things become more complicated when you have to get grandchildren, then great-grandchildren, and so on, and so forth. with that, CTE needs to be used in a recursive way. But then there is a problem arises – there is a chance, small, but still, that the data was added in a correct way, so circular reference is possible. i.e. A is a parent of B, which is a parent of C, and itself is a parent of A. To stop the SP from running forever, a check needs to be added to make sure any circular reference is cut short.

This brings back memory as the first ever support case I worked on at Optimizely (then Episerver) was with a circular reference. The site would crash whenever someone visited the catalog management in Commerce Manager. That was around June, 2012 (feeling old now?). My “boss” at that time involuntarily volunteered me for the case. See what you made me do, boss.

Now you can grasp the basic of what the SP does – let’s get back to the original problem. it’s slow to run especially with big catalog and complex node structure. As always, to optimize everything you need to find the bottleneck – time to fire up SQL Server Management Studio and turn on the Actual Execution Plan

I decided to go with 66, the “root” catalog node. this query yield around 18k rows

declare @Nodes udttCatalogNodeList 

insert into @Nodes (CatalogNodeId) select 66

exec ecf_CatalogNode_GetAllChildNodes @Nodes

and also 18s of execution.

Mind you, this is on my machine with pretty powerful CPU (AMD Ryzen 7 5800x, 8 cores 16 threads), and a very fast nvme PCIe SSD (Western Digital Black SN850 2TB). If this was executed on Azure Sql database for example, a timeout is almost certainly guaranteed. So time of execution should only be compared relatively with each other.

If we look at the execution plan, it is quite obvious where the bottleneck is. A scan on CatalogNode table is heavy (it read 79M rows on that operation). As suggest by Anders from Timeout when deleting CatalogNodes from a large catalog (optimizely.com), adding a non clustered index on ParentNodeId column would improve it quite a lot. And indeed it does. The execution time is reduced to 5 second.

And the number of rows read on CatalogNode reduced to just 17k

This is of course a very nice improvement. But the customer reported that it is not enough and the SP is still giving timeout, i.e. further optimization is needed.

Naturally, the next step would be to see if we can skip the circular check. It was added as a safe measure to avoid bad data. It should not be there, as the check should be performed at data modification. But it is there for historical reasons and we can’t just change it, not trivially. So let’s try it for our curiousity.

The modified query looks like this (basically just commented out any code related to the CyclePrevention

ALTER PROCEDURE [dbo].[ecf_CatalogNode_GetAllChildNodes]
    @catalogNodeIds udttCatalogNodeList readonly
AS
BEGIN
    WITH all_node_relations AS 
    (
        SELECT ParentNodeId, CatalogNodeId AS ChildNodeId FROM CatalogNode
        WHERE ParentNodeId > 0
        UNION
        SELECT ParentNodeId, ChildNodeId FROM CatalogNodeRelation
    ),
    hierarchy AS
    (
        SELECT 
            n.CatalogNodeId
			--, '|' + CAST(n.CatalogNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM @catalogNodeIds n
        UNION ALL
        SELECT
            children.ChildNodeId AS CatalogNodeId
			--, parent.CyclePrevention + CAST(children.ChildNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM hierarchy parent
        JOIN all_node_relations children ON parent.CatalogNodeId = children.ParentNodeId
        --WHERE CHARINDEX('|' + CAST(children.ChildNodeId AS nvarchar(4000)) + '|', parent.CyclePrevention) = 0
    )
    SELECT CatalogNodeId FROM hierarchy
END

And the improvement is quite impressive (more than I expected), the query completes almost instantly (less than 1s). The read on CatalogNodeRelation significantly reduced

A word of warning here, execution plan can’t be simply compared as-is. If I run two versions side by side, it gives quite misleading comparison

Even though the top one (without the circular reference check) is much faster than the original (the bottom one), SQL Server estimates that the first is slower (almost 2x slower than the second). So execution plan should be used to see what has been done and what is likely the bottleneck inside a query, it should not be used as comparison between queries. In most cases, comparing statistics using set statistics io on is the best way to compare.

If not for the fact that we are changing the behavior of the stored procedure, I would be happy with this approach. The chance of running into circular reference is small, but it is not zero. As we said, we can in theory gating the relation during insert/updating, but that would be too big a change to start with. This is one of constraint as we work at framework level – we have to step carefully to not break anything. A breaking change is bad, but a data corruption is simply unacceptable. I spent a few hours (probably more than I should) trying to optimize the circular reference check, but no better solution is found.

The next approach would be – as we can guess, to make sure that we get rid of the Clustered Index Scan happened on the CatalogNodeRelation table. The solution would be quite simple, a non clustered index on the `ParentNodeId should be enough.

Great success. The performance is comparable with the “non circular reference check” approach.

As adding an index is a non breaking change (and albeit in some cases it can cause performance regression, like in A curious case of SQL execution plan – Quan Mai’s blog (vimvq1987.com) , but it is rare, also, in this case the cardinality of the ParentNodeId is most likely quite well distributed).

That is all for today. Hopefully you learn one thing or two about optimizing queries in your daily works.

Delete orphaned assets

I was asked this question: we have about 3TB of assets, any way to clean it up.

These days, storage is cheap, but still not free. and big storage means you need space for back up. and with that, bandwidth and time.

Is there away to clean up things you no longer need?

Yes!

Optimizely Content already has a scheduled job named Remove Abandoned BLOBs, but this job only removes the blobs that have no content associated. I.e. the content is deleted by IContentRepository.Delete but the blob was left behind. The job uses the log to find out which content were deleted, then find those blobs.

How’s about the assets that still have contents associated with them, but not used anywhere? Time to get your hands dirty!

Due to the nature of this task, it is best to make it a scheduled job.

All of the assets are children under the global asset root. By iterating over them, we can check if each of them is being used by another content. If not, we will add them to a list for later delete. Before deleting the content, we will find the blob and then delete it as well. Easy, right?

To get the content recursively we use this little piece of code

        public virtual IEnumerable<T> GetAssetRecursive<T>(ContentReference parentLink, CultureInfo defaultCulture) where T : MediaData
        {
            foreach (var folder in LoadChildrenBatched<ContentFolder>(parentLink, defaultCulture))
            {
                foreach (var entry in GetAssetRecursive<T>(folder.ContentLink, defaultCulture))
                {
                    yield return entry;
                }
            }

            foreach (var entry in LoadChildrenBatched<T>(parentLink, defaultCulture))
            {
                yield return entry;
            }
        }

        private IEnumerable<T> LoadChildrenBatched<T>(ContentReference parentLink, CultureInfo defaultCulture) where T : IContent
        {
            var start = 0;

            while (!_isStopped)
            {
                var batch = _contentRepository.GetChildren<T>(parentLink, defaultCulture, start, 50);
                if (!batch.Any())
                {
                    yield break;
                }
                foreach (var content in batch)
                {
                    // Don't include linked products to avoid including them multiple times when traversing the catalog
                    if (!parentLink.CompareToIgnoreWorkID(content.ParentLink))
                    {
                        continue;
                    }

                    yield return content;
                }
                start += 50;
            }
        }

And we will start from SiteDefinition.Current.GlobalAssetsRoot, and use IContentRepository.GetReferencesToContent to see if it is used in any content (both CMS and Catalog). If not, we add it to a list. Later, we use IPermanentLinkMapper to see if it has any blob associated, and delete that as well

            foreach (var asset in GetAssetRecursive<MediaData>(SiteDefinition.Current.GlobalAssetsRoot, CultureInfo.InvariantCulture))
            {
                totalAsset++;
                if (!_contentRepository.GetReferencesToContent(asset.ContentLink, false).Any())
                {
                    toDelete.Add(asset.ContentLink.ToReferenceWithoutVersion());
                }

                if (toDelete.Count % 50 == 0)
                {
                    var maps = _permanentLinkMapper.Find(toDelete);
                    foreach (var map in maps)
                    {
                        deletedAsset++;
                        _contentRepository.Delete(map.ContentReference, true, EPiServer.Security.AccessLevel.NoAccess);
                        var container = Blob.GetContainerIdentifier(map.Guid);
                        //Probably redundency, can just delete directly
                        var blob = _blobFactory.GetBlob(container);
                        if (blob != null)
                        {
                            _blobFactory.Delete(container);
                        }
                        OnStatusChanged($"Deleting asset with id {map.ContentReference}");
                    }
                    toDelete.Clear();
                }
            }

We need another round of delete after the while loop to clean up the left over (or if we have less than 50 abandoned assets)

And we’re done!

Testing this job is simple – uploading a few assets to your cms and do not use it anywhere, then run the job. it should delete those assets.

Things to improve: we might want to make sure only assets that created more than a certain number of days ago are deleted. This allows editors to upload assets for later uses without having to use them immediately.

The code has been open sourced at vimvq1987/DeleteUnusedAssets: Delete unused assets from an Optimizely/Episerver site (github.com) , and I have uploaded a nuget package to Packages (optimizely.com) to be reviewed.

Skip validating carts

Carts are meant to be validated. Prices changed, customers add more quantity than allowed, promotions expired, stock ran out, etc.. All kinds of stuffs that make the items in carts need validation to make sure they up to date, and be ready to be converted to an order.

However, there are cases when you don’t want your carts to be validated. The most common case is of course, wish list – a special cart that allows customer to add items to, just to keep track of. You certainly don’t want to touch it. Another example is quote – when you give a specific item at specific price for a customer, and you don’t want it to be automatically changed to the public prices, which is different from that said price.

By default, when it is called to validate a cart, these things will be done:

  • Remove items that no longer available (either deleted or end of line)
  • Update prices of the items to the latest applicable prices, or remove items that have no prices.
  • Update quantity of the items (to comply with the settings or in stock quantity), or remove items that are out of stock
  • Apply promotions
  • Update quantity again (As promotions could do things like adding free items to the cart)

There are two ways you can avoid carts being validated, let’s see what we can do.

The “Wish list names” route

With OrderOptions you can set certain wishlist names to be exempted from the validation, using WishListCartNames. By default, it’s only “Wishlist”, but you can set several using the comma separator, like this

orderOptions.WishListCartNames = "Wishlist,Quote";

However there is a caveat, with this approach, carts in those names will not be shown in the Order Management (If you want, you can change that, however it is not an easy or quick one)

The OrderValidationService route

The validation of carts (or rather, order types in general) is done by OrderValidationService. And that class is meant to be extended if necessary. Here is how you would avoid validation carts with name “Quote”, using OrderValidationService

    public class CustomOrderValidationService : OrderValidationService
    {
        public CustomOrderValidationService(ILineItemValidator lineItemValidator, IPlacedPriceProcessor placedPriceProcessor, IPromotionEngine promotionEngine, IInventoryProcessor inventoryProcessor, OrderOptions orderOptions) : base(lineItemValidator, placedPriceProcessor, promotionEngine, inventoryProcessor, orderOptions)
        {
        }

        public override IDictionary<ILineItem, IList<ValidationIssue>> ValidateOrder(IOrderGroup orderGroup)
        {
            if (orderGroup.Name.Equals("Quote", System.StringComparison.OrdinalIgnoreCase))
            {
                return new Dictionary<ILineItem, IList<ValidationIssue>>();
            }
            return base.ValidateOrder(orderGroup);
        }
    }

And as OrderValidationService is registered by ServiceConfiguration attribute, you can register yours by

services.AddTransient<OrderValidationService, CustomOrderValidationService>();

One caveat though, making changes to OrderValidationService means those changes will apply to the entire website, so make sure the changes are actually the ones you want site-wide, not just in specific places.

Index only Catalog content

If you are using Find to index your content, you likely have used the Find Indexing job – which would index everything in one go. Today I stumped upon this question – A way to run indexing job for Commerce only | Optimizely Develope – and it is a good one – if you have many of content in CMS side, and they don’t change that often, if at all – you certain don’t want to waste time and resource in trying to reindex them again. Is there away to just index catalog content?

Yes, there is. It is a bit hacky solution, but it can certain work. But first, let’s dive in on how Find indexing job does it work. It relies on IIndexingJobService , which itself relies on ContentIndexer to do the job. In its turn, ContentIndexer uses a list of IReindexInformation to know which content to index, and in which languages. Here’s what it looks like

    public interface IReindexInformation
    {
        /// <summary>
        /// Content links to be reindexed.
        /// </summary>
        IEnumerable<ReindexTarget> ReindexTargets { get; }

        /// <summary>
        /// Gets the root to index.
        /// </summary>
        ContentReference Root { get; }
    }

It has one Root, and multiple ReindexTarget, which contains

    public class ReindexTarget
    {
        /// <summary>
        /// The content references.
        /// </summary>
        public IEnumerable<ContentReference> ContentLinks { get; set; }

        /// <summary>
        /// The languages the collection of <see cref="ContentReference"/> are enabled on.
        /// </summary>
        public IEnumerable<CultureInfo> Languages { get; set; }

        /// <summary>
        /// The site that the collection of <see cref="ContentReference"/> appears on
        /// or <c>null</c> if unknown.
        /// </summary>
        public SiteDefinition SiteDefinition { get; set; }
    }

As you might have guessed, Commerce has its own IReindexInformation to index catalog content. If we can only use that to run our job. This is how our “hack” begins

The interface IContentIndexer has no method to control the IReindexInformation`, but the default implementation ContentIndexer does. We set it to the only one we need, so here it is

        List<IReindexInformation> targets;
        var contentIndexer = _contentIndexer as ContentIndexer;
        if (contentIndexer != null)
        {
            targets = contentIndexer.ReindexInformation.ToList();
            var commerceReIndexInformation = targets.FirstOrDefault(x => x.GetType() == typeof(CommerceReIndexInformation));
            contentIndexer.ReindexInformation = new List<IReindexInformation>() { commerceReIndexInformation };
            _indexingJobService.Start(OnStatusChanged);

            contentIndexer.ReindexInformation = targets;
        }

A note is that you will still see the “Indexing Global assets and other data” message, because IIndexingJobService implementation will go through all SiteDefinition regardless and show that message, but the internal ContentIndexer will skip if the SiteDefinition passed to it does not match the SiteDefinition in the IReindexInformation (and for CommerceReIndexInformation it’s SiteDefinition.Empty

As I mentioned in the beginning, this is a bit hacky solution, as you have to cast IContentIndexer to its concrete implementation. The proper solution would be implement IContentIndexer yourself. Given that’s not a trivial job, I’ll leave at that.

Loading the contacts/organizations, the right way

If you have been using Business Foundation, you most likely know about a limitation – you can only load the first 1000 objects using the GetXXX methods. For example, by using CustomerContext.Current.GetOrganizations(), you can load the first 1000 organizations. In theory, you can get more objects by changing the value of MaxObjectsList. However, changing that has consequences. Changing that will affect all types of objects, including contacts, organizations, and your custom objects. Also, loading too much in one go is almost never a good idea.

Is there a better way?

Yes, of course – which is why we have this blog post

There is a “hidden” method from base class of Business Foundation – BusinessManager that takes paging parameters

public static EntityObject[] List(string metaClassName, FilterElement[] filters, SortingElement[] sorting, int? start, int? count)

You will need to convert the results to the type you want. Note that all Business Foundation objects are inherited from EntityObject. So if you want to get the contacts by paging, it would look like this:

                var contacts = BusinessManager.List(ContactEntity.ClassName, new FilterElement[0], new SortingElement[] { new SortingElement(sortField, sortType) }, startIndex, recordsToRetrieve)
                .OfType<CustomerContact>();

Let’s go through the parameters one by one.

  • The first you need is the class name of your objects. For contacts, you can use ContactEntity.ClassName as shown above. For organizations, OrganizationEntity.ClassName
  • Next one is the filter. As you are trying to load all objects, you can just pass in an empty (but not null) instance – new FilterElement[0]
  • Third one is how you want to sort it. If you pass an empty array, it will be sort by default. If you want to sort by Name for example, set your sortField to Name and sortType to one of SortingElementType (Asc or Desc)
  • Forth and fifth ones are what we are looking for, they’re simply paging parameters – which position to start getting, and how many objects to get. Combine this with a simple while loop, you can get all of your Business Foundation objects.

And that’s about it, my friends.

What’s about caching?

Caching with list is always tricky – as you have to keep track of each item in the list to make sure you invalidate the list cache if one of the item is changed (updated/removed). For the purpose of just loading all contacts/organizations, it is probably better to just skip caching, for simplicity.

Delete property no longer available in code

Recently I stumped upon this question Removing a property that no longer exists in the code (optimizely.com) . it’s a valid (and even good) question. It is easy to add a new property to your catalog content type – you can simply add a new property to the model, build and start the site. However the opposite is not easy. In Commerce 14 at least.

A property for the strongly typed content type, is actually mapped and backed by a MetaField in MetaDataPlus system (of course unless you specifically tell it not to, by using IgnoreMetaDataPlusSynchronization attribute). When you add a new property to your content type, build and start your site, your content type is scanned and metafields will be created if necessary. However, if you delete a property from your content type, the scanner will just leave the metafield there. There are a few reasons for that. Firstly, it allows loosely typed content type, i.e. content types with none, or only a few property defined. If you have used some kind of external PIM, you’ll understand why it is important. Lastly, because the property can be mapped with a metafield of different name, the scanner might have trouble figuring out which metafield to delete. All in all, keeping the metafields is the sensible (if not the right) choice.

Then what to do if you want to delete the property and also clean up the metafield? With Commerce 13 and earlier, you can detach a MetaField from its MetaClass(s), then delete it using Commerce Manager. With the dead of CM in Commerce 13, what is your option?

By using code, of course. There are a few APIs – namely MetaField and MetaClass that can be used for that purpose. Note that there are two MetaField and MetaClass, and only the ones in Mediachase.MetaDataPlus.Configurator namespace are what we want (the others are for Business Foundation)

Enough for chit chat, this is the code that you would need to run

        private void DeleteMetaField(string metafieldName)
        {
            var metaField = MetaField.Load(CatalogContext.MetaDataContext, metafieldName);
            if (metaField == null)
            {
                return;
            }
            foreach (int metaClassId in metaField.OwnerMetaClassIdList)
            {
                var metaClass = MetaClass.Load(CatalogContext.MetaDataContext, metaClassId);
                if (metaClass == null)
                {
                    metaClass.DeleteField(metafieldName);
                }
            }
            MetaField.Delete(CatalogContext.MetaDataContext, metaField.Id);
        }

It is pretty straightforward. We load the MetaField by its name, if it is not null, then we remove it from all MetaClass that are using it, then eventually delete it.

In beginning of this post we mentioned strongly typed content type, but note that order system also uses the same metaclass/metafield system, so this code can be used for them as well.

This piece of code can be used in an admin-privilege controller to delete metafields on demand. Until Commerce 14 allows you to do it with a proper UI.

A curious case of cookie threading issue – part 2

A while back, I wrote this post A curious case of cookie threading issue – Quan Mai’s blog (vimvq1987.com) as an example of how to use WinDbg to diagnose a race condition (with results in an infinite loop). Naturally that issue needs to be fixed. A starting point to fix a race condition is to put locks whenever you change the data. But it’s not that simple. Too much locking would definitely hurt your application performance, and increase the chance of you running into another nasty problem (namely, deadlock). So the art of locking is “lagom” – Swedish word for “just right”, not too much, not too little.

ReaderWriteLockSlim comes to rescue. You only need to lock when writing, not locking. So this for reading

            readerWriterLockSlim.EnterReadLock();
            try
            {
                return httpRequest.Cookies[CookieKey];
            }
            finally
            {
                readerWriterLockSlim.ExitReadLock();
            }

And this for writing

            readerWriterLockSlim.EnterWriteLock();
            try
            {
                httpRequest.Cookies.Remove(CookieKey);               
            }
            finally
            {
                readerWriterLockSlim.ExitWriteLock();
            }

Would be enough, right?

No. The issue runs deeper than that.

If we go back to the stacktrace of the issue

0:039> !clrstack
OS Thread Id: 0x1e34 (39)
        Child SP               IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d 

and look closer. Something is not right. If you do not spot the issue – it’s fine, I missed it too. The problem is that we have a HashSet.Remove inside a HttpCookieCollection.Get. Uh oh.

If we look at source code of HttpCookieCollection, inside Get, it calls to EnsureKeyValidated

if (cookie != null) {
                EnsureKeyValidated(name, cookie.Value);
            }

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

which itself calls to

            _keysAwaitingValidation.Remove(key);

referencesource/HttpCookieCollection.cs at master · microsoft/referencesource · GitHub

_keysAwaitingValidation is a HashSet<string> . That explains the Remove we saw. And that explains why ReaderWriterLockSlim is not enough – changes are made within the supposedly read only action as well.

The only valid solution here is to lock both read and write actions on HttpCookieCollection. However, as HttpCookieCollection is per request, so our lock object should only be per request as well (we certainly do not want every thread to be locked when we get cookie on a request).

Moral of the story:

  • Look closer. There might always be something underneath. Close to the truth is still not the truth.
  • Never assume that a Get method is thread-safe. The implementation can do plenty of unexpected things under the hood.
  • You might ask why this happen, as HttpCookieCollection is per request. Turned out that there is a code that use ThreadPool.QueueUserWorkItem to queue tasks which share the same HttpContext object. As we learned, that’s the recipe for disaster. Think twice (or thrice) before sharing an object that is not thread-safe between threads.

Delete a content – directly from database

Before we even start, I would reiterate that manipulating data directly should be avoided unless absolutely necessary, it should be used as the last resort, and should be proceeded with cautions – always back up first and test your queries on development database first before running it in production. And if the situation dictates that you have to run the query, better do it with the 4 eyes principle – having a colleague double check it for you. When it comes to production database, nothing is too careful.

Now back to the question, if you absolutely have to delete a content, you should do like this

exec editDeletePage @pageId = 123, @ForceDelete = 1

It is basically what Content Clouds (i.e. CMS) does under the hood, without the cache validation on the application layer of course.

So the moral of the story – do everything with API if you can. If you absolutely have to, use the built-in stored procedures – they are tested vigorously and should have minimal issues/bugs, and should take care of everything, data-wise for you. Only write your own query if there is no SP that can be used.

Update: Initially I mentioned Tomas’ post in this, and that gave impression his way is incorrect. I should have written better. My apologies to Tomas

A curious case of cookie threading issue

Threading is hard. It’s hard to get right. It’s hard to avoid race condition. Even with experienced developers, it’s not always a given (trust me, I’ve been there).

This time, the problem comes from a report that a customer constantly has high CPU situation, on all instances. Memory dumps were taken and I was able to take a look. As always, high CPU can be result of several causes, most likely thread deadlocks. For educational purposes, let’s take this memory dump step by step.

First steps are as with routine Windbg – open it. use Ctrl + D to start debugging a memory dump, then you would need to run .loadby sos clr to load the clr runtime

If you are debugging a memory dump that is captured on Azure, .loadby sos clr will not work with this error (if your Windows is installed to C:\ drive)

0:000> .loadby sos clr
The call to LoadLibrary(D:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll) failed, Win32 error 0n126
    "The specified module could not be found."
Please check your debugger configuration and/or network access.

Simply fix it by copy the path to sos.dll, replacing D with C, and rerun

.load C:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll

That should fix it.

Next step would be checking the CPU situation by .threadpool command

0:000> !threadpool
CPU utilization: 98%
Worker Thread: Total: 160 Running: 6 Idle: 134 MaxLimit: 32767 MinLimit: 140
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 6 Free: 4 MaxFree: 8 CurrentLimit: 6 MaxLimit: 1000 MinLimit: 4

This confirms a high CPU situation. It’s however worth noting this CPU is for the entire instance, not necessarily for the process (w3wp) only, but it’s highly likely for an Azure App Service instance that the w3wp is the problem here.

The next step would be checking if there is any long running thread

0:000> !runaway
 User Mode Time
  Thread       Time
   39:1e34     2 days 11:44:59.546
   40:219c     2 days 8:08:56.765
   41:2198     1 days 4:53:46.687
   17:1480     0 days 0:19:38.171
   20:1334     0 days 0:14:57.718
   18:1798     0 days 0:12:53.625
   19:12f8     0 days 0:12:05.015
   31:a9c      0 days 0:03:39.093
   27:1494     0 days 0:01:35.406
   36:1e14     0 days 0:01:28.890
    5:1124     0 days 0:00:38.578
    6:1560     0 days 0:00:37.218
    3:ff8      0 days 0:00:34.984

Normally, the first few threads in the list are not that interesting, they are usually just timers that run from the application start up. But let’s leave no stone unturned, shall we? Let’s switch to the longest running thread

0:000> ~39s
00007ff9`297d9f59 458bf7          mov     r14d,r15d

Hmm, it does not look like a listener. Let’s see what it actually is with !clrstack

0:039> !clrstack
OS Thread Id: 0x1e34 (39)
        Child SP               IP Call Site
000000740273cff0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
000000740273d080 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
000000740273d0c0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
000000740273d100 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
000000740273d140 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
000000740273d1f0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)
000000740273d2a0 00007ff93153b492 Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder.GetProductPriceViewModel(System.String, Abck.Web.Features.User.UserContext)

That looks interesting. If we move to the second longest running thread, it looks oddly familiar

0:040> !clrstack
OS Thread Id: 0x219c (40)
        Child SP               IP Call Site
00000074be23d060 00007ff926c82f2e System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Contains(System.__Canon)
00000074be23d0d0 00007ff9307aebcd System.Web.HttpCookieCollection.EnsureKeyValidated(System.String, System.String)
00000074be23d110 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000074be23d150 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000074be23d190 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000074be23d1d0 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()

And the third longest running thread

0:041> !clrstack
OS Thread Id: 0x2198 (41)
        Child SP               IP Call Site
00000078597bd8d0 00007ff9297d9f59 System.Collections.Generic.HashSet`1[[System.__Canon, mscorlib]].Remove(System.__Canon)
00000078597bd960 00007ff92a8eb4e3 System.Web.HttpCookieCollection.Get(System.String)
00000078597bd9a0 00007ff9314de98d Abck.Web.Features.User.UserContext.get_CountryCurrency()
00000078597bd9e0 00007ff93154367e Abck.Web.Features.Cart.Services.CountryCurrencyService.GetCountrySettingThreeAlpha(Abck.Web.Features.User.UserContext)
00000078597bda20 00007ff93153b6dd Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29.MoveNext()
00000078597bdad0 00007ff93153b54a System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib]].Start[[Abck.Web.Features.Product.ViewModels.Builders.CoreProductViewModelBuilder+d__29, Abck.Web]](d__29 ByRef)

Any guess on what happened?

It looks like what we have a clear case of infinite loop. There were multiple threads trying to get the same cookie, and under the neath, the HttpCookieCollection.Get is not threadsafe, it tries to access an underlying HashSet<T> without properly locks. While one thread tried to remove items from the hashset, another tried to read it, and messed up the internal hash table. All three threads are then doomed in infinite loops, keep running and use up all CPU resources.

The fix in this case would be as simple as a lock around the code to get the CountryCurrency, but it’s best to be careful. Again, when fixing the problem of lack of lock, it’s easy to run into the problem of too much of lock.