Performance optimization – the hardcore series – part 4

Let’s take a break from the memory allocation, and do some optimization on another aspect, yet as important (if not even more important) – database.

We all know that database queries play an essential part in any serious app. It’s almost a given that if you want your app to perform well, your database queries must also perform well. And for them to perform well, you need things like proper design (normalization, references etc.), properly written queries, and proper indexes. In this post, we will explore how an index can improve query performance, and how can we do it better.

Let’s start with a fairly simple table design

CREATE TABLE [dbo].[UniqueCoupon](
	[Id] [int] identity primary key clustered, 
	[PromotionId] [int] NOT NULL,
	[Code] [nvarchar](10) NOT NULL,
	[ExpiredOn] [datetime] NULL,
	[Redeemed] [bit] NULL
) ON [PRIMARY]

Nothing extraordinary here, pretty common if you ask me. Now for testing purpose, let’s insert 1.000.000 rows into it

INSERT INTO  dbo.[UniqueCoupon] (PromotionId, Code)
SELECT

FLOOR(RAND()*(100)+1),
SUBSTRING(CONVERT(varchar(255), NEWID()), 0, 7)

GO 1000000

We need to query data by the code, so let’s create an user defined type

CREATE TYPE CouponTable AS TABLE (
    Code NVARCHAR(10));

Time to run some query against data, let’s go with this

SELECT Id, PromotionId, Code, ExpiredOn, Redeemed FROM dbo.UniqueCoupons
                                                                    WHERE PromotionId = @PromotionId AND Code in (SELECT Code FROM @Data)

This is the complete query as we need some data

	declare @data CouponTable
	insert into @data 
	select top 10 code from dbo.UniqueCoupon 
	where promotionid = 36

	SELECT Id, PromotionId, Code, ExpiredOn, Redeemed FROM dbo.UniqueCoupon
                                                                    WHERE PromotionId = 36 AND Code in (SELECT Code FROM @Data)

As we learned that execution plan is not a good way to compare performance, let’s use the statistics, our trusted friends

																	set statistics io on
																	set statistics time on

And this is how it takes with our default setting (i.e. no index)

(10 rows affected)
Table '#AEDEED61'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'UniqueCoupon'. Scan count 9, logical reads 7070, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

If you are somewhat experienced with SQL Server, you might guess it would not be exactly happy because of, obviously an index is needed. As we query on PromotionId, it does makes sense to add an index for it, SQL Server does give you that

If we just blindly add the index suggested by SQL Server

(10 rows affected)
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'UniqueCoupon'. Scan count 1, logical reads 53, physical reads 0, page server reads 0, read-ahead reads 5, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#A7AA9B2B'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

But can we do better?

If we look at the index, there’s something not very optimized about it – we are query by both PromotionId and Code, so not really makes senses to have Code as included. How’s about we have the index on both PromotionId and Code?

(10 rows affected)
Table 'UniqueCoupon'. Scan count 10, logical reads 30, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#A1F9F38F'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

Yet we can make it better! From 53 to 30 logical reads might not sound a lot, but if you have thousands of queries every hour, it will be fairly significant.

Prepare yourself for some pleasant surprises – when we eventually applied the change on an actual database, the change was staggering, much more than what we hoped for. The query that were run for 24h in total, every day, now takes less than 10 minutes (yes you read it right, 10 minutes).

At this point you can certainly be happy and move on. But can we do better? For the sake of curiosity ? Yes we do.

SQL Server is rather smart that it knows we are getting the other columns as well, so those will be included in the index, to avoid a key lookup. Let’s see if we can remove that and see how it performs

(10 rows affected)
Table 'UniqueCoupon'. Scan count 10, logical reads 60, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#B1996E94'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

So it was indeed worse, a key lookup is performed for every row (SQL Server uses the index to track down the rows and read the other columns from there)

There are two way to get rid of those key lookup – includes the columns in the index itself, or, more dramatic, make the index the clustered. As we can see the data should be accessed by PromotionId and Code, it makes perfect senses.

It is a commonly belief that Identity column should be clustered index – it is unique, it is not null. However, it only makes senses if it is the most heavily accessed column. In this case, Id only serves as an Identity column, it does not need to be the clustered index (although being an unique means it will has a non clustered index for it)

ALTER TABLE [dbo].[UniqueCoupon] DROP CONSTRAINT [PK__UniqueCo__3214EC0744C2FF38] WITH ( ONLINE = OFF )
GO

ALTER TABLE [dbo].[UniqueCoupon] ADD PRIMARY KEY NONCLUSTERED 
(
	[Id] ASC
)

Does this bring dramatically performance change? Unlikely. My test show no improvement in statistic. However, there is one critical impact here: we significantly reduced the size of indexes in the table. (data incoming)

Moral of the story

  • Indexes are crucial.
  • You can almost always do better than the auto suggested indexes.
  • Real test is the only true validation.

Performance optimization – the hardcore series – part 3

“In 99% of the cases, premature optimization is the root of all devil”

This quote is usually said to be from Donald Knuth, usually regarded as “father of the analysis of algorithms”. His actual quote is a bit difference

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%.

If you have read my posts, you know that I always ask for measuring your application before diving in optimization. But that’s not all of the story. Without profiling, your optimization effort might be futile. But there are things you can “optimize” right away without any profiling – because – they are easy to do, they make your code simpler, easier to follow, and you can be certain they are faster.

Let’s see if you can spot the potential problematic piece of code from this snippet

public Something GetData()
{
var market = list.FirstOrDefault(x => x.MarketId == GetCurrentMarket().MarketId)
{
//do some stuffs
}

}

If you are writing similar code, don’t be discouraged. It’s easy to overlook the problem – when you call FirstOrDefault, you actually iterate over the list until you find the first matching element. And for each and every of that, GetCurrentMarket() will be called.

Because we can’t be sure when we will find the matching element, it might be the first element, or the last, or it does not exist, or anywhere in between. The median is that GetCurrentMarket will be half called half the size of list

We don’t know if GetCurrentMarket is a very lightweight implementation, or list is a very small set, but we know that if this is in one very hot path, the cost can be (very) significant. These are the allocations made by said GetCurrentMarket

This is a custom implementation of IMarketService – the default implementation is much more lightweight and should not be of concern. Of course, fewer calls are always better – no matter how quick something is.

In this specific example, a simple call to get the current market and store it in a local variable to be used in the scope of the entire method should be enough. You don’t need profiling to make such “optimization” (and as we proved, profiling only confirm our suspect )

Moral of the story

  • For optimization, less is almost always, more
  • You definitely should profile before spending any considerable amount optimizing your code. But there are things that can be optimized automatically. Make them your habit.

Performance optimization – the hardcore series – part 2

Earlier we started a new series about performance optimization, here Performance optimization – the hardcore series – part 1 – Quan Mai’s blog (vimvq1987.com) . There are ton of places where things can go wrong. A seasoned developer can, from experience, avoid some obvious performance errors. But as we will soon learn, a small thing can make a huge impact if it is called repeatedly, and a big thing might be OK to use as long as it is called once.

Let’s take this example – how would you think about this snippet – CategoryIds is a list of string converted from ContentReference

            if (CategoryIds.Any(x => new ContentReference(x).ToReferenceWithoutVersion() == contentLink))
            {
                //do stuff
            }

If this is in any “cool” path that run a few hundred times a day, you will be fine. It’s not “elegant”, but it works, and maybe you can get away with it. However, if it is in a hot path that is executed every time a visitor visits a product page in your website, it can create a huge problem.

And can you guess what it is?

new ContentReference(string) is fairly lightweight, but if it is called a lot, this is what happen. This is allocations from the constructor alone, and only within 220 seconds of the trace

A lot of allocations which should have been avoided if CategoryIds was just an IEnumerable<ContentReference> instead of IEnumerable<string>

For comparison, this is how 10.000 and 1000.000 new ContentReference would allocate

Thing is similar if you use .ToReferenceWithoutVersion() to compare to another ContentReference (although to a lesser extend as ToReferenceWithoutVersion would return the same ContentReference if the WorkId is 0, and it use cloning instead of new). The correct way to compare two instances of ContentReference without caring about versions, is to use .Compare with ContentReferenceComparer.IgnoreVersion

Moral of the story

  • It is not only what you do, but also how you do it
  • Small things can make big impacts, don’t guess, measure!

Performance optimization – the hardcore series – part 1

Hi again every body. New day – new thing to write about. today we will talk about memory allocation, and effect it has on your website performance. With .NET, memory allocations are usually overlooked because CLR handles that for you. Except in rare cases that you need to handle unmanaged resources, that you have to be conscious about releasing that said resources yourself, it’s usually fire and forget approach.

Truth is, it is more complicated than that. The more objects you created, the more memory you need, and the more time CLR needs to clean it up after you. When you might have written code that is executed blazing fast in your benchmarks, in reality, your website might still struggle to perform well in long run – and that’s because of Garbage collection. Occasional GC is not of a concern – because it’s nature of .NET CLR, but frequent GC, especially Gen 2 GC, is definitely something you should look into and should fix, because it natively affects your website performance.

The follow up question – how do you fix that.

Of course, the first step is always measuring the memory allocations of your app. Locally you can use something like Jetbrains dotMemory to profile your website, but that has a big caveat – you can’t really mimic the actual traffic to your website. Sure, it is very helpful to profile something like a scheduled job, but it is less than optimal to see how your website performs in reality. To do that, we need another tool, and I’ve found nothing better than Application Insights Profiler trace on Azure. It will sample your website periodically, taking ETL ( event trace log) files in 220 seconds (Note, depends on your .NET version, you might download a .diagsession or a .netperf.zip file from Application Insights, but they are essentially the same inside (zipped .ETL)). Those files are extremely informative, they contains whole load of information which might be overwhelming if you’re new, but take small steps, you’ll be there.

To open a ETL file, common tool is Perfview (microsoft/perfview: PerfView is a CPU and memory performance-analysis tool (github.com)). Yes it has certain 2000 look like other analysis tool (remember Windbg), but it is fast, efficient, and gets the job done

Note that once extracted ETL can be very big – in 1GB or more range often. Perfview has to go through all that event log so it’s extremely memory hungry as well, especially if you open multiple ETL files at once. My perfview kept crashing when I had a 16GB RAM machine (I had several Visual Studio instances open), and that was solved when I switched to 32GB RAM

The first step is to confirm the allocation problems with GCStats (this is one of the extreme ones, but it does happen)

Two main things to look into – Total Allocs, i.e. the total size of objects allocated, and then the time spent in Garbage collection. They are naturally closely related, but not always. Total allocation might not be high but time for GC might be – in case of large objects allocation (we will talk about it in a later post). Then for the purpose of memory allocation analysis, this is where you should look at

What you find in there, might surprise you. And that’s the purpose of this series, point out possible unexpected allocations that are easy – or fairly easy – to fix.

In this first post, we will talk about a somewhat popular feature – Injected<T>.

We all know that in Optimizely Content/Commerce, the preferred way of dependency injection is constructor injection. I.e. if your class has a dependency on a certain type, that dependency should be declared as a parameter of the constructor. That’s nice and all, but not always possible. For example you might have a static class (used for extension methods) so no constructor is available. Or in some rare cases, that you can’t added a new parameter to the constructor because it is a breaking change.

Adding Injected<T> as a hidden dependency in your class is at least working, so can you forget about it?

Not quite!

This is how the uses of Injected<T> result in allocation of Structuremap objects – yes every time you call Injected<T>.Service the whole dependency tree must be built again.

And that’s not everything, during that process, other objects need to be created as well. You can right click on a path and select “Include item”. The allocations below are for anything that were created by `module episerver.framework episerver.framework!EPiServer.ServiceLocation.Injected1[System.__Canon].get_Service() i.e. all object allocations, related to Injected<T>

You can expand further to see what Injected<T>(s) have the most allocations, and therefore, are the ones should be fixed.

How can one fix a Injected<T> then? The best fix is to make it constructor dependency, but that might not always be possible. Alternative fix is to use ServiceLocator.GetInstance, but to make that variable static if possible. That way you won’t have to call Injected<T>.Service every time you need the instance.

There are cases that you indeed need a new instance every time, then the fix might be more complicated, and you might want to check if you need the whole dependency tree, or just a data object.

Moral of the story

  • Performance can’t be guessed, it must be measured
  • Injected<T> is not your good friend. You can use it if you have no other choice, but definitely avoid it in hot paths.

Optimizing an interesting query

It’s not a secret, I love optimizing things. In a sense, I am both an Optimizer (literally) and an optimizer. And today we will be back to basic – optimizing a tricky SQL query.

The query in question is this particular stored procedure ecf_CatalogNode_GetAllChildNodes, this is used to get all children nodes of specific nodes. It is used in between to find all entries that are direct, or indirect children of specific nodes. Why, you might ask, because when you change the url segment of the node, you want to make sure that all entries that are under that node, will have their indexed object refreshed.

Let’s take a look at this stored procedure, this is how it looks like

CREATE PROCEDURE [dbo].[ecf_CatalogNode_GetAllChildNodes]
    @catalogNodeIds udttCatalogNodeList readonly
AS
BEGIN
    WITH all_node_relations AS 
    (
        SELECT ParentNodeId, CatalogNodeId AS ChildNodeId FROM CatalogNode
        WHERE ParentNodeId > 0
        UNION
        SELECT ParentNodeId, ChildNodeId FROM CatalogNodeRelation
    ),
    hierarchy AS
    (
        SELECT 
            n.CatalogNodeId,
            '|' + CAST(n.CatalogNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM @catalogNodeIds n
        UNION ALL
        SELECT
            children.ChildNodeId AS CatalogNodeId,
            parent.CyclePrevention + CAST(children.ChildNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM hierarchy parent
        JOIN all_node_relations children ON parent.CatalogNodeId = children.ParentNodeId
        WHERE CHARINDEX('|' + CAST(children.ChildNodeId AS nvarchar(4000)) + '|', parent.CyclePrevention) = 0
    )
    SELECT CatalogNodeId FROM hierarchy
END

I previously wrote about the relations between entities in Commerce catalog, here Commerce relation(ship), a story – Quan Mai’s blog (vimvq1987.com) , so relations between nodes can be a bit complicated – a node can have one true parent defined in CatalogNode table, and then other “linked” nodes in CatalogNodeRelation . So to find all children – and grand children of a node, you need to get from both.

Getting children of a node from CatalogNode or CatalogNodeRelation is simple, but things become more complicated when you have to get grandchildren, then great-grandchildren, and so on, and so forth. with that, CTE needs to be used in a recursive way. But then there is a problem arises – there is a chance, small, but still, that the data was added in a correct way, so circular reference is possible. i.e. A is a parent of B, which is a parent of C, and itself is a parent of A. To stop the SP from running forever, a check needs to be added to make sure any circular reference is cut short.

This brings back memory as the first ever support case I worked on at Optimizely (then Episerver) was with a circular reference. The site would crash whenever someone visited the catalog management in Commerce Manager. That was around June, 2012 (feeling old now?). My “boss” at that time involuntarily volunteered me for the case. See what you made me do, boss.

Now you can grasp the basic of what the SP does – let’s get back to the original problem. it’s slow to run especially with big catalog and complex node structure. As always, to optimize everything you need to find the bottleneck – time to fire up SQL Server Management Studio and turn on the Actual Execution Plan

I decided to go with 66, the “root” catalog node. this query yield around 18k rows

declare @Nodes udttCatalogNodeList 

insert into @Nodes (CatalogNodeId) select 66

exec ecf_CatalogNode_GetAllChildNodes @Nodes

and also 18s of execution.

Mind you, this is on my machine with pretty powerful CPU (AMD Ryzen 7 5800x, 8 cores 16 threads), and a very fast nvme PCIe SSD (Western Digital Black SN850 2TB). If this was executed on Azure Sql database for example, a timeout is almost certainly guaranteed. So time of execution should only be compared relatively with each other.

If we look at the execution plan, it is quite obvious where the bottleneck is. A scan on CatalogNode table is heavy (it read 79M rows on that operation). As suggest by Anders from Timeout when deleting CatalogNodes from a large catalog (optimizely.com), adding a non clustered index on ParentNodeId column would improve it quite a lot. And indeed it does. The execution time is reduced to 5 second.

And the number of rows read on CatalogNode reduced to just 17k

This is of course a very nice improvement. But the customer reported that it is not enough and the SP is still giving timeout, i.e. further optimization is needed.

Naturally, the next step would be to see if we can skip the circular check. It was added as a safe measure to avoid bad data. It should not be there, as the check should be performed at data modification. But it is there for historical reasons and we can’t just change it, not trivially. So let’s try it for our curiousity.

The modified query looks like this (basically just commented out any code related to the CyclePrevention

ALTER PROCEDURE [dbo].[ecf_CatalogNode_GetAllChildNodes]
    @catalogNodeIds udttCatalogNodeList readonly
AS
BEGIN
    WITH all_node_relations AS 
    (
        SELECT ParentNodeId, CatalogNodeId AS ChildNodeId FROM CatalogNode
        WHERE ParentNodeId > 0
        UNION
        SELECT ParentNodeId, ChildNodeId FROM CatalogNodeRelation
    ),
    hierarchy AS
    (
        SELECT 
            n.CatalogNodeId
			--, '|' + CAST(n.CatalogNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM @catalogNodeIds n
        UNION ALL
        SELECT
            children.ChildNodeId AS CatalogNodeId
			--, parent.CyclePrevention + CAST(children.ChildNodeId AS nvarchar(4000)) + '|' AS CyclePrevention
        FROM hierarchy parent
        JOIN all_node_relations children ON parent.CatalogNodeId = children.ParentNodeId
        --WHERE CHARINDEX('|' + CAST(children.ChildNodeId AS nvarchar(4000)) + '|', parent.CyclePrevention) = 0
    )
    SELECT CatalogNodeId FROM hierarchy
END

And the improvement is quite impressive (more than I expected), the query completes almost instantly (less than 1s). The read on CatalogNodeRelation significantly reduced

A word of warning here, execution plan can’t be simply compared as-is. If I run two versions side by side, it gives quite misleading comparison

Even though the top one (without the circular reference check) is much faster than the original (the bottom one), SQL Server estimates that the first is slower (almost 2x slower than the second). So execution plan should be used to see what has been done and what is likely the bottleneck inside a query, it should not be used as comparison between queries. In most cases, comparing statistics using set statistics io on is the best way to compare.

If not for the fact that we are changing the behavior of the stored procedure, I would be happy with this approach. The chance of running into circular reference is small, but it is not zero. As we said, we can in theory gating the relation during insert/updating, but that would be too big a change to start with. This is one of constraint as we work at framework level – we have to step carefully to not break anything. A breaking change is bad, but a data corruption is simply unacceptable. I spent a few hours (probably more than I should) trying to optimize the circular reference check, but no better solution is found.

The next approach would be – as we can guess, to make sure that we get rid of the Clustered Index Scan happened on the CatalogNodeRelation table. The solution would be quite simple, a non clustered index on the `ParentNodeId should be enough.

Great success. The performance is comparable with the “non circular reference check” approach.

As adding an index is a non breaking change (and albeit in some cases it can cause performance regression, like in A curious case of SQL execution plan – Quan Mai’s blog (vimvq1987.com) , but it is rare, also, in this case the cardinality of the ParentNodeId is most likely quite well distributed).

That is all for today. Hopefully you learn one thing or two about optimizing queries in your daily works.

Potential performance issue with Maxmind.db

From time to time, I have to dig into some customers’ profiler traces to figure out why their site is slow (yeah, if you follow me, you’d know that’s kind of my main job). There are multiple issues that can eat your website performance for breakfast, from loading too much content, to unmaintained database indexes. While my blog does not cover everything, I think you can get a good grasp of what mistakes to avoid.

But sometimes the problem might come from a 3rd party library/framework. It’s not new, as we have seen it with A curious case of memory dump diagnostic: How Stackify can cause troubles to your site – Quan Mai’s blog (vimvq1987.com). The problem with those types of issues is that they are usually overlooked.

The library we’ll be investigating today would be Maxmind.db. To be honest, I’ve never used it my own, but it seems to be a very popular choice to geography-map the visitors. It’s usually used by Optimizely sites for that purpose, using VisitorGroup (which is why it came under my radar).

For several sites that use it, it seems more often than not stuck in this stack

It’s essentially to think that CreateActivator is doing something heavy here (evidently with the LambdaCompiler.Compile part. A peek from decompiling actually shows that yes, it’s heavy. I’m not quite sure I can post the decompiled code here without violating any agreement (I did, in fact, accepted no agreement at this point), but it’s quite straightforward code: TypeActivatorCreator uses reflection to get the constructors of the Type passed to it, to sees if there is any constructor decorated with MaxMind.Db.Constructor attribute, then prepares some parameters, and creates an LambdaExpression that would create an instance of that Type, using found constructor (which is a good thing because a compiled expression would perform much better than just a reflection call).

(I’m using Mindmax.db 2.0.0, for the record)

The code is straightforward, but it is also slow – as any code which involves reflection and lambda compilation would be. The essential step would be to cache any result of this. This is actually a very good place to cache. The number of types are fixed during runtime (except for very edge cases where you dynamically create new types), so you won’t have to worry about cache invalidation. The cache would significantly improve the performance of above code.

And in TypeActivatorCreator there is a cache for it. It is a simple ConcurrentDictionary<Type, TypeActivator> , which would return an TypeActivator if the Type was requested before, or create one and cache it, it it hasn’t been. As I said, this is a very good place to add cache to this.

There is a cache for that, which is good. However, the very important tidbit here is that the dictionary is not static. That means, the cache only works, if the class is registered as Singleton (by itself, or by another class down the dependency chain), meaning, only one of the instance is created and shared between thread (which is why the ConcurrentDictionary part is important).

But except it’s not.

When I look at a memory dump that collected for a customer that is using Maxmind.db, this is what I got:

0:000> !dumpheap -stat -type TypeAcivatorCreator
Statistics:
MT Count TotalSize Class Name
00007ffa920f67e0 1 24 MaxMind.Db.TypeAcivatorCreator+<>c
00007ffa920f6500 147 3528 MaxMind.Db.TypeAcivatorCreator
Total 148 objects

So there were 147 instances of TypeAcivatorCreator. Note that this is only the number of existing instances. There might be other instances that were disposed and garbaged by CLR.

Now it’s clear why it has been performing bad. For supposedly every request, a new instance of TypeActivatorCreator is created, and therefore its internal cache is simply empty (it is just newly created, too). Therefore each of request will go through the expensive path of CreateActivator, and performance suffers.

The obvious fix here is to make the dictionary static, or making the TypeActivatorCreator class Singleton. I don’t have the full source code of Mindmax.Db to determine which is better, but I’d be leaning toward the former.

Moral of the story:

  • Caching is very, very important, especially when you are dealing with reflection and lambda compilation
  • You can get it right 99%, but the 1% left could still destroy performance.

Update:

I reached out to Maxmind.db regarding this issue on November 9th, 2021

About 6h later they replied with this

I was at first confused, then somewhat disappointed. It is a small thing to fix to improve overall performance, rather than relying on/expecting customers to do what you say in documentation. But well, let’s just say we have different opinions.

Don’t let order search kill your site

Episerver Commerce order search is a powerful feature. My colleague Shannon Gray wrote about is long ago https://world.episerver.com/blogs/Shannon-Gray/Dates/2012/12/EPiServer-Commerce-Order-Search-Made-Easy/ , and I myself as well https://world.episerver.com/blogs/Quan-Mai/Dates/2014/10/Order-searchmade-easy/

But because of its power and flexibility, it can be complicated to get right. People usually stop at making the query works. Performance is usually an after thought, as it is only visible on production environment when there are enough requests to bring your database to its knees.

Let me be very clear about it: during my years helping customers with performance issues (and you can guess, that is a lot of customers), order search is one of the most, if not the most common cause of database spikes.

Trust me, you never want to your database looks like this

As your commerce database is brought to its knees, your entire website performance suffers. Your response time suffers. Your visitors are unhappy and that makes your business suffer.

But what is so bad about order search?

Order search allows you to find orders by almost any criteria. And to do that, you often join with different tables in the database. Search for orders with specific line items? Join with LineItem table on a match of CatalogEntryId column. Search for orders with a specific shipping method? Join with Shipment table on a match of ShippingMethodId etc. etc. SqlWhereClause and SqlMetaWhereClause of OrderSearchParameters are extremely flexible, and that is both a cure, and a curse.

Let’s examine the first example in closer details. The query is easy to write. But don’t you know that there is no index on the CatalogEntryId column? That means every request to search order, end up in a full table scan of LineItem.

There are two bad news into that: your LineItem table usually have many rows already, which makes that scan slow, and resource intensive. And as it’s an ever growing table, the situation only gets worse over time.

That is only a start, and a simple one, because that can be resolved by adding an index on CatalogEntryId , but there are more complicated cases when adding an index simply can’t solve the problem – because there is no good one. For example if you search for orders with custom fields, but only of type bit . Bit is essentially the worst type when it comes to index-ability, so your indexes will be much less effective than you want it to be. A full table scan will likely be used.

In short:

Order search is flexible, and powerful. But, “With great power come great responsibility”. Think about what you join on your SqlWhereClause and SqlMetaWhereClause statements, and if your query is covered by an index, or if adding an index will make senses in this case (I have a few guidelines here for a good index https://vimvq1987.com/index-or-no-index-thats-the-question/). Or if you can limit the number of the orders you search for.

Your database will thank you, later.

Dynamic data store is slow, (but) you can do better.

If you have been developing with Episerver CMS for a while, you probably know about its embedded “ORM”, called Dynamic Data Store, or DDS for short. It allows you to define strongly typed types which are mapped to database directly to you. You don’t have to create the table(s), don’t have to write stored procedures to insert/query/delete data. Sounds very convenient, right? The fact is, DDS is quite frequently used, and more often than you might think, mis-used.

As Joel Spolsky once said Every abstraction is leaky, an ORM will likely make you forget about the nature of the RDBMS under neath, and that can cause performance problems, sometime severe problems.

Let me make it clear to you

DDS is slow, and it is not suitable for big sets of data.

If you want to store a few settings for your website, DDS should be fine. However, if you are thinking about hundreds of items, it is probably worth looking else. Thousands and more items, then it would be a NO.

I did spend some time trying to bench mark the DDS to see how bad it is. A simple test is to add 10.000 items to a store, then query by each item, then deleted by each item, to see how long does it take

The item is defined like this, this is just another boring POCO:

internal class ShippingArea : IDynamicData
{
    public Identity Id { get; set; }

    public string PostCode { get; set; }

    public string Area { get; set; }

    public DateTime Expires { get; set; }
}

The store is defined like this

    public class ShippingAreaStore
    {
        private const string TokenStoreName = "ShippingArea";

        internal virtual ShippingArea CreateNew(string postCode, string area)
        {
            var token = new ShippingArea
            {
                Id = Identity.NewIdentity(),
                PostCode = postCode,
                Area = area,
                Expires = DateTime.UtcNow.AddDays(1)
            };
            GetStore().Save(token);
            return token;
        }

        internal virtual IEnumerable<ShippingArea> LoadAll()
        {
            return GetStore().LoadAll<ShippingArea>();
        }

        internal virtual IEnumerable<ShippingArea> Find(IDictionary<string, object> parameters)
        {
            return GetStore().Find<ShippingArea>(parameters);
        }

        internal virtual void Delete(ShippingArea shippingArea)
        {
            GetStore().Delete(shippingArea);
        }

        internal virtual ShippingArea Get(Identity tokenId)
        {
            return GetStore().Load<ShippingArea>(tokenId);
        }

        private static DynamicDataStore GetStore()
        {
            return DynamicDataStoreFactory.Instance.CreateStore(TokenStoreName, typeof(ShippingArea));
        }
    }

Then I have some quick and dirty code in QuickSilver ProductController.Index to measure the time (You will have to forgive some bad coding practices here ;). As usual StopWatch should be used on demonstration only, it should not be used in production. If you want a good break down of your code execution, use tools like dotTrace. If you want to measure production performance, use some monitoring system like NewRelic or Azure Application Insights ):

        var shippingAreaStore = ServiceLocator.Current.GetInstance<ShippingAreaStore>();
        var dictionary = new Dictionary<string, string>();
        for (int i = 0; i < 10000; i++)
        {
            dictionary[RandomString(6)] = RandomString(10);
        }
        var identities = new List<ShippingArea>();
        var sw = new Stopwatch();
        sw.Start();
        foreach (var pair in dictionary)
        {
            shippingAreaStore.CreateNew(pair.Key, pair.Value);
        }
        sw.Stop();
        _logger.Error($"Creating 10000 items took {sw.ElapsedMilliseconds}");
        sw.Restart();
        foreach (var pair in dictionary)
        {
            Dictionary<string, object> parameters = new Dictionary<string, object>();
            parameters.Add("PostCode", pair.Key);
            parameters.Add("Area", pair.Value);
            identities.AddRange(shippingAreaStore.Find(parameters));
        }

        sw.Stop();
        _logger.Error($"Querying 10000 items took {sw.ElapsedMilliseconds}");
        sw.Restart();

        foreach (var id in identities)
        {
            shippingAreaStore.Delete(id);
        }
        sw.Stop();
        _logger.Error($"Deleting 10000 items took {sw.ElapsedMilliseconds}");

Everything is ready. So a few tries gave us a fairly stable result:

2019-12-02 13:33:01,574 Creating 10000 items took 11938

2019-12-02 13:34:59,594 Querying 10000 items took 118009

2019-12-02 13:35:24,728 Deleting 10000 items took 25131

And this is strictly single-threaded, the site will certainly perform worse when it comes to real site with a lot of traffic, and thus multiple insert-query-delete at the same time.

Can we do better?

There is a little better attribute that many people don’t know about DDS: you can mark a field as indexed, by adding [EPiServerDataIndex] attribute to the properties. The new class would look like this.

    [EPiServerDataStore]
    internal class ShippingArea : IDynamicData
    {
        public Identity Id { get; set; }

        [EPiServerDataIndex]
        public string PostCode { get; set; }

        [EPiServerDataIndex]
        public string Area { get; set; }

        public DateTime Expires { get; set; }
    }

If you peek into the database during the test, you can see that the data is now being written to Indexed_String01 and Indexed_String02 columns, instead of String01 and String02 as without the attributes. Such changes give us quite drastic improvement:

2019-12-02 15:38:16,376 Creating 10000 items took 7741

2019-12-02 15:38:19,245 Querying 10000 items took 2867

2019-12-02 15:38:44,266 Deleting 10000 items took 25019

The querying benefits greatly from the new index, as it no longer has to do a Clustered Index Scan, it can now do a non clustered index seek + Key look up. Deleting is still equally slow, because the delete is done by a Clustered Index delete on the Id column, which we already have, and the index on an Uniqueidentifier column is not the most effective one.

Before you are happy which such improvement, keep in mind that there are two indexes added for Indexed_String01 and Indexed_String02 separately. Naturally, we would want a combination, clustered even, on those columns, but we just can’t.

What if we want to go bare metal and create a table ourselves, write the query ourselves? Our repository would look like this

public class ShippingAreaStore2
    {
        private readonly IDatabaseExecutor _databaseExecutor;

        public ShippingAreaStore2(IDatabaseExecutor databaseExecutor)
        {
            _databaseExecutor = databaseExecutor;
        }

        /// <summary>
        /// Creates and stores a new token.
        /// </summary>
        /// <param name="blobId">The id of the blob for which the token is valid.</param>
        /// <returns>The id of the new token.</returns>
        internal virtual ShippingArea CreateNew(string postCode, string area)
        {
            var token = new ShippingArea
            {
                Id = Identity.NewIdentity(),
                PostCode = postCode,
                Area = area,
                Expires = DateTime.UtcNow.AddDays(1)
            };
            _databaseExecutor.Execute(() =>
            {
                var cmd = _databaseExecutor.CreateCommand();
                cmd.CommandText = "ShippingArea_Add";
                cmd.CommandType = CommandType.StoredProcedure;
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("Id", token.Id.ExternalId));
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("PostCode", token.PostCode));
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("Area", token.Area));
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("Expires", token.Expires));
                cmd.ExecuteNonQuery();
            });

            return token;
        }

        internal virtual IEnumerable<ShippingArea> Find(IDictionary<string, object> parameters)
        {
            return _databaseExecutor.Execute<IEnumerable<ShippingArea>>(() =>
            {
                var areas = new List<ShippingArea>();
                var cmd = _databaseExecutor.CreateCommand();
                cmd.CommandText = "ShippingArea_Find";
                cmd.CommandType = CommandType.StoredProcedure;
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("PostCode", parameters.Values.First()));
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("Area", parameters.Values.Last()));
                var reader = cmd.ExecuteReader();
                while (reader.Read())
                {
                    areas.Add(new ShippingArea
                    {
                        Id = (Guid)reader["Id"],
                        PostCode = (string)reader["PostCode"],
                        Area = (string)reader["Area"],
                        Expires = (DateTime)reader["Expires"]
                    });
                }
                return areas;
            });
        }

        /// <summary>
        /// Deletes a token from the store.
        /// </summary>
        /// <param name="token">The token to be deleted.</param>
        internal virtual void Delete(ShippingArea area)
        {
            _databaseExecutor.Execute(() =>
            {
                var cmd = _databaseExecutor.CreateCommand();
                cmd.CommandText = "ShippingArea_Delete";
                cmd.CommandType = CommandType.StoredProcedure;
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("PostCode", area.PostCode));
                cmd.Parameters.Add(_databaseExecutor.CreateParameter("Area", area.Area));
                cmd.ExecuteNonQuery();
            });
        }
    }

And those would give us the results:

2019-12-02 16:44:14,785 Creating 10000 items took 2977

2019-12-02 16:44:17,114 Querying 10000 items took 2315

2019-12-02 16:44:20,307 Deleting 10000 items took 3190

Moral of the story?

DDS is slow and you should be avoid using it if you are working with fairly big set of data. If you have to use DDS for whatever reason, make sure to at least try to index the columns that you query the most.

And in the end of the days, hand-crafted custom table + query beats everything. Remember that you can use some tools like Dapper to do most of the works for you.

Fixing ASP.NET Membership performance – part 1

Even though it is not the best identity management system in the .NET world, ASP.NET Membership provider is still fairly widely used, especially for systems that have been running for quite long time with a significant amount of users: migrating to a better system like AspNetIdentity does not comes cheap. However, built from early days of ASP.NET mean Membership provider has numerous significant limitations: beside the “architecture” problems, it also has limited performance. Depends on who you ask, the ultimate “maximum” number of customers that ASP.NET membership provider can handle ranges from 30.000 to 750.000. That does not sound great. Today if you start a new project, you should be probably better off with AspNetIdentity or some other solutions, but if your website is using ASP.NET membership provider and there is currently no plan to migrate, then read on.

The one I will be used for this blog post has around 950.000 registered users, and the site is doing great – but that was achieved by some very fine grained performance tuning, and a very high end Azure subscription.

A performance overview 

I have been using ASP.NET membership provider for years, but I have never looked into it from performance aspects. (Even though I have done some very nasty digging to their table structure). And now I have the chance, I realize how bad it is.

It’s a fairly common seen in the aspnet_* tables that the indexes have ApplicationId as the first column. It does not take a database master to know it is a very ineffective way to create an index – in most of the cases, you only have on ApplicationId in your website, making those indexes useless when you want to, for example, query by UserId. This is a rookie mistake – a newbie tends to make order of columns in the index as same as they appear in the table, thinking, that that SQL Server will just do magic to exchange the order for the best performance. It’s not how SQL Server – or in general – RDBMS systems work.

It is OK to be a newbie or to misunderstand some concepts. I had the very same misconception once, and learned my lessons. However, it should not be OK for a framework to make that mistake, and never correct it.

That is the beginning of much bigger problems. Because of the ineffective order of columns, the builtin indexes are as almost useless. That makes the queries, which should be very fast, become unnecessarily slow, wasting resources and increasing your site average response time. This is of course bad news. But good news is it’s in database level, so we can change it for the better. It if were in the application level then our chance of doing that is close to none.

Missing indexes

If you use Membership.GetUserNameByEmail on your website a lot, you might notice that it is … slow. It leads to this query:

        SELECT  u.UserName
        FROM    dbo.aspnet_Applications a, dbo.aspnet_Users u, dbo.aspnet_Membership m
        WHERE   LOWER(@ApplicationName) = a.LoweredApplicationName AND
                u.ApplicationId = a.ApplicationId    AND
                u.UserId = m.UserId AND
                LOWER(@Email) = m.LoweredEmail

Let’s just ignore the style for now (INNER JOIN would be a much more popular choice), and look into the what is actually done here. So it joins 3 tables by their keys. The join with aspnet_Applications would be fairly simple, because you usually have just one application. The join between aspnet_Users and aspnet_Membership is also simple, because both of them have index on UserId – clustered on aspnet_Users and non-clustered on aspnet_Membership

The last one is actually problematic. The clustered index on aspnet_Membership actually looks like this

CREATE CLUSTERED INDEX [aspnet_Membership_index]
    ON [dbo].[aspnet_Membership]([ApplicationId] ASC, [LoweredEmail] ASC);

Uh oh. Even if this contains LoweredEmail, it’s the worst possible kind of index. By using the least distinctive column in the first, it defeats the purpose of the index completely. Every request to get user name by email address will need to perform a full table scan (oops!)

This is a the last thing you want to see in a execution plan, especially with a fairly big table. 

It should have been just

CREATE CLUSTERED INDEX [aspnet_Membership_index]
    ON [dbo].[aspnet_Membership]([LoweredEmail] ASC);

which helps SQL Server to use the optimal execution plan

If you look into Azure SQL Database recommendation, it suggest you to create a non clustered index on LoweredEmail. That is not technically incorrect, and it still helps. However, keep in mind that each non clustered index will have to “duplicate” the clustered index, for the purpose of identify the rows, so keeping the useless clustered index actually increases wastes and slows down performance (even just a little, because you have to perform more reads to get the same data). However, if your database is currently performing badly, adding a non clustered index is a much quicker and safer option. The change to clustered index should be done with caution at low traffic time.

Tested the stored procedure on database above, without any additional index

Table 'aspnet_Membership'. Scan count 9, logical reads 20101, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Applications'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Users'. Scan count 0, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 237 ms,  elapsed time = 182 ms.

With new non clustered index


(1 row affected)
Table 'aspnet_Applications'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Users'. Scan count 0, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Membership'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 89 ms.

With new clustered index:

(1 row affected)
Table 'aspnet_Applications'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Users'. Scan count 0, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'aspnet_Membership'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 89 ms.

Don’t we have a clear winner?

Speed up catalog routing if you have multiple children under catalog

A normal catalog structure is like this: you have a few high level categories under the catalog, then each high level category has a few lower level categories under it, then each lower level category has their children, so on and so forth until you reach the leaves – catalog entries.

However it is not uncommon that you have multiple children (categories and entries) directly under catalog. Even though that is not something you should do, it happens. 

But that is not without drawbacks. You might notice it is slow to route to a product. It might not be visible to naked eyes, but if you use some decent profilers (which I personally recommend dotTrace), it can be fairly obvious that your site is not routing optimally.

Why?

To route to a specific catalog content, for example http://commerceref/en/fashion/mens/mens-shirts/p-39101253/, the default router have to figure out which content is mapped to an url segment. So with default registration where the catalog root is the default routing root, we will start with the catalog which maps to the first part of route (fashion ). How do it figure out which content to route for the next part (mens ) ? 

Until recently, what it does it to call GetChildren on the catalog ContentReference . Now you can see the problem. Even with a cached result, that is still too much – GetChildren with a big number of children is definitely expensive.

We noticed this behavior, thanks to Erik Norberg. An improvement have been made in Commerce 12.10 to make sure even with a number of children directly under Catalog, the router should perform adequately efficient.

If you can’t upgrade to 12.10 or later (you should!), then you might have a workaround that improve the performance. By adding your own implementation of HierarchicalCatalogPartialRouter, you can override how you would get the children content – by using a more lightweight method (GetBySegment)

    public class CustomHierarchicalCatalogPartialRouter : HierarchicalCatalogPartialRouter
    {
        private readonly IContentLoader _contentLoader;

        public CustomHierarchicalCatalogPartialRouter(Func<ContentReference> routeStartingPoint, CatalogContentBase commerceRoot, bool enableOutgoingSeoUri) : base(routeStartingPoint, commerceRoot, enableOutgoingSeoUri)
        {
        }

        public CustomHierarchicalCatalogPartialRouter(Func<ContentReference> routeStartingPoint, CatalogContentBase commerceRoot, bool supportSeoUri, IContentLoader contentLoader, IRoutingSegmentLoader routingSegmentLoader, IContentVersionRepository contentVersionRepository, IUrlSegmentRouter urlSegmentRouter, IContentLanguageSettingsHandler contentLanguageSettingsHandler, ServiceAccessor<HttpContextBase> httpContextAccessor) : base(routeStartingPoint, commerceRoot, supportSeoUri, contentLoader, routingSegmentLoader, contentVersionRepository, urlSegmentRouter, contentLanguageSettingsHandler, httpContextAccessor)
        {
            _contentLoader = contentLoader;
        }

        protected override CatalogContentBase FindNextContentInSegmentPair(CatalogContentBase catalogContent, SegmentPair segmentPair, SegmentContext segmentContext, CultureInfo cultureInfo)
        {
            return _contentLoader.GetBySegment(catalogContent.ContentLink, segmentPair.Next, cultureInfo) as CatalogContentBase;
        }
    }

And then instead of using CatalogRouteHelper.MapDefaultHierarchialRouter , you register your router directly

 var referenceConverter = ServiceLocator.Current.GetInstance<ReferenceConverter>();
            var contentLoader = ServiceLocator.Current.GetInstance<IContentLoader>();
            var commerceRootContent = contentLoader.Get<CatalogContentBase>(referenceConverter.GetRootLink());
            routes.RegisterPartialRouter(new HierarchicalCatalogPartialRouter(startingPoint, commerceRootContent, enableOutgoingSeoUri));

(ServiceLocator is just to make it easier to understand the code. You should do this in an IInitializationModule, so use context.Locate.Advanced instead.

This is applicable from 9.2.0 and newer versions. 

Moral of the story:

  • Catalog structure can play a big role when it comes to performance.
  • You should do profiling whenever you can
  • We do that too, and we make sure to include improvements in later versions, so keeping your website up to date is a good way to tune performance.