Catalog,  Commerce,  Episerver,  Performance,  SQL Server

A curious case of SQL execution plan

I said this already, and I will say it again: SQL Server optimizer is smart. I can even go further and say, it’s smarter than you and me (I have no doubt that you are smart, even very, very smart 🙂 ). So most of the cases, you leave it to do whatever it thinks is the best.

But there are cases SQL Server optimizer is fooled by the engine – it gets confused and chooses an sub-optimal plan, because it was given wrong, outdated, or incorrect information. That’s when you need to step in.

Today I face one case like that, as reported here:

So the slow query looks like this

It took several minutes on my machine (A fairly powerful desktop), so that is a very bad sign. By default, the command timeout for calling a stored procedure is 30 seconds – and any thing slower than that will result in a timeout exception. Of course you can change the command timeout value – but that would only be used in extraordinary case, not as a general approach. Slow queries are not only prone to timeout exception, it can cause severe issues – (very) bad user experience, deadlocks, etc. With a frequently called SP like ecfVersion_ListByWorkIds, the acceptable limit should be a few seconds – of course faster is better.

Let’s try with a single row instead of 65 rows for the parameter:

This took around 1 minutes and 18 seconds. Of course that is slow, but because we are trying to identify the bottleneck, it’s good that we know where to look at. Looking at the execution plan gives us a fairly good idea of what is wrong

So one statement is trying to “seek” a non clustered index. The problem is that it seeks “only” 10510143 rows. That’s the entire ecfVersionProperty table. No wonder why it’s slow. A seek like that is a definitely a bad sign, let’s look at the code corresponding to that plan :

Let’s skip the select statement, it’s not of our interest. The two lines we need to look into is this:

So here we have two joins – one is very selective. WorkId is an unique clustered index, so it should be very fast. The second one is MetaFieldId on ecfContentProperty table – which is not as selective. There are – as we know – 10.5M rows in ecfContentProperty, and about 240 values in MetaFieldId, so there are almost 50000 rows with same value on average. That kind of distinction is poor, so we naturally want to join by WorkId first. That would limit the number of rows a lot, and allow an easy join for MetaFieldId.

Shouldn’t SQL Server figure that out? I searched a bit and there is no definitive answer on if the order of join affect performance or not. The majority says that it should not matter, and SQL Server will be able to swap the joins for best plan. However there are people saying that the order of joins matters. Hmm – shall we?

And the magic happens. The query is now blazing fast. It completes in less than 1 seconds, and the “bottleneck” has gone.

Now it uses the clustered index on WorkId as we want to. Great. Did we solve the problem? Not yet. We did make it faster, yes, but we need to confirm that it actually fixed the problem. We can try that by reverting the change and see how it behaves.

Heh. And it’s still fast, now with the original order of joins. What the heck. Plan caching? Cached buffer? No, exec sp_refreshsqlmodule 'ecfVersionProperty_ListByWorkIds' , and DBCC DROPCLEANBUFFERS, and it’s still fast (slight slower because it has to do some physical reads, but still less than 1s). In fact, I could not make it as slow as it was!

So here we can assume that it had a bad execution plan. By changing the  order, we allowed SQL Server to remove the bad execution plan from cache, and SQL Server now can choose a better plan. And as we confirmed, the order of the joins does not matter here.

So did we solve the problem we originally have? Does the “big query” run fast now? Let’s try it.

Uh oh…, it’s still slow. Taking more than 3 minutes to complete. And the interesting thing is, after running the big query, I was able to slow down the small query. It’s slow again, yay!

So we can say with fairly high certainty that the big query is able to fool SQL Server into using bad execution plan. But how?

I was pulling my hairs until I look into the statistics of the “bad index” – the index we don’t want to use first (as it’s not very selective)

Gotcha! This index is very oddly distributed. Some of the values have very few matching rows. In the big query, one of the those values matched, and that fooled SQL Server optimizer to think that it’s better to use the index on MetaFieldId. Because there were so few matching rows, that would end up in fewer read, and can be faster overall. But it was wrong.

This is a kind of “parameter sniffing” problem. An oddly distributed data can make SQL Server uses the wrong, slow execution plan. It might be faster for a specific value, but will be much slower on the remaining data.

Now we know what caused the problem, is there a fix? Making data evenly distributed is not an option here – that is entirely up to site implementation, and out of our control.

Luckily for us, SQL Server provides a hint to enforce join order. Because here we know for sure the WorkId will always work better, we can ensure that it’s joined first, by adding OPTION (FORCE ORDER)

The fixed query looks like this

Does it work? Yes, it works wonder. The small queries are now consistently fast. And the big query only takes around 3.5 seconds. Yeah, given the site of the catalog, I’m happy with the result.

So to sum up, SQL Server is smart, but it can be fooled. And if you know for sure something, the query hint can provide huge boost on performance, and even better, help to stabilize your plan.

Thanks for reading!

Part 2 is here:


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: