I said this already, and I will say it again: SQL Server optimizer is smart. I can even go further and say, it’s smarter than you and me (I have no doubt that you are smart, even very, very smart 🙂 ). So most of the cases, you leave it to do whatever it thinks is the best.
But there are cases SQL Server optimizer is fooled by the engine – it gets confused and chooses an sub-optimal plan, because it was given wrong, outdated, or incorrect information. That’s when you need to step in.
Today I face one case like that, as reported here:Â http://world.episerver.com/forum/developer-forum/Episerver-Commerce/Thread-Container/2017/10/database-timeout-on-productvariant-update/
So the slow query looks like this
declare @p1 dbo.udttObjectWorkId insert into @p1 values(110830,0,1378017,NULL) insert into @p1 values(110830,0,1356813,NULL) insert into @p1 values(110830,0,1356812,NULL) insert into @p1 values(110830,0,1356810,NULL) insert into @p1 values(110830,0,1356811,NULL) insert into @p1 values(96861,0,1380113,NULL) insert into @p1 values(96861,0,1329903,NULL) insert into @p1 values(96861,0,1281127,NULL) insert into @p1 values(96861,0,1265426,NULL) insert into @p1 values(96861,0,1260440,NULL) insert into @p1 values(96861,0,1247704,NULL) insert into @p1 values(96861,0,469360,NULL) insert into @p1 values(96861,0,469359,NULL) insert into @p1 values(96861,0,444058,NULL) insert into @p1 values(96861,0,444057,NULL) insert into @p1 values(96861,0,415689,NULL) insert into @p1 values(96861,0,415688,NULL) insert into @p1 values(96861,0,379004,NULL) insert into @p1 values(96861,0,379003,NULL) insert into @p1 values(96861,0,236924,NULL) insert into @p1 values(96861,0,236923,NULL) insert into @p1 values(96861,0,233462,NULL) insert into @p1 values(96861,0,233461,NULL) insert into @p1 values(96861,0,190682,NULL) insert into @p1 values(96861,0,190681,NULL) insert into @p1 values(96441,0,1380282,NULL) insert into @p1 values(96441,0,1330257,NULL) insert into @p1 values(96441,0,1281483,NULL) insert into @p1 values(96441,0,1265765,NULL) insert into @p1 values(96441,0,1260796,NULL) insert into @p1 values(96441,0,1248075,NULL) insert into @p1 values(96441,0,468196,NULL) insert into @p1 values(96441,0,468195,NULL) insert into @p1 values(96441,0,442894,NULL) insert into @p1 values(96441,0,442893,NULL) insert into @p1 values(96441,0,414525,NULL) insert into @p1 values(96441,0,414524,NULL) insert into @p1 values(96441,0,377840,NULL) insert into @p1 values(96441,0,377839,NULL) insert into @p1 values(96441,0,235664,NULL) insert into @p1 values(96441,0,235663,NULL) insert into @p1 values(96441,0,232202,NULL) insert into @p1 values(96441,0,232201,NULL) insert into @p1 values(96441,0,189002,NULL) insert into @p1 values(96441,0,189001,NULL) insert into @p1 values(96950,0,1381232,NULL) insert into @p1 values(96950,0,1337237,NULL) insert into @p1 values(96950,0,1278209,NULL) insert into @p1 values(96950,0,1277635,NULL) insert into @p1 values(96950,0,1269243,NULL) insert into @p1 values(96950,0,1257298,NULL) insert into @p1 values(96950,0,469585,NULL) insert into @p1 values(96950,0,469584,NULL) insert into @p1 values(96950,0,444283,NULL) insert into @p1 values(96950,0,444282,NULL) insert into @p1 values(96950,0,415914,NULL) insert into @p1 values(96950,0,415913,NULL) insert into @p1 values(96950,0,379229,NULL) insert into @p1 values(96950,0,379228,NULL) insert into @p1 values(96950,0,237191,NULL) insert into @p1 values(96950,0,237190,NULL) insert into @p1 values(96950,0,233729,NULL) insert into @p1 values(96950,0,233728,NULL) insert into @p1 values(96950,0,191038,NULL) insert into @p1 values(96950,0,191037,NULL) set statistics io on set statistics time on exec ecfVersion_ListByWorkIds @ContentLinks=@p1
It took several minutes on my machine (A fairly powerful desktop), so that is a very bad sign. By default, the command timeout for calling a stored procedure is 30 seconds – and any thing slower than that will result in a timeout exception. Of course you can change the command timeout value – but that would only be used in extraordinary case, not as a general approach. Slow queries are not only prone to timeout exception, it can cause severe issues – (very) bad user experience, deadlocks, etc. With a frequently called SP like `ecfVersion_ListByWorkIds`, the acceptable limit should be a few seconds – of course faster is better.
Let’s try with a single row instead of 65 rows for the parameter:
declare @p1 dbo.udttObjectWorkId insert into @p1 values(110830,0,1378017,NULL) exec ecfVersion_ListByWorkIds @ContentLinks=@p1
This took around 1 minutes and 18 seconds. Of course that is slow, but because we are trying to identify the bottleneck, it’s good that we know where to look at. Looking at the execution plan gives us a fairly good idea of what is wrong
So one statement is trying to “seek” a non clustered index. The problem is that it seeks “only” 10510143 rows. That’s the entire ecfVersionProperty table. No wonder why it’s slow. A seek like that is a definitely a bad sign, let’s look at the code corresponding to that plan :
SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName, draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) ELSE draftProperty.LongString END as LongString, draftProperty.[Guid] FROM ecfVersionProperty draftProperty INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId
Let’s skip the select statement, it’s not of our interest. The two lines we need to look into is this:
INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId
So here we have two joins – one is very selective. WorkId
is an unique clustered index, so it should be very fast. The second one is MetaFieldId
on ecfContentProperty
table – which is not as selective. There are – as we know – 10.5M rows in ecfContentProperty
, and about 240 values in MetaFieldId
, so there are almost 50000 rows with same value on average. That kind of distinction is poor, so we naturally want to join by WorkId
first. That would limit the number of rows a lot, and allow an easy join for MetaFieldId
.
Shouldn’t SQL Server figure that out? I searched a bit and there is no definitive answer on if the order of join affect performance or not. The majority says that it should not matter, and SQL Server will be able to swap the joins for best plan. However there are people saying that the order of joins matters. Hmm – shall we?
INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId
And the magic happens. The query is now blazing fast. It completes in less than 1 seconds, and the “bottleneck” has gone.
Now it uses the clustered index on WorkId
as we want to. Great. Did we solve the problem? Not yet. We did make it faster, yes, but we need to confirm that it actually fixed the problem. We can try that by reverting the change and see how it behaves.
Heh. And it’s still fast, now with the original order of joins. What the heck. Plan caching? Cached buffer? No, exec sp_refreshsqlmodule 'ecfVersionProperty_ListByWorkIds'
 , and
 DBCC DROPCLEANBUFFERS
, and it’s still fast (slight slower because it has to do some physical reads, but still less than 1s). In fact, I could not make it as slow as it was!
So here we can assume that it had a bad execution plan. By changing the order, we allowed SQL Server to remove the bad execution plan from cache, and SQL Server now can choose a better plan. And as we confirmed, the order of the joins does not matter here.
So did we solve the problem we originally have? Does the “big query” run fast now? Let’s try it.
Uh oh…, it’s still slow. Taking more than 3 minutes to complete. And the interesting thing is, after running the big query, I was able to slow down the small query. It’s slow again, yay!
So we can say with fairly high certainty that the big query is able to fool SQL Server into using bad execution plan. But how?
I was pulling my hairs until I look into the statistics of the “bad index” – the index we don’t want to use first (as it’s not very selective)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS 15 0 474128 17 0 479083 18 0 479083 19 0 479083 192 0 439755 193 0 169223 194 0 410397 195 0 444977 196 0 169223 197 0 444346 198 0 444006 199 0 17114 200 0 413320 201 0 169223 202 0 169223 203 0 444972 204 0 375462 205 0 169223 206 0 169223 207 0 7205 208 0 120735 209 0 443485 210 0 41117 211 0 439649 212 0 406152 213 0 169223 214 0 444985 215 0 444974 216 0 169223 217 0 155334 218 0 154951 219 0 169223 220 0 155273 221 0 154890 222 0 169223 223 0 155225 224 0 3800 225 0 75458 226 0 61 227 0 193 228 0 7326 230 0 83 231 0 55 232 0 50205 233 0 59 234 0 86 235 0 7368 237 0 114 239 0 38384 240 0 57 242 0 40194 247 0 106 248 0 1831 250 0 122113 254 0 23 255 0 240 257 0 456
Gotcha! This index is very oddly distributed. Some of the values have very few matching rows. In the big query, one of the those values matched, and that fooled SQL Server optimizer to think that it’s better to use the index on MetaFieldId
. Because there were so few matching rows, that would end up in fewer read, and can be faster overall. But it was wrong.
This is a kind of “parameter sniffing” problem. An oddly distributed data can make SQL Server uses the wrong, slow execution plan. It might be faster for a specific value, but will be much slower on the remaining data.
Now we know what caused the problem, is there a fix? Making data evenly distributed is not an option here – that is entirely up to site implementation, and out of our control.
Luckily for us, SQL Server provides a hint to enforce join order. Because here we know for sure the WorkId
will always work better, we can ensure that it’s joined first, by adding OPTION (FORCE ORDER)
The fixed query looks like this
ALTER PROCEDURE [dbo].[ecfVersionProperty_ListByWorkIds] @ContentLinks dbo.udttObjectWorkId readonly AS BEGIN CREATE TABLE #nonMasterLinks ( ObjectId INT, ObjectTypeId INT, WorkId INT, DefaultLanguage NVARCHAR(50), [Status] INT, MasterWorkId INT ) INSERT INTO #nonMasterLinks SELECT l.ObjectId, l.ObjectTypeId, l.WorkId, d.MasterLanguageName, d.[Status], NULL FROM @ContentLinks l INNER JOIN ecfVersion d ON l.WorkId = d.WorkId WHERE d.LanguageName <> d.MasterLanguageName COLLATE DATABASE_DEFAULT UPDATE l SET MasterWorkId = d.WorkId FROM #nonMasterLinks l INNER JOIN ecfVersion d ON d.ObjectId = l.ObjectId AND d.ObjectTypeId = l.ObjectTypeId WHERE d.[Status] = 4 AND l.DefaultLanguage = d.LanguageName COLLATE DATABASE_DEFAULT DECLARE @IsAzureCompatible BIT SET @IsAzureCompatible = dbo.mdpfn_sys_IsAzureCompatible() -- Open and Close SymmetricKey do nothing if the system does not support encryption EXEC mdpsp_sys_OpenSymmetricKey -- select property for draft that is master language one or multi language property SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName, draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) ELSE draftProperty.LongString END as LongString, draftProperty.[Guid] FROM ecfVersionProperty draftProperty INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId -- and fall back property UNION ALL SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName, draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) ELSE draftProperty.LongString END as LongString, draftProperty.[Guid] FROM ecfVersionProperty draftProperty INNER JOIN #nonMasterLinks links ON links.MasterWorkId = draftProperty.WorkId INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId WHERE F.MultiLanguageValue = 0 OPTION (FORCE ORDER) EXEC mdpsp_sys_CloseSymmetricKey DROP TABLE #nonMasterLinks END
Does it work? Yes, it works wonder. The small queries are now consistently fast. And the big query only takes around 3.5 seconds. Yeah, given the site of the catalog, I’m happy with the result.
So to sum up, SQL Server is smart, but it can be fooled. And if you know for sure something, the query hint can provide huge boost on performance, and even better, help to stabilize your plan.
Thanks for reading!
Part 2 is here:Â http://vimvq1987.com/curious-case-sql-execution-plan-part-2/
Thanks for sharing this. It is very interesting to know about the deep part of query optimization.
Interesting read, thanks Quan!
This really adds to my SQL\EPiServer experience.
Thank you so much taking the time to share the knowledge.