A curious case of SQL execution plan

I said this already, and I will say it again: SQL Server optimizer is smart. I can even go further and say, it’s smarter than you and me (I have no doubt that you are smart, even very, very smart 🙂 ). So most of the cases, you leave it to do whatever it thinks is the best.

But there are cases SQL Server optimizer is fooled by the engine – it gets confused and chooses an sub-optimal plan, because it was given wrong, outdated, or incorrect information. That’s when you need to step in.

Today I face one case like that, as reported here: http://world.episerver.com/forum/developer-forum/Episerver-Commerce/Thread-Container/2017/10/database-timeout-on-productvariant-update/

So the slow query looks like this

declare @p1 dbo.udttObjectWorkId
insert into @p1 values(110830,0,1378017,NULL)
insert into @p1 values(110830,0,1356813,NULL)
insert into @p1 values(110830,0,1356812,NULL)
insert into @p1 values(110830,0,1356810,NULL)
insert into @p1 values(110830,0,1356811,NULL)
insert into @p1 values(96861,0,1380113,NULL)
insert into @p1 values(96861,0,1329903,NULL)
insert into @p1 values(96861,0,1281127,NULL)
insert into @p1 values(96861,0,1265426,NULL)
insert into @p1 values(96861,0,1260440,NULL)
insert into @p1 values(96861,0,1247704,NULL)
insert into @p1 values(96861,0,469360,NULL)
insert into @p1 values(96861,0,469359,NULL)
insert into @p1 values(96861,0,444058,NULL)
insert into @p1 values(96861,0,444057,NULL)
insert into @p1 values(96861,0,415689,NULL)
insert into @p1 values(96861,0,415688,NULL)
insert into @p1 values(96861,0,379004,NULL)
insert into @p1 values(96861,0,379003,NULL)
insert into @p1 values(96861,0,236924,NULL)
insert into @p1 values(96861,0,236923,NULL)
insert into @p1 values(96861,0,233462,NULL)
insert into @p1 values(96861,0,233461,NULL)
insert into @p1 values(96861,0,190682,NULL)
insert into @p1 values(96861,0,190681,NULL)
insert into @p1 values(96441,0,1380282,NULL)
insert into @p1 values(96441,0,1330257,NULL)
insert into @p1 values(96441,0,1281483,NULL)
insert into @p1 values(96441,0,1265765,NULL)
insert into @p1 values(96441,0,1260796,NULL)
insert into @p1 values(96441,0,1248075,NULL)
insert into @p1 values(96441,0,468196,NULL)
insert into @p1 values(96441,0,468195,NULL)
insert into @p1 values(96441,0,442894,NULL)
insert into @p1 values(96441,0,442893,NULL)
insert into @p1 values(96441,0,414525,NULL)
insert into @p1 values(96441,0,414524,NULL)
insert into @p1 values(96441,0,377840,NULL)
insert into @p1 values(96441,0,377839,NULL)
insert into @p1 values(96441,0,235664,NULL)
insert into @p1 values(96441,0,235663,NULL)
insert into @p1 values(96441,0,232202,NULL)
insert into @p1 values(96441,0,232201,NULL)
insert into @p1 values(96441,0,189002,NULL)
insert into @p1 values(96441,0,189001,NULL)
insert into @p1 values(96950,0,1381232,NULL)
insert into @p1 values(96950,0,1337237,NULL)
insert into @p1 values(96950,0,1278209,NULL)
insert into @p1 values(96950,0,1277635,NULL)
insert into @p1 values(96950,0,1269243,NULL)
insert into @p1 values(96950,0,1257298,NULL)
insert into @p1 values(96950,0,469585,NULL)
insert into @p1 values(96950,0,469584,NULL)
insert into @p1 values(96950,0,444283,NULL)
insert into @p1 values(96950,0,444282,NULL)
insert into @p1 values(96950,0,415914,NULL)
insert into @p1 values(96950,0,415913,NULL)
insert into @p1 values(96950,0,379229,NULL)
insert into @p1 values(96950,0,379228,NULL)
insert into @p1 values(96950,0,237191,NULL)
insert into @p1 values(96950,0,237190,NULL)
insert into @p1 values(96950,0,233729,NULL)
insert into @p1 values(96950,0,233728,NULL)
insert into @p1 values(96950,0,191038,NULL)
insert into @p1 values(96950,0,191037,NULL)
set statistics io on
set statistics time on
exec ecfVersion_ListByWorkIds @ContentLinks=@p1

It took several minutes on my machine (A fairly powerful desktop), so that is a very bad sign. By default, the command timeout for calling a stored procedure is 30 seconds – and any thing slower than that will result in a timeout exception. Of course you can change the command timeout value – but that would only be used in extraordinary case, not as a general approach. Slow queries are not only prone to timeout exception, it can cause severe issues – (very) bad user experience, deadlocks, etc. With a frequently called SP like `ecfVersion_ListByWorkIds`, the acceptable limit should be a few seconds – of course faster is better.

Let’s try with a single row instead of 65 rows for the parameter:

declare @p1 dbo.udttObjectWorkId
insert into @p1 values(110830,0,1378017,NULL)
exec ecfVersion_ListByWorkIds @ContentLinks=@p1

This took around 1 minutes and 18 seconds. Of course that is slow, but because we are trying to identify the bottleneck, it’s good that we know where to look at. Looking at the execution plan gives us a fairly good idea of what is wrong

So one statement is trying to “seek” a non clustered index. The problem is that it seeks “only” 10510143 rows. That’s the entire ecfVersionProperty table. No wonder why it’s slow. A seek like that is a definitely a bad sign, let’s look at the code corresponding to that plan :

SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName,
			draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], 
			CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) 
				THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) 
				ELSE draftProperty.LongString END as LongString, 
			draftProperty.[Guid]
	FROM ecfVersionProperty draftProperty
	INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId
	INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId

Let’s skip the select statement, it’s not of our interest. The two lines we need to look into is this:

INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId
INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId

So here we have two joins – one is very selective. WorkId is an unique clustered index, so it should be very fast. The second one is MetaFieldId on ecfContentProperty table – which is not as selective. There are – as we know – 10.5M rows in ecfContentProperty, and about 240 values in MetaFieldId, so there are almost 50000 rows with same value on average. That kind of distinction is poor, so we naturally want to join by WorkId first. That would limit the number of rows a lot, and allow an easy join for MetaFieldId.

Shouldn’t SQL Server figure that out? I searched a bit and there is no definitive answer on if the order of join affect performance or not. The majority says that it should not matter, and SQL Server will be able to swap the joins for best plan. However there are people saying that the order of joins matters. Hmm – shall we?

INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId
INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId

And the magic happens. The query is now blazing fast. It completes in less than 1 seconds, and the “bottleneck” has gone.

Now it uses the clustered index on WorkId as we want to. Great. Did we solve the problem? Not yet. We did make it faster, yes, but we need to confirm that it actually fixed the problem. We can try that by reverting the change and see how it behaves.

Heh. And it’s still fast, now with the original order of joins. What the heck. Plan caching? Cached buffer? No, exec sp_refreshsqlmodule 'ecfVersionProperty_ListByWorkIds' , and DBCC DROPCLEANBUFFERS, and it’s still fast (slight slower because it has to do some physical reads, but still less than 1s). In fact, I could not make it as slow as it was!

So here we can assume that it had a bad execution plan. By changing the  order, we allowed SQL Server to remove the bad execution plan from cache, and SQL Server now can choose a better plan. And as we confirmed, the order of the joins does not matter here.

So did we solve the problem we originally have? Does the “big query” run fast now? Let’s try it.

Uh oh…, it’s still slow. Taking more than 3 minutes to complete. And the interesting thing is, after running the big query, I was able to slow down the small query. It’s slow again, yay!

So we can say with fairly high certainty that the big query is able to fool SQL Server into using bad execution plan. But how?

I was pulling my hairs until I look into the statistics of the “bad index” – the index we don’t want to use first (as it’s not very selective)

RANGE_HI_KEY	RANGE_ROWS	EQ_ROWS
15	0	474128
17	0	479083
18	0	479083
19	0	479083
192	0	439755
193	0	169223
194	0	410397
195	0	444977
196	0	169223
197	0	444346
198	0	444006
199	0	17114
200	0	413320
201	0	169223
202	0	169223
203	0	444972
204	0	375462
205	0	169223
206	0	169223
207	0	7205
208	0	120735
209	0	443485
210	0	41117
211	0	439649
212	0	406152
213	0	169223
214	0	444985
215	0	444974
216	0	169223
217	0	155334
218	0	154951
219	0	169223
220	0	155273
221	0	154890
222	0	169223
223	0	155225
224	0	3800
225	0	75458
226	0	61
227	0	193
228	0	7326
230	0	83
231	0	55
232	0	50205
233	0	59
234	0	86
235	0	7368
237	0	114
239	0	38384
240	0	57
242	0	40194
247	0	106
248	0	1831
250	0	122113
254	0	23
255	0	240
257	0	456

Gotcha! This index is very oddly distributed. Some of the values have very few matching rows. In the big query, one of the those values matched, and that fooled SQL Server optimizer to think that it’s better to use the index on MetaFieldId. Because there were so few matching rows, that would end up in fewer read, and can be faster overall. But it was wrong.

This is a kind of “parameter sniffing” problem. An oddly distributed data can make SQL Server uses the wrong, slow execution plan. It might be faster for a specific value, but will be much slower on the remaining data.

Now we know what caused the problem, is there a fix? Making data evenly distributed is not an option here – that is entirely up to site implementation, and out of our control.

Luckily for us, SQL Server provides a hint to enforce join order. Because here we know for sure the WorkId will always work better, we can ensure that it’s joined first, by adding OPTION (FORCE ORDER)

The fixed query looks like this

ALTER PROCEDURE [dbo].[ecfVersionProperty_ListByWorkIds]
	@ContentLinks dbo.udttObjectWorkId readonly
AS
BEGIN
	CREATE TABLE #nonMasterLinks (
		ObjectId INT, 
		ObjectTypeId INT, 
		WorkId INT,
		DefaultLanguage NVARCHAR(50),
		[Status] INT,
		MasterWorkId INT
	)

	INSERT INTO #nonMasterLinks
	SELECT l.ObjectId, l.ObjectTypeId, l.WorkId, d.MasterLanguageName, d.[Status], NULL
	FROM @ContentLinks l
	INNER JOIN ecfVersion d ON l.WorkId = d.WorkId
	WHERE d.LanguageName <> d.MasterLanguageName COLLATE DATABASE_DEFAULT

	UPDATE l SET MasterWorkId = d.WorkId
	FROM #nonMasterLinks l
	INNER JOIN ecfVersion d ON d.ObjectId = l.ObjectId AND d.ObjectTypeId = l.ObjectTypeId
	WHERE d.[Status] = 4 AND l.DefaultLanguage = d.LanguageName COLLATE DATABASE_DEFAULT
	DECLARE @IsAzureCompatible BIT

	SET @IsAzureCompatible = dbo.mdpfn_sys_IsAzureCompatible()
	-- Open and Close SymmetricKey do nothing if the system does not support encryption
	EXEC mdpsp_sys_OpenSymmetricKey
	-- select property for draft that is master language one or multi language property
	SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName,
			draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], 
			CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) 
				THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) 
				ELSE draftProperty.LongString END as LongString, 
			draftProperty.[Guid]
	FROM ecfVersionProperty draftProperty

	INNER JOIN @ContentLinks links ON links.WorkId = draftProperty.WorkId
		INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId


				
	-- and fall back property
	UNION ALL
	SELECT draftProperty.pkId, draftProperty.WorkId, draftProperty.ObjectId, draftProperty.ObjectTypeId, draftProperty.MetaFieldId, draftProperty.MetaClassId, draftProperty.MetaFieldName, draftProperty.LanguageName,
			draftProperty.[Boolean], draftProperty.[Number], draftProperty.[FloatNumber], [Money], [Decimal], [Date], [Binary], [String], 
			CASE WHEN (@IsAzureCompatible = 0 AND F.IsEncrypted = 1 AND dbo.mdpfn_sys_IsLongStringMetaField(F.DataTypeId) = 1) 
				THEN dbo.mdpfn_sys_EncryptDecryptString(draftProperty.LongString, 0) 
				ELSE draftProperty.LongString END as LongString, 
			draftProperty.[Guid]
	FROM ecfVersionProperty draftProperty
			INNER JOIN #nonMasterLinks links ON links.MasterWorkId = draftProperty.WorkId
		INNER JOIN MetaField F ON F.MetaFieldId = draftProperty.MetaFieldId



	WHERE F.MultiLanguageValue = 0
	OPTION	(FORCE ORDER)
	
	EXEC mdpsp_sys_CloseSymmetricKey
	DROP TABLE #nonMasterLinks
END

Does it work? Yes, it works wonder. The small queries are now consistently fast. And the big query only takes around 3.5 seconds. Yeah, given the site of the catalog, I’m happy with the result.

So to sum up, SQL Server is smart, but it can be fooled. And if you know for sure something, the query hint can provide huge boost on performance, and even better, help to stabilize your plan.

Thanks for reading!

Part 2 is here: http://vimvq1987.com/curious-case-sql-execution-plan-part-2/

5 thoughts on “A curious case of SQL execution plan

Leave a Reply

Your email address will not be published. Required fields are marked *