A curious case of SQL Server function

This time, we will talk about ecfVersion_ListFiltered, again.

This stored procedure was previously the subject of several blog posts regarding SQL Server performance optimizations. When I thought it is perfect (in term of performance), I learned something more.

Recently we received a performance report from a customer asking about an issue after upgrading from Commerce 10.4.2 to Commerce 10.8 (the last version before Commerce 11). The job “Publish Delayed Content Versions” starts to throw timeout exceptions.

This scheduled job calls to a ecfVersion_ListFiltered to load the content versions which are in status DelayedPublish, it looks like this when it reaches SQL Server:

declare @s [udttIdTable]
insert into @s values(6)
exec dbo.ecfVersion_ListFiltered @Statuses = @s, @StartIndex = 0, @MaxRows = 2147483646

This query is known to be slow. The reason is quite obvious – Status contains only 5 or 6 distinct values, so it’s not indexed. SQL Server will have to do a Clustered Index Scan, and if ecfVersion is big enough, it’s inevitably slow.

But being slow does not mean it’s allowed to not work. Under any reasonable circumstances, the query should complete in under 30 seconds (the default timeout value), and return to the caller.

So it needs to be fixed.

[quads id=2]

I was in the direct contact with the partner developerE, and of course, as always, I suggested him to do some tests. Rebuilding index is the first thing to think about – as the query relies heavily on a Clustered Index Scan, a highly fragmented Clustered Index can severely affect performance, and rebuilding it would help.

It didn’t.

Knowing we improved this query in more reason versions, I suggested him to try to upgrade to later one, and eventually sent the newer version of the stored procedure. It was 2x times faster on another customer database.

It didn’t help, neither.

I was really puzzled. Even after getting their database to do the tests myself, I couldn’t find anything suspicious from the execution plan, the time and IO statistics. Everything looks normal.

Time to blame the hardware?

Not really, the developer said the old version was quite fast on same hardware. So a change in the database made the SP slower between these two versions. I finally scoped it down to the change in the SP from

	SET NOCOUNT ON

	DECLARE @StatusCount INT
	SELECT @StatusCount = COUNT(*) FROM @Statuses

	DECLARE @LanguageCount INT
	SELECT @LanguageCount = COUNT(*) FROM @Languages
	
	;WITH TempResult as
	(
		SELECT ROW_NUMBER() OVER(ORDER BY vn.Modified DESC) as RowNumber, vn.*
		FROM
			dbo.ecfVersion vn
		WHERE
			vn.CurrentLanguageRemoved = 0 AND
			((@ObjectId IS NULL) OR vn.ObjectId = @ObjectId) AND
			((@ObjectTypeId IS NULL) OR vn.ObjectTypeId = @ObjectTypeId) AND
			((@ModifiedBy IS NULL) OR vn.ModifiedBy = @ModifiedBy) AND
			((@StatusCount = 0) OR (vn.[Status] IN (SELECT ID FROM @Statuses))) AND
            ((@LanguageCount = 0) OR (vn.LanguageName IN (SELECT LanguageCode FROM @Languages)))
	)
	SELECT  WorkId, ObjectId, ObjectTypeId, Name, LanguageName, MasterLanguageName, IsCommonDraft, StartPublish, ModifiedBy, Modified, [Status], (SELECT COUNT(*) FROM TempResult) AS TotalRows
	FROM    TempResult
	WHERE	RowNumber BETWEEN (@StartIndex + 1) AND (@MaxRows + @StartIndex)

to:

	DECLARE @StatusCount INT
	SELECT @StatusCount = COUNT(*) FROM @Statuses
	DECLARE @LanguageCount INT
	SELECT @LanguageCount = COUNT(*) FROM @Languages
	
	;WITH TempResult as
	(
		SELECT ROW_NUMBER() OVER(ORDER BY vn.Modified DESC) as RowNumber, vn.*
		FROM
			dbo.ecfVersion vn
		WHERE
			[dbo].ecf_IsCurrentLanguageRemoved(vn.CatalogId, vn.LanguageName) = 0 AND
			((@ObjectId IS NULL) OR vn.ObjectId = @ObjectId) AND
			((@ObjectTypeId IS NULL) OR vn.ObjectTypeId = @ObjectTypeId) AND
			((@ModifiedBy IS NULL) OR vn.ModifiedBy = @ModifiedBy) AND
			((@StatusCount = 0) OR (vn.[Status] IN (SELECT ID FROM @Statuses))) AND
            ((@LanguageCount = 0) OR (vn.LanguageName IN (SELECT LanguageCode FROM @Languages)))
			
	)
	SELECT  WorkId, ObjectId, ObjectTypeId, Name, LanguageName, MasterLanguageName, IsCommonDraft, StartPublish, ModifiedBy, Modified, [Status], (SELECT COUNT(*) FROM TempResult) AS TotalRows
	FROM    TempResult
	WHERE	RowNumber BETWEEN (@StartIndex + 1) AND (@MaxRows + @StartIndex)

So if you can’t spot the difference, the check for ecfVersion.CurrentLanguageRemoved was changed to an user defined function – UDF – ecf_IsCurrentLanguageRemoved. Naturally, that will be our suspicion.

But execution plan tells otherwise. There were virtually no difference between execution plans:

Slow one:

Fast one:

IO Statistics are also very similar:

Slow one:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 3235, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#BEA5F9D9'. Scan count 2, logical reads 435176, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ecfVersion'. Scan count 2, logical reads 17076, physical reads 1924, read-ahead reads 10419, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Fast one:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 3236, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#BEA5F9D9'. Scan count 2, logical reads 435176, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ecfVersion'. Scan count 2, logical reads 17076, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

But if I removed the call to the UDF, the SP was fast again. With almost no difference in execution plan and number of IOs read. We can safely assume that the UDF is called for every row, and that makes it slow.

Why?

I pulled my hairs for a while (there is a joke running inside Commerce development team, if you have too many hairs, you are not experienced enough) Things only come clear when we get the execs stats

SELECT 
 text
 , execution_count
 , total_logical_reads / execution_count  AvgReads
 , total_worker_time / execution_count  AvgWorkersTime
 , total_elapsed_time / execution_count  AvgElapsedTime
FROM  sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
WHERE text LIKE '%ecf_IsCurrentLanguageRemoved%'

It’s now clear that the UDF was called on every row of ecfVersion, and because we need to count the number of rows satisfied the filter conditions, it’s called twice for each row. Each call is cheap, but the total time and IOs are significant.

One thing to remember when work with SQL Server is that it’s designed to work with sets, not individual rows. That’s explain why SQL Server performs very bad in this case: it is not allowed to do what it does best.

A simple fix would be to throw away the UDF, and to join with CatalogLanguage directly, something like this:

	SET NOCOUNT ON
	DECLARE @StatusCount INT
	SELECT @StatusCount = COUNT(*) FROM @Statuses
	DECLARE @LanguageCount INT
	SELECT @LanguageCount = COUNT(*) FROM @Languages
	
	;WITH TempResult as
	(
		SELECT ROW_NUMBER() OVER(ORDER BY vn.Modified DESC) as RowNumber, vn.*
		FROM
			dbo.ecfVersion vn
		INNER JOIN CatalogLanguage cl ON vn.CatalogId = cl.CatalogId AND vn.LanguageName = cl.LanguageCode
		WHERE
			((@ObjectId IS NULL) OR vn.ObjectId = @ObjectId) AND
			((@ObjectTypeId IS NULL) OR vn.ObjectTypeId = @ObjectTypeId) AND
			((@ModifiedBy IS NULL) OR vn.ModifiedBy = @ModifiedBy) AND
			((@StatusCount = 0) OR (vn.[Status] IN (SELECT ID FROM @Statuses))) AND
            ((@LanguageCount = 0) OR (vn.LanguageName IN (SELECT LanguageCode FROM @Languages)))
			
	)
	SELECT  WorkId, ObjectId, ObjectTypeId, Name, LanguageName, MasterLanguageName, IsCommonDraft, StartPublish, ModifiedBy, Modified, [Status], (SELECT COUNT(*) FROM TempResult) AS TotalRows
	FROM    TempResult
	WHERE	RowNumber BETWEEN (@StartIndex + 1) AND (@MaxRows + @StartIndex)

Lesson learned:

  • Always check the query you are going to change on some sizable data set. The result can be quite shocking.

One of the most important rules for performance optimizations is to not do worse. I.e. you have to keep checking the performance to ensure that performance is not degrading over time. A 5% degradation seems to be small, but ten of them can has significant impact.

The SP became 15-20 times slower, and nobody raised suspicions about it!

  • UDF can be a really sneaky b*stard. Execution plan does not show time consumed by an UDF. IO statistics does not include IO made by an UDF. It’s only be caught by the query statistics and that can be easily missed in 95% of the case.

Can it be even better? Yes, it can. The “updated” version (as we made some other improvements to ecfVersion_ListFiltered recently), which will be in an upcoming release, will perform like this:

SQL Server Execution Times:
CPU time = 201 ms, elapsed time = 540 ms.

So basically more than 50x times faster the affected one, and 3x times faster the fixed one.

That’s why you should upgrade to latest version, my friends!

2 thoughts on “A curious case of SQL Server function

  1. Currently working on upgrading from 9.24.1 to latest CMS/Commerce 10.

    We’re experiencing this issue, where the “Publish Delayed Content Versions” times out after 30 seconds. This is currently blocking us from deploying into production.

    Is it a viable option to modify the existing SP according to the solution you posted here? Or is there another option besides furter upgrading the next major version? (if solved there)

Leave a Reply

Your email address will not be published. Required fields are marked *