This time, we will talk about ecfVersion_ListFiltered, again.

This stored procedure was previously the subject of several blog posts regarding SQL Server performance optimizations. When I thought it is perfect (in term of performance), I learned something more.

Recently we received a performance report from a customer asking about an issue after upgrading from Commerce 10.4.2 to Commerce 10.8 (the last version before Commerce 11). The job “Publish Delayed Content Versions” starts to throw timeout exceptions.

This scheduled job calls to a ecfVersion_ListFiltered to load the content versions which are in status DelayedPublish, it looks like this when it reaches SQL Server:

This query is known to be slow. The reason is quite obvious – Status contains only 5 or 6 distinct values, so it’s not indexed. SQL Server will have to do a Clustered Index Scan, and if ecfVersion is big enough, it’s inevitably slow.

But being slow does not mean it’s allowed to not work. Under any reasonable circumstances, the query should complete in under 30 seconds (the default timeout value), and return to the caller.

So it needs to be fixed.

I was in the direct contact with the partner developerE, and of course, as always, I suggested him to do some tests. Rebuilding index is the first thing to think about – as the query relies heavily on a Clustered Index Scan, a highly fragmented Clustered Index can severely affect performance, and rebuilding it would help.

It didn’t.

Knowing we improved this query in more reason versions, I suggested him to try to upgrade to later one, and eventually sent the newer version of the stored procedure. It was 2x times faster on another customer database.

It didn’t help, neither.

I was really puzzled. Even after getting their database to do the tests myself, I couldn’t find anything suspicious from the execution plan, the time and IO statistics. Everything looks normal.

Time to blame the hardware?

Not really, the developer said the old version was quite fast on same hardware. So a change in the database made the SP slower between these two versions. I finally scoped it down to the change in the SP from

to:

So if you can’t spot the difference, the check for ecfVersion.CurrentLanguageRemoved was changed to an user defined function – UDF – ecf_IsCurrentLanguageRemoved. Naturally, that will be our suspicion.

But execution plan tells otherwise. There were virtually no difference between execution plans:

Slow one:

Fast one:

IO Statistics are also very similar:

Slow one:

Fast one:

But if I removed the call to the UDF, the SP was fast again. With almost no difference in execution plan and number of IOs read. We can safely assume that the UDF is called for every row, and that makes it slow.

Why?

I pulled my hairs for a while (there is a joke running inside Commerce development team, if you have too many hairs, you are not experienced enough) Things only come clear when we get the execs stats

It’s now clear that the UDF was called on every row of ecfVersion, and because we need to count the number of rows satisfied the filter conditions, it’s called twice for each row. Each call is cheap, but the total time and IOs are significant.

One thing to remember when work with SQL Server is that it’s designed to work with sets, not individual rows. That’s explain why SQL Server performs very bad in this case: it is not allowed to do what it does best.

A simple fix would be to throw away the UDF, and to join with CatalogLanguage directly, something like this:

Lesson learned:

  • Always check the query you are going to change on some sizable data set. The result can be quite shocking.

One of the most important rules for performance optimizations is to not do worse. I.e. you have to keep checking the performance to ensure that performance is not degrading over time. A 5% degradation seems to be small, but ten of them can has significant impact.

The SP became 15-20 times slower, and nobody raised suspicions about it!

  • UDF can be a really sneaky b*stard. Execution plan does not show time consumed by an UDF. IO statistics does not include IO made by an UDF. It’s only be caught by the query statistics and that can be easily missed in 95% of the case.

Can it be even better? Yes, it can. The “updated” version (as we made some other improvements to ecfVersion_ListFiltered recently), which will be in an upcoming release, will perform like this:

SQL Server Execution Times:
CPU time = 201 ms, elapsed time = 540 ms.

So basically more than 50x times faster the affected one, and 3x times faster the fixed one.

That’s why you should upgrade to latest version, my friends!


Leave a Reply

%d bloggers like this: