Varchar can be harmful to your performance

As string is the most common data type in an application, nvarchar and its variant varchar are probably the most common column types in your database. (We almost always use nvarchar because nchar is meant for fixed length columns which we don’t have). The difference is that nvarchar has encoding of UTF-16/USC-2 while varchar has UTF-8

Starting with SQL Server 2012 (11.x), when a Supplementary Character (SC) enabled collation is used, these data types store the full range of Unicode character data and use the UTF-16 character encoding. If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.

But varchar can be harmful in a way that you don’t expect it to. Let’s assume we have this simple table with two columns (forgive naming, I can’t come up with better names)

CREATE TABLE [dbo].[Demo](
	[varcharColumn] [varchar](50) NULL,
	[nvarcharColumn] [nvarchar](50) NULL
)

Each will be inserted with same random values, almost unique. We will add a non clustered index on each of these columns, and as we know, the index should be very efficient on querying based on those values.

Let’s try with out varchar column first. It should work pretty well right. Nope!

SELECT *
  FROM dbo.[Demo]
  where varcharColumn = N'0002T9'

Instead of a highly efficient Index seek, it does an Index scan on the entire table. This is of course not what you want to.

But, why? Good question. You might have noticed that I used N’0002T9′ which is a nvarchar type – which is what .NET would pass to your query if your parameter is of type string. If you look closer to the execution plan, you’ll see that SQL Server has to do a CONVERT_IMPLICIT on each row of this column, effectively invalidates the index.

If we pass ‘0002T9’ without the notion though, it works as it should, this can cause the confusion as it works during development, but once deployed it is much slower

To see the difference we can run the queries side by side. Note that this is for a very simple table with 130k rows. If you have a few millions or more rows, the difference will be even bigger.

(1 row affected)
Table 'Demo'. Scan count 1, logical reads 4, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.


(1 row affected)
Table 'Demo'. Scan count 1, logical reads 422, physical reads 0, page server reads 0, read-ahead reads 14, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

What’s about the vice versa? If we have data as nvarchar(100) but the parameter is passed as varchar ? SQL Server can handle it with ease. It simply converts the parameters to nvarchar and does an index seek, as it should

So moral of the story? Unless you have strong reasons to use varchar (or char ), stick with nvarchar (or nchar ) to avoid complications with data type conversion which can, and will hurt your database performance.

Performance optimization – the hardcore series – part 4

Let’s take a break from the memory allocation, and do some optimization on another aspect, yet as important (if not even more important) – database.

We all know that database queries play an essential part in any serious app. It’s almost a given that if you want your app to perform well, your database queries must also perform well. And for them to perform well, you need things like proper design (normalization, references etc.), properly written queries, and proper indexes. In this post, we will explore how an index can improve query performance, and how can we do it better.

Let’s start with a fairly simple table design

CREATE TABLE [dbo].[UniqueCoupon](
	[Id] [int] identity primary key clustered, 
	[PromotionId] [int] NOT NULL,
	[Code] [nvarchar](10) NOT NULL,
	[ExpiredOn] [datetime] NULL,
	[Redeemed] [bit] NULL
) ON [PRIMARY]

Nothing extraordinary here, pretty common if you ask me. Now for testing purpose, let’s insert 1.000.000 rows into it

INSERT INTO  dbo.[UniqueCoupon] (PromotionId, Code)
SELECT

FLOOR(RAND()*(100)+1),
SUBSTRING(CONVERT(varchar(255), NEWID()), 0, 7)

GO 1000000

We need to query data by the code, so let’s create an user defined type

CREATE TYPE CouponTable AS TABLE (
    Code NVARCHAR(10));

Time to run some query against data, let’s go with this

SELECT Id, PromotionId, Code, ExpiredOn, Redeemed FROM dbo.UniqueCoupons
                                                                    WHERE PromotionId = @PromotionId AND Code in (SELECT Code FROM @Data)

This is the complete query as we need some data

	declare @data CouponTable
	insert into @data 
	select top 10 code from dbo.UniqueCoupon 
	where promotionid = 36

	SELECT Id, PromotionId, Code, ExpiredOn, Redeemed FROM dbo.UniqueCoupon
                                                                    WHERE PromotionId = 36 AND Code in (SELECT Code FROM @Data)

As we learned that execution plan is not a good way to compare performance, let’s use the statistics, our trusted friends

																	set statistics io on
																	set statistics time on

And this is how it takes with our default setting (i.e. no index)

(10 rows affected)
Table '#AEDEED61'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'UniqueCoupon'. Scan count 9, logical reads 7070, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

If you are somewhat experienced with SQL Server, you might guess it would not be exactly happy because of, obviously an index is needed. As we query on PromotionId, it does makes sense to add an index for it, SQL Server does give you that

If we just blindly add the index suggested by SQL Server

(10 rows affected)
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'UniqueCoupon'. Scan count 1, logical reads 53, physical reads 0, page server reads 0, read-ahead reads 5, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#A7AA9B2B'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

But can we do better?

If we look at the index, there’s something not very optimized about it – we are query by both PromotionId and Code, so not really makes senses to have Code as included. How’s about we have the index on both PromotionId and Code?

(10 rows affected)
Table 'UniqueCoupon'. Scan count 10, logical reads 30, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#A1F9F38F'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

Yet we can make it better! From 53 to 30 logical reads might not sound a lot, but if you have thousands of queries every hour, it will be fairly significant.

Prepare yourself for some pleasant surprises – when we eventually applied the change on an actual database, the change was staggering, much more than what we hoped for. The query that were run for 24h in total, every day, now takes less than 10 minutes (yes you read it right, 10 minutes).

At this point you can certainly be happy and move on. But can we do better? For the sake of curiosity ? Yes we do.

SQL Server is rather smart that it knows we are getting the other columns as well, so those will be included in the index, to avoid a key lookup. Let’s see if we can remove that and see how it performs

(10 rows affected)
Table 'UniqueCoupon'. Scan count 10, logical reads 60, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#B1996E94'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

So it was indeed worse, a key lookup is performed for every row (SQL Server uses the index to track down the rows and read the other columns from there)

There are two way to get rid of those key lookup – includes the columns in the index itself, or, more dramatic, make the index the clustered. As we can see the data should be accessed by PromotionId and Code, it makes perfect senses.

It is a commonly belief that Identity column should be clustered index – it is unique, it is not null. However, it only makes senses if it is the most heavily accessed column. In this case, Id only serves as an Identity column, it does not need to be the clustered index (although being an unique means it will has a non clustered index for it)

ALTER TABLE [dbo].[UniqueCoupon] DROP CONSTRAINT [PK__UniqueCo__3214EC0744C2FF38] WITH ( ONLINE = OFF )
GO

ALTER TABLE [dbo].[UniqueCoupon] ADD PRIMARY KEY NONCLUSTERED 
(
	[Id] ASC
)

Does this bring dramatically performance change? Unlikely. My test show no improvement in statistic. However, there is one critical impact here: we significantly reduced the size of indexes in the table. (data incoming)

Moral of the story

  • Indexes are crucial.
  • You can almost always do better than the auto suggested indexes.
  • Real test is the only true validation.