syndicate

Getting an INDEX SEEK to Speed up LIKE “%string%” Searches

Posted on Updated on

In today’s blog I will attempt to challenge the popularly held notion that LIKE “%string%” wildcard searches must be slow (Sargability: Why %string% Is Slow).

A Sample Table Populated with 10 Million Rows of Test Data

In order to do this, we’ll need a large table of test data with a composite PRIMARY KEY to demonstrate various aspects of the issue.

CREATE TABLE dbo.TestLIKESearches
(
    ID1         INT
    ,ID2        INT
    ,AString    VARCHAR(100)
    ,Value      INT
    ,PRIMARY KEY (ID1, ID2)
);

WITH Tally (n) AS
(
SELECT TOP 10000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.TestLIKESearches
    (ID1, ID2, AString, Value)
SELECT 1+n/500, n%500
    ,CASE WHEN n%500 > 299 THEN
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            RIGHT(1000+n%1000, 3) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1)
          END
    ,1+ABS(CHECKSUM(NEWID()))%100
FROM Tally;

While we have set the AString column to contain some NULL values (60% in fact), what we are about to show works even when 0% of the rows are NULL. Let’s start by showing the non-SARGable version of the query and its corresponding actual execution plan.

SELECT ID1, ID2, AString
FROM dbo.TestLIKESearches
WHERE AString LIKE '%21%';


As expected, the non-SARGable query results in a Clustered Index SCAN on the table. Let’s see if we can affect this by adding the following INDEX.

CREATE INDEX tls_ix1 ON dbo.TestLIKESearches(AString);

When we check the execution plan again for the same query, it now looks like this:


The query is now using the NONCLUSTERED INDEX, but it is still doing a SCAN. So let’s try a slightly different form of the same query, one which we know must generate the exact same results.

SELECT ID1, ID2, AString
FROM dbo.TestLikeSearches
WHERE AString IS NOT NULL AND AString LIKE '%21%';

All we have done is to modify the filter in the WHERE clause to ignore NULL values, none of which could contain our search string anyway. Now our execution plan looks like this:

Since a SEEK should in theory be better than a SCAN, we’re hopeful that the query’s speed (elapsed time) is improved also, but we’ll get to that in a moment.

The same small change to an UPDATE converts the INDEX SCAN to a SEEK, as demonstrated by the two queries below.

UPDATE dbo.TestLIKESearches
SET Value = 300
WHERE AString LIKE '%21%';

UPDATE dbo.TestLIKESearches
SET Value = 400
WHERE AString IS NOT NULL AND AString LIKE '%21%';


On a SELECT however, we can show just how easily this SEEK can be broken. Let’s modify our query to also return the Value column.

SELECT ID1, ID2, AString, Value
FROM dbo.TestLikeSearches
WHERE AString IS NOT NULL AND AString LIKE '%21%';


By adding Value into the returned results, we have broken the SEEK; it has reverted to a SCAN.

But there is a way around this. We can use a query hint (FORCESEEK) to restore our INDEX SEEK.

SELECT ID1, ID2, AString, Value
FROM dbo.TestLikeSearches WITH(FORCESEEK)
WHERE AString IS NOT NULL AND AString LIKE '%21%';

Performance Comparison of SCAN vs. SEEK for a LIKE “%string%” Search

The following table summarizes the performance results we got from this method of getting SQL Server to SEEK the INDEX on our string column during the LIKE “%string%” search.

Logical Reads CPU Elapsed Time
Base SELECT Query Without INDEX 33193 2246 628
Base SELECT Query With INDEX 24169 2310 582
Improved SELECT Query (INDEX SEEK) 13659 1513 405
    Improvement 43% 35% 30%
Base UPDATE Query With INDEX 146678 2434 812
Improved UPDATE Query (INDEX SEEK) 136168 1763 546
    Improvement 7% 28% 33%
SELECT with Value (INDEX SCAN) 33193 2620 665
SELECT with Value (FORCESEEK) 136193 1794 455
    Improvement -310% 32% 32%

In all cases, by forcing the SEEK (even when it results in an added Key Lookup) we were able to improve elapsed and CPU times to a measurable degree. Only the FORCESEEK query hint on the SELECT when non-indexed columns are included actually increased the logical IO count (albeit by quite a bit).

Conclusion

Despite the commonly accepted belief that a LIKE “%string%” search is limited to an INDEX SCAN, we have proven that it is possible to make it happen with a properly constructed non-clustered INDEX on the string being searched.

The SEEK is easily obtained for either SELECT or UPDATE, and probably DELETE and/or MERGE as well (although we didn’t test these cases) with just a small additional filtering criteria (excluding NULLs).

The SEEK can also just as easily be broken by including columns in the SELECT that aren’t in our non-clustered INDEX, however even then using a FORCESEEK query hint can restore it.

You’ll be able to read more about the results of this technique and how well it improves performance against a wider range of NULL values in an upcoming article that I have submitted to http://www.sqlservercentral.com/.  You can generally expect that as the percentage of NULL values decreases, the performance gain will not be as much.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Common Table Expressions in SQL

Posted on

In SQL Server 2005, Microsoft introduced the Common Table Expression (CTE).  CTEs share similarities with VIEWS and derived tables, but are really not the same as either.  Oracle SQL also supports CTEs and while the syntax is basically the same, some of the properties that we’ll discuss may be slightly different.

Let’s take a look at a very simple CTE to get started.

CREATE TABLE #Vehicles
(
    VehicleID     VARCHAR(5)
    ,VehicleType  VARCHAR(5)
    ,Location     VARCHAR(3)
    ,PRIMARY KEY (VehicleID)
);

INSERT INTO #Vehicles
VALUES ('12211', 'TRUCK', 'BKK'),('12212', 'CAR', 'BKK'),('12213', 'TRUCK', 'CNX')
    ,('12214', 'CAR', 'CNX'),('12215', 'TRUCK', 'HDY'),('12216', 'CAR', 'HDY');

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
SELECT VehicleID, VehicleType, Location
FROM BKKVehicles;

Our CTE begins with the keyword WITH and ends at the closing parenthesis.  Below the CTE is what I’ll call the “main query.”  This CTE retrieves only vehicles whose location is BKK and adds a ROW_NUMBER to that result:

VehicleID VehicleType  Location  rn
12211     TRUCK        BKK       1
12212     CAR          BKK       2

There is a widespread belief that CTEs can improve performance, but the truth is they neither improve nor detract from performance.  They are simply a way to make your code more readable, although they do offer a couple of things that may also make your life a bit easier.  Let’s look at some of the rules/properties of a CTE, comparing and contrasting with VIEWs and derived tables where appropriate.

  • You must remember to terminate the statement preceding the CTE with a semicolon, otherwise SQL will throw this error message at you:

Incorrect syntax near the keyword ‘with’. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.

  • You can rename the columns returned by the CTE by providing the column names between parentheses immediately after the name of the CTE (our CTE is named BKKVehicles) and before “AS.”
  • A CTE must contain a SELECT and it may not contain INSERT, UPDATE, DELETE or MERGE statements.
  • CTEs will inherit the indexing of the tables upon which they are based.
  • CTEs are more like a derived table than a VIEW because they exist only for the life of the main query which follows them.  In order to reuse a CTE in a subsequent query, you must resupply the same code to the second query.
  • You can use a CTE as source or target in UPDATE, INSERT, DELETE and MERGE queries, but there are some restrictions.  This is similar to a VIEW.
  • You may have more than one CTE associated with a query.  When more than one CTE is defined, they are referred to as “cascaded” or “stacked” CTEs.  You may not however nest CTEs within CTEs.
  • You may code CTEs within VIEWs, FUNCTIONs or Stored Procedures.
  • You may refer to a CTE more than once in the main query.  Contrast this with a derived table, which if you’d like to use it more than once, must be coded as many times as you need it.
  • You may refer to a CTE in another CTE as long as the CTE being referred to occurs above the CTE that is doing the referring in the CTE stack.
  • CTEs support recursive queries.

CTEs are most often used with SELECT, but you can UPDATE through a CTE as in this example also.

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
UPDATE BKKVehicles
SET VehicleType = 'VAN'
WHERE Location = 'HDY';

In this example, no rows are updated because the table returned by the CTE does not contain any rows whose location is HDY.

Likewise you can delete through a CTE.

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
DELETE FROM BKKVehicles
WHERE rn > 1;

SELECT *
FROM #Vehicles;

After the DELETE runs, the rows remaining in our table are:

VehicleID  VehicleType  Location
12211      TRUCK        BKK
12213      TRUCK        CNX
12214      CAR          CNX
12215      TRUCK        HDY
12216      CAR          HDY

This is actually quite a useful method of deleting duplicate rows from a table.

We mentioned that when updating or deleting through a CTE, certain restrictions may apply.   Basically all that means is that the target rows in the target table must be unambiguous.  For example, if you happen to JOIN the target table with another table, the JOIN must be exact (no duplicate rows generated) otherwise the effort will likely fail.  Highly complex queries involving many JOINs, etc. may also confuse the compiler and make it unable to recognize the target table.

Here’s an example of using a CTE as the source for an INSERT, to generate some additional sample rows in our table.  It also demonstrates how you can name the columns generated by the CTE.

WITH MoreRows (VehicleID, VehicleType, Location) AS
(
    SELECT '12218','VAN','BKK'
    UNION ALL SELECT '12219','VAN','CNX'
    UNION ALL SELECT '12220','VAN','HDY'
)
INSERT INTO #Vehicles (VehicleID, VehicleType, Location)
SELECT VehicleID, VehicleType, Location
FROM MoreRows;

A CTE can also act as either source or target tables for a MERGE, but since MERGE is a topic that is deserving of consideration on its own, we’ll do a separate blog entry for that.

Final Remarks

We have demonstrated how a Common Table Expression can be used in SELECT, UPDATE, DELETE and INSERT statements.  CTEs are basically a way to improve the readability of the SQL code you produce, having no impact on their performance.

While we have mentioned that CTEs can be used for recursive queries, we haven’t provided any examples of this because it is quite an advanced topic.  However if you are interested in how to do this, you might want to take a look at Exploring Recursive CTEs by Example.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

The T-SQL ROW_NUMBER() Function

Posted on Updated on

If you learn one new T-SQL (i.e., Microsoft SQL Server) concept today it should be ROW_NUMBER().  Introduced in SQL 2005, this function is one of 4 window ranking functions (the others are RANK(), DENSE_RANK() and NTILE()).  Oracle SQL has a similar capability.

Let’s first create some sample data we can use for a demonstration.

CREATE TABLE #ROWNUMBER_Demo
(
    ID      INT
    ,MyDate DATETIME
    ,Price  MONEY
    ,PRIMARY KEY (ID, MyDate)
);

INSERT INTO #ROWNUMBER_Demo
SELECT 1, '2012-03-04', 23.22
UNION ALL SELECT 1, '2012-03-15', 25.15
UNION ALL SELECT 1, '2012-05-10', 28.47
UNION ALL SELECT 2, '2012-02-28', 15.10
UNION ALL SELECT 2, '2012-03-22', 18.22
UNION ALL SELECT 2, '2012-05-01', 21.43
UNION ALL SELECT 3, '2012-04-01', 45.06
UNION ALL SELECT 3, '2012-05-12', 48.23
UNION ALL SELECT 3, '2012-06-01', 51.66;

SELECT *
FROM #ROWNUMBER_Demo;

The results in our sample table are:

ID   MyDate                    Price
1    2012-03-04 00:00:00.000   23.22
1    2012-03-15 00:00:00.000   25.15
1    2012-05-10 00:00:00.000   28.47
2    2012-02-28 00:00:00.000   15.10
2    2012-03-22 00:00:00.000   18.22
2    2012-05-01 00:00:00.000   21.43
3    2012-04-01 00:00:00.000   45.06
3    2012-05-12 00:00:00.000   48.23
3    2012-06-01 00:00:00.000   51.66

To use the ROW_NUMBER() function you must supply an OVER predicate consisting of PARTITION (optional) and ORDER BY (required).  Let’s take a look at an example and see the results.

SELECT ID, MyDate
    ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate)
    ,rn2=ROW_NUMBER() OVER (ORDER BY MyDate)
FROM #ROWNUMBER_Demo
ORDER BY ID, MyDate;

The results returned are:

ID  MyDate                    rn1   rn2
1   2012-03-04 00:00:00.000   1     2
1   2012-03-15 00:00:00.000   2     3
1   2012-05-10 00:00:00.000   3     7
2   2012-02-28 00:00:00.000   1     1
2   2012-03-22 00:00:00.000   2     4
2   2012-05-01 00:00:00.000   3     6
3   2012-04-01 00:00:00.000   1     5
3   2012-05-12 00:00:00.000   2     8
3   2012-06-01 00:00:00.000   3     9

For rn1 (where PARTITION is applied) you see that it assigns row numbers 1, 2, 3 to the rows within each ID (the column specified to PARTITION on) based on the ordering of MyDate.  For the case without PARTITION (rn2), the entire set is the partition so the row numbers are 1, 2, …, 9, again based on the ordering of the MyDate column.

Eliminating Duplicates

ROW_NUMBER() is a particularly fast way to eliminate duplicate records.  Suppose you want to return only one record within each ID; specifically the one whose date is the latest.  You must note that ROW_NUMBER() cannot be used on the WHERE clause, so it is necessary to wrap this query in an outer query as follows:

SELECT ID, MyDate
FROM
(
    SELECT ID, MyDate
        ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC)
    FROM #ROWNUMBER_Demo
) a
WHERE rn1 = 1
ORDER BY ID;

Note the DESC sort applied to MyDate.  These results are:

ID   MyDate
1    2012-05-10 00:00:00.000
2    2012-05-01 00:00:00.000
3    2012-06-01 00:00:00.000

Of course, you’re probably saying you can achieve the same results using a GROUP BY (and you’d be correct), like this.

SELECT ID, MyDate=MAX(MyDate)
FROM #ROWNUMBER_Demo
GROUP BY ID
ORDER BY ID;

But try using that query to also return the Price column that corresponds to the MAX date.  You cannot!  But you can when you use ROW_NUMBER().

SELECT ID, MyDate, Price
FROM
(
    SELECT ID, MyDate, Price
        ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC)
    FROM #ROWNUMBER_Demo
) a
WHERE rn1 = 1
ORDER BY ID;

Results:

ID     MyDate Price
1      2012-05-10 00:00:00.000    28.47
2      2012-05-01 00:00:00.000    21.43
3      2012-06-01 00:00:00.000    51.66

Conclusion

ROW_NUMBER() is a very versatile T-SQL window ranking function.  Besides using it to eliminate duplicates, it has a great many other very practical purposes that we’ll explore in future entries on this blog.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved