Stupid T-SQL Tricks – Part 1: Logarithms

Not just for math geeks, logarithms and their inverse functions (the exponentials) can have useful and often unexpected capabilities in T-SQL.

    “Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?

While Carson’s animal antics were meant to entertain, my Stupid T-SQL Tricks are meant to be thought provoking and also add a tool or two to your T-SQL tools chest in the event that the need arises. So maybe they aren’t so stupid after all (and besides it’s my blog and I like the title). This will be the first of a series of blogs offering two to four specialty tips that are all somehow related. Today’s tricks will all utilize logarithms to do their magic.

Calculate a Factorial

Statisticians will recognize the factorial, a number that is represented as N!, where n=0, 1, 2, 3, … and for example 4! = 1*2*3*4 = 24. In the somewhat distant past I had the chutzpah to suggest that it could be done with a recursive Common Table Expression (CTE). Although in my article Exploring Recursive CTEs by Example, I wasn’t really advocating that approach. It was more like I was trying to help folks better understand how recursive CTEs work. Nowadays, I would never do it that way because there is something better.

Let’s start with a Tally table that generates numbers from 1 to 10. In our query below, we’ll SUM those numbers.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT SUM(n)
FROM Tally;

-- Results:
55

Too bad there isn’t a PROD() built-in function in T-SQL. Or is there? Consider logarithms and how they can convert a SUM() into a PROD() using a few simple calls to the right built-in functions in the proper sequence. Let’s try these two queries:

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT EXP(SUM(LOG(n)))
FROM Tally;

SELECT 1*2*3*4*5*6*7*8*9*10;

-- Results:
3628800.00000001
3628800

Those two results are remarkably close, the first one of course being slightly imprecise due to the fact that LOG and EXP are both returning a FLOAT data type. Since we know a factorial number must be an integer, we can convert it as follows, so now it returns the exact integer value of interest (3628800).

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT CAST(ROUND(EXP(SUM(LOG(n))), 0) AS BIGINT)
FROM Tally;

While it unlikely that you’ll often have the need to calculate factorials in T-SQL, you may occasionally have the need to calculate a product across multiple rows for a column. So this technique will work for that also.

The Running Product

Using SQL 2012, we can use this same technique to calculate a running product (similar to a running total) across multiple rows in a table. To do this, we’ll use a window frame.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n
    ,CAST(ROUND(EXP(
        SUM(LOG(n)) OVER
            (
                ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            )
        ), 0) AS BIGINT)
FROM Tally;

-- Results:
1      1
2      2
3      6
4      24
5      120
6      720
7      5040
8      40320
9      362880
10     3628800

Be careful though! These numbers are going to grow really quickly, so hopefully you won’t be doing this over many, many rows.

Coded, Numeric Fields with and without Leading Zeroes

In many applications you’ll find columns that are coded as numbers, but are stored as characters. This is, of course, a best practice when you don’t expect to be doing arithmetic with the contents. Let’s use a bit of a contrived example in the guise of a Sales Order Details table:

CREATE TABLE #SO_Details
(
    SO_Number           VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
    ,Extended_Price     AS (Quantity*Price)
);

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '1', 3, 15.20),('2013SO00001', '2', 3, 9.40)
    ,('2013SO00001', '3', 1, 11.50),('2013SO00001', '4', 2, 11.55)
    ,('2013SO00001', '5', 2, 14.30),('2013SO00001', '6', 10, 13.32)
    ,('2013SO00001', '7', 5, 19.42),('2013SO00001', '8', 6, 10.15)
    ,('2013SO00001', '9', 4, 12.15);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

We’ve used a computed column to calculate Extended_Price as Quantity * Price. Let’s suppose that we now need to insert an additional SO_Detail row.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   10        2          18.88   37.76
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

Oh my! Those weren’t the results we wanted because our new record (SO_Detail=10) is sorted into the wrong spot! Of course, this can easily be addressed by changing the ORDER BY so that it CASTs the SO_Detail column to INT, but then that’s why I said this example is a bit contrived. Note that we rolled back the transaction so that our new record 10 was not really saved in the table.

We could use our newfound knowledge of logarithms to address this. In our prior example, we used a natural log, but in this example we’ll use a base 10 logarithm. Let’s take a look at a property of base 10 logarithms that we can use to add a leading zero to each of the entries from 1 to 9. To show this property, we’ll fall back to our Tally table.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, LOG10(n), FLOOR(LOG10(n))
    ,POWER(10, 1+FLOOR(LOG10(n)))
    ,MAX(n) OVER (PARTITION BY NULL)
FROM Tally
WHERE n IN(1, 9, 10, 11, 99, 100);

-- Results:
1     0                   0    10     100
9     0.954242509439325   0    10     100
10    1                   1    100    100
11    1.04139268515823    1    100    100
99    1.99563519459755    1    100    100
100   2                   2    1000   100

Let’s say that we know that our range of SO_Detail is 1 through 9. In that case, we can use a formula like the following to prepend exactly the right number of leading zeroes (each is shown for the number 9).

SELECT RIGHT(10+9, 1), RIGHT(100+9, 2), RIGHT(1000+9, 3);

-- Results:
9     09   009

The numbers 10, 100 and 1000, as well as the 1, 2 and 3 can be derived directly from the results returned by the former query. Let’s put this all together.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + n, 1+FLOOR(LOG10(m)))
FROM
(
    SELECT n, m=MAX(n) OVER (PARTITION BY NULL)
    FROM Tally
) a
WHERE n IN (1, 9, 10, 11, 99, 100);

-- Results:
1      001
9      009
10     010
11     011
99     099
100    100

You see, our new formula using RIGHT not only gets us character strings. It also prepends exactly the correct number of zeroes so that the new result is sortable directly on the resulting character string.

Let’s return to our SO Details table and see how we can apply this new formula to normalize our detail line numbers.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

WITH SO_Details AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
        ,m=MAX(CAST(SO_Detail AS INT)) OVER (PARTITION BY NULL)
    FROM #SO_Details
    WHERE SO_Number = '2013SO00001'
)
UPDATE SO_Details
SET SO_Detail = RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + SO_Detail, 1+FLOOR(LOG10(m)));

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail  Quantity   Price   Extended_Price
2013SO00001   01         3          15.20   45.60
2013SO00001   02         3          9.40    28.20
2013SO00001   03         1          11.50   11.50
2013SO00001   04         2          11.55   23.10
2013SO00001   05         2          14.30   28.60
2013SO00001   06         10         13.32   133.20
2013SO00001   07         5          19.42   97.10
2013SO00001   08         6          10.15   60.90
2013SO00001   09         4          12.15   48.60
2013SO00001   10         2          18.88   37.76

Now our results are sorted as expected. Once again we have rolled back the transaction so we are left with only 9 rows in our #SO_Details table and single digit SO_Detail values.

Let’s complicate our example just a bit and add a table of transactions we’d like to process, consisting of inserts, updates and deletes and see if we can do the whole thing with a MERGE. First, the transactions table and some sample transactions.

CREATE TABLE #SO_Trans
(
    [action]            CHAR(1)
    ,SO_Number          VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
);

INSERT INTO #SO_Trans
    (
        [action], SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('U', '2013SO00001', '5', 15, 12.23)
    ,('I', '2013SO00001', NULL, 4, 16.15)
    ,('I', '2013SO00001', NULL, 3, 12.22)
    ,('D', '2013SO00001', '3', NULL, NULL);

SELECT *
FROM #SO_Trans;

-- Results:
action SO_Number     SO_Detail   Quantity   Price
U      2013SO00001   5           15         12.23
I      2013SO00001   NULL        4          16.15
I      2013SO00001   NULL        3          12.22
D      2013SO00001   3           NULL       NULL

For INSERTs represented in the transactions table, we do not need an SO_Detail number. For DELETEs we do not need Quantity or Price. For UPDATEs we need all three. In order to do a MERGE, we need to consider the following:

  • What is the target table?
  • What is the source table?
  • What are the matching criteria and what operations are performed on which kinds of matches?

The target table is easy, although not so easy as just the #SO_Details table because we need to only include SOs that are in the #SO_Trans table, so we’ll start with that:

-- Target table
SELECT SO_Number, SO_Detail, Quantity, Price
FROM #SO_Details
WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans);

We’ll put the target table into a CTE for use in the MERGE. Coincidentally the results are the 9 original rows we put into the table, because there is only one SO_Number in the #SO_Trans table. We do need to include all rows for that SO because any of them may be subject to an UPDATE to change the SO_Detail number.

MERGE throws an exception if you try to update any row in the target table more than once, so we need to carefully construct the source table so that there is one and only one matching row for each row in the target.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
)
SELECT [action]=ISNULL(b.[action], 'U')
    ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
    ,a.SO_Detail
    ,Quantity=ISNULL(b.Quantity, a.Quantity)
    ,Price=ISNULL(b.Price, a.Price)
FROM Target4MERGE a
FULL JOIN #SO_Trans b
ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail;

-- Results:
action  SO_Number    SO_Detail  Quantity  Price
I       2013SO00001  NULL       4         16.15
I       2013SO00001  NULL       3         12.22
U       2013SO00001  1          3         15.20
U       2013SO00001  2          3         9.40
D       2013SO00001  3          1         11.50
U       2013SO00001  4          2         11.55
U       2013SO00001  5          15        12.23
U       2013SO00001  6          10        13.32
U       2013SO00001  7          5         19.42
U       2013SO00001  8          6         10.15
U       2013SO00001  9          4         12.15

Notes about these results:

  • The two detail lines to be INSERTed have a NULL SO_Detail number because we have not yet determined what it is to be. We’ll be trying to add them at the end.
  • Note the Quantity and Price (15 and 12.23) for SO_Detail=5, which are the values we want to UPDATE.

The key to the matching will be to construct a ROW_NUMBER() that we can use as the new SO_Detail number. So let’s add that:

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
)
SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
    ,m=MAX(r) OVER
        (
            PARTITION BY SO_Number, CASE [action] WHEN 'D' THEN 0 ELSE 1 END
        )
FROM PreSource;

-- Results:

action  SO_Number    SO_Detail  Quantity  Price  r   m
D       2013SO00001  3          1         11.50  1   1
U       2013SO00001  1          3         15.20  1   10
U       2013SO00001  2          3         9.40   2   10
U       2013SO00001  4          2         11.55  3   10
U       2013SO00001  5          15        12.23  4   10
U       2013SO00001  6          10        13.32  5   10
U       2013SO00001  7          5         19.42  6   10
U       2013SO00001  8          6         10.15  7   10
U       2013SO00001  9          4         12.15  8   10
I       2013SO00001  NULL       4         16.15  9   10
I       2013SO00001  NULL       3         12.22  10  10

Additional comments on the latest query and results:

  • The PARTITION we created for our ROW_NUMBER() separates the record to be deleted from the remainder.
  • In the ORDER BY for our ROW_NUMBER() we substituted 10000 (one more than the maximum number of rows we can have per SO detail line) to make the INSERTed records move to the bottom.
  • We enclosed this query in a CTE, because we also want to use the MAX window aggregate function.
  • For that, the PARTITION is identical to what we used for ROW_NUMBER().
  • The result in the r column (except for the [action]=’D’ row) is the new SO_Detail number.
  • The result in m will be used when we convert our integers to characters with leading numbers.

All that’s left is to construct the final MERGE statement, so here it is.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
),
    Source4MERGE AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
        ,m=1+FLOOR(LOG10(
            MAX(r) OVER
                (
                    PARTITION BY SO_Number
                        ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                )))
    FROM PreSource
)
MERGE Target4MERGE t
USING Source4MERGE s
ON s.SO_Number = t.SO_Number AND s.SO_Detail = t.SO_Detail
WHEN MATCHED AND [action] = 'D'
THEN DELETE
WHEN MATCHED AND [action] = 'U'
THEN UPDATE
SET SO_Detail = RIGHT(POWER(10, m) + s.r, m)
    ,Quantity = s.Quantity
    ,Price    = s.Price
WHEN NOT MATCHED AND [action] = 'I'
THEN INSERT 
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES 
    (
        s.SO_Number
        ,RIGHT(POWER(10, m) + s.r, m)
        ,s.Quantity
        ,s.Price
    );

SELECT *
FROM #SO_Details;

-- Results:
SO_Number    SO_Detail  Quantity  Price   Extended_Price
2013SO00001  01         3         15.20   45.60
2013SO00001  02         3         9.40    28.20
2013SO00001  03         2         11.55   23.10
2013SO00001  04         15        12.23   183.45
2013SO00001  05         10        13.32   133.20
2013SO00001  06         5         19.42   97.10
2013SO00001  07         6         10.15   60.90
2013SO00001  08         4         12.15   48.60
2013SO00001  09         4         16.15   64.60
2013SO00001  10         3         12.22   36.66

Now that’s one heck of a complicated looking query and it could probably do with some comments, but you can get those as needed from the descriptions I’ve provided. A few more additional notations may be warranted.

  • We applied our 1+FLOOR+LOG10 formula around the MAX() window aggregate function to save a bit of complexity when assigning SO_Detail in the MERGE.
  • The MERGE uses MATCHED twice, segregating DELETEs from UPDATEs based on [action].
  • The NOT MATCHED clause INSERTs the new rows specified by the #SO_Trans table.

Conclusions

You may be asking why not save all of this computational convolution and just always use a 4 digit SO_Detail value with as many leading zeroes as necessary because that’s pretty easy with something like:

SELECT RIGHT('0000'+CAST(SO_Detail AS VARCHAR(4)), 4);

The answer is, yes we could have done that if our business requirements allowed for always having 4 digits in the SO_Detail column. But frankly, this was much more fun and we got to learn a little about logarithms along the way. And I did warn you by calling this article Stupid T-SQL Tricks, now didn’t I?

Still we may have learned a few things, like how to calculate factorials, creating a pseudo-PROD() aggregate, how base 10 logarithms can be used to find the number of digits in a number and how to methodically construct a complex MERGE query one step at a time. All in all, that alone probably made it worthwhile for me to post this blog.

Perhaps it wasn’t quite as entertaining as Johnny Carson’s Stupid Pet Tricks skits, but we hope you enjoyed it anyway.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Using the T-SQL MERGE Statement

In SQL Server 2008, Microsoft added a new SQL query type: the MERGE statement. This flexible query provides the ability to perform INSERTs, UPDATEs and even DELETEs all within a single statement. Used in combination with Common Table Expressions (CTEs), this can be a powerful tool to replace multiple SQL queries under the right circumstances.

One important rule to keep in mind when using MERGE, is that the statement must be terminated by a semicolon (;).

Case Study 1: A Simple Upsert

The most common usage of the MERGE statement is to perform what is colloquially called an “upsert,” which is really a diminutive form of UPDATE/INSERT. Without further preamble, let’s set up some test data and get to seeing how the MERGE statement can simplify your life.

CREATE TABLE #Master
(
    [key]       INT IDENTITY PRIMARY KEY
    ,name       VARCHAR(10)
);

INSERT INTO #Master
VALUES ('Dwain'),('Jeff'),('Paul')
    ,('Anon'),('Ralph'),('Tom'),('Sally');

CREATE TABLE #Staging
(
    [key]       INT PRIMARY KEY
    ,[NewName]  VARCHAR(10)
);

INSERT INTO #Staging
VALUES (2, 'Bob'),(4, 'Jim'),(6, 'Marvin'), (10, 'Buddy');

SELECT * FROM #Master;
SELECT * FROM #Staging;

The results in the two tables as displayed by the SELECT are:

key name
1   Dwain
2   Jeff
3   Paul
4   Anon
5   Ralph
6   Tom
7   Sally

key NewName
2   Bob
4   Jim
6   Marvin
10  Buddy

Our intention is to update (based on [key] in #Staging) the corresponding row by [key] in #Master. If the [key] in #Staging matches none of our [key] values in #Master, then insert a new row. The new row’s [key] does not need to match the value in the staging table. We can easily do this with a MERGE statement as follows:

MERGE #Master t
USING #Staging s
ON s.[key] = t.[key]
WHEN MATCHED THEN 
    UPDATE SET name = s.[NewName]
WHEN NOT MATCHED THEN 
    INSERT (name) VALUES (s.[NewName]); 

SELECT * FROM #Master;

The final SELECT result is as follows:

key name
1   Dwain
2   Bob
3   Paul
4   Jim
5   Ralph
6   Marvin
7   Sally
8   Buddy

You can see that Bob replaced Jeff, Jim replaced Anon and Marvin replaced Tom, and also that Buddy was added at the end.

The way the statement works is as follow:

  • The table name immediately after the MERGE keyword is the target table, in this case #Master, which we have aliased as t for easy understanding.
  • The USING table is the source, so #Staging will be merged into #Master.
  • The ON keyword represents the matching criteria between the records in the two tables. You should not think of this in the same way that ON appears after a JOIN as it operates quite differently.
  • Following those parts of the statement, are any number of WHEN clauses. The MATCHED criterion indicates a match based on the ON criteria. It can be combined with additional matching criteria if required.
  • NOT MATCHED (implied as BY TARGET), means that when a source row does not exist in the target table, we’re going to do something.
  • Following MATCHED or NOT MATCHED is the keyword THEN followed by either an INSERT or an UPDATE.

You can also use DELETE (instead of UPDATE or INSERT) and if you’d like to learn about how to DELETE rows from the target table, I suggest you read this article to understand exactly how it works: A Hazard of Using the SQL Merge Statement and the potential dangers when using it.

Case Study 2: A More Complicated MERGE

Suppose we have the following sample table and data:

CREATE TABLE #ItemTest
(
    ID INT NOT NULL
    ,LineID INT NOT NULL
    ,ProductID INT NULL
    ,PRIMARY KEY(ID, LineID)
); 

INSERT INTO #ItemTest (ID, LineID, ProductID)
SELECT 100, 1, 5 
UNION ALL SELECT 100, 2, 15 
UNION ALL SELECT 100, 3, 8 
UNION ALL SELECT 100, 4, 25 
UNION ALL SELECT 200, 1, 11 
UNION ALL SELECT 200, 2, 100 
UNION ALL SELECT 200, 3, 41 
UNION ALL SELECT 200, 4, 10 
UNION ALL SELECT 200, 5, 5 
UNION ALL SELECT 200, 6, 30
UNION ALL SELECT 300, 1, 20;

SELECT *
FROM #ItemTest;

From the final SELECT, we see that our data appears as follows:

ID   LineID ProductID
100  1      5
100  2      15
100  3      8
100  4      25
200  1      11
200  2      100
200  3      41
200  4      10
200  5      5
200  6      30
300  1      20

Notice how the entries for each ID contain a sequentially numbered LineID (1 to 4 for ID=100 and 1 to 6 for ID=200). Our business requirement is that we need to delete some rows and at the same time preserve the row numbering for LineID without introducing any gaps. So for example, if we need to delete LineID=3 from ID=100, we need to renumber LineID=4 for that ID to be LineID=3.

Ignoring for the moment that it’s probably poor application design to have this row renumbering requirement, this can be accomplished with a MERGE. Since it is a bit more complicated we’ll develop it in a couple of steps to help you understand. First, let’s say we want to delete three rows. We’ll put those into a table variable (a feature introduced in SQL Server 2005).

DECLARE @RowsToDelete TABLE
(
    ID      INT
    ,LineID INT
    ,PRIMARY KEY (ID, LineID)
);

INSERT INTO @RowsToDelete (ID, LineID)
SELECT 100, 3 UNION ALL SELECT 200, 2 UNION ALL SELECT 200, 4;

Note how we can create a PRIMARY KEY on a table variable. While not needed in this case, if you had lots of rows it will improve the performance of what we’re about to do.

Now we’ll construct the following query which will require some explanation:

SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID
    ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))
    ,XX=NULLIF(a.LineID, b.LineID)
FROM #ItemTest a
LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID;

XX is included only to illustrate what NULLIF is doing for us. This produces the following results:

ID   LineID ProductID LineID2  rn   XX
100  3      8         3        1    NULL
100  1      5         NULL     2    1
100  2      15        NULL     3    2
100  4      25        NULL     4    4
200  2      100       2        1    NULL
200  4      10        4        2    NULL
200  1      11        NULL     3    1
200  3      41        NULL     4    3
200  5      5         NULL     5    5
200  6      30        NULL     6    6
300  1      20        NULL     1    1

Each row from #ItemTest is returned because it is the left table of the LEFT JOIN. Matching rows from our @RowsToDelete temporary table have a value in LineID2, while rows not matched have a value of NULL (exactly how you’d expect the LEFT JOIN to work). The result in XX shows us that when the LineID of #ItemTest matches the LineID of @RowsToDelete, we get a NULL and NULL values usually sort first (there is a SQL Server setting that controls this). So in each case, the rows we want to delete are sorted to the top of the grouping (on ID).

For the 3 rows in our @RowsToDelete table, we have 1 for ID=100 and 2 for ID=200 (these counts are easy enough to obtain in SQL). So what happens if we subtract that count from rn?

WITH CountItemsToDelete (ID, c) AS
(
    SELECT ID, COUNT(*)
    FROM @RowsToDelete
    GROUP BY ID
)
SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID
    ,[rn-c]=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))-c
FROM #ItemTest a
LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID
JOIN CountItemsToDelete c ON a.ID = c.ID;

The results now appear as:

ID   LineID   ProductID LineID2  rn-c
100  3        8         3        0
100  1        5         NULL     1
100  2        15        NULL     2
100  4        25        NULL     3
200  2        100       2        -1
200  4        10        4        0
200  1        11        NULL     1
200  3        41        NULL     2
200  5        5         NULL     3
200  6        30        NULL     4

Note how the row for ID=300 has been eliminated by the INNER JOIN to our Common Table Expression (CTE) CountItemsToDelete. Looking at the [rn-c] column, we see that for rows where LineID2 is not NULL, the value is meaningless. But for rows where LineID2 is NULL, [rn-c] is precisely the final row number we’ll need to assign to LineID after deleting the rows we want to delete! Now we have enough information to write this into a MERGE statement:

WITH CountItemsToDelete (ID, c) AS
(
    SELECT ID, COUNT(*)
    FROM @RowsToDelete
    GROUP BY ID
), 
    SourceItems AS
(
    SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID, c
        ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))
    FROM #ItemTest a
    LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID
    JOIN CountItemsToDelete c ON a.ID = c.ID
)
-- The target table
MERGE #ItemTest t
-- The source table
USING SourceItems s
-- Matching criteria: lines up rows from SourceItems exactly with rows
-- from our target table (except for ID=300 which is not in the source) 
ON t.ID = s.ID AND s.LineID = t.LineID
-- LineID2 is not NULL for rows we need to delete
WHEN MATCHED AND s.LineID2 IS NOT NULL THEN
    DELETE
-- LineID2 is NULL for rows where we've calculated the new line number
WHEN MATCHED AND s.LineID2 IS NULL THEN
    UPDATE SET LineID = rn-c;

SELECT *
FROM #ItemTest;

The results shown in the final SELECT clearly indicate that this MERGE query has satisfied our business requirement.

ID   LineID   ProductID
100  1        5
100  2        15
100  3        25
200  1        11
200  2        41
200  3        5
200  4        30
300  1        20

To further improve the performance of the query, you can change the second MATCHED criteria to this, to avoid updating rows where the LineID isn’t changing.

WHEN MATCHED AND s.LineID2 IS NULL AND t.LineID <> rn-c THEN

To do this otherwise in SQL you would first need to DELETE the rows you want to delete, and then run a separate UPDATE to correct the row numbers that need correcting.

Summary of our Learning

Today we have learned about the MERGE query and how it can be used to replace multiple queries in the case of:

  • UPDATE/INSERT (the “Upsert”)
  • DELETE/UPDATE

I encourage you to also read the linked article about hazards present if you utilize the full capability of the MERGE (WHEN NOT MATCHED SOURCE THEN DELETE). The article shows a simple way of limiting the scope of the DELETE to avoid the hazard.

We also learned about SQL table variables, which are very handy under certain circumstances. They do have their drawbacks though, and some day we may blog on the differences between them and temporary tables.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

The One Million Row T-SQL Test Harness

So far in our blogs, we have talked a bit about performance, but today we’re going to show you a way you can confirm without doubt that you’re writing high-performance T-SQL code. This extremely valuable technique is something Developers and Testers alike should be familiar with.

Why one million (1M) rows? Well, the answer to that question is really three-fold.

  • First of all, 1M rows represent sort of a threshold, where you can start to consider your tables as being “big.”
  • Second, if your query handles 1M rows in a reasonably short period of time, it means that it will probably do OK even if the table grows much larger (which is what we hope because that judges how popular our application is).
  • Finally, depending on what you are testing, the 1M row test may make the difference between deciding on the right query (the one that is the fastest) and the wrong query, because at 1M rows it usually separates the winners from the losers quite clearly.

Along the way we’re going to learn about a variety of ways that test data can be generated. Unfortunately we won’t be able to cover every possible case. We’ll need to leave some of them to your imagination. Once you understand the basics, you should be able to extend the concept to other types of test data relatively quickly.

Case Study 1: Verify Which Formula is Fastest

Suppose you have a table that contains an INTEGER ID number column. Your application requires that when that number is shown on a form, it displays with leading zeroes. There are many ways to do this but we’ll look at three.

DECLARE @Num INT = 2342;

SELECT @Num
    ,RIGHT('000000'+CAST(@Num AS VARCHAR(7)),7)         -- Method 1
    ,RIGHT(10000000+@Num, 7)                            -- Method 2
    ,STUFF(@Num, 1, 0, REPLICATE('0', 7-LEN(@Num)));    -- Method 3

Each of these returns the same result: 0002342 and will work for any positive integer <= 9999999. But the question is which one is faster?

To answer that question we’ll construct a 1M row test harness based on the Numbers table we created in our blog on Tally Tables. We’ll also show an example of one of those methods using an in-line Tally table (from the same blog), so you can see the difference between using a permanent vs. an in-line tally table.

DECLARE @NumResult VARCHAR(7);

PRINT 'Method 1:';
SET STATISTICS TIME ON;
SELECT @NumResult = RIGHT('000000'+CAST(N AS VARCHAR(7)),7)
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 2:';SET STATISTICS TIME ON;
SELECT @NumResult = RIGHT(10000000+N, 7)
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 3:';
SET STATISTICS TIME ON;
SELECT @NumResult = STUFF(N, 1, 0, REPLICATE('0', 7-LEN(N)))
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 2 w-in line Tally Table:';
SET STATISTICS TIME ON;
WITH Tally(N) AS
(
    SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT @NumResult = RIGHT(10000000+N, 7)
FROM Tally
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

-- Results:
Method 1:
SQL Server Execution Times:
CPU time = 265 ms, elapsed time = 274 ms.

Method 2:
SQL Server Execution Times:
CPU time = 250 ms, elapsed time = 250 ms.

Method 3:
SQL Server Execution Times:
CPU time = 452 ms, elapsed time = 460 ms.

Method 2 w-in line Tally Table:
SQL Server Execution Times:
CPU time = 234 ms, elapsed time = 227 ms.

You should always run the test harness a few times and record the results of each run. By looking at the results, we see that methods 1 and 2 are pretty close, but over the 4-5 runs that I did, method 2 was consistently just a little faster in elapsed time. Using an in-line Tally table in this case was just a little faster than using the permanent Tally table (that may not always be the case).

You may be saying to yourself that this little 9% improvement doesn’t mean much, but picture it in the context of a much more complex query, where every slight improvement you can make counts for something.

Let’s now look at some key aspects of the test harness we used:

  • It contains multiple queries that return the identical results for the same number of rows. This is important because you’re trying to compare solutions.
  • We SET STATISTICS TIME ON before each query and OFF when it was complete. You could have turned them ON once at the start and OFF at the end, but that’s not always going to be the case.
  • We printed out a description of the method we are using just before setting STATISTICS ON. That’s the reason we turned them OFF after each query; so you wouldn’t see the statistics for the PRINT statement, which is immaterial.
  • Finally, we created a local variable @NumResult and assigned our calculated result (the returned columns) to it. This is important to eliminate the time that SQL Server Management Studio (SSMS) would otherwise take to render the results to the Results pane. If you don’t do that, it can bias the results. We’re interested in clocking the raw query speed here.

CPU time can be important sometimes, so you may also want to look at that. It turns out that method 2 also appears better than method 1 in terms of CPU time, but that may not always be the case and over the 4-5 runs we did it often was a tie.

After this test, we can say pretty clearly that method 2 was the highest performance method among the ones we looked at. We are also now well on our way to being able to say definitively that we’ve written the best possible query to solve our business problem.

Case Study 2: Removing Numbers from a Character String

In this case study, we will illustrate a few additional concepts you should know when constructing a test harness:

  • How to construct random character strings of any length.
  • Using an alternate means to capture the elapsed time of the queries we’re comparing.
  • Using an alternative to a local variable to avoid rendering results to SSMS.

The business problem is that we wish to remove numbers from character strings that contain only letters and numbers. For this task, we have written two T-SQL FUNCTIONs that basically do the job but need to be called differently.

-- Scalar Valued Function that removes characters based on a pattern match
CREATE FUNCTION dbo.CleanString
    (@pString VARCHAR(8000), @pPattern VARCHAR(100))
RETURNS VARCHAR(8000) AS
BEGIN

DECLARE @Pos SMALLINT;
SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

WHILE @Pos > 0
SELECT @pString = STUFF(@pString,@Pos,1,''),

@Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

RETURN @pString;

END
GO

-- In-line, schema-bound Table Valued Function
CREATE FUNCTION dbo.RemoveMatchedPatterns
(
    @Str        VARCHAR(8000)
    ,@Pattern   VARCHAR(100)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH Tally(n) AS
(
    SELECT TOP (ISNULL(LEN(@Str), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
),
SplitString AS
(
    SELECT n, s=SUBSTRING(@Str, n, 1)
    FROM Tally
    WHERE PATINDEX(@Pattern, SUBSTRING(@Str COLLATE Latin1_General_BIN, n, 1)) = 0
)
SELECT ReturnedString=
    (
        SELECT s + ''
        FROM SplitString b
        ORDER BY n
        FOR XML PATH(''), TYPE
    ).value('.', 'VARCHAR(8000)');
GO

For now, it is not necessary to fully understand how these FUNCTIONs work, but it is necessary to understand how to use them. Note that the CleanString FUNCTION uses a WHILE loop to do its work.

DECLARE @TestString VARCHAR(8000) = '123122adflajsdf34a23aa333w';

SELECT TestString=@TestString
    ,SVFReturns=dbo.CleanString(@TestString, '%[0-9]%')
    ,iTVFReturns=ReturnedString
FROM dbo.RemoveMatchedPatterns(@TestString, '%[0-9]%');

-- Results:
TestString                   SVFReturns     iTVFReturns
123122adflajsdf34a23aa333w   adflajsdfaaaw   adflajsdfaaaw

Both FUNCTIONs have the same call signature, which includes the pattern of characters you want removed, and both return the same results (only the alphabetic characters).

Before we proceed to generate a test harness containing lots of random strings we can test, it is necessary to familiarize you with a way to generate random numbers in T-SQL. So let’s consider the following statement:

SELECT IntRN=1+ABS(CHECKSUM(NEWID()))%100
    ,RNwDec=.01*(1+ABS(CHECKSUM(NEWID()))%10000);

-- Results:
IntRN  RNwDec
31     40.74

If you run this statement multiple times, you’ll get different results each time. In each case, you get a number between 1 and 100. The RNwDec column will have 2 decimal digits. This is the standard method in SQL to generate a Uniform Random Number (URN). If you need a wider range of numbers, change 100 (or 10000 for RNwDec) to something larger.

To generate a random string of characters, you can use the URN formula as follows:

SELECT REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20);

You can try this one yourself. Each time you get a different string consisting of 1 to 20 letters (always the same letter repeated) and 1 to 20 numbers (always the same number repeated). The string will be of varying length between 2 and 40 characters. To get a string of maximum length 8000, all you need to do is replicate the pattern a random number of times. The final (outer) replication should be performed up to 200 times, so can be done like this:

SELECT REPLICATE(
    REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20)
            ,1+ABS(CHECKSUM(NEWID()))%200);

Now that we understand the tools we’ll need, here is the test harness.

-- Create 1000 rows of random strings
SELECT s=REPLICATE(
    REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20)
            ,1+ABS(CHECKSUM(NEWID()))%200)
INTO #TestStrings
FROM dbo.Numbers
WHERE N <= 1000;

DECLARE @StartDT DATETIME = GETDATE();

SELECT SVFReturns=dbo.CleanString(s, '%[0-9]%')
INTO #Test1
FROM #TestStrings;

-- Display elapsed time for the scalar-valued User-defined Function (UDF)
SELECT SVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();
SELECT iTVFReturns=ReturnedString
INTO #Test2
FROM #TestStrings
CROSS APPLY dbo.RemoveMatchedPatterns(s, '%[0-9]%');

-- Display elapsed time for the in-line Table Valued Function (iTVF)
SELECT iTVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());
GO

DROP TABLE #TestStrings;
DROP TABLE #Test1;
DROP TABLE #Test2;

For this test, we’ll illustrate the fact that sometimes you don’t need to go up to 1M rows to distinguish the difference between two queries. In this case, it becomes quite apparent at 1000 rows. Here are the timings that are displayed by the two SELECT statements:

SVFElapsedMS
26290

iTVFElapsedMS
7933

That’s already a pretty large difference so you can imagine how long it would take to run at 1M rows. Here’s an explanation of the differences between this test harness and the one from before:

  • We ran at 1000 rows of test data instead of 1M (by filtering N from our Numbers table with WHERE N <= 1000).
  • Instead of assigning the results to a local variable, we instead SELECT … INTO a temporary table. Since both queries absorb the same overhead for this operation, the results are still comparable.
  • Instead of using SET STATISTICS TIME ON/OFF, we’ve simply used the @StartDT local variable to capture the elapsed time (calculated using DATEDIFF).

The latter method of capturing elapsed time is used because of a quirky behavior of STATISTICS in some cases when you are timing a scalar-valued, user-defined function. SQL MVP Jeff Moden explains this in How to Make Scalar UDFs Run Faster.

This example also serves to demonstrate the well-known fact that a good set-based query will almost always be faster than a loop. If you remember our introduction to DelimitedSplit8K in our Tally Tables blog, Jeff Moden uses the same basic methodology (an in-line Tally table) to make DelimitedSplit8K extremely fast.

Case Study 3: Timing Solutions for Running Totals

In this example, we’re going to perform a test on two solutions to the Running Totals (RT) problem in T-SQL. The second solution is only valid in SQL 2012. First, we’ll set up a test table and populate it with some random data.

CREATE TABLE dbo.RunningTotalsTest
(
    [Date]          DATETIME PRIMARY KEY
    ,Value          INT
    ,RunningTotal1  INT
    ,RunningTotal2  INT
);

WITH SomeDates AS
(
    SELECT d=DATEADD(hour, N, '2010-01-01')
    FROM dbo.Numbers
    WHERE N <= 10000
)
INSERT INTO dbo.RunningTotalsTest([Date], Value)
SELECT d, 1+ABS(CHECKSUM(NEWID()))%100
FROM SomeDates;
GO

We have populated our table with a series of date values that increases in an hourly fashion; including a “Value” that is simply a random integer between 1 and 100. We’ll be calculating the two RunningTotal columns from our Value. Note that at the end of this “batch” we have included the batch separator (GO). The test harness includes only 10,000 rows because we know in advance that this will be a sufficient number to distinguish between our solutions (1M rows is still recommended for most normal cases).

The first running totals solution we’ll look at is what is known as a triangular JOIN, because for each row it adds up all of the prior rows using a correlated sub-query. Once again, notice the batch separator (GO) at the end of the batch.

-- RT by Triangular JOIN
UPDATE a
SET RunningTotal1 =
    (
        SELECT SUM(value)
        FROM dbo.RunningTotalsTest b
        WHERE b.[Date] <= a.[Date]
    )
FROM dbo.RunningTotalsTest a;
GO

The next solution, which only works in SQL Server 2012, is a new facility Microsoft has kindly provided to us for calculating running totals (and a host of other things).

-- RT with SQL 2012 window frame
WITH RunningTotal AS
(
    SELECT [Date], Value, RunningTotal2
        ,rt=SUM(value) OVER (ORDER BY [Date] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    FROM dbo.RunningTotalsTest
)
UPDATE a
SET RunningTotal2 = rt
FROM RunningTotal a;
GO

-- Final SELECT
SELECT *
FROM dbo.RunningTotalsTest;

We’ve also included a final SELECT (in the last batch) to show that both running totals were calculated correctly.

To run this code and obtain our timings, we’re going to learn to use Tools/SQL Profiler. This brings up a window allowing you to name the profiler (trace) if desired.

After you click Run, the Profile session will begin.

You can now execute the four batches of T-SQL we created above, two of which contain the solutions of interest. Once the run is complete, the Profile window now looks like this, where we have circled in red the two results of interest.

Notice how the comment we placed at the beginning of each batch, clearly shows up on the BatchCompleted lines with our desired results:

-- RT by Triangular JOIN
-- RT with SQL 2012 window frame

The results show that the new SQL 2012 method for running totals completed in only 117 milliseconds, while the triangular JOIN took 11267 milliseconds. Imagine what the triangular JOIN approach would have taken had we run against 1M rows, or better yet imagine a customer waiting on an application’s web form for that result to be displayed!

The first ten rows of results displayed show that both of our running totals solutions worked correctly, yet the timing results tell us that they are definitely not equivalent!

Date                    Value   RunningTotal1  RunningTotal2
2010-01-01 01:00:00.000   63    63             63
2010-01-01 02:00:00.000   75    138            138
2010-01-01 03:00:00.000   2     140            140
2010-01-01 04:00:00.000   27    167            167
2010-01-01 05:00:00.000   73    240            240
2010-01-01 06:00:00.000   71    311            311
2010-01-01 07:00:00.000   17    328            328
2010-01-01 08:00:00.000   64    392            392
2010-01-01 09:00:00.000   40    432            432
2010-01-01 10:00:00.000   56    488            488

SQL Profiler is a very useful way to time a batch that contains multiple SQL statements (like if you want to test performance of a CURSOR vs. a set-based solution). Take care when setting up each batch to avoid unnecessary overhead in one batch vs. the other.

In a future blog, we’ll describe a method for calculating running totals that works in any version of SQL that is faster than both of these solutions.

Summary and What We Learned

Firstly and most importantly, we’ve learned how to create a one million row test harness so we can compare the performance of two queries that return identical results, and why this is an essential step to verifying query performance.

We’ve learned how to generate random test data:

  • Using Tally tables
  • Using a formula to generate random numbers within a specified range
  • Generating random character strings

We’ve learned three methods to time the queries we’re comparing as they process the test harness:

  • Using SET STATISTICS TIME ON/OFF
  • Using an elapsed time calculator and displaying the results in a SELECT
  • Using SQL Profiler

Some hints to follow when using the 1M row test harness (not all of which are mentioned):

  • After creating your test harness, start out with a lower row count than a million. Once you’re done debugging the test harness, ramp up the row count until one of the solutions clearly distinguishes itself from the others.
  • Try to run each test multiple times at the highest row count you need to prove the result and then average the timings.
  • Avoid using SET STATISTICS ON/OFF when comparing code that includes a call to a Scalar-valued, User-defined FUNCTION. Use one of the other techniques instead (Profiler works fine on this case).
  • You can as easily compare 3-4 solutions as two using any of these methods. The more solutions you have to the same problem, the more chance you’ll have of identifying the best performing.
  • There are cases where using DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS will improve the accuracy of your measurement; however this is a topic that is more advanced than we planned to explore in this blog.

Additional reading:

  • Just about any article that SQL MVP Jeff Moden writes for the SQL Server Central web site provides an example of a 1M row test harness to test his proposed solution against alternatives, but these two articles are specifically directed to this subject.

o   Generating Test Data: Part 1 – Generating Random Integers and Floats
o   Generating Test Data: Part 2 – Generating Sequential and Random Dates

Now you’re ready to start learning how to write high-performance T-SQL code, because now you have a tool that is essential to proving that you’re writing queries that are the best that they can be!

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Manipulating Dates and Times in T-SQL

In SQL 2008, Microsoft introduced some new date and time data types to augment the options available in prior versions. The full list of these data types with detailed explanations can be found in Microsoft Books on Line (BOL), but we’ll list them here with a very brief description.

  • DATETIME – This is the standard and probably most commonly used type that’s been available in T-SQL since its early days, with a range of 1753-01-01 through 9999-12-31 and accuracy of about 3 milliseconds. Note that if you cast (implicitly or explicitly) an INT value of 0 to this data type, the result will be 1900-01-01.
  • DATE – This is a new type that is accurate to the day only (no time component) and has a range of 0001-01-01 through 9999-12-31.
  • DATETIME2 – This is a higher-accuracy DATE + TIME type that is accurate to 100 nanoseconds (or .0000001 seconds) and has a range of 0001-01-01 through 9999-12-31.
  • DATETIMEOFFSET – This is a DATE + TIME type that includes the UTC time zone offset with varying degrees of accuracy (you can specify) and has a range of 0001-01-01 through 9999-12-31.
  • SMALLDATETIME – This is another DATE + TIME type that has an accuracy of one minute (no seconds) and a date range of 1900-01-01 through 2079-06-06.
  • TIME – This is a TIME-only type that is accurate to 100 nanoseconds and has a range of 00:00:00.0000000 through 23:59:59.9999999.

This blog is less about the date and time data types and really about the different ways they can be manipulated. We’re not talking about formatting here (for that you’d use CONVERT), instead we’re talking about how to do date arithmetic and the functions that SQL provides you to do so. Normally date formatting should be done in an application’s front end, but it is often quite useful and necessary to do date calculations in the T-SQL back end.

Simple Date Arithmetic

If you have a DATETIME data column, or perhaps you’re using T-SQL’s GETDATE() built-in function, if you want to add a fixed number of days, that is very simple:

SELECT GETDATE(), GETDATE()-1, GETDATE()+1;

-- Results:
2014-03-05 11:29:37.087   2014-03-04 11:29:37.087   2014-03-06 11:29:37.087

Often this approach can be faster than using the T-SQL built in function for adding dates (to be discussed in a minute).

Unfortunately, this doesn’t work well with any of the other date and time data types except for SMALLDATETIME:

SELECT CAST(GETDATE() AS DATE)+1;
GO
SELECT CAST(GETDATE() AS DATETIME2)+1;
GO
SELECT CAST(GETDATE() AS DATETIMEOFFSET)+1;
GO
SELECT CAST(GETDATE() AS TIME)+1;
GO

-- Errors returned:
Operand type clash: date is incompatible with int
Operand type clash: datetime2 is incompatible with int
Operand type clash: datetimeoffset is incompatible with int
Operand type clash: time is incompatible with int

The same link provided above for CONVERT describes CAST. There are those that like to ignore the myth of SQL code compatibility and recommend that instead of using GETDATE() you use CURRENT_TIMESTAMP (the ANSI standard function that returns DATETIME), but I am not one of them.

If you need to do any other date arithmetic, SQL provides a built in function called DATEADD. It can be used to add a fixed number of days, hours, seconds, months, etc. to any date/time data type (although you will get an error adding days, weeks, months, etc. to a TIME data type). The first argument to DATEADD tells the function what unit you want to add, while the second specifies the number of units. The last argument is the date/time value you want to add those units to. So we can fix our code above to add one day to each of the supported date/time types.

SELECT GETDATE();
SELECT DATEADD(day, 1, CAST(GETDATE() AS DATE));
SELECT DATEADD(day, 1, CAST(GETDATE() AS DATETIME2));
SELECT DATEADD(day, 1, CAST(GETDATE() AS DATETIMEOFFSET));
SELECT CAST(GETDATE() AS SMALLDATETIME)+1;
SELECT DATEADD(hour, 1, CAST(GETDATE() AS TIME));

-- Results:
2014-03-05 11:43:53.117
2014-03-06
2014-03-06 11:43:53.1170000
2014-03-06 11:43:53.1170000 +00:00
2014-03-06 11:46:00
12:43:53.1170000

Those results also clearly demonstrate the accuracy of each of the data types.

Another extremely useful function for doing date arithmetic is DATEDIFF, which is used to calculate the difference between two dates (or times) in whole units as specified by its first argument. Let’s take a look at an example.

SELECT DATEDIFF(day, '2014-04-15', '2014-04-17');

-- Results:
2

The result is negative if the left date is greater than the right date. The first argument to DATEDIFF is the same set of units you can specify to DATEADD.

Date Truncation

In an earlier blog on Tally Tables and another one on Calendar Tables, we’ve seen that DATEADD and DATEDIFF can be combined to perform date truncation on a date part boundary. Now we’ll explain exactly how that works.   Let’s take a look at the T-SQL for the most common case (truncation to the day):

SELECT GETDATE();
-- Take the days difference between today's date and 1900-01-01
SELECT DATEDIFF(day, 0, GETDATE());
-- Add back the days difference to 1900-01-01
SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0);

-- Results:
2014-03-05 12:02:51.870
41701
2014-03-05 00:00:00.000

If today’s date is 2014-03-05, the number of days since 1900-01-01 (=0 remember that this is the base date when 0 is cast to DATETIME) is 41701. We can add back that number of days to 1900-01-01 and get exactly today’s date without the time part.

Likewise, we can truncate to the minute, hour, month or year simply by specifying a different first argument to both functions:

SELECT GETDATE();
SELECT DATEADD(minute, DATEDIFF(minute, 0, GETDATE()), 0);
SELECT DATEADD(hour, DATEDIFF(hour, 0, GETDATE()), 0);
SELECT DATEADD(month, DATEDIFF(month, 0, GETDATE()), 0);
SELECT DATEADD(year, DATEDIFF(year, 0, GETDATE()), 0);

-- Results:
2014-03-05 12:08:51.573
2014-03-05 12:08:00.00
2014-03-05 12:00:00.000
2014-03-01 00:00:00.000
2014-01-01 00:00:00.000

You will run into an error however if you try it to the second:

The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

But you can work around this by specifying a later offset date (2010-01-01 should work for a few more years):

SELECT DATEADD(second, DATEDIFF(second, '2010-01-01', GETDATE()), '2010-01-01');

-- Results:
2014-03-05 12:08:51.000

So let’s try a quick exercise to demonstrate our new found skill with date arithmetic. Try to solve it before you look at the solution. How would you truncate a DATETIME to yesterday at 18:00:00.000?

SELECT GETDATE();
SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE())-1, '18:00');

-- Results:
2014-03-05 12:17:36.210
2014-03-04 18:00:00.000

In this case, the time value (18:00) we specified at the end is up-cast to DATETIME 1900-01-01 18:00:00.000 and that is what the days offset (from 1900-01-01) is added back to.

More of these really neat and useful but simple date arithmetic examples can be found in this blog by Lynn Pettis, who I like to think of as the guru of date manipulations: Some Common Date Routines.

Combining Date Components

In a database, dates should always be stored as dates (DATETIME) and not character strings or their individual date parts. Unfortunately, not everybody realizes this and sometimes make the mistake of storing dates as the individual parts. Let’s return now to an example of some T-SQL from our Calendar Tables blog. We’ll assume you still have access to the auxiliary Calendar TABLE we created in that blog.

SELECT [Date], [Year], [Month], [Day]
FROM dbo.Calendar
WHERE [Date] >= '2013-03-01' AND [Date] < '2013-03-05';

-- Results:
Date                     Year   Month   Day
2013-03-01 00:00:00.000 2013   3       1
2013-03-02 00:00:00.000 2013   3       2
2013-03-03 00:00:00.000 2013   3       3
2013-03-04 00:00:00.000 2013   3       4

Using our newly found knowledge of date arithmetic and the T-SQL built in functions to handle them, we can easily re-assemble the Year, Month and Day columns in this result to be a DATE or DATETIME.

SELECT [Date], [Year], [Month], [Day]
    ,[AsDATETIME]= [Day]+DATEADD(month,[Month]-1,DATEADD(year,[Year]-1900,0)-1
    ,[AsDATE]    = CAST([Day]+DATEADD(month,[Month]-1,DATEADD(year,[Year]-1900,0))-1 AS DATE)
FROM dbo.Calendar
WHERE [Date] >= '2013-03-01' AND [Date] < '2013-03-05';

-- Results:

Date                    Year   Month   Day   AsDATETIME              AsDATE
2013-03-01 00:00:00.000 2013   3       1     2013-03-01 00:00:00.000 2013-03-01
2013-03-02 00:00:00.000 2013   3       2     2013-03-02 00:00:00.000 2013-03-02
2013-03-03 00:00:00.000 2013   3       3     2013-03-03 00:00:00.000 2013-03-03
2013-03-04 00:00:00.000 2013   3       4     2013-03-04 00:00:00.000 2013-03-04

As you can see, we’ve subtracted the base year (1900) from Year, adding that number of years back to the base year (0=1900-01-01), then added Month-1 months to that and finally one less than Day to that. Our results are just what we need and we achieved them by using just a little simple date arithmetic. This will usually be faster than converting to a character string and then manipulating that, ultimately CASTing it to the desired type.

Casting DATETIMEs to Character Strings

If you’ve never tried to CAST a DATETIME to a character string, you might be a little surprised by the result you get.

SELECT GETDATE();
SELECT CAST(GETDATE() AS VARCHAR(100));

-- Results:
2014-03-05 14:13:42.760
Mar 5 2014 2:13PM

Why Microsoft chose this particular result is probably a mystery to all except them. But knowing this behavior does offer the occasional opportunity for CASTing in the other direction. Consider these cases:

SELECT CAST('Mar 01 2013' AS DATETIME);
SELECT CAST('Mar 01 2013 15:00:03' AS DATETIME);
SELECT CAST('Mar 01 2013 2:05PM' AS DATETIME);

-- Results:
2013-03-01 00:00:00.000
2013-03-01 15:00:03.000
2013-03-01 14:05:00.000

Another somewhat interesting case is when you try casting a character string that T-SQL recognizes as a valid year.

SELECT CAST('2013' AS DATETIME);
SELECT CAST('201303' AS DATETIME);
SELECT CAST('2013-01' AS DATETIME);

-- Results:
2013-01-01 00:00:00.000
The conversion of a varchar data type to a datetime data type resulted in an out-of-range value.
Conversion failed when converting date and/or time from character string.

While it works with a year, when the character string looks like a combination of year and month, it results in one of the two error messages shown.

On the other hand, these two cases produce exactly what you’d expect, which is the DATETIME value 2013-01-02 00:00:00.000, regardless of whether the CAST is explicit or implicit.

SELECT CAST('2013-01-02' AS DATETIME);
SELECT CAST('20130102' AS DATETIME);

To CAST these to any of the other date/time data types, you must use an explicit CAST. An implicit CAST will always result in a DATETIME.

Let’s also not forget the other CAST we learned from our exercise:

SELECT CAST('15:00' AS DATETIME)

-- Results:
1900-01-01 15:00:00.000

Summary

In this blog we’ve learned about the T-SQL data types that support dates and times, including the range and accuracy of each.

We have learned how to perform some simple date arithmetic and apply that knowledge to truncating a DATETIME and reassembling the date parts into a DATETIME.

Finally, we’ve learned a little about the results of casting a DATETIME to a character string and vice versa.

All of these date manipulation constructs can be considered T-SQL best practices if you need to use them. In general date arithmetic will be faster than extracting the parts as character strings, manipulating those character strings and then reassembling them and CASTing back to the date/time data type you need.

There are many other date functions offered by T-SQL and you can read about them in BOL. If you go back to the Calendar Tables blog (linked in above), you can now go back and try to understand how the GenerateCalendar function we provided there actually works.

I hope you enjoyed this introduction to manipulating dates and times in T-SQL.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Make it Work, Make it Fast, Make it Pretty

When I first heard this, it struck me as being a remarkably concise wisdom applicable to virtually any programming task. The entire quotation is actually:

“Make it work, make it fast, then make it pretty… and it isn’t finished until it is pretty!”

SQL MVP Jeff Moden (RedGate’s 2011 Exceptional DBA of the Year)

In case you don’t know what an MVP is, it stands for Most Valued Professional, and it is an award that Microsoft confers only to the best of the best in their Microsoft-focused technical skills.

Throughout the course of this article I will ask the reader a series of questions. Each question is designed as a thought question. When you encounter a question, you should mentally form a picture in your head of what your answer is. After the question I’ll provide you with my take on the answer. If your answers are not the same as my answers, you should take the time to reflect on why they are different.

Taking pride in your work is a very important thing. We all like to think we’re good at our jobs. Could you be better at your job? The answer is most certainly, because all of us could always be better. This article offers a philosophy by which any programmer can improve on their ability to perform their job. I sincerely hope that you can learn something from it.

Since the quotation on which this article is based was said by a really talented SQL DBA, towards the end of this article we’ll provide a code example in SQL. We’ll provide a good and bad example. Once again, you should reflect on this example and understand why it is bad or why it is good.

In case you were wondering, I have seen some of Jeff Moden’s work. I can assure you that he lives this philosophy every day. I have seen him produce some of the most remarkably ingenious and high-performing solutions to SQL problems, and in the end he always makes the code pretty. He will probably not be pleased with me for writing this article because he is also very modest about his accomplishments. That my dear reader is how you become a Microsoft MVP!

Make it Work

When you are writing SQL or coding in any programming language for that matter, the most important thing to ensure is that the code you have written meets the functional requirements. This means that you, as a Developer, must not only write the code but you must also thoroughly unit test it. That’s right, testing is not just for Testers to do!

To thoroughly unit test a piece of code, you must consider not just what happens when you execute your code against what you expect in terms of the way a business user may end up running your code. You must take into consideration unexpected actions of those same users, or in the case of SQL queries, unexpected but bad data in the tables you’re going against.

A good Developer will also take the time to document the testing that he or she performed. This documentation, if available, may help the testing team to identify other potential cases where the code that was written may fail to perform to expectations.

So as a Developer, do you believe that you perform sufficient unit testing so that your testing team finds no issues when they perform their testing? The answer is probably not, but that is the goal every Developer should focus on if they want to be considered “better” at their jobs.

Make it Fast

All SQL is not alike. I can probably write at least three to four queries that will satisfy any specific business requirement. The key is to select the query that will run the fastest, and there is some effort involved in making that happen.

Here’s another thought question for you. Has anyone ever told you that a query you have written runs too fast? Chances are, the answer to that question is no. Given the choice between two queries that return equivalent results, business users would almost always choose to use the query that runs the fastest. Performance counts! When I asked the first question to a good friend of mine by the name of Chris Morris (who is an expert at T-SQL), he had this story to relate:

“Some guy who’s been working at the same shop for three or four years without opening a book or visiting a forum but thinks he’s a T-SQL hero has a query that he swears can’t be made to run any faster. It takes twenty minutes to run. You rewrite it so that it works in a completely different way – which he hasn’t been exposed to – and it runs in two seconds. The first thing he’ll say is ‘That’s far too fast – it can’t possibly be doing all that work in so little time.’ Of course, the catch is that it isn’t doing all that work.”

I found that last bit quite humorous but also absolutely true. Two queries that return the same results do not have to perform the same amount of work. The one that does the least work is most likely to perform better.

There are some people who insist that, in order to write fast-performing queries, you must be an expert in understanding the query‘s execution plan. While that can certainly be helpful, not many are true experts in interpreting a query’s execution plan. In fact, the execution plan can oftentimes be misleading, specifically when comparing the “cost” of two queries. I’ve seen cases where two queries are rated as having a cost of 0% and 100%, yet the second query is much faster than the first.

To write faster queries, you don’t need to be an expert at reading a query’s execution plan but you do need to keep in mind a few very basic fundamentals.

  • Avoid CURSORs as they can be really slow in T-SQL. 99.99% of the time you can construct an alternate, set-based approach that will perform much faster than a CURSOR. The best favor you can do for yourself to improve the performance of your SQL is to forget the syntax for a CURSOR, or better yet completely forget that they exist.
  • Your query should only touch the rows it needs to touch and it should try to touch those rows only once. If it can touch only the entry in an index instead of the row, that is even better.
  • “Stop thinking about what you want to do to a row. Think instead about what you want to do to a column.” – This is another quote from Jeff Moden.
  • The only way to judge whether a query is fast or not is to have something to compare it against, like another query that returns exactly the same results.
  • I like to use what I call “code patterns” as a guideline to writing high performance SQL. In essence what this means is to know in advance the fastest performing methods for solving a particular problem and use that code pattern as the basis for the query that I am writing.
  • Add to your testing regimen the one million row test harness (look for this in a future blog). Write every query as if you expect it to be run one million times per day against one million rows of data.

I’ll provide more guidance in these areas in future blogs, as most of my interest is in writing high performance SQL.

Keep in mind that I only obtain very basic information from a SQL query’s execution plan. I mostly rely heavily on memorizing the fastest performing code patterns and using them when they are appropriate. That and keeping to the other bullet points I listed above.

Why should I take the time to try writing a faster performing query, when I can rely on SQL’s Database Tuning Advisor (DTA) to suggest an INDEX that will make my query perform faster? The answer to that is that INDEXes create overhead that slows down INSERTs, UPDATEs, DELETEs and MERGEs. Too much INDEXing in a database can drag down its overall performance way more than the little boost it gives to poorly written queries that perform inadequately.

The other thing that people usually have to say about this is that they don’t have the time it takes to generate more than one query form to solve a particular requirement because of time pressures and deadlines. Once again, memorizing the best practice code patterns can help you here. Once you learn them, and you can code them almost in your sleep, you’ll be able to generate multiple solutions to the same problem very quickly. Then, you just need to create the one million row test harness (also very easy to do) to prove which is fastest.

Make it Pretty

Let’s look at a query that is included in my earlier blog on Calendar Tables.

select [Date] from (select [Date], [WkDName2], FromDate, rn=row_number() over (partition by [YYYYMM] order by [Date] desc) from dbo.Calendar a left loin dbo.Holidays b on a.[DATE] between FromDate AND isnull(ToDate, FromDate) where [Date] between '2014-01-01' and '2014-12-31' and [Last] = 1 and [WkDName2] not in('SA', 'SU') and FromDate IS null) a where rn=1

Now compare this query against the actual query as published in that blog.

SELECT [Date]
FROM
(
    SELECT [Date], [WkDName2], FromDate
        ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC)
    FROM dbo.Calendar a
    LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate)
    WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND
        [Last] = 1 AND
        -- Remove Saturdays and Sundays
        [WkDName2] NOT IN('SA', 'SU') AND
        -- Remove holidays
        FromDate IS NULL
) a
WHERE rn=1;

Both of these queries are exactly the same and would produce the same results. One of the things Developers may often forget is that whatever language you are writing in, it is unlikely that no one will ever look at your code again. Maintenance of applications is a fact of life, so the “make it pretty” step is invaluable to those poor folks that come along afterwards and need to maintain your code. As a Developer have you ever said “gee I really don’t want to try to figure out what that person before me did, so I’ll just rewrite it rather than modify it?” The answer to this question is probably and the reason for that is most likely that the original Developer didn’t take the time to “make it pretty.”

So what is wrong with the first query, when compared to the second? What makes it harder to maintain?

  • The first of the two can be called “stream of consciousness” coding, or to put it another way “write-only SQL.” I call it write-only SQL because you can’t easily read it. In other words, whoever comes along and looks at it later is going to have a difficult time trying to understand what it does.
  • The first query has no line breaks or indentation to assist the reader in identifying the individual steps that are being performed. For example, can you easily see that it has a derived table embedded in it? The answer is probably not.
  • Personally, I like to see language keywords in upper case. In the first query, none of the keywords are in upper case.
  • The first query has no comments in it. The second query could probably be improved by placing a comment before it that says “Calculate Aware pay days based on the last work day of the month.” To make it pretty, you should always include some comments to assist the reader that will come along after you to maintain it.
  • The first query does not end in a semi-colon. Why is that important you might ask? SQL Server does not require that a semi-colon terminate every SQL statement. Well, there are cases that it does require one (e.g., the MERGE statement) and there are also cases like Common Table Expressions (CTEs) where T-SQL requires that you terminate the statement just prior to the CTE with a semi-colon. There may come a day when T-SQL requires semi-colon terminators on every statement, so why not plan ahead for that day now and end all of your queries with a semi-colon? Did you know that ending your query with a semi-colon is an ANSI standard?

Every programming language (T-SQL included) has some standard methodology suggested for indentation, designed to make the code more readable. I’m not going to sit here and tell you that mine is the “best.” I will tell you that you should take a moment to think about how to indent your queries so that they end up being easy on the eyes of the readers that come along later. And then, once you have a methodology you are comfortable with, stick to it. Apply it to every query that you write without exception.

Once you get the hang of this, you’ll probably find yourself indenting complex queries according to your indentation methodology as you’re coding the query. That’s great because it saves you some cleanup work later on, and it will probably help you as you unit test the query because if it is complex (consisting of many steps) it is easier to run one step at a time, looking at intermediate results as necessary.

Conclusion

I can promise you that, when you read my blogs I’ve taken the time to make sure that each query I publish works, that it is the fastest it can possibly be and that it is (reasonably) pretty. There will be cases where I compare the performance of various code patterns that can be used to arrive at the same solution, particularly if one or the other won’t run in earlier versions of SQL.

Above I had some disparaging words to say about both CURSORs and INDEXes. My advice on CURSORs stands – avoid them at all costs and most of the time you can. Possibly one of the reasons that some people use CURSORs in T-SQL is because they’ve become accustomed to using them in Oracle SQL, where they perform much better.

INDEXes in general should be used sparingly, and when you need to INDEX a table (other than the PRIMARY KEY of course) it should be as a result of the pre-planning that you’ve done based on knowing the kinds of data retrieval operations you’ll be performing, rather than as an afterthought because some tool like DTA suggests that you do so.

As a quick example of this, there is a SQL concept known as relational division. All of the code patterns for retrieval in the case of relational division are highly dependent for their performance on the INDEXes that are created for the table. This is something we’ll explore in a future blog.

I will also be blogging on various common business problems that you can solve in SQL, and provide you with the best practice code pattern to ensure that the solutions I provide perform the best that they can. Oftentimes, if you already know the alternatives, you’ll immediately jump to the fastest solution available and that may allow you to skip the one million row test. Really the only time that you should skip that test is when you are 100% sure you’re using the best practice code pattern for performance.

In the end, you should never be satisfied that your code is “good enough” when with just a little bit of study and that little bit of extra effort it can be the “best that it can be.”

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Calendar Tables in T-SQL

In an earlier blog, we covered a type of auxiliary table (the Tally Table) that can provide a lot of querying flexibility if you have one in your database or construct one in-line to your query. Today we’re going to talk about another: the Calendar table.

The basic concept behind a Calendar table is that each row is a date and the columns in the table represent complex date calculations that otherwise you’d need to perform manually in your query. If the calculations are done beforehand, they can tremendously speed up your query because you can avoid them when you need them.

We’re also going to show you a really neat and fast FUNCTION for generating a Calendar table that you could use instead of having a pre-defined, auxiliary table in your database.

The GenerateCalendar FUNCTION

We’ll start by proposing a utility FUNCTION for performing various complex date calculations all in one shot. Originally I had the idea that this might be nearly as fast as having a separate Calendar table. A friend of mine, SQL MVP Jeff Moden, got wind of what I was doing and he rewrote my version into one that was significantly faster. So without further ado, here is that FUNCTION.

CREATE FUNCTION [dbo].[GenerateCalendar] 
        (
        @FromDate   DATETIME 
        ,@NoDays    INT			
        )
-- Generates a calendar table with sequential day numbering (@FromDate = SeqNo 1).
-- See RETURNS table (comments) for meaning of each column.
-- Notes:       1) Max for NoDays is 65536, which runs in just over 2 seconds.
--
-- Example calls to generate the calendar:
-- 1) Forward for 365 days starting today:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 365) 
--             ORDER BY SeqNo;
-- 2) Backwards for 365 days back starting today:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, -365) 
--             ORDER BY SeqNo;
-- 3) For only the FromDate:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 1);
-- 4) Including only the last week days of each month:
--             Note: Seq no in this case are as if all dates were generated
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 365) 
--             WHERE Last = 1 ORDER BY SeqNo;
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== High speed code provided courtesy of SQL MVP Jeff Moden (idea by Dwain Camps)
--===== Generate sequence numbers from 1 to 65536 (credit to SQL MVP Itzik Ben-Gen)
   WITH  E1(N) AS (SELECT 1 UNION ALL SELECT 1), --2 rows
         E2(N) AS (SELECT 1 FROM E1 a, E1 b),    --4 rows
         E4(N) AS (SELECT 1 FROM E2 a, E2 b),    --16 rows
         E8(N) AS (SELECT 1 FROM E4 a, E4 b),    --256 rows
        E16(N) AS (SELECT 1 FROM E8 a, E8 b),    --65536 rows
   cteTally(N) AS (
SELECT TOP (ABS(@NoDays)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E16)
        -- [SeqNo]=Sequential day number (@FromDate always=1) forward or backwards
 SELECT [SeqNo]     = t.N,
        -- [Date]=Date (with 00:00:00.000 for the time component)                              
        [Date]      = dt.DT,  
        -- [Year]=Four digit year                                  
        [Year]      = dp.YY,
        -- [YrNN]=Two digit year                                    
        [YrNN]      = dp.YY % 100,
        -- [YYYYMM]=Integer YYYYMM (year * 100 + month)                              
        [YYYYMM]    = dp.YY * 100 + dp.MM,
        -- [BuddhaYr]=Year in Buddhist calendar                      
        [BuddhaYr]  = dp.YY + 543, 
        -- [Month]=Month (as an INT)                             
        [Month]     = dp.MM, 
        -- [Day]=Day (as an INT)                                   
        [Day]       = dp.DD,
        -- [WkDNo]=Week day number (based on @@DATEFIRST)                                    
        [WkDNo]     = DATEPART(dw,dt.DT),
        -- Next 3 columns dependent on language setting so may not work for non-English  
        -- [WkDName]=Full name of the week day, e.g., Monday, Tuesday, etc.                     
        [WkDName]   = CONVERT(NCHAR(9),dp.DW), 
        -- [WkDName2]=Two characters for the week day, e.g., Mo, Tu, etc.                 
        [WkDName2]  = CONVERT(NCHAR(2),dp.DW),  
        -- [WkDName3]=Three characters for the week day, e.g., Mon, Tue, etc.                
        [WkDName3]  = CONVERT(NCHAR(3),dp.DW),  
        -- [JulDay]=Julian day (day number of the year)                
        [JulDay]    = dp.DY,
        -- [JulWk]=Week number of the year                                    
        [JulWk]     = dp.DY/7+1,
        -- [WkNo]=Week number                                
        [WkNo]      = dp.DD/7+1,
        -- [Qtr]=Quarter number (of the year)                                
        [Qtr]       = DATEPART(qq,dt.Dt),                       
        -- [Last]=Number the weeks for the month in reverse      
        [Last]      = (DATEPART(dd,dp.LDtOfMo)-dp.DD)/7+1,
        -- [LdOfMo]=Last day of the month                  
        [LdOfMo]    = DATEPART(dd,dp.LDtOfMo),
        -- [LDtOfMo]=Last day of the month as a DATETIME
        [LDtOfMo]   = dp.LDtOfMo                                
   FROM cteTally t
  CROSS APPLY 
  ( --=== Create the date
        SELECT DT = DATEADD(dd,(t.N-1)*SIGN(@NoDays),@FromDate)
  ) dt
  CROSS APPLY 
  ( --=== Create the other parts from the date above using a "cCA"
    -- (Cascading CROSS APPLY (cCA), courtesy of Chris Morris)
        SELECT YY        = DATEPART(yy,dt.DT), 
                MM        = DATEPART(mm,dt.DT), 
                DD        = DATEPART(dd,dt.DT), 
                DW        = DATENAME(dw,dt.DT),
                Dy        = DATEPART(dy,dt.DT),
                LDtOfMo   = DATEADD(mm,DATEDIFF(mm,-1,dt.DT),-1)

  ) dp;

If you’ve read my blog on Tally tables, you’ll recognize the Ben-Gan style in-line Tally table that this FUNCTION uses (which will work in SQL 2005), and that it is limited to generating 65,536 rows (days) for your calendar table. You may also note that it is a schema-bound, in-line Table Valued FUNCTION (iTVF), done that way to get the best possible performance out of it. Let’s see how we would use it to populate a Calendar table starting on 01 Jan 2010.

SELECT [Date], [Year], [YrNN], [YYYYMM], [BuddhaYr], [Month], [Day]
    ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk]
    ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo]
INTO dbo.Calendar 
FROM dbo.GenerateCalendar('2010-01-01', 65536);

-- Change column types to be NOT NULL so we can index them
ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL;
GO
-- Build some representative INDEXes
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]);
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]);
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]);

SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last]
FROM dbo.Calendar;

The final SELECT returns selected columns that we’ll use in some examples that follow for just the first few rows of the table. Note that the CLUSTERED INDEX on [Date] makes it pretty likely that these rows will be returned in date sequence. We’ve also included a couple of UNIQUE NONCLUSTERED INDEXes to support the ways we may want to retrieve data from this table.

Date                    YYYYMM   WkDName2 WkNo  Last
2010-01-01 00:00:00.000 201001   Fr       1     5
2010-01-02 00:00:00.000 201001   Sa       1     5
2010-01-03 00:00:00.000 201001   Su       1     5
2010-01-04 00:00:00.000 201001   Mo       1     4
2010-01-05 00:00:00.000 201001   Tu       1     4
2010-01-06 00:00:00.000 201001   We       1     4
2010-01-07 00:00:00.000 201001   Th       2     4
2010-01-08 00:00:00.000 201001   Fr       2     4
2010-01-09 00:00:00.000 201001   Sa       2     4
2010-01-10 00:00:00.000 201001   Su       2     4

The last entry in this calendar table is 2189-06-06, which is probably far enough into the future for any reasonable use. Note also that the columns [Date] and [LDtOfMo] could as easily have been CAST to a DATE data type, assuming we’re working in SQL 2008 or later because we’ll never be interested in the time component.

Simple Cases of Using our Calendar Table

While these are “simple” cases for using our Calendar table, try doing them without one and you might come to the conclusion that they are not so simple.

Suppose we are interested in finding the first Monday of each month in the year 2014. The WkNo makes this really simple.

SELECT [Date]
FROM dbo.Calendar
WHERE [Date] BETWEEN ‘2014-01-01’ AND ‘2014-12-31’ AND
    [WkDName2] = ‘MO’ AND [WkNo] = 1;

Using the [Date] field as we have done ensures this query uses a Clustered Index Seek that returns these results.

Date
2014-01-06 00:00:00.000
2014-02-03 00:00:00.000
2014-03-03 00:00:00.000
2014-05-05 00:00:00.000
2014-06-02 00:00:00.000
2014-08-04 00:00:00.000
2014-09-01 00:00:00.000
2014-10-06 00:00:00.000
2014-11-03 00:00:00.000
2014-12-01 00:00:00.000

The interested reader may confirm that these are all Mondays.

Likewise the [Last] column, which numbers the days backwards as weeks within the month, is designed specifically to address a requirement like “list all of the dates that are the last Friday in a month.”

SELECT [Date]
FROM dbo.Calendar
WHERE [Date] BETWEEN ‘2014-01-01’ AND ‘2014-12-31’ AND
    [WkDName2] = ‘FR’ AND [Last] = 1;

Once again, because of the way we are using the [Date] in the WHERE filter, the result is a Clustered Index Seek, returning these results.

Date
2014-01-31 00:00:00.000
2014-02-28 00:00:00.000
2014-03-28 00:00:00.000
2014-04-25 00:00:00.000
2014-05-30 00:00:00.000
2014-06-27 00:00:00.000
2014-07-25 00:00:00.000
2014-08-29 00:00:00.000
2014-09-26 00:00:00.000
2014-10-31 00:00:00.000
2014-11-28 00:00:00.000
2014-12-26 00:00:00.000

Another requirement that can be a bit challenging is to calculate the last date of a particular month. Consider the query and results below listing the first and last day of each month, and specifically how it returns the leap day for February.

SELECT [Date], [LDtOfMo]
FROM dbo.Calendar
WHERE [Date] BETWEEN '2012-01-01' AND '2012-12-31' AND
    [Day] = 1;
Date                    LDtOfMo
2012-01-01 00:00:00.000 2012-01-31 00:00:00.000
2012-02-01 00:00:00.000 2012-02-29 00:00:00.000
2012-03-01 00:00:00.000 2012-03-31 00:00:00.000
2012-04-01 00:00:00.000 2012-04-30 00:00:00.000
2012-05-01 00:00:00.000 2012-05-31 00:00:00.000
2012-06-01 00:00:00.000 2012-06-30 00:00:00.000
2012-07-01 00:00:00.000 2012-07-31 00:00:00.000
2012-08-01 00:00:00.000 2012-08-31 00:00:00.000
2012-09-01 00:00:00.000 2012-09-30 00:00:00.000
2012-10-01 00:00:00.000 2012-10-31 00:00:00.000
2012-11-01 00:00:00.000 2012-11-30 00:00:00.000
2012-12-01 00:00:00.000 2012-12-31 00:00:00.000

The real utility of having a Calendar table will be clearer from the example described in the next section.

Using the Calendar Table to Fill in a Missing Month

Let’s suppose that you have a table that contains sales orders. We’ll simplify the situation slightly by ignoring the header/detail complexity of a true sales order, and just assume that the total for the order appears on the Sales order’s header.

CREATE TABLE #SalesOrders
(
    SONumber        INT IDENTITY PRIMARY KEY
    ,SODate         DATE
    ,Customer       VARCHAR(20)
    ,SalesAmount    MONEY
);

INSERT INTO #SalesOrders
VALUES('2013-01-15', 'AIS', 25000)
    ,('2013-01-31', 'AIS', 15000)
    ,('2013-03-01', 'AIS', 20000)
    ,('2013-03-02', 'AIS', 12000)
    ,('2013-03-05', 'AIS', 10000)
    ,('2013-01-15', 'DTAC', 25000)
    ,('2013-03-01', 'DTAC', 5000)
    ,('2013-03-15', 'DTAC', 11000);

SELECT *
FROM #SalesOrders;

We can see by inspecting the results that follow, that no sales were made in February.

SONumber  SODate       Customer  SalesAmount
1         2013-01-15   AIS       25000.00
2         2013-01-31   AIS       15000.00
3         2013-03-01   AIS       20000.00
4         2013-03-02   AIS       12000.00
5         2013-03-05   AIS       10000.00
6         2013-01-15   DTAC      25000.00
7         2013-03-01   DTAC      5000.00
8         2013-03-15   DTAC      11000.00

Suppose that our requirement was to list total sales by month for the first quarter of 2013. We can use our Calendar table to fill in the missing month.

SELECT [YYYYMM], Sales=ISNULL(SUM(SalesAmount), 0)
FROM dbo.Calendar a
LEFT JOIN #SalesOrders b ON a.[Date] = DATEADD(month, DATEDIFF(month, 0, SODate), 0)
WHERE [Date] BETWEEN '2013-01-01' AND '2013-03-31' AND
    [Day] = 1
GROUP BY [YYYYMM];

In our Tally tables blog entry, we showed you how you could truncate a DATETIME value to either the day or hour. DATEADD(month, DATEDIFF(month, 0, SODate), 0) simply truncates the DATE to the month (setting the day to the first of the month). It should come as no surprise to our readers that the results from this query exactly matches the requirement that we specified.

YYYYMM Sales
201301 65000.00
201302 0.00
201303 58000.00

We’ll leave it as an exercise for our interested readers to figure out how to list sales by company by month (so that there should be two rows for February each with a value of zero).

Note that just like a Tally table; a Calendar table can also be used to expand a row set if that happens to be what you need to do.

Using the Calendar Table in Conjunction with a Holidays Table

Suppose we have a table of company holidays for 2014 (this company is located in Thailand):

CREATE TABLE dbo.Holidays
(
    FromDate    DATETIME PRIMARY KEY
    ,ToDate     DATETIME
    ,Holiday    VARCHAR(100)
);

INSERT INTO dbo.Holidays
SELECT '2014-01-01', NULL, 'New Year''s Day'
UNION ALL SELECT '2014-02-14', NULL, 'Makha Bucha Day'
UNION ALL SELECT '2014-04-07', NULL, 'Substitution for Chakri Day'
UNION ALL SELECT '2014-04-14', '2014-04-15', 'Songkran Festival'
UNION ALL SELECT '2014-05-01', NULL, 'National Labor Day'
UNION ALL SELECT '2014-05-05', NULL, 'Coronation Day'
UNION ALL SELECT '2014-05-13', NULL, 'Wisakha Bucha Day'
UNION ALL SELECT '2014-07-11', NULL, 'Asarnha Bucha Day'
UNION ALL SELECT '2014-08-12', NULL, 'HM the Queen''s Birthday'
UNION ALL SELECT '2014-10-23', NULL, 'Chulalongkorn Day'
UNION ALL SELECT '2014-12-05', NULL, 'HM the King''s Birthday'
UNION ALL SELECT '2014-12-10', NULL, 'Constitution Day'
UNION ALL SELECT '2014-12-31', NULL, 'New Year''s Eve';

SELECT *
FROM dbo.Holidays;
FromDate     ToDate       Holiday
2014-01-01   NULL         New Year's Day
2014-02-14   NULL         Makha Bucha Day
2014-04-07   NULL         Substitution for Chakri Day
2014-04-14   2014-04-15   Songkran Festival
2014-05-01   NULL         National Labor Day
2014-05-05   NULL         Coronation Day
2014-05-13   NULL         Wisakha Bucha Day
2014-07-11   NULL         Asarnha Bucha Day
2014-08-12   NULL         HM the Queen's Birthday
2014-10-23   NULL         Chulalongkorn Day
2014-12-05   NULL         HM the King's Birthday
2014-12-10   NULL         Constitution Day
2014-12-31   NULL         New Year's Eve

Let’s suppose that our assignment is to calculate pay day for each month in 2014. This company pays its employees on the last day of the month, the Friday before that if the last day of the month occurs on a weekend or the previous work day if the normal pay day happens to be a holiday. Let’s do this in two parts.

Firstly, we know that none of the holidays plus weekend days can span more than seven days, so we just need to find (using the [Last] column of our Calendar table) all of the work days in the last week of the month.

SELECT [Date], [WkDName2], FromDate
FROM dbo.Calendar a
LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) 
WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND
    [Last] = 1 AND
    -- Remove Saturdays and Sundays
    [WkDName2] NOT IN ('SA', 'SU') AND
    -- Remove holidays
    FromDate IS NULL;

The above query actually generates 72 rows of results so let’s just look at a few of them (for May and December).

Date                     WkDName2 FromDate
<snip>
2014-05-26 00:00:00.000 Mo       NULL
2014-05-27 00:00:00.000 Tu       NULL
2014-05-28 00:00:00.000 We       NULL
2014-05-29 00:00:00.000 Th       NULL
2014-05-30 00:00:00.000 Fr       NULL
<snip>
2014-12-25 00:00:00.000 Th       NULL
2014-12-26 00:00:00.000 Fr       NULL
2014-12-29 00:00:00.000 Mo       NULL
2014-12-30 00:00:00.000 Tu       NULL

It should be no surprise that FromDate is always NULL, because of the WHERE filter. What’s important to note is that the last date shown for May is the Friday (31 May is a Saturday so it was eliminated) and that the last date for December is the 30th (because the New Year’s Eve holiday was eliminated).

It now becomes a relatively simple matter to use ROW_NUMBER() to identify the last working day of the month (note the DESC sort), like this:

SELECT [Date]
FROM
(
    SELECT [Date], [WkDName2], FromDate
        ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC) 
    FROM dbo.Calendar a
    LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) 
    WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND
        [Last] = 1 AND
        -- Remove Saturdays and Sundays
        [WkDName2] NOT IN ('SA', 'SU') AND
        -- Remove holidays
        FromDate IS NULL
) a
WHERE rn=1;

So here is a list of this Thai company’s pay days in 2014:

Date
2014-01-31 00:00:00.000
2014-02-28 00:00:00.000
2014-03-31 00:00:00.000
2014-04-30 00:00:00.000
2014-05-30 00:00:00.000
2014-06-30 00:00:00.000
2014-07-31 00:00:00.000
2014-08-29 00:00:00.000
2014-09-30 00:00:00.000
2014-10-31 00:00:00.000
2014-11-28 00:00:00.000
2014-12-30 00:00:00.000

Now that’s a query we can all be interested in!

But what is really important here is that by combining our two tables, we’ve got a really nice way of identifying working days and numbering them within a month if need be.

Conclusion

We have described how to create an auxiliary Calendar table in your SQL database and provided a high-performance, alternative FUNCTION (GenerateCalendar) that you can use instead if you want to. Since the FUNCTION returns a table, it can be substituted pretty much wherever the Calendar table is used (after FROM), but you do need to take care to specify the correct number of days you’ll need to drive the results.

Note that the GenerateCalendar FUNCTION is also somewhat unique in that it will generate days backwards from a starting date (by supplying a negative number of days). Some sample queries are embedded in its comments to guide you.

We’ve described three cases where a Calendar table can be useful:

  • To simplify what would otherwise be complex date calculations.
  • To fill in dates (e.g., a month) where those dates are missing from your data.
  • To combine with a Holidays table to calculate working days.

Note that there are some people that advocate marking days in your main Calendar table as holidays. I personally do not like that approach, mainly because Thailand has the tendency to create new holidays on a whim. I’d rather insert a new row into my Holidays table than update an existing row in my Calendar table, but there are other reasons why you may want to keep them separate.

Just like Tally tables, it is possible to create an in-line Calendar table that contains only the dates you need for your specific query. While you may still need to perform any specific date calculations you need within that query (like obtaining the month or day as an integer), it may still offer some usability in cases where you need to fill in dates like in our Sales order example. We’ll also leave this as an exercise for our interested readers.

We hope you found this blog entry on Calendar tables to be interesting, informative and at least a little bit of fun.

For additional reading on Calendar tables, you can look at these three articles by Todd Fifield:

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Tally Tables in T-SQL

The Tally (or Numbers) table is one of the most useful constructs in intermediate-level SQL. Have you ever written a T-SQL WHILE loop? How about a CURSOR? If your answer to either of those questions is yes, you’ve probably been coding some really poor-performing SQL and you may not even realize it. Tally tables are one way that you may be able to avoid using a loop or a CURSOR, instead creating a true set-based solution that is blazingly fast and reliable.

Tally Tables in SQL 2000

The ability to create a Tally table has been around for a long time, going way back to SQL Server’s roots in Sybase. One of the most common methods of creating a permanent table is:

SELECT TOP 1000000 N=IDENTITY(INT, 1, 1)
INTO dbo.Numbers
FROM master.dbo.syscolumns a CROSS JOIN master.dbo.syscolumns  b;

ALTER TABLE dbo.Numbers ADD CONSTRAINT NBR_pk PRIMARY KEY(N);

SELECT TOP 5 N FROM dbo.Numbers;

Obviously, in order to run this you need access to the master.dbo.syscolumns table that is referenced. Most of the time this particular table has at least 4,000 rows in it so constructing a Tally table with 1,000,000 (or even 10,000,000) rows is no problem at all. Our Tally table will start with 1 (results below are from the final SELECT).

N
1
2
3
4
5

If you create a permanent Tally table, it’s important to have a CLUSTERED INDEX (the PRIMARY KEY) on the single column that’s in the table for performance reasons. We’ll show some examples of unique ways that you can use a Tally table shortly, but first it’s good to know how to create them. The example above works in SQL Server 2000.

Here’s another Tally table that can generate N’s between 0 and 255 (up to 2047 in later versions like 2008) and can be used in SQL Server 2000.

SELECT N=number
FROM master..spt_values
WHERE type = 'P';

There are those that might be asking why in the world would you use an undocumented system table like this? Well, rest assured that this table is used internally by many of SQL’s internal stored procedures, so it is not going away any time soon.

Tally Tables in SQL 2005

When SQL Server 2005 came along and introduced the Common Table Expression (CTE) and the ROW_NUMBER() window function (both the subject of earlier blogs as linked), it opened up some new ways to create what are known as in-line Tally tables. Let’s look at a couple of examples.

WITH Tally (N) AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT TOP 5 N
FROM Tally;

-- Or if you don't have access to the sys tables use an in-line
-- Tally table known as a "Ben-Gan" style Tally
WITH lv0 AS (SELECT 0 g UNION ALL SELECT 0)
    ,lv1 AS (SELECT 0 g FROM lv0 a CROSS JOIN lv0 b) -- 4
    ,lv2 AS (SELECT 0 g FROM lv1 a CROSS JOIN lv1 b) -- 16
    ,lv3 AS (SELECT 0 g FROM lv2 a CROSS JOIN lv2 b) -- 256
    ,lv4 AS (SELECT 0 g FROM lv3 a CROSS JOIN lv3 b) -- 65,536
    ,lv5 AS (SELECT 0 g FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296
    ,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5)
SELECT TOP (5) n
FROM Tally
ORDER BY n;

The second Tally table structure was first suggested by SQL MVP Itzik Ben-Gan.

Never mind how they work, just know that the first will generate somewhere around 16M+ rows and the other will create a little over 4 billion rows (!) if we did not apply the TOP 10. The second one is a case of “stacked” or “cascaded” CTEs. Both are extremely fast; the second one especially because it is done entirely in memory.

Tally Tables in SQL 2008

SQL Server 2008 introduced something that’s known as a Table Value Constructor (TVC), which offers a very compact way to create a Tally table tailored specifically to the precise number of rows you need without using a TOP clause. Here’s an example of that:

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
SELECT *
FROM Tally;

That Tally table generates exactly 1,000 rows because each VALUES TVC has exactly 10 zeroes in it. We’ll now look at just a few of the remarkable ways that a Tally table can improve your SQL by helping you to write a set-based solution where otherwise you might need to resort to a loop or CURSOR.

Using a Tally Table to Expand a Row Set

Let’s start with some sample data:

CREATE TABLE #Temp
(
    ID          INT IDENTITY PRIMARY KEY
    ,StartDT    DATETIME
);

INSERT INTO #Temp
SELECT '2014-02-18 09:20' UNION ALL SELECT '2014-02-19 05:35';

SELECT ID, StartDT
    ,TT=DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0))
    ,TD=DATEADD(day, DATEDIFF(day, 0, StartDT), 0)
FROM #Temp;

The results returned are this, where DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0) is a nifty way to truncate the DATETIME value to the nearest hour (or use day instead to truncate to the current day). Here are the results from the above query:

ID   StartDT                  TT   TD
1    2014-02-18 09:20:00.000  9    2014-02-18 00:00:00.000
2    2014-02-19 05:35:00.000  5    2014-02-19 00:00:00.000

Suppose what we want is 10 rows that correspond to ID=1, where each is incremented to the next hour of the day (starting at midnight). Likewise, for ID=2 we want 6 rows. We can use a small Tally table for this (remember there’s only 24 hours in a day):

WITH Tally (N) AS
(
    -- Tally table starting at 0
    SELECT 0 UNION ALL
    -- Now 24 more rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0)) c(n)
)
SELECT ID, StartDT
    ,TD=DATEADD(hour, N, DATEADD(day, DATEDIFF(day, 0, StartDT), 0))
FROM #Temp
CROSS JOIN Tally
WHERE N BETWEEN 0 AND DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0))
ORDER BY ID, TD;

Note how this Tally table has been modified to start at 0 by adding SELECT 0 UNION ALL before the second SELECT. The results of this query are:

ID  StartDT TD
1   2014-02-18 19:20:00.000   2014-02-18 00:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 01:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 02:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 03:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 04:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 05:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 06:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 07:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 08:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 09:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 00:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 01:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 02:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 03:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 04:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 05:00:00.000

Wasn’t that easy and really cool?

Using a Tally Table to Remove Repeated Characters in a String

There are a lot of situations in T-SQL where you need to parse a string. Many of the built-in functions that T-SQL provides stop at the first occurrence of something. But what if you need to perform the same operation on many characters within the string and a built-in function simply won’t do it for you?

This is where a Tally table can really save you. We’ll start with one that has 8000 rows so we can parse each character out of VARCHAR(8000) length string. Follow the comments in the code to see how it works.

DECLARE @S VARCHAR(8000) = 'Aarrrgggh!';

WITH Tally (N) AS
(
    -- 8000 rows (max length of the VARCHAR string)
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT n, s
FROM
(
    -- Always choose the first element
    SELECT n=1, s=LEFT(@S, 1) UNION ALL
    -- Include each successive next element as long as it’s different than the prior
    SELECT N, CASE
        WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
        THEN SUBSTRING(@S, N, 1)
        -- Repeated characters are assigned NULL by the CASE
        END
    FROM Tally
    WHERE N BETWEEN 2 AND LEN(@S)
) a
-- Now we filter out the repeated elements
WHERE s IS NOT NULL;

The results displayed for this are below. Note that it works just fine with strings of length 1 or 0 also.

n   s
1   A
3   r
6   g
9   h
10  !

It is also case-insensitive, so it removes the lowercase a (second character) because it is a repeat of the first character (capital A). It could be made case sensitive if necessary. Now all we need to do is combine the rows that remain back into our final string using a technique that is similare to Creating a Comma-separated List by author Microsoft Certified Master (MCM) Wayne Sheffield. It just needs a slight modification to not include the comma separator and it can be applied to this case as follows.

WITH Tally (N) AS
(
    -- 8000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT OriginalString=@S, NewString=
    (
        SELECT '' + s
        FROM
        (
            -- Same base query as the prior example starts here
            SELECT n, s
            FROM
            (
                -- Always choose the first element
                SELECT n=1, s=LEFT(@S, 1) UNION ALL
                -- Include each successive next element as long as it's
                -- different than the prior
                SELECT N, CASE
                            WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
                            THEN SUBSTRING(@S, N, 1)
                            -- Repeated characters are assigned NULL by the CASE
                            END
                FROM Tally
                WHERE N BETWEEN 2 AND LEN(@S)
            ) a
            -- Now we filter out the repeated elements
            WHERE s IS NOT NULL
        ) a
        ORDER BY n
        FOR XML PATH(''), TYPE
        ).value('.', 'VARCHAR(8000)'
    );

Now our result is:

OriginalString  NewString
Aarrrgggh!      Argh!

That is very fast and way cool, so maybe we want to save this as a utility FUNCTION. SQL has many types of User-defined FUNCTIONs, e.g., scalar, table-valued, etc. Scalar-valued functions are known to be total performance nightmares. By far the fastest type of FUNCTION if you can write it this way is to use what’s known as an inline Table Valued Function (iTVF). We can do this for our new utility FUNCTION.

CREATE FUNCTION dbo.RemoveRepeatedChars
(
    @S VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING
RETURN

WITH Tally (N) AS
(
    -- 8000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT OriginalString=@S, NewString=
    (
        SELECT '' + s
        FROM
        (
            SELECT n, s
            FROM
            (
                -- Always choose the first element
                SELECT n=1, s=LEFT(@S, 1) UNION ALL
                -- Include each successive next element as long as it's
                -- different than the prior
                SELECT N, CASE
                            WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
                            THEN SUBSTRING(@S, N, 1)
                            -- Repeated characters are assigned NULL by the CASE
                            END
                FROM Tally
                WHERE N BETWEEN 2 AND LEN(@S)
            ) a
            -- Now we filter out the repeated elements
            WHERE s IS NOT NULL
        ) a
        ORDER BY n
        FOR XML PATH(''), TYPE
        ).value('.', 'VARCHAR(8000)'
    );

An iTVF with SCHEMABINDING is the fastest possible FUNCTION we can write and it works for this case. We can now call our function to parse as many strings as we need (like from a column in a table).

WITH SampleData (s) AS
(
    SELECT 'Aarrrgggh!'
    UNION ALL SELECT 'Noo repeaatting chharracterss wanntted.'
)
SELECT *
FROM SampleData
CROSS APPLY dbo.RemoveRepeatedChars(s);

CROSS APPLY is something we’ll cover in a future blog but we’ve provided some references at the end if you’d like to learn more now. The results we get from the above script are:

OriginalString                           NewString
Aarrrgggh!                               Argh!
Noo repeaatting chharracterss wanntted.  No repeating characters wanted.

Parsing a Comma Separated List of Values

A more common problem in SQL is the parsing of a comma-separated list of values. We’ve saved the best for last, because in this case we can truly point you to a best-of-breed solution.

How many of you have created a SQL FUNCTION to split a delimited list of values? Chances are you pulled down some tired old looping code from somewhere on the Internet that works but is dog slow. You probably don’t even realize how bad it is. Now is the time to locate it and replace it as what we’re about to tell you about will simply blow the doors off of any other SQL-based delimited string splitter!

SQL MVP Jeff Moden wrote the outstanding Tally OH! An Improved SQL 8K “CSV Splitter” Function (viewed more than 40,000 times as of this writing), which contains lots of outstanding information about Tally tables (another of his articles is referenced at the end), how they can be used to split a delimited string and the performance comparison with alternate solutions. At the end of the article, you can download DelimitedSplit8K, optimized specifically for use with VARCHAR(8000) strings, so don’t try to modify it to VARCHAR(MAX). I have provided a link to that article as over the years the FUNCTION has been revised and improved a few times, so if you go there you’ll be sure to get the most current version.

It also contains information on CLR-based splitters that can be even faster and handle the case of VARCHAR(MAX).

Conclusions

Let’s do a recap of all of the things we learned today:

  • The various ways to create a permanent or generate an in-line Tally table in SQL 2000, 2005 and 2008.
  • The formula for using DATEDIFF/DATEADD to truncate a date to just the day or to the hour.
  • How to expand a row set using a Tally table.
  • How to parse a string using a Tally table.
  • How to use the FOR XML PATH method to concatenate rows into a single character string.
  • How to create a high-performance, schema-bound, in-line Table Valued FUNCTION in SQL.
  • Where to find the fastest SQL-based delimited string splitter on the planet (DelimitedSplit8K).

We’ve also come away with two utility FUNCTIONs we can add to our SQL tool chest (see below for a third one):

  • RemoveRepeatedChars
  • DelimitedSplit8K

For further reading on CROSS APPLY (and its companion OUTER APPLY), here are two great articles by SQL MVP Paul White to get you started.

Further suggested reading:

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved