Month: March 2014

Calendar Tables in T-SQL

Posted on

In an earlier blog, we covered a type of auxiliary table (the Tally Table) that can provide a lot of querying flexibility if you have one in your database or construct one in-line to your query. Today we’re going to talk about another: the Calendar table.

The basic concept behind a Calendar table is that each row is a date and the columns in the table represent complex date calculations that otherwise you’d need to perform manually in your query. If the calculations are done beforehand, they can tremendously speed up your query because you can avoid them when you need them.

We’re also going to show you a really neat and fast FUNCTION for generating a Calendar table that you could use instead of having a pre-defined, auxiliary table in your database.

The GenerateCalendar FUNCTION

We’ll start by proposing a utility FUNCTION for performing various complex date calculations all in one shot. Originally I had the idea that this might be nearly as fast as having a separate Calendar table. A friend of mine, SQL MVP Jeff Moden, got wind of what I was doing and he rewrote my version into one that was significantly faster. So without further ado, here is that FUNCTION.

CREATE FUNCTION [dbo].[GenerateCalendar] 
        (
        @FromDate   DATETIME 
        ,@NoDays    INT			
        )
-- Generates a calendar table with sequential day numbering (@FromDate = SeqNo 1).
-- See RETURNS table (comments) for meaning of each column.
-- Notes:       1) Max for NoDays is 65536, which runs in just over 2 seconds.
--
-- Example calls to generate the calendar:
-- 1) Forward for 365 days starting today:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 365) 
--             ORDER BY SeqNo;
-- 2) Backwards for 365 days back starting today:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, -365) 
--             ORDER BY SeqNo;
-- 3) For only the FromDate:
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 1);
-- 4) Including only the last week days of each month:
--             Note: Seq no in this case are as if all dates were generated
--             DECLARE @Date DATETIME
--             SELECT @Date = GETDATE()
--             SELECT * 
--             FROM dbo.GenerateCalendar(@Date, 365) 
--             WHERE Last = 1 ORDER BY SeqNo;
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== High speed code provided courtesy of SQL MVP Jeff Moden (idea by Dwain Camps)
--===== Generate sequence numbers from 1 to 65536 (credit to SQL MVP Itzik Ben-Gen)
   WITH  E1(N) AS (SELECT 1 UNION ALL SELECT 1), --2 rows
         E2(N) AS (SELECT 1 FROM E1 a, E1 b),    --4 rows
         E4(N) AS (SELECT 1 FROM E2 a, E2 b),    --16 rows
         E8(N) AS (SELECT 1 FROM E4 a, E4 b),    --256 rows
        E16(N) AS (SELECT 1 FROM E8 a, E8 b),    --65536 rows
   cteTally(N) AS (
SELECT TOP (ABS(@NoDays)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E16)
        -- [SeqNo]=Sequential day number (@FromDate always=1) forward or backwards
 SELECT [SeqNo]     = t.N,
        -- [Date]=Date (with 00:00:00.000 for the time component)                              
        [Date]      = dt.DT,  
        -- [Year]=Four digit year                                  
        [Year]      = dp.YY,
        -- [YrNN]=Two digit year                                    
        [YrNN]      = dp.YY % 100,
        -- [YYYYMM]=Integer YYYYMM (year * 100 + month)                              
        [YYYYMM]    = dp.YY * 100 + dp.MM,
        -- [BuddhaYr]=Year in Buddhist calendar                      
        [BuddhaYr]  = dp.YY + 543, 
        -- [Month]=Month (as an INT)                             
        [Month]     = dp.MM, 
        -- [Day]=Day (as an INT)                                   
        [Day]       = dp.DD,
        -- [WkDNo]=Week day number (based on @@DATEFIRST)                                    
        [WkDNo]     = DATEPART(dw,dt.DT),
        -- Next 3 columns dependent on language setting so may not work for non-English  
        -- [WkDName]=Full name of the week day, e.g., Monday, Tuesday, etc.                     
        [WkDName]   = CONVERT(NCHAR(9),dp.DW), 
        -- [WkDName2]=Two characters for the week day, e.g., Mo, Tu, etc.                 
        [WkDName2]  = CONVERT(NCHAR(2),dp.DW),  
        -- [WkDName3]=Three characters for the week day, e.g., Mon, Tue, etc.                
        [WkDName3]  = CONVERT(NCHAR(3),dp.DW),  
        -- [JulDay]=Julian day (day number of the year)                
        [JulDay]    = dp.DY,
        -- [JulWk]=Week number of the year                                    
        [JulWk]     = dp.DY/7+1,
        -- [WkNo]=Week number                                
        [WkNo]      = dp.DD/7+1,
        -- [Qtr]=Quarter number (of the year)                                
        [Qtr]       = DATEPART(qq,dt.Dt),                       
        -- [Last]=Number the weeks for the month in reverse      
        [Last]      = (DATEPART(dd,dp.LDtOfMo)-dp.DD)/7+1,
        -- [LdOfMo]=Last day of the month                  
        [LdOfMo]    = DATEPART(dd,dp.LDtOfMo),
        -- [LDtOfMo]=Last day of the month as a DATETIME
        [LDtOfMo]   = dp.LDtOfMo                                
   FROM cteTally t
  CROSS APPLY 
  ( --=== Create the date
        SELECT DT = DATEADD(dd,(t.N-1)*SIGN(@NoDays),@FromDate)
  ) dt
  CROSS APPLY 
  ( --=== Create the other parts from the date above using a "cCA"
    -- (Cascading CROSS APPLY (cCA), courtesy of Chris Morris)
        SELECT YY        = DATEPART(yy,dt.DT), 
                MM        = DATEPART(mm,dt.DT), 
                DD        = DATEPART(dd,dt.DT), 
                DW        = DATENAME(dw,dt.DT),
                Dy        = DATEPART(dy,dt.DT),
                LDtOfMo   = DATEADD(mm,DATEDIFF(mm,-1,dt.DT),-1)

  ) dp;

If you’ve read my blog on Tally tables, you’ll recognize the Ben-Gan style in-line Tally table that this FUNCTION uses (which will work in SQL 2005), and that it is limited to generating 65,536 rows (days) for your calendar table. You may also note that it is a schema-bound, in-line Table Valued FUNCTION (iTVF), done that way to get the best possible performance out of it. Let’s see how we would use it to populate a Calendar table starting on 01 Jan 2010.

SELECT [Date], [Year], [YrNN], [YYYYMM], [BuddhaYr], [Month], [Day]
    ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk]
    ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo]
INTO dbo.Calendar 
FROM dbo.GenerateCalendar('2010-01-01', 65536);

-- Change column types to be NOT NULL so we can index them
ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL;
ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL;
GO
-- Build some representative INDEXes
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]);
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]);
ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]);

SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last]
FROM dbo.Calendar;

The final SELECT returns selected columns that we’ll use in some examples that follow for just the first few rows of the table. Note that the CLUSTERED INDEX on [Date] makes it pretty likely that these rows will be returned in date sequence. We’ve also included a couple of UNIQUE NONCLUSTERED INDEXes to support the ways we may want to retrieve data from this table.

Date                    YYYYMM   WkDName2 WkNo  Last
2010-01-01 00:00:00.000 201001   Fr       1     5
2010-01-02 00:00:00.000 201001   Sa       1     5
2010-01-03 00:00:00.000 201001   Su       1     5
2010-01-04 00:00:00.000 201001   Mo       1     4
2010-01-05 00:00:00.000 201001   Tu       1     4
2010-01-06 00:00:00.000 201001   We       1     4
2010-01-07 00:00:00.000 201001   Th       2     4
2010-01-08 00:00:00.000 201001   Fr       2     4
2010-01-09 00:00:00.000 201001   Sa       2     4
2010-01-10 00:00:00.000 201001   Su       2     4

The last entry in this calendar table is 2189-06-06, which is probably far enough into the future for any reasonable use. Note also that the columns [Date] and [LDtOfMo] could as easily have been CAST to a DATE data type, assuming we’re working in SQL 2008 or later because we’ll never be interested in the time component.

Simple Cases of Using our Calendar Table

While these are “simple” cases for using our Calendar table, try doing them without one and you might come to the conclusion that they are not so simple.

Suppose we are interested in finding the first Monday of each month in the year 2014. The WkNo makes this really simple.

SELECT [Date]
FROM dbo.Calendar
WHERE [Date] BETWEEN ‘2014-01-01’ AND ‘2014-12-31’ AND
    [WkDName2] = ‘MO’ AND [WkNo] = 1;

Using the [Date] field as we have done ensures this query uses a Clustered Index Seek that returns these results.

Date
2014-01-06 00:00:00.000
2014-02-03 00:00:00.000
2014-03-03 00:00:00.000
2014-05-05 00:00:00.000
2014-06-02 00:00:00.000
2014-08-04 00:00:00.000
2014-09-01 00:00:00.000
2014-10-06 00:00:00.000
2014-11-03 00:00:00.000
2014-12-01 00:00:00.000

The interested reader may confirm that these are all Mondays.

Likewise the [Last] column, which numbers the days backwards as weeks within the month, is designed specifically to address a requirement like “list all of the dates that are the last Friday in a month.”

SELECT [Date]
FROM dbo.Calendar
WHERE [Date] BETWEEN ‘2014-01-01’ AND ‘2014-12-31’ AND
    [WkDName2] = ‘FR’ AND [Last] = 1;

Once again, because of the way we are using the [Date] in the WHERE filter, the result is a Clustered Index Seek, returning these results.

Date
2014-01-31 00:00:00.000
2014-02-28 00:00:00.000
2014-03-28 00:00:00.000
2014-04-25 00:00:00.000
2014-05-30 00:00:00.000
2014-06-27 00:00:00.000
2014-07-25 00:00:00.000
2014-08-29 00:00:00.000
2014-09-26 00:00:00.000
2014-10-31 00:00:00.000
2014-11-28 00:00:00.000
2014-12-26 00:00:00.000

Another requirement that can be a bit challenging is to calculate the last date of a particular month. Consider the query and results below listing the first and last day of each month, and specifically how it returns the leap day for February.

SELECT [Date], [LDtOfMo]
FROM dbo.Calendar
WHERE [Date] BETWEEN '2012-01-01' AND '2012-12-31' AND
    [Day] = 1;
Date                    LDtOfMo
2012-01-01 00:00:00.000 2012-01-31 00:00:00.000
2012-02-01 00:00:00.000 2012-02-29 00:00:00.000
2012-03-01 00:00:00.000 2012-03-31 00:00:00.000
2012-04-01 00:00:00.000 2012-04-30 00:00:00.000
2012-05-01 00:00:00.000 2012-05-31 00:00:00.000
2012-06-01 00:00:00.000 2012-06-30 00:00:00.000
2012-07-01 00:00:00.000 2012-07-31 00:00:00.000
2012-08-01 00:00:00.000 2012-08-31 00:00:00.000
2012-09-01 00:00:00.000 2012-09-30 00:00:00.000
2012-10-01 00:00:00.000 2012-10-31 00:00:00.000
2012-11-01 00:00:00.000 2012-11-30 00:00:00.000
2012-12-01 00:00:00.000 2012-12-31 00:00:00.000

The real utility of having a Calendar table will be clearer from the example described in the next section.

Using the Calendar Table to Fill in a Missing Month

Let’s suppose that you have a table that contains sales orders. We’ll simplify the situation slightly by ignoring the header/detail complexity of a true sales order, and just assume that the total for the order appears on the Sales order’s header.

CREATE TABLE #SalesOrders
(
    SONumber        INT IDENTITY PRIMARY KEY
    ,SODate         DATE
    ,Customer       VARCHAR(20)
    ,SalesAmount    MONEY
);

INSERT INTO #SalesOrders
VALUES('2013-01-15', 'AIS', 25000)
    ,('2013-01-31', 'AIS', 15000)
    ,('2013-03-01', 'AIS', 20000)
    ,('2013-03-02', 'AIS', 12000)
    ,('2013-03-05', 'AIS', 10000)
    ,('2013-01-15', 'DTAC', 25000)
    ,('2013-03-01', 'DTAC', 5000)
    ,('2013-03-15', 'DTAC', 11000);

SELECT *
FROM #SalesOrders;

We can see by inspecting the results that follow, that no sales were made in February.

SONumber  SODate       Customer  SalesAmount
1         2013-01-15   AIS       25000.00
2         2013-01-31   AIS       15000.00
3         2013-03-01   AIS       20000.00
4         2013-03-02   AIS       12000.00
5         2013-03-05   AIS       10000.00
6         2013-01-15   DTAC      25000.00
7         2013-03-01   DTAC      5000.00
8         2013-03-15   DTAC      11000.00

Suppose that our requirement was to list total sales by month for the first quarter of 2013. We can use our Calendar table to fill in the missing month.

SELECT [YYYYMM], Sales=ISNULL(SUM(SalesAmount), 0)
FROM dbo.Calendar a
LEFT JOIN #SalesOrders b ON a.[Date] = DATEADD(month, DATEDIFF(month, 0, SODate), 0)
WHERE [Date] BETWEEN '2013-01-01' AND '2013-03-31' AND
    [Day] = 1
GROUP BY [YYYYMM];

In our Tally tables blog entry, we showed you how you could truncate a DATETIME value to either the day or hour. DATEADD(month, DATEDIFF(month, 0, SODate), 0) simply truncates the DATE to the month (setting the day to the first of the month). It should come as no surprise to our readers that the results from this query exactly matches the requirement that we specified.

YYYYMM Sales
201301 65000.00
201302 0.00
201303 58000.00

We’ll leave it as an exercise for our interested readers to figure out how to list sales by company by month (so that there should be two rows for February each with a value of zero).

Note that just like a Tally table; a Calendar table can also be used to expand a row set if that happens to be what you need to do.

Using the Calendar Table in Conjunction with a Holidays Table

Suppose we have a table of company holidays for 2014 (this company is located in Thailand):

CREATE TABLE dbo.Holidays
(
    FromDate    DATETIME PRIMARY KEY
    ,ToDate     DATETIME
    ,Holiday    VARCHAR(100)
);

INSERT INTO dbo.Holidays
SELECT '2014-01-01', NULL, 'New Year''s Day'
UNION ALL SELECT '2014-02-14', NULL, 'Makha Bucha Day'
UNION ALL SELECT '2014-04-07', NULL, 'Substitution for Chakri Day'
UNION ALL SELECT '2014-04-14', '2014-04-15', 'Songkran Festival'
UNION ALL SELECT '2014-05-01', NULL, 'National Labor Day'
UNION ALL SELECT '2014-05-05', NULL, 'Coronation Day'
UNION ALL SELECT '2014-05-13', NULL, 'Wisakha Bucha Day'
UNION ALL SELECT '2014-07-11', NULL, 'Asarnha Bucha Day'
UNION ALL SELECT '2014-08-12', NULL, 'HM the Queen''s Birthday'
UNION ALL SELECT '2014-10-23', NULL, 'Chulalongkorn Day'
UNION ALL SELECT '2014-12-05', NULL, 'HM the King''s Birthday'
UNION ALL SELECT '2014-12-10', NULL, 'Constitution Day'
UNION ALL SELECT '2014-12-31', NULL, 'New Year''s Eve';

SELECT *
FROM dbo.Holidays;
FromDate     ToDate       Holiday
2014-01-01   NULL         New Year's Day
2014-02-14   NULL         Makha Bucha Day
2014-04-07   NULL         Substitution for Chakri Day
2014-04-14   2014-04-15   Songkran Festival
2014-05-01   NULL         National Labor Day
2014-05-05   NULL         Coronation Day
2014-05-13   NULL         Wisakha Bucha Day
2014-07-11   NULL         Asarnha Bucha Day
2014-08-12   NULL         HM the Queen's Birthday
2014-10-23   NULL         Chulalongkorn Day
2014-12-05   NULL         HM the King's Birthday
2014-12-10   NULL         Constitution Day
2014-12-31   NULL         New Year's Eve

Let’s suppose that our assignment is to calculate pay day for each month in 2014. This company pays its employees on the last day of the month, the Friday before that if the last day of the month occurs on a weekend or the previous work day if the normal pay day happens to be a holiday. Let’s do this in two parts.

Firstly, we know that none of the holidays plus weekend days can span more than seven days, so we just need to find (using the [Last] column of our Calendar table) all of the work days in the last week of the month.

SELECT [Date], [WkDName2], FromDate
FROM dbo.Calendar a
LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) 
WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND
    [Last] = 1 AND
    -- Remove Saturdays and Sundays
    [WkDName2] NOT IN ('SA', 'SU') AND
    -- Remove holidays
    FromDate IS NULL;

The above query actually generates 72 rows of results so let’s just look at a few of them (for May and December).

Date                     WkDName2 FromDate
<snip>
2014-05-26 00:00:00.000 Mo       NULL
2014-05-27 00:00:00.000 Tu       NULL
2014-05-28 00:00:00.000 We       NULL
2014-05-29 00:00:00.000 Th       NULL
2014-05-30 00:00:00.000 Fr       NULL
<snip>
2014-12-25 00:00:00.000 Th       NULL
2014-12-26 00:00:00.000 Fr       NULL
2014-12-29 00:00:00.000 Mo       NULL
2014-12-30 00:00:00.000 Tu       NULL

It should be no surprise that FromDate is always NULL, because of the WHERE filter. What’s important to note is that the last date shown for May is the Friday (31 May is a Saturday so it was eliminated) and that the last date for December is the 30th (because the New Year’s Eve holiday was eliminated).

It now becomes a relatively simple matter to use ROW_NUMBER() to identify the last working day of the month (note the DESC sort), like this:

SELECT [Date]
FROM
(
    SELECT [Date], [WkDName2], FromDate
        ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC) 
    FROM dbo.Calendar a
    LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) 
    WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND
        [Last] = 1 AND
        -- Remove Saturdays and Sundays
        [WkDName2] NOT IN ('SA', 'SU') AND
        -- Remove holidays
        FromDate IS NULL
) a
WHERE rn=1;

So here is a list of this Thai company’s pay days in 2014:

Date
2014-01-31 00:00:00.000
2014-02-28 00:00:00.000
2014-03-31 00:00:00.000
2014-04-30 00:00:00.000
2014-05-30 00:00:00.000
2014-06-30 00:00:00.000
2014-07-31 00:00:00.000
2014-08-29 00:00:00.000
2014-09-30 00:00:00.000
2014-10-31 00:00:00.000
2014-11-28 00:00:00.000
2014-12-30 00:00:00.000

Now that’s a query we can all be interested in!

But what is really important here is that by combining our two tables, we’ve got a really nice way of identifying working days and numbering them within a month if need be.

Conclusion

We have described how to create an auxiliary Calendar table in your SQL database and provided a high-performance, alternative FUNCTION (GenerateCalendar) that you can use instead if you want to. Since the FUNCTION returns a table, it can be substituted pretty much wherever the Calendar table is used (after FROM), but you do need to take care to specify the correct number of days you’ll need to drive the results.

Note that the GenerateCalendar FUNCTION is also somewhat unique in that it will generate days backwards from a starting date (by supplying a negative number of days). Some sample queries are embedded in its comments to guide you.

We’ve described three cases where a Calendar table can be useful:

  • To simplify what would otherwise be complex date calculations.
  • To fill in dates (e.g., a month) where those dates are missing from your data.
  • To combine with a Holidays table to calculate working days.

Note that there are some people that advocate marking days in your main Calendar table as holidays. I personally do not like that approach, mainly because Thailand has the tendency to create new holidays on a whim. I’d rather insert a new row into my Holidays table than update an existing row in my Calendar table, but there are other reasons why you may want to keep them separate.

Just like Tally tables, it is possible to create an in-line Calendar table that contains only the dates you need for your specific query. While you may still need to perform any specific date calculations you need within that query (like obtaining the month or day as an integer), it may still offer some usability in cases where you need to fill in dates like in our Sales order example. We’ll also leave this as an exercise for our interested readers.

We hope you found this blog entry on Calendar tables to be interesting, informative and at least a little bit of fun.

For additional reading on Calendar tables, you can look at these three articles by Todd Fifield:

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Advertisement

Tally Tables in T-SQL

Posted on Updated on

The Tally (or Numbers) table is one of the most useful constructs in intermediate-level SQL. Have you ever written a T-SQL WHILE loop? How about a CURSOR? If your answer to either of those questions is yes, you’ve probably been coding some really poor-performing SQL and you may not even realize it. Tally tables are one way that you may be able to avoid using a loop or a CURSOR, instead creating a true set-based solution that is blazingly fast and reliable.

Tally Tables in SQL 2000

The ability to create a Tally table has been around for a long time, going way back to SQL Server’s roots in Sybase. One of the most common methods of creating a permanent table is:

SELECT TOP 1000000 N=IDENTITY(INT, 1, 1)
INTO dbo.Numbers
FROM master.dbo.syscolumns a CROSS JOIN master.dbo.syscolumns  b;

ALTER TABLE dbo.Numbers ADD CONSTRAINT NBR_pk PRIMARY KEY(N);

SELECT TOP 5 N FROM dbo.Numbers;

Obviously, in order to run this you need access to the master.dbo.syscolumns table that is referenced. Most of the time this particular table has at least 4,000 rows in it so constructing a Tally table with 1,000,000 (or even 10,000,000) rows is no problem at all. Our Tally table will start with 1 (results below are from the final SELECT).

N
1
2
3
4
5

If you create a permanent Tally table, it’s important to have a CLUSTERED INDEX (the PRIMARY KEY) on the single column that’s in the table for performance reasons. We’ll show some examples of unique ways that you can use a Tally table shortly, but first it’s good to know how to create them. The example above works in SQL Server 2000.

Here’s another Tally table that can generate N’s between 0 and 255 (up to 2047 in later versions like 2008) and can be used in SQL Server 2000.

SELECT N=number
FROM master..spt_values
WHERE type = 'P';

There are those that might be asking why in the world would you use an undocumented system table like this? Well, rest assured that this table is used internally by many of SQL’s internal stored procedures, so it is not going away any time soon.

Tally Tables in SQL 2005

When SQL Server 2005 came along and introduced the Common Table Expression (CTE) and the ROW_NUMBER() window function (both the subject of earlier blogs as linked), it opened up some new ways to create what are known as in-line Tally tables. Let’s look at a couple of examples.

WITH Tally (N) AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT TOP 5 N
FROM Tally;

-- Or if you don't have access to the sys tables use an in-line
-- Tally table known as a "Ben-Gan" style Tally
WITH lv0 AS (SELECT 0 g UNION ALL SELECT 0)
    ,lv1 AS (SELECT 0 g FROM lv0 a CROSS JOIN lv0 b) -- 4
    ,lv2 AS (SELECT 0 g FROM lv1 a CROSS JOIN lv1 b) -- 16
    ,lv3 AS (SELECT 0 g FROM lv2 a CROSS JOIN lv2 b) -- 256
    ,lv4 AS (SELECT 0 g FROM lv3 a CROSS JOIN lv3 b) -- 65,536
    ,lv5 AS (SELECT 0 g FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296
    ,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5)
SELECT TOP (5) n
FROM Tally
ORDER BY n;

The second Tally table structure was first suggested by SQL MVP Itzik Ben-Gan.

Never mind how they work, just know that the first will generate somewhere around 16M+ rows and the other will create a little over 4 billion rows (!) if we did not apply the TOP 10. The second one is a case of “stacked” or “cascaded” CTEs. Both are extremely fast; the second one especially because it is done entirely in memory.

Tally Tables in SQL 2008

SQL Server 2008 introduced something that’s known as a Table Value Constructor (TVC), which offers a very compact way to create a Tally table tailored specifically to the precise number of rows you need without using a TOP clause. Here’s an example of that:

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
SELECT *
FROM Tally;

That Tally table generates exactly 1,000 rows because each VALUES TVC has exactly 10 zeroes in it. We’ll now look at just a few of the remarkable ways that a Tally table can improve your SQL by helping you to write a set-based solution where otherwise you might need to resort to a loop or CURSOR.

Using a Tally Table to Expand a Row Set

Let’s start with some sample data:

CREATE TABLE #Temp
(
    ID          INT IDENTITY PRIMARY KEY
    ,StartDT    DATETIME
);

INSERT INTO #Temp
SELECT '2014-02-18 09:20' UNION ALL SELECT '2014-02-19 05:35';

SELECT ID, StartDT
    ,TT=DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0))
    ,TD=DATEADD(day, DATEDIFF(day, 0, StartDT), 0)
FROM #Temp;

The results returned are this, where DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0) is a nifty way to truncate the DATETIME value to the nearest hour (or use day instead to truncate to the current day). Here are the results from the above query:

ID   StartDT                  TT   TD
1    2014-02-18 09:20:00.000  9    2014-02-18 00:00:00.000
2    2014-02-19 05:35:00.000  5    2014-02-19 00:00:00.000

Suppose what we want is 10 rows that correspond to ID=1, where each is incremented to the next hour of the day (starting at midnight). Likewise, for ID=2 we want 6 rows. We can use a small Tally table for this (remember there’s only 24 hours in a day):

WITH Tally (N) AS
(
    -- Tally table starting at 0
    SELECT 0 UNION ALL
    -- Now 24 more rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0)) c(n)
)
SELECT ID, StartDT
    ,TD=DATEADD(hour, N, DATEADD(day, DATEDIFF(day, 0, StartDT), 0))
FROM #Temp
CROSS JOIN Tally
WHERE N BETWEEN 0 AND DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0))
ORDER BY ID, TD;

Note how this Tally table has been modified to start at 0 by adding SELECT 0 UNION ALL before the second SELECT. The results of this query are:

ID  StartDT TD
1   2014-02-18 19:20:00.000   2014-02-18 00:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 01:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 02:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 03:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 04:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 05:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 06:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 07:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 08:00:00.000
1   2014-02-18 19:20:00.000   2014-02-18 09:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 00:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 01:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 02:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 03:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 04:00:00.000
2   2014-02-19 15:35:00.000   2014-02-19 05:00:00.000

Wasn’t that easy and really cool?

Using a Tally Table to Remove Repeated Characters in a String

There are a lot of situations in T-SQL where you need to parse a string. Many of the built-in functions that T-SQL provides stop at the first occurrence of something. But what if you need to perform the same operation on many characters within the string and a built-in function simply won’t do it for you?

This is where a Tally table can really save you. We’ll start with one that has 8000 rows so we can parse each character out of VARCHAR(8000) length string. Follow the comments in the code to see how it works.

DECLARE @S VARCHAR(8000) = 'Aarrrgggh!';

WITH Tally (N) AS
(
    -- 8000 rows (max length of the VARCHAR string)
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT n, s
FROM
(
    -- Always choose the first element
    SELECT n=1, s=LEFT(@S, 1) UNION ALL
    -- Include each successive next element as long as it’s different than the prior
    SELECT N, CASE
        WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
        THEN SUBSTRING(@S, N, 1)
        -- Repeated characters are assigned NULL by the CASE
        END
    FROM Tally
    WHERE N BETWEEN 2 AND LEN(@S)
) a
-- Now we filter out the repeated elements
WHERE s IS NOT NULL;

The results displayed for this are below. Note that it works just fine with strings of length 1 or 0 also.

n   s
1   A
3   r
6   g
9   h
10  !

It is also case-insensitive, so it removes the lowercase a (second character) because it is a repeat of the first character (capital A). It could be made case sensitive if necessary. Now all we need to do is combine the rows that remain back into our final string using a technique that is similare to Creating a Comma-separated List by author Microsoft Certified Master (MCM) Wayne Sheffield. It just needs a slight modification to not include the comma separator and it can be applied to this case as follows.

WITH Tally (N) AS
(
    -- 8000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT OriginalString=@S, NewString=
    (
        SELECT '' + s
        FROM
        (
            -- Same base query as the prior example starts here
            SELECT n, s
            FROM
            (
                -- Always choose the first element
                SELECT n=1, s=LEFT(@S, 1) UNION ALL
                -- Include each successive next element as long as it's
                -- different than the prior
                SELECT N, CASE
                            WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
                            THEN SUBSTRING(@S, N, 1)
                            -- Repeated characters are assigned NULL by the CASE
                            END
                FROM Tally
                WHERE N BETWEEN 2 AND LEN(@S)
            ) a
            -- Now we filter out the repeated elements
            WHERE s IS NOT NULL
        ) a
        ORDER BY n
        FOR XML PATH(''), TYPE
        ).value('.', 'VARCHAR(8000)'
    );

Now our result is:

OriginalString  NewString
Aarrrgggh!      Argh!

That is very fast and way cool, so maybe we want to save this as a utility FUNCTION. SQL has many types of User-defined FUNCTIONs, e.g., scalar, table-valued, etc. Scalar-valued functions are known to be total performance nightmares. By far the fastest type of FUNCTION if you can write it this way is to use what’s known as an inline Table Valued Function (iTVF). We can do this for our new utility FUNCTION.

CREATE FUNCTION dbo.RemoveRepeatedChars
(
    @S VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING
RETURN

WITH Tally (N) AS
(
    -- 8000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n)
)
SELECT OriginalString=@S, NewString=
    (
        SELECT '' + s
        FROM
        (
            SELECT n, s
            FROM
            (
                -- Always choose the first element
                SELECT n=1, s=LEFT(@S, 1) UNION ALL
                -- Include each successive next element as long as it's
                -- different than the prior
                SELECT N, CASE
                            WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1)
                            THEN SUBSTRING(@S, N, 1)
                            -- Repeated characters are assigned NULL by the CASE
                            END
                FROM Tally
                WHERE N BETWEEN 2 AND LEN(@S)
            ) a
            -- Now we filter out the repeated elements
            WHERE s IS NOT NULL
        ) a
        ORDER BY n
        FOR XML PATH(''), TYPE
        ).value('.', 'VARCHAR(8000)'
    );

An iTVF with SCHEMABINDING is the fastest possible FUNCTION we can write and it works for this case. We can now call our function to parse as many strings as we need (like from a column in a table).

WITH SampleData (s) AS
(
    SELECT 'Aarrrgggh!'
    UNION ALL SELECT 'Noo repeaatting chharracterss wanntted.'
)
SELECT *
FROM SampleData
CROSS APPLY dbo.RemoveRepeatedChars(s);

CROSS APPLY is something we’ll cover in a future blog but we’ve provided some references at the end if you’d like to learn more now. The results we get from the above script are:

OriginalString                           NewString
Aarrrgggh!                               Argh!
Noo repeaatting chharracterss wanntted.  No repeating characters wanted.

Parsing a Comma Separated List of Values

A more common problem in SQL is the parsing of a comma-separated list of values. We’ve saved the best for last, because in this case we can truly point you to a best-of-breed solution.

How many of you have created a SQL FUNCTION to split a delimited list of values? Chances are you pulled down some tired old looping code from somewhere on the Internet that works but is dog slow. You probably don’t even realize how bad it is. Now is the time to locate it and replace it as what we’re about to tell you about will simply blow the doors off of any other SQL-based delimited string splitter!

SQL MVP Jeff Moden wrote the outstanding Tally OH! An Improved SQL 8K “CSV Splitter” Function (viewed more than 40,000 times as of this writing), which contains lots of outstanding information about Tally tables (another of his articles is referenced at the end), how they can be used to split a delimited string and the performance comparison with alternate solutions. At the end of the article, you can download DelimitedSplit8K, optimized specifically for use with VARCHAR(8000) strings, so don’t try to modify it to VARCHAR(MAX). I have provided a link to that article as over the years the FUNCTION has been revised and improved a few times, so if you go there you’ll be sure to get the most current version.

It also contains information on CLR-based splitters that can be even faster and handle the case of VARCHAR(MAX).

Conclusions

Let’s do a recap of all of the things we learned today:

  • The various ways to create a permanent or generate an in-line Tally table in SQL 2000, 2005 and 2008.
  • The formula for using DATEDIFF/DATEADD to truncate a date to just the day or to the hour.
  • How to expand a row set using a Tally table.
  • How to parse a string using a Tally table.
  • How to use the FOR XML PATH method to concatenate rows into a single character string.
  • How to create a high-performance, schema-bound, in-line Table Valued FUNCTION in SQL.
  • Where to find the fastest SQL-based delimited string splitter on the planet (DelimitedSplit8K).

We’ve also come away with two utility FUNCTIONs we can add to our SQL tool chest (see below for a third one):

  • RemoveRepeatedChars
  • DelimitedSplit8K

For further reading on CROSS APPLY (and its companion OUTER APPLY), here are two great articles by SQL MVP Paul White to get you started.

Further suggested reading:

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Getting an INDEX SEEK to Speed up LIKE “%string%” Searches

Posted on Updated on

In today’s blog I will attempt to challenge the popularly held notion that LIKE “%string%” wildcard searches must be slow (Sargability: Why %string% Is Slow).

A Sample Table Populated with 10 Million Rows of Test Data

In order to do this, we’ll need a large table of test data with a composite PRIMARY KEY to demonstrate various aspects of the issue.

CREATE TABLE dbo.TestLIKESearches
(
    ID1         INT
    ,ID2        INT
    ,AString    VARCHAR(100)
    ,Value      INT
    ,PRIMARY KEY (ID1, ID2)
);

WITH Tally (n) AS
(
SELECT TOP 10000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.TestLIKESearches
    (ID1, ID2, AString, Value)
SELECT 1+n/500, n%500
    ,CASE WHEN n%500 > 299 THEN
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            RIGHT(1000+n%1000, 3) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) +
            SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1)
          END
    ,1+ABS(CHECKSUM(NEWID()))%100
FROM Tally;

While we have set the AString column to contain some NULL values (60% in fact), what we are about to show works even when 0% of the rows are NULL. Let’s start by showing the non-SARGable version of the query and its corresponding actual execution plan.

SELECT ID1, ID2, AString
FROM dbo.TestLIKESearches
WHERE AString LIKE '%21%';


As expected, the non-SARGable query results in a Clustered Index SCAN on the table. Let’s see if we can affect this by adding the following INDEX.

CREATE INDEX tls_ix1 ON dbo.TestLIKESearches(AString);

When we check the execution plan again for the same query, it now looks like this:


The query is now using the NONCLUSTERED INDEX, but it is still doing a SCAN. So let’s try a slightly different form of the same query, one which we know must generate the exact same results.

SELECT ID1, ID2, AString
FROM dbo.TestLikeSearches
WHERE AString IS NOT NULL AND AString LIKE '%21%';

All we have done is to modify the filter in the WHERE clause to ignore NULL values, none of which could contain our search string anyway. Now our execution plan looks like this:

Since a SEEK should in theory be better than a SCAN, we’re hopeful that the query’s speed (elapsed time) is improved also, but we’ll get to that in a moment.

The same small change to an UPDATE converts the INDEX SCAN to a SEEK, as demonstrated by the two queries below.

UPDATE dbo.TestLIKESearches
SET Value = 300
WHERE AString LIKE '%21%';

UPDATE dbo.TestLIKESearches
SET Value = 400
WHERE AString IS NOT NULL AND AString LIKE '%21%';


On a SELECT however, we can show just how easily this SEEK can be broken. Let’s modify our query to also return the Value column.

SELECT ID1, ID2, AString, Value
FROM dbo.TestLikeSearches
WHERE AString IS NOT NULL AND AString LIKE '%21%';


By adding Value into the returned results, we have broken the SEEK; it has reverted to a SCAN.

But there is a way around this. We can use a query hint (FORCESEEK) to restore our INDEX SEEK.

SELECT ID1, ID2, AString, Value
FROM dbo.TestLikeSearches WITH(FORCESEEK)
WHERE AString IS NOT NULL AND AString LIKE '%21%';

Performance Comparison of SCAN vs. SEEK for a LIKE “%string%” Search

The following table summarizes the performance results we got from this method of getting SQL Server to SEEK the INDEX on our string column during the LIKE “%string%” search.

Logical Reads CPU Elapsed Time
Base SELECT Query Without INDEX 33193 2246 628
Base SELECT Query With INDEX 24169 2310 582
Improved SELECT Query (INDEX SEEK) 13659 1513 405
    Improvement 43% 35% 30%
Base UPDATE Query With INDEX 146678 2434 812
Improved UPDATE Query (INDEX SEEK) 136168 1763 546
    Improvement 7% 28% 33%
SELECT with Value (INDEX SCAN) 33193 2620 665
SELECT with Value (FORCESEEK) 136193 1794 455
    Improvement -310% 32% 32%

In all cases, by forcing the SEEK (even when it results in an added Key Lookup) we were able to improve elapsed and CPU times to a measurable degree. Only the FORCESEEK query hint on the SELECT when non-indexed columns are included actually increased the logical IO count (albeit by quite a bit).

Conclusion

Despite the commonly accepted belief that a LIKE “%string%” search is limited to an INDEX SCAN, we have proven that it is possible to make it happen with a properly constructed non-clustered INDEX on the string being searched.

The SEEK is easily obtained for either SELECT or UPDATE, and probably DELETE and/or MERGE as well (although we didn’t test these cases) with just a small additional filtering criteria (excluding NULLs).

The SEEK can also just as easily be broken by including columns in the SELECT that aren’t in our non-clustered INDEX, however even then using a FORCESEEK query hint can restore it.

You’ll be able to read more about the results of this technique and how well it improves performance against a wider range of NULL values in an upcoming article that I have submitted to http://www.sqlservercentral.com/.  You can generally expect that as the percentage of NULL values decreases, the performance gain will not be as much.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Common Table Expressions in SQL

Posted on

In SQL Server 2005, Microsoft introduced the Common Table Expression (CTE).  CTEs share similarities with VIEWS and derived tables, but are really not the same as either.  Oracle SQL also supports CTEs and while the syntax is basically the same, some of the properties that we’ll discuss may be slightly different.

Let’s take a look at a very simple CTE to get started.

CREATE TABLE #Vehicles
(
    VehicleID     VARCHAR(5)
    ,VehicleType  VARCHAR(5)
    ,Location     VARCHAR(3)
    ,PRIMARY KEY (VehicleID)
);

INSERT INTO #Vehicles
VALUES ('12211', 'TRUCK', 'BKK'),('12212', 'CAR', 'BKK'),('12213', 'TRUCK', 'CNX')
    ,('12214', 'CAR', 'CNX'),('12215', 'TRUCK', 'HDY'),('12216', 'CAR', 'HDY');

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
SELECT VehicleID, VehicleType, Location
FROM BKKVehicles;

Our CTE begins with the keyword WITH and ends at the closing parenthesis.  Below the CTE is what I’ll call the “main query.”  This CTE retrieves only vehicles whose location is BKK and adds a ROW_NUMBER to that result:

VehicleID VehicleType  Location  rn
12211     TRUCK        BKK       1
12212     CAR          BKK       2

There is a widespread belief that CTEs can improve performance, but the truth is they neither improve nor detract from performance.  They are simply a way to make your code more readable, although they do offer a couple of things that may also make your life a bit easier.  Let’s look at some of the rules/properties of a CTE, comparing and contrasting with VIEWs and derived tables where appropriate.

  • You must remember to terminate the statement preceding the CTE with a semicolon, otherwise SQL will throw this error message at you:

Incorrect syntax near the keyword ‘with’. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.

  • You can rename the columns returned by the CTE by providing the column names between parentheses immediately after the name of the CTE (our CTE is named BKKVehicles) and before “AS.”
  • A CTE must contain a SELECT and it may not contain INSERT, UPDATE, DELETE or MERGE statements.
  • CTEs will inherit the indexing of the tables upon which they are based.
  • CTEs are more like a derived table than a VIEW because they exist only for the life of the main query which follows them.  In order to reuse a CTE in a subsequent query, you must resupply the same code to the second query.
  • You can use a CTE as source or target in UPDATE, INSERT, DELETE and MERGE queries, but there are some restrictions.  This is similar to a VIEW.
  • You may have more than one CTE associated with a query.  When more than one CTE is defined, they are referred to as “cascaded” or “stacked” CTEs.  You may not however nest CTEs within CTEs.
  • You may code CTEs within VIEWs, FUNCTIONs or Stored Procedures.
  • You may refer to a CTE more than once in the main query.  Contrast this with a derived table, which if you’d like to use it more than once, must be coded as many times as you need it.
  • You may refer to a CTE in another CTE as long as the CTE being referred to occurs above the CTE that is doing the referring in the CTE stack.
  • CTEs support recursive queries.

CTEs are most often used with SELECT, but you can UPDATE through a CTE as in this example also.

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
UPDATE BKKVehicles
SET VehicleType = 'VAN'
WHERE Location = 'HDY';

In this example, no rows are updated because the table returned by the CTE does not contain any rows whose location is HDY.

Likewise you can delete through a CTE.

WITH BKKVehicles AS
(
    SELECT VehicleID, VehicleType, Location
        ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID)
    FROM #Vehicles
    WHERE Location = 'BKK'
)
DELETE FROM BKKVehicles
WHERE rn > 1;

SELECT *
FROM #Vehicles;

After the DELETE runs, the rows remaining in our table are:

VehicleID  VehicleType  Location
12211      TRUCK        BKK
12213      TRUCK        CNX
12214      CAR          CNX
12215      TRUCK        HDY
12216      CAR          HDY

This is actually quite a useful method of deleting duplicate rows from a table.

We mentioned that when updating or deleting through a CTE, certain restrictions may apply.   Basically all that means is that the target rows in the target table must be unambiguous.  For example, if you happen to JOIN the target table with another table, the JOIN must be exact (no duplicate rows generated) otherwise the effort will likely fail.  Highly complex queries involving many JOINs, etc. may also confuse the compiler and make it unable to recognize the target table.

Here’s an example of using a CTE as the source for an INSERT, to generate some additional sample rows in our table.  It also demonstrates how you can name the columns generated by the CTE.

WITH MoreRows (VehicleID, VehicleType, Location) AS
(
    SELECT '12218','VAN','BKK'
    UNION ALL SELECT '12219','VAN','CNX'
    UNION ALL SELECT '12220','VAN','HDY'
)
INSERT INTO #Vehicles (VehicleID, VehicleType, Location)
SELECT VehicleID, VehicleType, Location
FROM MoreRows;

A CTE can also act as either source or target tables for a MERGE, but since MERGE is a topic that is deserving of consideration on its own, we’ll do a separate blog entry for that.

Final Remarks

We have demonstrated how a Common Table Expression can be used in SELECT, UPDATE, DELETE and INSERT statements.  CTEs are basically a way to improve the readability of the SQL code you produce, having no impact on their performance.

While we have mentioned that CTEs can be used for recursive queries, we haven’t provided any examples of this because it is quite an advanced topic.  However if you are interested in how to do this, you might want to take a look at Exploring Recursive CTEs by Example.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

The T-SQL ROW_NUMBER() Function

Posted on Updated on

If you learn one new T-SQL (i.e., Microsoft SQL Server) concept today it should be ROW_NUMBER().  Introduced in SQL 2005, this function is one of 4 window ranking functions (the others are RANK(), DENSE_RANK() and NTILE()).  Oracle SQL has a similar capability.

Let’s first create some sample data we can use for a demonstration.

CREATE TABLE #ROWNUMBER_Demo
(
    ID      INT
    ,MyDate DATETIME
    ,Price  MONEY
    ,PRIMARY KEY (ID, MyDate)
);

INSERT INTO #ROWNUMBER_Demo
SELECT 1, '2012-03-04', 23.22
UNION ALL SELECT 1, '2012-03-15', 25.15
UNION ALL SELECT 1, '2012-05-10', 28.47
UNION ALL SELECT 2, '2012-02-28', 15.10
UNION ALL SELECT 2, '2012-03-22', 18.22
UNION ALL SELECT 2, '2012-05-01', 21.43
UNION ALL SELECT 3, '2012-04-01', 45.06
UNION ALL SELECT 3, '2012-05-12', 48.23
UNION ALL SELECT 3, '2012-06-01', 51.66;

SELECT *
FROM #ROWNUMBER_Demo;

The results in our sample table are:

ID   MyDate                    Price
1    2012-03-04 00:00:00.000   23.22
1    2012-03-15 00:00:00.000   25.15
1    2012-05-10 00:00:00.000   28.47
2    2012-02-28 00:00:00.000   15.10
2    2012-03-22 00:00:00.000   18.22
2    2012-05-01 00:00:00.000   21.43
3    2012-04-01 00:00:00.000   45.06
3    2012-05-12 00:00:00.000   48.23
3    2012-06-01 00:00:00.000   51.66

To use the ROW_NUMBER() function you must supply an OVER predicate consisting of PARTITION (optional) and ORDER BY (required).  Let’s take a look at an example and see the results.

SELECT ID, MyDate
    ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate)
    ,rn2=ROW_NUMBER() OVER (ORDER BY MyDate)
FROM #ROWNUMBER_Demo
ORDER BY ID, MyDate;

The results returned are:

ID  MyDate                    rn1   rn2
1   2012-03-04 00:00:00.000   1     2
1   2012-03-15 00:00:00.000   2     3
1   2012-05-10 00:00:00.000   3     7
2   2012-02-28 00:00:00.000   1     1
2   2012-03-22 00:00:00.000   2     4
2   2012-05-01 00:00:00.000   3     6
3   2012-04-01 00:00:00.000   1     5
3   2012-05-12 00:00:00.000   2     8
3   2012-06-01 00:00:00.000   3     9

For rn1 (where PARTITION is applied) you see that it assigns row numbers 1, 2, 3 to the rows within each ID (the column specified to PARTITION on) based on the ordering of MyDate.  For the case without PARTITION (rn2), the entire set is the partition so the row numbers are 1, 2, …, 9, again based on the ordering of the MyDate column.

Eliminating Duplicates

ROW_NUMBER() is a particularly fast way to eliminate duplicate records.  Suppose you want to return only one record within each ID; specifically the one whose date is the latest.  You must note that ROW_NUMBER() cannot be used on the WHERE clause, so it is necessary to wrap this query in an outer query as follows:

SELECT ID, MyDate
FROM
(
    SELECT ID, MyDate
        ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC)
    FROM #ROWNUMBER_Demo
) a
WHERE rn1 = 1
ORDER BY ID;

Note the DESC sort applied to MyDate.  These results are:

ID   MyDate
1    2012-05-10 00:00:00.000
2    2012-05-01 00:00:00.000
3    2012-06-01 00:00:00.000

Of course, you’re probably saying you can achieve the same results using a GROUP BY (and you’d be correct), like this.

SELECT ID, MyDate=MAX(MyDate)
FROM #ROWNUMBER_Demo
GROUP BY ID
ORDER BY ID;

But try using that query to also return the Price column that corresponds to the MAX date.  You cannot!  But you can when you use ROW_NUMBER().

SELECT ID, MyDate, Price
FROM
(
    SELECT ID, MyDate, Price
        ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC)
    FROM #ROWNUMBER_Demo
) a
WHERE rn1 = 1
ORDER BY ID;

Results:

ID     MyDate Price
1      2012-05-10 00:00:00.000    28.47
2      2012-05-01 00:00:00.000    21.43
3      2012-06-01 00:00:00.000    51.66

Conclusion

ROW_NUMBER() is a very versatile T-SQL window ranking function.  Besides using it to eliminate duplicates, it has a great many other very practical purposes that we’ll explore in future entries on this blog.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved