An Even Faster Method of Calculating the Median on a Partitioned Heap

Posted on Updated on

Back in 2013 I wrote an article kindly accepted for publication on the Simple Talk web site called Calculating the Median Value within a Partitioned Set Using T-SQL.  In that article, I was delighted to propose a previously unpublished solution for calculating the median over a partitioned set that did not rely on INDEXing.  In fact, in my tests it was the fastest solution available for that case.

Later on, I was quite pleased to hear that 18 times nominated (as of this writing) SQL MVP Aaron Bertrand published a comparative study of methods for calculating the median called the Best approaches for grouped median, where he kindly included my submission, and confirmed that it was the highest performance solution available for the case of a heap.

Today we’re going to take a second look at the solution I proposed, and offer a couple of interesting possibilities that may be even better for the case of a partitioned heap.

A Heap Example and Some Sample Data to Populate the Table

The following DDL can be used to create a heap, and load it with some partitioned sample data (the ID column is the partition).

CREATE TABLE dbo.MedianValues
(
   ID INT
   ,N INT
);

INSERT INTO dbo.MedianValues
VALUES (1,1),(1,2),(1,3),(1,4),(1,5)
   ,(2,1),(2,2),(2,3),(2,4)
   ,(3,10),(3,2),(3,10),(3,4)
   ,(4,1),(4,5),(4,1),(4,3),(4,3);

From my original article, below are the results you would expect for the median for these four sets of N:

ID    Ordered Sets         Median
1     {1, 2, 3, 4, 5}      3
2     {1, 2, 3, 4}         2.5
3     {2, 4, 10, 10}       7
4     {1, 1, 3, 3, 5}      3

As a reminder, the median is calculated as follows:

  • Order the values for which you want to calculate the median over the partitioning column (ID).
  • If the count of the rows is odd, you take the middle value. So in the example above, you would take the third row for IDs 1 and 4.
  • If the count of the rows is even, you take the average of the middle two rows. In the example above, this means you use the second and third elements in IDs 2 and 3.

My Original Solution to Median for this Case

Aaron commented that my somewhat “elaborate” solution un-pivots the rows of interest using the CROSS APPLY VALUES approach.

WITH Counts AS
(
   SELECT ID, c
   FROM
   (
      SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE 0 END
      FROM
      (
          SELECT ID, c=COUNT(*)
          FROM dbo.MedianValues
          GROUP BY ID
      ) a
   ) a
   CROSS APPLY (VALUES(c1),(c2)) b(c)
)
SELECT ID=a.ID, Median=AVG(0.+b.N)
FROM
(
   SELECT ID, N
      ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
   FROM dbo.MedianValues a
) a
CROSS APPLY
(
   SELECT N
   FROM Counts b
   WHERE a.ID = b.ID AND a.rn = b.c
) b
GROUP BY a.ID;

It is important to note how we convert the integer Ns to a FLOAT within the AVG grouping function, to ensure no truncation occurs.  The results set for this on our sample data appears as follows, confirming the correct calculation of the median for each case.

ID   Median
1    3.000000
2    2.500000
3    7.000000
4    3.000000

Calculating a Median Using Grouping Sets

While I was playing around with GROUPING SETS the other day, I hit upon the strangest query result, which gave me an idea that relates to calculating medians.  Let’s take a look at a sample of what I found.

SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
    ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
        ELSE (COUNT(*)+1)/2 
        END
FROM 
(
    SELECT ID, ID1=ID
    FROM dbo.MedianValues
) a
GROUP BY GROUPING SETS (ID, ID1)
HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
ORDER BY ID, c;

This query produces a results set consisting of all the partitioning IDs with one row for IDs 1 and 4 (where c=3) and two rows for IDs 2 and 3 (where c=2 and 3 respectively).  Take a look at that description and compare it to my description of how to calculate median, and see if you too are not intrigued.

ID   c
1    3
2    2
2    3
3    2
3    3
4    3

Coincidentally, this is exactly what I was trying to achieve by my “elaborate” (according to Aaron) method that used CROSS APPLY VALUES!  One could argue that this is even more elaborate, and I would tend to agree.  So naturally the thought struck me that perhaps I could use this approach to calculate a median.

My initial attempts at JOINing these six rows back to the full set with a ROW_NUMBER() added yielded me no joy however, as the result was slower than my original.

Once again inspiration struck and I had the idea to aggregate the GROUPING SETS so that I only had one row for each ID, after which I could do a range (BETWEEN) JOIN.  So my aggregation now looks like this.

SELECT ID, c1=MIN(c), c2=MAX(c) 
FROM
(
    SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
        ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
            ELSE (COUNT(*)+1)/2 
            END
    FROM 
    (
        SELECT ID, ID1=ID
        FROM dbo.MedianValues
    ) a
    GROUP BY GROUPING SETS (ID, ID1)
    HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
) a
GROUP BY ID;

Which produces a set that looks like this:

ID  c1  c2
1   3   3
2   2   3
3   2   3
4   3   3

Playing around with JOINing this back to the original data set, led me to using a CROSS APPLY instead of a JOIN.  So what I ended up with, which calculates the median quite nicely, is the following query.

WITH RowsOfInterest AS
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=MIN(c), c2=MAX(c) -- * -- ID, c
        FROM
        (
            SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
                ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
                    ELSE (COUNT(*)+1)/2 
                    END
            FROM 
            (
                SELECT ID, ID1=ID
                FROM dbo.MedianValues
            ) a
            GROUP BY GROUPING SETS (ID, ID1)
            HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
        ) a
        GROUP BY ID
        HAVING COUNT(*) <= 2
    ) a
    CROSS APPLY
    (
        SELECT ID, N
        FROM 
        (
            SELECT ID, N
                ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
            FROM dbo.MedianValues b
        ) b
        WHERE a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
    ) b
)
SELECT @ID=ID, @Median=AVG(0.+N)
FROM RowsOfInterest a
GROUP BY a.ID;

How’s that for elaborate, Aaron?  Initial timings appeared to be pretty decent, but more importantly what I was seeing was a drastically reduced IO count and CPU usage.  Clearly this was a promising vector of attack, although the truth will only get sorted out using SQL Profiler to average the results of multiple runs (the elapsed timings were very close).

One rather annoying note on this solution.  When I first hit upon it, I could have sworn I had something that was consistently faster than my original solution.  Unfortunately due to a PC reboot (pesky Windows updates!) I lost that one and I was never able to quite achieve the speed I was seeing.  This could be an interesting assignment for those so inclined.

Simplifying the Elaborate GROUPING SETS Approach

After thinking a little more about this, I realized it might be possible to greatly simplify this aggregation to produce just the four rows I needed.  Silly of me really not to think of it in the first place, but then my brain may not be wired that way.

So here’s another method to get the median using a simpler initial aggregation to get ID, c1 and c2 just like the query in the prior section.

WITH RowsOfInterest AS
(
    SELECT ID, c=COUNT(*)
    FROM dbo.MedianValues
    GROUP BY ID
)
SELECT ID=ID, Median=AVG(0.+N)
FROM
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
        FROM RowsOfInterest
    ) a
    JOIN 
    (
        SELECT ID, N
            ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
        FROM dbo.MedianValues b
    ) b
    ON a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
) a
GROUP BY a.ID;

This time I used a JOIN instead of a CROSS APPLY because initial timings seemed to indicate a slight advantage.  This method also exhibited a serious reduction in IO counts and CPU usage over my original solution.

You’ll note that this method is quite similar to my original approach except that it doesn’t CROSS APPLY VALUES and doesn’t do the pre-aggregation in the second solution either.  The main difference from the original solution is the method of JOINing, which is to use the BETWEEN range selector.

The One Million Row Test Harness

In my original Simple Talk article I developed a one million row test harness that can be used to test a median query, and I have reproduced that below exactly as before.

TRUNCATE TABLE dbo.MedianValues;

DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

WITH Tally (n) AS
(
   SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.MedianValues
SELECT a.n, ABS(CHECKSUM(NEWID()))%@N
-- Always create exactly 100 IDs from the Tally table
FROM (SELECT TOP 100 n FROM Tally) a
-- Generate either an odd or even number of rows within IDs
CROSS APPLY
(
   SELECT TOP (@N-a.n%2) n
   FROM Tally
)b;

This actually generates 999,950 rows into our sample data table.

The SQL Profiler Test

My initial timings using SET STATISTICS IO, TIME ON were producing pretty fine numbers for IO and CPU, but pretty close scores for elapsed times.  So the only way to really be sure which of these three queries is the fastest, it is necessary to run each batch multiple times (I chose that to be ten) and then average the results.  I won’t include the Profiler test harness here because this is just a blog, and I’m sure if you’ve read my previous blog on the One Million Row Test Harness, you’re capable of doing this yourself given everything we’ve given you above.

Here are the raw and averaged SQL Profiler results for the three queries, with the last row being the average of the ten prior.

Solution #1: Original method Solution #2: Using Grouping Sets Solution #3: Simplified aggregation
CPU Reads Duration CPU Reads Duration CPU Reads Duration
4,943 2,006,286 1,585 1,622 5,843 1,663 2,840 5,674 1,147
4,759 2,006,286 1,638 1,826 5,843 1,663 2,979 5,674 1,106
4,820 2,006,286 1,575 1,715 5,843 1,662 2,682 5,674 1,326
4,836 2,006,286 1,654 1,732 5,843 1,672 2,932 5,674 1,353
4,822 2,006,286 1,608 1,747 5,843 1,620 2,949 5,674 1,190
4,787 2,006,286 1,598 1,701 5,843 1,659 2,729 5,674 1,161
4,788 2,006,286 1,669 1,716 5,843 1,638 2,872 5,674 1,175
5,104 2,006,286 1,574 1,716 5,843 1,538 3,026 5,674 1,244
4,698 2,006,286 1,569 1,700 5,843 1,665 2,840 5,674 1,296
4,632 2,006,286 1,571 1,747 5,843 1,663 2,792 5,674 1,195
4,819 2,006,286 1,604 1,722 5,843 1,644 2,864 5,674 1,219

It is also instructive to look at a closer comparison of the averages.

  Averages
Solution CPU Reads Duration
#1: Original method 4,819 2,006,286 1,604
#2: Using Grouping Sets 1,722 5,843 1,644
#3: Simplified aggregation 2,864 5,674 1,219
Best Solution #2 #3 #3
New #2 %O/(U) #1 -64% -100% 3%
New #3 %O/(U) #1 -41% -100% -24%

You can see how solution #3 was the best performer in terms of IO (Reads) and Duration (Elapsed MS).  Solution #2 had the edge in CPU and was nearly tied in duration with solution #1.  Overall my take on these results is that they are both better than my original solution, but of the two #3 is probably the best.

The percentage rows indicate that solution #3 was 41% better than (improved over) solution #1 in CPU, while solution #2 was 64% better than solution #1.  Both new solutions pretty much ripped solution #1 apart in terms of IO.  And probably the most important statistics shows that while solutions #1 and #2 were probably indeterminately different, solution #3 was about 24% faster than both.

The Final Word

The two new solutions for calculating the median within partitioned sets on a heap (no indexing on the table) seem to have delivered a slight edge over the previous benchmark case.  It is always gratifying to do something just a little bit better, even when you’re kicking yourself for not thinking of doing it that way in the first place.

Performance testing was performed on:

  • SQL Server 2012 64 bit
  • Windows 7 64 bit with 12GB of memory installed
  • Intel Core i5-3230M CPU @ 2.60 GHz

I did not attempt to test these solutions using either of the INDEXing schemes proposed by Aaron in his comparative article, because my gut tells me that they’re probably no better than the ones that won out in those particular scenarios.  Since this is just a blog, we’ll have to settle today for seeing just this one case.

And to Aaron, if you happen to be reading this blog at some time, my little digs at you calling my solution “elaborate” were all in good fun.  Actually I considered it quite a badge of honor that you picked my solution in your comparative study.

Thanks folks as always for listening.  Hope the solutions you find here and in my other blogs are helpful to you some day.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 08 Apr 2015.  All rights reserved.

An Easter SQL

Posted on Updated on

Since Easter is nearly upon us, I got to thinking once again about how challenging it is to calculate the day that Easter Sunday falls on.  Since Calendar tables and their step-children Holiday tables are so interesting I thought I might write a sequel to my earlier blog link in this sentence.

There are at least three holidays (not necessarily business holidays) that are related to the calculation of Easter Sunday.  These are:

  • Good Friday is the Christian holiday that is the Friday just prior to Easter Sunday.
  • Ash Wednesday is another Christian holiday (the first day of Lent) that falls precisely 46 days prior to Easter Sunday.
  • Mardi Gras (a.k.a. Fat Tuesday) is a lesser known holiday celebrated by the more robust partyers among us.  I have a special affinity for it as I lived in New Orleans for a few years.

We will show you how to calculate them all and use that method to populate a Holidays lookup table so hopefully you’ll never have to do it again.  Happy Easter!

A Calendar and a Holiday Table

Discussions of Calendar tables in T-SQL are ubiquitous (here’s one by Todd Fifield), differing only in the columns actually stored in them.   We’ll draw upon the one we created in our blog to show you how easy it is here.

-- Create a Calendar table
SELECT [Date]=ISNULL([Date], 0), [Year]=ISNULL([Year], 0), [YrNN]=ISNULL([YrNN], 0)
    ,[YYYYMM]=ISNULL([YYYYMM], 0), [BuddhaYr], [Month], [Day]=ISNULL([Day], 0)
    ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk]
    ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo]
INTO dbo.Calendar 
-- This function can be obtained here: GenerateCalendar FUNCTION
FROM dbo.GenerateCalendar('1990-01-01', 65536);

-- Change column types to be NOT NULL so we can index them
--
-- Note: not needed thanks to the neat trick with ISNULL above
--ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL;
GO

-- Build some representative INDEXes
ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]);

SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last]
FROM dbo.Calendar;

-- Now create a Holidays table, from the same blog
CREATE TABLE dbo.Holidays
(
    FromDate    DATETIME PRIMARY KEY
    ,ToDate     DATETIME
    ,Holiday    VARCHAR(100)
);

We’ve created a Holidays table that we’ll be using to further our sequel to Easter.  There are also a few other things in the above Data Definition Language (DDL) that are worthy of mention.

  • The GenerateCalendar FUNCTION is an in-line, Table Valued Function (iTVF) that appears in my earlier blog (and also in the resources file attached at the end of this article), which can be used in lieu of a Calendar table should you prefer to do so.  Once again, my thanks go out to SQL MVP Jeff Moden for his help optimizing the performance of the calculations done in that FUNCTION.
  • We’ve used a neat trick with ISNULL to set the NULL attribute of the column in the table we selected INTO (when Calendar gets CREATEd) to NOT NULL, so that we don’t need to run the subsequent ALTER statements on those columns.  I included those ALTER statements for your information.  Thanks again to Jeff Moden for showing me that trick (with honorable mention to SQL MCM Wayne Sheffield (MCM=Microsoft Certified Master) who came along just a few days later with the same information).

Since Easter Sunday is so diabolically difficult to calculate, many reputable sources have created look-up tables such as this one.  Drawing upon that source, we can create the dates for Easter Sunday from 1990 through 2020 in our Holidays table using the following script.

WITH EasterDates (d) AS
(
    SELECT '1990-04-15' UNION ALL SELECT '1991-03-31' UNION ALL SELECT '1992-04-19'
    UNION ALL SELECT '1993-04-11' UNION ALL SELECT '1994-04-03' 
    UNION ALL SELECT '1995-04-16' UNION ALL SELECT '1996-04-07' 
    UNION ALL SELECT '1997-03-30' UNION ALL SELECT '1998-04-12'
    UNION ALL SELECT '1999-04-04' UNION ALL SELECT '2000-04-23' 
    UNION ALL SELECT '2001-04-15' UNION ALL SELECT '2002-03-31' 
    UNION ALL SELECT '2003-04-20' UNION ALL SELECT '2004-04-11'
    UNION ALL SELECT '2005-03-27' UNION ALL SELECT '2006-04-16' 
    UNION ALL SELECT '2007-04-08' UNION ALL SELECT '2008-03-23' 
    UNION ALL SELECT '2009-04-12' UNION ALL SELECT '2010-04-04'
    UNION ALL SELECT '2011-04-24' UNION ALL SELECT '2012-04-08' 
    UNION ALL SELECT '2013-03-31' UNION ALL SELECT '2014-04-20' 
    UNION ALL SELECT '2015-04-05' UNION ALL SELECT '2016-03-27'
    UNION ALL SELECT '2017-04-16' UNION ALL SELECT '2018-04-01' 
    UNION ALL SELECT '2019-04-21' UNION ALL SELECT '2020-04-12'
)
INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate=d, ToDate=d, Holiday='Easter Sunday ' + CAST(YEAR(d) AS CHAR(4))
FROM EasterDates;

Before we add Good Friday, Ash Wednesday and Fat Tuesday to our Holiday table, we’d like to check that the introductory description at least results in the correct day of the week on which the holiday falls.  We can use our Calendar table for this.

SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
    ,c.WkDName
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
JOIN dbo.Calendar c ON c.[Date] = a.FromDate+b.[days] 
WHERE a.Holiday LIKE 'Easter Sunday%';

This produces results like this (only the first two years are shown):

FromDate                ToDate                  holiday                        WkDName
1990-04-13 00:00:00.000 1990-04-13 00:00:00.000 Good Friday 1990               Friday   
1990-02-28 00:00:00.000 1990-02-28 00:00:00.000 Ash Wednesday 1990             Wednesday
1990-02-27 00:00:00.000 1990-02-27 00:00:00.000 Mardi Gras (Fat Tuesday) 1990  Tuesday  
1991-03-29 00:00:00.000 1991-03-29 00:00:00.000 Good Friday 1991               Friday   
1991-02-13 00:00:00.000 1991-02-13 00:00:00.000 Ash Wednesday 1991             Wednesday
1991-02-12 00:00:00.000 1991-02-12 00:00:00.000 Mardi Gras (Fat Tuesday) 1991  Tuesday

So now we can be confident to insert those holidays into our Holidays table.

INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
WHERE a.Holiday LIKE 'Easter Sunday%';

A couple of notes on the data types used in our Calendar and Holidays tables:

  • The Date column of the Calendar table was chosen to be DATETIME to retain compatibility with SQL 2005 (corresponding to the GenerateCalendar FUNCTION).  You could just as easily use a DATE datatype if you’re running SQL 2008 or later.
  • The FromDate/ToDate data types are also both DATETIME.  Here it could be considered important to keep the DATETIME data type to record cases where your business offers half-day holidays.  For example, I’ve seen this for Christmas Eve and New Year’s Eve.
  • You can also define a different FromDate than ToDate in the Holidays table to indicate a holiday that spans several days, like the Thailand Songkran holiday example from my blog.
“Since SQL is a database language, we prefer to do look ups and not calculations.” 
    –- Joe Celko (prolific author of books on things SQL)

There you go Joe, we’ve honored your preference!  Now we can do lookups of not just Easter Sunday, but also Good Friday, Ash Wednesday and Mardi Gras, the latter potentially being essential to planning your next trip to New Orleans.

The Calculation of Easter Sunday

A somewhat accurate description of the calculation of Easter Sunday is:

“The Sunday following the full Moon which falls on or after the equinox (the Paschal full moon) will give the lawful Easter.”
    –- From the Wikipedia link on Easter Sunday above

I say “somewhat accurate” because there’s all sorts of ifs, buts and caveats in that description, which you can read about in the Wikipedia link, however it will suffice for the time being.

The formal description of the calculation of Easter is known as Computus (no relation to Locutus of Borg), and is provided by Wikipedia at the link shown.  We shall observe this calculation in what follows.  While the formulas given are indicated to be valid for the years 1900 to 2199, since our Calendar table has been populated with years 1990-2169 we’ll limit our calculations of Easter Sunday to just those the years.  Years 1990 through 2020 will be checked against the lookup table we’ve already provided.

Of course, this calculation appears often enough in the literature and implemented in T-SQL (for example: here and here), often using methods I’d prefer to avoid due to performance.  Since our personal mantra is: “No Loops.  No Cursors.  No RBAR. Hoo-uh!” we shall endeavor to live up to that mantra today.  Think about that term “RBAR” and if you’ve never heard of it before, you’ll find it mentioned in the Conclusion to this article.

We’ll start by attempting to replicate the tabular example for calculating the Paschal full moon (PFMD) within the Gregorian calendar shown by Computus on that Wikipedia page.
 photo Gregorian Calendar for PFMD_zpsycqyrika.png
Note that the algorithm described above is a step-by-step or procedural algorithm. Here is where many SQL implementations fail because they don’t try to convert it to one that is set-based. To do this, you must think of each step in the procedural algorithm as a query, and as you proceed from one step to the next, each new step is another outer layer to your set-based query.

This is easier shown that described, so here is that query for 2013-2032.

SELECT [Year], [Y MOD 19], Addendum
    ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
    ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
    ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
FROM
(
    SELECT [Year], [Y MOD 19], Addendum
        ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
    FROM
    (
        SELECT [Year]
            ,[Y MOD 19] = [Year]%19
            ,Addendum   = CASE [Year]%19 
                            WHEN 5 THEN 29 
                            WHEN 16 THEN 29 
                            WHEN 8 THEN 30 
                            ELSE 0 
                            END 
        FROM dbo.Calendar
        WHERE [Month] = 1 AND [Day] = 1
    ) a
) a
WHERE [Year] BETWEEN 2013 AND 2032;

Note how the Addendum (CASE) implements the first two IFs (second and third rows) of the procedural algorithm. The results from this query for just the first 10 years are:

Year  Y MOD 19  Addendum  PFMD  Month  MonthNo
2013  18        0         27    Mar    3
2014  0         0         14    Apr    4
2015  1         0         3     Apr    4
2016  2         0         23    Mar    3
2017  3         0         11    Apr    4
2018  4         0         31    Mar    3
2019  5         29        18    Apr    4
2020  6         0         8     Apr    4
2021  7         0         28    Mar    3
2022  8         30        16    Apr    4

If we take PFMD and add the first letter of the month column that follows (for example for 2013: 27M) we see exact correspondence with the row from the Computus table (graphic) labeled the “Paschal full moon date,” including for the years not shown in the results set.  So it appears that we have rather simply converted the formula for calculating PFMD from Computus to a SQL script without too much effort.

Now we’ll use our Calendar table yet again, along with a little additional magic:

  • Put the pieces of the Paschal full moon date back together into a date using a formula that includes a couple of calls to the T-SQL built-in function DATEADD.
  • Do a CROSS APPLY (explained in Part 1 and Part 2 by SQL MVP Paul White) to pick up the Sunday after the Paschal full moon, which is Easter Sunday.
WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year], [Y MOD 19], Addendum
        ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END-1 -- PFMD
                ,DATEADD(month, CASE WHEN PFMD > 31 THEN 4 ELSE 3 END-1   -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b
WHERE [Year] BETWEEN 2013 AND 2032;

 

Now for a quiz on the above code where the lines are bold:

Why do we subtract 1 from each of the CASE statements that appear as the second argument to DATEADD?

We’ll give you the answer to this in the Conclusion section of this article, so stay tuned…

Here are the first ten years of the results set from the last T-SQL query shown above:

Year  PaschalFullMoon  EasterSunday
2013  2013-03-27       2013-03-31
2014  2014-04-14       2014-04-20
2015  2015-04-03       2015-04-05
2016  2016-03-23       2016-03-27
2017  2017-04-11       2017-04-16
2018  2018-03-31       2018-04-01
2019  2019-04-18       2019-04-21
2020  2020-04-08       2020-04-12
2021  2021-03-28       2021-04-04
2022  2022-04-16       2022-04-17

Our Computus Wikipedia page also has the inset table (at right) presenting the Western calculation of Easter Sunday, which corresponds quite nicely to our calculation of Easter Sunday.

That really wasn’t so hard was it? Now comes the packaging into a tool.

Since we like to keep things clean, we’re going to make this wonderful bit of SQL code into an in-line, table-valued function that we can call anytime we need to calculate Easter Sunday for a specific year.

We’ll then continue on to use our tool to calculate the other holidays by using the known number of days offset, just like we did before with our Easter Sunday lookup table, and insert that result into our Holidays table for years that are missing.

 photo Easter Sundays_zpsrruxck4b.png
CREATE FUNCTION CalculateEasterSunday
(
    @Year       INT
)
-- Calculate Easter Sunday for any given year
RETURNS TABLE WITH SCHEMABINDING
AS RETURN

WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year] --, [Y MOD 19], Addendum
        --,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        --,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        --,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real 
        -- DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 
                                THEN PFMD-31 
                                ELSE PFMD 
                                END-1           -- PFMD
                ,DATEADD(month, CASE 
                                WHEN PFMD > 31 
                                THEN 4 
                                ELSE 3 
                                END-1           -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1 AND [Year] = @Year
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b;

You can see where I commented out a few lines of code in the FUNCTION that represent intermediate results that our FUNCTION doesn’t really need to retain. Notice also how the filter on the @Year parameter was moved up into the innermost SELECT from the Calendar table, because when using the FUNCTION you’ll be selecting only one row at a time at that point. Calling the iTVF using CROSS APPLY in-lines the code to make this a non-RBAR query.

Now we can use our FUNCTION to check that the dates we’ve prior stored for Easter Sunday in our Holidays table are correct.

SELECT HolidayFromDate  = CAST(a.FromDate AS DATE)
    ,b.EasterSunday
    ,[Check]            = CASE 
                            WHEN a.FromDate = b.EasterSunday 
                            THEN 'MATCHED!' 
                            ELSE 'WRONG!' 
                            END
FROM dbo.Holidays a
CROSS APPLY dbo.CalculateEasterSunday(YEAR(a.FromDate)) b
WHERE Holiday LIKE 'Easter Sunday%';

The first ten years of our results show us all MATCHED! (and so do the others that are not shown).

HolidayFromDate  EasterSunday   Check
1990-04-15       1990-04-15     MATCHED!
1991-03-31       1991-03-31     MATCHED!
1992-04-19       1992-04-19     MATCHED!
1993-04-11       1993-04-11     MATCHED!
1994-04-03       1994-04-03     MATCHED!
1995-04-16       1995-04-16     MATCHED!
1996-04-07       1996-04-07     MATCHED!
1997-03-30       1997-03-30     MATCHED!
1998-04-12       1998-04-12     MATCHED!
1999-04-04       1999-04-04     MATCHED!

So we can be reasonably confident of our Easter Sunday calculation for the years indicated by Computus (1900-2199)!

Using our FUNCTION to Create Lookup Values in our Holidays Table

Now to keep Joe Celko happy (i.e., so we can do lookups instead of calculations), let’s insert the rest of our holidays, including Easter Sunday, Good Friday, Ash Wednesday and Mardi Gras, into our Holidays table, using all the years available to us in our Calendar table.

INSERT dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate, ToDate, holiday
FROM
(
    SELECT FromDate = DATEADD(day, c.days, b.EasterSunday)
        ,ToDate     = DATEADD(day, c.days, b.EasterSunday)
        ,c.holiday
    FROM dbo.Calendar a
    CROSS APPLY dbo.CalculateEasterSunday(a.[Year]) b
    CROSS APPLY
    (
        VALUES (0, 'Easter Sunday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-2, 'Good Friday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-46, 'Ash Wednesday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(a.[Year] AS CHAR(4)))
    ) c ([days], holiday)
    WHERE a.[Month] = 1 AND a.[Day] = 1
) a
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.Holidays x
    WHERE x.FromDate = a.FromDate AND x.ToDate = a.ToDate
);

This added 596 holidays to our table, and I’d guess we’re pretty well set for our lifetimes!

Notice how, in the above query we had to use DATEADD instead of simple arithmetic as we did before when using the DATETIME data type.  This is because simple arithmetic doesn’t work with the DATE data type returned by our CalculateEasterSunday function.

Conclusion and What we Learned

While Joe likes lookups, I find calculations in T-SQL to be quite a bit of fun.  In this article (as in my blog on Calendar tables), we got to do a little of both.

If you went ahead and read my blog on Calendar tables in conjunction with this article, I am pleased because you probably learned a lot about how useful they can be, particularly in conjunction with a Holidays table. In this article (in the FUNCTION) we used our Calendar table to find the next Sunday occurring after our calculated date of the Paschal full moon.

Besides learning about the trivialities of the relationship between Easter, Good Friday, Ash Wednesday, and planning for your next visit to New Orleans for Mardi Gras, we’d like to think that you may have learned a few other things as well:

  • You can use simple arithmetic on a DATETIME data type to add or subtract days, however you must use DATEADD to do the same thing with a DATE data type.
  • You now have efficient tools in your toolbox to generate a calendar table (GenerateCalendar) and CalculateEasterSunday, whenever you have a need to use them.  The latter function is also included in the resources file attached at the end of this article.
  • You learned how to CROSS APPLY VALUES to do an UNPIVOT operation, even though I didn’t specifically mention that was what we were doing when we used it (where we created rows with dates for Good Friday, Ash Wednesday and Mardi Gras).  If you haven’t seen that method before you may want to look into that link as well. That is something that will only work in SQL 2008 (or later), but there is an equivalent construct that will work in SQL 2005.
  • We learned how to use ISNULL on a SELECT/INTO operation, to set the attribute of the column created in the target table to NOT NULL.
  • You learned how to think your way through a rather complex, procedural calculation and convert it into set-based code, by wrapping each successive step in another layer of query.
  • You learned my personal SQL mantra, in case you hadn’t seen that before.

In case you’ve been wondering what RBAR stands for, it means “Row By Agonizing Row” and it is a term coined by Jeff Moden which I fortuitously picked up on before I got to know him.

Oh yes, then there’s the quiz I mentioned. There are two cases:

  • If we’re starting with 01 Jan of a year as our date, and the Paschal full moon month is three (meaning it should fall in Mar), if we add three months to 01 Jan it puts us in Apr. So we subtract one month to get us back to where we need to be.
  • If we’re starting with 01 Mar of a year as our date, and the Paschal full moon day within that month is (say) 14 (meaning it should fall on 14 Mar), if we add 14 days to 01 Mar it puts us in 15 Mar. So we subtract one day to get us back to where we need to be.

After all this, I still have no clue what a Paschal full moon is but what the heck, at least I know I can calculate when it is!  That’s all for today folks.  I hope you found this article useful and just a bit of fun.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 01 Apr 2015.  All rights reserved.

Resources for An Easter SQL.doc

The Resources above are stored in a Microsoft Word (TM) document. There are two sections:

-- Create GenerateCalendar, Calendar and Holiday Tables and Populate.sql
... page break ...
-- Create CalculateEasterSunday and Add Holidays.sql

Stupid T-SQL Tricks – Part 2: Retrieving Columns Set to the Current DEFAULT CONSTRAINT Value

Posted on Updated on

“Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?

My Stupid T-SQL Tricks are hopefully a bit more useful, but just as amusing as Carson’s Stupid Pet Tricks were! Hopefully everyone knows what a DEFAULT CONSTRAINT is for a column in a T-SQL table.  Let’s first create a table with such a column and populate it with some sample rows.

CREATE TABLE #SampleData
(
    MyPrimaryKey    VARCHAR(10) NOT NULL
    ,MyDataColumn   INT
    ,MyDataColumn2  INT 
    ,CONSTRAINT sd_pk PRIMARY KEY (MyPrimaryKey)
);

ALTER TABLE #SampleData
ADD CONSTRAINT sd_df1
DEFAULT (50) FOR MyDataColumn2;

INSERT INTO #SampleData (MyPrimaryKey, MyDataColumn)
SELECT 'Dwain.C', 41 UNION ALL SELECT 'Joe.C', 54
UNION ALL SELECT 'Chris.M', 44 UNION ALL SELECT 'Jeff.M', 33;

SELECT *
FROM #SampleData;

We can see from the following results that when the column is not specified on INSERT, the default value applies to each row created.

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Dwain.C       41            50
Jeff.M        33            50
Joe.C         54            50

Let’s change a couple of the rows to a different value.

UPDATE #SampleData
SET MyDataColumn2 = 52
WHERE MyPrimaryKey IN ('Dwain.C','Jeff.M');

SELECT *
FROM #SampleData;

So now we have this:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Dwain.C       41            52
Jeff.M        33            52
Joe.C         54            50

Suppose we want to write a generalized query that returns all rows that are set to the current default value for the column that has a default constraint.  We know that column is MyDataColumn2.

DECLARE @MyColDefault INT;

BEGIN TRANSACTION T1;

UPDATE TOP (1) #SampleData
SET @MyColDefault = MyDataColumn2 = DEFAULT;

ROLLBACK TRANSACTION T1;

SELECT *
FROM #SampleData
WHERE MyDataColumn2 = @MyColDefault;

Returns us these results:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Joe.C         54            50

Using the TRANSACTION/ROLLBACK, we’ve pretended to set one row back to its DEFAULT value, while saving that DEFAULT value into a local variable which we can then use in the subsequent, filtered SELECT. Using the TOP (1) option on the UPDATE ensures that we’ve only touched a single row to get the DEFAULT value that applies to all rows.

You’ll find that your code is now impervious to some pesky DBA changing the default constraint value for that column.

ALTER TABLE #SampleData
DROP CONSTRAINT sd_df1;

ALTER TABLE #SampleData
ADD CONSTRAINT sd_df1
DEFAULT (52) FOR MyDataColumn2;

DECLARE @MyColDefault INT;

BEGIN TRANSACTION T1;

UPDATE TOP (1) #SampleData
SET @MyColDefault = MyDataColumn2 = DEFAULT;

ROLLBACK TRANSACTION T1;

SELECT *
FROM #SampleData
WHERE MyDataColumn2 = @MyColDefault;

So now we get different results:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Dwain.C       41            52
Jeff.M        33            52

What we learned from these examples:

  • How to create a named DEFAULT constraint for a column.
  • How to change that constraint, by first dropping it and then creating it again.
  • How to write a query to return all rows where the value of a column is set to its current default.

Now how is that for a stupid, T-SQL trick?

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 19 Mar 2015.  All rights reserved.

Lessons Learned from a Poor-performing VIEW

Posted on Updated on

It is a “common knowledge” among SQL practitioners that VIEWs don’t perform well. Recently during an application’s development I put this to the test. I needed to construct a VIEW for a report the system was producing. It was highly complex. Let’s take a look at the general structure of the VIEW that I ended up with, along with how it was called by the report.

CREATE VIEW dbo.v_VeryComplex
WITH SCHEMABINDING
AS
WITH aCTE AS
(
    SELECT something
    FROM dbo.a_table a
    JOIN dbo.b_table b ON a.key = b.key
)
SELECT col1, col2, col3
-- many columns including a few calculated ones
FROM
(
    SELECT col, col2, col3-  - plus many more
    FROM dbo.table1 a
    JOIN dbo.table2 b ON a.key = b.key
    -- plus about 14 more JOINs and CROSS APPLYs
    -- among them the aCTE defined above and at 
    -- least one call to an schema-bound, in-line
    -- table valued function
    WHERE a.col = 'abc'
) a;
GO
-- The report calls it thusly 
SELECT col1, col2, col3 -- plus many more
FROM v_VeryComplex
WHERE col2 = 'cde' AND col3 = 'def'
ORDER BY sortcol1, sortcol2;

When I checked the Execution Plan for calling this VIEW (as the report does) during development, I was quite happy to see that every table it hit did an INDEX SEEK. So all of my JOINs, etc. were in order and it looked like it would happily perform with swiftness. And while I was on-site for the Go Live of the application, it most certainly performed with alacrity.

Then about four months later comes the dreaded call from the user. My beloved VIEW with all of those SEEKs was taking 30-40 seconds to return its results.

My first reaction was to be incredulous of course. Perhaps the report rendering to the browser was taking a long time. So I set about to testing it. I captured a query from the report using SQL Profiler and then ran it on the production server, where by this time some 12,000+ transactions had been entered into Table2 (from the VIEW above). Table1 is a configuration table and is quite small.

Sure enough it was dog-slow, returning in about 40 seconds. Not good.

A quick check of the execution plan on the production server showed that I was now suddenly getting an INDEX SCAN on Table2, which means that the VIEW was performing all of its work on all of the transactions in that table including a whack of complex calculations, summaries of detail records in child tables, etc. The final filtering on the VIEW then was not being transferred into an INDEX SEEK like it used to be!

A little more about the application so you can better appreciate my position. First of all, the application itself is mission critical for this customer. It has to work, and all of the work it processes is quite time sensitive. The report itself isn’t used for every transaction in the table, in fact probably less than 20-25% use this particular report to be processed. Nonetheless, those transactions are just as time-sensitive as all the rest. And because it involves some governmental reporting, it must be 100% accurate. Inaccuracies in the calculated amounts could lead to fines to the customer.

With more than a little trepidation, I set about to rewriting the VIEW in the hopes that I could remove the outer derived table (a) to perhaps return my call to the VIEW to doing an INDEX SEEK on Table2. Fortunately, I had at my disposal a local copy of an older version of the production database that exhibited the same problem – an INDEX SCAN on Table2.

After not a large amount of effort, I was able to do that rewrite and get my INDEX SEEK! So I happily created the new VIEW in the production server to continue my testing to ensure the accuracy of my results. The way I did that was to run the original VIEW, followed by EXCEPT, followed by the results from the new VIEW. Since I was running both VIEWS against all rows in Table2, this should highlight any discrepancies (after checking first that both returned the same number of rows). The result of that test were no rows returned, establishing with a very high degree of certainty that my rewrite was producing the correct results. I was nearly beside myself with glee!

Alas, that emotional state was short-lived because when I checked the execution plan of the call to the new VIEW in production, I found that I was back to the INDEX SCAN on Table2.

In fact, when I modified the filtering criteria to return the single transaction it was intended to return, I got virtually the same execution time. Not good!

What’s a performance evangelist like me to do? I certainly couldn’t go back to the client and tell him there’s nothing that can be done.

So I did a quick re-evaluation of the situation. My first thought was “perhaps I can INDEX the VIEW.”

That was not an option because:

  • You can’t INDEX a VIEW that uses common table expressions (CTE).
  • While I could modify the VIEW and move that CTE into a derived table, I was still left with several CROSS APPLYs, and you can’t INDEX a VIEW that uses APPLY either!

Rewriting all of those CROSS APPLYs into JOINs was simply not an appealing prospect to me.

Another rather annoying aspect of INDEXing the VIEW, were I able to accomplish it, is that the VIEW is schema-bound and it calls that pesky schema-bound, in-line table valued function (iTVF) mentioned in the comments in the code shown above. There is already a specific hierarchy to the deployment when one of the lowest level schema-bound objects needs changing (like that iTVF) that is relatively a pain. It involves dropping the VIEW and re-creating it after altering the iTVF. If I were to INDEX the VIEW, I’d always need to remember to recreate that INDEX anytime I had to deploy anything in that hierarchy. Not good either.

Then it hit me like a 2×4 to the head! When the report calls my VIEW, it does it for a single transaction only. So that gave me the solution.

I could copy all of the original code from the VIEW and instead create a schema-bound, ITVF. That would look something like this.

CREATE FUNCTION dbo.VeryComplexFcn
(
    @parm1      VARCHAR(10)
    ,@parm2     VARCHAR(10)
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH aCTE AS
(
    SELECT something
    FROM dbo.a_table a
    JOIN dbo.b_table b ON a.key = b.key
)
SELECT col1, col2, col3
-- many columns including a few calculated ones
FROM
(
    SELECT col, col2, col3  -- plus many more
    FROM dbo.table1 a
    JOIN dbo.table2 b ON a.key = b.key
    -- plus about 14 more JOINs and CROSS APPLYs
    -- among them the aCTE defined above and at 
    -- least one call to an schema-bound, in-line
    -- table valued function
    WHERE a.col = 'abc' AND a.col2 = @parm1 AND b.col3 = @parm2
) a;
GO
-- The report calls the iTVF like this 
SELECT col1, col2, col3 -- plus many more
FROM dbo.VeryComplexFcn ('cde', 'def')
ORDER BY sortcol1, sortcol2;

It was in short order that I deployed this new iTVF to the production server and checked the execution plan, to find that I was back to my beloved INDEX SEEK!

The query elapsed time calling the iTVF instead of the VIEW for a single transaction was impressively improved.

New (iTVF):
 SQL Server Execution Times:
 CPU time = 62 ms, elapsed time = 64 ms.
Old (VIEW):
 SQL Server Execution Times:
 CPU time = 35194 ms, elapsed time = 44423 ms.

I’m keeping the VIEW around for posterity’s sake, and of course in the event that we ever need to run it for multiple transactions.

Now I’ve just got to wait for the development team to modify the report to call the iTVF instead of the VIEW, so we can re-deploy the application. I am left with the same hierarchical deployment constraints, but those I can live with because I won’t need to remember to recreate an INDEX on a VIEW every time (SQL Server will remind me about everything else).

The lesson to be learned here is not to be beguiled by execution plans you see during development. Those can change as transactions accumulate. Even though we generated an impressive number of transactions during the development (I had some automated SQL scripts that did this), they were insufficient to reach that critical threshold where the execution plan changed.

Stupid T-SQL Tricks – Part 1: Logarithms

Posted on Updated on

Not just for math geeks, logarithms and their inverse functions (the exponentials) can have useful and often unexpected capabilities in T-SQL.

    “Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?

While Carson’s animal antics were meant to entertain, my Stupid T-SQL Tricks are meant to be thought provoking and also add a tool or two to your T-SQL tools chest in the event that the need arises. So maybe they aren’t so stupid after all (and besides it’s my blog and I like the title). This will be the first of a series of blogs offering two to four specialty tips that are all somehow related. Today’s tricks will all utilize logarithms to do their magic.

Calculate a Factorial

Statisticians will recognize the factorial, a number that is represented as N!, where n=0, 1, 2, 3, … and for example 4! = 1*2*3*4 = 24. In the somewhat distant past I had the chutzpah to suggest that it could be done with a recursive Common Table Expression (CTE). Although in my article Exploring Recursive CTEs by Example, I wasn’t really advocating that approach. It was more like I was trying to help folks better understand how recursive CTEs work. Nowadays, I would never do it that way because there is something better.

Let’s start with a Tally table that generates numbers from 1 to 10. In our query below, we’ll SUM those numbers.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT SUM(n)
FROM Tally;

-- Results:
55

Too bad there isn’t a PROD() built-in function in T-SQL. Or is there? Consider logarithms and how they can convert a SUM() into a PROD() using a few simple calls to the right built-in functions in the proper sequence. Let’s try these two queries:

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT EXP(SUM(LOG(n)))
FROM Tally;

SELECT 1*2*3*4*5*6*7*8*9*10;

-- Results:
3628800.00000001
3628800

Those two results are remarkably close, the first one of course being slightly imprecise due to the fact that LOG and EXP are both returning a FLOAT data type. Since we know a factorial number must be an integer, we can convert it as follows, so now it returns the exact integer value of interest (3628800).

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT CAST(ROUND(EXP(SUM(LOG(n))), 0) AS BIGINT)
FROM Tally;

While it unlikely that you’ll often have the need to calculate factorials in T-SQL, you may occasionally have the need to calculate a product across multiple rows for a column. So this technique will work for that also.

The Running Product

Using SQL 2012, we can use this same technique to calculate a running product (similar to a running total) across multiple rows in a table. To do this, we’ll use a window frame.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n
    ,CAST(ROUND(EXP(
        SUM(LOG(n)) OVER
            (
                ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            )
        ), 0) AS BIGINT)
FROM Tally;

-- Results:
1      1
2      2
3      6
4      24
5      120
6      720
7      5040
8      40320
9      362880
10     3628800

Be careful though! These numbers are going to grow really quickly, so hopefully you won’t be doing this over many, many rows.

Coded, Numeric Fields with and without Leading Zeroes

In many applications you’ll find columns that are coded as numbers, but are stored as characters. This is, of course, a best practice when you don’t expect to be doing arithmetic with the contents. Let’s use a bit of a contrived example in the guise of a Sales Order Details table:

CREATE TABLE #SO_Details
(
    SO_Number           VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
    ,Extended_Price     AS (Quantity*Price)
);

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '1', 3, 15.20),('2013SO00001', '2', 3, 9.40)
    ,('2013SO00001', '3', 1, 11.50),('2013SO00001', '4', 2, 11.55)
    ,('2013SO00001', '5', 2, 14.30),('2013SO00001', '6', 10, 13.32)
    ,('2013SO00001', '7', 5, 19.42),('2013SO00001', '8', 6, 10.15)
    ,('2013SO00001', '9', 4, 12.15);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

We’ve used a computed column to calculate Extended_Price as Quantity * Price. Let’s suppose that we now need to insert an additional SO_Detail row.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   10        2          18.88   37.76
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

Oh my! Those weren’t the results we wanted because our new record (SO_Detail=10) is sorted into the wrong spot! Of course, this can easily be addressed by changing the ORDER BY so that it CASTs the SO_Detail column to INT, but then that’s why I said this example is a bit contrived. Note that we rolled back the transaction so that our new record 10 was not really saved in the table.

We could use our newfound knowledge of logarithms to address this. In our prior example, we used a natural log, but in this example we’ll use a base 10 logarithm. Let’s take a look at a property of base 10 logarithms that we can use to add a leading zero to each of the entries from 1 to 9. To show this property, we’ll fall back to our Tally table.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, LOG10(n), FLOOR(LOG10(n))
    ,POWER(10, 1+FLOOR(LOG10(n)))
    ,MAX(n) OVER (PARTITION BY NULL)
FROM Tally
WHERE n IN(1, 9, 10, 11, 99, 100);

-- Results:
1     0                   0    10     100
9     0.954242509439325   0    10     100
10    1                   1    100    100
11    1.04139268515823    1    100    100
99    1.99563519459755    1    100    100
100   2                   2    1000   100

Let’s say that we know that our range of SO_Detail is 1 through 9. In that case, we can use a formula like the following to prepend exactly the right number of leading zeroes (each is shown for the number 9).

SELECT RIGHT(10+9, 1), RIGHT(100+9, 2), RIGHT(1000+9, 3);

-- Results:
9     09   009

The numbers 10, 100 and 1000, as well as the 1, 2 and 3 can be derived directly from the results returned by the former query. Let’s put this all together.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + n, 1+FLOOR(LOG10(m)))
FROM
(
    SELECT n, m=MAX(n) OVER (PARTITION BY NULL)
    FROM Tally
) a
WHERE n IN (1, 9, 10, 11, 99, 100);

-- Results:
1      001
9      009
10     010
11     011
99     099
100    100

You see, our new formula using RIGHT not only gets us character strings. It also prepends exactly the correct number of zeroes so that the new result is sortable directly on the resulting character string.

Let’s return to our SO Details table and see how we can apply this new formula to normalize our detail line numbers.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

WITH SO_Details AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
        ,m=MAX(CAST(SO_Detail AS INT)) OVER (PARTITION BY NULL)
    FROM #SO_Details
    WHERE SO_Number = '2013SO00001'
)
UPDATE SO_Details
SET SO_Detail = RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + SO_Detail, 1+FLOOR(LOG10(m)));

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail  Quantity   Price   Extended_Price
2013SO00001   01         3          15.20   45.60
2013SO00001   02         3          9.40    28.20
2013SO00001   03         1          11.50   11.50
2013SO00001   04         2          11.55   23.10
2013SO00001   05         2          14.30   28.60
2013SO00001   06         10         13.32   133.20
2013SO00001   07         5          19.42   97.10
2013SO00001   08         6          10.15   60.90
2013SO00001   09         4          12.15   48.60
2013SO00001   10         2          18.88   37.76

Now our results are sorted as expected. Once again we have rolled back the transaction so we are left with only 9 rows in our #SO_Details table and single digit SO_Detail values.

Let’s complicate our example just a bit and add a table of transactions we’d like to process, consisting of inserts, updates and deletes and see if we can do the whole thing with a MERGE. First, the transactions table and some sample transactions.

CREATE TABLE #SO_Trans
(
    [action]            CHAR(1)
    ,SO_Number          VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
);

INSERT INTO #SO_Trans
    (
        [action], SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('U', '2013SO00001', '5', 15, 12.23)
    ,('I', '2013SO00001', NULL, 4, 16.15)
    ,('I', '2013SO00001', NULL, 3, 12.22)
    ,('D', '2013SO00001', '3', NULL, NULL);

SELECT *
FROM #SO_Trans;

-- Results:
action SO_Number     SO_Detail   Quantity   Price
U      2013SO00001   5           15         12.23
I      2013SO00001   NULL        4          16.15
I      2013SO00001   NULL        3          12.22
D      2013SO00001   3           NULL       NULL

For INSERTs represented in the transactions table, we do not need an SO_Detail number. For DELETEs we do not need Quantity or Price. For UPDATEs we need all three. In order to do a MERGE, we need to consider the following:

  • What is the target table?
  • What is the source table?
  • What are the matching criteria and what operations are performed on which kinds of matches?

The target table is easy, although not so easy as just the #SO_Details table because we need to only include SOs that are in the #SO_Trans table, so we’ll start with that:

-- Target table
SELECT SO_Number, SO_Detail, Quantity, Price
FROM #SO_Details
WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans);

We’ll put the target table into a CTE for use in the MERGE. Coincidentally the results are the 9 original rows we put into the table, because there is only one SO_Number in the #SO_Trans table. We do need to include all rows for that SO because any of them may be subject to an UPDATE to change the SO_Detail number.

MERGE throws an exception if you try to update any row in the target table more than once, so we need to carefully construct the source table so that there is one and only one matching row for each row in the target.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
)
SELECT [action]=ISNULL(b.[action], 'U')
    ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
    ,a.SO_Detail
    ,Quantity=ISNULL(b.Quantity, a.Quantity)
    ,Price=ISNULL(b.Price, a.Price)
FROM Target4MERGE a
FULL JOIN #SO_Trans b
ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail;

-- Results:
action  SO_Number    SO_Detail  Quantity  Price
I       2013SO00001  NULL       4         16.15
I       2013SO00001  NULL       3         12.22
U       2013SO00001  1          3         15.20
U       2013SO00001  2          3         9.40
D       2013SO00001  3          1         11.50
U       2013SO00001  4          2         11.55
U       2013SO00001  5          15        12.23
U       2013SO00001  6          10        13.32
U       2013SO00001  7          5         19.42
U       2013SO00001  8          6         10.15
U       2013SO00001  9          4         12.15

Notes about these results:

  • The two detail lines to be INSERTed have a NULL SO_Detail number because we have not yet determined what it is to be. We’ll be trying to add them at the end.
  • Note the Quantity and Price (15 and 12.23) for SO_Detail=5, which are the values we want to UPDATE.

The key to the matching will be to construct a ROW_NUMBER() that we can use as the new SO_Detail number. So let’s add that:

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
)
SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
    ,m=MAX(r) OVER
        (
            PARTITION BY SO_Number, CASE [action] WHEN 'D' THEN 0 ELSE 1 END
        )
FROM PreSource;

-- Results:

action  SO_Number    SO_Detail  Quantity  Price  r   m
D       2013SO00001  3          1         11.50  1   1
U       2013SO00001  1          3         15.20  1   10
U       2013SO00001  2          3         9.40   2   10
U       2013SO00001  4          2         11.55  3   10
U       2013SO00001  5          15        12.23  4   10
U       2013SO00001  6          10        13.32  5   10
U       2013SO00001  7          5         19.42  6   10
U       2013SO00001  8          6         10.15  7   10
U       2013SO00001  9          4         12.15  8   10
I       2013SO00001  NULL       4         16.15  9   10
I       2013SO00001  NULL       3         12.22  10  10

Additional comments on the latest query and results:

  • The PARTITION we created for our ROW_NUMBER() separates the record to be deleted from the remainder.
  • In the ORDER BY for our ROW_NUMBER() we substituted 10000 (one more than the maximum number of rows we can have per SO detail line) to make the INSERTed records move to the bottom.
  • We enclosed this query in a CTE, because we also want to use the MAX window aggregate function.
  • For that, the PARTITION is identical to what we used for ROW_NUMBER().
  • The result in the r column (except for the [action]=’D’ row) is the new SO_Detail number.
  • The result in m will be used when we convert our integers to characters with leading numbers.

All that’s left is to construct the final MERGE statement, so here it is.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
),
    Source4MERGE AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
        ,m=1+FLOOR(LOG10(
            MAX(r) OVER
                (
                    PARTITION BY SO_Number
                        ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                )))
    FROM PreSource
)
MERGE Target4MERGE t
USING Source4MERGE s
ON s.SO_Number = t.SO_Number AND s.SO_Detail = t.SO_Detail
WHEN MATCHED AND [action] = 'D'
THEN DELETE
WHEN MATCHED AND [action] = 'U'
THEN UPDATE
SET SO_Detail = RIGHT(POWER(10, m) + s.r, m)
    ,Quantity = s.Quantity
    ,Price    = s.Price
WHEN NOT MATCHED AND [action] = 'I'
THEN INSERT 
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES 
    (
        s.SO_Number
        ,RIGHT(POWER(10, m) + s.r, m)
        ,s.Quantity
        ,s.Price
    );

SELECT *
FROM #SO_Details;

-- Results:
SO_Number    SO_Detail  Quantity  Price   Extended_Price
2013SO00001  01         3         15.20   45.60
2013SO00001  02         3         9.40    28.20
2013SO00001  03         2         11.55   23.10
2013SO00001  04         15        12.23   183.45
2013SO00001  05         10        13.32   133.20
2013SO00001  06         5         19.42   97.10
2013SO00001  07         6         10.15   60.90
2013SO00001  08         4         12.15   48.60
2013SO00001  09         4         16.15   64.60
2013SO00001  10         3         12.22   36.66

Now that’s one heck of a complicated looking query and it could probably do with some comments, but you can get those as needed from the descriptions I’ve provided. A few more additional notations may be warranted.

  • We applied our 1+FLOOR+LOG10 formula around the MAX() window aggregate function to save a bit of complexity when assigning SO_Detail in the MERGE.
  • The MERGE uses MATCHED twice, segregating DELETEs from UPDATEs based on [action].
  • The NOT MATCHED clause INSERTs the new rows specified by the #SO_Trans table.

Conclusions

You may be asking why not save all of this computational convolution and just always use a 4 digit SO_Detail value with as many leading zeroes as necessary because that’s pretty easy with something like:

SELECT RIGHT('0000'+CAST(SO_Detail AS VARCHAR(4)), 4);

The answer is, yes we could have done that if our business requirements allowed for always having 4 digits in the SO_Detail column. But frankly, this was much more fun and we got to learn a little about logarithms along the way. And I did warn you by calling this article Stupid T-SQL Tricks, now didn’t I?

Still we may have learned a few things, like how to calculate factorials, creating a pseudo-PROD() aggregate, how base 10 logarithms can be used to find the number of digits in a number and how to methodically construct a complex MERGE query one step at a time. All in all, that alone probably made it worthwhile for me to post this blog.

Perhaps it wasn’t quite as entertaining as Johnny Carson’s Stupid Pet Tricks skits, but we hope you enjoyed it anyway.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

Using the T-SQL MERGE Statement

Posted on Updated on

In SQL Server 2008, Microsoft added a new SQL query type: the MERGE statement. This flexible query provides the ability to perform INSERTs, UPDATEs and even DELETEs all within a single statement. Used in combination with Common Table Expressions (CTEs), this can be a powerful tool to replace multiple SQL queries under the right circumstances.

One important rule to keep in mind when using MERGE, is that the statement must be terminated by a semicolon (;).

Case Study 1: A Simple Upsert

The most common usage of the MERGE statement is to perform what is colloquially called an “upsert,” which is really a diminutive form of UPDATE/INSERT. Without further preamble, let’s set up some test data and get to seeing how the MERGE statement can simplify your life.

CREATE TABLE #Master
(
    [key]       INT IDENTITY PRIMARY KEY
    ,name       VARCHAR(10)
);

INSERT INTO #Master
VALUES ('Dwain'),('Jeff'),('Paul')
    ,('Anon'),('Ralph'),('Tom'),('Sally');

CREATE TABLE #Staging
(
    [key]       INT PRIMARY KEY
    ,[NewName]  VARCHAR(10)
);

INSERT INTO #Staging
VALUES (2, 'Bob'),(4, 'Jim'),(6, 'Marvin'), (10, 'Buddy');

SELECT * FROM #Master;
SELECT * FROM #Staging;

The results in the two tables as displayed by the SELECT are:

key name
1   Dwain
2   Jeff
3   Paul
4   Anon
5   Ralph
6   Tom
7   Sally

key NewName
2   Bob
4   Jim
6   Marvin
10  Buddy

Our intention is to update (based on [key] in #Staging) the corresponding row by [key] in #Master. If the [key] in #Staging matches none of our [key] values in #Master, then insert a new row. The new row’s [key] does not need to match the value in the staging table. We can easily do this with a MERGE statement as follows:

MERGE #Master t
USING #Staging s
ON s.[key] = t.[key]
WHEN MATCHED THEN 
    UPDATE SET name = s.[NewName]
WHEN NOT MATCHED THEN 
    INSERT (name) VALUES (s.[NewName]); 

SELECT * FROM #Master;

The final SELECT result is as follows:

key name
1   Dwain
2   Bob
3   Paul
4   Jim
5   Ralph
6   Marvin
7   Sally
8   Buddy

You can see that Bob replaced Jeff, Jim replaced Anon and Marvin replaced Tom, and also that Buddy was added at the end.

The way the statement works is as follow:

  • The table name immediately after the MERGE keyword is the target table, in this case #Master, which we have aliased as t for easy understanding.
  • The USING table is the source, so #Staging will be merged into #Master.
  • The ON keyword represents the matching criteria between the records in the two tables. You should not think of this in the same way that ON appears after a JOIN as it operates quite differently.
  • Following those parts of the statement, are any number of WHEN clauses. The MATCHED criterion indicates a match based on the ON criteria. It can be combined with additional matching criteria if required.
  • NOT MATCHED (implied as BY TARGET), means that when a source row does not exist in the target table, we’re going to do something.
  • Following MATCHED or NOT MATCHED is the keyword THEN followed by either an INSERT or an UPDATE.

You can also use DELETE (instead of UPDATE or INSERT) and if you’d like to learn about how to DELETE rows from the target table, I suggest you read this article to understand exactly how it works: A Hazard of Using the SQL Merge Statement and the potential dangers when using it.

Case Study 2: A More Complicated MERGE

Suppose we have the following sample table and data:

CREATE TABLE #ItemTest
(
    ID INT NOT NULL
    ,LineID INT NOT NULL
    ,ProductID INT NULL
    ,PRIMARY KEY(ID, LineID)
); 

INSERT INTO #ItemTest (ID, LineID, ProductID)
SELECT 100, 1, 5 
UNION ALL SELECT 100, 2, 15 
UNION ALL SELECT 100, 3, 8 
UNION ALL SELECT 100, 4, 25 
UNION ALL SELECT 200, 1, 11 
UNION ALL SELECT 200, 2, 100 
UNION ALL SELECT 200, 3, 41 
UNION ALL SELECT 200, 4, 10 
UNION ALL SELECT 200, 5, 5 
UNION ALL SELECT 200, 6, 30
UNION ALL SELECT 300, 1, 20;

SELECT *
FROM #ItemTest;

From the final SELECT, we see that our data appears as follows:

ID   LineID ProductID
100  1      5
100  2      15
100  3      8
100  4      25
200  1      11
200  2      100
200  3      41
200  4      10
200  5      5
200  6      30
300  1      20

Notice how the entries for each ID contain a sequentially numbered LineID (1 to 4 for ID=100 and 1 to 6 for ID=200). Our business requirement is that we need to delete some rows and at the same time preserve the row numbering for LineID without introducing any gaps. So for example, if we need to delete LineID=3 from ID=100, we need to renumber LineID=4 for that ID to be LineID=3.

Ignoring for the moment that it’s probably poor application design to have this row renumbering requirement, this can be accomplished with a MERGE. Since it is a bit more complicated we’ll develop it in a couple of steps to help you understand. First, let’s say we want to delete three rows. We’ll put those into a table variable (a feature introduced in SQL Server 2005).

DECLARE @RowsToDelete TABLE
(
    ID      INT
    ,LineID INT
    ,PRIMARY KEY (ID, LineID)
);

INSERT INTO @RowsToDelete (ID, LineID)
SELECT 100, 3 UNION ALL SELECT 200, 2 UNION ALL SELECT 200, 4;

Note how we can create a PRIMARY KEY on a table variable. While not needed in this case, if you had lots of rows it will improve the performance of what we’re about to do.

Now we’ll construct the following query which will require some explanation:

SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID
    ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))
    ,XX=NULLIF(a.LineID, b.LineID)
FROM #ItemTest a
LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID;

XX is included only to illustrate what NULLIF is doing for us. This produces the following results:

ID   LineID ProductID LineID2  rn   XX
100  3      8         3        1    NULL
100  1      5         NULL     2    1
100  2      15        NULL     3    2
100  4      25        NULL     4    4
200  2      100       2        1    NULL
200  4      10        4        2    NULL
200  1      11        NULL     3    1
200  3      41        NULL     4    3
200  5      5         NULL     5    5
200  6      30        NULL     6    6
300  1      20        NULL     1    1

Each row from #ItemTest is returned because it is the left table of the LEFT JOIN. Matching rows from our @RowsToDelete temporary table have a value in LineID2, while rows not matched have a value of NULL (exactly how you’d expect the LEFT JOIN to work). The result in XX shows us that when the LineID of #ItemTest matches the LineID of @RowsToDelete, we get a NULL and NULL values usually sort first (there is a SQL Server setting that controls this). So in each case, the rows we want to delete are sorted to the top of the grouping (on ID).

For the 3 rows in our @RowsToDelete table, we have 1 for ID=100 and 2 for ID=200 (these counts are easy enough to obtain in SQL). So what happens if we subtract that count from rn?

WITH CountItemsToDelete (ID, c) AS
(
    SELECT ID, COUNT(*)
    FROM @RowsToDelete
    GROUP BY ID
)
SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID
    ,[rn-c]=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))-c
FROM #ItemTest a
LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID
JOIN CountItemsToDelete c ON a.ID = c.ID;

The results now appear as:

ID   LineID   ProductID LineID2  rn-c
100  3        8         3        0
100  1        5         NULL     1
100  2        15        NULL     2
100  4        25        NULL     3
200  2        100       2        -1
200  4        10        4        0
200  1        11        NULL     1
200  3        41        NULL     2
200  5        5         NULL     3
200  6        30        NULL     4

Note how the row for ID=300 has been eliminated by the INNER JOIN to our Common Table Expression (CTE) CountItemsToDelete. Looking at the [rn-c] column, we see that for rows where LineID2 is not NULL, the value is meaningless. But for rows where LineID2 is NULL, [rn-c] is precisely the final row number we’ll need to assign to LineID after deleting the rows we want to delete! Now we have enough information to write this into a MERGE statement:

WITH CountItemsToDelete (ID, c) AS
(
    SELECT ID, COUNT(*)
    FROM @RowsToDelete
    GROUP BY ID
), 
    SourceItems AS
(
    SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID, c
        ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))
    FROM #ItemTest a
    LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID
    JOIN CountItemsToDelete c ON a.ID = c.ID
)
-- The target table
MERGE #ItemTest t
-- The source table
USING SourceItems s
-- Matching criteria: lines up rows from SourceItems exactly with rows
-- from our target table (except for ID=300 which is not in the source) 
ON t.ID = s.ID AND s.LineID = t.LineID
-- LineID2 is not NULL for rows we need to delete
WHEN MATCHED AND s.LineID2 IS NOT NULL THEN
    DELETE
-- LineID2 is NULL for rows where we've calculated the new line number
WHEN MATCHED AND s.LineID2 IS NULL THEN
    UPDATE SET LineID = rn-c;

SELECT *
FROM #ItemTest;

The results shown in the final SELECT clearly indicate that this MERGE query has satisfied our business requirement.

ID   LineID   ProductID
100  1        5
100  2        15
100  3        25
200  1        11
200  2        41
200  3        5
200  4        30
300  1        20

To further improve the performance of the query, you can change the second MATCHED criteria to this, to avoid updating rows where the LineID isn’t changing.

WHEN MATCHED AND s.LineID2 IS NULL AND t.LineID <> rn-c THEN

To do this otherwise in SQL you would first need to DELETE the rows you want to delete, and then run a separate UPDATE to correct the row numbers that need correcting.

Summary of our Learning

Today we have learned about the MERGE query and how it can be used to replace multiple queries in the case of:

  • UPDATE/INSERT (the “Upsert”)
  • DELETE/UPDATE

I encourage you to also read the linked article about hazards present if you utilize the full capability of the MERGE (WHEN NOT MATCHED SOURCE THEN DELETE). The article shows a simple way of limiting the scope of the DELETE to avoid the hazard.

We also learned about SQL table variables, which are very handy under certain circumstances. They do have their drawbacks though, and some day we may blog on the differences between them and temporary tables.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved

The One Million Row T-SQL Test Harness

Posted on

So far in our blogs, we have talked a bit about performance, but today we’re going to show you a way you can confirm without doubt that you’re writing high-performance T-SQL code. This extremely valuable technique is something Developers and Testers alike should be familiar with.

Why one million (1M) rows? Well, the answer to that question is really three-fold.

  • First of all, 1M rows represent sort of a threshold, where you can start to consider your tables as being “big.”
  • Second, if your query handles 1M rows in a reasonably short period of time, it means that it will probably do OK even if the table grows much larger (which is what we hope because that judges how popular our application is).
  • Finally, depending on what you are testing, the 1M row test may make the difference between deciding on the right query (the one that is the fastest) and the wrong query, because at 1M rows it usually separates the winners from the losers quite clearly.

Along the way we’re going to learn about a variety of ways that test data can be generated. Unfortunately we won’t be able to cover every possible case. We’ll need to leave some of them to your imagination. Once you understand the basics, you should be able to extend the concept to other types of test data relatively quickly.

Case Study 1: Verify Which Formula is Fastest

Suppose you have a table that contains an INTEGER ID number column. Your application requires that when that number is shown on a form, it displays with leading zeroes. There are many ways to do this but we’ll look at three.

DECLARE @Num INT = 2342;

SELECT @Num
    ,RIGHT('000000'+CAST(@Num AS VARCHAR(7)),7)         -- Method 1
    ,RIGHT(10000000+@Num, 7)                            -- Method 2
    ,STUFF(@Num, 1, 0, REPLICATE('0', 7-LEN(@Num)));    -- Method 3

Each of these returns the same result: 0002342 and will work for any positive integer <= 9999999. But the question is which one is faster?

To answer that question we’ll construct a 1M row test harness based on the Numbers table we created in our blog on Tally Tables. We’ll also show an example of one of those methods using an in-line Tally table (from the same blog), so you can see the difference between using a permanent vs. an in-line tally table.

DECLARE @NumResult VARCHAR(7);

PRINT 'Method 1:';
SET STATISTICS TIME ON;
SELECT @NumResult = RIGHT('000000'+CAST(N AS VARCHAR(7)),7)
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 2:';SET STATISTICS TIME ON;
SELECT @NumResult = RIGHT(10000000+N, 7)
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 3:';
SET STATISTICS TIME ON;
SELECT @NumResult = STUFF(N, 1, 0, REPLICATE('0', 7-LEN(N)))
FROM dbo.Numbers
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

PRINT 'Method 2 w-in line Tally Table:';
SET STATISTICS TIME ON;
WITH Tally(N) AS
(
    SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT @NumResult = RIGHT(10000000+N, 7)
FROM Tally
WHERE N <= 1000000;
SET STATISTICS TIME OFF;

-- Results:
Method 1:
SQL Server Execution Times:
CPU time = 265 ms, elapsed time = 274 ms.

Method 2:
SQL Server Execution Times:
CPU time = 250 ms, elapsed time = 250 ms.

Method 3:
SQL Server Execution Times:
CPU time = 452 ms, elapsed time = 460 ms.

Method 2 w-in line Tally Table:
SQL Server Execution Times:
CPU time = 234 ms, elapsed time = 227 ms.

You should always run the test harness a few times and record the results of each run. By looking at the results, we see that methods 1 and 2 are pretty close, but over the 4-5 runs that I did, method 2 was consistently just a little faster in elapsed time. Using an in-line Tally table in this case was just a little faster than using the permanent Tally table (that may not always be the case).

You may be saying to yourself that this little 9% improvement doesn’t mean much, but picture it in the context of a much more complex query, where every slight improvement you can make counts for something.

Let’s now look at some key aspects of the test harness we used:

  • It contains multiple queries that return the identical results for the same number of rows. This is important because you’re trying to compare solutions.
  • We SET STATISTICS TIME ON before each query and OFF when it was complete. You could have turned them ON once at the start and OFF at the end, but that’s not always going to be the case.
  • We printed out a description of the method we are using just before setting STATISTICS ON. That’s the reason we turned them OFF after each query; so you wouldn’t see the statistics for the PRINT statement, which is immaterial.
  • Finally, we created a local variable @NumResult and assigned our calculated result (the returned columns) to it. This is important to eliminate the time that SQL Server Management Studio (SSMS) would otherwise take to render the results to the Results pane. If you don’t do that, it can bias the results. We’re interested in clocking the raw query speed here.

CPU time can be important sometimes, so you may also want to look at that. It turns out that method 2 also appears better than method 1 in terms of CPU time, but that may not always be the case and over the 4-5 runs we did it often was a tie.

After this test, we can say pretty clearly that method 2 was the highest performance method among the ones we looked at. We are also now well on our way to being able to say definitively that we’ve written the best possible query to solve our business problem.

Case Study 2: Removing Numbers from a Character String

In this case study, we will illustrate a few additional concepts you should know when constructing a test harness:

  • How to construct random character strings of any length.
  • Using an alternate means to capture the elapsed time of the queries we’re comparing.
  • Using an alternative to a local variable to avoid rendering results to SSMS.

The business problem is that we wish to remove numbers from character strings that contain only letters and numbers. For this task, we have written two T-SQL FUNCTIONs that basically do the job but need to be called differently.

-- Scalar Valued Function that removes characters based on a pattern match
CREATE FUNCTION dbo.CleanString
    (@pString VARCHAR(8000), @pPattern VARCHAR(100))
RETURNS VARCHAR(8000) AS
BEGIN

DECLARE @Pos SMALLINT;
SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

WHILE @Pos > 0
SELECT @pString = STUFF(@pString,@Pos,1,''),

@Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

RETURN @pString;

END
GO

-- In-line, schema-bound Table Valued Function
CREATE FUNCTION dbo.RemoveMatchedPatterns
(
    @Str        VARCHAR(8000)
    ,@Pattern   VARCHAR(100)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH Tally(n) AS
(
    SELECT TOP (ISNULL(LEN(@Str), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
),
SplitString AS
(
    SELECT n, s=SUBSTRING(@Str, n, 1)
    FROM Tally
    WHERE PATINDEX(@Pattern, SUBSTRING(@Str COLLATE Latin1_General_BIN, n, 1)) = 0
)
SELECT ReturnedString=
    (
        SELECT s + ''
        FROM SplitString b
        ORDER BY n
        FOR XML PATH(''), TYPE
    ).value('.', 'VARCHAR(8000)');
GO

For now, it is not necessary to fully understand how these FUNCTIONs work, but it is necessary to understand how to use them. Note that the CleanString FUNCTION uses a WHILE loop to do its work.

DECLARE @TestString VARCHAR(8000) = '123122adflajsdf34a23aa333w';

SELECT TestString=@TestString
    ,SVFReturns=dbo.CleanString(@TestString, '%[0-9]%')
    ,iTVFReturns=ReturnedString
FROM dbo.RemoveMatchedPatterns(@TestString, '%[0-9]%');

-- Results:
TestString                   SVFReturns     iTVFReturns
123122adflajsdf34a23aa333w   adflajsdfaaaw   adflajsdfaaaw

Both FUNCTIONs have the same call signature, which includes the pattern of characters you want removed, and both return the same results (only the alphabetic characters).

Before we proceed to generate a test harness containing lots of random strings we can test, it is necessary to familiarize you with a way to generate random numbers in T-SQL. So let’s consider the following statement:

SELECT IntRN=1+ABS(CHECKSUM(NEWID()))%100
    ,RNwDec=.01*(1+ABS(CHECKSUM(NEWID()))%10000);

-- Results:
IntRN  RNwDec
31     40.74

If you run this statement multiple times, you’ll get different results each time. In each case, you get a number between 1 and 100. The RNwDec column will have 2 decimal digits. This is the standard method in SQL to generate a Uniform Random Number (URN). If you need a wider range of numbers, change 100 (or 10000 for RNwDec) to something larger.

To generate a random string of characters, you can use the URN formula as follows:

SELECT REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20);

You can try this one yourself. Each time you get a different string consisting of 1 to 20 letters (always the same letter repeated) and 1 to 20 numbers (always the same number repeated). The string will be of varying length between 2 and 40 characters. To get a string of maximum length 8000, all you need to do is replicate the pattern a random number of times. The final (outer) replication should be performed up to 200 times, so can be done like this:

SELECT REPLICATE(
    REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20)
            ,1+ABS(CHECKSUM(NEWID()))%200);

Now that we understand the tools we’ll need, here is the test harness.

-- Create 1000 rows of random strings
SELECT s=REPLICATE(
    REPLICATE(
        SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1)
            ,1+ABS(CHECKSUM(NEWID()))%20) +
    REPLICATE(
        SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1)
            ,1+ABS(CHECKSUM(NEWID()))%20)
            ,1+ABS(CHECKSUM(NEWID()))%200)
INTO #TestStrings
FROM dbo.Numbers
WHERE N <= 1000;

DECLARE @StartDT DATETIME = GETDATE();

SELECT SVFReturns=dbo.CleanString(s, '%[0-9]%')
INTO #Test1
FROM #TestStrings;

-- Display elapsed time for the scalar-valued User-defined Function (UDF)
SELECT SVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();
SELECT iTVFReturns=ReturnedString
INTO #Test2
FROM #TestStrings
CROSS APPLY dbo.RemoveMatchedPatterns(s, '%[0-9]%');

-- Display elapsed time for the in-line Table Valued Function (iTVF)
SELECT iTVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());
GO

DROP TABLE #TestStrings;
DROP TABLE #Test1;
DROP TABLE #Test2;

For this test, we’ll illustrate the fact that sometimes you don’t need to go up to 1M rows to distinguish the difference between two queries. In this case, it becomes quite apparent at 1000 rows. Here are the timings that are displayed by the two SELECT statements:

SVFElapsedMS
26290

iTVFElapsedMS
7933

That’s already a pretty large difference so you can imagine how long it would take to run at 1M rows. Here’s an explanation of the differences between this test harness and the one from before:

  • We ran at 1000 rows of test data instead of 1M (by filtering N from our Numbers table with WHERE N <= 1000).
  • Instead of assigning the results to a local variable, we instead SELECT … INTO a temporary table. Since both queries absorb the same overhead for this operation, the results are still comparable.
  • Instead of using SET STATISTICS TIME ON/OFF, we’ve simply used the @StartDT local variable to capture the elapsed time (calculated using DATEDIFF).

The latter method of capturing elapsed time is used because of a quirky behavior of STATISTICS in some cases when you are timing a scalar-valued, user-defined function. SQL MVP Jeff Moden explains this in How to Make Scalar UDFs Run Faster.

This example also serves to demonstrate the well-known fact that a good set-based query will almost always be faster than a loop. If you remember our introduction to DelimitedSplit8K in our Tally Tables blog, Jeff Moden uses the same basic methodology (an in-line Tally table) to make DelimitedSplit8K extremely fast.

Case Study 3: Timing Solutions for Running Totals

In this example, we’re going to perform a test on two solutions to the Running Totals (RT) problem in T-SQL. The second solution is only valid in SQL 2012. First, we’ll set up a test table and populate it with some random data.

CREATE TABLE dbo.RunningTotalsTest
(
    [Date]          DATETIME PRIMARY KEY
    ,Value          INT
    ,RunningTotal1  INT
    ,RunningTotal2  INT
);

WITH SomeDates AS
(
    SELECT d=DATEADD(hour, N, '2010-01-01')
    FROM dbo.Numbers
    WHERE N <= 10000
)
INSERT INTO dbo.RunningTotalsTest([Date], Value)
SELECT d, 1+ABS(CHECKSUM(NEWID()))%100
FROM SomeDates;
GO

We have populated our table with a series of date values that increases in an hourly fashion; including a “Value” that is simply a random integer between 1 and 100. We’ll be calculating the two RunningTotal columns from our Value. Note that at the end of this “batch” we have included the batch separator (GO). The test harness includes only 10,000 rows because we know in advance that this will be a sufficient number to distinguish between our solutions (1M rows is still recommended for most normal cases).

The first running totals solution we’ll look at is what is known as a triangular JOIN, because for each row it adds up all of the prior rows using a correlated sub-query. Once again, notice the batch separator (GO) at the end of the batch.

-- RT by Triangular JOIN
UPDATE a
SET RunningTotal1 =
    (
        SELECT SUM(value)
        FROM dbo.RunningTotalsTest b
        WHERE b.[Date] <= a.[Date]
    )
FROM dbo.RunningTotalsTest a;
GO

The next solution, which only works in SQL Server 2012, is a new facility Microsoft has kindly provided to us for calculating running totals (and a host of other things).

-- RT with SQL 2012 window frame
WITH RunningTotal AS
(
    SELECT [Date], Value, RunningTotal2
        ,rt=SUM(value) OVER (ORDER BY [Date] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    FROM dbo.RunningTotalsTest
)
UPDATE a
SET RunningTotal2 = rt
FROM RunningTotal a;
GO

-- Final SELECT
SELECT *
FROM dbo.RunningTotalsTest;

We’ve also included a final SELECT (in the last batch) to show that both running totals were calculated correctly.

To run this code and obtain our timings, we’re going to learn to use Tools/SQL Profiler. This brings up a window allowing you to name the profiler (trace) if desired.

After you click Run, the Profile session will begin.

You can now execute the four batches of T-SQL we created above, two of which contain the solutions of interest. Once the run is complete, the Profile window now looks like this, where we have circled in red the two results of interest.

Notice how the comment we placed at the beginning of each batch, clearly shows up on the BatchCompleted lines with our desired results:

-- RT by Triangular JOIN
-- RT with SQL 2012 window frame

The results show that the new SQL 2012 method for running totals completed in only 117 milliseconds, while the triangular JOIN took 11267 milliseconds. Imagine what the triangular JOIN approach would have taken had we run against 1M rows, or better yet imagine a customer waiting on an application’s web form for that result to be displayed!

The first ten rows of results displayed show that both of our running totals solutions worked correctly, yet the timing results tell us that they are definitely not equivalent!

Date                    Value   RunningTotal1  RunningTotal2
2010-01-01 01:00:00.000   63    63             63
2010-01-01 02:00:00.000   75    138            138
2010-01-01 03:00:00.000   2     140            140
2010-01-01 04:00:00.000   27    167            167
2010-01-01 05:00:00.000   73    240            240
2010-01-01 06:00:00.000   71    311            311
2010-01-01 07:00:00.000   17    328            328
2010-01-01 08:00:00.000   64    392            392
2010-01-01 09:00:00.000   40    432            432
2010-01-01 10:00:00.000   56    488            488

SQL Profiler is a very useful way to time a batch that contains multiple SQL statements (like if you want to test performance of a CURSOR vs. a set-based solution). Take care when setting up each batch to avoid unnecessary overhead in one batch vs. the other.

In a future blog, we’ll describe a method for calculating running totals that works in any version of SQL that is faster than both of these solutions.

Summary and What We Learned

Firstly and most importantly, we’ve learned how to create a one million row test harness so we can compare the performance of two queries that return identical results, and why this is an essential step to verifying query performance.

We’ve learned how to generate random test data:

  • Using Tally tables
  • Using a formula to generate random numbers within a specified range
  • Generating random character strings

We’ve learned three methods to time the queries we’re comparing as they process the test harness:

  • Using SET STATISTICS TIME ON/OFF
  • Using an elapsed time calculator and displaying the results in a SELECT
  • Using SQL Profiler

Some hints to follow when using the 1M row test harness (not all of which are mentioned):

  • After creating your test harness, start out with a lower row count than a million. Once you’re done debugging the test harness, ramp up the row count until one of the solutions clearly distinguishes itself from the others.
  • Try to run each test multiple times at the highest row count you need to prove the result and then average the timings.
  • Avoid using SET STATISTICS ON/OFF when comparing code that includes a call to a Scalar-valued, User-defined FUNCTION. Use one of the other techniques instead (Profiler works fine on this case).
  • You can as easily compare 3-4 solutions as two using any of these methods. The more solutions you have to the same problem, the more chance you’ll have of identifying the best performing.
  • There are cases where using DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS will improve the accuracy of your measurement; however this is a topic that is more advanced than we planned to explore in this blog.

Additional reading:

  • Just about any article that SQL MVP Jeff Moden writes for the SQL Server Central web site provides an example of a 1M row test harness to test his proposed solution against alternatives, but these two articles are specifically directed to this subject.

o   Generating Test Data: Part 1 – Generating Random Integers and Floats
o   Generating Test Data: Part 2 – Generating Sequential and Random Dates

Now you’re ready to start learning how to write high-performance T-SQL code, because now you have a tool that is essential to proving that you’re writing queries that are the best that they can be!

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved