The Best Tool for Comparing Whether Two T-SQL Queries Produce Identical Results

There is one software tool that should be in every software developer’s tool chest.  That is knowledge of the platform on which you are developing.  With all the hype that surrounds every new software tool that comes out, it is important to recognize the tools you already have at hand, and the ones that may be hidden gems within your development platform of choice.

Let’s backtrack for a moment into how a T-SQL query is developed.  Remember my prior blog: Make it Work, Make it Fast, Make it Pretty?  That blog (based on the sagacious and immortal words of SQL MVP Jeff Moden) suggests there’s three steps.  In reality there are more, but for the time being let’s focus on two of the three mentioned in that statement.

  • Make it Work – you understand the requirements, you understand the physical and logical models in which the data is stored, you write a query that returns hundreds of rows of results and you painstakingly check that the results indeed match the requirements, and are as you intended, based on the underlying data stored in the tables.
  • Make it Fast – once your query is returning the desired results, at some point in time you may be confronted by the dreaded statement by somebody that “it isn’t fast enough.” Time to break out those performance tuning tools.

Let’s hope that the dreaded “it isn’t fast enough” pronouncement doesn’t come after the query is already in production, but that is probably the most likely scenario.

While I’m into doing the Make it Work part, I have a tendency to rely on known, high-performance code patterns to develop a query which, under most circumstances should probably be of sufficient performance.  Maybe I’m having an off day, or maybe my underlying data changes in such a way that the SQL Optimizer simply starts making bad choices for choosing an execution plan (see: Lessons Learned from a Poor-performing View).  Maybe the darned thing is very, very complex and you’re pushing up against tight delivery timelines, so you just need to get it done.  Whatever the case, there will come a time where you’ll need to rewrite a query to improve its performance.

This is where it might get a little tricky.  Suppose that the working query is highly complex, perhaps complex to the point that you’re afraid to touch it for fear of breaking it.  I’ve seen that happen a lot in cases where you weren’t the original author of the query.  Perhaps you don’t even know all of the outlying test cases that were run through it to prove that it worked correctly in the first place.  Let’s face it, documentation isn’t always all it’s cracked up to be, and easily could be lost with time anyway.

You’ll need a tool that allows you to compare the results of the working query, against any alternative(s) that you may need to construct to improve on performance.

Enter the T-SQL EXCEPT Set Operator

Well, first I suppose we need to have a couple of SQL queries to compare.  Let’s draw upon an old article of mine kindly published by the fine but eccentric editor of Simple Talk entitled: The SQL of Gaps and Islands in Sequences.

-- A solution of my own design for Gaps
WITH Islands AS 
(
    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1
    FROM (
        SELECT ID, SeqNo
            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)
        FROM dbo.GapsIslands) a
     GROUP BY ID, rn
)
SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)
FROM (
    SELECT ID, SeqNo
        ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2
    FROM Islands
    CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a
GROUP BY ID, m
HAVING COUNT(*) = 2
ORDER BY ID, StartSeqNo;

-- A Gaps Solution from SQL MVP Deep Dives
SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1
FROM (
    SELECT ID, cur=SeqNo, nxt=(
        SELECT MIN(B.SeqNo)
        FROM dbo.GapsIslands AS B
        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo)
    FROM dbo.GapsIslands AS A) AS D
WHERE nxt - cur > 1;

Both of these solutions find gaps in a table containing a column of sequence numbers.  You won’t really need to run these two queries to follow along with the discussion, but if you want to try some of the things we’ll discuss that article has all the scripts you’ll need in its resources section (at the top).

Suppose that each of these queries returns about 10,000 rows give or take (they don’t but let’s pretend that they do).  To verify the veracity of the statement “they produce identical results” we just need to do a few steps:

  • Run both of the queries exactly as shown above and check the row counts of each. If they both return 10,650 rows we’re good. If one returns 10,650 rows and the other returns 10,633 then you know they’re not producing identical results (duh!).
  • Comment out the ORDER BY in the first query and add EXCEPT between the two. That should now run because each query produces exactly three columns of results.
  • Note that the first query contains a Common Table Expression (CTE). If the second query does too, you’ll need to move that CTE up (and possibly rename it) to make it cascaded with the first. If both queries use the same CTE(s), simply discard the one for the second query.

Our case should now look like this:

-- A solution of my own design for Gaps
WITH Islands AS 
(
    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1
    FROM (
        SELECT ID, SeqNo
            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)
        FROM dbo.GapsIslands) a
     GROUP BY ID, rn
)
SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)
FROM (
    SELECT ID, SeqNo
        ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2
    FROM Islands
    CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a
GROUP BY ID, m
HAVING COUNT(*) = 2

EXCEPT

-- A Gaps Solution from SQL MVP Deep Dives
SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1
FROM (
    SELECT ID, cur=SeqNo, nxt=(
        SELECT MIN(B.SeqNo)
        FROM dbo.GapsIslands AS B
        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo)
    FROM dbo.GapsIslands AS A) AS D
WHERE nxt - cur > 1;

If the two queries produce identical results, when you run the above it should produce zero rows in the results set!  Then all you’re left with is to verify that the newer query runs faster, and that’s something I covered in The One Million Row Test Harness.

If some rows appear then there is something wrong and the queries are not producing identical results sets.  You could analyze the results by switching the order of the queries around the EXCEPT.

Or you could just forget this tip about using EXCEPT and painstakingly analyze each of the 10,650 rows produced by each result set.  But that is something that I’m just a bit too lazy to do!

Then again, maybe it is precisely the tool you need to get over your queryophobia and dive head first into that complex query written by some other SQL geek, without worrying too much about verifying that the results are correct.

If you liked this neat little T-SQL trick then maybe you should follow me on Twitter: @DwainCSQL because who knows what other cool stuff I might come up with!

© Copyright Dwain Camps 21 May 2015.  All rights reserved.

Excel in T-SQL Part 2 – The Normal Distribution (NORM.DIST) Density Functions

Today’s blog will be the second in a multi-part series on replicating Excel functions in T-SQL, continuing with Excel’s NORM.DIST built-in function, thus enshrining my geekdom in the SQLverse forever.

Today’s solutions will once again focus on creating T-SQL in-line, Table Valued Functions (iTVFs), introduced in SQL Server 2005, that apply my solution techniques to the general case of calculating the value of the Normal Distribution’s Probability Density Function (PDF) and Cumulative Distribution Function (CDF), for any mean and standard deviation you specify.  I will not be doing any performance testing of today’s iTVFs; however I have done my best to utilize T-SQL performance best practices in their construction.  I will also provide you some links to alternate solutions I have seen.  Since all are approximation algorithms (including mine), the difference to our readers is probably going to be in which produces the lowest error results.  Only you can be the best judge of which that is.

The Normal Distribution

As I’ve said before I’m no statistician, but I have seen the Normal Distribution and most likely you’ve seen it too.  The normal distribution has many interesting properties, starting with the:

  • Mean – The average of all sample points representing the population, which appears as the center line of a normal distribution (probability density) function’s curve.
  • Standard Deviation – A measure of the elongation of the tails of the distribution.

Graphically I’m sure you’ve seen the normal distribution before, with the blue line (below) being what you normally (no pun intended) see.  This is the normal distribution’s probability density function (PDF).  The red line is also of interest, and this is called the cumulative distribution function (CDF).

Normal PDF-CDF 1

Notice how the blue line (PDF) is bilaterally symmetrical about the mean value, which in the above case is ten.  This simply means that there’s an equal portion of the curve on both the left and the right of the mean.  Another property of the PDF for any distribution is that the area under the curve is exactly equal to one, and this applies regardless of what the mean and standard deviation are.  In fact, it also applies to other statistical distributions, so long as it is the PDF you’re talking about.

The CDF represents the area under the PDF at any particular point along the PDF line.  For example, if we draw two intersecting lines as in the example below, with the first (vertical) line through the mean and the second line intersecting the first and also the CDF, we see:

  • The clear bilateral symmetry of the PDF line, where half of the area appears to the left of the bisecting line and the other half appears to the right.
  • The horizontal line intersecting the right vertical access at 0.5, indicating that the value of the CDF (the area under the PDF) at the mean is precisely 0.5, meaning that the total area of the PDF curve is 1.0, which is also the point at which the red curve intersects with the right axis.

Normal PDF-CDF 2

This same relationship holds for the PDF/CDF of any other distribution.

Computing the Normal Distribution’s Probability Density Function

If you go to the Wiki page for the Normal Distribution linked above (or many other sources), you will see that the PDF for the normal distribution can be represented mathematically as:

Formula 1

Where µ = the arithmetic mean and σ2= the standard deviation.  The variance (σ) is the square root of the standard deviation.

Also, SQL has the PI and EXP built-in functions to support this equation.

SELECT PI(), EXP(1.);
-- Results: 
3.14159265358979    2.71828182845905

In SQL terms, this is relatively straightforward to compute, and we’ll show that in a minute.

Let’s look now at the Excel NORM.DIST function, which takes four arguments:

  • x is the point where we wish to calculate the value of the normal distribution, which could be any point along the x-axis of the graphics above.
  • µ is the mean for the normal distribution of interest.
  • σ2 is the standard deviation for the normal distribution of interest.
  • The last (fourth) argument to NORM.DIST is a logical (TRUE/FALSE) value which specifies whether you want to calculate the value for the PDF (if FALSE) or the CDF (if TRUE).

There is also a special case of the normal distribution, known as the Standard Normal Distribution, where the mean is zero and the standard deviation is one.  For this special case, the PDF function is somewhat simpler and can be written as:

Formula 2

As I am also no mathematician, I am thankful that all of the formulas so far have simply been copied out of the relevant sources.

Since I am a T-SQL guy, I must now focus on my specialty and present a function that will calculate the probability density function (PDF) for both the standard and general normal distributions.

CREATE FUNCTION dbo.NORMAL_PDF
-- Normal Distribution - Returns two bits of information:
--  NPDF(X) - The Normal Probability Density Function's value at given mean and std deviation
--  SNPDF(X) - The Standard Normal Probability Density Function's value (mean=0, std dev=1)
(
    @X          FLOAT       -- Point at which function is to be evaluated
    ,@Mean      FLOAT       -- Mean of the Normal Distribution
    ,@StdDev    FLOAT       -- Standard Deviation of the Normal Distribution
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
SELECT X        = @X
    ,Mean       = @Mean
    ,StdDev     = @StdDev
    -- Normal Probability Density Function for Mean and Standard Deviation
    ,[NPDF(X)]  = EXP(-0.5*(@X-@Mean)*(@X-@Mean) /
                    (@StdDev*@StdDev))/(SQRT(2.*PI())*@StdDev)
    -- Standard Normal Probability Density function for Mean=0 and StdDev=1
    ,[SNPDF(X)] = EXP(-0.5*@X*@X)/SQRT(2.*PI());

Let’s now construct an Excel spreadsheet to calculate the PDF for the normal distribution using a couple of different mean/standard deviation combinations, so we can check the results from our function.

Excel Check Sheet 1a

The highlighted cell’s formula is in the formula text box above, showing you how to calculate the PDF for the normal distribution where mean is two and standard deviation is 0.5 (from columns B and C).

We’ve left a couple of columns open so we can run the following T-SQL script that uses our iTVF to calculate the PDF values at the various Xs.  The differences columns are currently showing the same value as appears in the Excel column with reversal of sign.

WITH SampleData (X, Mean, StdDev) AS
(
    SELECT -1,2,0.5
    UNION ALL SELECT -0.5,2,0.5
    UNION ALL SELECT 0,2,0.5
    UNION ALL SELECT 0.5,2,0.5
    UNION ALL SELECT 1,2,0.5
    UNION ALL SELECT 1.5,2,0.5
    UNION ALL SELECT 2,2,0.5
    UNION ALL SELECT 2.5,2,0.5
    UNION ALL SELECT 3,2,0.5
    UNION ALL SELECT 3.5,2,0.5
    UNION ALL SELECT 4,2,0.5
    UNION ALL SELECT 2.33,10,5
    UNION ALL SELECT 4.22,10,5
    UNION ALL SELECT 5.1,10,5
    UNION ALL SELECT 8.2,10,5
    UNION ALL SELECT 10,10,5
    UNION ALL SELECT 11.5,10,5
    UNION ALL SELECT 12.6,10,5
    UNION ALL SELECT 16.8,10,5
    UNION ALL SELECT 22.5,10,5
)
SELECT a.X, a.Mean, a.StdDev, b.[NPDF(X)], b.[SNPDF(X)]
FROM SampleData a
CROSS APPLY dbo.NORMAL_PDF(X, Mean, StdDev) b;

When we run the above script, then copy/paste the output from the SSMS Results pane into our spreadsheet, it now looks like this.

Excel Check Sheet 2

The cell formula at the top now shows you how to calculate the PDF for the standard normal distribution in Excel (second and third arguments are different from the prior Excel graphic).

The red bordered columns now show very miniscule differences between the values computed by T-SQL for the normal distributions’ PDF and SNPDF.  These calculations are probably accurate enough to say that we’ve replicated the Excel NORM.DIST function’s results when its fourth argument is FALSE.

Computing the Normal Distribution’s Cumulative Distribution Function (CDF)

We can easily add a few columns to our spreadsheet to show you how easy it is in Excel to calculate the CDF for our two distributions, including some placeholders for when we get to the point of having some results in T-SQL.

Excel Check Sheet 3

Note that the highlighted cell’s formula is shown in Excel’s formula entry text box.

While the PDF for the normal distribution can be represented in what is known as “closed form” (see the formulas above), the CDF cannot be represented in closed form.  Instead we need to represent the value of the CDF at a point (X) as a definite integral, which is essentially just calculating the area under the PDF.

Formula 3

At the sight of the integral symbol, I’m sure that some of my reader’s eyes have glazed over.  Once again though, this is a formula not of my own making, but one which can be found in many sources.  My hope is that I don’t lose you here, because if you read on you just might find some really interesting stuff!

Let us consider a simple case, for example where X = µ (the mean).  In that case, from our graphical look at the two PDF/CDF curves, it implies that the value of the CDF should be 0.5.  If we look at cells J9 and J18 in the last Excel spreadsheet shown, we see that indeed Excel’s NORM.DIST function computes exactly that value for the CDF.

To replicate this in T-SQL, we’ve got just a bit of a problem though.  Unfortunately, T-SQL does not have an “INTEGRATE” built-in function!

The Fundamental Theorem of Calculus

I know that’s a mouthful, but the fundamental theorem of calculus is going to give us a way to construct a set-based algorithm that will allow us to calculate the CDF of the normal distribution.  If you haven’t taken calculus, or can’t be bothered to try to remember what the fundamental theorem says, you might want to skip ahead to the next section and just get to the T-SQL.  But as I’ve said many times, since I am no mathematician, I’ll need to keep my explanation as simple as possible in the hopes that I don’t trip myself up and that my readers can follow along.

Let’s look at a graphical example.

Normal PDF-CDF 3

Suppose we were interested in the CDF where X=5.  We could construct a series of rectangles that extend from the lowest value of the left tail of the distribution, up to where X=5.  As these rectangles get smaller and smaller, the sum of the areas within the rectangles approaches quite closely the value of the CDF.  Technically, this is expressed as a “limit” where the width of the rectangle à 0.

That is essentially the fundamental theorem of calculus, or the part that interests us anyway.

Unfortunately, we’re not quite done with our exotica yet.  Constructing a set-based formula that sets up a specific (known) number of rectangles from -∞ (negative infinity) up to our value of X, is problematic at best.  So we’re going to fall back on something that I pointed out earlier.  This is that the area under the curve where x=10 (the mean in this example) is 0.5.  So instead we can solve the following integral when X is less than our mean.

Formula 4

And alternatively, when x is greater than the mean:

Formula 5

Gosh!  That still doesn’t sound too simple, but in reality we are now quite close to integrating the normal distribution’s PDF at a point of interest, and doing it with a set-based algorithm.

By now my regular readers are probably used to my little T-SQL sleight of hand tricks.  So here’s another.  Consider the following script.

DECLARE
    @X          FLOAT = 1.5     -- Point at which function is to be evaluated
    ,@Mean      FLOAT = 2.      -- Mean of the Normal Distribution
    ,@StdDev    FLOAT = 0.5     -- Standard Deviation of the Normal Distribution
;

WITH CalculateIntervals AS
(
    -- Total intervals (about 5 per standard deviation)
    SELECT Intervals = 5 * ABS(@Mean - @X) / @StdDev
        -- Number of intervals per standard deviation
        ,Interval    = 5
), 
    Tally (n) AS
(
    -- Up to 10,000 row tally table
    SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals)
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
)
SELECT X = @X
    ,Mean = @Mean
    ,StdDev = @StdDev
    ,Intervals 
    ,n
    ,Pos1
    ,Pos2
    ,Width
    ,Height = AvgX
    --    Width x Height = NPDF(AvgX)
    ,Area=Width * e.[NPDF(X)]
FROM CalculateIntervals a
CROSS JOIN Tally b
CROSS APPLY
(
    SELECT Pos1     = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean)
        ,Pos2       = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean)
        ,Width      = ABS(@Mean - @X)/Intervals
) c
CROSS APPLY
(
    -- Cheat the rectangle's height - make it the average of Pos1 and Pos2
    SELECT AvgX = 0.5 * (Pos1 + Pos2)
) d
CROSS APPLY dbo.NORMAL_PDF(AvgX, @Mean, @StdDev) e;

Let’s take this bit of magic and decompose it one step at a time.

  • We start by declaring three local variables, which will ultimately become arguments to the iTVF we hope to create.
  • The CalculateIntervals CTE creates two intermediate column values that we’ll use later: 1) the first is the total number of intervals, and 2) is the intervals per standard deviation that our X of interest is from our mean.
  • The Tally CTE is just an in-line Tally table that is limited to the number of rows set up in the Intervals (total intervals) column of the CalculateIntervals CTE. The intervals themselves are simply used to represent the rectangles we described above.
  • Next (in our FROM clause in the main query) we take the one row that is returned from the CalculateIntervalsCTE and CROSS JOIN it with the Tally table’s results, so that we’ve consolidated everything we need onto each row.
  • The first of three cascading CROSS APPLYs calculates two positions (Pos1 and Pos2) and the width of our rectangle. These calculations will be clearer when we look at the results set (below).
  • The second cascading CROSS APPLY “cheats” the height of the rectangle, to be the average of Pos1 and Pos2. This introduces some level of error into our calculations, and there are ways around that (too advanced for me and this blog). One way to reduce this error is to introduce more intervals (or rectangles).
  • Lastly, in our final CROSS APPLY we call our NORMAL_PDF function to calculate the value of the PDF at AvgX where the mean and standard deviation are specified.

Note that this code fails miserably (no results rows produced) when @X = @Mean, but we’ll handle that in our iTVF.  Here are the results:

X    Mean  StdDev  Intervals  n  Pos1   Pos2  Width   Height  Area
1.5  2     0.5     5          1  2      1.9   0.1     1.95    0.0793905094954024
1.5  2     0.5     5          2  1.9    1.8   0.1     1.85    0.0762775630921048
1.5  2     0.5     5          3  1.8    1.7   0.1     1.75    0.0704130653528599
1.5  2     0.5     5          4  1.7    1.6   0.1     1.65    0.0624507866733522
1.5  2     0.5     5          5  1.6    1.5   0.1     1.55    0.053217049979751

The X, Mean, StdDev and n columns should require no explanation.  The Intervals column will be five times the number of standard deviations that X is from our mean (note that X could be less than or greater than the mean).  Pos1 and Pos2 compute the right and left points of our interval (which may be reversed if @X > @Mean), while width is computed based on the total difference between X and our mean divided by the total number of intervals.  See the comment in the code about “cheat the height?”  Height in the result set is simply the average of Pos1 and Pos2.  Finally, the Area is Width * Height, which when summed across all five of our rectangles and adjusted according to the value of the CDF at the midpoint (0.5) should be a reasonably close approximation of the CDF of the normal distribution at X!  This approximation should improve if we use more than five rectangles per standard deviation.

To put all of this another way, we’ve implemented the fundamental theorem of calculus by using a tally table to calculate our little rectangles, in order to integrate under a curve.  Too bad I didn’t think of that example when I wrote my Tally tables blog!

A T-SQL iTVF to Calculate the Normal Distribution’s Cumulative Distribution Function (CDF)

At this time, it is just a little more work to get to where we want to be.  This is an iTVF to simulate the NORM.DIST function in Excel.  We’ll need to eliminate some of the unnecessary intermediate results from the previous script, and also combine a couple of extra bits to calculate the PDF and the CDF when X is at the mean, but you can read the comments in the function to see those additions.

We’re also going to make our function allow for a little additional selectivity by adding one argument that can adjust the number of intervals per standard deviation.

CREATE FUNCTION dbo.Excel_NORM_DIST
-- Excel Normal Distribution - Returns either:
--  Probability Density Function (PDF) or 
--  Cumulative Distribution Function (CDF)
(
    @X          FLOAT   -- Point at which function is to be evaluated
    ,@Mean      FLOAT   -- Mean of the Normal Distribution
    ,@StdDev    FLOAT   -- Standard Deviation of the Normal Distribution
    ,@CumDist   TINYINT -- =0 for Probability Density, =1 for Cumulative Density
    ,@Intervals INT = NULL
) 
RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH CalculateIntervals AS
(
    -- Total intervals (default is about 100 per standard deviation)
    SELECT Intervals = ISNULL(@Intervals, 100) * ABS(@Mean - @X) / @StdDev
        -- Number of intervals per standard deviation
        ,Interval    = ISNULL(@Intervals, 100)
), 
    Tally (n) AS
(
    -- Up to 10,000 row tally table
    SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals)
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n)
)
-- PDF 
SELECT X, Mean, StdDev, [F(X)]=[NPDF(X)]
FROM
(
    SELECT X        = @X
        ,Mean       = @Mean
        ,StdDev     = @StdDev
        ,[NPDF(X)]
    FROM dbo.NORMAL_PDF(@X, @Mean, @StdDev)
) a
WHERE @CumDist = 0
UNION ALL
-- CDF where X = mean
SELECT X        = @X
    ,Mean       = @Mean
    ,StdDev     = @StdDev
    ,[CDF(X)]   = 0.5
WHERE @CumDist = 1 AND @X = @Mean
UNION ALL
-- CDF where X  mean
SELECT X, Mean, StdDev, [CDF(X)]
FROM
(
    SELECT X        = @X
        ,Mean       = @Mean
        ,StdDev     = @StdDev
        --                  SUM the rectangles
        ,[CDF(X)]   = 0.5 + SUM(
                                -- Add to or remove from CDF at mean (0.5)
                                SIGN(@X - @Mean) *
                                -- Width x Height = NPDF(AvgX) 
                                Width * d.[NPDF(X)]
                                )
    FROM CalculateIntervals a
    CROSS JOIN Tally b
    CROSS APPLY
    (
        SELECT Pos1     = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean)
            ,Pos2       = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean)
            ,Width      = ABS(@Mean - @X)/Intervals
    ) c
    --                         -- Average height --
    CROSS APPLY dbo.NORMAL_PDF(0.5 * (Pos1 + Pos2), @Mean, @StdDev) d
) a
WHERE @CumDist = 1 AND @X  @Mean;
GO

WITH SampleData (X, Mean, StdDev) AS
(
    SELECT -1,2,0.5
    UNION ALL SELECT -0.5,2,0.5
    UNION ALL SELECT 0,2,0.5
    UNION ALL SELECT 0.5,2,0.5
    UNION ALL SELECT 1,2,0.5
    UNION ALL SELECT 1.5,2,0.5
    UNION ALL SELECT 2,2,0.5
    UNION ALL SELECT 2.5,2,0.5
    UNION ALL SELECT 3,2,0.5
    UNION ALL SELECT 3.5,2,0.5
    UNION ALL SELECT 4,2,0.5
    UNION ALL SELECT 2.33,10,5
    UNION ALL SELECT 4.22,10,5
    UNION ALL SELECT 5.1,10,5
    UNION ALL SELECT 8.2,10,5
    UNION ALL SELECT 10,10,5
    UNION ALL SELECT 11.5,10,5
    UNION ALL SELECT 12.6,10,5
    UNION ALL SELECT 16.8,10,5
    UNION ALL SELECT 22.5,10,5
)
SELECT a.X, a.Mean, a.StdDev
    ,[NPDF(X)]  = c.[F(X)]
    ,[SNPDF(X)] = b.[SNPDF(X)]
    ,[CDF(X)]   = d.[F(X)]
FROM SampleData a
-- Use NORMAL_PDF to return PDF for standard normal distribution
CROSS APPLY dbo.NORMAL_PDF(a.X, a.Mean, a.StdDev) b
-- Use Excel_NORM_DIST to return PDF for general normal distribution
CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 0, DEFAULT) c
-- Use Excel_NORM_DIST to return CDF for a general normal distribution
CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 1
    -- Increase the accuracy by specifying a larger number for fifth parameter
    ,CASE a.Mean WHEN 10 THEN 500 ELSE 100 END) d;

Note how we can override (using the fifth parameter to the function) the number of intervals, or just let it use the DEFAULT (which is 100 per standard deviation).  You need to take some care with this because if you end up trying to calculate the CDF for an X that is a significant number of standard deviations away using a high interval count, you could exceed the number of rows generated by the in-line tally table.  But you can always make that tally table generate more rows by adding additional CROSS JOINs.  Just remember that as you evaluate more and more intervals, the performance of the iTVF will be degraded, so my advice is to use just what you need to achieve the accuracy you seek.

The final results from this query are:

X      Mean  StdDev  NPDF(X)               SNPDF(X)              CDF(X)
-1.00  2     0.5     1.21517656996466E-08  0.241970724519143     9.86435766403559E-10
-0.50  2     0.5     2.9734390294686E-06   0.3520653267643       2.86620600586929E-07
0.00   2     0.5     0.000267660451529771  0.398942280401433     3.16690114139928E-05
0.50   2     0.5     0.00886369682387602   0.3520653267643       0.00134984263449417
1.00   2     0.5     0.107981933026376     0.241970724519143     0.0227496820247703
1.50   2     0.5     0.483941449038287     0.129517595665892     0.158654245714223
2.00   2     0.5     0.797884560802865     0.0539909665131881    0.5
2.50   2     0.5     0.483941449038287     0.0175283004935685    0.841345754285777
3.00   2     0.5     0.107981933026376     0.00443184841193801   0.97725031797523
3.50   2     0.5     0.00886369682387602   0.00087268269504576   0.998650157365506
4.00   2     0.5     0.000267660451529771  0.000133830225764885  0.999968330988586
2.33   10    5.0     0.0246013380403521    0.0264264854972617    0.0625147949422269
4.22   10    5.0     0.0409033036384656    5.41832610895401E-05  0.123840539089739
5.10   10    5.0     0.0493618981134085    8.97243516238334E-07  0.163543019015466
8.20   10    5.0     0.0747821210746257    9.99837874849718E-16  0.359423544347365
10.00  10    5.0     0.0797884560802865    7.69459862670642E-23  0.5
11.50  10    5.0     0.0762775630921048    7.6416554115872E-30   0.61791144125835
12.60  10    5.0     0.0696985025517949    1.33848679925429E-35  0.697770895452422
16.80  10    5.0     0.0316449580740766    2.05718230302861E-62  0.913085073917205
22.50  10    5.0     0.00350566009871371   4.67863681725005E-111 0.99379034197768

The first thing that we notice about these results is that our CDF where X equals the mean is 0.5.  That’s a good start, but for a more thorough check we’ll copy/paste these results into our Excel spreadsheet.

Excel Check Sheet 4

Notice how all of the values in column M are quite small (nearly zero), indicating a pretty small difference between Excel’s calculated CDF and the one calculated in T-SQL.  Had we left the second distribution group to default to 100 intervals, the difference would have been slightly larger.

Conclusions

I’ll number my conclusions:

  1. I am a T-SQL geek, because whom but a T-SQL geek would have thought of doing something so ridiculously geeky and so enjoyed writing about it as much as me!
  2. It is possible to do integration (specifically to solve for a finite definite integral) in T-SQL by using a tally table to implement the fundamental theorem of calculus.
  3. We have provided two functions:
    1. The first, NORMAL_PDF, which calculates the value for the probability density function for both general and standard normal distributions.
    2. The second, EXCEL_NORM_DIST, which calculates the probability density and cumulative distribution functions for a general normal distribution, closely replicating the results of Excel’s NORM.DIST function.

Prior to writing this blog, I did a little searching to see if anyone has tried this before and I ran across this article by Eli Algranti: Part 1: T-SQL Implementation of NORMDIST / NORM.S.DIST

He is obviously a bit more of a mathematician than I am, because he came up with three different implementations that have different error characteristics for the CDF.

  • Using polynomial approximation – certainly an interesting approach that uses an approximate “closed form” version of the antiderivative of the normal distribution. My only issue with it is that is uses a scalar-valued, user-defined function (UDF) to perform the calculation.
  • Another polynomial approximation – similar to the first but with a different error result. Again, my issue is his implementation using the scalar-valued, UDF.
  • Using a rational polynomial approximation – a bit more elaborate estimation approach that once again is implemented using a scalar-valued, UDF.

Certainly all of those could be converted to iTVFs that would run faster.  It would be interesting to see how well my function’s error results compare to those.

Below is an Excel workbook file provided to you as a resource.  In that you’ll find three worksheets:

  • Final Results w-Check – showing the spreadsheet screen captures above (all columns), plus a column that contains a neat way to translate sample data from Excel to a series of T-SQL SELECT/UNION ALL/SELECT statements, like you see in my SampleDate CTEs above.
  • SQL- Create-run NORMAL_PDF – a T-SQL script to create the named function (in italics) and run it with the sample data provided.
  • SQL- Create-run EXCEL_NORM_DIST – a T-SQL script to create the named function (in italics) and run it with the sample data provided.

The normal distribution is one that is quite important in statistics and having tools to apply its distributions in T-SQL can be just as important.  We hope you’ve found today’s blog useful and instructive!

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 14 May 2015.  All rights reserved.

Excel in T-SQL Part 1 – HARMEAN, GEOMEAN and FREQUENCY

Today’s blog will be the first in a multi-part series on replicating Excel functions in T-SQL, starting with HARMEAN, GEOMEAN and FREQUENCY.

We’ll focus our solutions on creating T-SQL in-line, Table Valued Functions (iTVFs), introduced in SQL Server 2005, that apply the solution technique to a specifically designed table of sample data.  If you plan on using any of these solutions, this means you’ll need to adapt them to your table of interest.  Due to the need to remain somewhat brief in my explanations, I probably won’t be doing a lot of performance testing of these iTVFs (I certainly will not in today’s blog).  But you can rest assured I’ve put my head to the grindstone in an attempt to do my best to incorporate all of the performance best practices I know of when writing this library of functions.

Some Sample Data

We’ll need to construct some sample data to validate our T-SQL code against, so let’s start with a table and some sample data.

-- A table to hold our sample data
CREATE TABLE dbo.ExcelExamples
(
    Grp         INT
    ,Value      INT
);
GO

INSERT INTO dbo.ExcelExamples (Grp, Value)
-- Grp identifiers 1, 2, 3 and 4 
SELECT 1, n
FROM (VALUES(15),(20),(35),(40),(50)) a (n)
UNION ALL 
SELECT 2, n
FROM (VALUES(3),(6),(7),(8),(8),(10),(13),(15),(16), (20)) a (n)
UNION ALL 
SELECT 3, n
FROM (VALUES(3),(6),(7),(8),(8),(9),(10),(13),(15),(16), (20)) a (n)
UNION ALL
SELECT 4, n
FROM (VALUES(1),(2),(3),(4)) a(n);

Let’s construct an Excel spreadsheet containing this sample data and show Excel’s results for the three functions we’ll attempt to replicate.  The Breaks column represents one of the arguments to the FREQUENCY function.

BLOG - Excel in T-SQL Part 1

In order to use the Excel FREQUENCY function, I needed to learn how to enter array formulas, which you’ll also need to do if you want to try some examples of your own that are like the above.

Harmonic Mean

The harmonic mean of the set of observations represented by our sample can be described as the total number of observations divided by the sum of the reciprocals of the data elements.  If you are into the mathematical representation of this, you can visit the linked Wiki page.

This is actually quite straightforward in T-SQL:

SELECT Grp, HarmonicMean=COUNT(*)/SUM(1./Value)
FROM dbo.ExcelExamples
GROUP BY Grp;
-- Results:
Grp	HarmonicMean
1	26.28285356807711767
2	8.00674561002681970
3	8.08789038406971559
4	1.92000000000307200

Note the correspondence between returned values and those calculated by Excel above.

To keep things neat and tidy, we’ll encapsulate this in an iTVF like I said I would in the introduction.

CREATE FUNCTION dbo.Excel_HarmonicMean
(
    @Grp    INT
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
SELECT Grp=@Grp, HarmonicMean=COUNT(*)/SUM(1./a.Value)
FROM dbo.ExcelExamples a
WHERE @Grp IS NULL OR @Grp = a.Grp;

Which we can call either for a single group or the entire set as follows.

-- Call the iTVF for each group (Grp) in the set
SELECT a.Grp, b.HarmonicMean
FROM
( 
    SELECT a.Grp
    FROM dbo.ExcelExamples a
    GROUP BY a.Grp
) a
CROSS APPLY dbo.Excel_HarmonicMean(a.Grp) b;

-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT Grp, HarmonicMean
FROM dbo.Excel_HarmonicMean(NULL);
-- Results:
Grp  HarmonicMean
1    26.28285356807711767
2    8.00674561002681970
3    8.08789038406971559
4    1.92000000000307200

Grp  HarmonicMean
NULL 6.14429720701647651

You can also check that the harmonic mean for all of the sample data matches our Excel spreadsheet example above.  These results show slightly more precision than Excel does, but presumably you can adjust that if you are interested.

Geometric Mean

The geometric mean of the set of observations represented by our sample data can be described as the nth root of the product of the values in the sample.  An alternative description is to find the product of the values in the sample, and raise that to the power that is the reciprocal of n.  Once again, the Wiki page will show you the mathematical representation of this calculation, if that would be more clear to you.

T-SQL has a SQRT function but it does not have a function for calculating the nth root.  It does have a POWER function, which we can use as in the secondary description.

Of course, the real challenge is calculating the product of all items in the sample, which could turn out to be quite large and end up causing an arithmetic overflow quite quickly.  To deal with this issue, we’re going to use a Stupid T-SQL Trick for Logarithms that we learned from an earlier blog.

Let’s look at the results from the following query:

SELECT Grp, SUM(LOG10(Value)), COUNT(*)
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp  (No column name)   (No column name)
1    7.32221929473392   5               
2    9.70173510878387   10              
3    10.6559776182232   11              
4    1.38021124171161   4 

By summing the LOG10 values, we’ve got a logarithmic number that can be converted back to the product of the numbers that were summed by using POWER(10.0, x), where x is SUM(LOG10(Value)).  That could be quite problematic due to the potential for arithmetic overflow, but is possible for the set of sample data we’ve provided.  If we cast the SUM to a FLOAT, we’ll have a much wider range of values available, albeit at a loss of precision.

Also note that LOG10(0.0) is undefined and will return an error, so if you’ve got zero data points anywhere, you’ll need to handle that case as well.

SELECT Grp, SUM(LOG10(Value)), COUNT(*)
    ,CAST(POWER(10., CAST(SUM(LOG10(Value)) AS FLOAT)) AS FLOAT) 
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp    (No column name)   (No column name)  (No column name)
1      7.32221929473392   5                 21000000
2      9.70173510878387   10                5031936000
3      10.6559776182232   11                45287424000
4      1.38021124171161   4                 24

For a quick check, we can see that 1*2*3*4 = 24 (the last row or where Grp=4).

We’re not quite there yet, because we still need to take the nth root, but we can now calculate the geometric mean thusly:

SELECT Grp, GeometricMean=
    POWER(CAST(POWER(10., CAST(SUM(LOG10(Value)) AS FLOAT)) AS FLOAT)
        , 1./COUNT(*))              -- Reciprocal of count gives nth root
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp   GeometricMean
1     29.1369345857619
2     9.33627232372056
3     9.30518984213118
4     2.21336383940064

Our results indicate that we seem to have nailed the geometric mean as calculated by the Excel GEOMEAN function.

Let’s now do what we always like to do and said we were going to do in the introduction, and that is to encapsulate this in an iTVF.

CREATE FUNCTION dbo.Excel_GeometricMean
(
    @Grp    INT
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
SELECT Grp=@Grp, GeometricMean=
    POWER(CAST(POWER(10., CAST(SUM(LOG10(a.Value)) AS FLOAT)) AS FLOAT)
        , 1./COUNT(*))              -- Reciprocal of count gives nth root
FROM dbo.ExcelExamples a
WHERE @Grp IS NULL OR @Grp = a.Grp;

And we can now either call that function for a group or we can calculate the geometric mean across all of the sample data.

-- Call the iTVF for each group (Grp) in the set
SELECT a.Grp, b.GeometricMean
FROM
( 
    SELECT a.Grp
    FROM dbo.ExcelExamples a
    GROUP BY a.Grp
) a
CROSS APPLY dbo.Excel_GeometricMean(a.Grp) b;

-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT Grp, GeometricMean
FROM dbo.Excel_GeometricMean(NULL);
-- Results:
Grp   GeometricMean
1     29.1369345857619
2     9.33627232372056
3     9.30518984213118
4     2.21336383940064

Grp   GeometricMean
NULL  9.3040372829131

Once again, note the correspondence between returned values and those calculated by Excel above.

Frequency

Building an iTVF for FREQUENCY is going to be just a little more challenging than the two cases we’ve studied above.  Let’s take a look at a specialized query targeting our first Grp to get started.

DECLARE @Grp  INT = 1;

WITH Ranges (r) AS
(
    -- The ranges of interest for Grp=1 (refer to the Excel example)
    SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40
),
    RangesExt (r, rn) AS
(
    -- Add a row number (i.e., extend the ranges)
    SELECT r, rn=ROW_NUMBER() OVER (ORDER BY r)
    FROM Ranges
),
    JoinedRanges AS
(
    SELECT Grp, r1, r2 
    FROM
    (
        -- Full join to increase range rows from 3 to 4
        SELECT r1=a.r, r2=b.r
        FROM RangesExt a
        FULL JOIN RangesExt b ON a.rn = b.rn - 1
    ) a
    CROSS APPLY (SELECT Grp=@Grp) b
)
SELECT *
FROM JoinedRanges;
-- Results:
Grp	r1	r2
1	NULL	10
1	10	25
1	25	40
1	40	NULL

If you’ve looked at how FREQUENCY works, it constructs four rows from a three value range (each value being the upper end of a range), in order to find anything less than the second value (first array value returned), greater than the first value (last array value returned), or otherwise between intermediate values.  More than three rows in the range of course always results in one additional row of results.

Our r1 and r2 columns can be used to perform the range checks described above.  In fact, they’re really the heart of the function we’ll show you next.

Note that Excel’s FREQUENCY function may only calculate counts between zero and the value in the first row, whereas what we’re about to do extends the lower end of the range into negative numbers.  The documentation was unclear and I didn’t personally check it out.  But we encourage our valued readers to do so and modify my code accordingly if it better suits your needs that way.

Before we create a function to do the heavy lifting here, we want to mention Table Valued Parameters (TVPs).  This is a way of creating and using tables based on a pattern on-the-fly (sort of) in T-SQL.  I’ve written before about how TVPs can be used.  Let’s create one now (this is done by the CREATE TYPE AS TABLE statement):

CREATE TYPE dbo.Excel_FrequencyRanges AS TABLE
(
    [Range] INT
);
GO
DECLARE @Sample_Range dbo.Excel_FrequencyRanges;

INSERT INTO @Sample_Range
SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40;

SELECT *
FROM @Sample_Range;
-- Results:
Range
10
25
40

TVPs can be created as above and passed into a T-SQL function, assuming you are using at least SQL 2008.

So now here’s my take on a T-SQL iTVF to calculate frequency. Note how I pass in the TVP as the second parameter to the function.

CREATE FUNCTION dbo.Excel_Frequency
(
    @Grp        INT
    ,@Ranges    dbo.Excel_FrequencyRanges READONLY
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH RangesExt (r, rn) AS
(
    -- Add a row number
    SELECT r=[Range], rn=ROW_NUMBER() OVER (ORDER BY [Range])
    FROM @Ranges
),
    JoinedRanges AS
(
    SELECT Grp, r1, r2 
        -- Create a string to represent the range
        ,[Range]=CASE 
                WHEN a.r1 IS NULL
                THEN 'x less than or equal to ' + CAST(a.r1 AS VARCHAR(20))
                WHEN a.r2 IS NULL
                THEN 'x greater than ' + CAST(a.r1 AS VARCHAR(20))
                ELSE CAST(a.r1 AS VARCHAR(20)) + ' less than x less than or equal to +
                    CAST(a.r2 AS VARCHAR(20))
                END
    FROM
    (
        -- Full join to increase range rows from 3 to 4
        SELECT r1=a.r, r2=b.r
        FROM RangesExt a
        FULL JOIN RangesExt b ON a.rn = b.rn - 1
    ) a
    CROSS APPLY (SELECT Grp=@Grp) b
)
SELECT a.Grp, a.[Range], Frequency=COUNT(Value)
    ,[Order]=ROW_NUMBER() OVER (ORDER BY r1)
FROM JoinedRanges a
OUTER APPLY
(
    SELECT Grp, Value
    FROM dbo.ExcelExamples b
    WHERE (@Grp IS NULL OR a.Grp = b.Grp) AND
        (r1 IS NULL OR b.Value > r1) AND
        (r2 IS NULL OR b.Value <= r2)
) b
GROUP BY a.Grp, a.[Range], a.r1;

I’ve done a couple of things in our function here that perhaps require explanation:

  • I’ve formatted the ranges we’re interested in into character strings that list out the values compared to an “x” – the column of interest, which in the case of our table is Value.
  • I’ve returned an [Order] column to help in ordering the results after their return should we need it for display purposes.

Let’s run this bad boy through its paces.

DECLARE @Sample_Range dbo.Excel_FrequencyRanges;

-- Create the range for Grp=1
INSERT INTO @Sample_Range
SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40;

-- Display the results for Grp=1
SELECT *
FROM dbo.Excel_Frequency(1, @Sample_Range);

-- Clear the table so we can create the range for Grp IN (2, 3)
DELETE FROM @Sample_Range;

-- Create the range for Grp IN (2, 3)
INSERT INTO @Sample_Range
SELECT 5 UNION ALL SELECT 10 
UNION ALL SELECT 15 UNION ALL SELECT 20;

-- Display the results for Grp IN (2, 3)
SELECT a.Grp, b.[Range], b.Frequency
FROM
(
    SELECT Grp
    FROM dbo.ExcelExamples
    WHERE Grp IN (2, 3)
    GROUP BY Grp
) a
CROSS APPLY dbo.Excel_Frequency(a.Grp, @Sample_Range) b
ORDER BY b.Grp, b.[Order];

-- Clear the table so we can create the range for Grp=4
DELETE FROM @Sample_Range;

-- Create the range for Grp=4
INSERT INTO @Sample_Range
SELECT 1 UNION ALL SELECT 3;

-- Display the results for Grp=1
SELECT *
FROM dbo.Excel_Frequency(4, @Sample_Range);

-- Clear the table so we can create the range across all data points
DELETE FROM @Sample_Range;

-- Create the range to use across all data points
INSERT INTO @Sample_Range
SELECT 5 UNION ALL SELECT 10 
UNION ALL SELECT 15 UNION ALL SELECT 20
UNION ALL SELECT 30;

-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT *
FROM dbo.Excel_Frequency(NULL, @Sample_Range);

The script above produces the four results sets shown below.

Grp  Range                                    Frequency  Order
1    x less than or equal to 10               0          1
1    10 less than x less than or equal to 25  2          2
1    25 less than x less than or equal to 40  2          3
1    x greater than 40                        1          4

Grp  Range                                    Frequency
2    x less than or equal to 5                1
2    5 less than x less than or equal to 10   5
2    10 less than x less than or equal to 15  2
2    15 less than x less than or equal to 20  2
2    x greater than 20                        0
3    x less than or equal to 5                1
3    5 less than x less than or equal to 10   6
3    10 less than x less than or equal to 15  2
3    15 less than x less than or equal to 20  2
3    x greater than 20                        0

Grp  Range                                    Frequency  Order
4    x less than or equal to 1                1          1
4    1 less than x less than or equal to 3    2          2
4    x greater than 3                         1          3

Grp  Range                                    Frequency  Order
NULL x less than or equal to 5                6          1
NULL 5 less than x less than or equal to 10   11         2
NULL 10 less than x less than or equal to 15  5          3
NULL 15 less than x less than or equal to 20  5          4
NULL 20 less than x less than or equal to 30  0          5
NULL x greater than 30                        3          6

You might want to go back and check those frequencies against the Excel example just to be sure I’m not trying to pull a fast one on you.

So there you have it!  Our first attempt at replicating a few Excel functions in T-SQL.

And here you may have been thinking that I was going to teach you how to excel at writing SQL when you first read that title!  Maybe that will come later, once I’ve figured it out myself.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 29 Apr 2015.  All rights reserved.

Ruminations on Writing Great T-SQL

Today we’re going to try to extend some advice that I once heard from SQL MVP Jeff Moden, which I wrote about in my previous blog entitled “Make it Work, Make it Fast, Make it Pretty.”  That advice was just as pertinent then as it is now, but perhaps I can add to it what I think makes for really great T-SQL code.

Great T-SQL code is going to start with all of the advice that I gave in that blog.  That same advice really applies to any coding language, not just T-SQL.

But great T-SQL code is actually much more than that.  Let’s look at some of the aspects that transcend what I’ve previously written.

Elegance of a Set-based Algorithm

Obviously in our T-SQL we’d like to use predominately set-based algorithms.  Depending on the problem being addressed, oftentimes there are just a few very elegant solutions, and probably a bunch of solutions that are not nearly as elegant.

In simplicity there is often elegance.  Simplicity, elegance and speed may unfortunately not all play well together.  But if you can come up with a solution that is elegant, simple and fast, that’s usually going to get you a lot closer to writing great T-SQL.

Reusability

From Facts and Fallacies of Software Engineering by Robert L. Glass, here is Fact #18: “There are two “rules of three” in reuse: (a) It is three times as difficult to build reusable components as single use components, and (b) a reusable component should be tried out in three different applications before it will be sufficiently general to accept into a reuse library.

I will suggest that in order to be great, T-SQL code must be reusable.  I’ve written before that well-focused code that solves only a specific problem is probably going to be a whole lot faster than more generalized code that solves a wider variety of problems.  So that means it should be focused on solving a common problem that occurs frequently, so that it can be reused (as a tool) whenever the need for it arises.

I tend to agree with Mr. Glass in this case; finding a common problem and then building a fast-performing, simple and elegant T-SQL solution is probably a lot more challenging than building any other one-off query.  Part of this challenge of course is simply finding a suitably simple problem to focus on, which is also encountered commonly enough to want to build something that’s reusable.

Frequency and Quality of its Reuse

Since beginning my study of T-SQL, and of course I remain a student of the subject to this day, I’ve seen a fair number of really elegant solutions to specific problems.  Of the ones that can be classified as reusable, I’ve seen many that solve well a somewhat common problem, but unfortunately the problem itself is not so common that the solution will be reused often.  Let’s face it, maybe you’ve got a solution for Gaps and Islands that is outrageously wonderful, simple, fast and elegant.  But Gaps and Islands isn’t a particularly common problem to encounter.  Sure you see it once in a while, and certainly it is good to have an elegant solution just waiting in the wings for when you need it.  But commonality of the problem is a great influencer, at least to me, of what makes for great T-SQL.

Quality of reuse is a different story entirely.  So let’s say you’ve got this (arguably) great solution and it sees quite a fair bit of reuse.  Who are the folks that are reusing it?  Are they peers in the SQL community, are they SQL Server MVPs who have embraced your method thus utilizing it themselves?  Or is it the great unwashed masses?

Facebook is a great application because millions of people use it.  But who are those users?  I’d say it is a much higher quality pattern of reuse, to have a group of peers or those that are more skilled than you are, using your reusable code, rather than just anybody else in the world.  So while Facebook may be a great app, it may not represent great code!  The quality of the people reusing your code counts for something.

Productivity

Most great T-SQL code should enhance your productivity.  By that I mean, if you’re a developer it can be considered a tool that allows you to avoid spending awkward moments of your development time solving silly sub-problems, when really you need to be focusing on solving the bigger problems that relate to the business solutions you’re trying to produce.  As a DBA, your productivity can be enhanced by any T-SQL script that automates some activity that allows you to be more proactive in your database monitoring activities.

Of course, just because you can solve a problem more quickly because the sub-problems have already been addressed doesn’t mean you should sacrifice performance.  But since great T-SQL should already be pretty fast, you may find that using the tool makes the solution to your bigger problem fast enough!  Ultimately you may have to spend some time to make it even faster, but at least you’ve gotten through the first hurdle of a development project, which is to just get it working.

Encapsulation

Any great T-SQL solution is one that can be encapsulated such that it can be used as a “black box,” without the need for users (or more importantly maintainers) of the using bit of T-SQL, to need to worry about how it works.

Yes, they should be able to look under the covers, so that when need be they can figure out how it works.  But mostly the encapsulations simply makes a great piece of T-SQL, which may otherwise be elegant or complex, really, really easy to use.

My Vote for the Greatest Piece of T-SQL

Arguably the greatest piece of T-SQL code that I’ve ever seen written is the DelimitedSplit8K string splitter.  While Jeff Moden probably wasn’t the originator of the concept of splitting a string, his first attempt at writing a fast string splitter (Tally OH! An Improved SQL 8K “CSV Splitter” Function) was pretty awesome.  I’d say that it clearly meets all of the criteria I’ve mentioned so far in this blog, and with more than 50,000 views of that article as of this writing I’d imagine there’s more than just a few people that agree with me.

Jeff really did a great job of setting up a proper test harness and proving the speed potential of the algorithm, so whether or not he claims to be the author of this wonderful little function, he’ll be the one everyone remembers once all of the dust has settled.

It has also achieved another level of greatness, which is unmatched in my experience.  That is that an entire community has sprung up around it.  There have over the years since its original publication, been numerous contributors to the evolution of DelimitedSplit8K.  These contributions have spanned more than 75 pages of discussion thread on the topic (again, as of this writing).

Alas, there are too many contributors to mention all of them by name, but suffice it to say that list includes some really great SQL talent.  Some have focused on small tweaks, others have focused on producing better test harnesses.  Others have written custom CLRs to improve the speed even further, when CLRs are allowed.  All of the contributors probably use this tool frequently, so this speaks volumes to the function’s adoption and reuse.  One of the recent contributors, Eirikur Eiriksson, upped the ante significantly by writing a version that was even faster by utilizing a new SQL 2012 feature, and his diligence in constructing test harnesses is also worthy of honorable mention.

So here’s this huge group of peers (by that I mean SQL-folk of course), that have not only embraced Jeff’s work, but have also formed an on-going support community because they felt the underlying problem was so worthwhile that they should expend their precious time and effort on it.

There’s an interesting sidebar to my vote.  The Federal Relational Protection Agency (FRPA) might scoff at how preposterous the idea of a string-splitter being great T-SQL is.  After all, this is a relational database and we simply don’t store delimited strings in our properly normalized relational databases, now do we?  I guess what I’m saying here is that academic considerations probably take a back seat to adoption.  Clearly this problem occurs relatively frequently in the real world, regardless of what the FRPA would have us think.  And solving real world problems, to me at least, is really what it’s all about.

Aspirations

One day when I grow up I want to write a truly great snippet of T-SQL.  In my development career, knowing that solutions I’ve written are actually being used has always brought me great comfort and satisfaction.  No developer in their right mind, at least in my humble opinion, wants to knowingly build shelfware.  I’ve been around for a long time so maybe not everyone feels the same any more, but I hope this attitude still carries forward to all levels of developers currently putting their code out there for others to learn from and grow with.

So how about you?  Do you have a vote for the greatest snippet of T-SQL ever?  Post a link as a comment telling us why you think it qualifies, so all who see this might benefit from your particular experience with some T-SQL you think is great.  Affirmations of my vote are also welcome.

In the end, if this blog becomes a repository of some really great T-SQL code patterns, I will derive a lot of satisfaction from that too.

And to those of you out there that share my aspirations, whether or not the favorite tool/snippet you develop ever makes the list of all-time greats, wouldn’t you like to hear from all of those people that found your code wonderful enough to add to their T-SQL toolkit?  So take note you adopters out there!  If you’re using someone else’s code commonly to solve a problem, post them a thank you.  I’m sure they’ll be happy to hear that from you, and who knows?  Perhaps it will even spawn further greatness and sharing among one of the greatest communities the coding world has ever known.  That right, I mean the SQLverse, and we’re all a part of it!

Thanks for listening to my ruminations today on what makes T-SQL code great.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 22 Apr 2015.  All rights reserved.

Stupid T-SQL Tricks – Part 3: A Zodiacal SQL

Today we’ll learn how you can use T-SQL to calculate your zodiac sign from your birthdate.

“Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?  My Stupid T-SQL Tricks are hopefully a bit more useful, but just as amusing as Carson’s Stupid Pet Tricks were!

Everyone knows the twelve signs of the zodiac, and there’s a pretty good chance you also know what your sign is, and that which sign of the zodiac you are depends on your date of birth.

Putting the Zodiac Signs into a Lookup Table

Four our lookup table, we’ll use a Common Table Expression (CTE) with the twelve zodiac signs, but with one repeated.

WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT *
FROM Zodiac;

The signs are numbered in the order that they’re typically described in the literature, e.g., on the Wiki page for The Zodiac.  Note how Capricorn is duplicated, because the dates for that sign cross the end-of-year boundary.  Here are the displayed results from our table:

SignNo  SignName     Symbol  StartDT     EndDT       SignIs
1       Aries        ♈      1900-03-21  1900-04-20   Ram
2       Taurus       ♉      1900-04-21  1900-05-21   Bull
3       Gemini       ♊      1900-05-22  1900-06-21   Twins
4       Cancer       ♋      1900-06-22  1900-07-22   Crab
5       Leo          ♌      1900-07-23  1900-08-22   Lion
6       Virgo        ♍      1900-08-23  1900-09-23   Maiden
7       Libra        ♎      1900-09-24  1900-10-23   Scales
8       Scorpio      ♏      1900-10-24  1900-11-22   Scorpion
9       Sagitarius   ♐      1900-11-23  1900-12-21   Archer (Centaur)
10      Capricorn    ♑      1900-12-22  1900-12-31   Sea-Goat (Goat)
10      Capricorn    ♑      1900-01-01  1900-01-20   Sea-Goat (Goat)
11      Aquarius     ♒      1900-01-21  1900-02-19   Water-bearer
12      Pisces       ♓       1900-02-20  1900-03-20   Fish

I think it is kind of cool that we could use the NCHAR built-in T-SQL function (introduced in SQL 2005) to return the UNICODE symbol for each zodiac sign!

Suppose we’d now like to calculate the sign for a birthdate.  We’ll use two cases, a Capricorn and a Gemini to illustrate how easy it is using the lookup table above.

DECLARE @BirthDate  DATE = '1999-01-12';

WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

Which returns one row for our Capricorn:

BirthDate   SignNo   SignName    SignIs              Symbol
1999-01-12  10       Capricorn   The Sea-Goat (Goat) ♑

You can see how we’ve used the DATEADD and DATEDIFF functions to place our birthdate between the StartDT and EndDT columns within our lookup table.

A Function to Calculate the Sign of the Zodiac for a Birth Date

I like in-line, Table Valued Functions (iTVFs) because I consider them tools.  This little T-SQL query is easy enough to put into an iTVF, so let’s do that now.

CREATE FUNCTION dbo.ZodiacSign
(
    @BirthDate  DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

With this function, we can now calculate the Zodiac signs for a whole range of birthdates as follows:

WITH BirthDates (BirthDate) AS
(
    SELECT '1998-06-17'             -- A Gemini
    UNION ALL SELECT '1999-01-10'   -- A Capricorn
)
SELECT a.BirthDate, SignNo, SignName, SignIs, Symbol
FROM BirthDates a
CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;

Which gives us these results:

BirthDate   SignNo  SignName    SignIs               Symbol
1998-06-17  3       Gemini      The Twins            ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  ♑

The Zodiac Date Range in which a Birth Date Falls

Because of Capricorn, which splits two different years, were we to want to calculate the date range for the sign on which a particular birth date falls, things get just a little more complicated.  For example, for our two birthdates shown above, we’d like to see a results set that looks like this.

BirthDate   SignNo  SignName    SignIs               SignPeriodStart  SignPeriodEnd  Symbol
1998-06-17  3       Gemini      The Twins            1998-05-22       1998-06-21     ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  1998-12-22       1999-01-20     ♑

Notice how Capricorn’s SignPeriodStart is the year before that particular birthdate.

Our ZodiacSign function can be modified as follows to handle this case.

ALTER FUNCTION dbo.ZodiacSign
(
    @BirthDate  DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
),
    GroupedSigns AS
(
    SELECT SignNo
        ,StartDT    = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
                        THEN DATEADD(year, -1, MAX(StartDT)) 
                        WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
                        THEN MAX(StartDT)
                        ELSE MIN(StartDT) 
                        END
        ,EndDT      = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
                        THEN MIN(EndDT) 
                        WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
                        THEN DATEADD(year, 1, MIN(EndDT))
                        ELSE MAX(EndDT) 
                        END    FROM Zodiac
    GROUP BY SignNo
)
SELECT a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
    ,SignPeriodStart    = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.StartDT) AS DATE)
    ,SignPeriodEnd      = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.EndDT) AS DATE)
FROM Zodiac a
JOIN GroupedSigns b ON a.SignNo = b.SignNo
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

Because of our split row for Capricorn, we needed to create a grouping that set the dates accordingly and then JOIN this back to our original Zodiac lookup table.  A little more date arithmetic magic, this time using the DATEADD and YEAR built-in functions, will get us to our period dates.

This can be demonstrated by running the following query to call our newly ALTERed function:

WITH BirthDates (BirthDate) AS
(
    SELECT '1998-06-17'             -- A Gemini
    UNION ALL SELECT '1999-01-10'   -- A Capricorn
)
SELECT a.BirthDate, SignNo, SignName, SignIs
    ,SignPeriodStart
    ,SignPeriodEnd
    ,Symbol
FROM BirthDates a
CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;

Which gives us the results we seek!

BirthDate   SignNo  SignName    SignIs               SignPeriodStart  SignPeriodEnd  Symbol
1998-06-17  3       Gemini      The Twins            1998-05-22       1998-06-21     ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  1998-12-22       1999-01-20     ♑

Conclusions

I love to create iTVFs as tools I can always use when some calculation gets hairy!  I have many tools that are specific to their purpose.  I also like to adopt iTVFs of this sort from other sources (a great example of that being DelimitedSplit8K from SQL MVP Jeff Moden).  You should probably consider doing this yourself to improve your skill at both writing iTVFs and using the tools you create to accelerate your own T-SQL coding jobs.

Now if I could just figure out the Yin and the Yang of the Chinese calendar to confirm I’m a Dog that would really be something to chew on.

Yet another stupid, T-SQL trick, right?  Who knows, perhaps one day you’ll even have a use for it!

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 16 Apr 2015.  All rights reserved.

An Even Faster Method of Calculating the Median on a Partitioned Heap

Back in 2013 I wrote an article kindly accepted for publication on the Simple Talk web site called Calculating the Median Value within a Partitioned Set Using T-SQL.  In that article, I was delighted to propose a previously unpublished solution for calculating the median over a partitioned set that did not rely on INDEXing.  In fact, in my tests it was the fastest solution available for that case.

Later on, I was quite pleased to hear that 18 times nominated (as of this writing) SQL MVP Aaron Bertrand published a comparative study of methods for calculating the median called the Best approaches for grouped median, where he kindly included my submission, and confirmed that it was the highest performance solution available for the case of a heap.

Today we’re going to take a second look at the solution I proposed, and offer a couple of interesting possibilities that may be even better for the case of a partitioned heap.

A Heap Example and Some Sample Data to Populate the Table

The following DDL can be used to create a heap, and load it with some partitioned sample data (the ID column is the partition).

CREATE TABLE dbo.MedianValues
(
   ID INT
   ,N INT
);

INSERT INTO dbo.MedianValues
VALUES (1,1),(1,2),(1,3),(1,4),(1,5)
   ,(2,1),(2,2),(2,3),(2,4)
   ,(3,10),(3,2),(3,10),(3,4)
   ,(4,1),(4,5),(4,1),(4,3),(4,3);

From my original article, below are the results you would expect for the median for these four sets of N:

ID    Ordered Sets         Median
1     {1, 2, 3, 4, 5}      3
2     {1, 2, 3, 4}         2.5
3     {2, 4, 10, 10}       7
4     {1, 1, 3, 3, 5}      3

As a reminder, the median is calculated as follows:

  • Order the values for which you want to calculate the median over the partitioning column (ID).
  • If the count of the rows is odd, you take the middle value. So in the example above, you would take the third row for IDs 1 and 4.
  • If the count of the rows is even, you take the average of the middle two rows. In the example above, this means you use the second and third elements in IDs 2 and 3.

My Original Solution to Median for this Case

Aaron commented that my somewhat “elaborate” solution un-pivots the rows of interest using the CROSS APPLY VALUES approach.

WITH Counts AS
(
   SELECT ID, c
   FROM
   (
      SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE 0 END
      FROM
      (
          SELECT ID, c=COUNT(*)
          FROM dbo.MedianValues
          GROUP BY ID
      ) a
   ) a
   CROSS APPLY (VALUES(c1),(c2)) b(c)
)
SELECT ID=a.ID, Median=AVG(0.+b.N)
FROM
(
   SELECT ID, N
      ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
   FROM dbo.MedianValues a
) a
CROSS APPLY
(
   SELECT N
   FROM Counts b
   WHERE a.ID = b.ID AND a.rn = b.c
) b
GROUP BY a.ID;

It is important to note how we convert the integer Ns to a FLOAT within the AVG grouping function, to ensure no truncation occurs.  The results set for this on our sample data appears as follows, confirming the correct calculation of the median for each case.

ID   Median
1    3.000000
2    2.500000
3    7.000000
4    3.000000

Calculating a Median Using Grouping Sets

While I was playing around with GROUPING SETS the other day, I hit upon the strangest query result, which gave me an idea that relates to calculating medians.  Let’s take a look at a sample of what I found.

SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
    ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
        ELSE (COUNT(*)+1)/2 
        END
FROM 
(
    SELECT ID, ID1=ID
    FROM dbo.MedianValues
) a
GROUP BY GROUPING SETS (ID, ID1)
HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
ORDER BY ID, c;

This query produces a results set consisting of all the partitioning IDs with one row for IDs 1 and 4 (where c=3) and two rows for IDs 2 and 3 (where c=2 and 3 respectively).  Take a look at that description and compare it to my description of how to calculate median, and see if you too are not intrigued.

ID   c
1    3
2    2
2    3
3    2
3    3
4    3

Coincidentally, this is exactly what I was trying to achieve by my “elaborate” (according to Aaron) method that used CROSS APPLY VALUES!  One could argue that this is even more elaborate, and I would tend to agree.  So naturally the thought struck me that perhaps I could use this approach to calculate a median.

My initial attempts at JOINing these six rows back to the full set with a ROW_NUMBER() added yielded me no joy however, as the result was slower than my original.

Once again inspiration struck and I had the idea to aggregate the GROUPING SETS so that I only had one row for each ID, after which I could do a range (BETWEEN) JOIN.  So my aggregation now looks like this.

SELECT ID, c1=MIN(c), c2=MAX(c) 
FROM
(
    SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
        ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
            ELSE (COUNT(*)+1)/2 
            END
    FROM 
    (
        SELECT ID, ID1=ID
        FROM dbo.MedianValues
    ) a
    GROUP BY GROUPING SETS (ID, ID1)
    HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
) a
GROUP BY ID;

Which produces a set that looks like this:

ID  c1  c2
1   3   3
2   2   3
3   2   3
4   3   3

Playing around with JOINing this back to the original data set, led me to using a CROSS APPLY instead of a JOIN.  So what I ended up with, which calculates the median quite nicely, is the following query.

WITH RowsOfInterest AS
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=MIN(c), c2=MAX(c) -- * -- ID, c
        FROM
        (
            SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
                ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
                    ELSE (COUNT(*)+1)/2 
                    END
            FROM 
            (
                SELECT ID, ID1=ID
                FROM dbo.MedianValues
            ) a
            GROUP BY GROUPING SETS (ID, ID1)
            HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
        ) a
        GROUP BY ID
        HAVING COUNT(*) <= 2
    ) a
    CROSS APPLY
    (
        SELECT ID, N
        FROM 
        (
            SELECT ID, N
                ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
            FROM dbo.MedianValues b
        ) b
        WHERE a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
    ) b
)
SELECT @ID=ID, @Median=AVG(0.+N)
FROM RowsOfInterest a
GROUP BY a.ID;

How’s that for elaborate, Aaron?  Initial timings appeared to be pretty decent, but more importantly what I was seeing was a drastically reduced IO count and CPU usage.  Clearly this was a promising vector of attack, although the truth will only get sorted out using SQL Profiler to average the results of multiple runs (the elapsed timings were very close).

One rather annoying note on this solution.  When I first hit upon it, I could have sworn I had something that was consistently faster than my original solution.  Unfortunately due to a PC reboot (pesky Windows updates!) I lost that one and I was never able to quite achieve the speed I was seeing.  This could be an interesting assignment for those so inclined.

Simplifying the Elaborate GROUPING SETS Approach

After thinking a little more about this, I realized it might be possible to greatly simplify this aggregation to produce just the four rows I needed.  Silly of me really not to think of it in the first place, but then my brain may not be wired that way.

So here’s another method to get the median using a simpler initial aggregation to get ID, c1 and c2 just like the query in the prior section.

WITH RowsOfInterest AS
(
    SELECT ID, c=COUNT(*)
    FROM dbo.MedianValues
    GROUP BY ID
)
SELECT ID=ID, Median=AVG(0.+N)
FROM
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
        FROM RowsOfInterest
    ) a
    JOIN 
    (
        SELECT ID, N
            ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
        FROM dbo.MedianValues b
    ) b
    ON a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
) a
GROUP BY a.ID;

This time I used a JOIN instead of a CROSS APPLY because initial timings seemed to indicate a slight advantage.  This method also exhibited a serious reduction in IO counts and CPU usage over my original solution.

You’ll note that this method is quite similar to my original approach except that it doesn’t CROSS APPLY VALUES and doesn’t do the pre-aggregation in the second solution either.  The main difference from the original solution is the method of JOINing, which is to use the BETWEEN range selector.

The One Million Row Test Harness

In my original Simple Talk article I developed a one million row test harness that can be used to test a median query, and I have reproduced that below exactly as before.

TRUNCATE TABLE dbo.MedianValues;

DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

WITH Tally (n) AS
(
   SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.MedianValues
SELECT a.n, ABS(CHECKSUM(NEWID()))%@N
-- Always create exactly 100 IDs from the Tally table
FROM (SELECT TOP 100 n FROM Tally) a
-- Generate either an odd or even number of rows within IDs
CROSS APPLY
(
   SELECT TOP (@N-a.n%2) n
   FROM Tally
)b;

This actually generates 999,950 rows into our sample data table.

The SQL Profiler Test

My initial timings using SET STATISTICS IO, TIME ON were producing pretty fine numbers for IO and CPU, but pretty close scores for elapsed times.  So the only way to really be sure which of these three queries is the fastest, it is necessary to run each batch multiple times (I chose that to be ten) and then average the results.  I won’t include the Profiler test harness here because this is just a blog, and I’m sure if you’ve read my previous blog on the One Million Row Test Harness, you’re capable of doing this yourself given everything we’ve given you above.

Here are the raw and averaged SQL Profiler results for the three queries, with the last row being the average of the ten prior.

Solution #1: Original method Solution #2: Using Grouping Sets Solution #3: Simplified aggregation
CPU Reads Duration CPU Reads Duration CPU Reads Duration
4,943 2,006,286 1,585 1,622 5,843 1,663 2,840 5,674 1,147
4,759 2,006,286 1,638 1,826 5,843 1,663 2,979 5,674 1,106
4,820 2,006,286 1,575 1,715 5,843 1,662 2,682 5,674 1,326
4,836 2,006,286 1,654 1,732 5,843 1,672 2,932 5,674 1,353
4,822 2,006,286 1,608 1,747 5,843 1,620 2,949 5,674 1,190
4,787 2,006,286 1,598 1,701 5,843 1,659 2,729 5,674 1,161
4,788 2,006,286 1,669 1,716 5,843 1,638 2,872 5,674 1,175
5,104 2,006,286 1,574 1,716 5,843 1,538 3,026 5,674 1,244
4,698 2,006,286 1,569 1,700 5,843 1,665 2,840 5,674 1,296
4,632 2,006,286 1,571 1,747 5,843 1,663 2,792 5,674 1,195
4,819 2,006,286 1,604 1,722 5,843 1,644 2,864 5,674 1,219

It is also instructive to look at a closer comparison of the averages.

  Averages
Solution CPU Reads Duration
#1: Original method 4,819 2,006,286 1,604
#2: Using Grouping Sets 1,722 5,843 1,644
#3: Simplified aggregation 2,864 5,674 1,219
Best Solution #2 #3 #3
New #2 %O/(U) #1 -64% -100% 3%
New #3 %O/(U) #1 -41% -100% -24%

You can see how solution #3 was the best performer in terms of IO (Reads) and Duration (Elapsed MS).  Solution #2 had the edge in CPU and was nearly tied in duration with solution #1.  Overall my take on these results is that they are both better than my original solution, but of the two #3 is probably the best.

The percentage rows indicate that solution #3 was 41% better than (improved over) solution #1 in CPU, while solution #2 was 64% better than solution #1.  Both new solutions pretty much ripped solution #1 apart in terms of IO.  And probably the most important statistics shows that while solutions #1 and #2 were probably indeterminately different, solution #3 was about 24% faster than both.

The Final Word

The two new solutions for calculating the median within partitioned sets on a heap (no indexing on the table) seem to have delivered a slight edge over the previous benchmark case.  It is always gratifying to do something just a little bit better, even when you’re kicking yourself for not thinking of doing it that way in the first place.

Performance testing was performed on:

  • SQL Server 2012 64 bit
  • Windows 7 64 bit with 12GB of memory installed
  • Intel Core i5-3230M CPU @ 2.60 GHz

I did not attempt to test these solutions using either of the INDEXing schemes proposed by Aaron in his comparative article, because my gut tells me that they’re probably no better than the ones that won out in those particular scenarios.  Since this is just a blog, we’ll have to settle today for seeing just this one case.

And to Aaron, if you happen to be reading this blog at some time, my little digs at you calling my solution “elaborate” were all in good fun.  Actually I considered it quite a badge of honor that you picked my solution in your comparative study.

Thanks folks as always for listening.  Hope the solutions you find here and in my other blogs are helpful to you some day.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 08 Apr 2015.  All rights reserved.

An Easter SQL

Since Easter is nearly upon us, I got to thinking once again about how challenging it is to calculate the day that Easter Sunday falls on.  Since Calendar tables and their step-children Holiday tables are so interesting I thought I might write a sequel to my earlier blog link in this sentence.

There are at least three holidays (not necessarily business holidays) that are related to the calculation of Easter Sunday.  These are:

  • Good Friday is the Christian holiday that is the Friday just prior to Easter Sunday.
  • Ash Wednesday is another Christian holiday (the first day of Lent) that falls precisely 46 days prior to Easter Sunday.
  • Mardi Gras (a.k.a. Fat Tuesday) is a lesser known holiday celebrated by the more robust partyers among us.  I have a special affinity for it as I lived in New Orleans for a few years.

We will show you how to calculate them all and use that method to populate a Holidays lookup table so hopefully you’ll never have to do it again.  Happy Easter!

A Calendar and a Holiday Table

Discussions of Calendar tables in T-SQL are ubiquitous (here’s one by Todd Fifield), differing only in the columns actually stored in them.   We’ll draw upon the one we created in our blog to show you how easy it is here.

-- Create a Calendar table
SELECT [Date]=ISNULL([Date], 0), [Year]=ISNULL([Year], 0), [YrNN]=ISNULL([YrNN], 0)
    ,[YYYYMM]=ISNULL([YYYYMM], 0), [BuddhaYr], [Month], [Day]=ISNULL([Day], 0)
    ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk]
    ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo]
INTO dbo.Calendar 
-- This function can be obtained here: GenerateCalendar FUNCTION
FROM dbo.GenerateCalendar('1990-01-01', 65536);

-- Change column types to be NOT NULL so we can index them
--
-- Note: not needed thanks to the neat trick with ISNULL above
--ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL;
GO

-- Build some representative INDEXes
ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]);

SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last]
FROM dbo.Calendar;

-- Now create a Holidays table, from the same blog
CREATE TABLE dbo.Holidays
(
    FromDate    DATETIME PRIMARY KEY
    ,ToDate     DATETIME
    ,Holiday    VARCHAR(100)
);

We’ve created a Holidays table that we’ll be using to further our sequel to Easter.  There are also a few other things in the above Data Definition Language (DDL) that are worthy of mention.

  • The GenerateCalendar FUNCTION is an in-line, Table Valued Function (iTVF) that appears in my earlier blog (and also in the resources file attached at the end of this article), which can be used in lieu of a Calendar table should you prefer to do so.  Once again, my thanks go out to SQL MVP Jeff Moden for his help optimizing the performance of the calculations done in that FUNCTION.
  • We’ve used a neat trick with ISNULL to set the NULL attribute of the column in the table we selected INTO (when Calendar gets CREATEd) to NOT NULL, so that we don’t need to run the subsequent ALTER statements on those columns.  I included those ALTER statements for your information.  Thanks again to Jeff Moden for showing me that trick (with honorable mention to SQL MCM Wayne Sheffield (MCM=Microsoft Certified Master) who came along just a few days later with the same information).

Since Easter Sunday is so diabolically difficult to calculate, many reputable sources have created look-up tables such as this one.  Drawing upon that source, we can create the dates for Easter Sunday from 1990 through 2020 in our Holidays table using the following script.

WITH EasterDates (d) AS
(
    SELECT '1990-04-15' UNION ALL SELECT '1991-03-31' UNION ALL SELECT '1992-04-19'
    UNION ALL SELECT '1993-04-11' UNION ALL SELECT '1994-04-03' 
    UNION ALL SELECT '1995-04-16' UNION ALL SELECT '1996-04-07' 
    UNION ALL SELECT '1997-03-30' UNION ALL SELECT '1998-04-12'
    UNION ALL SELECT '1999-04-04' UNION ALL SELECT '2000-04-23' 
    UNION ALL SELECT '2001-04-15' UNION ALL SELECT '2002-03-31' 
    UNION ALL SELECT '2003-04-20' UNION ALL SELECT '2004-04-11'
    UNION ALL SELECT '2005-03-27' UNION ALL SELECT '2006-04-16' 
    UNION ALL SELECT '2007-04-08' UNION ALL SELECT '2008-03-23' 
    UNION ALL SELECT '2009-04-12' UNION ALL SELECT '2010-04-04'
    UNION ALL SELECT '2011-04-24' UNION ALL SELECT '2012-04-08' 
    UNION ALL SELECT '2013-03-31' UNION ALL SELECT '2014-04-20' 
    UNION ALL SELECT '2015-04-05' UNION ALL SELECT '2016-03-27'
    UNION ALL SELECT '2017-04-16' UNION ALL SELECT '2018-04-01' 
    UNION ALL SELECT '2019-04-21' UNION ALL SELECT '2020-04-12'
)
INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate=d, ToDate=d, Holiday='Easter Sunday ' + CAST(YEAR(d) AS CHAR(4))
FROM EasterDates;

Before we add Good Friday, Ash Wednesday and Fat Tuesday to our Holiday table, we’d like to check that the introductory description at least results in the correct day of the week on which the holiday falls.  We can use our Calendar table for this.

SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
    ,c.WkDName
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
JOIN dbo.Calendar c ON c.[Date] = a.FromDate+b.[days] 
WHERE a.Holiday LIKE 'Easter Sunday%';

This produces results like this (only the first two years are shown):

FromDate                ToDate                  holiday                        WkDName
1990-04-13 00:00:00.000 1990-04-13 00:00:00.000 Good Friday 1990               Friday   
1990-02-28 00:00:00.000 1990-02-28 00:00:00.000 Ash Wednesday 1990             Wednesday
1990-02-27 00:00:00.000 1990-02-27 00:00:00.000 Mardi Gras (Fat Tuesday) 1990  Tuesday  
1991-03-29 00:00:00.000 1991-03-29 00:00:00.000 Good Friday 1991               Friday   
1991-02-13 00:00:00.000 1991-02-13 00:00:00.000 Ash Wednesday 1991             Wednesday
1991-02-12 00:00:00.000 1991-02-12 00:00:00.000 Mardi Gras (Fat Tuesday) 1991  Tuesday

So now we can be confident to insert those holidays into our Holidays table.

INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
WHERE a.Holiday LIKE 'Easter Sunday%';

A couple of notes on the data types used in our Calendar and Holidays tables:

  • The Date column of the Calendar table was chosen to be DATETIME to retain compatibility with SQL 2005 (corresponding to the GenerateCalendar FUNCTION).  You could just as easily use a DATE datatype if you’re running SQL 2008 or later.
  • The FromDate/ToDate data types are also both DATETIME.  Here it could be considered important to keep the DATETIME data type to record cases where your business offers half-day holidays.  For example, I’ve seen this for Christmas Eve and New Year’s Eve.
  • You can also define a different FromDate than ToDate in the Holidays table to indicate a holiday that spans several days, like the Thailand Songkran holiday example from my blog.
“Since SQL is a database language, we prefer to do look ups and not calculations.” 
    –- Joe Celko (prolific author of books on things SQL)

There you go Joe, we’ve honored your preference!  Now we can do lookups of not just Easter Sunday, but also Good Friday, Ash Wednesday and Mardi Gras, the latter potentially being essential to planning your next trip to New Orleans.

The Calculation of Easter Sunday

A somewhat accurate description of the calculation of Easter Sunday is:

“The Sunday following the full Moon which falls on or after the equinox (the Paschal full moon) will give the lawful Easter.”
    –- From the Wikipedia link on Easter Sunday above

I say “somewhat accurate” because there’s all sorts of ifs, buts and caveats in that description, which you can read about in the Wikipedia link, however it will suffice for the time being.

The formal description of the calculation of Easter is known as Computus (no relation to Locutus of Borg), and is provided by Wikipedia at the link shown.  We shall observe this calculation in what follows.  While the formulas given are indicated to be valid for the years 1900 to 2199, since our Calendar table has been populated with years 1990-2169 we’ll limit our calculations of Easter Sunday to just those the years.  Years 1990 through 2020 will be checked against the lookup table we’ve already provided.

Of course, this calculation appears often enough in the literature and implemented in T-SQL (for example: here and here), often using methods I’d prefer to avoid due to performance.  Since our personal mantra is: “No Loops.  No Cursors.  No RBAR. Hoo-uh!” we shall endeavor to live up to that mantra today.  Think about that term “RBAR” and if you’ve never heard of it before, you’ll find it mentioned in the Conclusion to this article.

We’ll start by attempting to replicate the tabular example for calculating the Paschal full moon (PFMD) within the Gregorian calendar shown by Computus on that Wikipedia page.
 photo Gregorian Calendar for PFMD_zpsycqyrika.png
Note that the algorithm described above is a step-by-step or procedural algorithm. Here is where many SQL implementations fail because they don’t try to convert it to one that is set-based. To do this, you must think of each step in the procedural algorithm as a query, and as you proceed from one step to the next, each new step is another outer layer to your set-based query.

This is easier shown that described, so here is that query for 2013-2032.

SELECT [Year], [Y MOD 19], Addendum
    ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
    ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
    ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
FROM
(
    SELECT [Year], [Y MOD 19], Addendum
        ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
    FROM
    (
        SELECT [Year]
            ,[Y MOD 19] = [Year]%19
            ,Addendum   = CASE [Year]%19 
                            WHEN 5 THEN 29 
                            WHEN 16 THEN 29 
                            WHEN 8 THEN 30 
                            ELSE 0 
                            END 
        FROM dbo.Calendar
        WHERE [Month] = 1 AND [Day] = 1
    ) a
) a
WHERE [Year] BETWEEN 2013 AND 2032;

Note how the Addendum (CASE) implements the first two IFs (second and third rows) of the procedural algorithm. The results from this query for just the first 10 years are:

Year  Y MOD 19  Addendum  PFMD  Month  MonthNo
2013  18        0         27    Mar    3
2014  0         0         14    Apr    4
2015  1         0         3     Apr    4
2016  2         0         23    Mar    3
2017  3         0         11    Apr    4
2018  4         0         31    Mar    3
2019  5         29        18    Apr    4
2020  6         0         8     Apr    4
2021  7         0         28    Mar    3
2022  8         30        16    Apr    4

If we take PFMD and add the first letter of the month column that follows (for example for 2013: 27M) we see exact correspondence with the row from the Computus table (graphic) labeled the “Paschal full moon date,” including for the years not shown in the results set.  So it appears that we have rather simply converted the formula for calculating PFMD from Computus to a SQL script without too much effort.

Now we’ll use our Calendar table yet again, along with a little additional magic:

  • Put the pieces of the Paschal full moon date back together into a date using a formula that includes a couple of calls to the T-SQL built-in function DATEADD.
  • Do a CROSS APPLY (explained in Part 1 and Part 2 by SQL MVP Paul White) to pick up the Sunday after the Paschal full moon, which is Easter Sunday.
WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year], [Y MOD 19], Addendum
        ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END-1 -- PFMD
                ,DATEADD(month, CASE WHEN PFMD > 31 THEN 4 ELSE 3 END-1   -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b
WHERE [Year] BETWEEN 2013 AND 2032;

 

Now for a quiz on the above code where the lines are bold:

Why do we subtract 1 from each of the CASE statements that appear as the second argument to DATEADD?

We’ll give you the answer to this in the Conclusion section of this article, so stay tuned…

Here are the first ten years of the results set from the last T-SQL query shown above:

Year  PaschalFullMoon  EasterSunday
2013  2013-03-27       2013-03-31
2014  2014-04-14       2014-04-20
2015  2015-04-03       2015-04-05
2016  2016-03-23       2016-03-27
2017  2017-04-11       2017-04-16
2018  2018-03-31       2018-04-01
2019  2019-04-18       2019-04-21
2020  2020-04-08       2020-04-12
2021  2021-03-28       2021-04-04
2022  2022-04-16       2022-04-17

Our Computus Wikipedia page also has the inset table (at right) presenting the Western calculation of Easter Sunday, which corresponds quite nicely to our calculation of Easter Sunday.

That really wasn’t so hard was it? Now comes the packaging into a tool.

Since we like to keep things clean, we’re going to make this wonderful bit of SQL code into an in-line, table-valued function that we can call anytime we need to calculate Easter Sunday for a specific year.

We’ll then continue on to use our tool to calculate the other holidays by using the known number of days offset, just like we did before with our Easter Sunday lookup table, and insert that result into our Holidays table for years that are missing.

 photo Easter Sundays_zpsrruxck4b.png
CREATE FUNCTION CalculateEasterSunday
(
    @Year       INT
)
-- Calculate Easter Sunday for any given year
RETURNS TABLE WITH SCHEMABINDING
AS RETURN

WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year] --, [Y MOD 19], Addendum
        --,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        --,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        --,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real 
        -- DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 
                                THEN PFMD-31 
                                ELSE PFMD 
                                END-1           -- PFMD
                ,DATEADD(month, CASE 
                                WHEN PFMD > 31 
                                THEN 4 
                                ELSE 3 
                                END-1           -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1 AND [Year] = @Year
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b;

You can see where I commented out a few lines of code in the FUNCTION that represent intermediate results that our FUNCTION doesn’t really need to retain. Notice also how the filter on the @Year parameter was moved up into the innermost SELECT from the Calendar table, because when using the FUNCTION you’ll be selecting only one row at a time at that point. Calling the iTVF using CROSS APPLY in-lines the code to make this a non-RBAR query.

Now we can use our FUNCTION to check that the dates we’ve prior stored for Easter Sunday in our Holidays table are correct.

SELECT HolidayFromDate  = CAST(a.FromDate AS DATE)
    ,b.EasterSunday
    ,[Check]            = CASE 
                            WHEN a.FromDate = b.EasterSunday 
                            THEN 'MATCHED!' 
                            ELSE 'WRONG!' 
                            END
FROM dbo.Holidays a
CROSS APPLY dbo.CalculateEasterSunday(YEAR(a.FromDate)) b
WHERE Holiday LIKE 'Easter Sunday%';

The first ten years of our results show us all MATCHED! (and so do the others that are not shown).

HolidayFromDate  EasterSunday   Check
1990-04-15       1990-04-15     MATCHED!
1991-03-31       1991-03-31     MATCHED!
1992-04-19       1992-04-19     MATCHED!
1993-04-11       1993-04-11     MATCHED!
1994-04-03       1994-04-03     MATCHED!
1995-04-16       1995-04-16     MATCHED!
1996-04-07       1996-04-07     MATCHED!
1997-03-30       1997-03-30     MATCHED!
1998-04-12       1998-04-12     MATCHED!
1999-04-04       1999-04-04     MATCHED!

So we can be reasonably confident of our Easter Sunday calculation for the years indicated by Computus (1900-2199)!

Using our FUNCTION to Create Lookup Values in our Holidays Table

Now to keep Joe Celko happy (i.e., so we can do lookups instead of calculations), let’s insert the rest of our holidays, including Easter Sunday, Good Friday, Ash Wednesday and Mardi Gras, into our Holidays table, using all the years available to us in our Calendar table.

INSERT dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate, ToDate, holiday
FROM
(
    SELECT FromDate = DATEADD(day, c.days, b.EasterSunday)
        ,ToDate     = DATEADD(day, c.days, b.EasterSunday)
        ,c.holiday
    FROM dbo.Calendar a
    CROSS APPLY dbo.CalculateEasterSunday(a.[Year]) b
    CROSS APPLY
    (
        VALUES (0, 'Easter Sunday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-2, 'Good Friday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-46, 'Ash Wednesday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(a.[Year] AS CHAR(4)))
    ) c ([days], holiday)
    WHERE a.[Month] = 1 AND a.[Day] = 1
) a
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.Holidays x
    WHERE x.FromDate = a.FromDate AND x.ToDate = a.ToDate
);

This added 596 holidays to our table, and I’d guess we’re pretty well set for our lifetimes!

Notice how, in the above query we had to use DATEADD instead of simple arithmetic as we did before when using the DATETIME data type.  This is because simple arithmetic doesn’t work with the DATE data type returned by our CalculateEasterSunday function.

Conclusion and What we Learned

While Joe likes lookups, I find calculations in T-SQL to be quite a bit of fun.  In this article (as in my blog on Calendar tables), we got to do a little of both.

If you went ahead and read my blog on Calendar tables in conjunction with this article, I am pleased because you probably learned a lot about how useful they can be, particularly in conjunction with a Holidays table. In this article (in the FUNCTION) we used our Calendar table to find the next Sunday occurring after our calculated date of the Paschal full moon.

Besides learning about the trivialities of the relationship between Easter, Good Friday, Ash Wednesday, and planning for your next visit to New Orleans for Mardi Gras, we’d like to think that you may have learned a few other things as well:

  • You can use simple arithmetic on a DATETIME data type to add or subtract days, however you must use DATEADD to do the same thing with a DATE data type.
  • You now have efficient tools in your toolbox to generate a calendar table (GenerateCalendar) and CalculateEasterSunday, whenever you have a need to use them.  The latter function is also included in the resources file attached at the end of this article.
  • You learned how to CROSS APPLY VALUES to do an UNPIVOT operation, even though I didn’t specifically mention that was what we were doing when we used it (where we created rows with dates for Good Friday, Ash Wednesday and Mardi Gras).  If you haven’t seen that method before you may want to look into that link as well. That is something that will only work in SQL 2008 (or later), but there is an equivalent construct that will work in SQL 2005.
  • We learned how to use ISNULL on a SELECT/INTO operation, to set the attribute of the column created in the target table to NOT NULL.
  • You learned how to think your way through a rather complex, procedural calculation and convert it into set-based code, by wrapping each successive step in another layer of query.
  • You learned my personal SQL mantra, in case you hadn’t seen that before.

In case you’ve been wondering what RBAR stands for, it means “Row By Agonizing Row” and it is a term coined by Jeff Moden which I fortuitously picked up on before I got to know him.

Oh yes, then there’s the quiz I mentioned. There are two cases:

  • If we’re starting with 01 Jan of a year as our date, and the Paschal full moon month is three (meaning it should fall in Mar), if we add three months to 01 Jan it puts us in Apr. So we subtract one month to get us back to where we need to be.
  • If we’re starting with 01 Mar of a year as our date, and the Paschal full moon day within that month is (say) 14 (meaning it should fall on 14 Mar), if we add 14 days to 01 Mar it puts us in 15 Mar. So we subtract one day to get us back to where we need to be.

After all this, I still have no clue what a Paschal full moon is but what the heck, at least I know I can calculate when it is!  That’s all for today folks.  I hope you found this article useful and just a bit of fun.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 01 Apr 2015.  All rights reserved.

Resources for An Easter SQL.doc

The Resources above are stored in a Microsoft Word (TM) document. There are two sections:

-- Create GenerateCalendar, Calendar and Holiday Tables and Populate.sql
... page break ...
-- Create CalculateEasterSunday and Add Holidays.sql