T-SQL
Excel in T-SQL Part 2 – The Normal Distribution (NORM.DIST) Density Functions
Today’s blog will be the second in a multi-part series on replicating Excel functions in T-SQL, continuing with Excel’s NORM.DIST built-in function, thus enshrining my geekdom in the SQLverse forever.
Today’s solutions will once again focus on creating T-SQL in-line, Table Valued Functions (iTVFs), introduced in SQL Server 2005, that apply my solution techniques to the general case of calculating the value of the Normal Distribution’s Probability Density Function (PDF) and Cumulative Distribution Function (CDF), for any mean and standard deviation you specify. I will not be doing any performance testing of today’s iTVFs; however I have done my best to utilize T-SQL performance best practices in their construction. I will also provide you some links to alternate solutions I have seen. Since all are approximation algorithms (including mine), the difference to our readers is probably going to be in which produces the lowest error results. Only you can be the best judge of which that is.
The Normal Distribution
As I’ve said before I’m no statistician, but I have seen the Normal Distribution and most likely you’ve seen it too. The normal distribution has many interesting properties, starting with the:
- Mean – The average of all sample points representing the population, which appears as the center line of a normal distribution (probability density) function’s curve.
- Standard Deviation – A measure of the elongation of the tails of the distribution.
Graphically I’m sure you’ve seen the normal distribution before, with the blue line (below) being what you normally (no pun intended) see. This is the normal distribution’s probability density function (PDF). The red line is also of interest, and this is called the cumulative distribution function (CDF).
Notice how the blue line (PDF) is bilaterally symmetrical about the mean value, which in the above case is ten. This simply means that there’s an equal portion of the curve on both the left and the right of the mean. Another property of the PDF for any distribution is that the area under the curve is exactly equal to one, and this applies regardless of what the mean and standard deviation are. In fact, it also applies to other statistical distributions, so long as it is the PDF you’re talking about.
The CDF represents the area under the PDF at any particular point along the PDF line. For example, if we draw two intersecting lines as in the example below, with the first (vertical) line through the mean and the second line intersecting the first and also the CDF, we see:
- The clear bilateral symmetry of the PDF line, where half of the area appears to the left of the bisecting line and the other half appears to the right.
- The horizontal line intersecting the right vertical access at 0.5, indicating that the value of the CDF (the area under the PDF) at the mean is precisely 0.5, meaning that the total area of the PDF curve is 1.0, which is also the point at which the red curve intersects with the right axis.
This same relationship holds for the PDF/CDF of any other distribution.
Computing the Normal Distribution’s Probability Density Function
If you go to the Wiki page for the Normal Distribution linked above (or many other sources), you will see that the PDF for the normal distribution can be represented mathematically as:
Where µ = the arithmetic mean and σ^{2}= the standard deviation. The variance (σ) is the square root of the standard deviation.
Also, SQL has the PI and EXP built-in functions to support this equation.
SELECT PI(), EXP(1.); -- Results: 3.14159265358979 2.71828182845905
In SQL terms, this is relatively straightforward to compute, and we’ll show that in a minute.
Let’s look now at the Excel NORM.DIST function, which takes four arguments:
- x is the point where we wish to calculate the value of the normal distribution, which could be any point along the x-axis of the graphics above.
- µ is the mean for the normal distribution of interest.
- σ^{2} is the standard deviation for the normal distribution of interest.
- The last (fourth) argument to NORM.DIST is a logical (TRUE/FALSE) value which specifies whether you want to calculate the value for the PDF (if FALSE) or the CDF (if TRUE).
There is also a special case of the normal distribution, known as the Standard Normal Distribution, where the mean is zero and the standard deviation is one. For this special case, the PDF function is somewhat simpler and can be written as:
As I am also no mathematician, I am thankful that all of the formulas so far have simply been copied out of the relevant sources.
Since I am a T-SQL guy, I must now focus on my specialty and present a function that will calculate the probability density function (PDF) for both the standard and general normal distributions.
CREATE FUNCTION dbo.NORMAL_PDF -- Normal Distribution - Returns two bits of information: -- NPDF(X) - The Normal Probability Density Function's value at given mean and std deviation -- SNPDF(X) - The Standard Normal Probability Density Function's value (mean=0, std dev=1) ( @X FLOAT -- Point at which function is to be evaluated ,@Mean FLOAT -- Mean of the Normal Distribution ,@StdDev FLOAT -- Standard Deviation of the Normal Distribution ) RETURNS TABLE WITH SCHEMABINDING RETURN SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev -- Normal Probability Density Function for Mean and Standard Deviation ,[NPDF(X)] = EXP(-0.5*(@X-@Mean)*(@X-@Mean) / (@StdDev*@StdDev))/(SQRT(2.*PI())*@StdDev) -- Standard Normal Probability Density function for Mean=0 and StdDev=1 ,[SNPDF(X)] = EXP(-0.5*@X*@X)/SQRT(2.*PI());
Let’s now construct an Excel spreadsheet to calculate the PDF for the normal distribution using a couple of different mean/standard deviation combinations, so we can check the results from our function.
The highlighted cell’s formula is in the formula text box above, showing you how to calculate the PDF for the normal distribution where mean is two and standard deviation is 0.5 (from columns B and C).
We’ve left a couple of columns open so we can run the following T-SQL script that uses our iTVF to calculate the PDF values at the various Xs. The differences columns are currently showing the same value as appears in the Excel column with reversal of sign.
WITH SampleData (X, Mean, StdDev) AS ( SELECT -1,2,0.5 UNION ALL SELECT -0.5,2,0.5 UNION ALL SELECT 0,2,0.5 UNION ALL SELECT 0.5,2,0.5 UNION ALL SELECT 1,2,0.5 UNION ALL SELECT 1.5,2,0.5 UNION ALL SELECT 2,2,0.5 UNION ALL SELECT 2.5,2,0.5 UNION ALL SELECT 3,2,0.5 UNION ALL SELECT 3.5,2,0.5 UNION ALL SELECT 4,2,0.5 UNION ALL SELECT 2.33,10,5 UNION ALL SELECT 4.22,10,5 UNION ALL SELECT 5.1,10,5 UNION ALL SELECT 8.2,10,5 UNION ALL SELECT 10,10,5 UNION ALL SELECT 11.5,10,5 UNION ALL SELECT 12.6,10,5 UNION ALL SELECT 16.8,10,5 UNION ALL SELECT 22.5,10,5 ) SELECT a.X, a.Mean, a.StdDev, b.[NPDF(X)], b.[SNPDF(X)] FROM SampleData a CROSS APPLY dbo.NORMAL_PDF(X, Mean, StdDev) b;
When we run the above script, then copy/paste the output from the SSMS Results pane into our spreadsheet, it now looks like this.
The cell formula at the top now shows you how to calculate the PDF for the standard normal distribution in Excel (second and third arguments are different from the prior Excel graphic).
The red bordered columns now show very miniscule differences between the values computed by T-SQL for the normal distributions’ PDF and SNPDF. These calculations are probably accurate enough to say that we’ve replicated the Excel NORM.DIST function’s results when its fourth argument is FALSE.
Computing the Normal Distribution’s Cumulative Distribution Function (CDF)
We can easily add a few columns to our spreadsheet to show you how easy it is in Excel to calculate the CDF for our two distributions, including some placeholders for when we get to the point of having some results in T-SQL.
Note that the highlighted cell’s formula is shown in Excel’s formula entry text box.
While the PDF for the normal distribution can be represented in what is known as “closed form” (see the formulas above), the CDF cannot be represented in closed form. Instead we need to represent the value of the CDF at a point (X) as a definite integral, which is essentially just calculating the area under the PDF.
At the sight of the integral symbol, I’m sure that some of my reader’s eyes have glazed over. Once again though, this is a formula not of my own making, but one which can be found in many sources. My hope is that I don’t lose you here, because if you read on you just might find some really interesting stuff!
Let us consider a simple case, for example where X = µ (the mean). In that case, from our graphical look at the two PDF/CDF curves, it implies that the value of the CDF should be 0.5. If we look at cells J9 and J18 in the last Excel spreadsheet shown, we see that indeed Excel’s NORM.DIST function computes exactly that value for the CDF.
To replicate this in T-SQL, we’ve got just a bit of a problem though. Unfortunately, T-SQL does not have an “INTEGRATE” built-in function!
The Fundamental Theorem of Calculus
I know that’s a mouthful, but the fundamental theorem of calculus is going to give us a way to construct a set-based algorithm that will allow us to calculate the CDF of the normal distribution. If you haven’t taken calculus, or can’t be bothered to try to remember what the fundamental theorem says, you might want to skip ahead to the next section and just get to the T-SQL. But as I’ve said many times, since I am no mathematician, I’ll need to keep my explanation as simple as possible in the hopes that I don’t trip myself up and that my readers can follow along.
Let’s look at a graphical example.
Suppose we were interested in the CDF where X=5. We could construct a series of rectangles that extend from the lowest value of the left tail of the distribution, up to where X=5. As these rectangles get smaller and smaller, the sum of the areas within the rectangles approaches quite closely the value of the CDF. Technically, this is expressed as a “limit” where the width of the rectangle à 0.
That is essentially the fundamental theorem of calculus, or the part that interests us anyway.
Unfortunately, we’re not quite done with our exotica yet. Constructing a set-based formula that sets up a specific (known) number of rectangles from -∞ (negative infinity) up to our value of X, is problematic at best. So we’re going to fall back on something that I pointed out earlier. This is that the area under the curve where x=10 (the mean in this example) is 0.5. So instead we can solve the following integral when X is less than our mean.
And alternatively, when x is greater than the mean:
Gosh! That still doesn’t sound too simple, but in reality we are now quite close to integrating the normal distribution’s PDF at a point of interest, and doing it with a set-based algorithm.
By now my regular readers are probably used to my little T-SQL sleight of hand tricks. So here’s another. Consider the following script.
DECLARE @X FLOAT = 1.5 -- Point at which function is to be evaluated ,@Mean FLOAT = 2. -- Mean of the Normal Distribution ,@StdDev FLOAT = 0.5 -- Standard Deviation of the Normal Distribution ; WITH CalculateIntervals AS ( -- Total intervals (about 5 per standard deviation) SELECT Intervals = 5 * ABS(@Mean - @X) / @StdDev -- Number of intervals per standard deviation ,Interval = 5 ), Tally (n) AS ( -- Up to 10,000 row tally table SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) ) SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,Intervals ,n ,Pos1 ,Pos2 ,Width ,Height = AvgX -- Width x Height = NPDF(AvgX) ,Area=Width * e.[NPDF(X)] FROM CalculateIntervals a CROSS JOIN Tally b CROSS APPLY ( SELECT Pos1 = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean) ,Pos2 = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean) ,Width = ABS(@Mean - @X)/Intervals ) c CROSS APPLY ( -- Cheat the rectangle's height - make it the average of Pos1 and Pos2 SELECT AvgX = 0.5 * (Pos1 + Pos2) ) d CROSS APPLY dbo.NORMAL_PDF(AvgX, @Mean, @StdDev) e;
Let’s take this bit of magic and decompose it one step at a time.
- We start by declaring three local variables, which will ultimately become arguments to the iTVF we hope to create.
- The CalculateIntervals CTE creates two intermediate column values that we’ll use later: 1) the first is the total number of intervals, and 2) is the intervals per standard deviation that our X of interest is from our mean.
- The Tally CTE is just an in-line Tally table that is limited to the number of rows set up in the Intervals (total intervals) column of the CalculateIntervals CTE. The intervals themselves are simply used to represent the rectangles we described above.
- Next (in our FROM clause in the main query) we take the one row that is returned from the CalculateIntervalsCTE and CROSS JOIN it with the Tally table’s results, so that we’ve consolidated everything we need onto each row.
- The first of three cascading CROSS APPLYs calculates two positions (Pos1 and Pos2) and the width of our rectangle. These calculations will be clearer when we look at the results set (below).
- The second cascading CROSS APPLY “cheats” the height of the rectangle, to be the average of Pos1 and Pos2. This introduces some level of error into our calculations, and there are ways around that (too advanced for me and this blog). One way to reduce this error is to introduce more intervals (or rectangles).
- Lastly, in our final CROSS APPLY we call our NORMAL_PDF function to calculate the value of the PDF at AvgX where the mean and standard deviation are specified.
Note that this code fails miserably (no results rows produced) when @X = @Mean, but we’ll handle that in our iTVF. Here are the results:
X Mean StdDev Intervals n Pos1 Pos2 Width Height Area 1.5 2 0.5 5 1 2 1.9 0.1 1.95 0.0793905094954024 1.5 2 0.5 5 2 1.9 1.8 0.1 1.85 0.0762775630921048 1.5 2 0.5 5 3 1.8 1.7 0.1 1.75 0.0704130653528599 1.5 2 0.5 5 4 1.7 1.6 0.1 1.65 0.0624507866733522 1.5 2 0.5 5 5 1.6 1.5 0.1 1.55 0.053217049979751
The X, Mean, StdDev and n columns should require no explanation. The Intervals column will be five times the number of standard deviations that X is from our mean (note that X could be less than or greater than the mean). Pos1 and Pos2 compute the right and left points of our interval (which may be reversed if @X > @Mean), while width is computed based on the total difference between X and our mean divided by the total number of intervals. See the comment in the code about “cheat the height?” Height in the result set is simply the average of Pos1 and Pos2. Finally, the Area is Width * Height, which when summed across all five of our rectangles and adjusted according to the value of the CDF at the midpoint (0.5) should be a reasonably close approximation of the CDF of the normal distribution at X! This approximation should improve if we use more than five rectangles per standard deviation.
To put all of this another way, we’ve implemented the fundamental theorem of calculus by using a tally table to calculate our little rectangles, in order to integrate under a curve. Too bad I didn’t think of that example when I wrote my Tally tables blog!
A T-SQL iTVF to Calculate the Normal Distribution’s Cumulative Distribution Function (CDF)
At this time, it is just a little more work to get to where we want to be. This is an iTVF to simulate the NORM.DIST function in Excel. We’ll need to eliminate some of the unnecessary intermediate results from the previous script, and also combine a couple of extra bits to calculate the PDF and the CDF when X is at the mean, but you can read the comments in the function to see those additions.
We’re also going to make our function allow for a little additional selectivity by adding one argument that can adjust the number of intervals per standard deviation.
CREATE FUNCTION dbo.Excel_NORM_DIST -- Excel Normal Distribution - Returns either: -- Probability Density Function (PDF) or -- Cumulative Distribution Function (CDF) ( @X FLOAT -- Point at which function is to be evaluated ,@Mean FLOAT -- Mean of the Normal Distribution ,@StdDev FLOAT -- Standard Deviation of the Normal Distribution ,@CumDist TINYINT -- =0 for Probability Density, =1 for Cumulative Density ,@Intervals INT = NULL ) RETURNS TABLE WITH SCHEMABINDING RETURN WITH CalculateIntervals AS ( -- Total intervals (default is about 100 per standard deviation) SELECT Intervals = ISNULL(@Intervals, 100) * ABS(@Mean - @X) / @StdDev -- Number of intervals per standard deviation ,Interval = ISNULL(@Intervals, 100) ), Tally (n) AS ( -- Up to 10,000 row tally table SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) ) -- PDF SELECT X, Mean, StdDev, [F(X)]=[NPDF(X)] FROM ( SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,[NPDF(X)] FROM dbo.NORMAL_PDF(@X, @Mean, @StdDev) ) a WHERE @CumDist = 0 UNION ALL -- CDF where X = mean SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,[CDF(X)] = 0.5 WHERE @CumDist = 1 AND @X = @Mean UNION ALL -- CDF where X mean SELECT X, Mean, StdDev, [CDF(X)] FROM ( SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev -- SUM the rectangles ,[CDF(X)] = 0.5 + SUM( -- Add to or remove from CDF at mean (0.5) SIGN(@X - @Mean) * -- Width x Height = NPDF(AvgX) Width * d.[NPDF(X)] ) FROM CalculateIntervals a CROSS JOIN Tally b CROSS APPLY ( SELECT Pos1 = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean) ,Pos2 = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean) ,Width = ABS(@Mean - @X)/Intervals ) c -- -- Average height -- CROSS APPLY dbo.NORMAL_PDF(0.5 * (Pos1 + Pos2), @Mean, @StdDev) d ) a WHERE @CumDist = 1 AND @X @Mean; GO WITH SampleData (X, Mean, StdDev) AS ( SELECT -1,2,0.5 UNION ALL SELECT -0.5,2,0.5 UNION ALL SELECT 0,2,0.5 UNION ALL SELECT 0.5,2,0.5 UNION ALL SELECT 1,2,0.5 UNION ALL SELECT 1.5,2,0.5 UNION ALL SELECT 2,2,0.5 UNION ALL SELECT 2.5,2,0.5 UNION ALL SELECT 3,2,0.5 UNION ALL SELECT 3.5,2,0.5 UNION ALL SELECT 4,2,0.5 UNION ALL SELECT 2.33,10,5 UNION ALL SELECT 4.22,10,5 UNION ALL SELECT 5.1,10,5 UNION ALL SELECT 8.2,10,5 UNION ALL SELECT 10,10,5 UNION ALL SELECT 11.5,10,5 UNION ALL SELECT 12.6,10,5 UNION ALL SELECT 16.8,10,5 UNION ALL SELECT 22.5,10,5 ) SELECT a.X, a.Mean, a.StdDev ,[NPDF(X)] = c.[F(X)] ,[SNPDF(X)] = b.[SNPDF(X)] ,[CDF(X)] = d.[F(X)] FROM SampleData a -- Use NORMAL_PDF to return PDF for standard normal distribution CROSS APPLY dbo.NORMAL_PDF(a.X, a.Mean, a.StdDev) b -- Use Excel_NORM_DIST to return PDF for general normal distribution CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 0, DEFAULT) c -- Use Excel_NORM_DIST to return CDF for a general normal distribution CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 1 -- Increase the accuracy by specifying a larger number for fifth parameter ,CASE a.Mean WHEN 10 THEN 500 ELSE 100 END) d;
Note how we can override (using the fifth parameter to the function) the number of intervals, or just let it use the DEFAULT (which is 100 per standard deviation). You need to take some care with this because if you end up trying to calculate the CDF for an X that is a significant number of standard deviations away using a high interval count, you could exceed the number of rows generated by the in-line tally table. But you can always make that tally table generate more rows by adding additional CROSS JOINs. Just remember that as you evaluate more and more intervals, the performance of the iTVF will be degraded, so my advice is to use just what you need to achieve the accuracy you seek.
The final results from this query are:
X Mean StdDev NPDF(X) SNPDF(X) CDF(X) -1.00 2 0.5 1.21517656996466E-08 0.241970724519143 9.86435766403559E-10 -0.50 2 0.5 2.9734390294686E-06 0.3520653267643 2.86620600586929E-07 0.00 2 0.5 0.000267660451529771 0.398942280401433 3.16690114139928E-05 0.50 2 0.5 0.00886369682387602 0.3520653267643 0.00134984263449417 1.00 2 0.5 0.107981933026376 0.241970724519143 0.0227496820247703 1.50 2 0.5 0.483941449038287 0.129517595665892 0.158654245714223 2.00 2 0.5 0.797884560802865 0.0539909665131881 0.5 2.50 2 0.5 0.483941449038287 0.0175283004935685 0.841345754285777 3.00 2 0.5 0.107981933026376 0.00443184841193801 0.97725031797523 3.50 2 0.5 0.00886369682387602 0.00087268269504576 0.998650157365506 4.00 2 0.5 0.000267660451529771 0.000133830225764885 0.999968330988586 2.33 10 5.0 0.0246013380403521 0.0264264854972617 0.0625147949422269 4.22 10 5.0 0.0409033036384656 5.41832610895401E-05 0.123840539089739 5.10 10 5.0 0.0493618981134085 8.97243516238334E-07 0.163543019015466 8.20 10 5.0 0.0747821210746257 9.99837874849718E-16 0.359423544347365 10.00 10 5.0 0.0797884560802865 7.69459862670642E-23 0.5 11.50 10 5.0 0.0762775630921048 7.6416554115872E-30 0.61791144125835 12.60 10 5.0 0.0696985025517949 1.33848679925429E-35 0.697770895452422 16.80 10 5.0 0.0316449580740766 2.05718230302861E-62 0.913085073917205 22.50 10 5.0 0.00350566009871371 4.67863681725005E-111 0.99379034197768
The first thing that we notice about these results is that our CDF where X equals the mean is 0.5. That’s a good start, but for a more thorough check we’ll copy/paste these results into our Excel spreadsheet.
Notice how all of the values in column M are quite small (nearly zero), indicating a pretty small difference between Excel’s calculated CDF and the one calculated in T-SQL. Had we left the second distribution group to default to 100 intervals, the difference would have been slightly larger.
Conclusions
I’ll number my conclusions:
- I am a T-SQL geek, because whom but a T-SQL geek would have thought of doing something so ridiculously geeky and so enjoyed writing about it as much as me!
- It is possible to do integration (specifically to solve for a finite definite integral) in T-SQL by using a tally table to implement the fundamental theorem of calculus.
- We have provided two functions:
- The first, NORMAL_PDF, which calculates the value for the probability density function for both general and standard normal distributions.
- The second, EXCEL_NORM_DIST, which calculates the probability density and cumulative distribution functions for a general normal distribution, closely replicating the results of Excel’s NORM.DIST function.
Prior to writing this blog, I did a little searching to see if anyone has tried this before and I ran across this article by Eli Algranti: Part 1: T-SQL Implementation of NORMDIST / NORM.S.DIST
He is obviously a bit more of a mathematician than I am, because he came up with three different implementations that have different error characteristics for the CDF.
- Using polynomial approximation – certainly an interesting approach that uses an approximate “closed form” version of the antiderivative of the normal distribution. My only issue with it is that is uses a scalar-valued, user-defined function (UDF) to perform the calculation.
- Another polynomial approximation – similar to the first but with a different error result. Again, my issue is his implementation using the scalar-valued, UDF.
- Using a rational polynomial approximation – a bit more elaborate estimation approach that once again is implemented using a scalar-valued, UDF.
Certainly all of those could be converted to iTVFs that would run faster. It would be interesting to see how well my function’s error results compare to those.
Below is an Excel workbook file provided to you as a resource. In that you’ll find three worksheets:
- Final Results w-Check – showing the spreadsheet screen captures above (all columns), plus a column that contains a neat way to translate sample data from Excel to a series of T-SQL SELECT/UNION ALL/SELECT statements, like you see in my SampleDate CTEs above.
- SQL- Create-run NORMAL_PDF – a T-SQL script to create the named function (in italics) and run it with the sample data provided.
- SQL- Create-run EXCEL_NORM_DIST – a T-SQL script to create the named function (in italics) and run it with the sample data provided.
The normal distribution is one that is quite important in statistics and having tools to apply its distributions in T-SQL can be just as important. We hope you’ve found today’s blog useful and instructive!
Follow me on Twitter: @DwainCSQL
© Copyright Dwain Camps 14 May 2015. All rights reserved.
Stupid T-SQL Tricks – Part 3: A Zodiacal SQL
Today we’ll learn how you can use T-SQL to calculate your zodiac sign from your birthdate.
“Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show
If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman? My Stupid T-SQL Tricks are hopefully a bit more useful, but just as amusing as Carson’s Stupid Pet Tricks were!
Everyone knows the twelve signs of the zodiac, and there’s a pretty good chance you also know what your sign is, and that which sign of the zodiac you are depends on your date of birth.
Putting the Zodiac Signs into a Lookup Table
Four our lookup table, we’ll use a Common Table Expression (CTE) with the twelve zodiac signs, but with one repeated.
WITH Zodiac
(
SignNo, SignName, Symbol, StartDT, EndDT, SignIs
)
AS
(
SELECT 1, 'Aries', NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
UNION ALL SELECT 2, 'Taurus', NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
UNION ALL SELECT 3, 'Gemini', NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
UNION ALL SELECT 4, 'Cancer', NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
UNION ALL SELECT 5, 'Leo', NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
UNION ALL SELECT 6, 'Virgo', NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
UNION ALL SELECT 7, 'Libra', NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
UNION ALL SELECT 8, 'Scorpio', NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
UNION ALL SELECT 9, 'Sagitarius', NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
UNION ALL SELECT 11, 'Aquarius', NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
UNION ALL SELECT 12, 'Pisces', NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT *
FROM Zodiac;
The signs are numbered in the order that they’re typically described in the literature, e.g., on the Wiki page for The Zodiac. Note how Capricorn is duplicated, because the dates for that sign cross the end-of-year boundary. Here are the displayed results from our table:
SignNo SignName Symbol StartDT EndDT SignIs 1 Aries ♈ 1900-03-21 1900-04-20 Ram 2 Taurus ♉ 1900-04-21 1900-05-21 Bull 3 Gemini ♊ 1900-05-22 1900-06-21 Twins 4 Cancer ♋ 1900-06-22 1900-07-22 Crab 5 Leo ♌ 1900-07-23 1900-08-22 Lion 6 Virgo ♍ 1900-08-23 1900-09-23 Maiden 7 Libra ♎ 1900-09-24 1900-10-23 Scales 8 Scorpio ♏ 1900-10-24 1900-11-22 Scorpion 9 Sagitarius ♐ 1900-11-23 1900-12-21 Archer (Centaur) 10 Capricorn ♑ 1900-12-22 1900-12-31 Sea-Goat (Goat) 10 Capricorn ♑ 1900-01-01 1900-01-20 Sea-Goat (Goat) 11 Aquarius ♒ 1900-01-21 1900-02-19 Water-bearer 12 Pisces ♓ 1900-02-20 1900-03-20 Fish
I think it is kind of cool that we could use the NCHAR built-in T-SQL function (introduced in SQL 2005) to return the UNICODE symbol for each zodiac sign!
Suppose we’d now like to calculate the sign for a birthdate. We’ll use two cases, a Capricorn and a Gemini to illustrate how easy it is using the lookup table above.
DECLARE @BirthDate DATE = '1999-01-12';
WITH Zodiac
(
SignNo, SignName, Symbol, StartDT, EndDT, SignIs
)
AS
(
SELECT 1, 'Aries', NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
UNION ALL SELECT 2, 'Taurus', NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
UNION ALL SELECT 3, 'Gemini', NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
UNION ALL SELECT 4, 'Cancer', NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
UNION ALL SELECT 5, 'Leo', NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
UNION ALL SELECT 6, 'Virgo', NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
UNION ALL SELECT 7, 'Libra', NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
UNION ALL SELECT 8, 'Scorpio', NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
UNION ALL SELECT 9, 'Sagitarius', NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
UNION ALL SELECT 11, 'Aquarius', NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
UNION ALL SELECT 12, 'Pisces', NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN
DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND
DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);
Which returns one row for our Capricorn:
BirthDate SignNo SignName SignIs Symbol 1999-01-12 10 Capricorn The Sea-Goat (Goat) ♑
You can see how we’ve used the DATEADD and DATEDIFF functions to place our birthdate between the StartDT and EndDT columns within our lookup table.
A Function to Calculate the Sign of the Zodiac for a Birth Date
I like in-line, Table Valued Functions (iTVFs) because I consider them tools. This little T-SQL query is easy enough to put into an iTVF, so let’s do that now.
CREATE FUNCTION dbo.ZodiacSign
(
@BirthDate DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
SignNo, SignName, Symbol, StartDT, EndDT, SignIs
)
AS
(
SELECT 1, 'Aries', NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
UNION ALL SELECT 2, 'Taurus', NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
UNION ALL SELECT 3, 'Gemini', NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
UNION ALL SELECT 4, 'Cancer', NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
UNION ALL SELECT 5, 'Leo', NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
UNION ALL SELECT 6, 'Virgo', NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
UNION ALL SELECT 7, 'Libra', NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
UNION ALL SELECT 8, 'Scorpio', NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
UNION ALL SELECT 9, 'Sagitarius', NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
UNION ALL SELECT 11, 'Aquarius', NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
UNION ALL SELECT 12, 'Pisces', NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN
DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND
DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);
With this function, we can now calculate the Zodiac signs for a whole range of birthdates as follows:
WITH BirthDates (BirthDate) AS ( SELECT '1998-06-17' -- A Gemini UNION ALL SELECT '1999-01-10' -- A Capricorn ) SELECT a.BirthDate, SignNo, SignName, SignIs, Symbol FROM BirthDates a CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;
Which gives us these results:
BirthDate SignNo SignName SignIs Symbol 1998-06-17 3 Gemini The Twins ♊ 1999-01-10 10 Capricorn The Sea-Goat (Goat) ♑
The Zodiac Date Range in which a Birth Date Falls
Because of Capricorn, which splits two different years, were we to want to calculate the date range for the sign on which a particular birth date falls, things get just a little more complicated. For example, for our two birthdates shown above, we’d like to see a results set that looks like this.
BirthDate SignNo SignName SignIs SignPeriodStart SignPeriodEnd Symbol 1998-06-17 3 Gemini The Twins 1998-05-22 1998-06-21 ♊ 1999-01-10 10 Capricorn The Sea-Goat (Goat) 1998-12-22 1999-01-20 ♑
Notice how Capricorn’s SignPeriodStart is the year before that particular birthdate.
Our ZodiacSign function can be modified as follows to handle this case.
ALTER FUNCTION dbo.ZodiacSign
(
@BirthDate DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
SignNo, SignName, Symbol, StartDT, EndDT, SignIs
)
AS
(
SELECT 1, 'Aries', NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
UNION ALL SELECT 2, 'Taurus', NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
UNION ALL SELECT 3, 'Gemini', NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
UNION ALL SELECT 4, 'Cancer', NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
UNION ALL SELECT 5, 'Leo', NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
UNION ALL SELECT 6, 'Virgo', NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
UNION ALL SELECT 7, 'Libra', NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
UNION ALL SELECT 8, 'Scorpio', NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
UNION ALL SELECT 9, 'Sagitarius', NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
UNION ALL SELECT 10, 'Capricorn', NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
UNION ALL SELECT 11, 'Aquarius', NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
UNION ALL SELECT 12, 'Pisces', NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
),
GroupedSigns AS
(
SELECT SignNo
,StartDT = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
THEN DATEADD(year, -1, MAX(StartDT))
WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
THEN MAX(StartDT)
ELSE MIN(StartDT)
END
,EndDT = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
THEN MIN(EndDT)
WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
THEN DATEADD(year, 1, MIN(EndDT))
ELSE MAX(EndDT)
END FROM Zodiac
GROUP BY SignNo
)
SELECT a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
,SignPeriodStart = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.StartDT) AS DATE)
,SignPeriodEnd = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.EndDT) AS DATE)
FROM Zodiac a
JOIN GroupedSigns b ON a.SignNo = b.SignNo
WHERE @BirthDate BETWEEN
DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND
DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);
Because of our split row for Capricorn, we needed to create a grouping that set the dates accordingly and then JOIN this back to our original Zodiac lookup table. A little more date arithmetic magic, this time using the DATEADD and YEAR built-in functions, will get us to our period dates.
This can be demonstrated by running the following query to call our newly ALTERed function:
WITH BirthDates (BirthDate) AS ( SELECT '1998-06-17' -- A Gemini UNION ALL SELECT '1999-01-10' -- A Capricorn ) SELECT a.BirthDate, SignNo, SignName, SignIs ,SignPeriodStart ,SignPeriodEnd ,Symbol FROM BirthDates a CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;
Which gives us the results we seek!
BirthDate SignNo SignName SignIs SignPeriodStart SignPeriodEnd Symbol 1998-06-17 3 Gemini The Twins 1998-05-22 1998-06-21 ♊ 1999-01-10 10 Capricorn The Sea-Goat (Goat) 1998-12-22 1999-01-20 ♑
Conclusions
I love to create iTVFs as tools I can always use when some calculation gets hairy! I have many tools that are specific to their purpose. I also like to adopt iTVFs of this sort from other sources (a great example of that being DelimitedSplit8K from SQL MVP Jeff Moden). You should probably consider doing this yourself to improve your skill at both writing iTVFs and using the tools you create to accelerate your own T-SQL coding jobs.
Now if I could just figure out the Yin and the Yang of the Chinese calendar to confirm I’m a Dog that would really be something to chew on.
Yet another stupid, T-SQL trick, right? Who knows, perhaps one day you’ll even have a use for it!
Follow me on Twitter: @DwainCSQL
© Copyright Dwain Camps 16 Apr 2015. All rights reserved.
An Easter SQL
Since Easter is nearly upon us, I got to thinking once again about how challenging it is to calculate the day that Easter Sunday falls on. Since Calendar tables and their step-children Holiday tables are so interesting I thought I might write a sequel to my earlier blog link in this sentence.
There are at least three holidays (not necessarily business holidays) that are related to the calculation of Easter Sunday. These are:
- Good Friday is the Christian holiday that is the Friday just prior to Easter Sunday.
- Ash Wednesday is another Christian holiday (the first day of Lent) that falls precisely 46 days prior to Easter Sunday.
- Mardi Gras (a.k.a. Fat Tuesday) is a lesser known holiday celebrated by the more robust partyers among us. I have a special affinity for it as I lived in New Orleans for a few years.
We will show you how to calculate them all and use that method to populate a Holidays lookup table so hopefully you’ll never have to do it again. Happy Easter!
A Calendar and a Holiday Table
Discussions of Calendar tables in T-SQL are ubiquitous (here’s one by Todd Fifield), differing only in the columns actually stored in them. We’ll draw upon the one we created in our blog to show you how easy it is here.
-- Create a Calendar table SELECT [Date]=ISNULL([Date], 0), [Year]=ISNULL([Year], 0), [YrNN]=ISNULL([YrNN], 0) ,[YYYYMM]=ISNULL([YYYYMM], 0), [BuddhaYr], [Month], [Day]=ISNULL([Day], 0) ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk] ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo] INTO dbo.Calendar -- This function can be obtained here: GenerateCalendar FUNCTION FROM dbo.GenerateCalendar('1990-01-01', 65536); -- Change column types to be NOT NULL so we can index them -- -- Note: not needed thanks to the neat trick with ISNULL above --ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL; --ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL; --ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL; --ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL; --ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL; GO -- Build some representative INDEXes ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]); ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]); ALTER TABLE dbo.Calendar ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]); SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last] FROM dbo.Calendar; -- Now create a Holidays table, from the same blog CREATE TABLE dbo.Holidays ( FromDate DATETIME PRIMARY KEY ,ToDate DATETIME ,Holiday VARCHAR(100) );
We’ve created a Holidays table that we’ll be using to further our sequel to Easter. There are also a few other things in the above Data Definition Language (DDL) that are worthy of mention.
- The GenerateCalendar FUNCTION is an in-line, Table Valued Function (iTVF) that appears in my earlier blog (and also in the resources file attached at the end of this article), which can be used in lieu of a Calendar table should you prefer to do so. Once again, my thanks go out to SQL MVP Jeff Moden for his help optimizing the performance of the calculations done in that FUNCTION.
- We’ve used a neat trick with ISNULL to set the NULL attribute of the column in the table we selected INTO (when Calendar gets CREATEd) to NOT NULL, so that we don’t need to run the subsequent ALTER statements on those columns. I included those ALTER statements for your information. Thanks again to Jeff Moden for showing me that trick (with honorable mention to SQL MCM Wayne Sheffield (MCM=Microsoft Certified Master) who came along just a few days later with the same information).
Since Easter Sunday is so diabolically difficult to calculate, many reputable sources have created look-up tables such as this one. Drawing upon that source, we can create the dates for Easter Sunday from 1990 through 2020 in our Holidays table using the following script.
WITH EasterDates (d) AS ( SELECT '1990-04-15' UNION ALL SELECT '1991-03-31' UNION ALL SELECT '1992-04-19' UNION ALL SELECT '1993-04-11' UNION ALL SELECT '1994-04-03' UNION ALL SELECT '1995-04-16' UNION ALL SELECT '1996-04-07' UNION ALL SELECT '1997-03-30' UNION ALL SELECT '1998-04-12' UNION ALL SELECT '1999-04-04' UNION ALL SELECT '2000-04-23' UNION ALL SELECT '2001-04-15' UNION ALL SELECT '2002-03-31' UNION ALL SELECT '2003-04-20' UNION ALL SELECT '2004-04-11' UNION ALL SELECT '2005-03-27' UNION ALL SELECT '2006-04-16' UNION ALL SELECT '2007-04-08' UNION ALL SELECT '2008-03-23' UNION ALL SELECT '2009-04-12' UNION ALL SELECT '2010-04-04' UNION ALL SELECT '2011-04-24' UNION ALL SELECT '2012-04-08' UNION ALL SELECT '2013-03-31' UNION ALL SELECT '2014-04-20' UNION ALL SELECT '2015-04-05' UNION ALL SELECT '2016-03-27' UNION ALL SELECT '2017-04-16' UNION ALL SELECT '2018-04-01' UNION ALL SELECT '2019-04-21' UNION ALL SELECT '2020-04-12' ) INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday) SELECT FromDate=d, ToDate=d, Holiday='Easter Sunday ' + CAST(YEAR(d) AS CHAR(4)) FROM EasterDates;
Before we add Good Friday, Ash Wednesday and Fat Tuesday to our Holiday table, we’d like to check that the introductory description at least results in the correct day of the week on which the holiday falls. We can use our Calendar table for this.
SELECT FromDate = a.FromDate+b.[days] ,ToDate = a.FromDate+b.[days] ,b.holiday ,c.WkDName FROM dbo.Holidays a CROSS APPLY ( VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4))) ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4))) ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4))) ) b ([days], holiday) JOIN dbo.Calendar c ON c.[Date] = a.FromDate+b.[days] WHERE a.Holiday LIKE 'Easter Sunday%';
This produces results like this (only the first two years are shown):
FromDate ToDate holiday WkDName 1990-04-13 00:00:00.000 1990-04-13 00:00:00.000 Good Friday 1990 Friday 1990-02-28 00:00:00.000 1990-02-28 00:00:00.000 Ash Wednesday 1990 Wednesday 1990-02-27 00:00:00.000 1990-02-27 00:00:00.000 Mardi Gras (Fat Tuesday) 1990 Tuesday 1991-03-29 00:00:00.000 1991-03-29 00:00:00.000 Good Friday 1991 Friday 1991-02-13 00:00:00.000 1991-02-13 00:00:00.000 Ash Wednesday 1991 Wednesday 1991-02-12 00:00:00.000 1991-02-12 00:00:00.000 Mardi Gras (Fat Tuesday) 1991 Tuesday
So now we can be confident to insert those holidays into our Holidays table.
INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday) SELECT FromDate = a.FromDate+b.[days] ,ToDate = a.FromDate+b.[days] ,b.holiday FROM dbo.Holidays a CROSS APPLY ( VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4))) ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4))) ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4))) ) b ([days], holiday) WHERE a.Holiday LIKE 'Easter Sunday%';
A couple of notes on the data types used in our Calendar and Holidays tables:
- The Date column of the Calendar table was chosen to be DATETIME to retain compatibility with SQL 2005 (corresponding to the GenerateCalendar FUNCTION). You could just as easily use a DATE datatype if you’re running SQL 2008 or later.
- The FromDate/ToDate data types are also both DATETIME. Here it could be considered important to keep the DATETIME data type to record cases where your business offers half-day holidays. For example, I’ve seen this for Christmas Eve and New Year’s Eve.
- You can also define a different FromDate than ToDate in the Holidays table to indicate a holiday that spans several days, like the Thailand Songkran holiday example from my blog.
“Since SQL is a database language, we prefer to do look ups and not calculations.” –- Joe Celko (prolific author of books on things SQL)
There you go Joe, we’ve honored your preference! Now we can do lookups of not just Easter Sunday, but also Good Friday, Ash Wednesday and Mardi Gras, the latter potentially being essential to planning your next trip to New Orleans.
The Calculation of Easter Sunday
A somewhat accurate description of the calculation of Easter Sunday is:
“The Sunday following the full Moon which falls on or after the equinox (the Paschal full moon) will give the lawful Easter.” –- From the Wikipedia link on Easter Sunday above
I say “somewhat accurate” because there’s all sorts of ifs, buts and caveats in that description, which you can read about in the Wikipedia link, however it will suffice for the time being.
The formal description of the calculation of Easter is known as Computus (no relation to Locutus of Borg), and is provided by Wikipedia at the link shown. We shall observe this calculation in what follows. While the formulas given are indicated to be valid for the years 1900 to 2199, since our Calendar table has been populated with years 1990-2169 we’ll limit our calculations of Easter Sunday to just those the years. Years 1990 through 2020 will be checked against the lookup table we’ve already provided.
Of course, this calculation appears often enough in the literature and implemented in T-SQL (for example: here and here), often using methods I’d prefer to avoid due to performance. Since our personal mantra is: “No Loops. No Cursors. No RBAR. Hoo-uh!” we shall endeavor to live up to that mantra today. Think about that term “RBAR” and if you’ve never heard of it before, you’ll find it mentioned in the Conclusion to this article.
We’ll start by attempting to replicate the tabular example for calculating the Paschal full moon (PFMD) within the Gregorian calendar shown by Computus on that Wikipedia page.
Note that the algorithm described above is a step-by-step or procedural algorithm. Here is where many SQL implementations fail because they don’t try to convert it to one that is set-based. To do this, you must think of each step in the procedural algorithm as a query, and as you proceed from one step to the next, each new step is another outer layer to your set-based query.
This is easier shown that described, so here is that query for 2013-2032.
SELECT [Year], [Y MOD 19], Addendum ,PFMD = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END ,[Month] = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END ,[MonthNo] = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END FROM ( SELECT [Year], [Y MOD 19], Addendum ,PFMD = (45-([Y MOD 19]*11)%30+Addendum) FROM ( SELECT [Year] ,[Y MOD 19] = [Year]%19 ,Addendum = CASE [Year]%19 WHEN 5 THEN 29 WHEN 16 THEN 29 WHEN 8 THEN 30 ELSE 0 END FROM dbo.Calendar WHERE [Month] = 1 AND [Day] = 1 ) a ) a WHERE [Year] BETWEEN 2013 AND 2032;
Note how the Addendum (CASE) implements the first two IFs (second and third rows) of the procedural algorithm. The results from this query for just the first 10 years are:
Year Y MOD 19 Addendum PFMD Month MonthNo 2013 18 0 27 Mar 3 2014 0 0 14 Apr 4 2015 1 0 3 Apr 4 2016 2 0 23 Mar 3 2017 3 0 11 Apr 4 2018 4 0 31 Mar 3 2019 5 29 18 Apr 4 2020 6 0 8 Apr 4 2021 7 0 28 Mar 3 2022 8 30 16 Apr 4
If we take PFMD and add the first letter of the month column that follows (for example for 2013: 27M) we see exact correspondence with the row from the Computus table (graphic) labeled the “Paschal full moon date,” including for the years not shown in the results set. So it appears that we have rather simply converted the formula for calculating PFMD from Computus to a SQL script without too much effort.
Now we’ll use our Calendar table yet again, along with a little additional magic:
- Put the pieces of the Paschal full moon date back together into a date using a formula that includes a couple of calls to the T-SQL built-in function DATEADD.
- Do a CROSS APPLY (explained in Part 1 and Part 2 by SQL MVP Paul White) to pick up the Sunday after the Paschal full moon, which is Easter Sunday.
WITH CalculatePaschalFullMoon AS ( -- Original query to calculate the components of the -- Paschal Full Moon date SELECT a.[Year], [Y MOD 19], Addendum ,PFMD = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END ,[Month] = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END ,[MonthNo] = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END -- Put the date/time pieces together to get a real DATETIME for PFMD ,[PFMD2] = DATEADD(day, CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END-1 -- PFMD ,DATEADD(month, CASE WHEN PFMD > 31 THEN 4 ELSE 3 END-1 -- [MonthNo] , a.[Date])) FROM ( SELECT [Year], [Y MOD 19], Addendum, [Date] ,PFMD = (45-([Y MOD 19]*11)%30+Addendum) FROM ( SELECT [Year] ,[Y MOD 19] = [Year]%19 ,Addendum = CASE [Year]%19 WHEN 5 THEN 29 WHEN 16 THEN 29 WHEN 8 THEN 30 ELSE 0 END ,[Date] FROM dbo.Calendar WHERE [Month] = 1 AND [Day] = 1 ) a ) a ) SELECT a.[Year] ,PaschalFullMoon=CAST(a.PFMD2 AS DATE) ,EasterSunday FROM CalculatePaschalFullMoon a -- Easter Sunday follows the Paschal Full Moon date -- so pick that up from the Calendar table CROSS APPLY ( SELECT TOP 1 EasterSunday=CAST([Date] AS DATE) FROM dbo.Calendar b WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun' ORDER BY [Date] ) b WHERE [Year] BETWEEN 2013 AND 2032;
CREATE FUNCTION CalculateEasterSunday ( @Year INT ) -- Calculate Easter Sunday for any given year RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH CalculatePaschalFullMoon AS ( -- Original query to calculate the components of the -- Paschal Full Moon date SELECT a.[Year] --, [Y MOD 19], Addendum --,PFMD = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END --,[Month] = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END --,[MonthNo] = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END -- Put the date/time pieces together to get a real -- DATETIME for PFMD ,[PFMD2] = DATEADD(day, CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END-1 -- PFMD ,DATEADD(month, CASE WHEN PFMD > 31 THEN 4 ELSE 3 END-1 -- [MonthNo] , a.[Date])) FROM ( SELECT [Year], [Y MOD 19], Addendum, [Date] ,PFMD = (45-([Y MOD 19]*11)%30+Addendum) FROM ( SELECT [Year] ,[Y MOD 19] = [Year]%19 ,Addendum = CASE [Year]%19 WHEN 5 THEN 29 WHEN 16 THEN 29 WHEN 8 THEN 30 ELSE 0 END ,[Date] FROM dbo.Calendar WHERE [Month] = 1 AND [Day] = 1 AND [Year] = @Year ) a ) a ) SELECT a.[Year] ,PaschalFullMoon=CAST(a.PFMD2 AS DATE) ,EasterSunday FROM CalculatePaschalFullMoon a -- Easter Sunday follows the Paschal Full Moon date -- so pick that up from the Calendar table CROSS APPLY ( SELECT TOP 1 EasterSunday=CAST([Date] AS DATE) FROM dbo.Calendar b WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun' ORDER BY [Date] ) b;
You can see where I commented out a few lines of code in the FUNCTION that represent intermediate results that our FUNCTION doesn’t really need to retain. Notice also how the filter on the @Year parameter was moved up into the innermost SELECT from the Calendar table, because when using the FUNCTION you’ll be selecting only one row at a time at that point. Calling the iTVF using CROSS APPLY in-lines the code to make this a non-RBAR query.
Now we can use our FUNCTION to check that the dates we’ve prior stored for Easter Sunday in our Holidays table are correct.
SELECT HolidayFromDate = CAST(a.FromDate AS DATE) ,b.EasterSunday ,[Check] = CASE WHEN a.FromDate = b.EasterSunday THEN 'MATCHED!' ELSE 'WRONG!' END FROM dbo.Holidays a CROSS APPLY dbo.CalculateEasterSunday(YEAR(a.FromDate)) b WHERE Holiday LIKE 'Easter Sunday%';
The first ten years of our results show us all MATCHED! (and so do the others that are not shown).
HolidayFromDate EasterSunday Check 1990-04-15 1990-04-15 MATCHED! 1991-03-31 1991-03-31 MATCHED! 1992-04-19 1992-04-19 MATCHED! 1993-04-11 1993-04-11 MATCHED! 1994-04-03 1994-04-03 MATCHED! 1995-04-16 1995-04-16 MATCHED! 1996-04-07 1996-04-07 MATCHED! 1997-03-30 1997-03-30 MATCHED! 1998-04-12 1998-04-12 MATCHED! 1999-04-04 1999-04-04 MATCHED!
So we can be reasonably confident of our Easter Sunday calculation for the years indicated by Computus (1900-2199)!
Using our FUNCTION to Create Lookup Values in our Holidays Table
Now to keep Joe Celko happy (i.e., so we can do lookups instead of calculations), let’s insert the rest of our holidays, including Easter Sunday, Good Friday, Ash Wednesday and Mardi Gras, into our Holidays table, using all the years available to us in our Calendar table.
INSERT dbo.Holidays (FromDate, ToDate, Holiday) SELECT FromDate, ToDate, holiday FROM ( SELECT FromDate = DATEADD(day, c.days, b.EasterSunday) ,ToDate = DATEADD(day, c.days, b.EasterSunday) ,c.holiday FROM dbo.Calendar a CROSS APPLY dbo.CalculateEasterSunday(a.[Year]) b CROSS APPLY ( VALUES (0, 'Easter Sunday ' + CAST(a.[Year] AS CHAR(4))) ,(-2, 'Good Friday ' + CAST(a.[Year] AS CHAR(4))) ,(-46, 'Ash Wednesday ' + CAST(a.[Year] AS CHAR(4))) ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(a.[Year] AS CHAR(4))) ) c ([days], holiday) WHERE a.[Month] = 1 AND a.[Day] = 1 ) a WHERE NOT EXISTS ( SELECT 1 FROM dbo.Holidays x WHERE x.FromDate = a.FromDate AND x.ToDate = a.ToDate );
This added 596 holidays to our table, and I’d guess we’re pretty well set for our lifetimes!
Notice how, in the above query we had to use DATEADD instead of simple arithmetic as we did before when using the DATETIME data type. This is because simple arithmetic doesn’t work with the DATE data type returned by our CalculateEasterSunday function.
Conclusion and What we Learned
While Joe likes lookups, I find calculations in T-SQL to be quite a bit of fun. In this article (as in my blog on Calendar tables), we got to do a little of both.
If you went ahead and read my blog on Calendar tables in conjunction with this article, I am pleased because you probably learned a lot about how useful they can be, particularly in conjunction with a Holidays table. In this article (in the FUNCTION) we used our Calendar table to find the next Sunday occurring after our calculated date of the Paschal full moon.
Besides learning about the trivialities of the relationship between Easter, Good Friday, Ash Wednesday, and planning for your next visit to New Orleans for Mardi Gras, we’d like to think that you may have learned a few other things as well:
- You can use simple arithmetic on a DATETIME data type to add or subtract days, however you must use DATEADD to do the same thing with a DATE data type.
- You now have efficient tools in your toolbox to generate a calendar table (GenerateCalendar) and CalculateEasterSunday, whenever you have a need to use them. The latter function is also included in the resources file attached at the end of this article.
- You learned how to CROSS APPLY VALUES to do an UNPIVOT operation, even though I didn’t specifically mention that was what we were doing when we used it (where we created rows with dates for Good Friday, Ash Wednesday and Mardi Gras). If you haven’t seen that method before you may want to look into that link as well. That is something that will only work in SQL 2008 (or later), but there is an equivalent construct that will work in SQL 2005.
- We learned how to use ISNULL on a SELECT/INTO operation, to set the attribute of the column created in the target table to NOT NULL.
- You learned how to think your way through a rather complex, procedural calculation and convert it into set-based code, by wrapping each successive step in another layer of query.
- You learned my personal SQL mantra, in case you hadn’t seen that before.
In case you’ve been wondering what RBAR stands for, it means “Row By Agonizing Row” and it is a term coined by Jeff Moden which I fortuitously picked up on before I got to know him.
Oh yes, then there’s the quiz I mentioned. There are two cases:
- If we’re starting with 01 Jan of a year as our date, and the Paschal full moon month is three (meaning it should fall in Mar), if we add three months to 01 Jan it puts us in Apr. So we subtract one month to get us back to where we need to be.
- If we’re starting with 01 Mar of a year as our date, and the Paschal full moon day within that month is (say) 14 (meaning it should fall on 14 Mar), if we add 14 days to 01 Mar it puts us in 15 Mar. So we subtract one day to get us back to where we need to be.
After all this, I still have no clue what a Paschal full moon is but what the heck, at least I know I can calculate when it is! That’s all for today folks. I hope you found this article useful and just a bit of fun.
Follow me on Twitter: @DwainCSQL
© Copyright Dwain Camps 01 Apr 2015. All rights reserved.
Resources for An Easter SQL.doc
The Resources above are stored in a Microsoft Word (TM) document. There are two sections:
-- Create GenerateCalendar, Calendar and Holiday Tables and Populate.sql ... page break ... -- Create CalculateEasterSunday and Add Holidays.sql
Using the T-SQL MERGE Statement
In SQL Server 2008, Microsoft added a new SQL query type: the MERGE statement. This flexible query provides the ability to perform INSERTs, UPDATEs and even DELETEs all within a single statement. Used in combination with Common Table Expressions (CTEs), this can be a powerful tool to replace multiple SQL queries under the right circumstances.
One important rule to keep in mind when using MERGE, is that the statement must be terminated by a semicolon (;).
Case Study 1: A Simple Upsert
The most common usage of the MERGE statement is to perform what is colloquially called an “upsert,” which is really a diminutive form of UPDATE/INSERT. Without further preamble, let’s set up some test data and get to seeing how the MERGE statement can simplify your life.
CREATE TABLE #Master ( [key] INT IDENTITY PRIMARY KEY ,name VARCHAR(10) ); INSERT INTO #Master VALUES ('Dwain'),('Jeff'),('Paul') ,('Anon'),('Ralph'),('Tom'),('Sally'); CREATE TABLE #Staging ( [key] INT PRIMARY KEY ,[NewName] VARCHAR(10) ); INSERT INTO #Staging VALUES (2, 'Bob'),(4, 'Jim'),(6, 'Marvin'), (10, 'Buddy'); SELECT * FROM #Master; SELECT * FROM #Staging;
The results in the two tables as displayed by the SELECT are:
key name 1 Dwain 2 Jeff 3 Paul 4 Anon 5 Ralph 6 Tom 7 Sally key NewName 2 Bob 4 Jim 6 Marvin 10 Buddy
Our intention is to update (based on [key] in #Staging) the corresponding row by [key] in #Master. If the [key] in #Staging matches none of our [key] values in #Master, then insert a new row. The new row’s [key] does not need to match the value in the staging table. We can easily do this with a MERGE statement as follows:
MERGE #Master t USING #Staging s ON s.[key] = t.[key] WHEN MATCHED THEN UPDATE SET name = s.[NewName] WHEN NOT MATCHED THEN INSERT (name) VALUES (s.[NewName]); SELECT * FROM #Master;
The final SELECT result is as follows:
key name 1 Dwain 2 Bob 3 Paul 4 Jim 5 Ralph 6 Marvin 7 Sally 8 Buddy
You can see that Bob replaced Jeff, Jim replaced Anon and Marvin replaced Tom, and also that Buddy was added at the end.
The way the statement works is as follow:
- The table name immediately after the MERGE keyword is the target table, in this case #Master, which we have aliased as t for easy understanding.
- The USING table is the source, so #Staging will be merged into #Master.
- The ON keyword represents the matching criteria between the records in the two tables. You should not think of this in the same way that ON appears after a JOIN as it operates quite differently.
- Following those parts of the statement, are any number of WHEN clauses. The MATCHED criterion indicates a match based on the ON criteria. It can be combined with additional matching criteria if required.
- NOT MATCHED (implied as BY TARGET), means that when a source row does not exist in the target table, we’re going to do something.
- Following MATCHED or NOT MATCHED is the keyword THEN followed by either an INSERT or an UPDATE.
You can also use DELETE (instead of UPDATE or INSERT) and if you’d like to learn about how to DELETE rows from the target table, I suggest you read this article to understand exactly how it works: A Hazard of Using the SQL Merge Statement and the potential dangers when using it.
Case Study 2: A More Complicated MERGE
Suppose we have the following sample table and data:
CREATE TABLE #ItemTest ( ID INT NOT NULL ,LineID INT NOT NULL ,ProductID INT NULL ,PRIMARY KEY(ID, LineID) ); INSERT INTO #ItemTest (ID, LineID, ProductID) SELECT 100, 1, 5 UNION ALL SELECT 100, 2, 15 UNION ALL SELECT 100, 3, 8 UNION ALL SELECT 100, 4, 25 UNION ALL SELECT 200, 1, 11 UNION ALL SELECT 200, 2, 100 UNION ALL SELECT 200, 3, 41 UNION ALL SELECT 200, 4, 10 UNION ALL SELECT 200, 5, 5 UNION ALL SELECT 200, 6, 30 UNION ALL SELECT 300, 1, 20; SELECT * FROM #ItemTest;
From the final SELECT, we see that our data appears as follows:
ID LineID ProductID 100 1 5 100 2 15 100 3 8 100 4 25 200 1 11 200 2 100 200 3 41 200 4 10 200 5 5 200 6 30 300 1 20
Notice how the entries for each ID contain a sequentially numbered LineID (1 to 4 for ID=100 and 1 to 6 for ID=200). Our business requirement is that we need to delete some rows and at the same time preserve the row numbering for LineID without introducing any gaps. So for example, if we need to delete LineID=3 from ID=100, we need to renumber LineID=4 for that ID to be LineID=3.
Ignoring for the moment that it’s probably poor application design to have this row renumbering requirement, this can be accomplished with a MERGE. Since it is a bit more complicated we’ll develop it in a couple of steps to help you understand. First, let’s say we want to delete three rows. We’ll put those into a table variable (a feature introduced in SQL Server 2005).
DECLARE @RowsToDelete TABLE ( ID INT ,LineID INT ,PRIMARY KEY (ID, LineID) ); INSERT INTO @RowsToDelete (ID, LineID) SELECT 100, 3 UNION ALL SELECT 200, 2 UNION ALL SELECT 200, 4;
Note how we can create a PRIMARY KEY on a table variable. While not needed in this case, if you had lots of rows it will improve the performance of what we’re about to do.
Now we’ll construct the following query which will require some explanation:
SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID)) ,XX=NULLIF(a.LineID, b.LineID) FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID;
XX is included only to illustrate what NULLIF is doing for us. This produces the following results:
ID LineID ProductID LineID2 rn XX 100 3 8 3 1 NULL 100 1 5 NULL 2 1 100 2 15 NULL 3 2 100 4 25 NULL 4 4 200 2 100 2 1 NULL 200 4 10 4 2 NULL 200 1 11 NULL 3 1 200 3 41 NULL 4 3 200 5 5 NULL 5 5 200 6 30 NULL 6 6 300 1 20 NULL 1 1
Each row from #ItemTest is returned because it is the left table of the LEFT JOIN. Matching rows from our @RowsToDelete temporary table have a value in LineID2, while rows not matched have a value of NULL (exactly how you’d expect the LEFT JOIN to work). The result in XX shows us that when the LineID of #ItemTest matches the LineID of @RowsToDelete, we get a NULL and NULL values usually sort first (there is a SQL Server setting that controls this). So in each case, the rows we want to delete are sorted to the top of the grouping (on ID).
For the 3 rows in our @RowsToDelete table, we have 1 for ID=100 and 2 for ID=200 (these counts are easy enough to obtain in SQL). So what happens if we subtract that count from rn?
WITH CountItemsToDelete (ID, c) AS ( SELECT ID, COUNT(*) FROM @RowsToDelete GROUP BY ID ) SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID ,[rn-c]=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))-c FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID JOIN CountItemsToDelete c ON a.ID = c.ID;
The results now appear as:
ID LineID ProductID LineID2 rn-c 100 3 8 3 0 100 1 5 NULL 1 100 2 15 NULL 2 100 4 25 NULL 3 200 2 100 2 -1 200 4 10 4 0 200 1 11 NULL 1 200 3 41 NULL 2 200 5 5 NULL 3 200 6 30 NULL 4
Note how the row for ID=300 has been eliminated by the INNER JOIN to our Common Table Expression (CTE) CountItemsToDelete. Looking at the [rn-c] column, we see that for rows where LineID2 is not NULL, the value is meaningless. But for rows where LineID2 is NULL, [rn-c] is precisely the final row number we’ll need to assign to LineID after deleting the rows we want to delete! Now we have enough information to write this into a MERGE statement:
WITH CountItemsToDelete (ID, c) AS ( SELECT ID, COUNT(*) FROM @RowsToDelete GROUP BY ID ), SourceItems AS ( SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID, c ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID)) FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID JOIN CountItemsToDelete c ON a.ID = c.ID ) -- The target table MERGE #ItemTest t -- The source table USING SourceItems s -- Matching criteria: lines up rows from SourceItems exactly with rows -- from our target table (except for ID=300 which is not in the source) ON t.ID = s.ID AND s.LineID = t.LineID -- LineID2 is not NULL for rows we need to delete WHEN MATCHED AND s.LineID2 IS NOT NULL THEN DELETE -- LineID2 is NULL for rows where we've calculated the new line number WHEN MATCHED AND s.LineID2 IS NULL THEN UPDATE SET LineID = rn-c; SELECT * FROM #ItemTest;
The results shown in the final SELECT clearly indicate that this MERGE query has satisfied our business requirement.
ID LineID ProductID 100 1 5 100 2 15 100 3 25 200 1 11 200 2 41 200 3 5 200 4 30 300 1 20
To further improve the performance of the query, you can change the second MATCHED criteria to this, to avoid updating rows where the LineID isn’t changing.
WHEN MATCHED AND s.LineID2 IS NULL AND t.LineID <> rn-c THEN
To do this otherwise in SQL you would first need to DELETE the rows you want to delete, and then run a separate UPDATE to correct the row numbers that need correcting.
Summary of our Learning
Today we have learned about the MERGE query and how it can be used to replace multiple queries in the case of:
- UPDATE/INSERT (the “Upsert”)
- DELETE/UPDATE
I encourage you to also read the linked article about hazards present if you utilize the full capability of the MERGE (WHEN NOT MATCHED SOURCE THEN DELETE). The article shows a simple way of limiting the scope of the DELETE to avoid the hazard.
We also learned about SQL table variables, which are very handy under certain circumstances. They do have their drawbacks though, and some day we may blog on the differences between them and temporary tables.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Make it Work, Make it Fast, Make it Pretty
When I first heard this, it struck me as being a remarkably concise wisdom applicable to virtually any programming task. The entire quotation is actually:
“Make it work, make it fast, then make it pretty… and it isn’t finished until it is pretty!”
— SQL MVP Jeff Moden (RedGate’s 2011 Exceptional DBA of the Year)
In case you don’t know what an MVP is, it stands for Most Valued Professional, and it is an award that Microsoft confers only to the best of the best in their Microsoft-focused technical skills.
Throughout the course of this article I will ask the reader a series of questions. Each question is designed as a thought question. When you encounter a question, you should mentally form a picture in your head of what your answer is. After the question I’ll provide you with my take on the answer. If your answers are not the same as my answers, you should take the time to reflect on why they are different.
Taking pride in your work is a very important thing. We all like to think we’re good at our jobs. Could you be better at your job? The answer is most certainly, because all of us could always be better. This article offers a philosophy by which any programmer can improve on their ability to perform their job. I sincerely hope that you can learn something from it.
Since the quotation on which this article is based was said by a really talented SQL DBA, towards the end of this article we’ll provide a code example in SQL. We’ll provide a good and bad example. Once again, you should reflect on this example and understand why it is bad or why it is good.
In case you were wondering, I have seen some of Jeff Moden’s work. I can assure you that he lives this philosophy every day. I have seen him produce some of the most remarkably ingenious and high-performing solutions to SQL problems, and in the end he always makes the code pretty. He will probably not be pleased with me for writing this article because he is also very modest about his accomplishments. That my dear reader is how you become a Microsoft MVP!
Make it Work
When you are writing SQL or coding in any programming language for that matter, the most important thing to ensure is that the code you have written meets the functional requirements. This means that you, as a Developer, must not only write the code but you must also thoroughly unit test it. That’s right, testing is not just for Testers to do!
To thoroughly unit test a piece of code, you must consider not just what happens when you execute your code against what you expect in terms of the way a business user may end up running your code. You must take into consideration unexpected actions of those same users, or in the case of SQL queries, unexpected but bad data in the tables you’re going against.
A good Developer will also take the time to document the testing that he or she performed. This documentation, if available, may help the testing team to identify other potential cases where the code that was written may fail to perform to expectations.
So as a Developer, do you believe that you perform sufficient unit testing so that your testing team finds no issues when they perform their testing? The answer is probably not, but that is the goal every Developer should focus on if they want to be considered “better” at their jobs.
Make it Fast
All SQL is not alike. I can probably write at least three to four queries that will satisfy any specific business requirement. The key is to select the query that will run the fastest, and there is some effort involved in making that happen.
Here’s another thought question for you. Has anyone ever told you that a query you have written runs too fast? Chances are, the answer to that question is no. Given the choice between two queries that return equivalent results, business users would almost always choose to use the query that runs the fastest. Performance counts! When I asked the first question to a good friend of mine by the name of Chris Morris (who is an expert at T-SQL), he had this story to relate:
“Some guy who’s been working at the same shop for three or four years without opening a book or visiting a forum but thinks he’s a T-SQL hero has a query that he swears can’t be made to run any faster. It takes twenty minutes to run. You rewrite it so that it works in a completely different way – which he hasn’t been exposed to – and it runs in two seconds. The first thing he’ll say is ‘That’s far too fast – it can’t possibly be doing all that work in so little time.’ Of course, the catch is that it isn’t doing all that work.”
I found that last bit quite humorous but also absolutely true. Two queries that return the same results do not have to perform the same amount of work. The one that does the least work is most likely to perform better.
There are some people who insist that, in order to write fast-performing queries, you must be an expert in understanding the query‘s execution plan. While that can certainly be helpful, not many are true experts in interpreting a query’s execution plan. In fact, the execution plan can oftentimes be misleading, specifically when comparing the “cost” of two queries. I’ve seen cases where two queries are rated as having a cost of 0% and 100%, yet the second query is much faster than the first.
To write faster queries, you don’t need to be an expert at reading a query’s execution plan but you do need to keep in mind a few very basic fundamentals.
- Avoid CURSORs as they can be really slow in T-SQL. 99.99% of the time you can construct an alternate, set-based approach that will perform much faster than a CURSOR. The best favor you can do for yourself to improve the performance of your SQL is to forget the syntax for a CURSOR, or better yet completely forget that they exist.
- Your query should only touch the rows it needs to touch and it should try to touch those rows only once. If it can touch only the entry in an index instead of the row, that is even better.
- “Stop thinking about what you want to do to a row. Think instead about what you want to do to a column.” – This is another quote from Jeff Moden.
- The only way to judge whether a query is fast or not is to have something to compare it against, like another query that returns exactly the same results.
- I like to use what I call “code patterns” as a guideline to writing high performance SQL. In essence what this means is to know in advance the fastest performing methods for solving a particular problem and use that code pattern as the basis for the query that I am writing.
- Add to your testing regimen the one million row test harness (look for this in a future blog). Write every query as if you expect it to be run one million times per day against one million rows of data.
I’ll provide more guidance in these areas in future blogs, as most of my interest is in writing high performance SQL.
Keep in mind that I only obtain very basic information from a SQL query’s execution plan. I mostly rely heavily on memorizing the fastest performing code patterns and using them when they are appropriate. That and keeping to the other bullet points I listed above.
Why should I take the time to try writing a faster performing query, when I can rely on SQL’s Database Tuning Advisor (DTA) to suggest an INDEX that will make my query perform faster? The answer to that is that INDEXes create overhead that slows down INSERTs, UPDATEs, DELETEs and MERGEs. Too much INDEXing in a database can drag down its overall performance way more than the little boost it gives to poorly written queries that perform inadequately.
The other thing that people usually have to say about this is that they don’t have the time it takes to generate more than one query form to solve a particular requirement because of time pressures and deadlines. Once again, memorizing the best practice code patterns can help you here. Once you learn them, and you can code them almost in your sleep, you’ll be able to generate multiple solutions to the same problem very quickly. Then, you just need to create the one million row test harness (also very easy to do) to prove which is fastest.
Make it Pretty
Let’s look at a query that is included in my earlier blog on Calendar Tables.
select [Date] from (select [Date], [WkDName2], FromDate, rn=row_number() over (partition by [YYYYMM] order by [Date] desc) from dbo.Calendar a left loin dbo.Holidays b on a.[DATE] between FromDate AND isnull(ToDate, FromDate) where [Date] between '2014-01-01' and '2014-12-31' and [Last] = 1 and [WkDName2] not in('SA', 'SU') and FromDate IS null) a where rn=1
Now compare this query against the actual query as published in that blog.
SELECT [Date] FROM ( SELECT [Date], [WkDName2], FromDate ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC) FROM dbo.Calendar a LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND [Last] = 1 AND -- Remove Saturdays and Sundays [WkDName2] NOT IN('SA', 'SU') AND -- Remove holidays FromDate IS NULL ) a WHERE rn=1;
Both of these queries are exactly the same and would produce the same results. One of the things Developers may often forget is that whatever language you are writing in, it is unlikely that no one will ever look at your code again. Maintenance of applications is a fact of life, so the “make it pretty” step is invaluable to those poor folks that come along afterwards and need to maintain your code. As a Developer have you ever said “gee I really don’t want to try to figure out what that person before me did, so I’ll just rewrite it rather than modify it?” The answer to this question is probably and the reason for that is most likely that the original Developer didn’t take the time to “make it pretty.”
So what is wrong with the first query, when compared to the second? What makes it harder to maintain?
- The first of the two can be called “stream of consciousness” coding, or to put it another way “write-only SQL.” I call it write-only SQL because you can’t easily read it. In other words, whoever comes along and looks at it later is going to have a difficult time trying to understand what it does.
- The first query has no line breaks or indentation to assist the reader in identifying the individual steps that are being performed. For example, can you easily see that it has a derived table embedded in it? The answer is probably not.
- Personally, I like to see language keywords in upper case. In the first query, none of the keywords are in upper case.
- The first query has no comments in it. The second query could probably be improved by placing a comment before it that says “Calculate Aware pay days based on the last work day of the month.” To make it pretty, you should always include some comments to assist the reader that will come along after you to maintain it.
- The first query does not end in a semi-colon. Why is that important you might ask? SQL Server does not require that a semi-colon terminate every SQL statement. Well, there are cases that it does require one (e.g., the MERGE statement) and there are also cases like Common Table Expressions (CTEs) where T-SQL requires that you terminate the statement just prior to the CTE with a semi-colon. There may come a day when T-SQL requires semi-colon terminators on every statement, so why not plan ahead for that day now and end all of your queries with a semi-colon? Did you know that ending your query with a semi-colon is an ANSI standard?
Every programming language (T-SQL included) has some standard methodology suggested for indentation, designed to make the code more readable. I’m not going to sit here and tell you that mine is the “best.” I will tell you that you should take a moment to think about how to indent your queries so that they end up being easy on the eyes of the readers that come along later. And then, once you have a methodology you are comfortable with, stick to it. Apply it to every query that you write without exception.
Once you get the hang of this, you’ll probably find yourself indenting complex queries according to your indentation methodology as you’re coding the query. That’s great because it saves you some cleanup work later on, and it will probably help you as you unit test the query because if it is complex (consisting of many steps) it is easier to run one step at a time, looking at intermediate results as necessary.
Conclusion
I can promise you that, when you read my blogs I’ve taken the time to make sure that each query I publish works, that it is the fastest it can possibly be and that it is (reasonably) pretty. There will be cases where I compare the performance of various code patterns that can be used to arrive at the same solution, particularly if one or the other won’t run in earlier versions of SQL.
Above I had some disparaging words to say about both CURSORs and INDEXes. My advice on CURSORs stands – avoid them at all costs and most of the time you can. Possibly one of the reasons that some people use CURSORs in T-SQL is because they’ve become accustomed to using them in Oracle SQL, where they perform much better.
INDEXes in general should be used sparingly, and when you need to INDEX a table (other than the PRIMARY KEY of course) it should be as a result of the pre-planning that you’ve done based on knowing the kinds of data retrieval operations you’ll be performing, rather than as an afterthought because some tool like DTA suggests that you do so.
As a quick example of this, there is a SQL concept known as relational division. All of the code patterns for retrieval in the case of relational division are highly dependent for their performance on the INDEXes that are created for the table. This is something we’ll explore in a future blog.
I will also be blogging on various common business problems that you can solve in SQL, and provide you with the best practice code pattern to ensure that the solutions I provide perform the best that they can. Oftentimes, if you already know the alternatives, you’ll immediately jump to the fastest solution available and that may allow you to skip the one million row test. Really the only time that you should skip that test is when you are 100% sure you’re using the best practice code pattern for performance.
In the end, you should never be satisfied that your code is “good enough” when with just a little bit of study and that little bit of extra effort it can be the “best that it can be.”
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Tally Tables in T-SQL
The Tally (or Numbers) table is one of the most useful constructs in intermediate-level SQL. Have you ever written a T-SQL WHILE loop? How about a CURSOR? If your answer to either of those questions is yes, you’ve probably been coding some really poor-performing SQL and you may not even realize it. Tally tables are one way that you may be able to avoid using a loop or a CURSOR, instead creating a true set-based solution that is blazingly fast and reliable.
Tally Tables in SQL 2000
The ability to create a Tally table has been around for a long time, going way back to SQL Server’s roots in Sybase. One of the most common methods of creating a permanent table is:
SELECT TOP 1000000 N=IDENTITY(INT, 1, 1) INTO dbo.Numbers FROM master.dbo.syscolumns a CROSS JOIN master.dbo.syscolumns b; ALTER TABLE dbo.Numbers ADD CONSTRAINT NBR_pk PRIMARY KEY(N); SELECT TOP 5 N FROM dbo.Numbers;
Obviously, in order to run this you need access to the master.dbo.syscolumns table that is referenced. Most of the time this particular table has at least 4,000 rows in it so constructing a Tally table with 1,000,000 (or even 10,000,000) rows is no problem at all. Our Tally table will start with 1 (results below are from the final SELECT).
N 1 2 3 4 5
If you create a permanent Tally table, it’s important to have a CLUSTERED INDEX (the PRIMARY KEY) on the single column that’s in the table for performance reasons. We’ll show some examples of unique ways that you can use a Tally table shortly, but first it’s good to know how to create them. The example above works in SQL Server 2000.
Here’s another Tally table that can generate N’s between 0 and 255 (up to 2047 in later versions like 2008) and can be used in SQL Server 2000.
SELECT N=number FROM master..spt_values WHERE type = 'P';
There are those that might be asking why in the world would you use an undocumented system table like this? Well, rest assured that this table is used internally by many of SQL’s internal stored procedures, so it is not going away any time soon.
Tally Tables in SQL 2005
When SQL Server 2005 came along and introduced the Common Table Expression (CTE) and the ROW_NUMBER() window function (both the subject of earlier blogs as linked), it opened up some new ways to create what are known as in-line Tally tables. Let’s look at a couple of examples.
WITH Tally (N) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.all_columns a CROSS JOIN sys.all_columns b ) SELECT TOP 5 N FROM Tally; -- Or if you don't have access to the sys tables use an in-line -- Tally table known as a "Ben-Gan" style Tally WITH lv0 AS (SELECT 0 g UNION ALL SELECT 0) ,lv1 AS (SELECT 0 g FROM lv0 a CROSS JOIN lv0 b) -- 4 ,lv2 AS (SELECT 0 g FROM lv1 a CROSS JOIN lv1 b) -- 16 ,lv3 AS (SELECT 0 g FROM lv2 a CROSS JOIN lv2 b) -- 256 ,lv4 AS (SELECT 0 g FROM lv3 a CROSS JOIN lv3 b) -- 65,536 ,lv5 AS (SELECT 0 g FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296 ,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5) SELECT TOP (5) n FROM Tally ORDER BY n;
The second Tally table structure was first suggested by SQL MVP Itzik Ben-Gan.
Never mind how they work, just know that the first will generate somewhere around 16M+ rows and the other will create a little over 4 billion rows (!) if we did not apply the TOP 10. The second one is a case of “stacked” or “cascaded” CTEs. Both are extremely fast; the second one especially because it is done entirely in memory.
Tally Tables in SQL 2008
SQL Server 2008 introduced something that’s known as a Table Value Constructor (TVC), which offers a very compact way to create a Tally table tailored specifically to the precise number of rows you need without using a TOP clause. Here’s an example of that:
WITH Tally (n) AS ( -- 1000 rows SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) ) SELECT * FROM Tally;
That Tally table generates exactly 1,000 rows because each VALUES TVC has exactly 10 zeroes in it. We’ll now look at just a few of the remarkable ways that a Tally table can improve your SQL by helping you to write a set-based solution where otherwise you might need to resort to a loop or CURSOR.
Using a Tally Table to Expand a Row Set
Let’s start with some sample data:
CREATE TABLE #Temp ( ID INT IDENTITY PRIMARY KEY ,StartDT DATETIME ); INSERT INTO #Temp SELECT '2014-02-18 09:20' UNION ALL SELECT '2014-02-19 05:35'; SELECT ID, StartDT ,TT=DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0)) ,TD=DATEADD(day, DATEDIFF(day, 0, StartDT), 0) FROM #Temp;
The results returned are this, where DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0) is a nifty way to truncate the DATETIME value to the nearest hour (or use day instead to truncate to the current day). Here are the results from the above query:
ID StartDT TT TD 1 2014-02-18 09:20:00.000 9 2014-02-18 00:00:00.000 2 2014-02-19 05:35:00.000 5 2014-02-19 00:00:00.000
Suppose what we want is 10 rows that correspond to ID=1, where each is incremented to the next hour of the day (starting at midnight). Likewise, for ID=2 we want 6 rows. We can use a small Tally table for this (remember there’s only 24 hours in a day):
WITH Tally (N) AS ( -- Tally table starting at 0 SELECT 0 UNION ALL -- Now 24 more rows SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0)) c(n) ) SELECT ID, StartDT ,TD=DATEADD(hour, N, DATEADD(day, DATEDIFF(day, 0, StartDT), 0)) FROM #Temp CROSS JOIN Tally WHERE N BETWEEN 0 AND DATEPART(hour, DATEADD(hour, DATEDIFF(hour, 0, StartDT), 0)) ORDER BY ID, TD;
Note how this Tally table has been modified to start at 0 by adding SELECT 0 UNION ALL before the second SELECT. The results of this query are:
ID StartDT TD 1 2014-02-18 19:20:00.000 2014-02-18 00:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 01:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 02:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 03:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 04:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 05:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 06:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 07:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 08:00:00.000 1 2014-02-18 19:20:00.000 2014-02-18 09:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 00:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 01:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 02:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 03:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 04:00:00.000 2 2014-02-19 15:35:00.000 2014-02-19 05:00:00.000
Wasn’t that easy and really cool?
Using a Tally Table to Remove Repeated Characters in a String
There are a lot of situations in T-SQL where you need to parse a string. Many of the built-in functions that T-SQL provides stop at the first occurrence of something. But what if you need to perform the same operation on many characters within the string and a built-in function simply won’t do it for you?
This is where a Tally table can really save you. We’ll start with one that has 8000 rows so we can parse each character out of VARCHAR(8000) length string. Follow the comments in the code to see how it works.
DECLARE @S VARCHAR(8000) = 'Aarrrgggh!'; WITH Tally (N) AS ( -- 8000 rows (max length of the VARCHAR string) SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n) ) SELECT n, s FROM ( -- Always choose the first element SELECT n=1, s=LEFT(@S, 1) UNION ALL -- Include each successive next element as long as it’s different than the prior SELECT N, CASE WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1) THEN SUBSTRING(@S, N, 1) -- Repeated characters are assigned NULL by the CASE END FROM Tally WHERE N BETWEEN 2 AND LEN(@S) ) a -- Now we filter out the repeated elements WHERE s IS NOT NULL;
The results displayed for this are below. Note that it works just fine with strings of length 1 or 0 also.
n s 1 A 3 r 6 g 9 h 10 !
It is also case-insensitive, so it removes the lowercase a (second character) because it is a repeat of the first character (capital A). It could be made case sensitive if necessary. Now all we need to do is combine the rows that remain back into our final string using a technique that is similare to Creating a Comma-separated List by author Microsoft Certified Master (MCM) Wayne Sheffield. It just needs a slight modification to not include the comma separator and it can be applied to this case as follows.
WITH Tally (N) AS ( -- 8000 rows SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n) ) SELECT OriginalString=@S, NewString= ( SELECT '' + s FROM ( -- Same base query as the prior example starts here SELECT n, s FROM ( -- Always choose the first element SELECT n=1, s=LEFT(@S, 1) UNION ALL -- Include each successive next element as long as it's -- different than the prior SELECT N, CASE WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1) THEN SUBSTRING(@S, N, 1) -- Repeated characters are assigned NULL by the CASE END FROM Tally WHERE N BETWEEN 2 AND LEN(@S) ) a -- Now we filter out the repeated elements WHERE s IS NOT NULL ) a ORDER BY n FOR XML PATH(''), TYPE ).value('.', 'VARCHAR(8000)' );
Now our result is:
OriginalString NewString Aarrrgggh! Argh!
That is very fast and way cool, so maybe we want to save this as a utility FUNCTION. SQL has many types of User-defined FUNCTIONs, e.g., scalar, table-valued, etc. Scalar-valued functions are known to be total performance nightmares. By far the fastest type of FUNCTION if you can write it this way is to use what’s known as an inline Table Valued Function (iTVF). We can do this for our new utility FUNCTION.
CREATE FUNCTION dbo.RemoveRepeatedChars ( @S VARCHAR(8000) ) RETURNS TABLE WITH SCHEMABINDING RETURN WITH Tally (N) AS ( -- 8000 rows SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) e(n) ) SELECT OriginalString=@S, NewString= ( SELECT '' + s FROM ( SELECT n, s FROM ( -- Always choose the first element SELECT n=1, s=LEFT(@S, 1) UNION ALL -- Include each successive next element as long as it's -- different than the prior SELECT N, CASE WHEN SUBSTRING(@S, N-1, 1) <> SUBSTRING(@S, N, 1) THEN SUBSTRING(@S, N, 1) -- Repeated characters are assigned NULL by the CASE END FROM Tally WHERE N BETWEEN 2 AND LEN(@S) ) a -- Now we filter out the repeated elements WHERE s IS NOT NULL ) a ORDER BY n FOR XML PATH(''), TYPE ).value('.', 'VARCHAR(8000)' );
An iTVF with SCHEMABINDING is the fastest possible FUNCTION we can write and it works for this case. We can now call our function to parse as many strings as we need (like from a column in a table).
WITH SampleData (s) AS ( SELECT 'Aarrrgggh!' UNION ALL SELECT 'Noo repeaatting chharracterss wanntted.' ) SELECT * FROM SampleData CROSS APPLY dbo.RemoveRepeatedChars(s);
CROSS APPLY is something we’ll cover in a future blog but we’ve provided some references at the end if you’d like to learn more now. The results we get from the above script are:
OriginalString NewString Aarrrgggh! Argh! Noo repeaatting chharracterss wanntted. No repeating characters wanted.
Parsing a Comma Separated List of Values
A more common problem in SQL is the parsing of a comma-separated list of values. We’ve saved the best for last, because in this case we can truly point you to a best-of-breed solution.
How many of you have created a SQL FUNCTION to split a delimited list of values? Chances are you pulled down some tired old looping code from somewhere on the Internet that works but is dog slow. You probably don’t even realize how bad it is. Now is the time to locate it and replace it as what we’re about to tell you about will simply blow the doors off of any other SQL-based delimited string splitter!
SQL MVP Jeff Moden wrote the outstanding Tally OH! An Improved SQL 8K “CSV Splitter” Function (viewed more than 40,000 times as of this writing), which contains lots of outstanding information about Tally tables (another of his articles is referenced at the end), how they can be used to split a delimited string and the performance comparison with alternate solutions. At the end of the article, you can download DelimitedSplit8K, optimized specifically for use with VARCHAR(8000) strings, so don’t try to modify it to VARCHAR(MAX). I have provided a link to that article as over the years the FUNCTION has been revised and improved a few times, so if you go there you’ll be sure to get the most current version.
It also contains information on CLR-based splitters that can be even faster and handle the case of VARCHAR(MAX).
Conclusions
Let’s do a recap of all of the things we learned today:
- The various ways to create a permanent or generate an in-line Tally table in SQL 2000, 2005 and 2008.
- The formula for using DATEDIFF/DATEADD to truncate a date to just the day or to the hour.
- How to expand a row set using a Tally table.
- How to parse a string using a Tally table.
- How to use the FOR XML PATH method to concatenate rows into a single character string.
- How to create a high-performance, schema-bound, in-line Table Valued FUNCTION in SQL.
- Where to find the fastest SQL-based delimited string splitter on the planet (DelimitedSplit8K).
We’ve also come away with two utility FUNCTIONs we can add to our SQL tool chest (see below for a third one):
- RemoveRepeatedChars
- DelimitedSplit8K
For further reading on CROSS APPLY (and its companion OUTER APPLY), here are two great articles by SQL MVP Paul White to get you started.
Further suggested reading:
- The “Numbers” or “Tally” Table: What it is and how it replaces a loop, which is also by SQL MVP Jeff Moden
- Splitting Strings Based on Patterns, which contains PatternSplitCM – another must have high-performing, utility FUNCTION for your SQL tool box that uses a Tally table to work its magic
- An even faster version of DelimitedSplit8K that works only in SQL 2012 or later in this article by Eirikur Eiriksson: Reaping the benefits of the Window functions in T-SQL
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Getting an INDEX SEEK to Speed up LIKE “%string%” Searches
In today’s blog I will attempt to challenge the popularly held notion that LIKE “%string%” wildcard searches must be slow (Sargability: Why %string% Is Slow).
A Sample Table Populated with 10 Million Rows of Test Data
In order to do this, we’ll need a large table of test data with a composite PRIMARY KEY to demonstrate various aspects of the issue.
CREATE TABLE dbo.TestLIKESearches ( ID1 INT ,ID2 INT ,AString VARCHAR(100) ,Value INT ,PRIMARY KEY (ID1, ID2) ); WITH Tally (n) AS ( SELECT TOP 10000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.all_columns a CROSS JOIN sys.all_columns b ) INSERT INTO dbo.TestLIKESearches (ID1, ID2, AString, Value) SELECT 1+n/500, n%500 ,CASE WHEN n%500 > 299 THEN SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) + SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) + SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) + RIGHT(1000+n%1000, 3) + SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) + SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) + SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1) END ,1+ABS(CHECKSUM(NEWID()))%100 FROM Tally;
While we have set the AString column to contain some NULL values (60% in fact), what we are about to show works even when 0% of the rows are NULL. Let’s start by showing the non-SARGable version of the query and its corresponding actual execution plan.
SELECT ID1, ID2, AString FROM dbo.TestLIKESearches WHERE AString LIKE '%21%';
As expected, the non-SARGable query results in a Clustered Index SCAN on the table. Let’s see if we can affect this by adding the following INDEX.
CREATE INDEX tls_ix1 ON dbo.TestLIKESearches(AString);
When we check the execution plan again for the same query, it now looks like this:
The query is now using the NONCLUSTERED INDEX, but it is still doing a SCAN. So let’s try a slightly different form of the same query, one which we know must generate the exact same results.
SELECT ID1, ID2, AString FROM dbo.TestLikeSearches WHERE AString IS NOT NULL AND AString LIKE '%21%';
All we have done is to modify the filter in the WHERE clause to ignore NULL values, none of which could contain our search string anyway. Now our execution plan looks like this:
Since a SEEK should in theory be better than a SCAN, we’re hopeful that the query’s speed (elapsed time) is improved also, but we’ll get to that in a moment.
The same small change to an UPDATE converts the INDEX SCAN to a SEEK, as demonstrated by the two queries below.
UPDATE dbo.TestLIKESearches SET Value = 300 WHERE AString LIKE '%21%'; UPDATE dbo.TestLIKESearches SET Value = 400 WHERE AString IS NOT NULL AND AString LIKE '%21%';
On a SELECT however, we can show just how easily this SEEK can be broken. Let’s modify our query to also return the Value column.
SELECT ID1, ID2, AString, Value FROM dbo.TestLikeSearches WHERE AString IS NOT NULL AND AString LIKE '%21%';
By adding Value into the returned results, we have broken the SEEK; it has reverted to a SCAN.
But there is a way around this. We can use a query hint (FORCESEEK) to restore our INDEX SEEK.
SELECT ID1, ID2, AString, Value FROM dbo.TestLikeSearches WITH(FORCESEEK) WHERE AString IS NOT NULL AND AString LIKE '%21%';
Performance Comparison of SCAN vs. SEEK for a LIKE “%string%” Search
The following table summarizes the performance results we got from this method of getting SQL Server to SEEK the INDEX on our string column during the LIKE “%string%” search.
Logical Reads | CPU | Elapsed Time | |
Base SELECT Query Without INDEX | 33193 | 2246 | 628 |
Base SELECT Query With INDEX | 24169 | 2310 | 582 |
Improved SELECT Query (INDEX SEEK) | 13659 | 1513 | 405 |
Improvement | 43% | 35% | 30% |
Base UPDATE Query With INDEX | 146678 | 2434 | 812 |
Improved UPDATE Query (INDEX SEEK) | 136168 | 1763 | 546 |
Improvement | 7% | 28% | 33% |
SELECT with Value (INDEX SCAN) | 33193 | 2620 | 665 |
SELECT with Value (FORCESEEK) | 136193 | 1794 | 455 |
Improvement | -310% | 32% | 32% |
In all cases, by forcing the SEEK (even when it results in an added Key Lookup) we were able to improve elapsed and CPU times to a measurable degree. Only the FORCESEEK query hint on the SELECT when non-indexed columns are included actually increased the logical IO count (albeit by quite a bit).
Conclusion
Despite the commonly accepted belief that a LIKE “%string%” search is limited to an INDEX SCAN, we have proven that it is possible to make it happen with a properly constructed non-clustered INDEX on the string being searched.
The SEEK is easily obtained for either SELECT or UPDATE, and probably DELETE and/or MERGE as well (although we didn’t test these cases) with just a small additional filtering criteria (excluding NULLs).
The SEEK can also just as easily be broken by including columns in the SELECT that aren’t in our non-clustered INDEX, however even then using a FORCESEEK query hint can restore it.
You’ll be able to read more about the results of this technique and how well it improves performance against a wider range of NULL values in an upcoming article that I have submitted to http://www.sqlservercentral.com/. You can generally expect that as the percentage of NULL values decreases, the performance gain will not be as much.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Common Table Expressions in SQL
In SQL Server 2005, Microsoft introduced the Common Table Expression (CTE). CTEs share similarities with VIEWS and derived tables, but are really not the same as either. Oracle SQL also supports CTEs and while the syntax is basically the same, some of the properties that we’ll discuss may be slightly different.
Let’s take a look at a very simple CTE to get started.
CREATE TABLE #Vehicles ( VehicleID VARCHAR(5) ,VehicleType VARCHAR(5) ,Location VARCHAR(3) ,PRIMARY KEY (VehicleID) ); INSERT INTO #Vehicles VALUES ('12211', 'TRUCK', 'BKK'),('12212', 'CAR', 'BKK'),('12213', 'TRUCK', 'CNX') ,('12214', 'CAR', 'CNX'),('12215', 'TRUCK', 'HDY'),('12216', 'CAR', 'HDY'); WITH BKKVehicles AS ( SELECT VehicleID, VehicleType, Location ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID) FROM #Vehicles WHERE Location = 'BKK' ) SELECT VehicleID, VehicleType, Location FROM BKKVehicles;
Our CTE begins with the keyword WITH and ends at the closing parenthesis. Below the CTE is what I’ll call the “main query.” This CTE retrieves only vehicles whose location is BKK and adds a ROW_NUMBER to that result:
VehicleID VehicleType Location rn 12211 TRUCK BKK 1 12212 CAR BKK 2
There is a widespread belief that CTEs can improve performance, but the truth is they neither improve nor detract from performance. They are simply a way to make your code more readable, although they do offer a couple of things that may also make your life a bit easier. Let’s look at some of the rules/properties of a CTE, comparing and contrasting with VIEWs and derived tables where appropriate.
- You must remember to terminate the statement preceding the CTE with a semicolon, otherwise SQL will throw this error message at you:
Incorrect syntax near the keyword ‘with’. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.
- You can rename the columns returned by the CTE by providing the column names between parentheses immediately after the name of the CTE (our CTE is named BKKVehicles) and before “AS.”
- A CTE must contain a SELECT and it may not contain INSERT, UPDATE, DELETE or MERGE statements.
- CTEs will inherit the indexing of the tables upon which they are based.
- CTEs are more like a derived table than a VIEW because they exist only for the life of the main query which follows them. In order to reuse a CTE in a subsequent query, you must resupply the same code to the second query.
- You can use a CTE as source or target in UPDATE, INSERT, DELETE and MERGE queries, but there are some restrictions. This is similar to a VIEW.
- You may have more than one CTE associated with a query. When more than one CTE is defined, they are referred to as “cascaded” or “stacked” CTEs. You may not however nest CTEs within CTEs.
- You may code CTEs within VIEWs, FUNCTIONs or Stored Procedures.
- You may refer to a CTE more than once in the main query. Contrast this with a derived table, which if you’d like to use it more than once, must be coded as many times as you need it.
- You may refer to a CTE in another CTE as long as the CTE being referred to occurs above the CTE that is doing the referring in the CTE stack.
- CTEs support recursive queries.
CTEs are most often used with SELECT, but you can UPDATE through a CTE as in this example also.
WITH BKKVehicles AS ( SELECT VehicleID, VehicleType, Location ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID) FROM #Vehicles WHERE Location = 'BKK' ) UPDATE BKKVehicles SET VehicleType = 'VAN' WHERE Location = 'HDY';
In this example, no rows are updated because the table returned by the CTE does not contain any rows whose location is HDY.
Likewise you can delete through a CTE.
WITH BKKVehicles AS ( SELECT VehicleID, VehicleType, Location ,rn=ROW_NUMBER() OVER (ORDER BY VehicleID) FROM #Vehicles WHERE Location = 'BKK' ) DELETE FROM BKKVehicles WHERE rn > 1; SELECT * FROM #Vehicles;
After the DELETE runs, the rows remaining in our table are:
VehicleID VehicleType Location 12211 TRUCK BKK 12213 TRUCK CNX 12214 CAR CNX 12215 TRUCK HDY 12216 CAR HDY
This is actually quite a useful method of deleting duplicate rows from a table.
We mentioned that when updating or deleting through a CTE, certain restrictions may apply. Basically all that means is that the target rows in the target table must be unambiguous. For example, if you happen to JOIN the target table with another table, the JOIN must be exact (no duplicate rows generated) otherwise the effort will likely fail. Highly complex queries involving many JOINs, etc. may also confuse the compiler and make it unable to recognize the target table.
Here’s an example of using a CTE as the source for an INSERT, to generate some additional sample rows in our table. It also demonstrates how you can name the columns generated by the CTE.
WITH MoreRows (VehicleID, VehicleType, Location) AS ( SELECT '12218','VAN','BKK' UNION ALL SELECT '12219','VAN','CNX' UNION ALL SELECT '12220','VAN','HDY' ) INSERT INTO #Vehicles (VehicleID, VehicleType, Location) SELECT VehicleID, VehicleType, Location FROM MoreRows;
A CTE can also act as either source or target tables for a MERGE, but since MERGE is a topic that is deserving of consideration on its own, we’ll do a separate blog entry for that.
Final Remarks
We have demonstrated how a Common Table Expression can be used in SELECT, UPDATE, DELETE and INSERT statements. CTEs are basically a way to improve the readability of the SQL code you produce, having no impact on their performance.
While we have mentioned that CTEs can be used for recursive queries, we haven’t provided any examples of this because it is quite an advanced topic. However if you are interested in how to do this, you might want to take a look at Exploring Recursive CTEs by Example.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
The T-SQL ROW_NUMBER() Function
If you learn one new T-SQL (i.e., Microsoft SQL Server) concept today it should be ROW_NUMBER(). Introduced in SQL 2005, this function is one of 4 window ranking functions (the others are RANK(), DENSE_RANK() and NTILE()). Oracle SQL has a similar capability.
Let’s first create some sample data we can use for a demonstration.
CREATE TABLE #ROWNUMBER_Demo ( ID INT ,MyDate DATETIME ,Price MONEY ,PRIMARY KEY (ID, MyDate) ); INSERT INTO #ROWNUMBER_Demo SELECT 1, '2012-03-04', 23.22 UNION ALL SELECT 1, '2012-03-15', 25.15 UNION ALL SELECT 1, '2012-05-10', 28.47 UNION ALL SELECT 2, '2012-02-28', 15.10 UNION ALL SELECT 2, '2012-03-22', 18.22 UNION ALL SELECT 2, '2012-05-01', 21.43 UNION ALL SELECT 3, '2012-04-01', 45.06 UNION ALL SELECT 3, '2012-05-12', 48.23 UNION ALL SELECT 3, '2012-06-01', 51.66; SELECT * FROM #ROWNUMBER_Demo;
The results in our sample table are:
ID MyDate Price 1 2012-03-04 00:00:00.000 23.22 1 2012-03-15 00:00:00.000 25.15 1 2012-05-10 00:00:00.000 28.47 2 2012-02-28 00:00:00.000 15.10 2 2012-03-22 00:00:00.000 18.22 2 2012-05-01 00:00:00.000 21.43 3 2012-04-01 00:00:00.000 45.06 3 2012-05-12 00:00:00.000 48.23 3 2012-06-01 00:00:00.000 51.66
To use the ROW_NUMBER() function you must supply an OVER predicate consisting of PARTITION (optional) and ORDER BY (required). Let’s take a look at an example and see the results.
SELECT ID, MyDate ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate) ,rn2=ROW_NUMBER() OVER (ORDER BY MyDate) FROM #ROWNUMBER_Demo ORDER BY ID, MyDate;
The results returned are:
ID MyDate rn1 rn2 1 2012-03-04 00:00:00.000 1 2 1 2012-03-15 00:00:00.000 2 3 1 2012-05-10 00:00:00.000 3 7 2 2012-02-28 00:00:00.000 1 1 2 2012-03-22 00:00:00.000 2 4 2 2012-05-01 00:00:00.000 3 6 3 2012-04-01 00:00:00.000 1 5 3 2012-05-12 00:00:00.000 2 8 3 2012-06-01 00:00:00.000 3 9
For rn1 (where PARTITION is applied) you see that it assigns row numbers 1, 2, 3 to the rows within each ID (the column specified to PARTITION on) based on the ordering of MyDate. For the case without PARTITION (rn2), the entire set is the partition so the row numbers are 1, 2, …, 9, again based on the ordering of the MyDate column.
Eliminating Duplicates
ROW_NUMBER() is a particularly fast way to eliminate duplicate records. Suppose you want to return only one record within each ID; specifically the one whose date is the latest. You must note that ROW_NUMBER() cannot be used on the WHERE clause, so it is necessary to wrap this query in an outer query as follows:
SELECT ID, MyDate FROM ( SELECT ID, MyDate ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC) FROM #ROWNUMBER_Demo ) a WHERE rn1 = 1 ORDER BY ID;
Note the DESC sort applied to MyDate. These results are:
ID MyDate 1 2012-05-10 00:00:00.000 2 2012-05-01 00:00:00.000 3 2012-06-01 00:00:00.000
Of course, you’re probably saying you can achieve the same results using a GROUP BY (and you’d be correct), like this.
SELECT ID, MyDate=MAX(MyDate) FROM #ROWNUMBER_Demo GROUP BY ID ORDER BY ID;
But try using that query to also return the Price column that corresponds to the MAX date. You cannot! But you can when you use ROW_NUMBER().
SELECT ID, MyDate, Price FROM ( SELECT ID, MyDate, Price ,rn1=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MyDate DESC) FROM #ROWNUMBER_Demo ) a WHERE rn1 = 1 ORDER BY ID;
Results:
ID MyDate Price 1 2012-05-10 00:00:00.000 28.47 2 2012-05-01 00:00:00.000 21.43 3 2012-06-01 00:00:00.000 51.66
Conclusion
ROW_NUMBER() is a very versatile T-SQL window ranking function. Besides using it to eliminate duplicates, it has a great many other very practical purposes that we’ll explore in future entries on this blog.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved