Ruminations on Writing Great T-SQL

Today we’re going to try to extend some advice that I once heard from SQL MVP Jeff Moden, which I wrote about in my previous blog entitled “Make it Work, Make it Fast, Make it Pretty.”  That advice was just as pertinent then as it is now, but perhaps I can add to it what I think makes for really great T-SQL code.

Great T-SQL code is going to start with all of the advice that I gave in that blog.  That same advice really applies to any coding language, not just T-SQL.

But great T-SQL code is actually much more than that.  Let’s look at some of the aspects that transcend what I’ve previously written.

Elegance of a Set-based Algorithm

Obviously in our T-SQL we’d like to use predominately set-based algorithms.  Depending on the problem being addressed, oftentimes there are just a few very elegant solutions, and probably a bunch of solutions that are not nearly as elegant.

In simplicity there is often elegance.  Simplicity, elegance and speed may unfortunately not all play well together.  But if you can come up with a solution that is elegant, simple and fast, that’s usually going to get you a lot closer to writing great T-SQL.

Reusability

From Facts and Fallacies of Software Engineering by Robert L. Glass, here is Fact #18: “There are two “rules of three” in reuse: (a) It is three times as difficult to build reusable components as single use components, and (b) a reusable component should be tried out in three different applications before it will be sufficiently general to accept into a reuse library.

I will suggest that in order to be great, T-SQL code must be reusable.  I’ve written before that well-focused code that solves only a specific problem is probably going to be a whole lot faster than more generalized code that solves a wider variety of problems.  So that means it should be focused on solving a common problem that occurs frequently, so that it can be reused (as a tool) whenever the need for it arises.

I tend to agree with Mr. Glass in this case; finding a common problem and then building a fast-performing, simple and elegant T-SQL solution is probably a lot more challenging than building any other one-off query.  Part of this challenge of course is simply finding a suitably simple problem to focus on, which is also encountered commonly enough to want to build something that’s reusable.

Frequency and Quality of its Reuse

Since beginning my study of T-SQL, and of course I remain a student of the subject to this day, I’ve seen a fair number of really elegant solutions to specific problems.  Of the ones that can be classified as reusable, I’ve seen many that solve well a somewhat common problem, but unfortunately the problem itself is not so common that the solution will be reused often.  Let’s face it, maybe you’ve got a solution for Gaps and Islands that is outrageously wonderful, simple, fast and elegant.  But Gaps and Islands isn’t a particularly common problem to encounter.  Sure you see it once in a while, and certainly it is good to have an elegant solution just waiting in the wings for when you need it.  But commonality of the problem is a great influencer, at least to me, of what makes for great T-SQL.

Quality of reuse is a different story entirely.  So let’s say you’ve got this (arguably) great solution and it sees quite a fair bit of reuse.  Who are the folks that are reusing it?  Are they peers in the SQL community, are they SQL Server MVPs who have embraced your method thus utilizing it themselves?  Or is it the great unwashed masses?

Facebook is a great application because millions of people use it.  But who are those users?  I’d say it is a much higher quality pattern of reuse, to have a group of peers or those that are more skilled than you are, using your reusable code, rather than just anybody else in the world.  So while Facebook may be a great app, it may not represent great code!  The quality of the people reusing your code counts for something.

Productivity

Most great T-SQL code should enhance your productivity.  By that I mean, if you’re a developer it can be considered a tool that allows you to avoid spending awkward moments of your development time solving silly sub-problems, when really you need to be focusing on solving the bigger problems that relate to the business solutions you’re trying to produce.  As a DBA, your productivity can be enhanced by any T-SQL script that automates some activity that allows you to be more proactive in your database monitoring activities.

Of course, just because you can solve a problem more quickly because the sub-problems have already been addressed doesn’t mean you should sacrifice performance.  But since great T-SQL should already be pretty fast, you may find that using the tool makes the solution to your bigger problem fast enough!  Ultimately you may have to spend some time to make it even faster, but at least you’ve gotten through the first hurdle of a development project, which is to just get it working.

Encapsulation

Any great T-SQL solution is one that can be encapsulated such that it can be used as a “black box,” without the need for users (or more importantly maintainers) of the using bit of T-SQL, to need to worry about how it works.

Yes, they should be able to look under the covers, so that when need be they can figure out how it works.  But mostly the encapsulations simply makes a great piece of T-SQL, which may otherwise be elegant or complex, really, really easy to use.

My Vote for the Greatest Piece of T-SQL

Arguably the greatest piece of T-SQL code that I’ve ever seen written is the DelimitedSplit8K string splitter.  While Jeff Moden probably wasn’t the originator of the concept of splitting a string, his first attempt at writing a fast string splitter (Tally OH! An Improved SQL 8K “CSV Splitter” Function) was pretty awesome.  I’d say that it clearly meets all of the criteria I’ve mentioned so far in this blog, and with more than 50,000 views of that article as of this writing I’d imagine there’s more than just a few people that agree with me.

Jeff really did a great job of setting up a proper test harness and proving the speed potential of the algorithm, so whether or not he claims to be the author of this wonderful little function, he’ll be the one everyone remembers once all of the dust has settled.

It has also achieved another level of greatness, which is unmatched in my experience.  That is that an entire community has sprung up around it.  There have over the years since its original publication, been numerous contributors to the evolution of DelimitedSplit8K.  These contributions have spanned more than 75 pages of discussion thread on the topic (again, as of this writing).

Alas, there are too many contributors to mention all of them by name, but suffice it to say that list includes some really great SQL talent.  Some have focused on small tweaks, others have focused on producing better test harnesses.  Others have written custom CLRs to improve the speed even further, when CLRs are allowed.  All of the contributors probably use this tool frequently, so this speaks volumes to the function’s adoption and reuse.  One of the recent contributors, Eirikur Eiriksson, upped the ante significantly by writing a version that was even faster by utilizing a new SQL 2012 feature, and his diligence in constructing test harnesses is also worthy of honorable mention.

So here’s this huge group of peers (by that I mean SQL-folk of course), that have not only embraced Jeff’s work, but have also formed an on-going support community because they felt the underlying problem was so worthwhile that they should expend their precious time and effort on it.

There’s an interesting sidebar to my vote.  The Federal Relational Protection Agency (FRPA) might scoff at how preposterous the idea of a string-splitter being great T-SQL is.  After all, this is a relational database and we simply don’t store delimited strings in our properly normalized relational databases, now do we?  I guess what I’m saying here is that academic considerations probably take a back seat to adoption.  Clearly this problem occurs relatively frequently in the real world, regardless of what the FRPA would have us think.  And solving real world problems, to me at least, is really what it’s all about.

Aspirations

One day when I grow up I want to write a truly great snippet of T-SQL.  In my development career, knowing that solutions I’ve written are actually being used has always brought me great comfort and satisfaction.  No developer in their right mind, at least in my humble opinion, wants to knowingly build shelfware.  I’ve been around for a long time so maybe not everyone feels the same any more, but I hope this attitude still carries forward to all levels of developers currently putting their code out there for others to learn from and grow with.

So how about you?  Do you have a vote for the greatest snippet of T-SQL ever?  Post a link as a comment telling us why you think it qualifies, so all who see this might benefit from your particular experience with some T-SQL you think is great.  Affirmations of my vote are also welcome.

In the end, if this blog becomes a repository of some really great T-SQL code patterns, I will derive a lot of satisfaction from that too.

And to those of you out there that share my aspirations, whether or not the favorite tool/snippet you develop ever makes the list of all-time greats, wouldn’t you like to hear from all of those people that found your code wonderful enough to add to their T-SQL toolkit?  So take note you adopters out there!  If you’re using someone else’s code commonly to solve a problem, post them a thank you.  I’m sure they’ll be happy to hear that from you, and who knows?  Perhaps it will even spawn further greatness and sharing among one of the greatest communities the coding world has ever known.  That right, I mean the SQLverse, and we’re all a part of it!

Thanks for listening to my ruminations today on what makes T-SQL code great.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 22 Apr 2015.  All rights reserved.

Stupid T-SQL Tricks – Part 3: A Zodiacal SQL

Today we’ll learn how you can use T-SQL to calculate your zodiac sign from your birthdate.

“Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?  My Stupid T-SQL Tricks are hopefully a bit more useful, but just as amusing as Carson’s Stupid Pet Tricks were!

Everyone knows the twelve signs of the zodiac, and there’s a pretty good chance you also know what your sign is, and that which sign of the zodiac you are depends on your date of birth.

Putting the Zodiac Signs into a Lookup Table

Four our lookup table, we’ll use a Common Table Expression (CTE) with the twelve zodiac signs, but with one repeated.

WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT *
FROM Zodiac;

The signs are numbered in the order that they’re typically described in the literature, e.g., on the Wiki page for The Zodiac.  Note how Capricorn is duplicated, because the dates for that sign cross the end-of-year boundary.  Here are the displayed results from our table:

SignNo  SignName     Symbol  StartDT     EndDT       SignIs
1       Aries        ♈      1900-03-21  1900-04-20   Ram
2       Taurus       ♉      1900-04-21  1900-05-21   Bull
3       Gemini       ♊      1900-05-22  1900-06-21   Twins
4       Cancer       ♋      1900-06-22  1900-07-22   Crab
5       Leo          ♌      1900-07-23  1900-08-22   Lion
6       Virgo        ♍      1900-08-23  1900-09-23   Maiden
7       Libra        ♎      1900-09-24  1900-10-23   Scales
8       Scorpio      ♏      1900-10-24  1900-11-22   Scorpion
9       Sagitarius   ♐      1900-11-23  1900-12-21   Archer (Centaur)
10      Capricorn    ♑      1900-12-22  1900-12-31   Sea-Goat (Goat)
10      Capricorn    ♑      1900-01-01  1900-01-20   Sea-Goat (Goat)
11      Aquarius     ♒      1900-01-21  1900-02-19   Water-bearer
12      Pisces       ♓       1900-02-20  1900-03-20   Fish

I think it is kind of cool that we could use the NCHAR built-in T-SQL function (introduced in SQL 2005) to return the UNICODE symbol for each zodiac sign!

Suppose we’d now like to calculate the sign for a birthdate.  We’ll use two cases, a Capricorn and a Gemini to illustrate how easy it is using the lookup table above.

DECLARE @BirthDate  DATE = '1999-01-12';

WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

Which returns one row for our Capricorn:

BirthDate   SignNo   SignName    SignIs              Symbol
1999-01-12  10       Capricorn   The Sea-Goat (Goat) ♑

You can see how we’ve used the DATEADD and DATEDIFF functions to place our birthdate between the StartDT and EndDT columns within our lookup table.

A Function to Calculate the Sign of the Zodiac for a Birth Date

I like in-line, Table Valued Functions (iTVFs) because I consider them tools.  This little T-SQL query is easy enough to put into an iTVF, so let’s do that now.

CREATE FUNCTION dbo.ZodiacSign
(
    @BirthDate  DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
)
SELECT BirthDate=@BirthDate, a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
FROM Zodiac a
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

With this function, we can now calculate the Zodiac signs for a whole range of birthdates as follows:

WITH BirthDates (BirthDate) AS
(
    SELECT '1998-06-17'             -- A Gemini
    UNION ALL SELECT '1999-01-10'   -- A Capricorn
)
SELECT a.BirthDate, SignNo, SignName, SignIs, Symbol
FROM BirthDates a
CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;

Which gives us these results:

BirthDate   SignNo  SignName    SignIs               Symbol
1998-06-17  3       Gemini      The Twins            ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  ♑

The Zodiac Date Range in which a Birth Date Falls

Because of Capricorn, which splits two different years, were we to want to calculate the date range for the sign on which a particular birth date falls, things get just a little more complicated.  For example, for our two birthdates shown above, we’d like to see a results set that looks like this.

BirthDate   SignNo  SignName    SignIs               SignPeriodStart  SignPeriodEnd  Symbol
1998-06-17  3       Gemini      The Twins            1998-05-22       1998-06-21     ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  1998-12-22       1999-01-20     ♑

Notice how Capricorn’s SignPeriodStart is the year before that particular birthdate.

Our ZodiacSign function can be modified as follows to handle this case.

ALTER FUNCTION dbo.ZodiacSign
(
    @BirthDate  DATE
) RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH Zodiac
(
    SignNo, SignName, Symbol, StartDT, EndDT, SignIs
) 
AS
(
    SELECT           1,  'Aries',       NCHAR(9800), '1900-03-21', '1900-04-20', 'Ram'
    UNION ALL SELECT 2,  'Taurus',      NCHAR(9801), '1900-04-21', '1900-05-21', 'Bull'
    UNION ALL SELECT 3,  'Gemini',      NCHAR(9802), '1900-05-22', '1900-06-21', 'Twins'
    UNION ALL SELECT 4,  'Cancer',      NCHAR(9803), '1900-06-22', '1900-07-22', 'Crab'
    UNION ALL SELECT 5,  'Leo',         NCHAR(9804), '1900-07-23', '1900-08-22', 'Lion'
    UNION ALL SELECT 6,  'Virgo',       NCHAR(9805), '1900-08-23', '1900-09-23', 'Maiden'
    UNION ALL SELECT 7,  'Libra',       NCHAR(9806), '1900-09-24', '1900-10-23', 'Scales'
    UNION ALL SELECT 8,  'Scorpio',     NCHAR(9807), '1900-10-24', '1900-11-22', 'Scorpion'
    UNION ALL SELECT 9,  'Sagitarius',  NCHAR(9808), '1900-11-23', '1900-12-21', 'Archer (Centaur)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-12-22', '1900-12-31', 'Sea-Goat (Goat)'
    UNION ALL SELECT 10, 'Capricorn',   NCHAR(9809), '1900-01-01', '1900-01-20', 'Sea-Goat (Goat)'
    UNION ALL SELECT 11, 'Aquarius',    NCHAR(9810), '1900-01-21', '1900-02-19', 'Water-bearer'
    UNION ALL SELECT 12, 'Pisces',      NCHAR(9811), '1900-02-20', '1900-03-20', 'Fish'
),
    GroupedSigns AS
(
    SELECT SignNo
        ,StartDT    = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
                        THEN DATEADD(year, -1, MAX(StartDT)) 
                        WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
                        THEN MAX(StartDT)
                        ELSE MIN(StartDT) 
                        END
        ,EndDT      = CASE WHEN SignNo = 10 AND MONTH(@BirthDate) = 1
                        THEN MIN(EndDT) 
                        WHEN SignNo = 10 AND MONTH(@BirthDate) = 12
                        THEN DATEADD(year, 1, MIN(EndDT))
                        ELSE MAX(EndDT) 
                        END    FROM Zodiac
    GROUP BY SignNo
)
SELECT a.SignNo, a.SignName, SignIs='The ' + a.SignIs, a.Symbol
    ,SignPeriodStart    = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.StartDT) AS DATE)
    ,SignPeriodEnd      = CAST(DATEADD(year, YEAR(@BirthDate)-1900, b.EndDT) AS DATE)
FROM Zodiac a
JOIN GroupedSigns b ON a.SignNo = b.SignNo
WHERE @BirthDate BETWEEN 
    DATEADD(year, DATEDIFF(year, a.StartDT, @BirthDate), a.StartDT) AND 
    DATEADD(year, DATEDIFF(year, a.EndDT, @BirthDate), a.EndDT);

Because of our split row for Capricorn, we needed to create a grouping that set the dates accordingly and then JOIN this back to our original Zodiac lookup table.  A little more date arithmetic magic, this time using the DATEADD and YEAR built-in functions, will get us to our period dates.

This can be demonstrated by running the following query to call our newly ALTERed function:

WITH BirthDates (BirthDate) AS
(
    SELECT '1998-06-17'             -- A Gemini
    UNION ALL SELECT '1999-01-10'   -- A Capricorn
)
SELECT a.BirthDate, SignNo, SignName, SignIs
    ,SignPeriodStart
    ,SignPeriodEnd
    ,Symbol
FROM BirthDates a
CROSS APPLY dbo.ZodiacSign(a.BirthDate) b;

Which gives us the results we seek!

BirthDate   SignNo  SignName    SignIs               SignPeriodStart  SignPeriodEnd  Symbol
1998-06-17  3       Gemini      The Twins            1998-05-22       1998-06-21     ♊
1999-01-10  10      Capricorn   The Sea-Goat (Goat)  1998-12-22       1999-01-20     ♑

Conclusions

I love to create iTVFs as tools I can always use when some calculation gets hairy!  I have many tools that are specific to their purpose.  I also like to adopt iTVFs of this sort from other sources (a great example of that being DelimitedSplit8K from SQL MVP Jeff Moden).  You should probably consider doing this yourself to improve your skill at both writing iTVFs and using the tools you create to accelerate your own T-SQL coding jobs.

Now if I could just figure out the Yin and the Yang of the Chinese calendar to confirm I’m a Dog that would really be something to chew on.

Yet another stupid, T-SQL trick, right?  Who knows, perhaps one day you’ll even have a use for it!

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 16 Apr 2015.  All rights reserved.

An Even Faster Method of Calculating the Median on a Partitioned Heap

Back in 2013 I wrote an article kindly accepted for publication on the Simple Talk web site called Calculating the Median Value within a Partitioned Set Using T-SQL.  In that article, I was delighted to propose a previously unpublished solution for calculating the median over a partitioned set that did not rely on INDEXing.  In fact, in my tests it was the fastest solution available for that case.

Later on, I was quite pleased to hear that 18 times nominated (as of this writing) SQL MVP Aaron Bertrand published a comparative study of methods for calculating the median called the Best approaches for grouped median, where he kindly included my submission, and confirmed that it was the highest performance solution available for the case of a heap.

Today we’re going to take a second look at the solution I proposed, and offer a couple of interesting possibilities that may be even better for the case of a partitioned heap.

A Heap Example and Some Sample Data to Populate the Table

The following DDL can be used to create a heap, and load it with some partitioned sample data (the ID column is the partition).

CREATE TABLE dbo.MedianValues
(
   ID INT
   ,N INT
);

INSERT INTO dbo.MedianValues
VALUES (1,1),(1,2),(1,3),(1,4),(1,5)
   ,(2,1),(2,2),(2,3),(2,4)
   ,(3,10),(3,2),(3,10),(3,4)
   ,(4,1),(4,5),(4,1),(4,3),(4,3);

From my original article, below are the results you would expect for the median for these four sets of N:

ID    Ordered Sets         Median
1     {1, 2, 3, 4, 5}      3
2     {1, 2, 3, 4}         2.5
3     {2, 4, 10, 10}       7
4     {1, 1, 3, 3, 5}      3

As a reminder, the median is calculated as follows:

  • Order the values for which you want to calculate the median over the partitioning column (ID).
  • If the count of the rows is odd, you take the middle value. So in the example above, you would take the third row for IDs 1 and 4.
  • If the count of the rows is even, you take the average of the middle two rows. In the example above, this means you use the second and third elements in IDs 2 and 3.

My Original Solution to Median for this Case

Aaron commented that my somewhat “elaborate” solution un-pivots the rows of interest using the CROSS APPLY VALUES approach.

WITH Counts AS
(
   SELECT ID, c
   FROM
   (
      SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE 0 END
      FROM
      (
          SELECT ID, c=COUNT(*)
          FROM dbo.MedianValues
          GROUP BY ID
      ) a
   ) a
   CROSS APPLY (VALUES(c1),(c2)) b(c)
)
SELECT ID=a.ID, Median=AVG(0.+b.N)
FROM
(
   SELECT ID, N
      ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
   FROM dbo.MedianValues a
) a
CROSS APPLY
(
   SELECT N
   FROM Counts b
   WHERE a.ID = b.ID AND a.rn = b.c
) b
GROUP BY a.ID;

It is important to note how we convert the integer Ns to a FLOAT within the AVG grouping function, to ensure no truncation occurs.  The results set for this on our sample data appears as follows, confirming the correct calculation of the median for each case.

ID   Median
1    3.000000
2    2.500000
3    7.000000
4    3.000000

Calculating a Median Using Grouping Sets

While I was playing around with GROUPING SETS the other day, I hit upon the strangest query result, which gave me an idea that relates to calculating medians.  Let’s take a look at a sample of what I found.

SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
    ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
        ELSE (COUNT(*)+1)/2 
        END
FROM 
(
    SELECT ID, ID1=ID
    FROM dbo.MedianValues
) a
GROUP BY GROUPING SETS (ID, ID1)
HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
ORDER BY ID, c;

This query produces a results set consisting of all the partitioning IDs with one row for IDs 1 and 4 (where c=3) and two rows for IDs 2 and 3 (where c=2 and 3 respectively).  Take a look at that description and compare it to my description of how to calculate median, and see if you too are not intrigued.

ID   c
1    3
2    2
2    3
3    2
3    3
4    3

Coincidentally, this is exactly what I was trying to achieve by my “elaborate” (according to Aaron) method that used CROSS APPLY VALUES!  One could argue that this is even more elaborate, and I would tend to agree.  So naturally the thought struck me that perhaps I could use this approach to calculate a median.

My initial attempts at JOINing these six rows back to the full set with a ROW_NUMBER() added yielded me no joy however, as the result was slower than my original.

Once again inspiration struck and I had the idea to aggregate the GROUPING SETS so that I only had one row for each ID, after which I could do a range (BETWEEN) JOIN.  So my aggregation now looks like this.

SELECT ID, c1=MIN(c), c2=MAX(c) 
FROM
(
    SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
        ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
            ELSE (COUNT(*)+1)/2 
            END
    FROM 
    (
        SELECT ID, ID1=ID
        FROM dbo.MedianValues
    ) a
    GROUP BY GROUPING SETS (ID, ID1)
    HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
) a
GROUP BY ID;

Which produces a set that looks like this:

ID  c1  c2
1   3   3
2   2   3
3   2   3
4   3   3

Playing around with JOINing this back to the original data set, led me to using a CROSS APPLY instead of a JOIN.  So what I ended up with, which calculates the median quite nicely, is the following query.

WITH RowsOfInterest AS
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=MIN(c), c2=MAX(c) -- * -- ID, c
        FROM
        (
            SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
                ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
                    ELSE (COUNT(*)+1)/2 
                    END
            FROM 
            (
                SELECT ID, ID1=ID
                FROM dbo.MedianValues
            ) a
            GROUP BY GROUPING SETS (ID, ID1)
            HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
        ) a
        GROUP BY ID
        HAVING COUNT(*) <= 2
    ) a
    CROSS APPLY
    (
        SELECT ID, N
        FROM 
        (
            SELECT ID, N
                ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
            FROM dbo.MedianValues b
        ) b
        WHERE a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
    ) b
)
SELECT @ID=ID, @Median=AVG(0.+N)
FROM RowsOfInterest a
GROUP BY a.ID;

How’s that for elaborate, Aaron?  Initial timings appeared to be pretty decent, but more importantly what I was seeing was a drastically reduced IO count and CPU usage.  Clearly this was a promising vector of attack, although the truth will only get sorted out using SQL Profiler to average the results of multiple runs (the elapsed timings were very close).

One rather annoying note on this solution.  When I first hit upon it, I could have sworn I had something that was consistently faster than my original solution.  Unfortunately due to a PC reboot (pesky Windows updates!) I lost that one and I was never able to quite achieve the speed I was seeing.  This could be an interesting assignment for those so inclined.

Simplifying the Elaborate GROUPING SETS Approach

After thinking a little more about this, I realized it might be possible to greatly simplify this aggregation to produce just the four rows I needed.  Silly of me really not to think of it in the first place, but then my brain may not be wired that way.

So here’s another method to get the median using a simpler initial aggregation to get ID, c1 and c2 just like the query in the prior section.

WITH RowsOfInterest AS
(
    SELECT ID, c=COUNT(*)
    FROM dbo.MedianValues
    GROUP BY ID
)
SELECT ID=ID, Median=AVG(0.+N)
FROM
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
        FROM RowsOfInterest
    ) a
    JOIN 
    (
        SELECT ID, N
            ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
        FROM dbo.MedianValues b
    ) b
    ON a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
) a
GROUP BY a.ID;

This time I used a JOIN instead of a CROSS APPLY because initial timings seemed to indicate a slight advantage.  This method also exhibited a serious reduction in IO counts and CPU usage over my original solution.

You’ll note that this method is quite similar to my original approach except that it doesn’t CROSS APPLY VALUES and doesn’t do the pre-aggregation in the second solution either.  The main difference from the original solution is the method of JOINing, which is to use the BETWEEN range selector.

The One Million Row Test Harness

In my original Simple Talk article I developed a one million row test harness that can be used to test a median query, and I have reproduced that below exactly as before.

TRUNCATE TABLE dbo.MedianValues;

DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

WITH Tally (n) AS
(
   SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.MedianValues
SELECT a.n, ABS(CHECKSUM(NEWID()))%@N
-- Always create exactly 100 IDs from the Tally table
FROM (SELECT TOP 100 n FROM Tally) a
-- Generate either an odd or even number of rows within IDs
CROSS APPLY
(
   SELECT TOP (@N-a.n%2) n
   FROM Tally
)b;

This actually generates 999,950 rows into our sample data table.

The SQL Profiler Test

My initial timings using SET STATISTICS IO, TIME ON were producing pretty fine numbers for IO and CPU, but pretty close scores for elapsed times.  So the only way to really be sure which of these three queries is the fastest, it is necessary to run each batch multiple times (I chose that to be ten) and then average the results.  I won’t include the Profiler test harness here because this is just a blog, and I’m sure if you’ve read my previous blog on the One Million Row Test Harness, you’re capable of doing this yourself given everything we’ve given you above.

Here are the raw and averaged SQL Profiler results for the three queries, with the last row being the average of the ten prior.

Solution #1: Original method Solution #2: Using Grouping Sets Solution #3: Simplified aggregation
CPU Reads Duration CPU Reads Duration CPU Reads Duration
4,943 2,006,286 1,585 1,622 5,843 1,663 2,840 5,674 1,147
4,759 2,006,286 1,638 1,826 5,843 1,663 2,979 5,674 1,106
4,820 2,006,286 1,575 1,715 5,843 1,662 2,682 5,674 1,326
4,836 2,006,286 1,654 1,732 5,843 1,672 2,932 5,674 1,353
4,822 2,006,286 1,608 1,747 5,843 1,620 2,949 5,674 1,190
4,787 2,006,286 1,598 1,701 5,843 1,659 2,729 5,674 1,161
4,788 2,006,286 1,669 1,716 5,843 1,638 2,872 5,674 1,175
5,104 2,006,286 1,574 1,716 5,843 1,538 3,026 5,674 1,244
4,698 2,006,286 1,569 1,700 5,843 1,665 2,840 5,674 1,296
4,632 2,006,286 1,571 1,747 5,843 1,663 2,792 5,674 1,195
4,819 2,006,286 1,604 1,722 5,843 1,644 2,864 5,674 1,219

It is also instructive to look at a closer comparison of the averages.

  Averages
Solution CPU Reads Duration
#1: Original method 4,819 2,006,286 1,604
#2: Using Grouping Sets 1,722 5,843 1,644
#3: Simplified aggregation 2,864 5,674 1,219
Best Solution #2 #3 #3
New #2 %O/(U) #1 -64% -100% 3%
New #3 %O/(U) #1 -41% -100% -24%

You can see how solution #3 was the best performer in terms of IO (Reads) and Duration (Elapsed MS).  Solution #2 had the edge in CPU and was nearly tied in duration with solution #1.  Overall my take on these results is that they are both better than my original solution, but of the two #3 is probably the best.

The percentage rows indicate that solution #3 was 41% better than (improved over) solution #1 in CPU, while solution #2 was 64% better than solution #1.  Both new solutions pretty much ripped solution #1 apart in terms of IO.  And probably the most important statistics shows that while solutions #1 and #2 were probably indeterminately different, solution #3 was about 24% faster than both.

The Final Word

The two new solutions for calculating the median within partitioned sets on a heap (no indexing on the table) seem to have delivered a slight edge over the previous benchmark case.  It is always gratifying to do something just a little bit better, even when you’re kicking yourself for not thinking of doing it that way in the first place.

Performance testing was performed on:

  • SQL Server 2012 64 bit
  • Windows 7 64 bit with 12GB of memory installed
  • Intel Core i5-3230M CPU @ 2.60 GHz

I did not attempt to test these solutions using either of the INDEXing schemes proposed by Aaron in his comparative article, because my gut tells me that they’re probably no better than the ones that won out in those particular scenarios.  Since this is just a blog, we’ll have to settle today for seeing just this one case.

And to Aaron, if you happen to be reading this blog at some time, my little digs at you calling my solution “elaborate” were all in good fun.  Actually I considered it quite a badge of honor that you picked my solution in your comparative study.

Thanks folks as always for listening.  Hope the solutions you find here and in my other blogs are helpful to you some day.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 08 Apr 2015.  All rights reserved.

An Easter SQL

Since Easter is nearly upon us, I got to thinking once again about how challenging it is to calculate the day that Easter Sunday falls on.  Since Calendar tables and their step-children Holiday tables are so interesting I thought I might write a sequel to my earlier blog link in this sentence.

There are at least three holidays (not necessarily business holidays) that are related to the calculation of Easter Sunday.  These are:

  • Good Friday is the Christian holiday that is the Friday just prior to Easter Sunday.
  • Ash Wednesday is another Christian holiday (the first day of Lent) that falls precisely 46 days prior to Easter Sunday.
  • Mardi Gras (a.k.a. Fat Tuesday) is a lesser known holiday celebrated by the more robust partyers among us.  I have a special affinity for it as I lived in New Orleans for a few years.

We will show you how to calculate them all and use that method to populate a Holidays lookup table so hopefully you’ll never have to do it again.  Happy Easter!

A Calendar and a Holiday Table

Discussions of Calendar tables in T-SQL are ubiquitous (here’s one by Todd Fifield), differing only in the columns actually stored in them.   We’ll draw upon the one we created in our blog to show you how easy it is here.

-- Create a Calendar table
SELECT [Date]=ISNULL([Date], 0), [Year]=ISNULL([Year], 0), [YrNN]=ISNULL([YrNN], 0)
    ,[YYYYMM]=ISNULL([YYYYMM], 0), [BuddhaYr], [Month], [Day]=ISNULL([Day], 0)
    ,[WkDNo], [WkDName], [WkDName2], [WkDName3], [JulDay], [JulWk]
    ,[WkNo], [Qtr], [Last], [LdOfMo], [LDtOfMo]
INTO dbo.Calendar 
-- This function can be obtained here: GenerateCalendar FUNCTION
FROM dbo.GenerateCalendar('1990-01-01', 65536);

-- Change column types to be NOT NULL so we can index them
--
-- Note: not needed thanks to the neat trick with ISNULL above
--ALTER TABLE dbo.Calendar ALTER COLUMN [Date] DATETIME NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Year] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Month] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [YYYYMM] INT NOT NULL;
--ALTER TABLE dbo.Calendar ALTER COLUMN [Day] INT NOT NULL;
GO

-- Build some representative INDEXes
ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_pk PRIMARY KEY([Date]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix1 UNIQUE NONCLUSTERED([Year], [Month], [Day]);

ALTER TABLE dbo.Calendar 
ADD CONSTRAINT Cal_ix2 UNIQUE NONCLUSTERED([YYYYMM], [Day]);

SELECT TOP 10 [Date], [YYYYMM], [WkDName2], [WkNo], [Last]
FROM dbo.Calendar;

-- Now create a Holidays table, from the same blog
CREATE TABLE dbo.Holidays
(
    FromDate    DATETIME PRIMARY KEY
    ,ToDate     DATETIME
    ,Holiday    VARCHAR(100)
);

We’ve created a Holidays table that we’ll be using to further our sequel to Easter.  There are also a few other things in the above Data Definition Language (DDL) that are worthy of mention.

  • The GenerateCalendar FUNCTION is an in-line, Table Valued Function (iTVF) that appears in my earlier blog (and also in the resources file attached at the end of this article), which can be used in lieu of a Calendar table should you prefer to do so.  Once again, my thanks go out to SQL MVP Jeff Moden for his help optimizing the performance of the calculations done in that FUNCTION.
  • We’ve used a neat trick with ISNULL to set the NULL attribute of the column in the table we selected INTO (when Calendar gets CREATEd) to NOT NULL, so that we don’t need to run the subsequent ALTER statements on those columns.  I included those ALTER statements for your information.  Thanks again to Jeff Moden for showing me that trick (with honorable mention to SQL MCM Wayne Sheffield (MCM=Microsoft Certified Master) who came along just a few days later with the same information).

Since Easter Sunday is so diabolically difficult to calculate, many reputable sources have created look-up tables such as this one.  Drawing upon that source, we can create the dates for Easter Sunday from 1990 through 2020 in our Holidays table using the following script.

WITH EasterDates (d) AS
(
    SELECT '1990-04-15' UNION ALL SELECT '1991-03-31' UNION ALL SELECT '1992-04-19'
    UNION ALL SELECT '1993-04-11' UNION ALL SELECT '1994-04-03' 
    UNION ALL SELECT '1995-04-16' UNION ALL SELECT '1996-04-07' 
    UNION ALL SELECT '1997-03-30' UNION ALL SELECT '1998-04-12'
    UNION ALL SELECT '1999-04-04' UNION ALL SELECT '2000-04-23' 
    UNION ALL SELECT '2001-04-15' UNION ALL SELECT '2002-03-31' 
    UNION ALL SELECT '2003-04-20' UNION ALL SELECT '2004-04-11'
    UNION ALL SELECT '2005-03-27' UNION ALL SELECT '2006-04-16' 
    UNION ALL SELECT '2007-04-08' UNION ALL SELECT '2008-03-23' 
    UNION ALL SELECT '2009-04-12' UNION ALL SELECT '2010-04-04'
    UNION ALL SELECT '2011-04-24' UNION ALL SELECT '2012-04-08' 
    UNION ALL SELECT '2013-03-31' UNION ALL SELECT '2014-04-20' 
    UNION ALL SELECT '2015-04-05' UNION ALL SELECT '2016-03-27'
    UNION ALL SELECT '2017-04-16' UNION ALL SELECT '2018-04-01' 
    UNION ALL SELECT '2019-04-21' UNION ALL SELECT '2020-04-12'
)
INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate=d, ToDate=d, Holiday='Easter Sunday ' + CAST(YEAR(d) AS CHAR(4))
FROM EasterDates;

Before we add Good Friday, Ash Wednesday and Fat Tuesday to our Holiday table, we’d like to check that the introductory description at least results in the correct day of the week on which the holiday falls.  We can use our Calendar table for this.

SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
    ,c.WkDName
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
JOIN dbo.Calendar c ON c.[Date] = a.FromDate+b.[days] 
WHERE a.Holiday LIKE 'Easter Sunday%';

This produces results like this (only the first two years are shown):

FromDate                ToDate                  holiday                        WkDName
1990-04-13 00:00:00.000 1990-04-13 00:00:00.000 Good Friday 1990               Friday   
1990-02-28 00:00:00.000 1990-02-28 00:00:00.000 Ash Wednesday 1990             Wednesday
1990-02-27 00:00:00.000 1990-02-27 00:00:00.000 Mardi Gras (Fat Tuesday) 1990  Tuesday  
1991-03-29 00:00:00.000 1991-03-29 00:00:00.000 Good Friday 1991               Friday   
1991-02-13 00:00:00.000 1991-02-13 00:00:00.000 Ash Wednesday 1991             Wednesday
1991-02-12 00:00:00.000 1991-02-12 00:00:00.000 Mardi Gras (Fat Tuesday) 1991  Tuesday

So now we can be confident to insert those holidays into our Holidays table.

INSERT INTO dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate = a.FromDate+b.[days]
    ,ToDate     = a.FromDate+b.[days]
    ,b.holiday
FROM dbo.Holidays a
CROSS APPLY
(
    VALUES(-2, 'Good Friday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-46, 'Ash Wednesday ' + CAST(YEAR(FromDate) AS CHAR(4)))
        ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(YEAR(FromDate) AS CHAR(4)))
) b ([days], holiday)
WHERE a.Holiday LIKE 'Easter Sunday%';

A couple of notes on the data types used in our Calendar and Holidays tables:

  • The Date column of the Calendar table was chosen to be DATETIME to retain compatibility with SQL 2005 (corresponding to the GenerateCalendar FUNCTION).  You could just as easily use a DATE datatype if you’re running SQL 2008 or later.
  • The FromDate/ToDate data types are also both DATETIME.  Here it could be considered important to keep the DATETIME data type to record cases where your business offers half-day holidays.  For example, I’ve seen this for Christmas Eve and New Year’s Eve.
  • You can also define a different FromDate than ToDate in the Holidays table to indicate a holiday that spans several days, like the Thailand Songkran holiday example from my blog.
“Since SQL is a database language, we prefer to do look ups and not calculations.” 
    –- Joe Celko (prolific author of books on things SQL)

There you go Joe, we’ve honored your preference!  Now we can do lookups of not just Easter Sunday, but also Good Friday, Ash Wednesday and Mardi Gras, the latter potentially being essential to planning your next trip to New Orleans.

The Calculation of Easter Sunday

A somewhat accurate description of the calculation of Easter Sunday is:

“The Sunday following the full Moon which falls on or after the equinox (the Paschal full moon) will give the lawful Easter.”
    –- From the Wikipedia link on Easter Sunday above

I say “somewhat accurate” because there’s all sorts of ifs, buts and caveats in that description, which you can read about in the Wikipedia link, however it will suffice for the time being.

The formal description of the calculation of Easter is known as Computus (no relation to Locutus of Borg), and is provided by Wikipedia at the link shown.  We shall observe this calculation in what follows.  While the formulas given are indicated to be valid for the years 1900 to 2199, since our Calendar table has been populated with years 1990-2169 we’ll limit our calculations of Easter Sunday to just those the years.  Years 1990 through 2020 will be checked against the lookup table we’ve already provided.

Of course, this calculation appears often enough in the literature and implemented in T-SQL (for example: here and here), often using methods I’d prefer to avoid due to performance.  Since our personal mantra is: “No Loops.  No Cursors.  No RBAR. Hoo-uh!” we shall endeavor to live up to that mantra today.  Think about that term “RBAR” and if you’ve never heard of it before, you’ll find it mentioned in the Conclusion to this article.

We’ll start by attempting to replicate the tabular example for calculating the Paschal full moon (PFMD) within the Gregorian calendar shown by Computus on that Wikipedia page.
 photo Gregorian Calendar for PFMD_zpsycqyrika.png
Note that the algorithm described above is a step-by-step or procedural algorithm. Here is where many SQL implementations fail because they don’t try to convert it to one that is set-based. To do this, you must think of each step in the procedural algorithm as a query, and as you proceed from one step to the next, each new step is another outer layer to your set-based query.

This is easier shown that described, so here is that query for 2013-2032.

SELECT [Year], [Y MOD 19], Addendum
    ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
    ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
    ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
FROM
(
    SELECT [Year], [Y MOD 19], Addendum
        ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
    FROM
    (
        SELECT [Year]
            ,[Y MOD 19] = [Year]%19
            ,Addendum   = CASE [Year]%19 
                            WHEN 5 THEN 29 
                            WHEN 16 THEN 29 
                            WHEN 8 THEN 30 
                            ELSE 0 
                            END 
        FROM dbo.Calendar
        WHERE [Month] = 1 AND [Day] = 1
    ) a
) a
WHERE [Year] BETWEEN 2013 AND 2032;

Note how the Addendum (CASE) implements the first two IFs (second and third rows) of the procedural algorithm. The results from this query for just the first 10 years are:

Year  Y MOD 19  Addendum  PFMD  Month  MonthNo
2013  18        0         27    Mar    3
2014  0         0         14    Apr    4
2015  1         0         3     Apr    4
2016  2         0         23    Mar    3
2017  3         0         11    Apr    4
2018  4         0         31    Mar    3
2019  5         29        18    Apr    4
2020  6         0         8     Apr    4
2021  7         0         28    Mar    3
2022  8         30        16    Apr    4

If we take PFMD and add the first letter of the month column that follows (for example for 2013: 27M) we see exact correspondence with the row from the Computus table (graphic) labeled the “Paschal full moon date,” including for the years not shown in the results set.  So it appears that we have rather simply converted the formula for calculating PFMD from Computus to a SQL script without too much effort.

Now we’ll use our Calendar table yet again, along with a little additional magic:

  • Put the pieces of the Paschal full moon date back together into a date using a formula that includes a couple of calls to the T-SQL built-in function DATEADD.
  • Do a CROSS APPLY (explained in Part 1 and Part 2 by SQL MVP Paul White) to pick up the Sunday after the Paschal full moon, which is Easter Sunday.
WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year], [Y MOD 19], Addendum
        ,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        ,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        ,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END-1 -- PFMD
                ,DATEADD(month, CASE WHEN PFMD > 31 THEN 4 ELSE 3 END-1   -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b
WHERE [Year] BETWEEN 2013 AND 2032;

 

Now for a quiz on the above code where the lines are bold:

Why do we subtract 1 from each of the CASE statements that appear as the second argument to DATEADD?

We’ll give you the answer to this in the Conclusion section of this article, so stay tuned…

Here are the first ten years of the results set from the last T-SQL query shown above:

Year  PaschalFullMoon  EasterSunday
2013  2013-03-27       2013-03-31
2014  2014-04-14       2014-04-20
2015  2015-04-03       2015-04-05
2016  2016-03-23       2016-03-27
2017  2017-04-11       2017-04-16
2018  2018-03-31       2018-04-01
2019  2019-04-18       2019-04-21
2020  2020-04-08       2020-04-12
2021  2021-03-28       2021-04-04
2022  2022-04-16       2022-04-17

Our Computus Wikipedia page also has the inset table (at right) presenting the Western calculation of Easter Sunday, which corresponds quite nicely to our calculation of Easter Sunday.

That really wasn’t so hard was it? Now comes the packaging into a tool.

Since we like to keep things clean, we’re going to make this wonderful bit of SQL code into an in-line, table-valued function that we can call anytime we need to calculate Easter Sunday for a specific year.

We’ll then continue on to use our tool to calculate the other holidays by using the known number of days offset, just like we did before with our Easter Sunday lookup table, and insert that result into our Holidays table for years that are missing.

 photo Easter Sundays_zpsrruxck4b.png
CREATE FUNCTION CalculateEasterSunday
(
    @Year       INT
)
-- Calculate Easter Sunday for any given year
RETURNS TABLE WITH SCHEMABINDING
AS RETURN

WITH CalculatePaschalFullMoon AS
(
    -- Original query to calculate the components of the 
    -- Paschal Full Moon date
    SELECT a.[Year] --, [Y MOD 19], Addendum
        --,PFMD       = CASE WHEN PFMD > 31 THEN PFMD-31 ELSE PFMD END
        --,[Month]    = CASE WHEN PFMD > 31 THEN 'Apr' ELSE 'Mar' END
        --,[MonthNo]  = CASE WHEN PFMD > 31 THEN 4 ELSE 3 END
        -- Put the date/time pieces together to get a real 
        -- DATETIME for PFMD
        ,[PFMD2]    =
            DATEADD(day, CASE WHEN PFMD > 31 
                                THEN PFMD-31 
                                ELSE PFMD 
                                END-1           -- PFMD
                ,DATEADD(month, CASE 
                                WHEN PFMD > 31 
                                THEN 4 
                                ELSE 3 
                                END-1           -- [MonthNo]
                    , a.[Date]))
    FROM
    (
        SELECT [Year], [Y MOD 19], Addendum, [Date]
            ,PFMD   = (45-([Y MOD 19]*11)%30+Addendum)
        FROM
        (
            SELECT [Year]
                ,[Y MOD 19] = [Year]%19
                ,Addendum   = CASE [Year]%19 
                                WHEN 5 THEN 29 
                                WHEN 16 THEN 29 
                                WHEN 8 THEN 30 
                                ELSE 0 
                                END
                ,[Date] 
            FROM dbo.Calendar
            WHERE [Month] = 1 AND [Day] = 1 AND [Year] = @Year
        ) a
    ) a
)
SELECT a.[Year]
    ,PaschalFullMoon=CAST(a.PFMD2 AS DATE)
    ,EasterSunday
FROM CalculatePaschalFullMoon a
-- Easter Sunday follows the Paschal Full Moon date
-- so pick that up from the Calendar table
CROSS APPLY
(
    SELECT TOP 1 EasterSunday=CAST([Date] AS DATE)
    FROM dbo.Calendar b
    WHERE b.[Date] > a.PFMD2 AND b.WkDName3 = 'Sun'
    ORDER BY [Date]
) b;

You can see where I commented out a few lines of code in the FUNCTION that represent intermediate results that our FUNCTION doesn’t really need to retain. Notice also how the filter on the @Year parameter was moved up into the innermost SELECT from the Calendar table, because when using the FUNCTION you’ll be selecting only one row at a time at that point. Calling the iTVF using CROSS APPLY in-lines the code to make this a non-RBAR query.

Now we can use our FUNCTION to check that the dates we’ve prior stored for Easter Sunday in our Holidays table are correct.

SELECT HolidayFromDate  = CAST(a.FromDate AS DATE)
    ,b.EasterSunday
    ,[Check]            = CASE 
                            WHEN a.FromDate = b.EasterSunday 
                            THEN 'MATCHED!' 
                            ELSE 'WRONG!' 
                            END
FROM dbo.Holidays a
CROSS APPLY dbo.CalculateEasterSunday(YEAR(a.FromDate)) b
WHERE Holiday LIKE 'Easter Sunday%';

The first ten years of our results show us all MATCHED! (and so do the others that are not shown).

HolidayFromDate  EasterSunday   Check
1990-04-15       1990-04-15     MATCHED!
1991-03-31       1991-03-31     MATCHED!
1992-04-19       1992-04-19     MATCHED!
1993-04-11       1993-04-11     MATCHED!
1994-04-03       1994-04-03     MATCHED!
1995-04-16       1995-04-16     MATCHED!
1996-04-07       1996-04-07     MATCHED!
1997-03-30       1997-03-30     MATCHED!
1998-04-12       1998-04-12     MATCHED!
1999-04-04       1999-04-04     MATCHED!

So we can be reasonably confident of our Easter Sunday calculation for the years indicated by Computus (1900-2199)!

Using our FUNCTION to Create Lookup Values in our Holidays Table

Now to keep Joe Celko happy (i.e., so we can do lookups instead of calculations), let’s insert the rest of our holidays, including Easter Sunday, Good Friday, Ash Wednesday and Mardi Gras, into our Holidays table, using all the years available to us in our Calendar table.

INSERT dbo.Holidays (FromDate, ToDate, Holiday)
SELECT FromDate, ToDate, holiday
FROM
(
    SELECT FromDate = DATEADD(day, c.days, b.EasterSunday)
        ,ToDate     = DATEADD(day, c.days, b.EasterSunday)
        ,c.holiday
    FROM dbo.Calendar a
    CROSS APPLY dbo.CalculateEasterSunday(a.[Year]) b
    CROSS APPLY
    (
        VALUES (0, 'Easter Sunday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-2, 'Good Friday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-46, 'Ash Wednesday ' + CAST(a.[Year] AS CHAR(4)))
            ,(-47, 'Mardi Gras (Fat Tuesday) ' + CAST(a.[Year] AS CHAR(4)))
    ) c ([days], holiday)
    WHERE a.[Month] = 1 AND a.[Day] = 1
) a
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.Holidays x
    WHERE x.FromDate = a.FromDate AND x.ToDate = a.ToDate
);

This added 596 holidays to our table, and I’d guess we’re pretty well set for our lifetimes!

Notice how, in the above query we had to use DATEADD instead of simple arithmetic as we did before when using the DATETIME data type.  This is because simple arithmetic doesn’t work with the DATE data type returned by our CalculateEasterSunday function.

Conclusion and What we Learned

While Joe likes lookups, I find calculations in T-SQL to be quite a bit of fun.  In this article (as in my blog on Calendar tables), we got to do a little of both.

If you went ahead and read my blog on Calendar tables in conjunction with this article, I am pleased because you probably learned a lot about how useful they can be, particularly in conjunction with a Holidays table. In this article (in the FUNCTION) we used our Calendar table to find the next Sunday occurring after our calculated date of the Paschal full moon.

Besides learning about the trivialities of the relationship between Easter, Good Friday, Ash Wednesday, and planning for your next visit to New Orleans for Mardi Gras, we’d like to think that you may have learned a few other things as well:

  • You can use simple arithmetic on a DATETIME data type to add or subtract days, however you must use DATEADD to do the same thing with a DATE data type.
  • You now have efficient tools in your toolbox to generate a calendar table (GenerateCalendar) and CalculateEasterSunday, whenever you have a need to use them.  The latter function is also included in the resources file attached at the end of this article.
  • You learned how to CROSS APPLY VALUES to do an UNPIVOT operation, even though I didn’t specifically mention that was what we were doing when we used it (where we created rows with dates for Good Friday, Ash Wednesday and Mardi Gras).  If you haven’t seen that method before you may want to look into that link as well. That is something that will only work in SQL 2008 (or later), but there is an equivalent construct that will work in SQL 2005.
  • We learned how to use ISNULL on a SELECT/INTO operation, to set the attribute of the column created in the target table to NOT NULL.
  • You learned how to think your way through a rather complex, procedural calculation and convert it into set-based code, by wrapping each successive step in another layer of query.
  • You learned my personal SQL mantra, in case you hadn’t seen that before.

In case you’ve been wondering what RBAR stands for, it means “Row By Agonizing Row” and it is a term coined by Jeff Moden which I fortuitously picked up on before I got to know him.

Oh yes, then there’s the quiz I mentioned. There are two cases:

  • If we’re starting with 01 Jan of a year as our date, and the Paschal full moon month is three (meaning it should fall in Mar), if we add three months to 01 Jan it puts us in Apr. So we subtract one month to get us back to where we need to be.
  • If we’re starting with 01 Mar of a year as our date, and the Paschal full moon day within that month is (say) 14 (meaning it should fall on 14 Mar), if we add 14 days to 01 Mar it puts us in 15 Mar. So we subtract one day to get us back to where we need to be.

After all this, I still have no clue what a Paschal full moon is but what the heck, at least I know I can calculate when it is!  That’s all for today folks.  I hope you found this article useful and just a bit of fun.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 01 Apr 2015.  All rights reserved.

Resources for An Easter SQL.doc

The Resources above are stored in a Microsoft Word (TM) document. There are two sections:

-- Create GenerateCalendar, Calendar and Holiday Tables and Populate.sql
... page break ...
-- Create CalculateEasterSunday and Add Holidays.sql

Stupid T-SQL Tricks – Part 2: Retrieving Columns Set to the Current DEFAULT CONSTRAINT Value

“Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?

My Stupid T-SQL Tricks are hopefully a bit more useful, but just as amusing as Carson’s Stupid Pet Tricks were! Hopefully everyone knows what a DEFAULT CONSTRAINT is for a column in a T-SQL table.  Let’s first create a table with such a column and populate it with some sample rows.

CREATE TABLE #SampleData
(
    MyPrimaryKey    VARCHAR(10) NOT NULL
    ,MyDataColumn   INT
    ,MyDataColumn2  INT 
    ,CONSTRAINT sd_pk PRIMARY KEY (MyPrimaryKey)
);

ALTER TABLE #SampleData
ADD CONSTRAINT sd_df1
DEFAULT (50) FOR MyDataColumn2;

INSERT INTO #SampleData (MyPrimaryKey, MyDataColumn)
SELECT 'Dwain.C', 41 UNION ALL SELECT 'Joe.C', 54
UNION ALL SELECT 'Chris.M', 44 UNION ALL SELECT 'Jeff.M', 33;

SELECT *
FROM #SampleData;

We can see from the following results that when the column is not specified on INSERT, the default value applies to each row created.

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Dwain.C       41            50
Jeff.M        33            50
Joe.C         54            50

Let’s change a couple of the rows to a different value.

UPDATE #SampleData
SET MyDataColumn2 = 52
WHERE MyPrimaryKey IN ('Dwain.C','Jeff.M');

SELECT *
FROM #SampleData;

So now we have this:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Dwain.C       41            52
Jeff.M        33            52
Joe.C         54            50

Suppose we want to write a generalized query that returns all rows that are set to the current default value for the column that has a default constraint.  We know that column is MyDataColumn2.

DECLARE @MyColDefault INT;

BEGIN TRANSACTION T1;

UPDATE TOP (1) #SampleData
SET @MyColDefault = MyDataColumn2 = DEFAULT;

ROLLBACK TRANSACTION T1;

SELECT *
FROM #SampleData
WHERE MyDataColumn2 = @MyColDefault;

Returns us these results:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Chris.M       44            50
Joe.C         54            50

Using the TRANSACTION/ROLLBACK, we’ve pretended to set one row back to its DEFAULT value, while saving that DEFAULT value into a local variable which we can then use in the subsequent, filtered SELECT. Using the TOP (1) option on the UPDATE ensures that we’ve only touched a single row to get the DEFAULT value that applies to all rows.

You’ll find that your code is now impervious to some pesky DBA changing the default constraint value for that column.

ALTER TABLE #SampleData
DROP CONSTRAINT sd_df1;

ALTER TABLE #SampleData
ADD CONSTRAINT sd_df1
DEFAULT (52) FOR MyDataColumn2;

DECLARE @MyColDefault INT;

BEGIN TRANSACTION T1;

UPDATE TOP (1) #SampleData
SET @MyColDefault = MyDataColumn2 = DEFAULT;

ROLLBACK TRANSACTION T1;

SELECT *
FROM #SampleData
WHERE MyDataColumn2 = @MyColDefault;

So now we get different results:

MyPrimaryKey  MyDataColumn  MyDataColumn2
Dwain.C       41            52
Jeff.M        33            52

What we learned from these examples:

  • How to create a named DEFAULT constraint for a column.
  • How to change that constraint, by first dropping it and then creating it again.
  • How to write a query to return all rows where the value of a column is set to its current default.

Now how is that for a stupid, T-SQL trick?

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 19 Mar 2015.  All rights reserved.

Lessons Learned from a Poor-performing VIEW

It is a “common knowledge” among SQL practitioners that VIEWs don’t perform well. Recently during an application’s development I put this to the test. I needed to construct a VIEW for a report the system was producing. It was highly complex. Let’s take a look at the general structure of the VIEW that I ended up with, along with how it was called by the report.

CREATE VIEW dbo.v_VeryComplex
WITH SCHEMABINDING
AS
WITH aCTE AS
(
    SELECT something
    FROM dbo.a_table a
    JOIN dbo.b_table b ON a.key = b.key
)
SELECT col1, col2, col3
-- many columns including a few calculated ones
FROM
(
    SELECT col, col2, col3-  - plus many more
    FROM dbo.table1 a
    JOIN dbo.table2 b ON a.key = b.key
    -- plus about 14 more JOINs and CROSS APPLYs
    -- among them the aCTE defined above and at 
    -- least one call to an schema-bound, in-line
    -- table valued function
    WHERE a.col = 'abc'
) a;
GO
-- The report calls it thusly 
SELECT col1, col2, col3 -- plus many more
FROM v_VeryComplex
WHERE col2 = 'cde' AND col3 = 'def'
ORDER BY sortcol1, sortcol2;

When I checked the Execution Plan for calling this VIEW (as the report does) during development, I was quite happy to see that every table it hit did an INDEX SEEK. So all of my JOINs, etc. were in order and it looked like it would happily perform with swiftness. And while I was on-site for the Go Live of the application, it most certainly performed with alacrity.

Then about four months later comes the dreaded call from the user. My beloved VIEW with all of those SEEKs was taking 30-40 seconds to return its results.

My first reaction was to be incredulous of course. Perhaps the report rendering to the browser was taking a long time. So I set about to testing it. I captured a query from the report using SQL Profiler and then ran it on the production server, where by this time some 12,000+ transactions had been entered into Table2 (from the VIEW above). Table1 is a configuration table and is quite small.

Sure enough it was dog-slow, returning in about 40 seconds. Not good.

A quick check of the execution plan on the production server showed that I was now suddenly getting an INDEX SCAN on Table2, which means that the VIEW was performing all of its work on all of the transactions in that table including a whack of complex calculations, summaries of detail records in child tables, etc. The final filtering on the VIEW then was not being transferred into an INDEX SEEK like it used to be!

A little more about the application so you can better appreciate my position. First of all, the application itself is mission critical for this customer. It has to work, and all of the work it processes is quite time sensitive. The report itself isn’t used for every transaction in the table, in fact probably less than 20-25% use this particular report to be processed. Nonetheless, those transactions are just as time-sensitive as all the rest. And because it involves some governmental reporting, it must be 100% accurate. Inaccuracies in the calculated amounts could lead to fines to the customer.

With more than a little trepidation, I set about to rewriting the VIEW in the hopes that I could remove the outer derived table (a) to perhaps return my call to the VIEW to doing an INDEX SEEK on Table2. Fortunately, I had at my disposal a local copy of an older version of the production database that exhibited the same problem – an INDEX SCAN on Table2.

After not a large amount of effort, I was able to do that rewrite and get my INDEX SEEK! So I happily created the new VIEW in the production server to continue my testing to ensure the accuracy of my results. The way I did that was to run the original VIEW, followed by EXCEPT, followed by the results from the new VIEW. Since I was running both VIEWS against all rows in Table2, this should highlight any discrepancies (after checking first that both returned the same number of rows). The result of that test were no rows returned, establishing with a very high degree of certainty that my rewrite was producing the correct results. I was nearly beside myself with glee!

Alas, that emotional state was short-lived because when I checked the execution plan of the call to the new VIEW in production, I found that I was back to the INDEX SCAN on Table2.

In fact, when I modified the filtering criteria to return the single transaction it was intended to return, I got virtually the same execution time. Not good!

What’s a performance evangelist like me to do? I certainly couldn’t go back to the client and tell him there’s nothing that can be done.

So I did a quick re-evaluation of the situation. My first thought was “perhaps I can INDEX the VIEW.”

That was not an option because:

  • You can’t INDEX a VIEW that uses common table expressions (CTE).
  • While I could modify the VIEW and move that CTE into a derived table, I was still left with several CROSS APPLYs, and you can’t INDEX a VIEW that uses APPLY either!

Rewriting all of those CROSS APPLYs into JOINs was simply not an appealing prospect to me.

Another rather annoying aspect of INDEXing the VIEW, were I able to accomplish it, is that the VIEW is schema-bound and it calls that pesky schema-bound, in-line table valued function (iTVF) mentioned in the comments in the code shown above. There is already a specific hierarchy to the deployment when one of the lowest level schema-bound objects needs changing (like that iTVF) that is relatively a pain. It involves dropping the VIEW and re-creating it after altering the iTVF. If I were to INDEX the VIEW, I’d always need to remember to recreate that INDEX anytime I had to deploy anything in that hierarchy. Not good either.

Then it hit me like a 2×4 to the head! When the report calls my VIEW, it does it for a single transaction only. So that gave me the solution.

I could copy all of the original code from the VIEW and instead create a schema-bound, ITVF. That would look something like this.

CREATE FUNCTION dbo.VeryComplexFcn
(
    @parm1      VARCHAR(10)
    ,@parm2     VARCHAR(10)
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH aCTE AS
(
    SELECT something
    FROM dbo.a_table a
    JOIN dbo.b_table b ON a.key = b.key
)
SELECT col1, col2, col3
-- many columns including a few calculated ones
FROM
(
    SELECT col, col2, col3  -- plus many more
    FROM dbo.table1 a
    JOIN dbo.table2 b ON a.key = b.key
    -- plus about 14 more JOINs and CROSS APPLYs
    -- among them the aCTE defined above and at 
    -- least one call to an schema-bound, in-line
    -- table valued function
    WHERE a.col = 'abc' AND a.col2 = @parm1 AND b.col3 = @parm2
) a;
GO
-- The report calls the iTVF like this 
SELECT col1, col2, col3 -- plus many more
FROM dbo.VeryComplexFcn ('cde', 'def')
ORDER BY sortcol1, sortcol2;

It was in short order that I deployed this new iTVF to the production server and checked the execution plan, to find that I was back to my beloved INDEX SEEK!

The query elapsed time calling the iTVF instead of the VIEW for a single transaction was impressively improved.

New (iTVF):
 SQL Server Execution Times:
 CPU time = 62 ms, elapsed time = 64 ms.
Old (VIEW):
 SQL Server Execution Times:
 CPU time = 35194 ms, elapsed time = 44423 ms.

I’m keeping the VIEW around for posterity’s sake, and of course in the event that we ever need to run it for multiple transactions.

Now I’ve just got to wait for the development team to modify the report to call the iTVF instead of the VIEW, so we can re-deploy the application. I am left with the same hierarchical deployment constraints, but those I can live with because I won’t need to remember to recreate an INDEX on a VIEW every time (SQL Server will remind me about everything else).

The lesson to be learned here is not to be beguiled by execution plans you see during development. Those can change as transactions accumulate. Even though we generated an impressive number of transactions during the development (I had some automated SQL scripts that did this), they were insufficient to reach that critical threshold where the execution plan changed.

Stupid T-SQL Tricks – Part 1: Logarithms

Not just for math geeks, logarithms and their inverse functions (the exponentials) can have useful and often unexpected capabilities in T-SQL.

    “Heeeeeere’s Johnny!” – Ed McMahon introducing Johnny Carson for the Tonight Show

If you’re old enough to remember the Tonight Show hosted by Johnny Carson, how can you not remember the iconic Stupid Pet Tricks skits, and the later tributes to those done by David Letterman?

While Carson’s animal antics were meant to entertain, my Stupid T-SQL Tricks are meant to be thought provoking and also add a tool or two to your T-SQL tools chest in the event that the need arises. So maybe they aren’t so stupid after all (and besides it’s my blog and I like the title). This will be the first of a series of blogs offering two to four specialty tips that are all somehow related. Today’s tricks will all utilize logarithms to do their magic.

Calculate a Factorial

Statisticians will recognize the factorial, a number that is represented as N!, where n=0, 1, 2, 3, … and for example 4! = 1*2*3*4 = 24. In the somewhat distant past I had the chutzpah to suggest that it could be done with a recursive Common Table Expression (CTE). Although in my article Exploring Recursive CTEs by Example, I wasn’t really advocating that approach. It was more like I was trying to help folks better understand how recursive CTEs work. Nowadays, I would never do it that way because there is something better.

Let’s start with a Tally table that generates numbers from 1 to 10. In our query below, we’ll SUM those numbers.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT SUM(n)
FROM Tally;

-- Results:
55

Too bad there isn’t a PROD() built-in function in T-SQL. Or is there? Consider logarithms and how they can convert a SUM() into a PROD() using a few simple calls to the right built-in functions in the proper sequence. Let’s try these two queries:

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT EXP(SUM(LOG(n)))
FROM Tally;

SELECT 1*2*3*4*5*6*7*8*9*10;

-- Results:
3628800.00000001
3628800

Those two results are remarkably close, the first one of course being slightly imprecise due to the fact that LOG and EXP are both returning a FLOAT data type. Since we know a factorial number must be an integer, we can convert it as follows, so now it returns the exact integer value of interest (3628800).

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT CAST(ROUND(EXP(SUM(LOG(n))), 0) AS BIGINT)
FROM Tally;

While it unlikely that you’ll often have the need to calculate factorials in T-SQL, you may occasionally have the need to calculate a product across multiple rows for a column. So this technique will work for that also.

The Running Product

Using SQL 2012, we can use this same technique to calculate a running product (similar to a running total) across multiple rows in a table. To do this, we’ll use a window frame.

WITH Tally(n) AS
(
    SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n
    ,CAST(ROUND(EXP(
        SUM(LOG(n)) OVER
            (
                ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            )
        ), 0) AS BIGINT)
FROM Tally;

-- Results:
1      1
2      2
3      6
4      24
5      120
6      720
7      5040
8      40320
9      362880
10     3628800

Be careful though! These numbers are going to grow really quickly, so hopefully you won’t be doing this over many, many rows.

Coded, Numeric Fields with and without Leading Zeroes

In many applications you’ll find columns that are coded as numbers, but are stored as characters. This is, of course, a best practice when you don’t expect to be doing arithmetic with the contents. Let’s use a bit of a contrived example in the guise of a Sales Order Details table:

CREATE TABLE #SO_Details
(
    SO_Number           VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
    ,Extended_Price     AS (Quantity*Price)
);

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '1', 3, 15.20),('2013SO00001', '2', 3, 9.40)
    ,('2013SO00001', '3', 1, 11.50),('2013SO00001', '4', 2, 11.55)
    ,('2013SO00001', '5', 2, 14.30),('2013SO00001', '6', 10, 13.32)
    ,('2013SO00001', '7', 5, 19.42),('2013SO00001', '8', 6, 10.15)
    ,('2013SO00001', '9', 4, 12.15);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

We’ve used a computed column to calculate Extended_Price as Quantity * Price. Let’s suppose that we now need to insert an additional SO_Detail row.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail Quantity   Price   Extended_Price
2013SO00001   1         3          15.20   45.60
2013SO00001   10        2          18.88   37.76
2013SO00001   2         3          9.40    28.20
2013SO00001   3         1          11.50   11.50
2013SO00001   4         2          11.55   23.10
2013SO00001   5         2          14.30   28.60
2013SO00001   6         10         13.32   133.20
2013SO00001   7         5          19.42   97.10
2013SO00001   8         6          10.15   60.90
2013SO00001   9         4          12.15   48.60

Oh my! Those weren’t the results we wanted because our new record (SO_Detail=10) is sorted into the wrong spot! Of course, this can easily be addressed by changing the ORDER BY so that it CASTs the SO_Detail column to INT, but then that’s why I said this example is a bit contrived. Note that we rolled back the transaction so that our new record 10 was not really saved in the table.

We could use our newfound knowledge of logarithms to address this. In our prior example, we used a natural log, but in this example we’ll use a base 10 logarithm. Let’s take a look at a property of base 10 logarithms that we can use to add a leading zero to each of the entries from 1 to 9. To show this property, we’ll fall back to our Tally table.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, LOG10(n), FLOOR(LOG10(n))
    ,POWER(10, 1+FLOOR(LOG10(n)))
    ,MAX(n) OVER (PARTITION BY NULL)
FROM Tally
WHERE n IN(1, 9, 10, 11, 99, 100);

-- Results:
1     0                   0    10     100
9     0.954242509439325   0    10     100
10    1                   1    100    100
11    1.04139268515823    1    100    100
99    1.99563519459755    1    100    100
100   2                   2    1000   100

Let’s say that we know that our range of SO_Detail is 1 through 9. In that case, we can use a formula like the following to prepend exactly the right number of leading zeroes (each is shown for the number 9).

SELECT RIGHT(10+9, 1), RIGHT(100+9, 2), RIGHT(1000+9, 3);

-- Results:
9     09   009

The numbers 10, 100 and 1000, as well as the 1, 2 and 3 can be derived directly from the results returned by the former query. Let’s put this all together.

WITH Tally(n) AS
(
    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(b)
    CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(c)
)
SELECT n, RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + n, 1+FLOOR(LOG10(m)))
FROM
(
    SELECT n, m=MAX(n) OVER (PARTITION BY NULL)
    FROM Tally
) a
WHERE n IN (1, 9, 10, 11, 99, 100);

-- Results:
1      001
9      009
10     010
11     011
99     099
100    100

You see, our new formula using RIGHT not only gets us character strings. It also prepends exactly the correct number of zeroes so that the new result is sortable directly on the resulting character string.

Let’s return to our SO Details table and see how we can apply this new formula to normalize our detail line numbers.

BEGIN TRANSACTION T1;

INSERT INTO #SO_Details
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('2013SO00001', '10', 2, 18.88);

WITH SO_Details AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
        ,m=MAX(CAST(SO_Detail AS INT)) OVER (PARTITION BY NULL)
    FROM #SO_Details
    WHERE SO_Number = '2013SO00001'
)
UPDATE SO_Details
SET SO_Detail = RIGHT(POWER(10, 1+FLOOR(LOG10(m))) + SO_Detail, 1+FLOOR(LOG10(m)));

SELECT *
FROM #SO_Details
ORDER BY SO_Number, SO_Detail;

ROLLBACK TRANSACTION T1;

-- Results:
SO_Number     SO_Detail  Quantity   Price   Extended_Price
2013SO00001   01         3          15.20   45.60
2013SO00001   02         3          9.40    28.20
2013SO00001   03         1          11.50   11.50
2013SO00001   04         2          11.55   23.10
2013SO00001   05         2          14.30   28.60
2013SO00001   06         10         13.32   133.20
2013SO00001   07         5          19.42   97.10
2013SO00001   08         6          10.15   60.90
2013SO00001   09         4          12.15   48.60
2013SO00001   10         2          18.88   37.76

Now our results are sorted as expected. Once again we have rolled back the transaction so we are left with only 9 rows in our #SO_Details table and single digit SO_Detail values.

Let’s complicate our example just a bit and add a table of transactions we’d like to process, consisting of inserts, updates and deletes and see if we can do the whole thing with a MERGE. First, the transactions table and some sample transactions.

CREATE TABLE #SO_Trans
(
    [action]            CHAR(1)
    ,SO_Number          VARCHAR(11)
    ,SO_Detail          VARCHAR(4)
    ,Quantity           INT
    ,Price              MONEY
);

INSERT INTO #SO_Trans
    (
        [action], SO_Number, SO_Detail, Quantity, Price
    )
VALUES ('U', '2013SO00001', '5', 15, 12.23)
    ,('I', '2013SO00001', NULL, 4, 16.15)
    ,('I', '2013SO00001', NULL, 3, 12.22)
    ,('D', '2013SO00001', '3', NULL, NULL);

SELECT *
FROM #SO_Trans;

-- Results:
action SO_Number     SO_Detail   Quantity   Price
U      2013SO00001   5           15         12.23
I      2013SO00001   NULL        4          16.15
I      2013SO00001   NULL        3          12.22
D      2013SO00001   3           NULL       NULL

For INSERTs represented in the transactions table, we do not need an SO_Detail number. For DELETEs we do not need Quantity or Price. For UPDATEs we need all three. In order to do a MERGE, we need to consider the following:

  • What is the target table?
  • What is the source table?
  • What are the matching criteria and what operations are performed on which kinds of matches?

The target table is easy, although not so easy as just the #SO_Details table because we need to only include SOs that are in the #SO_Trans table, so we’ll start with that:

-- Target table
SELECT SO_Number, SO_Detail, Quantity, Price
FROM #SO_Details
WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans);

We’ll put the target table into a CTE for use in the MERGE. Coincidentally the results are the 9 original rows we put into the table, because there is only one SO_Number in the #SO_Trans table. We do need to include all rows for that SO because any of them may be subject to an UPDATE to change the SO_Detail number.

MERGE throws an exception if you try to update any row in the target table more than once, so we need to carefully construct the source table so that there is one and only one matching row for each row in the target.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
)
SELECT [action]=ISNULL(b.[action], 'U')
    ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
    ,a.SO_Detail
    ,Quantity=ISNULL(b.Quantity, a.Quantity)
    ,Price=ISNULL(b.Price, a.Price)
FROM Target4MERGE a
FULL JOIN #SO_Trans b
ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail;

-- Results:
action  SO_Number    SO_Detail  Quantity  Price
I       2013SO00001  NULL       4         16.15
I       2013SO00001  NULL       3         12.22
U       2013SO00001  1          3         15.20
U       2013SO00001  2          3         9.40
D       2013SO00001  3          1         11.50
U       2013SO00001  4          2         11.55
U       2013SO00001  5          15        12.23
U       2013SO00001  6          10        13.32
U       2013SO00001  7          5         19.42
U       2013SO00001  8          6         10.15
U       2013SO00001  9          4         12.15

Notes about these results:

  • The two detail lines to be INSERTed have a NULL SO_Detail number because we have not yet determined what it is to be. We’ll be trying to add them at the end.
  • Note the Quantity and Price (15 and 12.23) for SO_Detail=5, which are the values we want to UPDATE.

The key to the matching will be to construct a ROW_NUMBER() that we can use as the new SO_Detail number. So let’s add that:

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
)
SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
    ,m=MAX(r) OVER
        (
            PARTITION BY SO_Number, CASE [action] WHEN 'D' THEN 0 ELSE 1 END
        )
FROM PreSource;

-- Results:

action  SO_Number    SO_Detail  Quantity  Price  r   m
D       2013SO00001  3          1         11.50  1   1
U       2013SO00001  1          3         15.20  1   10
U       2013SO00001  2          3         9.40   2   10
U       2013SO00001  4          2         11.55  3   10
U       2013SO00001  5          15        12.23  4   10
U       2013SO00001  6          10        13.32  5   10
U       2013SO00001  7          5         19.42  6   10
U       2013SO00001  8          6         10.15  7   10
U       2013SO00001  9          4         12.15  8   10
I       2013SO00001  NULL       4         16.15  9   10
I       2013SO00001  NULL       3         12.22  10  10

Additional comments on the latest query and results:

  • The PARTITION we created for our ROW_NUMBER() separates the record to be deleted from the remainder.
  • In the ORDER BY for our ROW_NUMBER() we substituted 10000 (one more than the maximum number of rows we can have per SO detail line) to make the INSERTed records move to the bottom.
  • We enclosed this query in a CTE, because we also want to use the MAX window aggregate function.
  • For that, the PARTITION is identical to what we used for ROW_NUMBER().
  • The result in the r column (except for the [action]=’D’ row) is the new SO_Detail number.
  • The result in m will be used when we convert our integers to characters with leading numbers.

All that’s left is to construct the final MERGE statement, so here it is.

WITH Target4MERGE AS
(
    SELECT SO_Number, SO_Detail, Quantity, Price
    FROM #SO_Details
    WHERE SO_Number IN(SELECT SO_Number FROM #SO_Trans)
),
    PreSource AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price
        ,r=ROW_NUMBER() OVER
            (
                PARTITION BY SO_Number
                    ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                ORDER BY CASE
                         WHEN SO_Detail IS NULL
                         THEN 10000
                         ELSE SO_Detail
                         END
            )
    FROM
    (
        SELECT [action]=ISNULL(b.[action], 'U')
            ,SO_Number=ISNULL(a.SO_Number, b.SO_Number)
            ,a.SO_Detail
            ,Quantity=ISNULL(b.Quantity, a.Quantity)
            ,Price=ISNULL(b.Price, a.Price)
        FROM Target4MERGE a
        FULL JOIN #SO_Trans b
        ON a.SO_Number = b.SO_Number AND a.SO_Detail = b.SO_Detail
    ) a
),
    Source4MERGE AS
(
    SELECT [action], SO_Number, SO_Detail, Quantity, Price, r
        ,m=1+FLOOR(LOG10(
            MAX(r) OVER
                (
                    PARTITION BY SO_Number
                        ,CASE [action] WHEN 'D' THEN 0 ELSE 1 END
                )))
    FROM PreSource
)
MERGE Target4MERGE t
USING Source4MERGE s
ON s.SO_Number = t.SO_Number AND s.SO_Detail = t.SO_Detail
WHEN MATCHED AND [action] = 'D'
THEN DELETE
WHEN MATCHED AND [action] = 'U'
THEN UPDATE
SET SO_Detail = RIGHT(POWER(10, m) + s.r, m)
    ,Quantity = s.Quantity
    ,Price    = s.Price
WHEN NOT MATCHED AND [action] = 'I'
THEN INSERT 
    (
        SO_Number, SO_Detail, Quantity, Price
    )
VALUES 
    (
        s.SO_Number
        ,RIGHT(POWER(10, m) + s.r, m)
        ,s.Quantity
        ,s.Price
    );

SELECT *
FROM #SO_Details;

-- Results:
SO_Number    SO_Detail  Quantity  Price   Extended_Price
2013SO00001  01         3         15.20   45.60
2013SO00001  02         3         9.40    28.20
2013SO00001  03         2         11.55   23.10
2013SO00001  04         15        12.23   183.45
2013SO00001  05         10        13.32   133.20
2013SO00001  06         5         19.42   97.10
2013SO00001  07         6         10.15   60.90
2013SO00001  08         4         12.15   48.60
2013SO00001  09         4         16.15   64.60
2013SO00001  10         3         12.22   36.66

Now that’s one heck of a complicated looking query and it could probably do with some comments, but you can get those as needed from the descriptions I’ve provided. A few more additional notations may be warranted.

  • We applied our 1+FLOOR+LOG10 formula around the MAX() window aggregate function to save a bit of complexity when assigning SO_Detail in the MERGE.
  • The MERGE uses MATCHED twice, segregating DELETEs from UPDATEs based on [action].
  • The NOT MATCHED clause INSERTs the new rows specified by the #SO_Trans table.

Conclusions

You may be asking why not save all of this computational convolution and just always use a 4 digit SO_Detail value with as many leading zeroes as necessary because that’s pretty easy with something like:

SELECT RIGHT('0000'+CAST(SO_Detail AS VARCHAR(4)), 4);

The answer is, yes we could have done that if our business requirements allowed for always having 4 digits in the SO_Detail column. But frankly, this was much more fun and we got to learn a little about logarithms along the way. And I did warn you by calling this article Stupid T-SQL Tricks, now didn’t I?

Still we may have learned a few things, like how to calculate factorials, creating a pseudo-PROD() aggregate, how base 10 logarithms can be used to find the number of digits in a number and how to methodically construct a complex MERGE query one step at a time. All in all, that alone probably made it worthwhile for me to post this blog.

Perhaps it wasn’t quite as entertaining as Johnny Carson’s Stupid Pet Tricks skits, but we hope you enjoyed it anyway.

Follow me on Twitter: @DwainCSQL

Copyright © Dwain Camps 2014 All Rights Reserved