An Even Faster Method of Calculating the Median on a Partitioned Heap

Posted on Updated on

Back in 2013 I wrote an article kindly accepted for publication on the Simple Talk web site called Calculating the Median Value within a Partitioned Set Using T-SQL.  In that article, I was delighted to propose a previously unpublished solution for calculating the median over a partitioned set that did not rely on INDEXing.  In fact, in my tests it was the fastest solution available for that case.

Later on, I was quite pleased to hear that 18 times nominated (as of this writing) SQL MVP Aaron Bertrand published a comparative study of methods for calculating the median called the Best approaches for grouped median, where he kindly included my submission, and confirmed that it was the highest performance solution available for the case of a heap.

Today we’re going to take a second look at the solution I proposed, and offer a couple of interesting possibilities that may be even better for the case of a partitioned heap.

A Heap Example and Some Sample Data to Populate the Table

The following DDL can be used to create a heap, and load it with some partitioned sample data (the ID column is the partition).

CREATE TABLE dbo.MedianValues
(
   ID INT
   ,N INT
);

INSERT INTO dbo.MedianValues
VALUES (1,1),(1,2),(1,3),(1,4),(1,5)
   ,(2,1),(2,2),(2,3),(2,4)
   ,(3,10),(3,2),(3,10),(3,4)
   ,(4,1),(4,5),(4,1),(4,3),(4,3);

From my original article, below are the results you would expect for the median for these four sets of N:

ID    Ordered Sets         Median
1     {1, 2, 3, 4, 5}      3
2     {1, 2, 3, 4}         2.5
3     {2, 4, 10, 10}       7
4     {1, 1, 3, 3, 5}      3

As a reminder, the median is calculated as follows:

  • Order the values for which you want to calculate the median over the partitioning column (ID).
  • If the count of the rows is odd, you take the middle value. So in the example above, you would take the third row for IDs 1 and 4.
  • If the count of the rows is even, you take the average of the middle two rows. In the example above, this means you use the second and third elements in IDs 2 and 3.

My Original Solution to Median for this Case

Aaron commented that my somewhat “elaborate” solution un-pivots the rows of interest using the CROSS APPLY VALUES approach.

WITH Counts AS
(
   SELECT ID, c
   FROM
   (
      SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE 0 END
      FROM
      (
          SELECT ID, c=COUNT(*)
          FROM dbo.MedianValues
          GROUP BY ID
      ) a
   ) a
   CROSS APPLY (VALUES(c1),(c2)) b(c)
)
SELECT ID=a.ID, Median=AVG(0.+b.N)
FROM
(
   SELECT ID, N
      ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
   FROM dbo.MedianValues a
) a
CROSS APPLY
(
   SELECT N
   FROM Counts b
   WHERE a.ID = b.ID AND a.rn = b.c
) b
GROUP BY a.ID;

It is important to note how we convert the integer Ns to a FLOAT within the AVG grouping function, to ensure no truncation occurs.  The results set for this on our sample data appears as follows, confirming the correct calculation of the median for each case.

ID   Median
1    3.000000
2    2.500000
3    7.000000
4    3.000000

Calculating a Median Using Grouping Sets

While I was playing around with GROUPING SETS the other day, I hit upon the strangest query result, which gave me an idea that relates to calculating medians.  Let’s take a look at a sample of what I found.

SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
    ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
        ELSE (COUNT(*)+1)/2 
        END
FROM 
(
    SELECT ID, ID1=ID
    FROM dbo.MedianValues
) a
GROUP BY GROUPING SETS (ID, ID1)
HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
ORDER BY ID, c;

This query produces a results set consisting of all the partitioning IDs with one row for IDs 1 and 4 (where c=3) and two rows for IDs 2 and 3 (where c=2 and 3 respectively).  Take a look at that description and compare it to my description of how to calculate median, and see if you too are not intrigued.

ID   c
1    3
2    2
2    3
3    2
3    3
4    3

Coincidentally, this is exactly what I was trying to achieve by my “elaborate” (according to Aaron) method that used CROSS APPLY VALUES!  One could argue that this is even more elaborate, and I would tend to agree.  So naturally the thought struck me that perhaps I could use this approach to calculate a median.

My initial attempts at JOINing these six rows back to the full set with a ROW_NUMBER() added yielded me no joy however, as the result was slower than my original.

Once again inspiration struck and I had the idea to aggregate the GROUPING SETS so that I only had one row for each ID, after which I could do a range (BETWEEN) JOIN.  So my aggregation now looks like this.

SELECT ID, c1=MIN(c), c2=MAX(c) 
FROM
(
    SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
        ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
            ELSE (COUNT(*)+1)/2 
            END
    FROM 
    (
        SELECT ID, ID1=ID
        FROM dbo.MedianValues
    ) a
    GROUP BY GROUPING SETS (ID, ID1)
    HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
) a
GROUP BY ID;

Which produces a set that looks like this:

ID  c1  c2
1   3   3
2   2   3
3   2   3
4   3   3

Playing around with JOINing this back to the original data set, led me to using a CROSS APPLY instead of a JOIN.  So what I ended up with, which calculates the median quite nicely, is the following query.

WITH RowsOfInterest AS
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=MIN(c), c2=MAX(c) -- * -- ID, c
        FROM
        (
            SELECT ID=CASE GROUPING_ID(ID) WHEN 1 THEN ID1 ELSE ID END
                ,c=CASE GROUPING_ID(ID1) WHEN 1 THEN 1+COUNT(*)/2
                    ELSE (COUNT(*)+1)/2 
                    END
            FROM 
            (
                SELECT ID, ID1=ID
                FROM dbo.MedianValues
            ) a
            GROUP BY GROUPING SETS (ID, ID1)
            HAVING GROUPING_ID(ID) = 1 OR (GROUPING_ID(ID1) = 1 AND COUNT(*)%2 = 0)
        ) a
        GROUP BY ID
        HAVING COUNT(*) <= 2
    ) a
    CROSS APPLY
    (
        SELECT ID, N
        FROM 
        (
            SELECT ID, N
                ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
            FROM dbo.MedianValues b
        ) b
        WHERE a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
    ) b
)
SELECT @ID=ID, @Median=AVG(0.+N)
FROM RowsOfInterest a
GROUP BY a.ID;

How’s that for elaborate, Aaron?  Initial timings appeared to be pretty decent, but more importantly what I was seeing was a drastically reduced IO count and CPU usage.  Clearly this was a promising vector of attack, although the truth will only get sorted out using SQL Profiler to average the results of multiple runs (the elapsed timings were very close).

One rather annoying note on this solution.  When I first hit upon it, I could have sworn I had something that was consistently faster than my original solution.  Unfortunately due to a PC reboot (pesky Windows updates!) I lost that one and I was never able to quite achieve the speed I was seeing.  This could be an interesting assignment for those so inclined.

Simplifying the Elaborate GROUPING SETS Approach

After thinking a little more about this, I realized it might be possible to greatly simplify this aggregation to produce just the four rows I needed.  Silly of me really not to think of it in the first place, but then my brain may not be wired that way.

So here’s another method to get the median using a simpler initial aggregation to get ID, c1 and c2 just like the query in the prior section.

WITH RowsOfInterest AS
(
    SELECT ID, c=COUNT(*)
    FROM dbo.MedianValues
    GROUP BY ID
)
SELECT ID=ID, Median=AVG(0.+N)
FROM
(
    SELECT a.ID, N
    FROM 
    (
        SELECT ID, c1=(c+1)/2, c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
        FROM RowsOfInterest
    ) a
    JOIN 
    (
        SELECT ID, N
            ,rn=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)
        FROM dbo.MedianValues b
    ) b
    ON a.ID = b.ID AND b.rn BETWEEN a.c1 AND a.c2
) a
GROUP BY a.ID;

This time I used a JOIN instead of a CROSS APPLY because initial timings seemed to indicate a slight advantage.  This method also exhibited a serious reduction in IO counts and CPU usage over my original solution.

You’ll note that this method is quite similar to my original approach except that it doesn’t CROSS APPLY VALUES and doesn’t do the pre-aggregation in the second solution either.  The main difference from the original solution is the method of JOINing, which is to use the BETWEEN range selector.

The One Million Row Test Harness

In my original Simple Talk article I developed a one million row test harness that can be used to test a median query, and I have reproduced that below exactly as before.

TRUNCATE TABLE dbo.MedianValues;

DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

WITH Tally (n) AS
(
   SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO dbo.MedianValues
SELECT a.n, ABS(CHECKSUM(NEWID()))%@N
-- Always create exactly 100 IDs from the Tally table
FROM (SELECT TOP 100 n FROM Tally) a
-- Generate either an odd or even number of rows within IDs
CROSS APPLY
(
   SELECT TOP (@N-a.n%2) n
   FROM Tally
)b;

This actually generates 999,950 rows into our sample data table.

The SQL Profiler Test

My initial timings using SET STATISTICS IO, TIME ON were producing pretty fine numbers for IO and CPU, but pretty close scores for elapsed times.  So the only way to really be sure which of these three queries is the fastest, it is necessary to run each batch multiple times (I chose that to be ten) and then average the results.  I won’t include the Profiler test harness here because this is just a blog, and I’m sure if you’ve read my previous blog on the One Million Row Test Harness, you’re capable of doing this yourself given everything we’ve given you above.

Here are the raw and averaged SQL Profiler results for the three queries, with the last row being the average of the ten prior.

Solution #1: Original method Solution #2: Using Grouping Sets Solution #3: Simplified aggregation
CPU Reads Duration CPU Reads Duration CPU Reads Duration
4,943 2,006,286 1,585 1,622 5,843 1,663 2,840 5,674 1,147
4,759 2,006,286 1,638 1,826 5,843 1,663 2,979 5,674 1,106
4,820 2,006,286 1,575 1,715 5,843 1,662 2,682 5,674 1,326
4,836 2,006,286 1,654 1,732 5,843 1,672 2,932 5,674 1,353
4,822 2,006,286 1,608 1,747 5,843 1,620 2,949 5,674 1,190
4,787 2,006,286 1,598 1,701 5,843 1,659 2,729 5,674 1,161
4,788 2,006,286 1,669 1,716 5,843 1,638 2,872 5,674 1,175
5,104 2,006,286 1,574 1,716 5,843 1,538 3,026 5,674 1,244
4,698 2,006,286 1,569 1,700 5,843 1,665 2,840 5,674 1,296
4,632 2,006,286 1,571 1,747 5,843 1,663 2,792 5,674 1,195
4,819 2,006,286 1,604 1,722 5,843 1,644 2,864 5,674 1,219

It is also instructive to look at a closer comparison of the averages.

  Averages
Solution CPU Reads Duration
#1: Original method 4,819 2,006,286 1,604
#2: Using Grouping Sets 1,722 5,843 1,644
#3: Simplified aggregation 2,864 5,674 1,219
Best Solution #2 #3 #3
New #2 %O/(U) #1 -64% -100% 3%
New #3 %O/(U) #1 -41% -100% -24%

You can see how solution #3 was the best performer in terms of IO (Reads) and Duration (Elapsed MS).  Solution #2 had the edge in CPU and was nearly tied in duration with solution #1.  Overall my take on these results is that they are both better than my original solution, but of the two #3 is probably the best.

The percentage rows indicate that solution #3 was 41% better than (improved over) solution #1 in CPU, while solution #2 was 64% better than solution #1.  Both new solutions pretty much ripped solution #1 apart in terms of IO.  And probably the most important statistics shows that while solutions #1 and #2 were probably indeterminately different, solution #3 was about 24% faster than both.

The Final Word

The two new solutions for calculating the median within partitioned sets on a heap (no indexing on the table) seem to have delivered a slight edge over the previous benchmark case.  It is always gratifying to do something just a little bit better, even when you’re kicking yourself for not thinking of doing it that way in the first place.

Performance testing was performed on:

  • SQL Server 2012 64 bit
  • Windows 7 64 bit with 12GB of memory installed
  • Intel Core i5-3230M CPU @ 2.60 GHz

I did not attempt to test these solutions using either of the INDEXing schemes proposed by Aaron in his comparative article, because my gut tells me that they’re probably no better than the ones that won out in those particular scenarios.  Since this is just a blog, we’ll have to settle today for seeing just this one case.

And to Aaron, if you happen to be reading this blog at some time, my little digs at you calling my solution “elaborate” were all in good fun.  Actually I considered it quite a badge of honor that you picked my solution in your comparative study.

Thanks folks as always for listening.  Hope the solutions you find here and in my other blogs are helpful to you some day.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 08 Apr 2015.  All rights reserved.

Advertisements

2 thoughts on “An Even Faster Method of Calculating the Median on a Partitioned Heap

    […] An Even Faster Method of Calculating the Median on a Partitioned Heap | dwaincsql […]

    […] by SQL MVP Aaron Bertrand (Best Approaches for a Grouped Median) and again by me in another blog (An Even Faster Method of Calculating the Median on a Partitioned Heap). In that later blog, is yet another code pattern for median that remains in the un-tested stage […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s