The Best Tool for Comparing Whether Two T-SQL Queries Produce Identical Results
There is one software tool that should be in every software developer’s tool chest. That is knowledge of the platform on which you are developing. With all the hype that surrounds every new software tool that comes out, it is important to recognize the tools you already have at hand, and the ones that may be hidden gems within your development platform of choice.
Let’s backtrack for a moment into how a T-SQL query is developed. Remember my prior blog: Make it Work, Make it Fast, Make it Pretty? That blog (based on the sagacious and immortal words of SQL MVP Jeff Moden) suggests there’s three steps. In reality there are more, but for the time being let’s focus on two of the three mentioned in that statement.
- Make it Work – you understand the requirements, you understand the physical and logical models in which the data is stored, you write a query that returns hundreds of rows of results and you painstakingly check that the results indeed match the requirements, and are as you intended, based on the underlying data stored in the tables.
- Make it Fast – once your query is returning the desired results, at some point in time you may be confronted by the dreaded statement by somebody that “it isn’t fast enough.” Time to break out those performance tuning tools.
Let’s hope that the dreaded “it isn’t fast enough” pronouncement doesn’t come after the query is already in production, but that is probably the most likely scenario.
While I’m into doing the Make it Work part, I have a tendency to rely on known, high-performance code patterns to develop a query which, under most circumstances should probably be of sufficient performance. Maybe I’m having an off day, or maybe my underlying data changes in such a way that the SQL Optimizer simply starts making bad choices for choosing an execution plan (see: Lessons Learned from a Poor-performing View). Maybe the darned thing is very, very complex and you’re pushing up against tight delivery timelines, so you just need to get it done. Whatever the case, there will come a time where you’ll need to rewrite a query to improve its performance.
This is where it might get a little tricky. Suppose that the working query is highly complex, perhaps complex to the point that you’re afraid to touch it for fear of breaking it. I’ve seen that happen a lot in cases where you weren’t the original author of the query. Perhaps you don’t even know all of the outlying test cases that were run through it to prove that it worked correctly in the first place. Let’s face it, documentation isn’t always all it’s cracked up to be, and easily could be lost with time anyway.
You’ll need a tool that allows you to compare the results of the working query, against any alternative(s) that you may need to construct to improve on performance.
Enter the T-SQL EXCEPT Set Operator
Well, first I suppose we need to have a couple of SQL queries to compare. Let’s draw upon an old article of mine kindly published by the fine but eccentric editor of Simple Talk entitled: The SQL of Gaps and Islands in Sequences.
-- A solution of my own design for Gaps WITH Islands AS ( SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1 FROM ( SELECT ID, SeqNo ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) FROM dbo.GapsIslands) a GROUP BY ID, rn ) SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo) FROM ( SELECT ID, SeqNo ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2 FROM Islands CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a GROUP BY ID, m HAVING COUNT(*) = 2 ORDER BY ID, StartSeqNo; -- A Gaps Solution from SQL MVP Deep Dives SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1 FROM ( SELECT ID, cur=SeqNo, nxt=( SELECT MIN(B.SeqNo) FROM dbo.GapsIslands AS B WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) FROM dbo.GapsIslands AS A) AS D WHERE nxt - cur > 1;
Both of these solutions find gaps in a table containing a column of sequence numbers. You won’t really need to run these two queries to follow along with the discussion, but if you want to try some of the things we’ll discuss that article has all the scripts you’ll need in its resources section (at the top).
Suppose that each of these queries returns about 10,000 rows give or take (they don’t but let’s pretend that they do). To verify the veracity of the statement “they produce identical results” we just need to do a few steps:
- Run both of the queries exactly as shown above and check the row counts of each. If they both return 10,650 rows we’re good. If one returns 10,650 rows and the other returns 10,633 then you know they’re not producing identical results (duh!).
- Comment out the ORDER BY in the first query and add EXCEPT between the two. That should now run because each query produces exactly three columns of results.
- Note that the first query contains a Common Table Expression (CTE). If the second query does too, you’ll need to move that CTE up (and possibly rename it) to make it cascaded with the first. If both queries use the same CTE(s), simply discard the one for the second query.
Our case should now look like this:
-- A solution of my own design for Gaps WITH Islands AS ( SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1 FROM ( SELECT ID, SeqNo ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) FROM dbo.GapsIslands) a GROUP BY ID, rn ) SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo) FROM ( SELECT ID, SeqNo ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2 FROM Islands CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a GROUP BY ID, m HAVING COUNT(*) = 2 EXCEPT -- A Gaps Solution from SQL MVP Deep Dives SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1 FROM ( SELECT ID, cur=SeqNo, nxt=( SELECT MIN(B.SeqNo) FROM dbo.GapsIslands AS B WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) FROM dbo.GapsIslands AS A) AS D WHERE nxt - cur > 1;
If the two queries produce identical results, when you run the above it should produce zero rows in the results set! Then all you’re left with is to verify that the newer query runs faster, and that’s something I covered in The One Million Row Test Harness.
If some rows appear then there is something wrong and the queries are not producing identical results sets. You could analyze the results by switching the order of the queries around the EXCEPT.
Or you could just forget this tip about using EXCEPT and painstakingly analyze each of the 10,650 rows produced by each result set. But that is something that I’m just a bit too lazy to do!
Then again, maybe it is precisely the tool you need to get over your queryophobia and dive head first into that complex query written by some other SQL geek, without worrying too much about verifying that the results are correct.
If you liked this neat little T-SQL trick then maybe you should follow me on Twitter: @DwainCSQL because who knows what other cool stuff I might come up with!
© Copyright Dwain Camps 21 May 2015. All rights reserved.
May 21, 2015 at 6:53 pm
this is great. I actually follow the same practice of sampling my code changes this way. Good to see someone else mention it. Extremely helpful as it handles nulls correctly.
I normally try to limit my results to bitesized chunks for my purposes, so I normally take a tablesample(1000 rows) and then order by newid() and select top(200)… This gives me a somewhat random sample to run a few times and verify the results match 100%. I do a A vs B and B vs A except statement from the temp tables I store them in. Super helpful. Keep up the good posts.
May 24, 2015 at 9:07 pm
Just remember that, according to BOL (https://msdn.microsoft.com/en-us/library/ms188055.aspx), “EXCEPT returns distinct rows from the left input query that aren’t output by the right input query.” This means that while running “QUERYA EXCEPT QUERYB” might return 0 rows, it doesn’t mean that running “QUERYB EXCEPT QUERYA” will also return 0 rows.
In this post, it was mentioned checking the number of rows returned between the two queries. This would also work (if you have the same # of rows, and QUERYA EXCEPT QUERYB returns zero rows, then QUERYB EXCEPT QUERYA would also return zero rows).
The point is that in order to ensure that the two queries return the same data, there needs to be two checks. Either compare row counts and a QUERYA EXCEPT QUERYB, or QUERYA EXCEPT QUERYB and QUERYB EXCEPT QUERYA.
May 25, 2015 at 7:27 am
Wayne is absolutely correct. If the row counts are not the same (that’s always the first thing I check), then you need to run the query with EXCEPT twice, with the before and after queries switched on each run. This will usually help to give some insight into why they are different.
May 26, 2015 at 9:42 am
I would be remiss in not mentioning one case where EXCEPT may not identify non-congruent queries. Consider the following:
SELECT n
FROM
(
SELECT 2
UNION ALL SELECT 1
UNION ALL SELECT 2
) a (n)
EXCEPT
SELECT n
FROM
(
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 1
) a (n);
Obviously, the two derived tables contain different sets, yet the EXCEPT causes no rows to be returned.
This is caused by different distributions of duplicates within the set. Normally, duplicates are to be avoided, but sometimes 100% duplication is needed. If you think about all of the possible permutations that can cause duplicates, it should be pretty easy to see that most of the time, these would not return an identical row count. So the row counts are important to this case as well.
May 26, 2015 at 6:18 pm
Thanks for the article, made me think. I’ve come up with a fix to this problem; add group by on all columns, select them all, and also select the count. This way you also don’t have to do a rowcount of the tables.
Example:
–Initialization
declare @x table (a int)
declare @y table (a int)
insert @x (a)
select 2 a
union all
select 1
union all
select 1
union all
select 1
insert @y (a)
select 2 a
union all
select 2 a
union all
select 2 a
union all
select 1
–Check if the tables contain the same
select a, count(1) as MyCount
from @y y
group by a
except
select a, count(1) as MyCount
from @x x
group by a
union all
select a, count(1) as MyCount
from @x x
group by a
except
select a, count(1) as MyCount
from @y y
group by a
May 26, 2015 at 6:48 pm
Clive and Ariel – thanks for the comments.
Ariel, you do need to keep in mind that you’re interested in the row counts returned by the query and not necessarily the rows that are in the table right?
May 26, 2015 at 7:09 pm
Dwain Camps wrote (for some reason I can’t reply in the original thread):
—
Ariel, you do need to keep in mind that you’re interested in the row counts returned by the query and not necessarily the rows that are in the table right?
—
This is just an example, I’ve used two simple tables which I select from to make it easier to understand the concept. It wouldn’t work differently if you selected on something else, just 1) group by all the columns you select and 2) add a counter. Then (as Darko Martinovic wrote) do the except both ways.
May 27, 2015 at 6:40 am
Ariel – if your example was intended to be simple then I understand.
I’ll give you my take for whatever it’s worth. Any tool should be simple to use. What I’ve suggested is about the simplest way EXCEPT could be used to verify the congruence of two queries.
On the other hand, I have in the past gone more elaborate routes, like writing queries to verify queries. This can most certainly be done. The risk of course is that you make a mistake in coding the check query. Take for example other testing tools that require scripting of test cases. Those tools’ ability to verify a test case are highly dependent on the correctness of the code written to verify the test case. So in my mind at least, it introduces a new variable into the equation.
Does that make sense?
May 26, 2015 at 3:30 pm
This is an excellent observation. Most of my work, recently, is correcting poor T-SQL queries. As part of that work, I was in the need to develop a mechanism to check to ensure that the original and the revised T-SQL query return the same results. Frankly I did not use T-SQL except, rather than my utility written in .NET. There are much more flexibility in .NET, then in T-SQL. One of them is possiblity to apply hash function on both dataset. I think except is convenient solution but must be used in combination with a union all. QueryA except queryB union all queryB execept query.A further problem is to expect is support from version 2008 and above. It is hard to expect a general solution to this problem. If you really know that the order of rows will be identical, it is possible to apply and T-SQL binary_checksum.
May 26, 2015 at 3:40 pm
Nice point about the ordering of the rows in the two results sets. You do need to ensure equivalency of the ORDER BY clauses in the end. Thanks for commenting!
June 10, 2015 at 3:52 am
[…] Estimating the Size of your Database Backups 42 Things You Should Never, Ever Say To a DBA The Best Tool for Comparing Whether Two T-SQL Queries Produce Identical Results High Performance T-SQL using Code Patterns SQLXML Bulk Loader Basics The Cult of Automation SQL […]
June 19, 2015 at 3:43 am
This blog post got me started using UNION ALL for my comparisons. It will compare the output of two queries and show you the non-matching rows without fussing with changing the order of the queries or sorting the output. Pretty quick, too.
http://weblogs.sqlteam.com/jeffs/archive/2004/11/10/2737.aspx