generating test data
So far in our blogs, we have talked a bit about performance, but today we’re going to show you a way you can confirm without doubt that you’re writing high-performance T-SQL code. This extremely valuable technique is something Developers and Testers alike should be familiar with.
Why one million (1M) rows? Well, the answer to that question is really three-fold.
- First of all, 1M rows represent sort of a threshold, where you can start to consider your tables as being “big.”
- Second, if your query handles 1M rows in a reasonably short period of time, it means that it will probably do OK even if the table grows much larger (which is what we hope because that judges how popular our application is).
- Finally, depending on what you are testing, the 1M row test may make the difference between deciding on the right query (the one that is the fastest) and the wrong query, because at 1M rows it usually separates the winners from the losers quite clearly.
Along the way we’re going to learn about a variety of ways that test data can be generated. Unfortunately we won’t be able to cover every possible case. We’ll need to leave some of them to your imagination. Once you understand the basics, you should be able to extend the concept to other types of test data relatively quickly.
Case Study 1: Verify Which Formula is Fastest
Suppose you have a table that contains an INTEGER ID number column. Your application requires that when that number is shown on a form, it displays with leading zeroes. There are many ways to do this but we’ll look at three.
DECLARE @Num INT = 2342; SELECT @Num ,RIGHT('000000'+CAST(@Num AS VARCHAR(7)),7) -- Method 1 ,RIGHT(10000000+@Num, 7) -- Method 2 ,STUFF(@Num, 1, 0, REPLICATE('0', 7-LEN(@Num))); -- Method 3
Each of these returns the same result: 0002342 and will work for any positive integer <= 9999999. But the question is which one is faster?
To answer that question we’ll construct a 1M row test harness based on the Numbers table we created in our blog on Tally Tables. We’ll also show an example of one of those methods using an in-line Tally table (from the same blog), so you can see the difference between using a permanent vs. an in-line tally table.
DECLARE @NumResult VARCHAR(7); PRINT 'Method 1:'; SET STATISTICS TIME ON; SELECT @NumResult = RIGHT('000000'+CAST(N AS VARCHAR(7)),7) FROM dbo.Numbers WHERE N <= 1000000; SET STATISTICS TIME OFF; PRINT 'Method 2:';SET STATISTICS TIME ON; SELECT @NumResult = RIGHT(10000000+N, 7) FROM dbo.Numbers WHERE N <= 1000000; SET STATISTICS TIME OFF; PRINT 'Method 3:'; SET STATISTICS TIME ON; SELECT @NumResult = STUFF(N, 1, 0, REPLICATE('0', 7-LEN(N))) FROM dbo.Numbers WHERE N <= 1000000; SET STATISTICS TIME OFF; PRINT 'Method 2 w-in line Tally Table:'; SET STATISTICS TIME ON; WITH Tally(N) AS ( SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.all_columns a CROSS JOIN sys.all_columns b ) SELECT @NumResult = RIGHT(10000000+N, 7) FROM Tally WHERE N <= 1000000; SET STATISTICS TIME OFF; -- Results: Method 1: SQL Server Execution Times: CPU time = 265 ms, elapsed time = 274 ms. Method 2: SQL Server Execution Times: CPU time = 250 ms, elapsed time = 250 ms. Method 3: SQL Server Execution Times: CPU time = 452 ms, elapsed time = 460 ms. Method 2 w-in line Tally Table: SQL Server Execution Times: CPU time = 234 ms, elapsed time = 227 ms.
You should always run the test harness a few times and record the results of each run. By looking at the results, we see that methods 1 and 2 are pretty close, but over the 4-5 runs that I did, method 2 was consistently just a little faster in elapsed time. Using an in-line Tally table in this case was just a little faster than using the permanent Tally table (that may not always be the case).
You may be saying to yourself that this little 9% improvement doesn’t mean much, but picture it in the context of a much more complex query, where every slight improvement you can make counts for something.
Let’s now look at some key aspects of the test harness we used:
- It contains multiple queries that return the identical results for the same number of rows. This is important because you’re trying to compare solutions.
- We SET STATISTICS TIME ON before each query and OFF when it was complete. You could have turned them ON once at the start and OFF at the end, but that’s not always going to be the case.
- We printed out a description of the method we are using just before setting STATISTICS ON. That’s the reason we turned them OFF after each query; so you wouldn’t see the statistics for the PRINT statement, which is immaterial.
- Finally, we created a local variable @NumResult and assigned our calculated result (the returned columns) to it. This is important to eliminate the time that SQL Server Management Studio (SSMS) would otherwise take to render the results to the Results pane. If you don’t do that, it can bias the results. We’re interested in clocking the raw query speed here.
CPU time can be important sometimes, so you may also want to look at that. It turns out that method 2 also appears better than method 1 in terms of CPU time, but that may not always be the case and over the 4-5 runs we did it often was a tie.
After this test, we can say pretty clearly that method 2 was the highest performance method among the ones we looked at. We are also now well on our way to being able to say definitively that we’ve written the best possible query to solve our business problem.
Case Study 2: Removing Numbers from a Character String
In this case study, we will illustrate a few additional concepts you should know when constructing a test harness:
- How to construct random character strings of any length.
- Using an alternate means to capture the elapsed time of the queries we’re comparing.
- Using an alternative to a local variable to avoid rendering results to SSMS.
The business problem is that we wish to remove numbers from character strings that contain only letters and numbers. For this task, we have written two T-SQL FUNCTIONs that basically do the job but need to be called differently.
-- Scalar Valued Function that removes characters based on a pattern match CREATE FUNCTION dbo.CleanString (@pString VARCHAR(8000), @pPattern VARCHAR(100)) RETURNS VARCHAR(8000) AS BEGIN DECLARE @Pos SMALLINT; SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN); WHILE @Pos > 0 SELECT @pString = STUFF(@pString,@Pos,1,''), @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN); RETURN @pString; END GO -- In-line, schema-bound Table Valued Function CREATE FUNCTION dbo.RemoveMatchedPatterns ( @Str VARCHAR(8000) ,@Pattern VARCHAR(100) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH Tally(n) AS ( SELECT TOP (ISNULL(LEN(@Str), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN(VALUES(0),(0),(0),(0),(0),(0),(0),(0)) d(n) ), SplitString AS ( SELECT n, s=SUBSTRING(@Str, n, 1) FROM Tally WHERE PATINDEX(@Pattern, SUBSTRING(@Str COLLATE Latin1_General_BIN, n, 1)) = 0 ) SELECT ReturnedString= ( SELECT s + '' FROM SplitString b ORDER BY n FOR XML PATH(''), TYPE ).value('.', 'VARCHAR(8000)'); GO
For now, it is not necessary to fully understand how these FUNCTIONs work, but it is necessary to understand how to use them. Note that the CleanString FUNCTION uses a WHILE loop to do its work.
DECLARE @TestString VARCHAR(8000) = '123122adflajsdf34a23aa333w'; SELECT TestString=@TestString ,SVFReturns=dbo.CleanString(@TestString, '%[0-9]%') ,iTVFReturns=ReturnedString FROM dbo.RemoveMatchedPatterns(@TestString, '%[0-9]%'); -- Results: TestString SVFReturns iTVFReturns 123122adflajsdf34a23aa333w adflajsdfaaaw adflajsdfaaaw
Both FUNCTIONs have the same call signature, which includes the pattern of characters you want removed, and both return the same results (only the alphabetic characters).
Before we proceed to generate a test harness containing lots of random strings we can test, it is necessary to familiarize you with a way to generate random numbers in T-SQL. So let’s consider the following statement:
SELECT IntRN=1+ABS(CHECKSUM(NEWID()))%100 ,RNwDec=.01*(1+ABS(CHECKSUM(NEWID()))%10000); -- Results: IntRN RNwDec 31 40.74
If you run this statement multiple times, you’ll get different results each time. In each case, you get a number between 1 and 100. The RNwDec column will have 2 decimal digits. This is the standard method in SQL to generate a Uniform Random Number (URN). If you need a wider range of numbers, change 100 (or 10000 for RNwDec) to something larger.
To generate a random string of characters, you can use the URN formula as follows:
SELECT REPLICATE( SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1) ,1+ABS(CHECKSUM(NEWID()))%20) + REPLICATE( SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1) ,1+ABS(CHECKSUM(NEWID()))%20);
You can try this one yourself. Each time you get a different string consisting of 1 to 20 letters (always the same letter repeated) and 1 to 20 numbers (always the same number repeated). The string will be of varying length between 2 and 40 characters. To get a string of maximum length 8000, all you need to do is replicate the pattern a random number of times. The final (outer) replication should be performed up to 200 times, so can be done like this:
SELECT REPLICATE( REPLICATE( SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1) ,1+ABS(CHECKSUM(NEWID()))%20) + REPLICATE( SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1) ,1+ABS(CHECKSUM(NEWID()))%20) ,1+ABS(CHECKSUM(NEWID()))%200);
Now that we understand the tools we’ll need, here is the test harness.
-- Create 1000 rows of random strings SELECT s=REPLICATE( REPLICATE( SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1+ABS(CHECKSUM(NEWID()))%26, 1) ,1+ABS(CHECKSUM(NEWID()))%20) + REPLICATE( SUBSTRING('0123456789',1+ABS(CHECKSUM(NEWID()))%10,1) ,1+ABS(CHECKSUM(NEWID()))%20) ,1+ABS(CHECKSUM(NEWID()))%200) INTO #TestStrings FROM dbo.Numbers WHERE N <= 1000; DECLARE @StartDT DATETIME = GETDATE(); SELECT SVFReturns=dbo.CleanString(s, '%[0-9]%') INTO #Test1 FROM #TestStrings; -- Display elapsed time for the scalar-valued User-defined Function (UDF) SELECT SVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE()); SELECT @StartDT = GETDATE(); SELECT iTVFReturns=ReturnedString INTO #Test2 FROM #TestStrings CROSS APPLY dbo.RemoveMatchedPatterns(s, '%[0-9]%'); -- Display elapsed time for the in-line Table Valued Function (iTVF) SELECT iTVFElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE()); GO DROP TABLE #TestStrings; DROP TABLE #Test1; DROP TABLE #Test2;
For this test, we’ll illustrate the fact that sometimes you don’t need to go up to 1M rows to distinguish the difference between two queries. In this case, it becomes quite apparent at 1000 rows. Here are the timings that are displayed by the two SELECT statements:
SVFElapsedMS 26290 iTVFElapsedMS 7933
That’s already a pretty large difference so you can imagine how long it would take to run at 1M rows. Here’s an explanation of the differences between this test harness and the one from before:
- We ran at 1000 rows of test data instead of 1M (by filtering N from our Numbers table with WHERE N <= 1000).
- Instead of assigning the results to a local variable, we instead SELECT … INTO a temporary table. Since both queries absorb the same overhead for this operation, the results are still comparable.
- Instead of using SET STATISTICS TIME ON/OFF, we’ve simply used the @StartDT local variable to capture the elapsed time (calculated using DATEDIFF).
The latter method of capturing elapsed time is used because of a quirky behavior of STATISTICS in some cases when you are timing a scalar-valued, user-defined function. SQL MVP Jeff Moden explains this in How to Make Scalar UDFs Run Faster.
This example also serves to demonstrate the well-known fact that a good set-based query will almost always be faster than a loop. If you remember our introduction to DelimitedSplit8K in our Tally Tables blog, Jeff Moden uses the same basic methodology (an in-line Tally table) to make DelimitedSplit8K extremely fast.
Case Study 3: Timing Solutions for Running Totals
In this example, we’re going to perform a test on two solutions to the Running Totals (RT) problem in T-SQL. The second solution is only valid in SQL 2012. First, we’ll set up a test table and populate it with some random data.
CREATE TABLE dbo.RunningTotalsTest ( [Date] DATETIME PRIMARY KEY ,Value INT ,RunningTotal1 INT ,RunningTotal2 INT ); WITH SomeDates AS ( SELECT d=DATEADD(hour, N, '2010-01-01') FROM dbo.Numbers WHERE N <= 10000 ) INSERT INTO dbo.RunningTotalsTest([Date], Value) SELECT d, 1+ABS(CHECKSUM(NEWID()))%100 FROM SomeDates; GO
We have populated our table with a series of date values that increases in an hourly fashion; including a “Value” that is simply a random integer between 1 and 100. We’ll be calculating the two RunningTotal columns from our Value. Note that at the end of this “batch” we have included the batch separator (GO). The test harness includes only 10,000 rows because we know in advance that this will be a sufficient number to distinguish between our solutions (1M rows is still recommended for most normal cases).
The first running totals solution we’ll look at is what is known as a triangular JOIN, because for each row it adds up all of the prior rows using a correlated sub-query. Once again, notice the batch separator (GO) at the end of the batch.
-- RT by Triangular JOIN UPDATE a SET RunningTotal1 = ( SELECT SUM(value) FROM dbo.RunningTotalsTest b WHERE b.[Date] <= a.[Date] ) FROM dbo.RunningTotalsTest a; GO
The next solution, which only works in SQL Server 2012, is a new facility Microsoft has kindly provided to us for calculating running totals (and a host of other things).
-- RT with SQL 2012 window frame WITH RunningTotal AS ( SELECT [Date], Value, RunningTotal2 ,rt=SUM(value) OVER (ORDER BY [Date] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM dbo.RunningTotalsTest ) UPDATE a SET RunningTotal2 = rt FROM RunningTotal a; GO -- Final SELECT SELECT * FROM dbo.RunningTotalsTest;
We’ve also included a final SELECT (in the last batch) to show that both running totals were calculated correctly.
To run this code and obtain our timings, we’re going to learn to use Tools/SQL Profiler. This brings up a window allowing you to name the profiler (trace) if desired.
After you click Run, the Profile session will begin.
You can now execute the four batches of T-SQL we created above, two of which contain the solutions of interest. Once the run is complete, the Profile window now looks like this, where we have circled in red the two results of interest.
Notice how the comment we placed at the beginning of each batch, clearly shows up on the BatchCompleted lines with our desired results:
-- RT by Triangular JOIN -- RT with SQL 2012 window frame
The results show that the new SQL 2012 method for running totals completed in only 117 milliseconds, while the triangular JOIN took 11267 milliseconds. Imagine what the triangular JOIN approach would have taken had we run against 1M rows, or better yet imagine a customer waiting on an application’s web form for that result to be displayed!
The first ten rows of results displayed show that both of our running totals solutions worked correctly, yet the timing results tell us that they are definitely not equivalent!
Date Value RunningTotal1 RunningTotal2 2010-01-01 01:00:00.000 63 63 63 2010-01-01 02:00:00.000 75 138 138 2010-01-01 03:00:00.000 2 140 140 2010-01-01 04:00:00.000 27 167 167 2010-01-01 05:00:00.000 73 240 240 2010-01-01 06:00:00.000 71 311 311 2010-01-01 07:00:00.000 17 328 328 2010-01-01 08:00:00.000 64 392 392 2010-01-01 09:00:00.000 40 432 432 2010-01-01 10:00:00.000 56 488 488
SQL Profiler is a very useful way to time a batch that contains multiple SQL statements (like if you want to test performance of a CURSOR vs. a set-based solution). Take care when setting up each batch to avoid unnecessary overhead in one batch vs. the other.
In a future blog, we’ll describe a method for calculating running totals that works in any version of SQL that is faster than both of these solutions.
Summary and What We Learned
Firstly and most importantly, we’ve learned how to create a one million row test harness so we can compare the performance of two queries that return identical results, and why this is an essential step to verifying query performance.
We’ve learned how to generate random test data:
- Using Tally tables
- Using a formula to generate random numbers within a specified range
- Generating random character strings
We’ve learned three methods to time the queries we’re comparing as they process the test harness:
- Using SET STATISTICS TIME ON/OFF
- Using an elapsed time calculator and displaying the results in a SELECT
- Using SQL Profiler
Some hints to follow when using the 1M row test harness (not all of which are mentioned):
- After creating your test harness, start out with a lower row count than a million. Once you’re done debugging the test harness, ramp up the row count until one of the solutions clearly distinguishes itself from the others.
- Try to run each test multiple times at the highest row count you need to prove the result and then average the timings.
- Avoid using SET STATISTICS ON/OFF when comparing code that includes a call to a Scalar-valued, User-defined FUNCTION. Use one of the other techniques instead (Profiler works fine on this case).
- You can as easily compare 3-4 solutions as two using any of these methods. The more solutions you have to the same problem, the more chance you’ll have of identifying the best performing.
- There are cases where using DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS will improve the accuracy of your measurement; however this is a topic that is more advanced than we planned to explore in this blog.
- Just about any article that SQL MVP Jeff Moden writes for the SQL Server Central web site provides an example of a 1M row test harness to test his proposed solution against alternatives, but these two articles are specifically directed to this subject.
- Also by SQL MVP Hugo Kornelis: Generating Test Data in TSQL
- SQL MVPs like Itzik Ben-Gan and many others are advocates of and nearly always employ such 1M row tests when proving their solutions perform well.
Now you’re ready to start learning how to write high-performance T-SQL code, because now you have a tool that is essential to proving that you’re writing queries that are the best that they can be!
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved