Month: April 2014
Using the T-SQL MERGE Statement
In SQL Server 2008, Microsoft added a new SQL query type: the MERGE statement. This flexible query provides the ability to perform INSERTs, UPDATEs and even DELETEs all within a single statement. Used in combination with Common Table Expressions (CTEs), this can be a powerful tool to replace multiple SQL queries under the right circumstances.
One important rule to keep in mind when using MERGE, is that the statement must be terminated by a semicolon (;).
Case Study 1: A Simple Upsert
The most common usage of the MERGE statement is to perform what is colloquially called an “upsert,” which is really a diminutive form of UPDATE/INSERT. Without further preamble, let’s set up some test data and get to seeing how the MERGE statement can simplify your life.
CREATE TABLE #Master ( [key] INT IDENTITY PRIMARY KEY ,name VARCHAR(10) ); INSERT INTO #Master VALUES ('Dwain'),('Jeff'),('Paul') ,('Anon'),('Ralph'),('Tom'),('Sally'); CREATE TABLE #Staging ( [key] INT PRIMARY KEY ,[NewName] VARCHAR(10) ); INSERT INTO #Staging VALUES (2, 'Bob'),(4, 'Jim'),(6, 'Marvin'), (10, 'Buddy'); SELECT * FROM #Master; SELECT * FROM #Staging;
The results in the two tables as displayed by the SELECT are:
key name 1 Dwain 2 Jeff 3 Paul 4 Anon 5 Ralph 6 Tom 7 Sally key NewName 2 Bob 4 Jim 6 Marvin 10 Buddy
Our intention is to update (based on [key] in #Staging) the corresponding row by [key] in #Master. If the [key] in #Staging matches none of our [key] values in #Master, then insert a new row. The new row’s [key] does not need to match the value in the staging table. We can easily do this with a MERGE statement as follows:
MERGE #Master t USING #Staging s ON s.[key] = t.[key] WHEN MATCHED THEN UPDATE SET name = s.[NewName] WHEN NOT MATCHED THEN INSERT (name) VALUES (s.[NewName]); SELECT * FROM #Master;
The final SELECT result is as follows:
key name 1 Dwain 2 Bob 3 Paul 4 Jim 5 Ralph 6 Marvin 7 Sally 8 Buddy
You can see that Bob replaced Jeff, Jim replaced Anon and Marvin replaced Tom, and also that Buddy was added at the end.
The way the statement works is as follow:
- The table name immediately after the MERGE keyword is the target table, in this case #Master, which we have aliased as t for easy understanding.
- The USING table is the source, so #Staging will be merged into #Master.
- The ON keyword represents the matching criteria between the records in the two tables. You should not think of this in the same way that ON appears after a JOIN as it operates quite differently.
- Following those parts of the statement, are any number of WHEN clauses. The MATCHED criterion indicates a match based on the ON criteria. It can be combined with additional matching criteria if required.
- NOT MATCHED (implied as BY TARGET), means that when a source row does not exist in the target table, we’re going to do something.
- Following MATCHED or NOT MATCHED is the keyword THEN followed by either an INSERT or an UPDATE.
You can also use DELETE (instead of UPDATE or INSERT) and if you’d like to learn about how to DELETE rows from the target table, I suggest you read this article to understand exactly how it works: A Hazard of Using the SQL Merge Statement and the potential dangers when using it.
Case Study 2: A More Complicated MERGE
Suppose we have the following sample table and data:
CREATE TABLE #ItemTest ( ID INT NOT NULL ,LineID INT NOT NULL ,ProductID INT NULL ,PRIMARY KEY(ID, LineID) ); INSERT INTO #ItemTest (ID, LineID, ProductID) SELECT 100, 1, 5 UNION ALL SELECT 100, 2, 15 UNION ALL SELECT 100, 3, 8 UNION ALL SELECT 100, 4, 25 UNION ALL SELECT 200, 1, 11 UNION ALL SELECT 200, 2, 100 UNION ALL SELECT 200, 3, 41 UNION ALL SELECT 200, 4, 10 UNION ALL SELECT 200, 5, 5 UNION ALL SELECT 200, 6, 30 UNION ALL SELECT 300, 1, 20; SELECT * FROM #ItemTest;
From the final SELECT, we see that our data appears as follows:
ID LineID ProductID 100 1 5 100 2 15 100 3 8 100 4 25 200 1 11 200 2 100 200 3 41 200 4 10 200 5 5 200 6 30 300 1 20
Notice how the entries for each ID contain a sequentially numbered LineID (1 to 4 for ID=100 and 1 to 6 for ID=200). Our business requirement is that we need to delete some rows and at the same time preserve the row numbering for LineID without introducing any gaps. So for example, if we need to delete LineID=3 from ID=100, we need to renumber LineID=4 for that ID to be LineID=3.
Ignoring for the moment that it’s probably poor application design to have this row renumbering requirement, this can be accomplished with a MERGE. Since it is a bit more complicated we’ll develop it in a couple of steps to help you understand. First, let’s say we want to delete three rows. We’ll put those into a table variable (a feature introduced in SQL Server 2005).
DECLARE @RowsToDelete TABLE ( ID INT ,LineID INT ,PRIMARY KEY (ID, LineID) ); INSERT INTO @RowsToDelete (ID, LineID) SELECT 100, 3 UNION ALL SELECT 200, 2 UNION ALL SELECT 200, 4;
Note how we can create a PRIMARY KEY on a table variable. While not needed in this case, if you had lots of rows it will improve the performance of what we’re about to do.
Now we’ll construct the following query which will require some explanation:
SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID)) ,XX=NULLIF(a.LineID, b.LineID) FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID;
XX is included only to illustrate what NULLIF is doing for us. This produces the following results:
ID LineID ProductID LineID2 rn XX 100 3 8 3 1 NULL 100 1 5 NULL 2 1 100 2 15 NULL 3 2 100 4 25 NULL 4 4 200 2 100 2 1 NULL 200 4 10 4 2 NULL 200 1 11 NULL 3 1 200 3 41 NULL 4 3 200 5 5 NULL 5 5 200 6 30 NULL 6 6 300 1 20 NULL 1 1
Each row from #ItemTest is returned because it is the left table of the LEFT JOIN. Matching rows from our @RowsToDelete temporary table have a value in LineID2, while rows not matched have a value of NULL (exactly how you’d expect the LEFT JOIN to work). The result in XX shows us that when the LineID of #ItemTest matches the LineID of @RowsToDelete, we get a NULL and NULL values usually sort first (there is a SQL Server setting that controls this). So in each case, the rows we want to delete are sorted to the top of the grouping (on ID).
For the 3 rows in our @RowsToDelete table, we have 1 for ID=100 and 2 for ID=200 (these counts are easy enough to obtain in SQL). So what happens if we subtract that count from rn?
WITH CountItemsToDelete (ID, c) AS ( SELECT ID, COUNT(*) FROM @RowsToDelete GROUP BY ID ) SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID ,[rn-c]=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID))-c FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID JOIN CountItemsToDelete c ON a.ID = c.ID;
The results now appear as:
ID LineID ProductID LineID2 rn-c 100 3 8 3 0 100 1 5 NULL 1 100 2 15 NULL 2 100 4 25 NULL 3 200 2 100 2 -1 200 4 10 4 0 200 1 11 NULL 1 200 3 41 NULL 2 200 5 5 NULL 3 200 6 30 NULL 4
Note how the row for ID=300 has been eliminated by the INNER JOIN to our Common Table Expression (CTE) CountItemsToDelete. Looking at the [rn-c] column, we see that for rows where LineID2 is not NULL, the value is meaningless. But for rows where LineID2 is NULL, [rn-c] is precisely the final row number we’ll need to assign to LineID after deleting the rows we want to delete! Now we have enough information to write this into a MERGE statement:
WITH CountItemsToDelete (ID, c) AS ( SELECT ID, COUNT(*) FROM @RowsToDelete GROUP BY ID ), SourceItems AS ( SELECT a.ID, a.LineID, ProductID, LineID2=b.LineID, c ,rn=ROW_NUMBER() OVER (PARTITION BY a.ID ORDER BY NULLIF(a.LineID, b.LineID)) FROM #ItemTest a LEFT JOIN @RowsToDelete b ON a.ID = b.ID AND a.LineID = b.LineID JOIN CountItemsToDelete c ON a.ID = c.ID ) -- The target table MERGE #ItemTest t -- The source table USING SourceItems s -- Matching criteria: lines up rows from SourceItems exactly with rows -- from our target table (except for ID=300 which is not in the source) ON t.ID = s.ID AND s.LineID = t.LineID -- LineID2 is not NULL for rows we need to delete WHEN MATCHED AND s.LineID2 IS NOT NULL THEN DELETE -- LineID2 is NULL for rows where we've calculated the new line number WHEN MATCHED AND s.LineID2 IS NULL THEN UPDATE SET LineID = rn-c; SELECT * FROM #ItemTest;
The results shown in the final SELECT clearly indicate that this MERGE query has satisfied our business requirement.
ID LineID ProductID 100 1 5 100 2 15 100 3 25 200 1 11 200 2 41 200 3 5 200 4 30 300 1 20
To further improve the performance of the query, you can change the second MATCHED criteria to this, to avoid updating rows where the LineID isn’t changing.
WHEN MATCHED AND s.LineID2 IS NULL AND t.LineID <> rn-c THEN
To do this otherwise in SQL you would first need to DELETE the rows you want to delete, and then run a separate UPDATE to correct the row numbers that need correcting.
Summary of our Learning
Today we have learned about the MERGE query and how it can be used to replace multiple queries in the case of:
- UPDATE/INSERT (the “Upsert”)
- DELETE/UPDATE
I encourage you to also read the linked article about hazards present if you utilize the full capability of the MERGE (WHEN NOT MATCHED SOURCE THEN DELETE). The article shows a simple way of limiting the scope of the DELETE to avoid the hazard.
We also learned about SQL table variables, which are very handy under certain circumstances. They do have their drawbacks though, and some day we may blog on the differences between them and temporary tables.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Manipulating Dates and Times in T-SQL
In SQL 2008, Microsoft introduced some new date and time data types to augment the options available in prior versions. The full list of these data types with detailed explanations can be found in Microsoft Books on Line (BOL), but we’ll list them here with a very brief description.
- DATETIME – This is the standard and probably most commonly used type that’s been available in T-SQL since its early days, with a range of 1753-01-01 through 9999-12-31 and accuracy of about 3 milliseconds. Note that if you cast (implicitly or explicitly) an INT value of 0 to this data type, the result will be 1900-01-01.
- DATE – This is a new type that is accurate to the day only (no time component) and has a range of 0001-01-01 through 9999-12-31.
- DATETIME2 – This is a higher-accuracy DATE + TIME type that is accurate to 100 nanoseconds (or .0000001 seconds) and has a range of 0001-01-01 through 9999-12-31.
- DATETIMEOFFSET – This is a DATE + TIME type that includes the UTC time zone offset with varying degrees of accuracy (you can specify) and has a range of 0001-01-01 through 9999-12-31.
- SMALLDATETIME – This is another DATE + TIME type that has an accuracy of one minute (no seconds) and a date range of 1900-01-01 through 2079-06-06.
- TIME – This is a TIME-only type that is accurate to 100 nanoseconds and has a range of 00:00:00.0000000 through 23:59:59.9999999.
This blog is less about the date and time data types and really about the different ways they can be manipulated. We’re not talking about formatting here (for that you’d use CONVERT), instead we’re talking about how to do date arithmetic and the functions that SQL provides you to do so. Normally date formatting should be done in an application’s front end, but it is often quite useful and necessary to do date calculations in the T-SQL back end.
Simple Date Arithmetic
If you have a DATETIME data column, or perhaps you’re using T-SQL’s GETDATE() built-in function, if you want to add a fixed number of days, that is very simple:
SELECT GETDATE(), GETDATE()-1, GETDATE()+1; -- Results: 2014-03-05 11:29:37.087 2014-03-04 11:29:37.087 2014-03-06 11:29:37.087
Often this approach can be faster than using the T-SQL built in function for adding dates (to be discussed in a minute).
Unfortunately, this doesn’t work well with any of the other date and time data types except for SMALLDATETIME:
SELECT CAST(GETDATE() AS DATE)+1;
GO
SELECT CAST(GETDATE() AS DATETIME2)+1;
GO
SELECT CAST(GETDATE() AS DATETIMEOFFSET)+1;
GO
SELECT CAST(GETDATE() AS TIME)+1;
GO
-- Errors returned:
Operand type clash: date is incompatible with int
Operand type clash: datetime2 is incompatible with int
Operand type clash: datetimeoffset is incompatible with int
Operand type clash: time is incompatible with int
The same link provided above for CONVERT describes CAST. There are those that like to ignore the myth of SQL code compatibility and recommend that instead of using GETDATE() you use CURRENT_TIMESTAMP (the ANSI standard function that returns DATETIME), but I am not one of them.
If you need to do any other date arithmetic, SQL provides a built in function called DATEADD. It can be used to add a fixed number of days, hours, seconds, months, etc. to any date/time data type (although you will get an error adding days, weeks, months, etc. to a TIME data type). The first argument to DATEADD tells the function what unit you want to add, while the second specifies the number of units. The last argument is the date/time value you want to add those units to. So we can fix our code above to add one day to each of the supported date/time types.
SELECT GETDATE(); SELECT DATEADD(day, 1, CAST(GETDATE() AS DATE)); SELECT DATEADD(day, 1, CAST(GETDATE() AS DATETIME2)); SELECT DATEADD(day, 1, CAST(GETDATE() AS DATETIMEOFFSET)); SELECT CAST(GETDATE() AS SMALLDATETIME)+1; SELECT DATEADD(hour, 1, CAST(GETDATE() AS TIME)); -- Results: 2014-03-05 11:43:53.117 2014-03-06 2014-03-06 11:43:53.1170000 2014-03-06 11:43:53.1170000 +00:00 2014-03-06 11:46:00 12:43:53.1170000
Those results also clearly demonstrate the accuracy of each of the data types.
Another extremely useful function for doing date arithmetic is DATEDIFF, which is used to calculate the difference between two dates (or times) in whole units as specified by its first argument. Let’s take a look at an example.
SELECT DATEDIFF(day, '2014-04-15', '2014-04-17'); -- Results: 2
The result is negative if the left date is greater than the right date. The first argument to DATEDIFF is the same set of units you can specify to DATEADD.
Date Truncation
In an earlier blog on Tally Tables and another one on Calendar Tables, we’ve seen that DATEADD and DATEDIFF can be combined to perform date truncation on a date part boundary. Now we’ll explain exactly how that works. Let’s take a look at the T-SQL for the most common case (truncation to the day):
SELECT GETDATE(); -- Take the days difference between today's date and 1900-01-01 SELECT DATEDIFF(day, 0, GETDATE()); -- Add back the days difference to 1900-01-01 SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0); -- Results: 2014-03-05 12:02:51.870 41701 2014-03-05 00:00:00.000
If today’s date is 2014-03-05, the number of days since 1900-01-01 (=0 remember that this is the base date when 0 is cast to DATETIME) is 41701. We can add back that number of days to 1900-01-01 and get exactly today’s date without the time part.
Likewise, we can truncate to the minute, hour, month or year simply by specifying a different first argument to both functions:
SELECT GETDATE(); SELECT DATEADD(minute, DATEDIFF(minute, 0, GETDATE()), 0); SELECT DATEADD(hour, DATEDIFF(hour, 0, GETDATE()), 0); SELECT DATEADD(month, DATEDIFF(month, 0, GETDATE()), 0); SELECT DATEADD(year, DATEDIFF(year, 0, GETDATE()), 0); -- Results: 2014-03-05 12:08:51.573 2014-03-05 12:08:00.00 2014-03-05 12:00:00.000 2014-03-01 00:00:00.000 2014-01-01 00:00:00.000
You will run into an error however if you try it to the second:
The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.
But you can work around this by specifying a later offset date (2010-01-01 should work for a few more years):
SELECT DATEADD(second, DATEDIFF(second, '2010-01-01', GETDATE()), '2010-01-01'); -- Results: 2014-03-05 12:08:51.000
So let’s try a quick exercise to demonstrate our new found skill with date arithmetic. Try to solve it before you look at the solution. How would you truncate a DATETIME to yesterday at 18:00:00.000?
SELECT GETDATE(); SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE())-1, '18:00'); -- Results: 2014-03-05 12:17:36.210 2014-03-04 18:00:00.000
In this case, the time value (18:00) we specified at the end is up-cast to DATETIME 1900-01-01 18:00:00.000 and that is what the days offset (from 1900-01-01) is added back to.
More of these really neat and useful but simple date arithmetic examples can be found in this blog by Lynn Pettis, who I like to think of as the guru of date manipulations: Some Common Date Routines.
Combining Date Components
In a database, dates should always be stored as dates (DATETIME) and not character strings or their individual date parts. Unfortunately, not everybody realizes this and sometimes make the mistake of storing dates as the individual parts. Let’s return now to an example of some T-SQL from our Calendar Tables blog. We’ll assume you still have access to the auxiliary Calendar TABLE we created in that blog.
SELECT [Date], [Year], [Month], [Day] FROM dbo.Calendar WHERE [Date] >= '2013-03-01' AND [Date] < '2013-03-05'; -- Results: Date Year Month Day 2013-03-01 00:00:00.000 2013 3 1 2013-03-02 00:00:00.000 2013 3 2 2013-03-03 00:00:00.000 2013 3 3 2013-03-04 00:00:00.000 2013 3 4
Using our newly found knowledge of date arithmetic and the T-SQL built in functions to handle them, we can easily re-assemble the Year, Month and Day columns in this result to be a DATE or DATETIME.
SELECT [Date], [Year], [Month], [Day] ,[AsDATETIME]= [Day]+DATEADD(month,[Month]-1,DATEADD(year,[Year]-1900,0)-1 ,[AsDATE] = CAST([Day]+DATEADD(month,[Month]-1,DATEADD(year,[Year]-1900,0))-1 AS DATE) FROM dbo.Calendar WHERE [Date] >= '2013-03-01' AND [Date] < '2013-03-05'; -- Results: Date Year Month Day AsDATETIME AsDATE 2013-03-01 00:00:00.000 2013 3 1 2013-03-01 00:00:00.000 2013-03-01 2013-03-02 00:00:00.000 2013 3 2 2013-03-02 00:00:00.000 2013-03-02 2013-03-03 00:00:00.000 2013 3 3 2013-03-03 00:00:00.000 2013-03-03 2013-03-04 00:00:00.000 2013 3 4 2013-03-04 00:00:00.000 2013-03-04
As you can see, we’ve subtracted the base year (1900) from Year, adding that number of years back to the base year (0=1900-01-01), then added Month-1 months to that and finally one less than Day to that. Our results are just what we need and we achieved them by using just a little simple date arithmetic. This will usually be faster than converting to a character string and then manipulating that, ultimately CASTing it to the desired type.
Casting DATETIMEs to Character Strings
If you’ve never tried to CAST a DATETIME to a character string, you might be a little surprised by the result you get.
SELECT GETDATE(); SELECT CAST(GETDATE() AS VARCHAR(100)); -- Results: 2014-03-05 14:13:42.760 Mar 5 2014 2:13PM
Why Microsoft chose this particular result is probably a mystery to all except them. But knowing this behavior does offer the occasional opportunity for CASTing in the other direction. Consider these cases:
SELECT CAST('Mar 01 2013' AS DATETIME); SELECT CAST('Mar 01 2013 15:00:03' AS DATETIME); SELECT CAST('Mar 01 2013 2:05PM' AS DATETIME); -- Results: 2013-03-01 00:00:00.000 2013-03-01 15:00:03.000 2013-03-01 14:05:00.000
Another somewhat interesting case is when you try casting a character string that T-SQL recognizes as a valid year.
SELECT CAST('2013' AS DATETIME);
SELECT CAST('201303' AS DATETIME);
SELECT CAST('2013-01' AS DATETIME);
-- Results:
2013-01-01 00:00:00.000
The conversion of a varchar data type to a datetime data type resulted in an out-of-range value.
Conversion failed when converting date and/or time from character string.
While it works with a year, when the character string looks like a combination of year and month, it results in one of the two error messages shown.
On the other hand, these two cases produce exactly what you’d expect, which is the DATETIME value 2013-01-02 00:00:00.000, regardless of whether the CAST is explicit or implicit.
SELECT CAST('2013-01-02' AS DATETIME); SELECT CAST('20130102' AS DATETIME);
To CAST these to any of the other date/time data types, you must use an explicit CAST. An implicit CAST will always result in a DATETIME.
Let’s also not forget the other CAST we learned from our exercise:
SELECT CAST('15:00' AS DATETIME) -- Results: 1900-01-01 15:00:00.000
Summary
In this blog we’ve learned about the T-SQL data types that support dates and times, including the range and accuracy of each.
We have learned how to perform some simple date arithmetic and apply that knowledge to truncating a DATETIME and reassembling the date parts into a DATETIME.
Finally, we’ve learned a little about the results of casting a DATETIME to a character string and vice versa.
All of these date manipulation constructs can be considered T-SQL best practices if you need to use them. In general date arithmetic will be faster than extracting the parts as character strings, manipulating those character strings and then reassembling them and CASTing back to the date/time data type you need.
There are many other date functions offered by T-SQL and you can read about them in BOL. If you go back to the Calendar Tables blog (linked in above), you can now go back and try to understand how the GenerateCalendar function we provided there actually works.
I hope you enjoyed this introduction to manipulating dates and times in T-SQL.
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved
Make it Work, Make it Fast, Make it Pretty
When I first heard this, it struck me as being a remarkably concise wisdom applicable to virtually any programming task. The entire quotation is actually:
“Make it work, make it fast, then make it pretty… and it isn’t finished until it is pretty!”
— SQL MVP Jeff Moden (RedGate’s 2011 Exceptional DBA of the Year)
In case you don’t know what an MVP is, it stands for Most Valued Professional, and it is an award that Microsoft confers only to the best of the best in their Microsoft-focused technical skills.
Throughout the course of this article I will ask the reader a series of questions. Each question is designed as a thought question. When you encounter a question, you should mentally form a picture in your head of what your answer is. After the question I’ll provide you with my take on the answer. If your answers are not the same as my answers, you should take the time to reflect on why they are different.
Taking pride in your work is a very important thing. We all like to think we’re good at our jobs. Could you be better at your job? The answer is most certainly, because all of us could always be better. This article offers a philosophy by which any programmer can improve on their ability to perform their job. I sincerely hope that you can learn something from it.
Since the quotation on which this article is based was said by a really talented SQL DBA, towards the end of this article we’ll provide a code example in SQL. We’ll provide a good and bad example. Once again, you should reflect on this example and understand why it is bad or why it is good.
In case you were wondering, I have seen some of Jeff Moden’s work. I can assure you that he lives this philosophy every day. I have seen him produce some of the most remarkably ingenious and high-performing solutions to SQL problems, and in the end he always makes the code pretty. He will probably not be pleased with me for writing this article because he is also very modest about his accomplishments. That my dear reader is how you become a Microsoft MVP!
Make it Work
When you are writing SQL or coding in any programming language for that matter, the most important thing to ensure is that the code you have written meets the functional requirements. This means that you, as a Developer, must not only write the code but you must also thoroughly unit test it. That’s right, testing is not just for Testers to do!
To thoroughly unit test a piece of code, you must consider not just what happens when you execute your code against what you expect in terms of the way a business user may end up running your code. You must take into consideration unexpected actions of those same users, or in the case of SQL queries, unexpected but bad data in the tables you’re going against.
A good Developer will also take the time to document the testing that he or she performed. This documentation, if available, may help the testing team to identify other potential cases where the code that was written may fail to perform to expectations.
So as a Developer, do you believe that you perform sufficient unit testing so that your testing team finds no issues when they perform their testing? The answer is probably not, but that is the goal every Developer should focus on if they want to be considered “better” at their jobs.
Make it Fast
All SQL is not alike. I can probably write at least three to four queries that will satisfy any specific business requirement. The key is to select the query that will run the fastest, and there is some effort involved in making that happen.
Here’s another thought question for you. Has anyone ever told you that a query you have written runs too fast? Chances are, the answer to that question is no. Given the choice between two queries that return equivalent results, business users would almost always choose to use the query that runs the fastest. Performance counts! When I asked the first question to a good friend of mine by the name of Chris Morris (who is an expert at T-SQL), he had this story to relate:
“Some guy who’s been working at the same shop for three or four years without opening a book or visiting a forum but thinks he’s a T-SQL hero has a query that he swears can’t be made to run any faster. It takes twenty minutes to run. You rewrite it so that it works in a completely different way – which he hasn’t been exposed to – and it runs in two seconds. The first thing he’ll say is ‘That’s far too fast – it can’t possibly be doing all that work in so little time.’ Of course, the catch is that it isn’t doing all that work.”
I found that last bit quite humorous but also absolutely true. Two queries that return the same results do not have to perform the same amount of work. The one that does the least work is most likely to perform better.
There are some people who insist that, in order to write fast-performing queries, you must be an expert in understanding the query‘s execution plan. While that can certainly be helpful, not many are true experts in interpreting a query’s execution plan. In fact, the execution plan can oftentimes be misleading, specifically when comparing the “cost” of two queries. I’ve seen cases where two queries are rated as having a cost of 0% and 100%, yet the second query is much faster than the first.
To write faster queries, you don’t need to be an expert at reading a query’s execution plan but you do need to keep in mind a few very basic fundamentals.
- Avoid CURSORs as they can be really slow in T-SQL. 99.99% of the time you can construct an alternate, set-based approach that will perform much faster than a CURSOR. The best favor you can do for yourself to improve the performance of your SQL is to forget the syntax for a CURSOR, or better yet completely forget that they exist.
- Your query should only touch the rows it needs to touch and it should try to touch those rows only once. If it can touch only the entry in an index instead of the row, that is even better.
- “Stop thinking about what you want to do to a row. Think instead about what you want to do to a column.” – This is another quote from Jeff Moden.
- The only way to judge whether a query is fast or not is to have something to compare it against, like another query that returns exactly the same results.
- I like to use what I call “code patterns” as a guideline to writing high performance SQL. In essence what this means is to know in advance the fastest performing methods for solving a particular problem and use that code pattern as the basis for the query that I am writing.
- Add to your testing regimen the one million row test harness (look for this in a future blog). Write every query as if you expect it to be run one million times per day against one million rows of data.
I’ll provide more guidance in these areas in future blogs, as most of my interest is in writing high performance SQL.
Keep in mind that I only obtain very basic information from a SQL query’s execution plan. I mostly rely heavily on memorizing the fastest performing code patterns and using them when they are appropriate. That and keeping to the other bullet points I listed above.
Why should I take the time to try writing a faster performing query, when I can rely on SQL’s Database Tuning Advisor (DTA) to suggest an INDEX that will make my query perform faster? The answer to that is that INDEXes create overhead that slows down INSERTs, UPDATEs, DELETEs and MERGEs. Too much INDEXing in a database can drag down its overall performance way more than the little boost it gives to poorly written queries that perform inadequately.
The other thing that people usually have to say about this is that they don’t have the time it takes to generate more than one query form to solve a particular requirement because of time pressures and deadlines. Once again, memorizing the best practice code patterns can help you here. Once you learn them, and you can code them almost in your sleep, you’ll be able to generate multiple solutions to the same problem very quickly. Then, you just need to create the one million row test harness (also very easy to do) to prove which is fastest.
Make it Pretty
Let’s look at a query that is included in my earlier blog on Calendar Tables.
select [Date] from (select [Date], [WkDName2], FromDate, rn=row_number() over (partition by [YYYYMM] order by [Date] desc) from dbo.Calendar a left loin dbo.Holidays b on a.[DATE] between FromDate AND isnull(ToDate, FromDate) where [Date] between '2014-01-01' and '2014-12-31' and [Last] = 1 and [WkDName2] not in('SA', 'SU') and FromDate IS null) a where rn=1
Now compare this query against the actual query as published in that blog.
SELECT [Date] FROM ( SELECT [Date], [WkDName2], FromDate ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC) FROM dbo.Calendar a LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND [Last] = 1 AND -- Remove Saturdays and Sundays [WkDName2] NOT IN('SA', 'SU') AND -- Remove holidays FromDate IS NULL ) a WHERE rn=1;
Both of these queries are exactly the same and would produce the same results. One of the things Developers may often forget is that whatever language you are writing in, it is unlikely that no one will ever look at your code again. Maintenance of applications is a fact of life, so the “make it pretty” step is invaluable to those poor folks that come along afterwards and need to maintain your code. As a Developer have you ever said “gee I really don’t want to try to figure out what that person before me did, so I’ll just rewrite it rather than modify it?” The answer to this question is probably and the reason for that is most likely that the original Developer didn’t take the time to “make it pretty.”
So what is wrong with the first query, when compared to the second? What makes it harder to maintain?
- The first of the two can be called “stream of consciousness” coding, or to put it another way “write-only SQL.” I call it write-only SQL because you can’t easily read it. In other words, whoever comes along and looks at it later is going to have a difficult time trying to understand what it does.
- The first query has no line breaks or indentation to assist the reader in identifying the individual steps that are being performed. For example, can you easily see that it has a derived table embedded in it? The answer is probably not.
- Personally, I like to see language keywords in upper case. In the first query, none of the keywords are in upper case.
- The first query has no comments in it. The second query could probably be improved by placing a comment before it that says “Calculate Aware pay days based on the last work day of the month.” To make it pretty, you should always include some comments to assist the reader that will come along after you to maintain it.
- The first query does not end in a semi-colon. Why is that important you might ask? SQL Server does not require that a semi-colon terminate every SQL statement. Well, there are cases that it does require one (e.g., the MERGE statement) and there are also cases like Common Table Expressions (CTEs) where T-SQL requires that you terminate the statement just prior to the CTE with a semi-colon. There may come a day when T-SQL requires semi-colon terminators on every statement, so why not plan ahead for that day now and end all of your queries with a semi-colon? Did you know that ending your query with a semi-colon is an ANSI standard?
Every programming language (T-SQL included) has some standard methodology suggested for indentation, designed to make the code more readable. I’m not going to sit here and tell you that mine is the “best.” I will tell you that you should take a moment to think about how to indent your queries so that they end up being easy on the eyes of the readers that come along later. And then, once you have a methodology you are comfortable with, stick to it. Apply it to every query that you write without exception.
Once you get the hang of this, you’ll probably find yourself indenting complex queries according to your indentation methodology as you’re coding the query. That’s great because it saves you some cleanup work later on, and it will probably help you as you unit test the query because if it is complex (consisting of many steps) it is easier to run one step at a time, looking at intermediate results as necessary.
Conclusion
I can promise you that, when you read my blogs I’ve taken the time to make sure that each query I publish works, that it is the fastest it can possibly be and that it is (reasonably) pretty. There will be cases where I compare the performance of various code patterns that can be used to arrive at the same solution, particularly if one or the other won’t run in earlier versions of SQL.
Above I had some disparaging words to say about both CURSORs and INDEXes. My advice on CURSORs stands – avoid them at all costs and most of the time you can. Possibly one of the reasons that some people use CURSORs in T-SQL is because they’ve become accustomed to using them in Oracle SQL, where they perform much better.
INDEXes in general should be used sparingly, and when you need to INDEX a table (other than the PRIMARY KEY of course) it should be as a result of the pre-planning that you’ve done based on knowing the kinds of data retrieval operations you’ll be performing, rather than as an afterthought because some tool like DTA suggests that you do so.
As a quick example of this, there is a SQL concept known as relational division. All of the code patterns for retrieval in the case of relational division are highly dependent for their performance on the INDEXes that are created for the table. This is something we’ll explore in a future blog.
I will also be blogging on various common business problems that you can solve in SQL, and provide you with the best practice code pattern to ensure that the solutions I provide perform the best that they can. Oftentimes, if you already know the alternatives, you’ll immediately jump to the fastest solution available and that may allow you to skip the one million row test. Really the only time that you should skip that test is when you are 100% sure you’re using the best practice code pattern for performance.
In the end, you should never be satisfied that your code is “good enough” when with just a little bit of study and that little bit of extra effort it can be the “best that it can be.”
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved