Writing Great T-SQL
Today we’re going to try to extend some advice that I once heard from SQL MVP Jeff Moden, which I wrote about in my previous blog entitled “Make it Work, Make it Fast, Make it Pretty.” That advice was just as pertinent then as it is now, but perhaps I can add to it what I think makes for really great T-SQL code.
Great T-SQL code is going to start with all of the advice that I gave in that blog. That same advice really applies to any coding language, not just T-SQL.
But great T-SQL code is actually much more than that. Let’s look at some of the aspects that transcend what I’ve previously written.
Elegance of a Set-based Algorithm
Obviously in our T-SQL we’d like to use predominately set-based algorithms. Depending on the problem being addressed, oftentimes there are just a few very elegant solutions, and probably a bunch of solutions that are not nearly as elegant.
In simplicity there is often elegance. Simplicity, elegance and speed may unfortunately not all play well together. But if you can come up with a solution that is elegant, simple and fast, that’s usually going to get you a lot closer to writing great T-SQL.
From Facts and Fallacies of Software Engineering by Robert L. Glass, here is Fact #18: “There are two “rules of three” in reuse: (a) It is three times as difficult to build reusable components as single use components, and (b) a reusable component should be tried out in three different applications before it will be sufficiently general to accept into a reuse library.”
I will suggest that in order to be great, T-SQL code must be reusable. I’ve written before that well-focused code that solves only a specific problem is probably going to be a whole lot faster than more generalized code that solves a wider variety of problems. So that means it should be focused on solving a common problem that occurs frequently, so that it can be reused (as a tool) whenever the need for it arises.
I tend to agree with Mr. Glass in this case; finding a common problem and then building a fast-performing, simple and elegant T-SQL solution is probably a lot more challenging than building any other one-off query. Part of this challenge of course is simply finding a suitably simple problem to focus on, which is also encountered commonly enough to want to build something that’s reusable.
Frequency and Quality of its Reuse
Since beginning my study of T-SQL, and of course I remain a student of the subject to this day, I’ve seen a fair number of really elegant solutions to specific problems. Of the ones that can be classified as reusable, I’ve seen many that solve well a somewhat common problem, but unfortunately the problem itself is not so common that the solution will be reused often. Let’s face it, maybe you’ve got a solution for Gaps and Islands that is outrageously wonderful, simple, fast and elegant. But Gaps and Islands isn’t a particularly common problem to encounter. Sure you see it once in a while, and certainly it is good to have an elegant solution just waiting in the wings for when you need it. But commonality of the problem is a great influencer, at least to me, of what makes for great T-SQL.
Quality of reuse is a different story entirely. So let’s say you’ve got this (arguably) great solution and it sees quite a fair bit of reuse. Who are the folks that are reusing it? Are they peers in the SQL community, are they SQL Server MVPs who have embraced your method thus utilizing it themselves? Or is it the great unwashed masses?
Facebook is a great application because millions of people use it. But who are those users? I’d say it is a much higher quality pattern of reuse, to have a group of peers or those that are more skilled than you are, using your reusable code, rather than just anybody else in the world. So while Facebook may be a great app, it may not represent great code! The quality of the people reusing your code counts for something.
Most great T-SQL code should enhance your productivity. By that I mean, if you’re a developer it can be considered a tool that allows you to avoid spending awkward moments of your development time solving silly sub-problems, when really you need to be focusing on solving the bigger problems that relate to the business solutions you’re trying to produce. As a DBA, your productivity can be enhanced by any T-SQL script that automates some activity that allows you to be more proactive in your database monitoring activities.
Of course, just because you can solve a problem more quickly because the sub-problems have already been addressed doesn’t mean you should sacrifice performance. But since great T-SQL should already be pretty fast, you may find that using the tool makes the solution to your bigger problem fast enough! Ultimately you may have to spend some time to make it even faster, but at least you’ve gotten through the first hurdle of a development project, which is to just get it working.
Any great T-SQL solution is one that can be encapsulated such that it can be used as a “black box,” without the need for users (or more importantly maintainers) of the using bit of T-SQL, to need to worry about how it works.
Yes, they should be able to look under the covers, so that when need be they can figure out how it works. But mostly the encapsulations simply makes a great piece of T-SQL, which may otherwise be elegant or complex, really, really easy to use.
My Vote for the Greatest Piece of T-SQL
Arguably the greatest piece of T-SQL code that I’ve ever seen written is the DelimitedSplit8K string splitter. While Jeff Moden probably wasn’t the originator of the concept of splitting a string, his first attempt at writing a fast string splitter (Tally OH! An Improved SQL 8K “CSV Splitter” Function) was pretty awesome. I’d say that it clearly meets all of the criteria I’ve mentioned so far in this blog, and with more than 50,000 views of that article as of this writing I’d imagine there’s more than just a few people that agree with me.
Jeff really did a great job of setting up a proper test harness and proving the speed potential of the algorithm, so whether or not he claims to be the author of this wonderful little function, he’ll be the one everyone remembers once all of the dust has settled.
It has also achieved another level of greatness, which is unmatched in my experience. That is that an entire community has sprung up around it. There have over the years since its original publication, been numerous contributors to the evolution of DelimitedSplit8K. These contributions have spanned more than 75 pages of discussion thread on the topic (again, as of this writing).
Alas, there are too many contributors to mention all of them by name, but suffice it to say that list includes some really great SQL talent. Some have focused on small tweaks, others have focused on producing better test harnesses. Others have written custom CLRs to improve the speed even further, when CLRs are allowed. All of the contributors probably use this tool frequently, so this speaks volumes to the function’s adoption and reuse. One of the recent contributors, Eirikur Eiriksson, upped the ante significantly by writing a version that was even faster by utilizing a new SQL 2012 feature, and his diligence in constructing test harnesses is also worthy of honorable mention.
So here’s this huge group of peers (by that I mean SQL-folk of course), that have not only embraced Jeff’s work, but have also formed an on-going support community because they felt the underlying problem was so worthwhile that they should expend their precious time and effort on it.
There’s an interesting sidebar to my vote. The Federal Relational Protection Agency (FRPA) might scoff at how preposterous the idea of a string-splitter being great T-SQL is. After all, this is a relational database and we simply don’t store delimited strings in our properly normalized relational databases, now do we? I guess what I’m saying here is that academic considerations probably take a back seat to adoption. Clearly this problem occurs relatively frequently in the real world, regardless of what the FRPA would have us think. And solving real world problems, to me at least, is really what it’s all about.
One day when I grow up I want to write a truly great snippet of T-SQL. In my development career, knowing that solutions I’ve written are actually being used has always brought me great comfort and satisfaction. No developer in their right mind, at least in my humble opinion, wants to knowingly build shelfware. I’ve been around for a long time so maybe not everyone feels the same any more, but I hope this attitude still carries forward to all levels of developers currently putting their code out there for others to learn from and grow with.
So how about you? Do you have a vote for the greatest snippet of T-SQL ever? Post a link as a comment telling us why you think it qualifies, so all who see this might benefit from your particular experience with some T-SQL you think is great. Affirmations of my vote are also welcome.
In the end, if this blog becomes a repository of some really great T-SQL code patterns, I will derive a lot of satisfaction from that too.
And to those of you out there that share my aspirations, whether or not the favorite tool/snippet you develop ever makes the list of all-time greats, wouldn’t you like to hear from all of those people that found your code wonderful enough to add to their T-SQL toolkit? So take note you adopters out there! If you’re using someone else’s code commonly to solve a problem, post them a thank you. I’m sure they’ll be happy to hear that from you, and who knows? Perhaps it will even spawn further greatness and sharing among one of the greatest communities the coding world has ever known. That right, I mean the SQLverse, and we’re all a part of it!
Thanks for listening to my ruminations today on what makes T-SQL code great.
Follow me on Twitter: @DwainCSQL
© Copyright Dwain Camps 22 Apr 2015. All rights reserved.
When I first heard this, it struck me as being a remarkably concise wisdom applicable to virtually any programming task. The entire quotation is actually:
“Make it work, make it fast, then make it pretty… and it isn’t finished until it is pretty!”
In case you don’t know what an MVP is, it stands for Most Valued Professional, and it is an award that Microsoft confers only to the best of the best in their Microsoft-focused technical skills.
Throughout the course of this article I will ask the reader a series of questions. Each question is designed as a thought question. When you encounter a question, you should mentally form a picture in your head of what your answer is. After the question I’ll provide you with my take on the answer. If your answers are not the same as my answers, you should take the time to reflect on why they are different.
Taking pride in your work is a very important thing. We all like to think we’re good at our jobs. Could you be better at your job? The answer is most certainly, because all of us could always be better. This article offers a philosophy by which any programmer can improve on their ability to perform their job. I sincerely hope that you can learn something from it.
Since the quotation on which this article is based was said by a really talented SQL DBA, towards the end of this article we’ll provide a code example in SQL. We’ll provide a good and bad example. Once again, you should reflect on this example and understand why it is bad or why it is good.
In case you were wondering, I have seen some of Jeff Moden’s work. I can assure you that he lives this philosophy every day. I have seen him produce some of the most remarkably ingenious and high-performing solutions to SQL problems, and in the end he always makes the code pretty. He will probably not be pleased with me for writing this article because he is also very modest about his accomplishments. That my dear reader is how you become a Microsoft MVP!
Make it Work
When you are writing SQL or coding in any programming language for that matter, the most important thing to ensure is that the code you have written meets the functional requirements. This means that you, as a Developer, must not only write the code but you must also thoroughly unit test it. That’s right, testing is not just for Testers to do!
To thoroughly unit test a piece of code, you must consider not just what happens when you execute your code against what you expect in terms of the way a business user may end up running your code. You must take into consideration unexpected actions of those same users, or in the case of SQL queries, unexpected but bad data in the tables you’re going against.
A good Developer will also take the time to document the testing that he or she performed. This documentation, if available, may help the testing team to identify other potential cases where the code that was written may fail to perform to expectations.
So as a Developer, do you believe that you perform sufficient unit testing so that your testing team finds no issues when they perform their testing? The answer is probably not, but that is the goal every Developer should focus on if they want to be considered “better” at their jobs.
Make it Fast
All SQL is not alike. I can probably write at least three to four queries that will satisfy any specific business requirement. The key is to select the query that will run the fastest, and there is some effort involved in making that happen.
Here’s another thought question for you. Has anyone ever told you that a query you have written runs too fast? Chances are, the answer to that question is no. Given the choice between two queries that return equivalent results, business users would almost always choose to use the query that runs the fastest. Performance counts! When I asked the first question to a good friend of mine by the name of Chris Morris (who is an expert at T-SQL), he had this story to relate:
“Some guy who’s been working at the same shop for three or four years without opening a book or visiting a forum but thinks he’s a T-SQL hero has a query that he swears can’t be made to run any faster. It takes twenty minutes to run. You rewrite it so that it works in a completely different way – which he hasn’t been exposed to – and it runs in two seconds. The first thing he’ll say is ‘That’s far too fast – it can’t possibly be doing all that work in so little time.’ Of course, the catch is that it isn’t doing all that work.”
I found that last bit quite humorous but also absolutely true. Two queries that return the same results do not have to perform the same amount of work. The one that does the least work is most likely to perform better.
There are some people who insist that, in order to write fast-performing queries, you must be an expert in understanding the query‘s execution plan. While that can certainly be helpful, not many are true experts in interpreting a query’s execution plan. In fact, the execution plan can oftentimes be misleading, specifically when comparing the “cost” of two queries. I’ve seen cases where two queries are rated as having a cost of 0% and 100%, yet the second query is much faster than the first.
To write faster queries, you don’t need to be an expert at reading a query’s execution plan but you do need to keep in mind a few very basic fundamentals.
- Avoid CURSORs as they can be really slow in T-SQL. 99.99% of the time you can construct an alternate, set-based approach that will perform much faster than a CURSOR. The best favor you can do for yourself to improve the performance of your SQL is to forget the syntax for a CURSOR, or better yet completely forget that they exist.
- Your query should only touch the rows it needs to touch and it should try to touch those rows only once. If it can touch only the entry in an index instead of the row, that is even better.
- “Stop thinking about what you want to do to a row. Think instead about what you want to do to a column.” – This is another quote from Jeff Moden.
- The only way to judge whether a query is fast or not is to have something to compare it against, like another query that returns exactly the same results.
- I like to use what I call “code patterns” as a guideline to writing high performance SQL. In essence what this means is to know in advance the fastest performing methods for solving a particular problem and use that code pattern as the basis for the query that I am writing.
- Add to your testing regimen the one million row test harness (look for this in a future blog). Write every query as if you expect it to be run one million times per day against one million rows of data.
I’ll provide more guidance in these areas in future blogs, as most of my interest is in writing high performance SQL.
Keep in mind that I only obtain very basic information from a SQL query’s execution plan. I mostly rely heavily on memorizing the fastest performing code patterns and using them when they are appropriate. That and keeping to the other bullet points I listed above.
Why should I take the time to try writing a faster performing query, when I can rely on SQL’s Database Tuning Advisor (DTA) to suggest an INDEX that will make my query perform faster? The answer to that is that INDEXes create overhead that slows down INSERTs, UPDATEs, DELETEs and MERGEs. Too much INDEXing in a database can drag down its overall performance way more than the little boost it gives to poorly written queries that perform inadequately.
The other thing that people usually have to say about this is that they don’t have the time it takes to generate more than one query form to solve a particular requirement because of time pressures and deadlines. Once again, memorizing the best practice code patterns can help you here. Once you learn them, and you can code them almost in your sleep, you’ll be able to generate multiple solutions to the same problem very quickly. Then, you just need to create the one million row test harness (also very easy to do) to prove which is fastest.
Make it Pretty
Let’s look at a query that is included in my earlier blog on Calendar Tables.
select [Date] from (select [Date], [WkDName2], FromDate, rn=row_number() over (partition by [YYYYMM] order by [Date] desc) from dbo.Calendar a left loin dbo.Holidays b on a.[DATE] between FromDate AND isnull(ToDate, FromDate) where [Date] between '2014-01-01' and '2014-12-31' and [Last] = 1 and [WkDName2] not in('SA', 'SU') and FromDate IS null) a where rn=1
Now compare this query against the actual query as published in that blog.
SELECT [Date] FROM ( SELECT [Date], [WkDName2], FromDate ,rn=ROW_NUMBER() OVER (PARTITION BY [YYYYMM] ORDER BY [Date] DESC) FROM dbo.Calendar a LEFT JOIN dbo.Holidays b ON a.[DATE] BETWEEN FromDate AND ISNULL(ToDate, FromDate) WHERE [Date] BETWEEN '2014-01-01' AND '2014-12-31' AND [Last] = 1 AND -- Remove Saturdays and Sundays [WkDName2] NOT IN('SA', 'SU') AND -- Remove holidays FromDate IS NULL ) a WHERE rn=1;
Both of these queries are exactly the same and would produce the same results. One of the things Developers may often forget is that whatever language you are writing in, it is unlikely that no one will ever look at your code again. Maintenance of applications is a fact of life, so the “make it pretty” step is invaluable to those poor folks that come along afterwards and need to maintain your code. As a Developer have you ever said “gee I really don’t want to try to figure out what that person before me did, so I’ll just rewrite it rather than modify it?” The answer to this question is probably and the reason for that is most likely that the original Developer didn’t take the time to “make it pretty.”
So what is wrong with the first query, when compared to the second? What makes it harder to maintain?
- The first of the two can be called “stream of consciousness” coding, or to put it another way “write-only SQL.” I call it write-only SQL because you can’t easily read it. In other words, whoever comes along and looks at it later is going to have a difficult time trying to understand what it does.
- The first query has no line breaks or indentation to assist the reader in identifying the individual steps that are being performed. For example, can you easily see that it has a derived table embedded in it? The answer is probably not.
- Personally, I like to see language keywords in upper case. In the first query, none of the keywords are in upper case.
- The first query has no comments in it. The second query could probably be improved by placing a comment before it that says “Calculate Aware pay days based on the last work day of the month.” To make it pretty, you should always include some comments to assist the reader that will come along after you to maintain it.
- The first query does not end in a semi-colon. Why is that important you might ask? SQL Server does not require that a semi-colon terminate every SQL statement. Well, there are cases that it does require one (e.g., the MERGE statement) and there are also cases like Common Table Expressions (CTEs) where T-SQL requires that you terminate the statement just prior to the CTE with a semi-colon. There may come a day when T-SQL requires semi-colon terminators on every statement, so why not plan ahead for that day now and end all of your queries with a semi-colon? Did you know that ending your query with a semi-colon is an ANSI standard?
Every programming language (T-SQL included) has some standard methodology suggested for indentation, designed to make the code more readable. I’m not going to sit here and tell you that mine is the “best.” I will tell you that you should take a moment to think about how to indent your queries so that they end up being easy on the eyes of the readers that come along later. And then, once you have a methodology you are comfortable with, stick to it. Apply it to every query that you write without exception.
Once you get the hang of this, you’ll probably find yourself indenting complex queries according to your indentation methodology as you’re coding the query. That’s great because it saves you some cleanup work later on, and it will probably help you as you unit test the query because if it is complex (consisting of many steps) it is easier to run one step at a time, looking at intermediate results as necessary.
I can promise you that, when you read my blogs I’ve taken the time to make sure that each query I publish works, that it is the fastest it can possibly be and that it is (reasonably) pretty. There will be cases where I compare the performance of various code patterns that can be used to arrive at the same solution, particularly if one or the other won’t run in earlier versions of SQL.
Above I had some disparaging words to say about both CURSORs and INDEXes. My advice on CURSORs stands – avoid them at all costs and most of the time you can. Possibly one of the reasons that some people use CURSORs in T-SQL is because they’ve become accustomed to using them in Oracle SQL, where they perform much better.
INDEXes in general should be used sparingly, and when you need to INDEX a table (other than the PRIMARY KEY of course) it should be as a result of the pre-planning that you’ve done based on knowing the kinds of data retrieval operations you’ll be performing, rather than as an afterthought because some tool like DTA suggests that you do so.
As a quick example of this, there is a SQL concept known as relational division. All of the code patterns for retrieval in the case of relational division are highly dependent for their performance on the INDEXes that are created for the table. This is something we’ll explore in a future blog.
I will also be blogging on various common business problems that you can solve in SQL, and provide you with the best practice code pattern to ensure that the solutions I provide perform the best that they can. Oftentimes, if you already know the alternatives, you’ll immediately jump to the fastest solution available and that may allow you to skip the one million row test. Really the only time that you should skip that test is when you are 100% sure you’re using the best practice code pattern for performance.
In the end, you should never be satisfied that your code is “good enough” when with just a little bit of study and that little bit of extra effort it can be the “best that it can be.”
Follow me on Twitter: @DwainCSQL
Copyright © Dwain Camps 2014 All Rights Reserved