Month: May 2015
High Performance T-SQL using Code Patterns
What exactly does this mean? We’ve heard of “code patterns” in other programming languages, but what does it mean to use T-SQL code patterns to develop high performance T-SQL? Having now been called out by a couple of folks in the SQLverse that I respect, most recently the editor of the SQL Server Central (SSC) web site SQL MVP Steve Jones, I believe it is time to “come out” so to speak and come clean on what I mean by this and how to achieve it.
In order to do this, I think it is necessary to expand a little on the rather enigmatic profiles you may have seen of me. The reason for this is simple, and I’ll start with a little bit of advice: “Don’t develop your applications based on the advice of just any yahoo that finds a way to post an internet page.” I’ll return more to this in a while, but for now you need to ask yourself: “Should I take the advice of DwainCSQL?” Read my extended bio below and answer that question for yourself.
My Extended Bio
I started coding T-SQL around the middle of 2011, and SSC published my first article on T-SQL in early 2012. Prior to my first experience really coding SQL, I had limited experience with ad-hoc SELECT statements only (and obviously FROM and WHERE clauses). At the time, the simplest of JOINs required me to use Google to recall the syntax. I was project manager for a project that required a rather complex stored procedure to be built. I became exasperated with the development team, because after three weeks of testing it, they simply could not get it right. So I threw my hands in the air (after beating my head against my desk many times) and said “let’s see if I can write this thing myself.”
After five days of Googling and coding, one tiny excruciating bit at a time, I came up with a stored procedure (SP) that I thought would do the job. One issue was found by our testing team during testing, none were found in User Acceptance Testing (UAT) and after it was moved into production no error was ever recorded for that SP despite the fact that it runs every two minutes (24×7) and has been doing so for more than four years now. The Googling was to teach myself the T-SQL code constructs that were needed (and I’m talking real simple things like UPDATEs, INSERTs and JOINs) to make my SP work.
A little later on in the same project I was faced with a second challenge. The development team had created another SP to handle a different scheduled task, one which was also required to run every two minutes. They came up with something that processed 1,000 rows in about twelve minutes. This was obviously unacceptable, because it needed to process those 1,000 rows within the interval between two scheduled runs. Off I went to Google to teach myself about how to write high performance T-SQL. It was at that time that I ran across the term RBAR, which stands for “row-by-agonizing-row,” and I immediately recognized the wisdom of the term and wrote “no-RBAR” on the whiteboard behind me. This of course got me a lot of questions (like what does it mean?).
I set about to rewrite that SP, and through some relatively simple techniques I managed to get the processing time for the 1,000 rows down to less than two seconds.
These two challenges in T-SQL taught me two things: 1) anyone can learn the language (even a non-technical project manager) and 2) performance counts. I later came to realize that performance always counts, even if it is not explicitly stated as a non-functional, system requirement.
I also learned one other thing from this experience. I really liked T-SQL! And that of course got me to reading about it (and later writing about it). One of the first things I did, probably in late 2011 or so, was to get a login account on the SSC web site, because I realized that a lot of the good information I’d gotten, to get me through those coding challenges above, I got from there. That’s also around the time that two things happened: 1) I started fielding questions on the SSC forums (mostly for fun and) to improve my skill level and 2) I learned from whom the term RBAR originated, namely SQL MVP Jeff Moden.
Being of the brazen sort, sometime after my first article was published I believe, I reached out to Jeff and told him part of the above story and how his RBAR term really stuck with me and how I’d embraced it. It was really the combination of the two coding challenges that hit me at nearly the same time that got me interested in high performance T-SQL and how it can be achieved. Somewhere, shortly after that first contact with Jeff (and I owe him the utmost gratitude for encouraging me during my formative years), he gave me a piece of advice that has influenced nearly all of my writings, ramblings and ruminations since then. It was simple and to the point, like most of the advice that he dispenses: “never make performance claims without proving it in code.”
Since that first article I’ve had a total of 23 articles published on SSC (some have received more than 30,000 views), 21 articles published on Simple Talk and another 19 or 20 blogs. You’ll see many if not most of those articles embrace Jeff’s advice, about proving the performance with code and statistics.
So now the question is, how did I get from T-SQL newbie four years ago, to where I am now (wherever that is)? And more importantly, have I come far enough to be someone who’s advice you should listen to?
I can’t answer the latter for you, but this blog is all about the former.
Enter T-SQL Code Patterns
I honestly don’t have much experience with code patterns in other programming languages. I get the feeling that they’re somewhat akin to something we used to do, back in the days when I was all about technical and coding, which we called code reuse. And I did it all the time.
In T-SQL I think of code patterns a little bit differently. Let’s first talk a little bit about algorithms. There are at least two types: procedural and set-based. Procedural algorithms can be written in T-SQL using SPs. Usually they suck. Set-based algorithms on the other hand, usually perform much better and they are the code patterns you need to embrace when it comes to T-SQL. Another piece of advice that Jeff likes to dispense is “it depends” (also short and to the point), and I’d agree with that here. There are exceptions to this statement (“they suck”) but I’d suggest they are rare.
T-SQL code patterns are different in the sense that little code reuse is usually involved. A code pattern in T-SQL is more like a concept that can be applied to a specific query circumstance. Let’s look first at a really simple concept that represents a code pattern.
When doing a JOIN, it is best to try to JOIN a foreign key in the right table against its referenced columns in the left table.
This is a code pattern that is so simple in fact, many people may not recognize it for what it is. It is a high-performance code pattern for the basic reason that the SQL optimizer is going to use the available indexing (usually the PRIMARY KEY’s clustered index in the left table) to come up with the lowest cost query plan. It is hardly something you’ll likely be copying from one query to another, so it is not “reusable” in that sense at all. It does have wide applicability that extends to nearly every query you’ll ever write (assuming you’re not like me in my beginnings when I didn’t even know the proper syntax for JOINs).
What about more complex code patterns? There are many, but they are all IMHO concepts. How did I learn about them? Well, that would come from extensive reading. Once I got interested in T-SQL and performance, I dived into reading as much material as I could about specific queries and their applicability to specific problems. It didn’t take me very long to realize that there are about as many T-SQL solutions to a specific query problem as there are stars in the sky. In fact, to solve a specific problem it is unlikely that any two intrepid T-SQL developers would do it the same way. The literature is swamped by different people’s takes on different ways to solve a particular problem using a T-SQL query.
That’s where Jeff’s advice about always proving performance claims in code came in. If I read about a particular solution technique and the performance was not proven in code, I would consider it basically un-tested until I found somewhere that it was. Or otherwise, I’d need to test it myself to prove its efficacy. In the end, the plethora of potential solutions would get winnowed down to just one (or perhaps a handful) that demonstrated the performance characteristic that would make them viable as a T-SQL code pattern.
Remember what I said above: “performance always counts.” That’s how you know you’ve stumbled onto a code pattern that’s worth remembering.
So now you’re probably asking yourself two things: 1) how can I possibly remember the right code patterns to solve every querying situation I might encounter and 2) how can I sort through this huge amalgam of potential solutions to find the one that performs the best, and do that in the finite amount of time I have available to me to write my special query.
My answers in their simplest form are: 1) you don’t and 2) rely on source legitimacy. Both of those require a bit of explaining so I will.
It is not necessary to remember the specifics of every elegant and performance efficient code pattern you find. All you need to remember is where the pattern applies and who wrote about it. Let’s face it, most of the really tricky code patterns have been tackled by experts. And these guys, who are no mere mortal men mind you, have developed best-of-breed solutions. Thanks to the Internet and Google, if you’ve found the code pattern once, it is quite simple to find it again when you need to, simply by remembering who wrote about it. Google is your friend – embrace it.
I personally do not memorize the specifics of every code pattern I’ve ever encountered that I liked because it was of proven performance characteristic. I am more than happy to look them up when I need them. Yes I do remember a few, but that’s usually because I’ve had to use them extensively or I’ve had the opportunity to write about them. For example, two that I do remember without look up are the Quirky Update and Gaps and Islands. We’ll talk more about Gaps and Islands in a minute, but both of these are retained in my long term memory as more than a concept, rather as a technique for specifically the reasons I mentioned.
Lest we not forget my second answer “rely on source legitimacy,” the explanation of that is also pretty simple. Rely on proven sources for your information. These are authors in the SQLverse that demonstrate time and time again the code patterns that perform the best for specific query cases. A short list of my favorite authors of code patterns (even though they may not themselves call them that) are listed on the Favorites page right here in my blog. I will apologize to those other sources that I may have left out (especially if I’ve quoted or referenced their work in past articles) or because I simply haven’t identified them yet. Rest assured I will maintain that Favorites page as new authors come to my attention.
There are many-score authors on T-SQL out there that never demonstrate performance. My attitude is generally to avoid them. Or at least to take their ideas with a grain of salt, and test the heck out of them.
A Little more Depth on one of my Favorite T-SQL Code Patterns
Early on in my T-SQL readings and mass accumulation of information that I hoped would lead me to be better at writing T-SQL I stumbled across a problem that for some reason piqued my interest, probably because of its complexity. Once I learned of it, and quite honestly I had no real use for it other than to satisfy my curiosity, I couldn’t read enough about it. What really ticked me off the most I suppose, and which led me further down my research path, was that I didn’t really understand how the solutions I was seeing were working, but that’s beside the point.
That problem is known as Gaps and Islands, which I said above I’d discuss in more detail. I have linked in there probably the first article that I found with what I’ve called “un-tested code patterns.” I should point out also, that this is a code pattern that you probably won’t need to use too often. Still, as it is one of my favorites, I feel I should expound upon it just a bit to explain how I distilled down the solutions I found to the one code pattern that is easily remembered and worthwhile to do so.
As I continued my reading I found some good articles about Gaps and Islands, while at the same time flailing around rather blindly and failing miserably in my understanding of the solutions.
- Chapter 5 of SQL Server MVP Deep Dives written by SQL MVP Itzik Ben-Gan
- Group Islands of Contiguous Dates written by Jeff Moden
- And later, SQL 2012 Performance Test: Gap Detection by Microsoft Certified Master (in SQL Server) Wayne Sheffield
If you’ve looked at the Favorites page of my blog that I linked in earlier, it is no coincidence that you’ll find all three of these authors listed there. All of them are highly representative of what I call source legitimacy, because they back up their performance claims with code and statistics, but perhaps more importantly they are lucid writers. Their explanations are easy to follow and so are their examples. In fact, it was only after reading these sources did the dawn of understanding arise for me with respect to the Gaps and Islands problems, and how they are solved.
If we take a look at the Deep Dives book, Itzik Ben-Gan’s chapter demonstrates and explains four solutions to islands and four additional solutions to the separate problem of gaps. Jeff’s article on the other hand focuses on only one of the islands solutions (and that one is also in the Deep Dives chapter).
Wayne Sheffield’s blog, which I read later showed one of the new solutions for gaps that came about because of a new feature in SQL 2012 (the LAG analytic function). His article was quite interesting in that it eliminated that code pattern on the basis of its performance (later I’ll call this an “anti-pattern”).
Finally my curiosity was sated! I now understood the code patterns and had reliable benchmarks I could use to base my use case decisions on. I had embedded the authors in my long term memory, so when the need arose I could easily re-locate those references and pick the applicable code pattern.
Nonetheless, I had the nagging desire to want to contribute something to this space. I came to the conclusion that it seemed to be the playground of mostly Microsoft SQL Server MVPs, but being of the brazen sort as I mentioned before, that deterred my desire not in the least.
Then a few months later it hit me. My Aha! moment. That moment when I came to understand the relationship that exists between Gaps and Islands. To put it as simply as possible, they are not essentially different as islands can be converted to gaps (my first realization), and that gaps can be converted to islands (although not quite as easily). After some quick code-proofing (always a first step in all of my writing), my muse kicked in and I furiously composed The SQL of Gaps and Islands in Sequences, which the editor at Simple Talk kindly published for me.
Remaining true to my desire to back up my performance assertions with code and statistics, again that great advice given to me long before by my T-SQL mentor Jeff, I tested my new contribution to the Gaps and Islands solutions space. While in neither case did my solution come out the overall winner, in both cases my solutions were competitive. And that was a place I was happy to live in.
Ultimately this led me to the code patterns I now apply to either Gaps or Islands:
- Use the code pattern shown by Jeff in his linked article on grouping islands of contiguous dates for the islands problem.
- Use my solution on top of that one for the gaps problem. I called that solution Cross Apply Values – Islands to Gaps.
Because I had studied it so hard I can now easily remember Jeff’s islands solution. Because of my aha moment, I can easily remember my little add on to take me from islands to gaps. The code pattern had solidified and I could now apply solutions that performed admirably in these scenarios. If for some reason, for example due to data irregularities or what-not, neither of these solutions meets the end game of the best performing solutions, I know where to go to look into alternatives.
Interestingly, the code pattern for islands (which I call the staggered rows approach) actually has applicability to a wider variety of problems, making it an extremely versatile code pattern that perhaps I’ll dive into in a future article.
A Couple More High Performance Code Patterns that I’ve had the Privilege to Author
There are way too many high-performance T-SQL code patterns out there than can be enumerated in a single blog. Most of these have been authored or demonstrated by individuals that are highly skilled in T-SQL, much more so than you or me. Look to the favorites that I mentioned and you’ll easily find many of them.
It has been my privilege on occasion to stumble upon a code pattern of my own making that seems to be the most desirable one to use based on performance. These are so few as to be able to mention them all in this short blog, so I will do so and humbly ask your forgiveness if you feel I’m tooting my own horn as that is not my intent.
- My solution to a challenge posed by Itzik Ben-Gan to Identify a Subsequence within a Sequence. In that article incidentally is an alternate, high-performance approach to the problem by SQL MVP Peter Larsson, which is based on the staggered rows code pattern I mentioned earlier.
- I generalized that subsequences solution to what I call the Row Reduction approach to Relational Division, which I wrote about in High Performance Relational Division in SQL Server. Peter Larsson’s work is referenced quite frequently therein, because he’s really the man when it comes to relational division.
- Then there is calculating the statistical median in T-SQL, which has been written about by me (Calculating the Median Value within a Partitioned Set Using T-SQL), by SQL MVP Aaron Bertrand (Best Approaches for a Grouped Median) and again by me in another blog (An Even Faster Method of Calculating the Median on a Partitioned Heap). In that later blog, is yet another code pattern for median that remains in the un-tested stage because it hasn’t been run through all scenarios and against all contenders, but one day I’ll get around to it because it looks promising. Note that in the case of median, my solution was not the front-runner in all scenarios, but as usual I’m willing to settle for just being in the race.
Again, I want to emphasize that this blog is most emphatically not about me, it is about T-SQL code patterns. If my examples that involve me haven’t yet convinced you, then read on for mention of just a few more. Many of these are even more important and generally more versatile than the few examples I’ve bored you with so far.
Other High-Performance T-SQL Code Patterns that you can Explore
In no particular order, here are some additional code patterns you should try to familiarize yourself with time permitting, should you in fact have bought into this concept that I am espousing. Some of the links are to my writings, while others are to mostly authors on my favorites list. Once again, I’m sure there are others worthy of mention that I apologize for omitting.
- The Numbers or Tally table
- Calendar Tables (also here, here and here, which is where I first learned of them)
- Using dynamic SQL in search procedures (here, here and here)
- Splitting a delimited character string (and the massive discussion thread that embellishes on this code pattern)
- The CROSS APPLY VALUES approach to Un-pivot
- The T-SQL Window Functions (here and here)
- Cross tab queries (here and here)
- Hierarchies in T-SQL (here, here and here)
- In-line Table Valued Functions instead of scalar-valued, user-defined functions (sorry no link here, so I’ll leave it as a challenge to you to find a reputable source)
There is overlap in some of those links, but that is only because some of those articles cover multiple code patterns well.
And then of course, there are the anti-patterns (where better code patterns exist):
- Counting with Recursive CTEs
- Triangular Joins
- CURSORs in T-SQL (not really a code pattern per se, but certainly an anti-pattern that can be applied in many querying scenarios that should be scrupulously avoided)
That should boost your reading list a bit! And I’ve barely scratched the surface.
When there is no Code Pattern to Fit
We all like to think that our T-SQL problems are special, that they are unique; and that in order to solve them we should write our own approaches to solving them, otherwise where’s the challenge?
That is utter nonsense. The challenge is to quickly deliver high-performance solutions, which accurately solve business problems. Solutions that stand the test of time by being bug free, regardless of what changing, underlying data patterns impact them. Solutions that scale upwards as your applications become successful beyond your wildest imaginings.
I’ve tried to show you that learning high-performance code patterns is easy. Read a lot, remember a little. Utilize your memory bandwidth to its utmost by simply remembering the name of the code pattern and the name of the author that provided the best proof of performance, so later you can look it up when you need to. You don’t need to learn them all at first to get started with this approach. Start by learning a few, and then expand on that knowledge when the need arises.
Perhaps the biggest challenge when encountering a new T-SQL querying problem is to figure out what your predecessors called it. Gaps and islands for instance, may not be intuitively obvious, but once you remember it and can identify the use cases, looking up the code patterns becomes relatively simple. And when I talk about your predecessors, you can rest assured that for nearly every complex querying situation that you’ll encounter in your career, someone has explored it before you. Sorry to say this, but some of those people (yours truly excepted) are probably a lot more talented in T-SQL than you, especially if you’re just a beginner. Why not try to draw upon what they have to offer?
And on those rare occasions where you truly have a new and complex query to write, keep in mind these immortal words of SQL MVP Gail Shaw:
“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.”
Learning some of the code patterns noted in this article will almost certainly help you to write high-performance T-SQL queries. Learning though, is a never-ending cycle. You should let it lead you to new frontiers throughout your career. Learn about T-SQL query plans (SQL MVP Paul White’s and SQL MVP Grant Fritchey’s articles on my Favorites page should help you a lot with that) and learn the anti-patterns too, so you know what to avoid.
Learn who to trust, and learn when perhaps your trust should be placed in other sources.
And if you should perchance to fancy writing about T-SQL, be sure to give credit where credit is due. Building on the work of others is perfectly acceptable (I do it all the time), as long as you credit the source and the person. Try to add value and not just rehash what you have seen others do. This expands the knowledge space for everyone.
Notice how I always hyperlink SQL MVPs to their Microsoft MVP page? That’s because I figure these guys (and ladies) have earned the recognition and deserve to be honored for it.
To be honest, I’m not yet sure if I’ve convinced you to trust my words or not. After all, I am just another yahoo that’s figured out how to post something to the Internet. You’ll need to be the judge of that.
If you do trust me, I think that’s just peachy! Perhaps you’ll then want to follow me on Twitter (@DwainCSQL), because there I can promise you nearly all-SQL tweets!
I hope to hear your comments on these ideas in the comments section below. Until next time, happy and productive querying!
© Copyright Dwain Camps 27 May 2015. All rights reserved.
The Best Tool for Comparing Whether Two T-SQL Queries Produce Identical Results
There is one software tool that should be in every software developer’s tool chest. That is knowledge of the platform on which you are developing. With all the hype that surrounds every new software tool that comes out, it is important to recognize the tools you already have at hand, and the ones that may be hidden gems within your development platform of choice.
Let’s backtrack for a moment into how a T-SQL query is developed. Remember my prior blog: Make it Work, Make it Fast, Make it Pretty? That blog (based on the sagacious and immortal words of SQL MVP Jeff Moden) suggests there’s three steps. In reality there are more, but for the time being let’s focus on two of the three mentioned in that statement.
- Make it Work – you understand the requirements, you understand the physical and logical models in which the data is stored, you write a query that returns hundreds of rows of results and you painstakingly check that the results indeed match the requirements, and are as you intended, based on the underlying data stored in the tables.
- Make it Fast – once your query is returning the desired results, at some point in time you may be confronted by the dreaded statement by somebody that “it isn’t fast enough.” Time to break out those performance tuning tools.
Let’s hope that the dreaded “it isn’t fast enough” pronouncement doesn’t come after the query is already in production, but that is probably the most likely scenario.
While I’m into doing the Make it Work part, I have a tendency to rely on known, high-performance code patterns to develop a query which, under most circumstances should probably be of sufficient performance. Maybe I’m having an off day, or maybe my underlying data changes in such a way that the SQL Optimizer simply starts making bad choices for choosing an execution plan (see: Lessons Learned from a Poor-performing View). Maybe the darned thing is very, very complex and you’re pushing up against tight delivery timelines, so you just need to get it done. Whatever the case, there will come a time where you’ll need to rewrite a query to improve its performance.
This is where it might get a little tricky. Suppose that the working query is highly complex, perhaps complex to the point that you’re afraid to touch it for fear of breaking it. I’ve seen that happen a lot in cases where you weren’t the original author of the query. Perhaps you don’t even know all of the outlying test cases that were run through it to prove that it worked correctly in the first place. Let’s face it, documentation isn’t always all it’s cracked up to be, and easily could be lost with time anyway.
You’ll need a tool that allows you to compare the results of the working query, against any alternative(s) that you may need to construct to improve on performance.
Enter the T-SQL EXCEPT Set Operator
Well, first I suppose we need to have a couple of SQL queries to compare. Let’s draw upon an old article of mine kindly published by the fine but eccentric editor of Simple Talk entitled: The SQL of Gaps and Islands in Sequences.
-- A solution of my own design for Gaps WITH Islands AS ( SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1 FROM ( SELECT ID, SeqNo ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) FROM dbo.GapsIslands) a GROUP BY ID, rn ) SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo) FROM ( SELECT ID, SeqNo ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2 FROM Islands CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a GROUP BY ID, m HAVING COUNT(*) = 2 ORDER BY ID, StartSeqNo; -- A Gaps Solution from SQL MVP Deep Dives SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1 FROM ( SELECT ID, cur=SeqNo, nxt=( SELECT MIN(B.SeqNo) FROM dbo.GapsIslands AS B WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) FROM dbo.GapsIslands AS A) AS D WHERE nxt - cur > 1;
Both of these solutions find gaps in a table containing a column of sequence numbers. You won’t really need to run these two queries to follow along with the discussion, but if you want to try some of the things we’ll discuss that article has all the scripts you’ll need in its resources section (at the top).
Suppose that each of these queries returns about 10,000 rows give or take (they don’t but let’s pretend that they do). To verify the veracity of the statement “they produce identical results” we just need to do a few steps:
- Run both of the queries exactly as shown above and check the row counts of each. If they both return 10,650 rows we’re good. If one returns 10,650 rows and the other returns 10,633 then you know they’re not producing identical results (duh!).
- Comment out the ORDER BY in the first query and add EXCEPT between the two. That should now run because each query produces exactly three columns of results.
- Note that the first query contains a Common Table Expression (CTE). If the second query does too, you’ll need to move that CTE up (and possibly rename it) to make it cascaded with the first. If both queries use the same CTE(s), simply discard the one for the second query.
Our case should now look like this:
-- A solution of my own design for Gaps WITH Islands AS ( SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1 FROM ( SELECT ID, SeqNo ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) FROM dbo.GapsIslands) a GROUP BY ID, rn ) SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo) FROM ( SELECT ID, SeqNo ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2 FROM Islands CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a GROUP BY ID, m HAVING COUNT(*) = 2 EXCEPT -- A Gaps Solution from SQL MVP Deep Dives SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1 FROM ( SELECT ID, cur=SeqNo, nxt=( SELECT MIN(B.SeqNo) FROM dbo.GapsIslands AS B WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) FROM dbo.GapsIslands AS A) AS D WHERE nxt - cur > 1;
If the two queries produce identical results, when you run the above it should produce zero rows in the results set! Then all you’re left with is to verify that the newer query runs faster, and that’s something I covered in The One Million Row Test Harness.
If some rows appear then there is something wrong and the queries are not producing identical results sets. You could analyze the results by switching the order of the queries around the EXCEPT.
Or you could just forget this tip about using EXCEPT and painstakingly analyze each of the 10,650 rows produced by each result set. But that is something that I’m just a bit too lazy to do!
Then again, maybe it is precisely the tool you need to get over your queryophobia and dive head first into that complex query written by some other SQL geek, without worrying too much about verifying that the results are correct.
If you liked this neat little T-SQL trick then maybe you should follow me on Twitter: @DwainCSQL because who knows what other cool stuff I might come up with!
© Copyright Dwain Camps 21 May 2015. All rights reserved.
Excel in T-SQL Part 2 – The Normal Distribution (NORM.DIST) Density Functions
Today’s blog will be the second in a multi-part series on replicating Excel functions in T-SQL, continuing with Excel’s NORM.DIST built-in function, thus enshrining my geekdom in the SQLverse forever.
Today’s solutions will once again focus on creating T-SQL in-line, Table Valued Functions (iTVFs), introduced in SQL Server 2005, that apply my solution techniques to the general case of calculating the value of the Normal Distribution’s Probability Density Function (PDF) and Cumulative Distribution Function (CDF), for any mean and standard deviation you specify. I will not be doing any performance testing of today’s iTVFs; however I have done my best to utilize T-SQL performance best practices in their construction. I will also provide you some links to alternate solutions I have seen. Since all are approximation algorithms (including mine), the difference to our readers is probably going to be in which produces the lowest error results. Only you can be the best judge of which that is.
The Normal Distribution
As I’ve said before I’m no statistician, but I have seen the Normal Distribution and most likely you’ve seen it too. The normal distribution has many interesting properties, starting with the:
- Mean – The average of all sample points representing the population, which appears as the center line of a normal distribution (probability density) function’s curve.
- Standard Deviation – A measure of the elongation of the tails of the distribution.
Graphically I’m sure you’ve seen the normal distribution before, with the blue line (below) being what you normally (no pun intended) see. This is the normal distribution’s probability density function (PDF). The red line is also of interest, and this is called the cumulative distribution function (CDF).
Notice how the blue line (PDF) is bilaterally symmetrical about the mean value, which in the above case is ten. This simply means that there’s an equal portion of the curve on both the left and the right of the mean. Another property of the PDF for any distribution is that the area under the curve is exactly equal to one, and this applies regardless of what the mean and standard deviation are. In fact, it also applies to other statistical distributions, so long as it is the PDF you’re talking about.
The CDF represents the area under the PDF at any particular point along the PDF line. For example, if we draw two intersecting lines as in the example below, with the first (vertical) line through the mean and the second line intersecting the first and also the CDF, we see:
- The clear bilateral symmetry of the PDF line, where half of the area appears to the left of the bisecting line and the other half appears to the right.
- The horizontal line intersecting the right vertical access at 0.5, indicating that the value of the CDF (the area under the PDF) at the mean is precisely 0.5, meaning that the total area of the PDF curve is 1.0, which is also the point at which the red curve intersects with the right axis.
This same relationship holds for the PDF/CDF of any other distribution.
Computing the Normal Distribution’s Probability Density Function
If you go to the Wiki page for the Normal Distribution linked above (or many other sources), you will see that the PDF for the normal distribution can be represented mathematically as:
Where µ = the arithmetic mean and σ2= the standard deviation. The variance (σ) is the square root of the standard deviation.
Also, SQL has the PI and EXP built-in functions to support this equation.
SELECT PI(), EXP(1.); -- Results: 3.14159265358979 2.71828182845905
In SQL terms, this is relatively straightforward to compute, and we’ll show that in a minute.
Let’s look now at the Excel NORM.DIST function, which takes four arguments:
- x is the point where we wish to calculate the value of the normal distribution, which could be any point along the x-axis of the graphics above.
- µ is the mean for the normal distribution of interest.
- σ2 is the standard deviation for the normal distribution of interest.
- The last (fourth) argument to NORM.DIST is a logical (TRUE/FALSE) value which specifies whether you want to calculate the value for the PDF (if FALSE) or the CDF (if TRUE).
There is also a special case of the normal distribution, known as the Standard Normal Distribution, where the mean is zero and the standard deviation is one. For this special case, the PDF function is somewhat simpler and can be written as:
As I am also no mathematician, I am thankful that all of the formulas so far have simply been copied out of the relevant sources.
Since I am a T-SQL guy, I must now focus on my specialty and present a function that will calculate the probability density function (PDF) for both the standard and general normal distributions.
CREATE FUNCTION dbo.NORMAL_PDF -- Normal Distribution - Returns two bits of information: -- NPDF(X) - The Normal Probability Density Function's value at given mean and std deviation -- SNPDF(X) - The Standard Normal Probability Density Function's value (mean=0, std dev=1) ( @X FLOAT -- Point at which function is to be evaluated ,@Mean FLOAT -- Mean of the Normal Distribution ,@StdDev FLOAT -- Standard Deviation of the Normal Distribution ) RETURNS TABLE WITH SCHEMABINDING RETURN SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev -- Normal Probability Density Function for Mean and Standard Deviation ,[NPDF(X)] = EXP(-0.5*(@X-@Mean)*(@X-@Mean) / (@StdDev*@StdDev))/(SQRT(2.*PI())*@StdDev) -- Standard Normal Probability Density function for Mean=0 and StdDev=1 ,[SNPDF(X)] = EXP(-0.5*@X*@X)/SQRT(2.*PI());
Let’s now construct an Excel spreadsheet to calculate the PDF for the normal distribution using a couple of different mean/standard deviation combinations, so we can check the results from our function.
The highlighted cell’s formula is in the formula text box above, showing you how to calculate the PDF for the normal distribution where mean is two and standard deviation is 0.5 (from columns B and C).
We’ve left a couple of columns open so we can run the following T-SQL script that uses our iTVF to calculate the PDF values at the various Xs. The differences columns are currently showing the same value as appears in the Excel column with reversal of sign.
WITH SampleData (X, Mean, StdDev) AS ( SELECT -1,2,0.5 UNION ALL SELECT -0.5,2,0.5 UNION ALL SELECT 0,2,0.5 UNION ALL SELECT 0.5,2,0.5 UNION ALL SELECT 1,2,0.5 UNION ALL SELECT 1.5,2,0.5 UNION ALL SELECT 2,2,0.5 UNION ALL SELECT 2.5,2,0.5 UNION ALL SELECT 3,2,0.5 UNION ALL SELECT 3.5,2,0.5 UNION ALL SELECT 4,2,0.5 UNION ALL SELECT 2.33,10,5 UNION ALL SELECT 4.22,10,5 UNION ALL SELECT 5.1,10,5 UNION ALL SELECT 8.2,10,5 UNION ALL SELECT 10,10,5 UNION ALL SELECT 11.5,10,5 UNION ALL SELECT 12.6,10,5 UNION ALL SELECT 16.8,10,5 UNION ALL SELECT 22.5,10,5 ) SELECT a.X, a.Mean, a.StdDev, b.[NPDF(X)], b.[SNPDF(X)] FROM SampleData a CROSS APPLY dbo.NORMAL_PDF(X, Mean, StdDev) b;
When we run the above script, then copy/paste the output from the SSMS Results pane into our spreadsheet, it now looks like this.
The cell formula at the top now shows you how to calculate the PDF for the standard normal distribution in Excel (second and third arguments are different from the prior Excel graphic).
The red bordered columns now show very miniscule differences between the values computed by T-SQL for the normal distributions’ PDF and SNPDF. These calculations are probably accurate enough to say that we’ve replicated the Excel NORM.DIST function’s results when its fourth argument is FALSE.
Computing the Normal Distribution’s Cumulative Distribution Function (CDF)
We can easily add a few columns to our spreadsheet to show you how easy it is in Excel to calculate the CDF for our two distributions, including some placeholders for when we get to the point of having some results in T-SQL.
Note that the highlighted cell’s formula is shown in Excel’s formula entry text box.
While the PDF for the normal distribution can be represented in what is known as “closed form” (see the formulas above), the CDF cannot be represented in closed form. Instead we need to represent the value of the CDF at a point (X) as a definite integral, which is essentially just calculating the area under the PDF.
At the sight of the integral symbol, I’m sure that some of my reader’s eyes have glazed over. Once again though, this is a formula not of my own making, but one which can be found in many sources. My hope is that I don’t lose you here, because if you read on you just might find some really interesting stuff!
Let us consider a simple case, for example where X = µ (the mean). In that case, from our graphical look at the two PDF/CDF curves, it implies that the value of the CDF should be 0.5. If we look at cells J9 and J18 in the last Excel spreadsheet shown, we see that indeed Excel’s NORM.DIST function computes exactly that value for the CDF.
To replicate this in T-SQL, we’ve got just a bit of a problem though. Unfortunately, T-SQL does not have an “INTEGRATE” built-in function!
The Fundamental Theorem of Calculus
I know that’s a mouthful, but the fundamental theorem of calculus is going to give us a way to construct a set-based algorithm that will allow us to calculate the CDF of the normal distribution. If you haven’t taken calculus, or can’t be bothered to try to remember what the fundamental theorem says, you might want to skip ahead to the next section and just get to the T-SQL. But as I’ve said many times, since I am no mathematician, I’ll need to keep my explanation as simple as possible in the hopes that I don’t trip myself up and that my readers can follow along.
Let’s look at a graphical example.
Suppose we were interested in the CDF where X=5. We could construct a series of rectangles that extend from the lowest value of the left tail of the distribution, up to where X=5. As these rectangles get smaller and smaller, the sum of the areas within the rectangles approaches quite closely the value of the CDF. Technically, this is expressed as a “limit” where the width of the rectangle à 0.
That is essentially the fundamental theorem of calculus, or the part that interests us anyway.
Unfortunately, we’re not quite done with our exotica yet. Constructing a set-based formula that sets up a specific (known) number of rectangles from -∞ (negative infinity) up to our value of X, is problematic at best. So we’re going to fall back on something that I pointed out earlier. This is that the area under the curve where x=10 (the mean in this example) is 0.5. So instead we can solve the following integral when X is less than our mean.
And alternatively, when x is greater than the mean:
Gosh! That still doesn’t sound too simple, but in reality we are now quite close to integrating the normal distribution’s PDF at a point of interest, and doing it with a set-based algorithm.
By now my regular readers are probably used to my little T-SQL sleight of hand tricks. So here’s another. Consider the following script.
DECLARE @X FLOAT = 1.5 -- Point at which function is to be evaluated ,@Mean FLOAT = 2. -- Mean of the Normal Distribution ,@StdDev FLOAT = 0.5 -- Standard Deviation of the Normal Distribution ; WITH CalculateIntervals AS ( -- Total intervals (about 5 per standard deviation) SELECT Intervals = 5 * ABS(@Mean - @X) / @StdDev -- Number of intervals per standard deviation ,Interval = 5 ), Tally (n) AS ( -- Up to 10,000 row tally table SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) ) SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,Intervals ,n ,Pos1 ,Pos2 ,Width ,Height = AvgX -- Width x Height = NPDF(AvgX) ,Area=Width * e.[NPDF(X)] FROM CalculateIntervals a CROSS JOIN Tally b CROSS APPLY ( SELECT Pos1 = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean) ,Pos2 = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean) ,Width = ABS(@Mean - @X)/Intervals ) c CROSS APPLY ( -- Cheat the rectangle's height - make it the average of Pos1 and Pos2 SELECT AvgX = 0.5 * (Pos1 + Pos2) ) d CROSS APPLY dbo.NORMAL_PDF(AvgX, @Mean, @StdDev) e;
Let’s take this bit of magic and decompose it one step at a time.
- We start by declaring three local variables, which will ultimately become arguments to the iTVF we hope to create.
- The CalculateIntervals CTE creates two intermediate column values that we’ll use later: 1) the first is the total number of intervals, and 2) is the intervals per standard deviation that our X of interest is from our mean.
- The Tally CTE is just an in-line Tally table that is limited to the number of rows set up in the Intervals (total intervals) column of the CalculateIntervals CTE. The intervals themselves are simply used to represent the rectangles we described above.
- Next (in our FROM clause in the main query) we take the one row that is returned from the CalculateIntervalsCTE and CROSS JOIN it with the Tally table’s results, so that we’ve consolidated everything we need onto each row.
- The first of three cascading CROSS APPLYs calculates two positions (Pos1 and Pos2) and the width of our rectangle. These calculations will be clearer when we look at the results set (below).
- The second cascading CROSS APPLY “cheats” the height of the rectangle, to be the average of Pos1 and Pos2. This introduces some level of error into our calculations, and there are ways around that (too advanced for me and this blog). One way to reduce this error is to introduce more intervals (or rectangles).
- Lastly, in our final CROSS APPLY we call our NORMAL_PDF function to calculate the value of the PDF at AvgX where the mean and standard deviation are specified.
Note that this code fails miserably (no results rows produced) when @X = @Mean, but we’ll handle that in our iTVF. Here are the results:
X Mean StdDev Intervals n Pos1 Pos2 Width Height Area 1.5 2 0.5 5 1 2 1.9 0.1 1.95 0.0793905094954024 1.5 2 0.5 5 2 1.9 1.8 0.1 1.85 0.0762775630921048 1.5 2 0.5 5 3 1.8 1.7 0.1 1.75 0.0704130653528599 1.5 2 0.5 5 4 1.7 1.6 0.1 1.65 0.0624507866733522 1.5 2 0.5 5 5 1.6 1.5 0.1 1.55 0.053217049979751
The X, Mean, StdDev and n columns should require no explanation. The Intervals column will be five times the number of standard deviations that X is from our mean (note that X could be less than or greater than the mean). Pos1 and Pos2 compute the right and left points of our interval (which may be reversed if @X > @Mean), while width is computed based on the total difference between X and our mean divided by the total number of intervals. See the comment in the code about “cheat the height?” Height in the result set is simply the average of Pos1 and Pos2. Finally, the Area is Width * Height, which when summed across all five of our rectangles and adjusted according to the value of the CDF at the midpoint (0.5) should be a reasonably close approximation of the CDF of the normal distribution at X! This approximation should improve if we use more than five rectangles per standard deviation.
To put all of this another way, we’ve implemented the fundamental theorem of calculus by using a tally table to calculate our little rectangles, in order to integrate under a curve. Too bad I didn’t think of that example when I wrote my Tally tables blog!
A T-SQL iTVF to Calculate the Normal Distribution’s Cumulative Distribution Function (CDF)
At this time, it is just a little more work to get to where we want to be. This is an iTVF to simulate the NORM.DIST function in Excel. We’ll need to eliminate some of the unnecessary intermediate results from the previous script, and also combine a couple of extra bits to calculate the PDF and the CDF when X is at the mean, but you can read the comments in the function to see those additions.
We’re also going to make our function allow for a little additional selectivity by adding one argument that can adjust the number of intervals per standard deviation.
CREATE FUNCTION dbo.Excel_NORM_DIST -- Excel Normal Distribution - Returns either: -- Probability Density Function (PDF) or -- Cumulative Distribution Function (CDF) ( @X FLOAT -- Point at which function is to be evaluated ,@Mean FLOAT -- Mean of the Normal Distribution ,@StdDev FLOAT -- Standard Deviation of the Normal Distribution ,@CumDist TINYINT -- =0 for Probability Density, =1 for Cumulative Density ,@Intervals INT = NULL ) RETURNS TABLE WITH SCHEMABINDING RETURN WITH CalculateIntervals AS ( -- Total intervals (default is about 100 per standard deviation) SELECT Intervals = ISNULL(@Intervals, 100) * ABS(@Mean - @X) / @StdDev -- Number of intervals per standard deviation ,Interval = ISNULL(@Intervals, 100) ), Tally (n) AS ( -- Up to 10,000 row tally table SELECT TOP (SELECT CAST(Intervals AS INT) FROM CalculateIntervals) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) ) -- PDF SELECT X, Mean, StdDev, [F(X)]=[NPDF(X)] FROM ( SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,[NPDF(X)] FROM dbo.NORMAL_PDF(@X, @Mean, @StdDev) ) a WHERE @CumDist = 0 UNION ALL -- CDF where X = mean SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev ,[CDF(X)] = 0.5 WHERE @CumDist = 1 AND @X = @Mean UNION ALL -- CDF where X mean SELECT X, Mean, StdDev, [CDF(X)] FROM ( SELECT X = @X ,Mean = @Mean ,StdDev = @StdDev -- SUM the rectangles ,[CDF(X)] = 0.5 + SUM( -- Add to or remove from CDF at mean (0.5) SIGN(@X - @Mean) * -- Width x Height = NPDF(AvgX) Width * d.[NPDF(X)] ) FROM CalculateIntervals a CROSS JOIN Tally b CROSS APPLY ( SELECT Pos1 = @Mean + (@StdDev/Interval) * (n - 1.) * SIGN(@X - @Mean) ,Pos2 = @Mean + (@StdDev/Interval) * (n + 0.) * SIGN(@X - @Mean) ,Width = ABS(@Mean - @X)/Intervals ) c -- -- Average height -- CROSS APPLY dbo.NORMAL_PDF(0.5 * (Pos1 + Pos2), @Mean, @StdDev) d ) a WHERE @CumDist = 1 AND @X @Mean; GO WITH SampleData (X, Mean, StdDev) AS ( SELECT -1,2,0.5 UNION ALL SELECT -0.5,2,0.5 UNION ALL SELECT 0,2,0.5 UNION ALL SELECT 0.5,2,0.5 UNION ALL SELECT 1,2,0.5 UNION ALL SELECT 1.5,2,0.5 UNION ALL SELECT 2,2,0.5 UNION ALL SELECT 2.5,2,0.5 UNION ALL SELECT 3,2,0.5 UNION ALL SELECT 3.5,2,0.5 UNION ALL SELECT 4,2,0.5 UNION ALL SELECT 2.33,10,5 UNION ALL SELECT 4.22,10,5 UNION ALL SELECT 5.1,10,5 UNION ALL SELECT 8.2,10,5 UNION ALL SELECT 10,10,5 UNION ALL SELECT 11.5,10,5 UNION ALL SELECT 12.6,10,5 UNION ALL SELECT 16.8,10,5 UNION ALL SELECT 22.5,10,5 ) SELECT a.X, a.Mean, a.StdDev ,[NPDF(X)] = c.[F(X)] ,[SNPDF(X)] = b.[SNPDF(X)] ,[CDF(X)] = d.[F(X)] FROM SampleData a -- Use NORMAL_PDF to return PDF for standard normal distribution CROSS APPLY dbo.NORMAL_PDF(a.X, a.Mean, a.StdDev) b -- Use Excel_NORM_DIST to return PDF for general normal distribution CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 0, DEFAULT) c -- Use Excel_NORM_DIST to return CDF for a general normal distribution CROSS APPLY dbo.Excel_NORM_DIST(a.X, a.Mean, a.StdDev, 1 -- Increase the accuracy by specifying a larger number for fifth parameter ,CASE a.Mean WHEN 10 THEN 500 ELSE 100 END) d;
Note how we can override (using the fifth parameter to the function) the number of intervals, or just let it use the DEFAULT (which is 100 per standard deviation). You need to take some care with this because if you end up trying to calculate the CDF for an X that is a significant number of standard deviations away using a high interval count, you could exceed the number of rows generated by the in-line tally table. But you can always make that tally table generate more rows by adding additional CROSS JOINs. Just remember that as you evaluate more and more intervals, the performance of the iTVF will be degraded, so my advice is to use just what you need to achieve the accuracy you seek.
The final results from this query are:
X Mean StdDev NPDF(X) SNPDF(X) CDF(X) -1.00 2 0.5 1.21517656996466E-08 0.241970724519143 9.86435766403559E-10 -0.50 2 0.5 2.9734390294686E-06 0.3520653267643 2.86620600586929E-07 0.00 2 0.5 0.000267660451529771 0.398942280401433 3.16690114139928E-05 0.50 2 0.5 0.00886369682387602 0.3520653267643 0.00134984263449417 1.00 2 0.5 0.107981933026376 0.241970724519143 0.0227496820247703 1.50 2 0.5 0.483941449038287 0.129517595665892 0.158654245714223 2.00 2 0.5 0.797884560802865 0.0539909665131881 0.5 2.50 2 0.5 0.483941449038287 0.0175283004935685 0.841345754285777 3.00 2 0.5 0.107981933026376 0.00443184841193801 0.97725031797523 3.50 2 0.5 0.00886369682387602 0.00087268269504576 0.998650157365506 4.00 2 0.5 0.000267660451529771 0.000133830225764885 0.999968330988586 2.33 10 5.0 0.0246013380403521 0.0264264854972617 0.0625147949422269 4.22 10 5.0 0.0409033036384656 5.41832610895401E-05 0.123840539089739 5.10 10 5.0 0.0493618981134085 8.97243516238334E-07 0.163543019015466 8.20 10 5.0 0.0747821210746257 9.99837874849718E-16 0.359423544347365 10.00 10 5.0 0.0797884560802865 7.69459862670642E-23 0.5 11.50 10 5.0 0.0762775630921048 7.6416554115872E-30 0.61791144125835 12.60 10 5.0 0.0696985025517949 1.33848679925429E-35 0.697770895452422 16.80 10 5.0 0.0316449580740766 2.05718230302861E-62 0.913085073917205 22.50 10 5.0 0.00350566009871371 4.67863681725005E-111 0.99379034197768
The first thing that we notice about these results is that our CDF where X equals the mean is 0.5. That’s a good start, but for a more thorough check we’ll copy/paste these results into our Excel spreadsheet.
Notice how all of the values in column M are quite small (nearly zero), indicating a pretty small difference between Excel’s calculated CDF and the one calculated in T-SQL. Had we left the second distribution group to default to 100 intervals, the difference would have been slightly larger.
I’ll number my conclusions:
- I am a T-SQL geek, because whom but a T-SQL geek would have thought of doing something so ridiculously geeky and so enjoyed writing about it as much as me!
- It is possible to do integration (specifically to solve for a finite definite integral) in T-SQL by using a tally table to implement the fundamental theorem of calculus.
- We have provided two functions:
- The first, NORMAL_PDF, which calculates the value for the probability density function for both general and standard normal distributions.
- The second, EXCEL_NORM_DIST, which calculates the probability density and cumulative distribution functions for a general normal distribution, closely replicating the results of Excel’s NORM.DIST function.
Prior to writing this blog, I did a little searching to see if anyone has tried this before and I ran across this article by Eli Algranti: Part 1: T-SQL Implementation of NORMDIST / NORM.S.DIST
He is obviously a bit more of a mathematician than I am, because he came up with three different implementations that have different error characteristics for the CDF.
- Using polynomial approximation – certainly an interesting approach that uses an approximate “closed form” version of the antiderivative of the normal distribution. My only issue with it is that is uses a scalar-valued, user-defined function (UDF) to perform the calculation.
- Another polynomial approximation – similar to the first but with a different error result. Again, my issue is his implementation using the scalar-valued, UDF.
- Using a rational polynomial approximation – a bit more elaborate estimation approach that once again is implemented using a scalar-valued, UDF.
Certainly all of those could be converted to iTVFs that would run faster. It would be interesting to see how well my function’s error results compare to those.
Below is an Excel workbook file provided to you as a resource. In that you’ll find three worksheets:
- Final Results w-Check – showing the spreadsheet screen captures above (all columns), plus a column that contains a neat way to translate sample data from Excel to a series of T-SQL SELECT/UNION ALL/SELECT statements, like you see in my SampleDate CTEs above.
- SQL- Create-run NORMAL_PDF – a T-SQL script to create the named function (in italics) and run it with the sample data provided.
- SQL- Create-run EXCEL_NORM_DIST – a T-SQL script to create the named function (in italics) and run it with the sample data provided.
The normal distribution is one that is quite important in statistics and having tools to apply its distributions in T-SQL can be just as important. We hope you’ve found today’s blog useful and instructive!
Follow me on Twitter: @DwainCSQL
© Copyright Dwain Camps 14 May 2015. All rights reserved.