No Seatbelt - Expecting Order without ORDER BY
One of the greatest lessons I've learned in building software is that every technical implementation has a human element to how it is used, perceived, and interpreted. While I've seen my share of poor technical solutions to problems, sometimes the most interesting "failure" point is the human operator and how he or she uses/perceives/interprets a software feature. Some of them are easy to understand (trying to sell a Chevy Nova in Mexico, "All your base are belong to us" from the poor translation in Zero Wing, etc). Others are harder for me to grok.
Occasionally I get asked about ordering guarantees in SQL Server. In some cases, SQL Server had to break previously undocumented but somewhat consistent behavior (TOP 100 Percent... ORDER BY, as I blogged a few years ago here). In other cases, I can find it difficult to reason through the logic ;). Today's question has to do with not specifying an ORDER BY but still getting rows back in sorted order. In their zeal to keep me employed with interesting corner cases in the SQL language, some have concluded:
"I don't have to specify the ORDER BY because I *know* that it will return rows in sorted order. It always does! every time! I promise!"
So, we're talking about the top-level ORDER BY clause in a query that defines the "presentation order" in ANSI SQL. The common reason for this conclusion is that, during development, the database developer found that a query plan always came back with the same plan. So, if I get out the "Jump to Conclusions" game and start playing, I'll decide that everyone will get the same plan from now until the end of the universe. The primary reason for this is that the query probably scanned an index that happens to be in the order that you desired:
Will that query plan change? Well, if keep the same machine configuration (CPUs, Memory, etc) and don't ever insert data into the table, perhaps you can get lucky and keep the same plan... perhaps. Of course, when you upgrade to the next release of SQL Server, perhaps the server has changed, new optimizations have been added, or someone adds a column to the clustered index that makes it slower than before. All of these could change the plan, even if you didn't change a thing.
Amazingly, if I happen to just add an "ORDER BY col1" into this query, I'll note that the query plan stays the same. So, no runtime cost to pay to get that guarantee - perhaps just a few more characters to type.
So the "human" part of these kinds of problems arise when I get questions like "Where does it SAY that the system will not return the rows in order?" It's difficult to undo the conclusion in your head when many experiments have shown, for a certain set of conditions, that the query will return rows in sorted order in "all" cases. So, I'll try to help resolve a few of these discussions today by giving an example when SQL Server will indeed stop returning rows in sorted order.
I'm running this on a Developer Edition build of SQL Server 2008 on my personal machine, a dual-core machine with 8 GB of RAM, FWIW. For this test, you'll need a machine that has more than one CPU available for SQL Server to use.
CREATE TABLE [dbo].[orderingtest1]( [col1] [int] NULL, [col2] [int] NULL, [col3] [int] NULL, [col4] [int] NULL, [col5] [binary](4000) NULL ) ON [PRIMARY]; create clustered index i1 on orderingtest1(col1) DECLARE @i INT=0; set nocount on begin transaction WHILE @i<20000 BEGIN INSERT INTO [orderingtest1](col1, col2, col3, col4) VALUES (rand()*1000, rand()*1000, rand()*1000, rand()*1000); SET @i+=1; END; commit transaction
I've created a table with 20,000 rows in it. Perhaps this is your inventory table for the database backing your web site.
So let's go run an query that yields a plan that happens to give me results in a specific order:
select * from orderingtest1 where col2 = 5
Hey, that's ordered on col1! Let's add some more rows just to be sure.
DECLARE @i INT=0; set nocount on begin transaction WHILE @i<20000 BEGIN INSERT INTO [orderingtest1](col1, col2, col3, col4) VALUES (rand()*1000, rand()*1000, rand()*1000, rand()*1000); SET @i+=1; END; commit transaction
Another 20,000 rows, and sure enough my query is fine.. There's only the one index anyways, so what could happen? Let's ship this guy.
Then we get a new set of inventory and add yet another 20,000 rows into our table and run our query again:
select * from orderingtest1 where col2 = 5
Whoa! we just got a fancy parallel query plan. It uses multiple threads to scan the index and apply the filter condition, only sending up rows to the main thread if they qualify. This can be faster on larger data sets.
(You may need to run DBCC FREEPROCCACHE if you aren't getting this plan - since the data is random, you may also need to try another value for col2. Assuming you have enough CPUs and are running a SKU that can use them in SQL, you should get this plan eventually, just keep adding rows ;)
Oh my - it's not ordered!
The hard part here is that there is no reasonable way for any external user to know when a plan will change . The space of all plans is huge and hurts your head to ponder. SQL Server's optimizer will change plans, even for simple queries, if enough of the parameters change. You may get lucky and not have a plan change, or you can just not think about this problem and add an ORDER BY. Interestingly enough, once you have enough rows *and* add an ORDER BY, SQL Server still generates a plan that looks like this:
However, if you run the query, you'll notice that the rows DO come back in order... The Exchange operator can run in a mode where it preserves order, but it only does so if you ask it through an ORDER BY clause:
So, adding an ORDER BY gives things back in order. (I have visions of the two guys in the pub mentioning this and one guy saying "A way to order the rows that come back from a query? Brilliant!")
The example I've done is fairly simple - one table with a clustered index using a simple query with a predicate on only one column. Even if there were only 2 possible plans considered for this query, one of them doesn't guarantee order unless you ask for it.
There are lots of situations where plans can change in the Optimizer - for more complex queries, there can be thousands of plan choices or more, and each of them has a case when it would likely be picked. For each of those plans, the sort for that plan can be different if you don't specify it. So, I hope this gives you something to go try - I find that examples help me learn, so perhaps it can help you as well.
So, my advice for the day:
If you need order in your query results, put in an ORDER BY. It's that simple. Anything else is like riding in a car without a seatbelt.
Architect, SQL Server Core Engine