It’s 11 am. You’re sitting at your desk at work (where else would you be?) trying to determine which of the 20 help desk tickets you’re actively working on to work on next. That includes at least 5 actual projects (I think it’s 5), not just fixing various issues. Not to mention trying to figure out how to explain to someone why you shouldn’t include a “rollback transaction” as part of error handling in a stored procedure that contains just your run-of-the-mill select query. It’s all part of learning, right? Oh and don’t forget that Oracle conference call at 1:30 pm. Did I mention you’re new minion, err…Oracle DBA has some good valid questions for you about the Oracle system as well? Wait. Did you talk to the Access guy yet about the tasks he’s working on? Oh yeah. You did that earlier this morning. Given all that, you’re actually feeling pretty good because one of the projects you’ve been working on went live that morning with no problems. Go team! However, before you can say “I wonder what I should have for lunch today”, you have 3 developers and 1 manager at your desk (or was that 2 developers and 2 managers?) asking for help with a SQL Server performance issue. It’s actually pretty important considering the end users are on site doing user acceptance testing for a major system release. Dun dun dun…
A stored procedure runs fine on Server A but times out on Server B. Both are non-production servers. Both servers have SQL Server 2005 build 9.0.4035. Note: I included the build instead of the service pack level because I didn’t want to look it up and I don’t have it memorized yet. Did I mention we’re running SQL Server 2000, 2005, 2008, 2008 R2 and soon-to-include 2012? Oh and that’s for somewhere around 73 instances and 800+ databases. Oh and then there’s Oracle Exadata. Continuing on… The databases are identical because they were restored from the same backup file. Still, you verify that the record counts match and the structures match. No problems there. You can run the procedure within SSMS (SQL Server Management Studio) just fine on both servers. No problem. You breakdown and give the developer db_owner permissions on both databases just to prove it’s not a permissions issue. Plus it’s not production. So no worries. They had no problems running the procedure in SSMS on both servers. However, when the procedure is executed from the application or from within Visual Studio (2010, I believe), it times out on Server B. There are no error messages here, there, or anywhere. Not in a log. Not in a… Where was I? Oh yeah…
What about the stored procedure itself? It returns two datasets from two queries. From doing a SQL Profiler trace I found it was getting hung up on the first one. The first query is a select with inner joins on four other tables. Nothing too complicated, at least. I probably shouldn’t mention the two optional parameters are included in the inner join clause instead of using a where clause.
One of These Things Isn’t Like the Other
There are so very many pieces to look at and consider, but this is what I did. I probably should come up with a good checklist for the future so I’m not scrambling. Good intentions and all that, right?
So what could possibly differ between these two systems? The record counts on the tables are the same. The structures are the same. The indexes are identical. Hmm.
Maybe something with memory? The cache? Could it be the execution plan? I attempted to retrieve the actual plan from Server B and guess what happened? It kept running and running and running. Just like what the developer experienced. I had no problems retrieving the actual execution plan from Server A, though. It ran in about 5 seconds. Double hmm.
So I generated the estimated plan from both systems with no problem and compared them. Gee. They were completely different. That wasn’t a huge surprise but still somewhat surprising considering the usage on both systems should be about the same. What was interesting was the plan on Server B said an index was missing on one of the tables. Really? The index is there but it turns out the number of statistics on that table was different than the ones on the table on the other server. So why were the statistics so different? We have maintenance plans in place on both servers to reorganize the indexes and update the statistics every Sunday. It ran on both servers this last weekend just fine. They should be the same but for giggles I thought I’d check them. Guess what? They were different. Dude! Different how? Different in that on Server B it updated the statistics and then reorganized the indexes. This is the server where the procedure hangs when executed. On Server A, the indexes are reorganized before the statistics are updated. Wow. Could this be it? I think it very well could be the problem!
So on Server B, I reorganized the indexes on only the tables used by this procedure and then updated the statistics. Guess what? I could then easily retrieve the actual execution plan without it hanging. I then asked the developer to try executing the procedure. Ya know what? It ran just fine! Sweet!
Hindsight is 20/20
Should I have gathered a bunch of info from DMVs and what-not first? Yep. Probably. There’s a million things I probably should have done but considering the time crunch and
sheer number of other tasks that have fallen to me, I think I did okay. I solved the problem and made everyone happy so they can continue testing and I can continue on my merry way.
Now that is the million dollar question. Why does it matter if you reorganize your indexes before updating statistics? Well… you probably won’t like this but I’m going to save that for another post on another day. :-) That’s my plan anyway.
Hey, Ya’All Ain’t Gonna Believe This!
I do have to give a shout out to our new minion, oops. I mean Oracle DBA. Even though he doesn’t know SQL Server, he asked very intelligent questions which helped me to think through the process and what could be wrong. We made a pretty good team today which is awesome in itself. :-)