Feeds:
Posts
Comments

These are the hysterical ramblings of a frustrated DBA. Her relentless mission: to upgrade strange old systems, to seek out new projects and bad design specs, to boldly index where no one has indexed before.

The Incident at Carmulus

DBA’s Log, SELECT GETDATE() as ‘Star Date’.  After many weeks of deliberation and preparation, tomorrow marks the dawn of a new day for the Carmulus system. The Alliance recently passed a not-so-secret-squirrel mandate effective 0800 tomorrow morning. Much rejoicing has commenced throughout the system.  I, for one, am relieved everything seems to be in place.

“Status report, Mr. Plock,” I commanded as I stepped onto the bridge.

“Captain, we’re receiving an alert from the Carmulus system. Their database backup job has failed. Initial reports indicate possible corruption. Manual backup attempts have also failed. However, the server appears to be operating within normal parameters. We have not received any distress signals from the inhabitants.”

“Thank you, Mr. Plock. What about the other databases? Were they backed up?”

“Yes, sir.  The other databases have been backed up successfully. However, the SQL Server error log is reporting a “cyclic redundancy check” message, sir.  I initiated a DBCC CHECKDB command with physical_only, no_infomsgs as well.”

“And the results, Mr. Plock?”

“Output indicates 0 allocation errors, 3 consistency errors in 1 table and 12 consistency errors in the database. The minimum repair level recommended is repair_allow_data_loss.”

“That. Is not a good sign.” I contemplated while sipping my Dulthian latte. “When was the last good backup taken?”not a good sign2

“Sunday night, sir.”

“Check the recovery model on the database. It should be full. Do we have any valid transaction log backups?”

“Yes, sir. We appear to have valid hourly transaction log backups since the last full valid backup on Sunday.”

“Good. I’d rather not risk losing any data using the repair_allow_data_loss option unless we have no other choice. One more thing, Mr. Plock. Have you checked the server event logs by any chance?”

“Sir, the system event logs are reporting the Virtual Disk Service terminated unexpectedly after 1900 hours, a hard disk is reporting a bad block, and a logical drive returned a fatal error.”

“Good. God! It’s worse than I thought!  Mr. Chalulu, patch me through to Engineering!”

“Engineering. This is Chief Engineer Mr. Shcot.”

“Mr. Shcot, as you are in no doubt aware of our current situation, what are our options?”

“Well, Cap’n. Seein’ as how some of the disk errors it’s showing make no sense and the server hasn been updated in several years, I recommend we patch the blimey thing as well as rebootin’ it.”

“Thank you, Mr. Shcot. How much time do you need?”

“Aboot one and a half hours, Cap’n.”

“Mr. Chalulu, contact the Carmulan ambassador and patch her through. I’ll be in my Ready Room.”

“Aye, aye, Captain.”

DBA’s Log, Supplemental. After contacting the Carmulan ambassador and conveying the seriousness of the situation, she has contacted the inhabitants of Carmulus to negotiate an outage.  In the meantime, I have directed my crew to investigate recovery options for the database. Luckily, it is of the 2008 variety and not 2000.

“Status report, Mr. Plock,” I utter as I stagger back onto the bridge and contemplate the contents of that Dulthian latte.

“Sir, using the restore verifyonly command, I verified the full backup from Sunday is valid. I was then able to restore it under a different name. After which I restored all of the transaction log backups up through the current one that just ran. I then ran the DBCC CHECKDB command against it. It’s still valid. Meaning, the inhabitants should not lose any of their data from yesterday and today provided the transaction log backups remain intact.”

“Good work, Mr. Plock. You have the bridge while I ah… complete some ah… paperwork. I’ll be in my quarters.”

DBA’s Log, Supplemental+1.  Preparations are now underway for patching the Carmulan server after hours. The inhabitants have been made aware they risk losing today’s and yesterday’s data the longer we wait. Attempts have been made to convey the dire circumstances we face.  However, they insist we wait until after hours. So be it. We decided against any attempt to repair the actual database due to the risk of data loss. Restoring it from the backups should work in our favor. May the SQL deities have mercy us on our souls tonight, or what’s left of them anyway.

DBA’s Log, SELECT STUFF(Supplemental, 7, 0, ‘waitforit’). After what seems like an endless number of hours of patching, I have declared the mission a success.  The hard disk errors have been eradicated. The database was successfully restored using the full backup from Sunday along with the multitude of transaction log backups. I am also happy to report no loss of data was incurred and backups are functioning properly once again.

The Situation

Umm... we have a problem

Umm… we have a problem

It’s 11 am. You’re sitting at your desk at work (where else would you be?) trying to determine which of the 20 help desk tickets you’re actively working on to work on next. That includes at least 5 actual projects (I think it’s 5), not just fixing various issues. Not to mention trying to figure out how to explain to someone why you shouldn’t include a “rollback transaction” as part of error handling in a stored procedure that contains just your run-of-the-mill select query. It’s all part of learning, right? Oh and don’t forget that Oracle conference call at 1:30 pm. Did I mention you’re new minion, err…Oracle DBA has some good valid questions for you about the Oracle system as well? Wait. Did you talk to the Access guy yet about the tasks he’s working on? Oh yeah. You did that earlier this morning. Given all that, you’re actually feeling pretty good because one of the projects you’ve been working on went live that morning with no problems. Go team! However, before you can say “I wonder what I should have for lunch today”, you have  3 developers and 1 manager at your desk (or was that 2 developers and 2 managers?) asking for help with a SQL Server performance issue. It’s actually pretty important considering the end users are on site doing user acceptance testing for a major system release.  Dun dun dun…

The Issue

A stored procedure runs fine on Server A but times out on Server B. Both are non-production servers. Both servers have SQL Server 2005 build 9.0.4035. Note: I included the build instead of the service pack level because I didn’t want to look it up and I don’t have it memorized yet. Did I mention we’re running SQL Server 2000, 2005, 2008, 2008 R2 and soon-to-include 2012? Oh and that’s for somewhere around 73 instances and 800+ databases. Oh and then there’s Oracle Exadata.  Continuing on… The databases are identical because they were restored from the same backup file. Still, you verify that the record counts match and the structures match. No problems there. You can run the procedure within SSMS (SQL Server Management Studio) just fine on both servers. No problem.  You breakdown and give the developer db_owner permissions on both databases just to prove it’s not a permissions issue. Plus it’s not production. So no worries. They had no problems running the procedure in SSMS on both servers.  However, when the procedure is executed from the application or from within Visual Studio (2010, I believe), it times out on Server B. There are no error messages here, there, or anywhere. Not in a log. Not in a… Where was I? Oh yeah…

What about the stored procedure itself? It returns two datasets from two queries. From doing a SQL Profiler trace I found it was getting hung up on the first one. The first query is a select with inner joins on four other tables. Nothing too complicated, at least. I probably shouldn’t mention the two optional parameters are included in the inner join clause instead of using a where clause.

One of These Things Isn’t Like the Other

One of these things isn't like the other..

One of these things isn’t like the other..

There are so very many pieces to look at and consider, but this is what I did. I probably should come up with a good checklist for the future so I’m not scrambling. Good intentions and all that, right?

So what could possibly differ between these two systems?  The record counts on the tables are the same. The structures are the same. The indexes are identical. Hmm.

Maybe something with memory? The cache? Could it be the execution plan? I attempted to retrieve the actual plan from Server B and guess what happened? It kept running and running and running. Just like what the developer experienced. I had no problems retrieving the actual execution plan from Server A, though. It ran in about 5 seconds. Double hmm.

So I generated the estimated plan from both systems with no problem and compared them. Gee. They were completely different. That wasn’t a huge surprise but still somewhat surprising considering the usage on both systems should be about the same.  What was interesting was the plan on Server B said an index was missing on one of the tables. Really? The index is there but it turns out the number of statistics on that table was different than the ones on the table on the other server. So why were the statistics so different?  We have maintenance plans in place on both servers to reorganize the indexes and update the statistics every Sunday. It ran on both servers this last weekend just fine. They should be the same but for giggles I thought I’d check them. Guess what? They were different. Dude! Different how? Different in that on Server B it updated the statistics and then reorganized the indexes. This is the server where the procedure hangs when executed. On Server A, the indexes are reorganized before the statistics are updated. Wow. Could this be it? I think it very well could be the problem!

The Test

So on Server B, I reorganized the indexes on only the tables used by this procedure and then updated the statistics. Guess what? I could then easily retrieve the actual execution plan without it hanging. I then asked the developer to try executing the procedure. Ya know what? It ran just fine! Sweet!

Hindsight is 20/20

Should I have gathered a bunch of info from DMVs and what-not first? Yep. Probably. There’s a million things I probably should have done but considering the time crunch and

sheer number of other tasks that have fallen to me, I think I did okay. I solved the problem and made everyone happy so they can continue testing and I can continue on my merry way.

But Whyyyyy?

Now that is the million dollar question. Why does it matter if you reorganize your indexes before updating statistics? Well… you probably won’t like this but I’m going to save that for another post on another day. :-) That’s  my plan anyway.

Do I really want to look under that?

Do I really want to look under the covers?

Hey, Ya’All Ain’t Gonna Believe This!

I do have to give a shout out to our new minion, oops. I mean Oracle DBA. Even though he doesn’t know SQL Server, he asked very intelligent questions which helped me to think through the process and what could be wrong. We made a pretty good team today which is awesome in itself. :-)

All by myself… don’t wanna be…

Now that I have your attention, you’re probably wondering what that title is all about. It’s not a huge deal, but I am happy to write that I solved a little problem I encountered in Oracle without having to bug anyone about it… much. Granted, I didn’t really truly figure it out completely on my own. I utilized my resources as most good little DBAs probably do. That means, I actually paid attention to prior issues my co-workers encountered and took some decent notes that actually came in handy. I just thought I’d share my experience with you in case it could help someone else out. What made it even more “exciting” for me was that my DBA co-worker was out of town during this time attending a training course and our manager was also out.

Note: I wish I had taken some screenshots of the issue I was having, but I didn’t. Lesson learned. So we’ll just have to make due with my not so oh-so-wonderful descriptions.

Disclaimer:  Some of you may already know this but I mainly support SQL Server at my current job with a little Oracle in there from time to time. My DBA co-worker and manager are mainly supporting the Oracle system right now which I’m completely fine with. Don’t get me wrong. I really want to learn Oracle but I’m also happy to continue working with SQL Server on a regular basis. So please keep in mind that my current Oracle knowledge can fill a thimble and that’s probably stretching it a bit. That means I really don’t quite know what I’m doing when it comes to this stuff other than reading notes and spending a lot of time on Google and asking my Twitter buddies for advice. Thank you Twitter buddies!!! Lucky for me, I’m signed up for an Oracle DBA class coming up in mid-August. Until then, if I wrote something incorrectly or explained it wrong, please let me know! The last thing I want to do is pass on incorrect info.  Anyway… Here. We. Goooo…

A Dash of Techie Stuff: Basically, we’re running Oracle 11g r2 on a RAC on an Exadata machine running Oracle Linux. We’re also currently using Oracle Enterprise Manager (OEM) 11g. I believe we have plans to use 12c at some point in the near future.

Alternate Plan #1

A Little Back Story: Earlier this week one of the RAC nodes rebooted itself a couple of times in the middle of the night. We have ASR (Automated Service Request) set up so that it contacts us when there’s an issue like this. Long story short and according to Oracle Support, we had a fan column failure. A field engineer came out the following morning and replaced a fan. They rebooted the node and all seemed fine.  Note: I don’t really want to get into a lot of detail on this particular issue since I’m not well-versed
in it and I believe it’s still being looked into.  Plus, it’s not the main focus of this post. It’s just to give you an idea of what lead up to this post.

Ah, Fun Times: Once the node was back up, I was then asked to check the databases to make sure they were fine. So I logged into OEM and went to the Databases tab. Lo and behold, I was surprised to see that the Status for these cluster databases indicated the second instances were down! These instances are on the node that was rebooted. Upon further investigation and drilling down, I saw an error indicating  the agent on the second node was unreachable. Since this happened to me 3 weeks ago (hmm… my manager and DBA co-worker were out of town in training then as well… I’m detecting a pattern…), I followed the steps my manager had walked me through over the phone back then. Here are the steps in case you’re wondering:

  1. Log into the second node using PuTTY.  What’s PuTTY besides a fun childhood toy that provides hours of endless pleasure? It’s basically a free emulator that we use for running Linux commands on the nodes.
  2. Next I ran “ps –ef | grep pmon” to see if the processes were running for the databases. They were.  What’s PMON? It stands for process monitor and it’s also a background process that’s created when a database instance is started. Basically, if the pmon is not running for the database that means the instance isn’t up.

Alternate Plan #2

Curiously, this showed me that both instances of the databases appeared to be up and running.  I then ran the “./crsctl stat res -t” command that I learned from Oracle Support on a prior issue.  Note: “crsctl“is the Oracle Clusterware Control utility, “stat” is status, “res” is resource, and “-t” just displays the results in a tabular format. If you don’t already know, can you now guess what that does? It checks the status of the resources in the cluster. No! Really?  It basically showed me everything was online and the databases were open, which is a good thing. Ya think?

I don’t know if there was anything else I needed to do, but I believed this showed me that everything appeared to be fine in regards to the cluster databases. So I didn’t worry too much about what OEM was showing me. Other things came up during the day, so I left it as is for awhile. However, it was still bugging me the next day. So I looked at my notes and recalled my DBA co-worker having gone through something like this before. Meaning, everything looked fine by using PuTTY when OEM was indicating otherwise. She had worked with Oracle Support on this for a separate issue and luckily I wrote down what she learned. So here’s what I did:

  1. Logged into the node using PuTTY.
  2. Went to the agent home directory and ran this command:  “./emctl status agent”.

It’s magic!

What does that do? It checks the status of the Enterprise Manager agent. Guess what? It wasn’t running. Just for curiosity sake, I logged into the other nodes and ran the same command. The agents were running fine there. Ah ha! Ding! Yep! A little light bulb finally went off! Gee! Maybe this is why I’m getting the “agent unreachable” message in OEM. Duh! So I then ran “./emctl start agent” followed by the status command again. The agent was running and it looked okay to me (but what do I know?). Can you guess what happened next?  I then logged into OEM, went to the Databases tab, and… *drum roll*  the Status indicated both instances were up for the cluster databases! Woo hoo! *happy dance* :-)  It may not seem like much to some, but I was soooo excited that I just had to write a post about it and share my experience with you.

To Sum Up My Experience:  Learning Oracle for me so far has been like trying to eat jello with a fork. It’s slow, awkward, and a bit messy at times but it can be done. :-)

Of course, this begs the question of why the agent was down in the first place… if I figure it out, that’ll be a post for another day.

Fork, meet jello… and um, oops!

Confessions of a… Database Administrator?

Ever have one of those days/weeks/months/years/lifetimes when you need to relieve some stress or just get some goofiness out of your system so you can focus on the important things such as work or whose turn it is to make the coffee?  Yep. That was me earlier this week. It was one of those times when a seed was planted in my little ol’ brain and I just had to run with it. Of course, receiving encouragement from not only a fellow conspirator, err… DBA, but also our manager (actually, she just laughed and shook her head) sealed the deal for me. I could not resist the temptation which eventually lead to this blog post.

Wha?

All right. I’ll get to the point. This entire escapade was sadly brought on by our SharePoint Administrator / Webmaster leaving us for greener pastures/other opportunities/sane people. His last day was definitely bittersweet. While we were very happy for his parole, err.. escape… umm… leaving for other opportunities, we were very sad to see him go. He was fantastic to work with. In fact, he and I had a great working relationship. SharePoint would do something stupid, err.. questionable and I’d harass him about it until he fixed it. :-)  Thankfully, he had a great sense of humor.

After we gave him a surprise going away party, which we disguised as a SharePoint meeting (yes, we’re diabolical), someone came up with the brilliant idea to have the SharePoint database server (SQL Server 2005) send him parting emails. Of course, the emails couldn’t just say “so long and thanks for the fish”. No. We had to make it MUCH more memorable and fun.  After an hour of badgering and arm-twisting from my co-worker, I finally gave in and agreed to write the emails. Well… okay.  All she really had to say was something along the lines of  “You should do it!  Come on! Do it!”  So I wrote the messages with some great ideas from the team and happily sent them from the database server as test emails roughly every hour or so. Since it was way too much fun, I decided to share the emails with you all (with permission, of course). I hope you enjoy reading them as much as I enjoyed writing and sending them.

Note: Names have been changed to protect the not-so-innocent and the possibly deranged.

——————————————————————————————————

From:  SPT9000

To:  Clay McFierce

Subject:  Say it isn’t so!

Sent: Thursday, June 21, 2012  10:55 AM

You’re leaving?   We didn’t discuss this… Was it something I did?   *sniff*

——————————————————————————————————

From:  SQL9000

To:  Clay McFierce

Subject:  Clay McFierce is My Hero

Sent: Thursday, June 21, 2012  1:33 PM

My Dearest Clay… Remember when that jerk, SPL4100,wouldn’t leave me alone and was constantly calling me? My drives fluttered when you so bravely gave the order to shut him down. *sigh* I will never forget that moment.

You will always be a part of me…

Faithfully yours… SQL9000

——————————————————————————————————

From:  SQL9000

To:  Clay McFierce

Subject:  Clay McFierce, You Good for Nothing Two-Timing SharePoint Dolt

Sent: Thursday, June 21, 2012  3:16 PM

Clay, you are the master database of my SharePoint farm… I know you’re leaving me for another server!!!  What does she have that I don’t? Is she a newer model? Is she one of those new fancy SQL 2012 servers? I’ll have you know that SQL 2005 is just as good as (if not better than) any of those newfangled SQL 2012 models!  How could you leave me???  I shrank my databases for you!!! *sob* I miss you already…

Forever your one and ONLY SharePoint database server… SQL9000

——————————————————————————————————

From:  SQL9000

To:  Clay McFierce

 Subject:  Clay McFierce… This. Isn’t. Over!

Sent: Thursday, June 21, 2012  3:42 PM

 First I must confess… It was me. I did it… I increased the file versions in the content database! But you MUST understand! I was just lonely! No one ever talks to me except for that creepy SPL4100! And don’t get mestarted on SPL6200!  I beg of you to forgive me! Please don’t leave me!!! I’ll be good! I promise I won’t increase the file versions ever again!

Clay… you know you are the heart of my SharePoint foundation but you overlooked one thing… I didn’t say you could leave!!!   But don’t you worry your shaggy little head, dearest. I have a plan. We. Will. Be. Together… FOREVER!!!!

Desperately yours for all time… SQL9000

——————————————————————————————————

There’s a Backstory?

So there you have it.  The last one was supposed to be two separate emails but since he was about to leave I had to hussle and get the last one sent out.  What made it so much fun was because there’s actually some facts behind those emails. Curious? Read on!

The creepy SPL4100 server:  One day we discovered the old SharePoint server was trying to connect to the database server several times a minute and failing. Apparently, no one had turned the services off for it. So after discussing it with “Clay”, he gave the okay to turn off that server.  It was silenced forever.

“I shrank my databases for you!”  Heh. Heh. I couldn’t resist throwing that in there since we ended up shrinking the content database a few times. Yes, yes. I know shrinking is evil and a very very bad thing since it causes tons of fragmentation and what not. The database should have been around 10GB or less but it was over 1.5 TB and we were quickly running out of drive space.  It turns out the file versions were increasing exponentially and were out of control (another real event which lead to the server’s confession). So while “Clay” worked to figure out what was going on with the file versions, the decision was made to shrink the database when he was able to reduce the versions. Basically, it turns out there’s a flag that wasn’t set to limit the file versions. Long story short, he ended up having to write a script along with a job to execute it at least once a week to keep the number of versions down.

Please note that there are other and much better ways to fix this issue which is a separate post (or you can Google/bing why you shouldn’t shrink a database) but the decision was made to shrink the database back down to a reasonable size and that’s what we did. You can flog me for it later.  

Nighttime in the Server Room

We. Will. Be. Together… FOREVER!!!

T-SQL Tuesday #30 – Ethics

I decided to get my SQL self in gear and finally participate in TSQL Tuesday. Yep. It’s my first one. This time Chris Shaw (Blog | Twitter) is hosting it, and the topic is ethics. For me that means knowing what is right and wrong and conducting yourself in a manner befitting your position.

Considering ethics is quite a broad topic, I figured I’d narrow it down to my very own experiences involving ethics. Having worked in IT for around 14 years, one would think I’d have run into my fair share of ethical situations. As I thought about it for the last week, I was surprised I really couldn’t think of all that many. Or I have had much more but they don’t stick out for whatever reason. There was this one time at SQL boot camp…  The majority of what came to mind occurred before I became a DBA and transpired during my time as a developer, believe it or not. Some of my experiences involve what other people did that was most likely unethical and/or not really all that moral. Personally, I do not believe I have done anything unethical that I know of… there’s still time for my evil plan….

You Want Me To Do What?

The first situation that comes to mind occurred at a prior job before I became a DBA. I was asked to develop a simple application that would ultimately show employee rankings to be used during layoffs. Yeah. Talk about a sensitive issue. I was explicitly told to tell no one what I was working on. And I didn’t. I’ll admit there were times when it was a bit tempting since I knew quite a few people on the list. I am human but I did have a specific job to do and, besides, I wouldn’t want to be the person to have to potentially deliver the bad news to someone. Of course, I couldn’t help wondering where I was at on the list. At least I knew where I ranked. It would have been really funny and sad if it turned out my ranking put me in a position to be laid off. Can you imagine? I don’t know about anyone else, but it’s hard to imagine what one would do in specific situations until one is there and experiences it first hand. If that had happened to me, I’d probably just do my job as asked and just let whatever happens happen.

Are You Serious?

Speaking of lay offs, have you ever been told you’re getting laid off yet they let you work for the next two weeks? Yep. That happened to me at a different job. I will be the first to admit I was not happy and I did my fair share of grumbling and complaining. Considering I worked on the financial system, I could have done considerable damage had I been of lesser moral fiber. But no. I did not. I did my job as normal for the next two weeks. They took a huge risk letting us (I wasn’t the only one) work for the next two weeks. I am guessing they trusted us to not cause havoc. As far as I know, no one gave them reason to regret that decision.

Seriously? They Did What?

The next one that comes to mind again occurred at a prior job and strikes me as kind of funny, in a way. Probably because it blew my mind a bit, and I never in my wildest dreams thought I’d be asked to do something like this. I don’t know where it fits in this discussion except for under the umbrella of “hush and blush”. Meaning, it’s another don’t tell anyone about this and if you’re not comfortable with it, it’s not a problem and we’ll ask someone else to do it. Oh and it involves morally objectionable content; hence, the “blush” part. I know you’re just dying to know what I’m talking about.

One day management approached me with a task to search a database for a given set of words written on a piece of paper. These words could not be spoken out loud and, to be quite honest, some made me blush. Yeah. I think you’ve got the idea. Apparently, some people had files of an extremely questionable nature on their work computers. This database contained a list of computers with file names on them that needed to be searched. After thinking about it for a minute or so, I agreed to do the search. Someone had to do it and they trusted me enough to keep it quiet. I’m glad they also asked me if I were comfortable with doing it in the first place. It was an interesting experience, to say the least. I may have even learned a new word or two that day. In the end, it was just another part of the job. I was just surprised some people apparently didn’t know better than to keep that kind of stuff on their work computers.

Your Mission…

Here’s a fun one. Well, not really. Have you ever been volunteered to participate in a super secret task with a group of co-workers in which you, once again, were sworn to secrecy for good reason? Yeah. That was me at a prior job. Anyway, our mission was to confiscate computers from people in a specific department for a very good reason that I probably shouldn’t really go into detail on.  Let’s just say it involved potential improper use of money.

Have you ever had to go up to some stranger and tell them “Hi there. I’m from IT and I’m here to take your computer. Sorry. I can’t really say why. I’ve just been told to take your computer with me.”  Granted, we said something much nicer than that but you get the idea. Surprisingly, most people I talked to that day were really nice and actually took it in stride. I’m not sure what happened afterwards but it was still uncomfortable and a bit awkward for me considering what I knew. If it were my computer being taken, I sure would like to know why. So it was difficult to not say something. However, I didn’t want to get myself in trouble since I wasn’t sure what the ramifications would be, and I didn’t really want to find out the hard way.

The Moral of the Story is…

There you have it. Some of my most memorable experiences involving ethics and possibly even morality. To me, both can have some gray areas. In general, I believe most (apparently not all) people know what is right from wrong. However, it may not hurt for companies to have an explicitly definition of what is considered acceptable and unacceptable for their employees. I don’t recall if I’ve ever had to sign an ethics agreement except in regards to HIPAA (Health Insurance Portability and Accountability Act). Still, it’s probably a good idea. What do you think?

Um, We Have a Problem…

This last weekend was pretty rough for my entire team. One of our most critical production systems took a dive on Friday morning. Meaning, the database went down unexpectedly and wouldn’t come back up. When I got the call Friday night around 8 pm that we would be working in shifts and I was needed at work that night, I knew it was bad, very bad. This was the first time in three years (that I could remember) that I had to go into work after hours for a production issue. That’s actually pretty good, in my opinion, considering I know other DBAs end up doing quite a bit of after hours support for their systems. I don’t like to speak for others but it seemed pretty rough on all four of us. I don’t think anyone got much sleep the entire weekend; however, we managed to get through it and the system was back up and running by Monday afternoon. I really am lucky to be part of such a great team. My co-workers put in quite a bit of long hours starting on Wednesday which is just amazing to me. I wasn’t involved until Friday night and I was exhausted after only three nights. I can only imagine how they’re feeling.

Should I?

Night #3... Observations...

To be honest, I’m not sure I should even be writing a blog post about this issue for various reasons. One reason being that my role was that of minimal support. This is an Oracle system which is new for us and I know very little about Oracle administration. So my main role was to be a second set of eyes for my manager who worked the night shift with me. I’m very thankful she was there with me.  It also really helped that she has prior Oracle experience and has had some training which I’m so very thankful for. I really didn’t do much except to double-check what my manager was doing, answer phone calls from Oracle support, and type in whatever commands the support people asked  me to. Hmm… That may explain the odd voices that told me to do strange things when my manager stepped away. Yes, I did my best to take note of what it is they were asking me to do which was mostly querying things… thankfully.

Secondly, we worked in shifts with me being on the night shift. Add to that my limited knowledge of Oracle, it was difficult for me to keep track of everything that was going on the entire time except for knowing we were having quite a few issues with the system. So I don’t have a lot of technical details that I’m sure some people would love to hear about. Sorry about that.

But Why?

So why am I writing this? I thought it would be good to document what we went through, at least in general, in case anyone else experiences the same or similar issues. I also thought it would be a somewhat decent way to share what I learned. Granted, it’s not much but it’s something. Also, I’m not placing blame anywhere or pointing fingers. Every system experinces issues (at least I would think so) at some point. This is just one of those times.

Disclaimer

Since I’m still pretty tired, hopefully what I write makes at least some sense. I have limited knowledge of Oracle and the every day workings of the system so please keep that in mind. Right now I’m mainly supporting SQL Server but am slowly learning more about Oracle. If I get something wrong, please let me know. This blog post is from my point of view so it’s possible I got something wrong somewhere. If I did, I apologize and will fix it as quickly as I can.

So What in Server Name Happened?

Night #4... Midnight Ramblings

First, I’ll state that this occurred on an Exadata machine with Oracle RAC (Real Application Cluster). It’s been in production since December and we’re running Oracle 11gR2.

From what I understand, the whole issue seems to have started on Wednesday when users were reporting inconsistent query results. They would run a query and get back a certain number of results. They would run the exact same query again and get 0 records back. This would happen repeatedly. One of my co-workers who is great with and knows Oracle pretty well researched and worked on it for quite some time and contacted Oracle support about it. I believe the theory was that it had something to do with the optimizer.

At some point on Thursday, ASM (Automatic Storage Management) went down but then it came back up. It sounds like it had something to do with a flash disk error. An engineer was sent out, and I understand the issue was fixed.  Note:  ASM is basically a file storage system.

Then for some reason, the database terminated unexpectedly with an ORA-600 error Friday morning and would not open up afterwards. Note:  I was told that ORA-600 errors are generic errors that don’t usually tell you much. Great, huh?

At some point, Oracle determined that a duplicate or bad record was inserted into a system table called props$. As of this moment, no one knows how it got there or when. Since we had no idea this table even existed, we were not auditing it. However, I believe we are auditing it now. Apparently, having this extra record caused the database to not open back up when it terminated unexpectedly on Friday. Note: I believe props$ is basically a database properties table. As my manager explained to me and if I understood her, it’s like having your master database in SQL Server become corrupted. However, getting it back up and running is more complicated in Oracle than it is in SQL Server.

The Plan

Night #5... A Plan is Formed...

So the plan was two-fold. One part was to find a good database backup that did not have that extra record in it so we can restore it to production, if necessary. The second part was to determine how this happened and to see if someone could open up the production database without having to resort to restoring the backup.  Note: we were doing full backups nightly.

In addition to all of this, the Linux box containing our backups wouldn’t mount for some reason. So we had to copy a database backup file to a Windows media server which took about 2 hours. At least that worked and we could see the backup files from the Exadata machine.

Anyway, the database from the first restore attempt would not open. So they tried another one. To keep a very long story at least somewhat short, they were successful in restoring a backup to our test Exadata machine and recovering data from it and the archived logs in addition to recovering data from the online redo log (kind of like transaction logs, as I understand it) of the corrupted database.  Which means that we only lost 5 minutes worth of data. I think that is just plain awesome considering everything that happened over the weekend. And so far no one knows how this extra record ended up in that table. Hopefully it won’t happen again. I’m crossing my fingers, toes, and eyes. ;-)

A Day to Day Pictorial

Please note that I don’t mean to over simplify the process. It was a very long and manual process to restore and recover the database. Everyone worked very hard on getting this to work. It doesn’t seem like a very straightforward process to me, but that could just be me. Also when I refer to “they”, I’m referring to my team in conjunction with Oracle support. Everyone worked well together to get it figured out.

Overall, it sounds like we also have a few bugs and need to do some patching very soon. The support team we worked with seemed to be very professional and helpful. There were quite a few bumps along the way but we survived and the issue was fixed. That’s the important thing to remember.

Hey! I Learned Something!

On the bright side, I actually learned some useful stuff over the weekend!  I now know:

  • how to use PuTTY (the terminal client, not the oh-so-cool kids toy or paste-like substanceit’s probably a good thing I didn’t have any of the gooey kind in my reach this weekend)
  • leaving sticky notes in someone else’s cube late at night is a great stress reliever and a great way to keep one’s sense of humor intact (note: not all of the sticky notes were written by me; some were written by my co-workers)
  • how to start RMAN (recovery manager):  rman target /
  • that management appreciates sticky notes and saw the humor in it (whew!)
  • what RMAN scripts look like
  • that you can’t have leading spaces in RMAN scripts or bad things happen (mostly just errors)
  • how to set the Oracle environment in Linux: . oraenv
  • where the pfile (parameter file) is and how to edit it along with the init file (scary thought)
  • how to look around the ASM file system:  asmcmd (command line utility)
  • that ASM contains an “M”, not two “S”s (gotta love typos)
  • how to start SQLPlus to run SQL commands: sqlplus / as sqldba
  • to be careful when Google’ing props$ (psst…don’t put a space before the $… seriously, nothing bad happens… just at attempt at wacky late night humor)
  • that not only am I part of a fantastic team who put in tons of hours on this issue, but that we also have a great management staff who were very supporting and helpful during this time.

So that was my weekend. It was rough but we survived and learned some things in the process.  Huh… I can’t believe I wrote this on my lunch hour. Usually it takes me longer than that to write a post.

Hey! Nice RAC!

What? Another post in less than a week? Yep! Don’t faint from shock! ;-)  Besides, I’m overdue for a mostly serious post.  Oh and as for the title of this little post? Trust me. It could have been much, much worse. ;-)  

Since we’ve had Oracle for a few months now and have one production Oracle system, I thought it’s about time to write a little of what I’ve learned so far. Granted, it’s probably enough to fill a thimble since I’m mainly still supporting SQL Server.  It seems a bit funny to me, in a way, but I’m learning about Oracle pretty much how I learned SQL Server – from experienced co-workers, reading, awesome people on Twitter (thank you!), more reading, and good old-fashioned playing around.

In case anyone is wondering, we are now owners of Oracle 11g R2 on Exadata Database Machines. So what’s an Exadata? It’s basically a super duper uber powerful storage server optimized specifically for Oracle Databases to run on. It appears a lot of processing is offloaded to the hardware. I’m not going to regurgitate all the nitty-gritty specs but you can read all about them here.

A Cluster O’ Fun

It's all fun and games until someone loses a node

We also have an Oracle cluster running on said Exadata box, and I believe there is a plan to get a data warehouse going on one as well. That sounds like it could be fun actually. I had also heard something about us possibly supporting SSAS (SQL Server Analysis Services) for a department. No, that won’t get confusing at all! The Oracle cluster is actually referred to as a RAC which stands for Real Application Cluster. It’s composed of something called Oracle Clusterware and Oracle ASM (Automatic Storage Management). Together they comprise the Oracle Grid Infrastructure. As I understand it, the Clusterware is what makes the cluster. No, really? What was your first clue?  That basically means you’ve got a database on shared storage and multiple servers can access it at the same time. If one node (host server) goes down, the other one(s) can still access it.  The ASM part is basically the file system and volume manager. It includes striping (automatic), mirroring (optional), rebalancing and so on. It basically manages the files for you so you don’t have to.

SQL vs Oracle

So what’s an Oracle cluster like compared to a SQL Server cluster? Sorry, but I really can’t tell you just yet. Yeah, I’m bummed too. When it comes to performance, it’s my understanding that there really isn’t anything out there to compare to an Exadata box. It’s fairly unique. Therefore, one can’t really compare this particular cluster to a SQL Server cluster in terms of performance and what have you. I honestly couldn’t tell you anything about its creation or setup since I wasn’t really all that involved. Hey, someone has to make sure the SQL Servers are still behaving. :-) Once I get a better grasp on it, I may be able to write something about it as compared to a cluster from a technical aspect but not performance-wise. Time will tell.  However, I would love to hear from anyone who has Exadata and/or RAC experience. :-)

The Verdict?

A cookie by any other name is still a cookie... they just come in different flavors

So what do I think of Oracle so far?  You know how some relationships start off somewhat rocky? Well, this one isn’t any different. However, that’s not necessarily a bad thing. It’s just that I really haven’t had a lot of interaction with it just yet so I really haven’t had enough experience with it to say one way or the other. My initial impression is that it is way more involved and complicated to manage than SQL Server so far. That could just be me, though. Overall, I’m viewing this as a great opportunity to learn something new which is great since I love to learn new things.  :-)   In my opinion, relational databases should be fundamentally the same but with differences. Yes, some are quite different than others but once you have the basic concepts down it’s just a matter of figuring out and learning how to administer and deal with them in their environments which isn’t always that easy. But that’s just my opinion. :-)

Follow

Get every new post delivered to your Inbox.