Um, We Have a Problem…
This last weekend was pretty rough for my entire team. One of our most critical production systems took a dive on Friday morning. Meaning, the database went down unexpectedly and wouldn’t come back up. When I got the call Friday night around 8 pm that we would be working in shifts and I was needed at work that night, I knew it was bad, very bad. This was the first time in three years (that I could remember) that I had to go into work after hours for a production issue. That’s actually pretty good, in my opinion, considering I know other DBAs end up doing quite a bit of after hours support for their systems. I don’t like to speak for others but it seemed pretty rough on all four of us. I don’t think anyone got much sleep the entire weekend; however, we managed to get through it and the system was back up and running by Monday afternoon. I really am lucky to be part of such a great team. My co-workers put in quite a bit of long hours starting on Wednesday which is just amazing to me. I wasn’t involved until Friday night and I was exhausted after only three nights. I can only imagine how they’re feeling.
To be honest, I’m not sure I should even be writing a blog post about this issue for various reasons. One reason being that my role was that of minimal support. This is an Oracle system which is new for us and I know very little about Oracle administration. So my main role was to be a second set of eyes for my manager who worked the night shift with me. I’m very thankful she was there with me. It also really helped that she has prior Oracle experience and has had some training which I’m so very thankful for. I really didn’t do much except to double-check what my manager was doing, answer phone calls from Oracle support, and type in whatever commands the support people asked me to. Hmm… That may explain the odd voices that told me to do strange things when my manager stepped away. Yes, I did my best to take note of what it is they were asking me to do which was mostly querying things… thankfully.
Secondly, we worked in shifts with me being on the night shift. Add to that my limited knowledge of Oracle, it was difficult for me to keep track of everything that was going on the entire time except for knowing we were having quite a few issues with the system. So I don’t have a lot of technical details that I’m sure some people would love to hear about. Sorry about that.
So why am I writing this? I thought it would be good to document what we went through, at least in general, in case anyone else experiences the same or similar issues. I also thought it would be a somewhat decent way to share what I learned. Granted, it’s not much but it’s something. Also, I’m not placing blame anywhere or pointing fingers. Every system experinces issues (at least I would think so) at some point. This is just one of those times.
Since I’m still pretty tired, hopefully what I write makes at least some sense. I have limited knowledge of Oracle and the every day workings of the system so please keep that in mind. Right now I’m mainly supporting SQL Server but am slowly learning more about Oracle. If I get something wrong, please let me know. This blog post is from my point of view so it’s possible I got something wrong somewhere. If I did, I apologize and will fix it as quickly as I can.
So What in Server Name Happened?
First, I’ll state that this occurred on an Exadata machine with Oracle RAC (Real Application Cluster). It’s been in production since December and we’re running Oracle 11gR2.
From what I understand, the whole issue seems to have started on Wednesday when users were reporting inconsistent query results. They would run a query and get back a certain number of results. They would run the exact same query again and get 0 records back. This would happen repeatedly. One of my co-workers who is great with and knows Oracle pretty well researched and worked on it for quite some time and contacted Oracle support about it. I believe the theory was that it had something to do with the optimizer.
At some point on Thursday, ASM (Automatic Storage Management) went down but then it came back up. It sounds like it had something to do with a flash disk error. An engineer was sent out, and I understand the issue was fixed. Note: ASM is basically a file storage system.
Then for some reason, the database terminated unexpectedly with an ORA-600 error Friday morning and would not open up afterwards. Note: I was told that ORA-600 errors are generic errors that don’t usually tell you much. Great, huh?
At some point, Oracle determined that a duplicate or bad record was inserted into a system table called props$. As of this moment, no one knows how it got there or when. Since we had no idea this table even existed, we were not auditing it. However, I believe we are auditing it now. Apparently, having this extra record caused the database to not open back up when it terminated unexpectedly on Friday. Note: I believe props$ is basically a database properties table. As my manager explained to me and if I understood her, it’s like having your master database in SQL Server become corrupted. However, getting it back up and running is more complicated in Oracle than it is in SQL Server.
So the plan was two-fold. One part was to find a good database backup that did not have that extra record in it so we can restore it to production, if necessary. The second part was to determine how this happened and to see if someone could open up the production database without having to resort to restoring the backup. Note: we were doing full backups nightly.
In addition to all of this, the Linux box containing our backups wouldn’t mount for some reason. So we had to copy a database backup file to a Windows media server which took about 2 hours. At least that worked and we could see the backup files from the Exadata machine.
Anyway, the database from the first restore attempt would not open. So they tried another one. To keep a very long story at least somewhat short, they were successful in restoring a backup to our test Exadata machine and recovering data from it and the archived logs in addition to recovering data from the online redo log (kind of like transaction logs, as I understand it) of the corrupted database. Which means that we only lost 5 minutes worth of data. I think that is just plain awesome considering everything that happened over the weekend. And so far no one knows how this extra record ended up in that table. Hopefully it won’t happen again. I’m crossing my fingers, toes, and eyes.
Please note that I don’t mean to over simplify the process. It was a very long and manual process to restore and recover the database. Everyone worked very hard on getting this to work. It doesn’t seem like a very straightforward process to me, but that could just be me. Also when I refer to “they”, I’m referring to my team in conjunction with Oracle support. Everyone worked well together to get it figured out.
Overall, it sounds like we also have a few bugs and need to do some patching very soon. The support team we worked with seemed to be very professional and helpful. There were quite a few bumps along the way but we survived and the issue was fixed. That’s the important thing to remember.
Hey! I Learned Something!
On the bright side, I actually learned some useful stuff over the weekend! I now know:
- how to use PuTTY (the terminal client, not the oh-so-cool kids toy or paste-like substance… it’s probably a good thing I didn’t have any of the gooey kind in my reach this weekend)
- leaving sticky notes in someone else’s cube late at night is a great stress reliever and a great way to keep one’s sense of humor intact (note: not all of the sticky notes were written by me; some were written by my co-workers)
- how to start RMAN (recovery manager): rman target /
- that management appreciates sticky notes and saw the humor in it (whew!)
- what RMAN scripts look like
- that you can’t have leading spaces in RMAN scripts or bad things happen (mostly just errors)
- how to set the Oracle environment in Linux: . oraenv
- where the pfile (parameter file) is and how to edit it along with the init file (scary thought)
- how to look around the ASM file system: asmcmd (command line utility)
- that ASM contains an “M”, not two “S”s (gotta love typos)
- how to start SQLPlus to run SQL commands: sqlplus / as sqldba
- to be careful when Google’ing props$ (psst…don’t put a space before the $… seriously, nothing bad happens… just at attempt at wacky late night humor)
- that not only am I part of a fantastic team who put in tons of hours on this issue, but that we also have a great management staff who were very supporting and helpful during this time.
So that was my weekend. It was rough but we survived and learned some things in the process. Huh… I can’t believe I wrote this on my lunch hour. Usually it takes me longer than that to write a post.