Now that I have your attention, you’re probably wondering what that title is all about. It’s not a huge deal, but I am happy to write that I solved a little problem I encountered in Oracle without having to bug anyone about it… much. Granted, I didn’t really truly figure it out completely on my own. I utilized my resources as most good little DBAs probably do. That means, I actually paid attention to prior issues my co-workers encountered and took some decent notes that actually came in handy. I just thought I’d share my experience with you in case it could help someone else out. What made it even more “exciting” for me was that my DBA co-worker was out of town during this time attending a training course and our manager was also out.
Note: I wish I had taken some screenshots of the issue I was having, but I didn’t. Lesson learned. So we’ll just have to make due with my not so oh-so-wonderful descriptions.
Disclaimer: Some of you may already know this but I mainly support SQL Server at my current job with a little Oracle in there from time to time. My DBA co-worker and manager are mainly supporting the Oracle system right now which I’m completely fine with. Don’t get me wrong. I really want to learn Oracle but I’m also happy to continue working with SQL Server on a regular basis. So please keep in mind that my current Oracle knowledge can fill a thimble and that’s probably stretching it a bit. That means I really don’t quite know what I’m doing when it comes to this stuff other than reading notes and spending a lot of time on Google and asking my Twitter buddies for advice. Thank you Twitter buddies!!! Lucky for me, I’m signed up for an Oracle DBA class coming up in mid-August. Until then, if I wrote something incorrectly or explained it wrong, please let me know! The last thing I want to do is pass on incorrect info. Anyway… Here. We. Goooo…
A Dash of Techie Stuff: Basically, we’re running Oracle 11g r2 on a RAC on an Exadata machine running Oracle Linux. We’re also currently using Oracle Enterprise Manager (OEM) 11g. I believe we have plans to use 12c at some point in the near future.
A Little Back Story: Earlier this week one of the RAC nodes rebooted itself a couple of times in the middle of the night. We have ASR (Automated Service Request) set up so that it contacts us when there’s an issue like this. Long story short and according to Oracle Support, we had a fan column failure. A field engineer came out the following morning and replaced a fan. They rebooted the node and all seemed fine. Note: I don’t really want to get into a lot of detail on this particular issue since I’m not well-versed
in it and I believe it’s still being looked into. Plus, it’s not the main focus of this post. It’s just to give you an idea of what lead up to this post.
Ah, Fun Times: Once the node was back up, I was then asked to check the databases to make sure they were fine. So I logged into OEM and went to the Databases tab. Lo and behold, I was surprised to see that the Status for these cluster databases indicated the second instances were down! These instances are on the node that was rebooted. Upon further investigation and drilling down, I saw an error indicating the agent on the second node was unreachable. Since this happened to me 3 weeks ago (hmm… my manager and DBA co-worker were out of town in training then as well… I’m detecting a pattern…), I followed the steps my manager had walked me through over the phone back then. Here are the steps in case you’re wondering:
- Log into the second node using PuTTY. What’s PuTTY besides a fun
childhoodtoy that provides hours of endless pleasure? It’s basically a free emulator that we use for running Linux commands on the nodes.
- Next I ran “ps –ef | grep pmon” to see if the processes were running for the databases. They were. What’s PMON? It stands for process monitor and it’s also a background process that’s created when a database instance is started. Basically, if the pmon is not running for the database that means the instance isn’t up.
Curiously, this showed me that both instances of the databases appeared to be up and running. I then ran the “./crsctl stat res -t” command that I learned from Oracle Support on a prior issue. Note: “crsctl“is the Oracle Clusterware Control utility, “stat” is status, “res” is resource, and “-t” just displays the results in a tabular format. If you don’t already know, can you now guess what that does? It checks the status of the resources in the cluster. No! Really? It basically showed me everything was online and the databases were open, which is a good thing. Ya think?
I don’t know if there was anything else I needed to do, but I believed this showed me that everything appeared to be fine in regards to the cluster databases. So I didn’t worry too much about what OEM was showing me. Other things came up during the day, so I left it as is for awhile. However, it was still bugging me the next day. So I looked at my notes and recalled my DBA co-worker having gone through something like this before. Meaning, everything looked fine by using PuTTY when OEM was indicating otherwise. She had worked with Oracle Support on this for a separate issue and luckily I wrote down what she learned. So here’s what I did:
- Logged into the node using PuTTY.
- Went to the agent home directory and ran this command: “./emctl status agent”.
What does that do? It checks the status of the Enterprise Manager agent. Guess what? It wasn’t running. Just for curiosity sake, I logged into the other nodes and ran the same command. The agents were running fine there. Ah ha! Ding! Yep! A little light bulb finally went off! Gee! Maybe this is why I’m getting the “agent unreachable” message in OEM. Duh! So I then ran “./emctl start agent” followed by the status command again. The agent was running and it looked okay to me (but what do I know?). Can you guess what happened next? I then logged into OEM, went to the Databases tab, and… *drum roll* the Status indicated both instances were up for the cluster databases! Woo hoo! *happy dance* :-) It may not seem like much to some, but I was soooo excited that I just had to write a post about it and share my experience with you.
To Sum Up My Experience: Learning Oracle for me so far has been like trying to eat jello with a fork. It’s slow, awkward, and a bit messy at times but it can be done.🙂
Of course, this begs the question of why the agent was down in the first place… if I figure it out, that’ll be a post for another day.