Feeds:
Posts
Comments

Archive for the ‘Oracle’ Category

Sometimes I Scare Myself

All by myself… don’t wanna be…

Now that I have your attention, you’re probably wondering what that title is all about. It’s not a huge deal, but I am happy to write that I solved a little problem I encountered in Oracle without having to bug anyone about it… much. Granted, I didn’t really truly figure it out completely on my own. I utilized my resources as most good little DBAs probably do. That means, I actually paid attention to prior issues my co-workers encountered and took some decent notes that actually came in handy. I just thought I’d share my experience with you in case it could help someone else out. What made it even more “exciting” for me was that my DBA co-worker was out of town during this time attending a training course and our manager was also out.

Note: I wish I had taken some screenshots of the issue I was having, but I didn’t. Lesson learned. So we’ll just have to make due with my not so oh-so-wonderful descriptions.

Disclaimer:  Some of you may already know this but I mainly support SQL Server at my current job with a little Oracle in there from time to time. My DBA co-worker and manager are mainly supporting the Oracle system right now which I’m completely fine with. Don’t get me wrong. I really want to learn Oracle but I’m also happy to continue working with SQL Server on a regular basis. So please keep in mind that my current Oracle knowledge can fill a thimble and that’s probably stretching it a bit. That means I really don’t quite know what I’m doing when it comes to this stuff other than reading notes and spending a lot of time on Google and asking my Twitter buddies for advice. Thank you Twitter buddies!!! Lucky for me, I’m signed up for an Oracle DBA class coming up in mid-August. Until then, if I wrote something incorrectly or explained it wrong, please let me know! The last thing I want to do is pass on incorrect info.  Anyway… Here. We. Goooo…

A Dash of Techie Stuff: Basically, we’re running Oracle 11g r2 on a RAC on an Exadata machine running Oracle Linux. We’re also currently using Oracle Enterprise Manager (OEM) 11g. I believe we have plans to use 12c at some point in the near future.

Alternate Plan #1

A Little Back Story: Earlier this week one of the RAC nodes rebooted itself a couple of times in the middle of the night. We have ASR (Automated Service Request) set up so that it contacts us when there’s an issue like this. Long story short and according to Oracle Support, we had a fan column failure. A field engineer came out the following morning and replaced a fan. They rebooted the node and all seemed fine.  Note: I don’t really want to get into a lot of detail on this particular issue since I’m not well-versed
in it and I believe it’s still being looked into.  Plus, it’s not the main focus of this post. It’s just to give you an idea of what lead up to this post.

Ah, Fun Times: Once the node was back up, I was then asked to check the databases to make sure they were fine. So I logged into OEM and went to the Databases tab. Lo and behold, I was surprised to see that the Status for these cluster databases indicated the second instances were down! These instances are on the node that was rebooted. Upon further investigation and drilling down, I saw an error indicating  the agent on the second node was unreachable. Since this happened to me 3 weeks ago (hmm… my manager and DBA co-worker were out of town in training then as well… I’m detecting a pattern…), I followed the steps my manager had walked me through over the phone back then. Here are the steps in case you’re wondering:

  1. Log into the second node using PuTTY.  What’s PuTTY besides a fun childhood toy that provides hours of endless pleasure? It’s basically a free emulator that we use for running Linux commands on the nodes.
  2. Next I ran “ps –ef | grep pmon” to see if the processes were running for the databases. They were.  What’s PMON? It stands for process monitor and it’s also a background process that’s created when a database instance is started. Basically, if the pmon is not running for the database that means the instance isn’t up.

Alternate Plan #2

Curiously, this showed me that both instances of the databases appeared to be up and running.  I then ran the “./crsctl stat res -t” command that I learned from Oracle Support on a prior issue.  Note: “crsctl“is the Oracle Clusterware Control utility, “stat” is status, “res” is resource, and “-t” just displays the results in a tabular format. If you don’t already know, can you now guess what that does? It checks the status of the resources in the cluster. No! Really?  It basically showed me everything was online and the databases were open, which is a good thing. Ya think?

I don’t know if there was anything else I needed to do, but I believed this showed me that everything appeared to be fine in regards to the cluster databases. So I didn’t worry too much about what OEM was showing me. Other things came up during the day, so I left it as is for awhile. However, it was still bugging me the next day. So I looked at my notes and recalled my DBA co-worker having gone through something like this before. Meaning, everything looked fine by using PuTTY when OEM was indicating otherwise. She had worked with Oracle Support on this for a separate issue and luckily I wrote down what she learned. So here’s what I did:

  1. Logged into the node using PuTTY.
  2. Went to the agent home directory and ran this command:  “./emctl status agent”.

It’s magic!

What does that do? It checks the status of the Enterprise Manager agent. Guess what? It wasn’t running. Just for curiosity sake, I logged into the other nodes and ran the same command. The agents were running fine there. Ah ha! Ding! Yep! A little light bulb finally went off! Gee! Maybe this is why I’m getting the “agent unreachable” message in OEM. Duh! So I then ran “./emctl start agent” followed by the status command again. The agent was running and it looked okay to me (but what do I know?). Can you guess what happened next?  I then logged into OEM, went to the Databases tab, and… *drum roll*  the Status indicated both instances were up for the cluster databases! Woo hoo! *happy dance* :-)  It may not seem like much to some, but I was soooo excited that I just had to write a post about it and share my experience with you.

To Sum Up My Experience:  Learning Oracle for me so far has been like trying to eat jello with a fork. It’s slow, awkward, and a bit messy at times but it can be done. :-)

Of course, this begs the question of why the agent was down in the first place… if I figure it out, that’ll be a post for another day.

Fork, meet jello… and um, oops!

Read Full Post »

Uh Oh… That Can’t Be Good

Um, We Have a Problem…

This last weekend was pretty rough for my entire team. One of our most critical production systems took a dive on Friday morning. Meaning, the database went down unexpectedly and wouldn’t come back up. When I got the call Friday night around 8 pm that we would be working in shifts and I was needed at work that night, I knew it was bad, very bad. This was the first time in three years (that I could remember) that I had to go into work after hours for a production issue. That’s actually pretty good, in my opinion, considering I know other DBAs end up doing quite a bit of after hours support for their systems. I don’t like to speak for others but it seemed pretty rough on all four of us. I don’t think anyone got much sleep the entire weekend; however, we managed to get through it and the system was back up and running by Monday afternoon. I really am lucky to be part of such a great team. My co-workers put in quite a bit of long hours starting on Wednesday which is just amazing to me. I wasn’t involved until Friday night and I was exhausted after only three nights. I can only imagine how they’re feeling.

Should I?

Night #3... Observations...

To be honest, I’m not sure I should even be writing a blog post about this issue for various reasons. One reason being that my role was that of minimal support. This is an Oracle system which is new for us and I know very little about Oracle administration. So my main role was to be a second set of eyes for my manager who worked the night shift with me. I’m very thankful she was there with me.  It also really helped that she has prior Oracle experience and has had some training which I’m so very thankful for. I really didn’t do much except to double-check what my manager was doing, answer phone calls from Oracle support, and type in whatever commands the support people asked  me to. Hmm… That may explain the odd voices that told me to do strange things when my manager stepped away. Yes, I did my best to take note of what it is they were asking me to do which was mostly querying things… thankfully.

Secondly, we worked in shifts with me being on the night shift. Add to that my limited knowledge of Oracle, it was difficult for me to keep track of everything that was going on the entire time except for knowing we were having quite a few issues with the system. So I don’t have a lot of technical details that I’m sure some people would love to hear about. Sorry about that.

But Why?

So why am I writing this? I thought it would be good to document what we went through, at least in general, in case anyone else experiences the same or similar issues. I also thought it would be a somewhat decent way to share what I learned. Granted, it’s not much but it’s something. Also, I’m not placing blame anywhere or pointing fingers. Every system experinces issues (at least I would think so) at some point. This is just one of those times.

Disclaimer

Since I’m still pretty tired, hopefully what I write makes at least some sense. I have limited knowledge of Oracle and the every day workings of the system so please keep that in mind. Right now I’m mainly supporting SQL Server but am slowly learning more about Oracle. If I get something wrong, please let me know. This blog post is from my point of view so it’s possible I got something wrong somewhere. If I did, I apologize and will fix it as quickly as I can.

So What in Server Name Happened?

Night #4... Midnight Ramblings

First, I’ll state that this occurred on an Exadata machine with Oracle RAC (Real Application Cluster). It’s been in production since December and we’re running Oracle 11gR2.

From what I understand, the whole issue seems to have started on Wednesday when users were reporting inconsistent query results. They would run a query and get back a certain number of results. They would run the exact same query again and get 0 records back. This would happen repeatedly. One of my co-workers who is great with and knows Oracle pretty well researched and worked on it for quite some time and contacted Oracle support about it. I believe the theory was that it had something to do with the optimizer.

At some point on Thursday, ASM (Automatic Storage Management) went down but then it came back up. It sounds like it had something to do with a flash disk error. An engineer was sent out, and I understand the issue was fixed.  Note:  ASM is basically a file storage system.

Then for some reason, the database terminated unexpectedly with an ORA-600 error Friday morning and would not open up afterwards. Note:  I was told that ORA-600 errors are generic errors that don’t usually tell you much. Great, huh?

At some point, Oracle determined that a duplicate or bad record was inserted into a system table called props$. As of this moment, no one knows how it got there or when. Since we had no idea this table even existed, we were not auditing it. However, I believe we are auditing it now. Apparently, having this extra record caused the database to not open back up when it terminated unexpectedly on Friday. Note: I believe props$ is basically a database properties table. As my manager explained to me and if I understood her, it’s like having your master database in SQL Server become corrupted. However, getting it back up and running is more complicated in Oracle than it is in SQL Server.

The Plan

Night #5... A Plan is Formed...

So the plan was two-fold. One part was to find a good database backup that did not have that extra record in it so we can restore it to production, if necessary. The second part was to determine how this happened and to see if someone could open up the production database without having to resort to restoring the backup.  Note: we were doing full backups nightly.

In addition to all of this, the Linux box containing our backups wouldn’t mount for some reason. So we had to copy a database backup file to a Windows media server which took about 2 hours. At least that worked and we could see the backup files from the Exadata machine.

Anyway, the database from the first restore attempt would not open. So they tried another one. To keep a very long story at least somewhat short, they were successful in restoring a backup to our test Exadata machine and recovering data from it and the archived logs in addition to recovering data from the online redo log (kind of like transaction logs, as I understand it) of the corrupted database.  Which means that we only lost 5 minutes worth of data. I think that is just plain awesome considering everything that happened over the weekend. And so far no one knows how this extra record ended up in that table. Hopefully it won’t happen again. I’m crossing my fingers, toes, and eyes. ;-)

A Day to Day Pictorial

Please note that I don’t mean to over simplify the process. It was a very long and manual process to restore and recover the database. Everyone worked very hard on getting this to work. It doesn’t seem like a very straightforward process to me, but that could just be me. Also when I refer to “they”, I’m referring to my team in conjunction with Oracle support. Everyone worked well together to get it figured out.

Overall, it sounds like we also have a few bugs and need to do some patching very soon. The support team we worked with seemed to be very professional and helpful. There were quite a few bumps along the way but we survived and the issue was fixed. That’s the important thing to remember.

Hey! I Learned Something!

On the bright side, I actually learned some useful stuff over the weekend!  I now know:

  • how to use PuTTY (the terminal client, not the oh-so-cool kids toy or paste-like substanceit’s probably a good thing I didn’t have any of the gooey kind in my reach this weekend)
  • leaving sticky notes in someone else’s cube late at night is a great stress reliever and a great way to keep one’s sense of humor intact (note: not all of the sticky notes were written by me; some were written by my co-workers)
  • how to start RMAN (recovery manager):  rman target /
  • that management appreciates sticky notes and saw the humor in it (whew!)
  • what RMAN scripts look like
  • that you can’t have leading spaces in RMAN scripts or bad things happen (mostly just errors)
  • how to set the Oracle environment in Linux: . oraenv
  • where the pfile (parameter file) is and how to edit it along with the init file (scary thought)
  • how to look around the ASM file system:  asmcmd (command line utility)
  • that ASM contains an “M”, not two “S”s (gotta love typos)
  • how to start SQLPlus to run SQL commands: sqlplus / as sqldba
  • to be careful when Google’ing props$ (psst…don’t put a space before the $… seriously, nothing bad happens… just at attempt at wacky late night humor)
  • that not only am I part of a fantastic team who put in tons of hours on this issue, but that we also have a great management staff who were very supporting and helpful during this time.

So that was my weekend. It was rough but we survived and learned some things in the process.  Huh… I can’t believe I wrote this on my lunch hour. Usually it takes me longer than that to write a post.

Read Full Post »

Hey! Nice RAC!

What? Another post in less than a week? Yep! Don’t faint from shock! ;-)  Besides, I’m overdue for a mostly serious post.  Oh and as for the title of this little post? Trust me. It could have been much, much worse. ;-)  

Since we’ve had Oracle for a few months now and have one production Oracle system, I thought it’s about time to write a little of what I’ve learned so far. Granted, it’s probably enough to fill a thimble since I’m mainly still supporting SQL Server.  It seems a bit funny to me, in a way, but I’m learning about Oracle pretty much how I learned SQL Server – from experienced co-workers, reading, awesome people on Twitter (thank you!), more reading, and good old-fashioned playing around.

In case anyone is wondering, we are now owners of Oracle 11g R2 on Exadata Database Machines. So what’s an Exadata? It’s basically a super duper uber powerful storage server optimized specifically for Oracle Databases to run on. It appears a lot of processing is offloaded to the hardware. I’m not going to regurgitate all the nitty-gritty specs but you can read all about them here.

A Cluster O’ Fun

It's all fun and games until someone loses a node

We also have an Oracle cluster running on said Exadata box, and I believe there is a plan to get a data warehouse going on one as well. That sounds like it could be fun actually. I had also heard something about us possibly supporting SSAS (SQL Server Analysis Services) for a department. No, that won’t get confusing at all! The Oracle cluster is actually referred to as a RAC which stands for Real Application Cluster. It’s composed of something called Oracle Clusterware and Oracle ASM (Automatic Storage Management). Together they comprise the Oracle Grid Infrastructure. As I understand it, the Clusterware is what makes the cluster. No, really? What was your first clue?  That basically means you’ve got a database on shared storage and multiple servers can access it at the same time. If one node (host server) goes down, the other one(s) can still access it.  The ASM part is basically the file system and volume manager. It includes striping (automatic), mirroring (optional), rebalancing and so on. It basically manages the files for you so you don’t have to.

SQL vs Oracle

So what’s an Oracle cluster like compared to a SQL Server cluster? Sorry, but I really can’t tell you just yet. Yeah, I’m bummed too. When it comes to performance, it’s my understanding that there really isn’t anything out there to compare to an Exadata box. It’s fairly unique. Therefore, one can’t really compare this particular cluster to a SQL Server cluster in terms of performance and what have you. I honestly couldn’t tell you anything about its creation or setup since I wasn’t really all that involved. Hey, someone has to make sure the SQL Servers are still behaving. :-) Once I get a better grasp on it, I may be able to write something about it as compared to a cluster from a technical aspect but not performance-wise. Time will tell.  However, I would love to hear from anyone who has Exadata and/or RAC experience. :-)

The Verdict?

A cookie by any other name is still a cookie... they just come in different flavors

So what do I think of Oracle so far?  You know how some relationships start off somewhat rocky? Well, this one isn’t any different. However, that’s not necessarily a bad thing. It’s just that I really haven’t had a lot of interaction with it just yet so I really haven’t had enough experience with it to say one way or the other. My initial impression is that it is way more involved and complicated to manage than SQL Server so far. That could just be me, though. Overall, I’m viewing this as a great opportunity to learn something new which is great since I love to learn new things.  :-)  In my opinion, relational databases should be fundamentally the same but with differences. Yes, some are quite different than others but once you have the basic concepts down it’s just a matter of figuring out and learning how to administer and deal with them in their environments which isn’t always that easy. But that’s just my opinion. :-)

Read Full Post »

Follow

Get every new post delivered to your Inbox.

Join 30 other followers