Feeds:
Posts
Comments

Recently, I tweeted about having to restore fourteen databases in SQL Server 2000. I wasn’t surprised to receive some good natured ribbing about still having databases in SQL Server 2000, and I’m okay with that… for the most part.  Yes, I know and agree that it is not good (major understatement of the year/decade) to still be on SQL Server 2000. Why?  Not just because it is woefully out-of-date, not supported by Microsoft, and plain miserable to deal with at times, but there are some wonderful features in the later versions that it’d be awesome to take advantage of.  Not to mention the embarrassment of admitting that, yes, we still use SQL Server 2000. At least we’re not running SQL Server 6.5 or 7.0, right? That I know of, at least. That has got to count for something.

It could be worse, right?

It could be worse, right?

While I’m at it, I might as well confess mention that we are also running SQL Server 2005, 2008, 2008 R2, and 2012… and Oracle 11gR2.. and MS Access (don’t ask what versions, please. For the love of all things data, please don’t ask!) What? No SQL 2014?  Nope. Not yet, anyway.  We are painfully aware that having so many versions isn’t necessarily a good thing. Why can’t I do a merge or a CTE? Oh yeah… I’m in a SQL 2000 database. Ick.  I am hopeful that we can come up with a good plan in the next year (or so) to somehow consolidate all (at least some!) of these instances into something more recent. At least management is aware of our need to consolidate/ migrate  to the current century. It’s a start. Plus there are quite a bit of other important projects in progress at the moment.

But Why?

So you may be wondering (or not)… for the love of all things SQL why are you still on SQL Server 2000?!  I admit that I envy anyone in a position where they can keep upgrading to the newer versions soon after they come out or at least before they are ridiculed aren’t supported anymore. So what can prevent someone from upgrading? Many things. Others may have  different reasons, but some of the major ones I’ve encountered include:

The 90s Called. They Want Their VAX Back. (a.k.a. Higher Priority Projects)

Seriously, I don't go anywhere near it.

Seriously, I don’t go anywhere near it.

What could be worse than running SQL Server 2000? How about a legacy VAX system from the 1990s? Now what if your core business applications including the database were still running on said VAX system? And what if the five year project to migrate off said VAX turned into a ten+ year project for various reasons and it’s still not done? That is a very long story and I’m not blaming anyone. Stuff happens. Let’s just move on and get it done. Yep. That trumps the SQL Server 2000 boxes by at least a few years. At least most of one of those major applications is on SQL Server… 2005. I know. I know. Yes, that one needs to be upgraded as well. <insert painful sigh here> The good news is we’ve renewed our efforts to get those apps off the VAX much quicker. I’m hoping it’ll happen in the next year or two. Positive thoughts. Positive thoughts. And, no. It’s not going to SQL 2000 or 2005… itsgoingtooracle11gr2orpossibly12c. All kidding aside, I’m okay with that. No, really. I am. The “why” on that one is a bit of a long story, but I’m good with it, and that’s probably where we’ll end up putting everything.

Oh, What a Tangled Web We Weave

Here is another reason why upgrading isn’t so easy or as straightforward as one may think… or hope… or wish. One of our main production SQL Server systems is still running on SQL 2005. I don’t like doing in place upgrades, if at all possible, because of the risks involved. So I’d much rather migrate this system to another box running SQL Server 2012 (at least) or even *gasp* 2014.  So what is delaying us from at least moving to another bigger/better/more beautiful box? All the flippin’ interfaces. When I stop to count up all the different applications and systems that access this box and the databases on it, my eyes start to glaze over, my brain starts to throb, and I feel a bit faint.  It would be do-able but I strongly believe it would be a fairly large/nightmarish undertaking; one that we must start at some point  in this decade soon.

This box contains at least five major databases. Of these five major databases, one very critical database is heavily used by multiple applications for different departments (I’ve counted at least four so far). The interfaces to it (I’ve counted twelve off the top of my head) would have to be analyzed and tested.  Their respective databases may need to be upgraded as well at some point.   That database is also replicated to two different servers (SQL 2005 and SQL 2008), internal and external. There are jobs to import/export data to/from other servers. There are linked servers from this one to other servers. Other servers link to it as well. Can’t forget the SSRS servers that also use it. At least those are running on SQL 2008 R2.

light_reading

All together this major system  is actually spread across at least eight different servers of various database versions including Ingres, Oracle, SQL 2000, SQL 2005, SQL 2008, SQL 2008 R2, and SQL 2012. To say it’s a bit of a mess is an understatement. And that’s just one system. We currently support over 70 SQL Server instances of various versions, not to mention Oracle and MS Access. Some are in-house developed and some are vendor-supported. Some we control and some are out of our control.

Kill It! Kill It With Fire!

So how did it get that way? I’m not really sure because I inherited a lot of this and don’t have the background/history as to why certain decisions were made… or I may have just blocked it from memory. It could just be the result of put this here, put this there, and before you know it you have a Frankenstein of a system/environment.  Having said that, I’m trying to not focus so much on the “how or why” but more on “where do we go from here?” and “how can we make this better?”   “Kill it with fire” is not a preferred option… yet. So please put the pitchforks away… for now.

Where to start...

Where to start…

I am hopeful that we can start on a plan to tackle these out-of-date servers once we’ve finally moved everything off the VAX or even sooner. Unfortunately, it’s going to take some time for all of it. Not to mention production issues that pop up and being pulled in on other important projects (so-and-so bought a new application with a new database and now we have to put it somewhere and figure out the licensing and by the way you need to learn Oracle and you need to learn APEX and SOA and how that’s going to fit into everything and don’t forget we need to migrate those MS Access databases to Oracle because we’re thankfully not supporting MS Access anymore…)  

Please don’t take this wrong. I am actually looking forward to learning more about Oracle and APEX and SOA and whatever comes next. It can just be a bit overwhelming at times. In addition, I don’t say it enough, but I am extremely thankful to have a wonderful and very talented team whom I can count on to help make important decisions and help us get to where we need to be – in the current century or at least somewhere closer, if not beyond. :-D

Here's To You

Here’s To You

Here’s To You!

So if you’re on the latest and greatest versions, I commend you. No, seriously. I do. No jokes. No sarcasm. It’s great. Really. It is. However, if you’re like us and are still running on old versions that aren’t supported anymore, I feel for you. It can be painful and challenging.  If you’re like me and tend to feel depressed embarrassed by the old technology, don’t be. Instead, hold your head up high. You’re probably doing the best you can with what you’re given and for your specific environment. Some things are out of our control.  We can only do our best to get where we need to be.  Hopefully, you won’t need a miracle to make it happen – just some support from your peers and quite possibly a few shots of your favorite beverage from time to time. So good luck and may all your databases stay online… and up-to-date.

The First SQL

A parody of the song “The First Noel”.  Merry Christmas, Happy Holidays, and so on and so forth…!

The first SQL the server did say
Could not parse this statement please write it this way.
In code where it lay a scanning the heap
On a production server I wanted to weep.
SQL SQL SQL SQL
Formed is the Team of S-Q-L.
 
I looked down and saw a star
Glaring in the code at me thus far
And to my eyes it gave great fright
And so it continued to my “delight”.
SQL SQL SQL SQL
Formed is the Team of S-Q-L.
 
And by the fright of that same star
DBAs came for the coding fubar;
To seek why a ping threw an event
And to destroy the RBAR whatever it meant.
SQL, SQL, SQL, SQL,
Formed is the Team of S-Q-L.
 
This star awry it went possessed;
Causing mayhem it did not rest,
And there it did not stop or stay,
Right inside my trace – zero disk space.
SQL, SQL, SQL, SQL,
Formed is the Team of S-Q-L.
 
They appeared with Admin’s decree,
I fell brazenly trying to flee,
And shuddered scared in their presence
Their scold and slur, I then did tense.
SQL, SQL, SQL, SQL,
Formed is the Team of S-Q-L.
 
Then we worked the code we abhorred
The select star was finally no more
That frickin bug we had finally caught
Was on a test server we forgot.
SQL, SQL, SQL, SQL,
Formed is the Team of S-Q-L.
 

Can’t Group This

Yep! Another one! This is another one of those dratted fun and exciting posts inspired by a co-worker planting a little seed in my brain many months ago. It was fun to write and took a long time. I had no idea the song was THAT long! *sigh* Sadly (depending on your point of view), I have another one already in the works. Writing these parodies is actually a great stress reliever for me. It’s not perfect but I hope you enjoy it anyway! One of these days I’ll get back to a more serious post.

Can’t Group This

A SQL parody of MC Hammer’s “U Can’t Touch This” 

You can’t group this
You can’t group this
You can’t group this
You can’t group this

Your, your, your query puts me on guard
Makes my day, now I’m floored
Spank you for stressing me
With a query to fix, no time to Tweet

It’s not good when you know you see
A distinct group by in a CTE
And I’ve seen so much
And this text field, uh, you can’t group

I told you, Code Boy
You can’t group this
Yeah, I’m too forgiving and you know
You can’t group this

Look at this code, man!
You can’t group this
Yo, let me bust the funky queries
You can’t group this

Fresh conflicts, no grants
You can’t do that, now, you know you wanna code
So move, onto your feet
And let this Princess do a CTE

While I’m codin’, hold on
Drop this data bit and I’ll show you what’s goin’ on
Call that, sys stats

Stuck in recursion so roll it back
Let me know if this is too much
And this image, uh, you can’t group

Yo, I showed you
You can’t group this
Why you doin’ this, man?
You can’t group this

Yo, what the hell! Try again, busta
You can’t group this

Give me a sum on a whim
Or better yet, that’s why I’m codin’ em
Now, you know
You talkin’ about aggregates, you talkin’ about some rows

Data types, that’s right
Varchars are maxin’ so set them just right
No escape, just merge
What’s it gonna be in your T-SQL search

Distinct? Admit
It’s not that hard, you need to drop this bit
That alias you know…
You can’t group this

You can’t group this

Break it down!
Stop, summarize!

Get outta this funk, go ahead
Write your funky code like this so the server won’t drop dead

So run your scans on this “where”
Adjust your views, run your try-catch with some flare

Make it fit, join with inner
Code like this and you’re not a beginner
Remove, tried and dumped
Wait just a minute don’t do that! Thump, thump, thump

Yeah… You can’t group this
Hey, man! You can’t group this

Get better with code. Oy! It’s time to grow 

You can’t group this
Be alert, start again

You can’t group this

Break it down!
Stop, summarize!

You can’t group this
You can’t group this
You can’t group this

Break it down!
Stop, summarize! 

Any time with CTEs
Let your fingers take flight
There’s code to explore and heaps of queries to write.

Now you can start coding with some success
With others writing queries that make you guess.
A new world, unfurled, from awkward to child’s play
I concur, you defer, and we’ll infer, uh, no duhr
And they all can go away.

You can’t group this
You can’t group this
You can’t group this
You can’t group this

….

SQL Woes from A to Z

Ever have one of those days when you’re working with a colleague on a database issue and one of you has a fun idea that just takes on a life of it’s own?  Well, that’s exactly what happened today while we were doing some actual work.  Imagine that!  Below is what we came up with for your reading pleasure.

Many thanks to my friend, Erin, for collaborating on this fun little poem with me!

A is for the Alter that shouldn’t be run.

B is for the Backup that should’ve been done.

C is for the Cluster that flew into bits.

D is for the Data that no longer fits.

E is for the Errors we saw in the logs.

F is for the Faults that were NOT in the logs!

G is for the GO that couldn’t be found.

H is for the Heap that couldn’t be bound.

I is for the Index, non-clustered and disabled.

J is for the Job which needs that index enabled.

K is for the Kill that was run with a cursor.

L is for the Locks it caused you son-of-a… grrr!

M is for the Month I’ll never get back.

N is for the NULLS hiding in the stack.

O is for the Order By that killed my query.

P is for the Performance I needed so dearly.

Q is for the Query that we redesigned.

R is for the Ranks that are now undefined.

S is for the Select star I found in a proc.

T is for the Time that it lingered in a Lock.

U is for the Update that was lacking a Where.

V is for the Values it swiftly plopped in there.

W is for the When that was found without a Case.

X is for the XQuery we slapped in its place.

Y is for the Year as varchar, we couldn’t believe.

Z is for the Zero pad left, for which we all grieve.

These are the hysterical ramblings of a frustrated DBA. Her relentless mission: to upgrade strange old systems, to seek out new projects and bad design specs, to boldly index where no one has indexed before.

The Incident at Carmulus

DBA’s Log, SELECT GETDATE() as ‘Star Date’.  After many weeks of deliberation and preparation, tomorrow marks the dawn of a new day for the Carmulus system. The Alliance recently passed a not-so-secret-squirrel mandate effective 0800 tomorrow morning. Much rejoicing has commenced throughout the system.  I, for one, am relieved everything seems to be in place.

“Status report, Mr. Plock,” I commanded as I stepped onto the bridge.

“Captain, we’re receiving an alert from the Carmulus system. Their database backup job has failed. Initial reports indicate possible corruption. Manual backup attempts have also failed. However, the server appears to be operating within normal parameters. We have not received any distress signals from the inhabitants.”

“Thank you, Mr. Plock. What about the other databases? Were they backed up?”

“Yes, sir.  The other databases have been backed up successfully. However, the SQL Server error log is reporting a “cyclic redundancy check” message, sir.  I initiated a DBCC CHECKDB command with physical_only, no_infomsgs as well.”

“And the results, Mr. Plock?”

“Output indicates 0 allocation errors, 3 consistency errors in 1 table and 12 consistency errors in the database. The minimum repair level recommended is repair_allow_data_loss.”

“That. Is not a good sign.” I contemplated while sipping my Dulthian latte. “When was the last good backup taken?”not a good sign2

“Sunday night, sir.”

“Check the recovery model on the database. It should be full. Do we have any valid transaction log backups?”

“Yes, sir. We appear to have valid hourly transaction log backups since the last full valid backup on Sunday.”

“Good. I’d rather not risk losing any data using the repair_allow_data_loss option unless we have no other choice. One more thing, Mr. Plock. Have you checked the server event logs by any chance?”

“Sir, the system event logs are reporting the Virtual Disk Service terminated unexpectedly after 1900 hours, a hard disk is reporting a bad block, and a logical drive returned a fatal error.”

“Good. God! It’s worse than I thought!  Mr. Chalulu, patch me through to Engineering!”

“Engineering. This is Chief Engineer Mr. Shcot.”

“Mr. Shcot, as you are in no doubt aware of our current situation, what are our options?”

“Well, Cap’n. Seein’ as how some of the disk errors it’s showing make no sense and the server hasn been updated in several years, I recommend we patch the blimey thing as well as rebootin’ it.”

“Thank you, Mr. Shcot. How much time do you need?”

“Aboot one and a half hours, Cap’n.”

“Mr. Chalulu, contact the Carmulan ambassador and patch her through. I’ll be in my Ready Room.”

“Aye, aye, Captain.”

DBA’s Log, Supplemental. After contacting the Carmulan ambassador and conveying the seriousness of the situation, she has contacted the inhabitants of Carmulus to negotiate an outage.  In the meantime, I have directed my crew to investigate recovery options for the database. Luckily, it is of the 2008 variety and not 2000.

“Status report, Mr. Plock,” I utter as I stagger back onto the bridge and contemplate the contents of that Dulthian latte.

“Sir, using the restore verifyonly command, I verified the full backup from Sunday is valid. I was then able to restore it under a different name. After which I restored all of the transaction log backups up through the current one that just ran. I then ran the DBCC CHECKDB command against it. It’s still valid. Meaning, the inhabitants should not lose any of their data from yesterday and today provided the transaction log backups remain intact.”

“Good work, Mr. Plock. You have the bridge while I ah… complete some ah… paperwork. I’ll be in my quarters.”

DBA’s Log, Supplemental+1.  Preparations are now underway for patching the Carmulan server after hours. The inhabitants have been made aware they risk losing today’s and yesterday’s data the longer we wait. Attempts have been made to convey the dire circumstances we face.  However, they insist we wait until after hours. So be it. We decided against any attempt to repair the actual database due to the risk of data loss. Restoring it from the backups should work in our favor. May the SQL deities have mercy us on our souls tonight, or what’s left of them anyway.

DBA’s Log, SELECT STUFF(Supplemental, 7, 0, ‘waitforit’). After what seems like an endless number of hours of patching, I have declared the mission a success.  The hard disk errors have been eradicated. The database was successfully restored using the full backup from Sunday along with the multitude of transaction log backups. I am also happy to report no loss of data was incurred and backups are functioning properly once again.

Summary below per request. Sorry for the delay.  [Updated 09/24/2013)
Mission Summary: The day before a major system change was to be implemented we discovered that a database backup job failed reporting the database may be corrupted. The manual backup attempts failed as well. The users did not notice any unusual behavior with their system and nothing else seemed wrong.  The error reported was a “cyclic redundancy check”. When we ran “CHECKDB command with physical_only, no_infomsgs”, it showed “0 allocation errors, 3 consistency errors in 1 table and 12 consistency errors in the database. The minimum repair level recommended is repair_allow_data_loss.”

The Windows system event logs also showed the Virtual Disk Service terminated unexpectedly that night, a hard disk reported a bad block, and a logical drive returned a fatal error. After talking with a server admin about it, they recommended patching the server and rebooting it.

Since we had a valid full database backup from the weekend along with hourly transaction log backups, we decided to restore that backup along with all the corresponding transaction logs under a different database name. We then ran the DBCC CHECKDB command against it to verify it wasn’t corrupted. It was fine.  So after the patching completed and fixed the hard disk errors, we restored the database using the full backup from the weekend along with the transaction log backups and all was fine.

The Situation

Umm... we have a problem

Umm… we have a problem

It’s 11 am. You’re sitting at your desk at work (where else would you be?) trying to determine which of the 20 help desk tickets you’re actively working on to work on next. That includes at least 5 actual projects (I think it’s 5), not just fixing various issues. Not to mention trying to figure out how to explain to someone why you shouldn’t include a “rollback transaction” as part of error handling in a stored procedure that contains just your run-of-the-mill select query. It’s all part of learning, right? Oh and don’t forget that Oracle conference call at 1:30 pm. Did I mention you’re new minion, err…Oracle DBA has some good valid questions for you about the Oracle system as well? Wait. Did you talk to the Access guy yet about the tasks he’s working on? Oh yeah. You did that earlier this morning. Given all that, you’re actually feeling pretty good because one of the projects you’ve been working on went live that morning with no problems. Go team! However, before you can say “I wonder what I should have for lunch today”, you have  3 developers and 1 manager at your desk (or was that 2 developers and 2 managers?) asking for help with a SQL Server performance issue. It’s actually pretty important considering the end users are on site doing user acceptance testing for a major system release.  Dun dun dun…

The Issue

A stored procedure runs fine on Server A but times out on Server B. Both are non-production servers. Both servers have SQL Server 2005 build 9.0.4035. Note: I included the build instead of the service pack level because I didn’t want to look it up and I don’t have it memorized yet. Did I mention we’re running SQL Server 2000, 2005, 2008, 2008 R2 and soon-to-include 2012? Oh and that’s for somewhere around 73 instances and 800+ databases. Oh and then there’s Oracle Exadata.  Continuing on… The databases are identical because they were restored from the same backup file. Still, you verify that the record counts match and the structures match. No problems there. You can run the procedure within SSMS (SQL Server Management Studio) just fine on both servers. No problem.  You breakdown and give the developer db_owner permissions on both databases just to prove it’s not a permissions issue. Plus it’s not production. So no worries. They had no problems running the procedure in SSMS on both servers.  However, when the procedure is executed from the application or from within Visual Studio (2010, I believe), it times out on Server B. There are no error messages here, there, or anywhere. Not in a log. Not in a… Where was I? Oh yeah…

What about the stored procedure itself? It returns two datasets from two queries. From doing a SQL Profiler trace I found it was getting hung up on the first one. The first query is a select with inner joins on four other tables. Nothing too complicated, at least. I probably shouldn’t mention the two optional parameters are included in the inner join clause instead of using a where clause.

One of These Things Isn’t Like the Other

One of these things isn't like the other..

One of these things isn’t like the other..

There are so very many pieces to look at and consider, but this is what I did. I probably should come up with a good checklist for the future so I’m not scrambling. Good intentions and all that, right?

So what could possibly differ between these two systems?  The record counts on the tables are the same. The structures are the same. The indexes are identical. Hmm.

Maybe something with memory? The cache? Could it be the execution plan? I attempted to retrieve the actual plan from Server B and guess what happened? It kept running and running and running. Just like what the developer experienced. I had no problems retrieving the actual execution plan from Server A, though. It ran in about 5 seconds. Double hmm.

So I generated the estimated plan from both systems with no problem and compared them. Gee. They were completely different. That wasn’t a huge surprise but still somewhat surprising considering the usage on both systems should be about the same.  What was interesting was the plan on Server B said an index was missing on one of the tables. Really? The index is there but it turns out the number of statistics on that table was different than the ones on the table on the other server. So why were the statistics so different?  We have maintenance plans in place on both servers to reorganize the indexes and update the statistics every Sunday. It ran on both servers this last weekend just fine. They should be the same but for giggles I thought I’d check them. Guess what? They were different. Dude! Different how? Different in that on Server B it updated the statistics and then reorganized the indexes. This is the server where the procedure hangs when executed. On Server A, the indexes are reorganized before the statistics are updated. Wow. Could this be it? I think it very well could be the problem!

The Test

So on Server B, I reorganized the indexes on only the tables used by this procedure and then updated the statistics. Guess what? I could then easily retrieve the actual execution plan without it hanging. I then asked the developer to try executing the procedure. Ya know what? It ran just fine! Sweet!

Hindsight is 20/20

Should I have gathered a bunch of info from DMVs and what-not first? Yep. Probably. There’s a million things I probably should have done but considering the time crunch and

sheer number of other tasks that have fallen to me, I think I did okay. I solved the problem and made everyone happy so they can continue testing and I can continue on my merry way.

But Whyyyyy?

Now that is the million dollar question. Why does it matter if you reorganize your indexes before updating statistics? Well… you probably won’t like this but I’m going to save that for another post on another day. :-) That’s  my plan anyway.

Do I really want to look under that?

Do I really want to look under the covers?

Hey, Ya’All Ain’t Gonna Believe This!

I do have to give a shout out to our new minion, oops. I mean Oracle DBA. Even though he doesn’t know SQL Server, he asked very intelligent questions which helped me to think through the process and what could be wrong. We made a pretty good team today which is awesome in itself. :-)

All by myself… don’t wanna be…

Now that I have your attention, you’re probably wondering what that title is all about. It’s not a huge deal, but I am happy to write that I solved a little problem I encountered in Oracle without having to bug anyone about it… much. Granted, I didn’t really truly figure it out completely on my own. I utilized my resources as most good little DBAs probably do. That means, I actually paid attention to prior issues my co-workers encountered and took some decent notes that actually came in handy. I just thought I’d share my experience with you in case it could help someone else out. What made it even more “exciting” for me was that my DBA co-worker was out of town during this time attending a training course and our manager was also out.

Note: I wish I had taken some screenshots of the issue I was having, but I didn’t. Lesson learned. So we’ll just have to make due with my not so oh-so-wonderful descriptions.

Disclaimer:  Some of you may already know this but I mainly support SQL Server at my current job with a little Oracle in there from time to time. My DBA co-worker and manager are mainly supporting the Oracle system right now which I’m completely fine with. Don’t get me wrong. I really want to learn Oracle but I’m also happy to continue working with SQL Server on a regular basis. So please keep in mind that my current Oracle knowledge can fill a thimble and that’s probably stretching it a bit. That means I really don’t quite know what I’m doing when it comes to this stuff other than reading notes and spending a lot of time on Google and asking my Twitter buddies for advice. Thank you Twitter buddies!!! Lucky for me, I’m signed up for an Oracle DBA class coming up in mid-August. Until then, if I wrote something incorrectly or explained it wrong, please let me know! The last thing I want to do is pass on incorrect info.  Anyway… Here. We. Goooo…

A Dash of Techie Stuff: Basically, we’re running Oracle 11g r2 on a RAC on an Exadata machine running Oracle Linux. We’re also currently using Oracle Enterprise Manager (OEM) 11g. I believe we have plans to use 12c at some point in the near future.

Alternate Plan #1

A Little Back Story: Earlier this week one of the RAC nodes rebooted itself a couple of times in the middle of the night. We have ASR (Automated Service Request) set up so that it contacts us when there’s an issue like this. Long story short and according to Oracle Support, we had a fan column failure. A field engineer came out the following morning and replaced a fan. They rebooted the node and all seemed fine.  Note: I don’t really want to get into a lot of detail on this particular issue since I’m not well-versed
in it and I believe it’s still being looked into.  Plus, it’s not the main focus of this post. It’s just to give you an idea of what lead up to this post.

Ah, Fun Times: Once the node was back up, I was then asked to check the databases to make sure they were fine. So I logged into OEM and went to the Databases tab. Lo and behold, I was surprised to see that the Status for these cluster databases indicated the second instances were down! These instances are on the node that was rebooted. Upon further investigation and drilling down, I saw an error indicating  the agent on the second node was unreachable. Since this happened to me 3 weeks ago (hmm… my manager and DBA co-worker were out of town in training then as well… I’m detecting a pattern…), I followed the steps my manager had walked me through over the phone back then. Here are the steps in case you’re wondering:

  1. Log into the second node using PuTTY.  What’s PuTTY besides a fun childhood toy that provides hours of endless pleasure? It’s basically a free emulator that we use for running Linux commands on the nodes.
  2. Next I ran “ps –ef | grep pmon” to see if the processes were running for the databases. They were.  What’s PMON? It stands for process monitor and it’s also a background process that’s created when a database instance is started. Basically, if the pmon is not running for the database that means the instance isn’t up.

Alternate Plan #2

Curiously, this showed me that both instances of the databases appeared to be up and running.  I then ran the “./crsctl stat res -t” command that I learned from Oracle Support on a prior issue.  Note: “crsctl“is the Oracle Clusterware Control utility, “stat” is status, “res” is resource, and “-t” just displays the results in a tabular format. If you don’t already know, can you now guess what that does? It checks the status of the resources in the cluster. No! Really?  It basically showed me everything was online and the databases were open, which is a good thing. Ya think?

I don’t know if there was anything else I needed to do, but I believed this showed me that everything appeared to be fine in regards to the cluster databases. So I didn’t worry too much about what OEM was showing me. Other things came up during the day, so I left it as is for awhile. However, it was still bugging me the next day. So I looked at my notes and recalled my DBA co-worker having gone through something like this before. Meaning, everything looked fine by using PuTTY when OEM was indicating otherwise. She had worked with Oracle Support on this for a separate issue and luckily I wrote down what she learned. So here’s what I did:

  1. Logged into the node using PuTTY.
  2. Went to the agent home directory and ran this command:  “./emctl status agent”.

It’s magic!

What does that do? It checks the status of the Enterprise Manager agent. Guess what? It wasn’t running. Just for curiosity sake, I logged into the other nodes and ran the same command. The agents were running fine there. Ah ha! Ding! Yep! A little light bulb finally went off! Gee! Maybe this is why I’m getting the “agent unreachable” message in OEM. Duh! So I then ran “./emctl start agent” followed by the status command again. The agent was running and it looked okay to me (but what do I know?). Can you guess what happened next?  I then logged into OEM, went to the Databases tab, and… *drum roll*  the Status indicated both instances were up for the cluster databases! Woo hoo! *happy dance* :-)  It may not seem like much to some, but I was soooo excited that I just had to write a post about it and share my experience with you.

To Sum Up My Experience:  Learning Oracle for me so far has been like trying to eat jello with a fork. It’s slow, awkward, and a bit messy at times but it can be done. :-)

Of course, this begs the question of why the agent was down in the first place… if I figure it out, that’ll be a post for another day.

Fork, meet jello… and um, oops!

Follow

Get every new post delivered to your Inbox.

Join 31 other followers