Failure? What Failure?

This week my boss asked who wanted to install SQL Server on a new server. Seeing as it had been awhile since I did one, or at least it seemed like it, I happily volunteered. I needed some time away from my desk anyway so I was pretty excited. Much to my dismay, it turned out that 1) the server had already been released to the users and they started putting documents and stuff on it and 2) it didn’t have the separate drives for the data and log files, which is part of our standard install. So now I get to wait until they add or partition those two drives. No biggie. Good things come to those who wait, right? At least that’s what I keep telling myself and my soon-to-be 5  year old daughter… her theme song should be “I Want It All” by Queen. Seriously, it is SO her! 🙂     

In the meantime, I decided it would be a good idea to update the status on the open help desk tickets assigned to me and close the ones where I managed to actually fix the issues.  All excited, I was finally able to close enough help desk tickets to where I had less than 30 tickets to my name. Finally! *happy dance* Well… that was short lived. My DBA buddy, who was manning the unassigned ticket queue, assigned me a few more. So now I’m back up to 30. *sigh* Maybe I shouldn’t have put the giant spider under her desk while she was away. That’s okay. The tickets should keep me busy and out of trouble… hopefully… for now…    

Yesterday, my DBA replacement from my last job sent me an email asking me if I just so happened to remember the password to a certain DR document. Funny thing… even though it’s been around 1.5 years since I worked there I actually remembered it! Probably because it made the senior DBA laugh when I told her what it was. No, I’m not posting the password here. Nice try! 😛 So I told her what it was and then promptly realized I should have asked for “payment” up front. Drat! Too late! Or was it? I thought I’d try anyway and promptly emailed her specifying that she needed to send me dark chocolate as payment for the correct password. So she sends me a picture of dark chocolate truffles… LMAO Yeah, she’s awesome like that! Lesson learned? Be more specific! *sigh… smack head on desk*    

Anyway, on to this week’s feature. I just know you all were anxiously waiting for this week’s script! Don’t deny it! You know who you are! Anywho, for those who missed them, a couple of posts ago I had this crazy idea to post parts of my daily SQL Server checking script every week. Week one was all about SQL and agent error logs. Last week was about server and database level triggers. Since I haven’t been flogged (neither publicly or privately)… yet… *crossing fingers and toes* I shall continue my mini quest to impart at least some of my knowledge… This week is all about failures. No, not my personal failures. At least not this week… and not without several strong mojitos or beer… or both… well, there was that time at code camp… oh, sorry. Where was I? Oh yes! Failures!    

So why did I feel the need to include job failures in my script when I could just use the GUI via Management Studio? When you have approximately 30+ database servers to check in the morning, like I had at my last job, time is of the essence, usually. Besides, I would rather know ahead of time and quickly if a job failed. I didn’t want to hear it from a client first. Plus we didn’t have any kind of third-party monitoring system. Besides, I liked the idea of them thinking I was on top of things and semi-omniscient when it came to the databases. The following script is to check of any of the backup jobs failed. How do I know if it’s a backup job? Our backup jobs usually have the word “backup” in the name. 😉 You can use whatever you want, though. Disclaimer: I wrote these queries years ago. There are probably better ways of doing the same thing. This is what I’ve done. Feel free to experiment.    

--Check backup jobs...
select 'Check if backup jobs failed...'
select  'job_name'=left(j.name,40),
            jh.run_date, jh.run_time, jh.step_id, 'step_name'=left(rtrim(jh.step_name),30),
            jh.sql_message_id, jh.sql_severity, 'message'=left(rtrim(jh.message),100), jh.run_status,
            jh.run_duration, 'server'=left(rtrim(jh.server),10),
            'job_descr'=left(rtrim(j.description),30), j.date_created, j.date_modified
            from msdb..sysjobhistory jh 
            join msdb..sysjobs j
            on jh.job_id = j.job_id
            where jh.run_status<>1
              and jh.step_id > 0
              and substring(cast(jh.run_date as varchar), 1, 4) = year(getdate())
              and substring(cast(jh.run_date as varchar), 5, 2) = month(getdate())
              and substring(cast(jh.run_date as varchar), 7, 2) >= day(getdate())-7
              and j.name like '%backup%'
            order by jh.run_date desc 

The main part of the above query is looking in the msdb.dbo.sysjobhistory table joined with sysjobs table for any jobs with the word “backup” in the name. I also check where the run_status is not 1 (note: 1 is successful) and the step_id is greater than 0.  The records where step_id = 0 are for the entire jobs themselves. Personally, I want to see which step(s) failed. Next I check all other job failures (see below). I separated out backup job failures because it was my preference, but you don’t have to.    

--Check jobs...
select 'jobs that failed...'
select 'job_name'=left(rtrim(j.name),30),jh.instance_id, jh.run_date, jh.run_time, jh.step_id
,'step_name'=left(rtrim(jh.step_name),30), jh.sql_message_id, jh.sql_severity, 'message'=left(rtrim(jh.message),100), jh.run_status, jh.run_duration
,'server'=left(rtrim(jh.server),10), 'job_name'=left(rtrim(j.name),30), 'job_descr'=left(rtrim(j.description),30), j.date_created, j.date_modified
from msdb..sysjobhistory jh 
join msdb..sysjobs j on jh.job_id = j.job_id
where jh.run_status<>1 and jh.run_status<>4 
and substring(cast(jh.run_date as varchar), 1, 4) = year(getdate())
and substring(cast(jh.run_date as varchar), 5, 2) = month(getdate())
order by jh.run_date desc

 
 Since I’m on the topic of jobs, depending on your environment, you may want to check for job owners that aren’t what they should be. Our jobs should be running under the same generic account we have setup. So I like to check that it hasn’t changed. It may not be a big deal but in some cases, it could be.  

--Check for jobs owners...  

select ‘jobs whose owner is not what is should be…’   

 select  left(SUSER_SNAME(j.owner_sid), 30) as ‘owner’, ‘job_name’=left(rtrim(j.name),40), j.date_created, j.date_modified    

from msdb..sysjobs j    

 where SUSER_SNAME(j.owner_sid) <> ‘MySQLServerOwnerName’    

 order by j.name   

  

In addition, it may be a good idea to check for jobs that are still running and could be stuck for whatever reason. Have you ever come in to work after a nice long weekend only to discover that one of the jobs that should only take 15 minutes to run actually ran all weekend and is still running? Since this has happened to me, I like to check for long running jobs just in case. The code is below:   

 
— Return the list of jobs still running… 

exec msdb.dbo.sp_get_composite_job_info @execution_status = 1 –1:Executing   

     

Okie dokie. I’m almost done.  Next is just a quick check to see if database mail has failed. Database mail is available in SQL Server 2005 and on.  The script below just checks the sysmail_allitems table in msdb to see if the sent_status is anything other than “sent.”  In our case, if database mail isn’t being sent it’s usually because somehow the port became blocked. The default port for database mail is 25, in case you were wondering.    

     

–Check db mail for failures…    

      select ‘Check db mail for failures…’    

      select mailitem_id    

      ,profile_id    

      ,sent_status    

      ,send_request_date    

      ,left(recipients,100) as recipients    

      ,left(subject,100) as [subject]    

      ,left(body,50) as body    

      ,body_format    

      ,send_request_user    

      ,sent_date    

      from msdb.dbo.sysmail_allitems     

      where sent_status<>‘sent’    

      order by send_request_date desc    

     

Sorry that was so long. I honestly don’t know if I’m capable of writing a short post. 🙂 Maybe one day I’ll surprise you all and write a short one. 
Next week I’ll write about checking your drive space, transaction log sizes, and how many databases are on the server. Yes, I’ve been surprised with new mysterious databases showing up unannounced in the past or even some disappearing! Yes, that’s right. Disappearing databases. The last time turned out to be the result of a server admin person removing one of their databases from the SQL cluster without telling us which in turn caused a major script to fail causing our disaster recovery test to fail which was embarrassing for my co-worker. It’s a tale chock full of lessons learned the hard way for all parties involved which I’ll save for next time. Have a great wonderful weekend!     
 
Updated: To view the prior post in this series, please refer to “Ready, Aim, Fire That Trigger!” which is all about detecting system triggers. Next in the series is “The Bermuda Cluster”.
Advertisements

8 thoughts on “Failure? What Failure?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s