Backups – The Vital Verification

These are the hysterical ramblings of a frustrated DBA. Her relentless mission: to upgrade strange old systems, to seek out new projects and bad design specs, to boldly index where no one has indexed before.

The Incident at Carmulus

DBA’s Log, SELECT GETDATE() as ‘Star Date’.  After many weeks of deliberation and preparation, tomorrow marks the dawn of a new day for the Carmulus system. The Alliance recently passed a not-so-secret-squirrel mandate effective 0800 tomorrow morning. Much rejoicing has commenced throughout the system.  I, for one, am relieved everything seems to be in place.

“Status report, Mr. Plock,” I commanded as I stepped onto the bridge.

“Captain, we’re receiving an alert from the Carmulus system. Their database backup job has failed. Initial reports indicate possible corruption. Manual backup attempts have also failed. However, the server appears to be operating within normal parameters. We have not received any distress signals from the inhabitants.”

“Thank you, Mr. Plock. What about the other databases? Were they backed up?”

“Yes, sir.  The other databases have been backed up successfully. However, the SQL Server error log is reporting a “cyclic redundancy check” message, sir.  I initiated a DBCC CHECKDB command with physical_only, no_infomsgs as well.”

“And the results, Mr. Plock?”

“Output indicates 0 allocation errors, 3 consistency errors in 1 table and 12 consistency errors in the database. The minimum repair level recommended is repair_allow_data_loss.”

“That. Is not a good sign.” I contemplated while sipping my Dulthian latte. “When was the last good backup taken?”not a good sign2

“Sunday night, sir.”

“Check the recovery model on the database. It should be full. Do we have any valid transaction log backups?”

“Yes, sir. We appear to have valid hourly transaction log backups since the last full valid backup on Sunday.”

“Good. I’d rather not risk losing any data using the repair_allow_data_loss option unless we have no other choice. One more thing, Mr. Plock. Have you checked the server event logs by any chance?”

“Sir, the system event logs are reporting the Virtual Disk Service terminated unexpectedly after 1900 hours, a hard disk is reporting a bad block, and a logical drive returned a fatal error.”

“Good. God! It’s worse than I thought!  Mr. Chalulu, patch me through to Engineering!”

“Engineering. This is Chief Engineer Mr. Shcot.”

“Mr. Shcot, as you are in no doubt aware of our current situation, what are our options?”

“Well, Cap’n. Seein’ as how some of the disk errors it’s showing make no sense and the server hasn been updated in several years, I recommend we patch the blimey thing as well as rebootin’ it.”

“Thank you, Mr. Shcot. How much time do you need?”

“Aboot one and a half hours, Cap’n.”

“Mr. Chalulu, contact the Carmulan ambassador and patch her through. I’ll be in my Ready Room.”

“Aye, aye, Captain.”

DBA’s Log, Supplemental. After contacting the Carmulan ambassador and conveying the seriousness of the situation, she has contacted the inhabitants of Carmulus to negotiate an outage.  In the meantime, I have directed my crew to investigate recovery options for the database. Luckily, it is of the 2008 variety and not 2000.

“Status report, Mr. Plock,” I utter as I stagger back onto the bridge and contemplate the contents of that Dulthian latte.

“Sir, using the restore verifyonly command, I verified the full backup from Sunday is valid. I was then able to restore it under a different name. After which I restored all of the transaction log backups up through the current one that just ran. I then ran the DBCC CHECKDB command against it. It’s still valid. Meaning, the inhabitants should not lose any of their data from yesterday and today provided the transaction log backups remain intact.”

“Good work, Mr. Plock. You have the bridge while I ah… complete some ah… paperwork. I’ll be in my quarters.”

DBA’s Log, Supplemental+1.  Preparations are now underway for patching the Carmulan server after hours. The inhabitants have been made aware they risk losing today’s and yesterday’s data the longer we wait. Attempts have been made to convey the dire circumstances we face.  However, they insist we wait until after hours. So be it. We decided against any attempt to repair the actual database due to the risk of data loss. Restoring it from the backups should work in our favor. May the SQL deities have mercy us on our souls tonight, or what’s left of them anyway.

DBA’s Log, SELECT STUFF(Supplemental, 7, 0, ‘waitforit’). After what seems like an endless number of hours of patching, I have declared the mission a success.  The hard disk errors have been eradicated. The database was successfully restored using the full backup from Sunday along with the multitude of transaction log backups. I am also happy to report no loss of data was incurred and backups are functioning properly once again.

Summary below per request. Sorry for the delay.  [Updated 09/24/2013)
Mission Summary: The day before a major system change was to be implemented we discovered that a database backup job failed reporting the database may be corrupted. The manual backup attempts failed as well. The users did not notice any unusual behavior with their system and nothing else seemed wrong.  The error reported was a “cyclic redundancy check”. When we ran “CHECKDB command with physical_only, no_infomsgs”, it showed “0 allocation errors, 3 consistency errors in 1 table and 12 consistency errors in the database. The minimum repair level recommended is repair_allow_data_loss.”

The Windows system event logs also showed the Virtual Disk Service terminated unexpectedly that night, a hard disk reported a bad block, and a logical drive returned a fatal error. After talking with a server admin about it, they recommended patching the server and rebooting it.

Since we had a valid full database backup from the weekend along with hourly transaction log backups, we decided to restore that backup along with all the corresponding transaction logs under a different database name. We then ran the DBCC CHECKDB command against it to verify it wasn’t corrupted. It was fine.  So after the patching completed and fixed the hard disk errors, we restored the database using the full backup from the weekend along with the transaction log backups and all was fine.

Advertisements

4 thoughts on “Backups – The Vital Verification

    1. Oh, Klara Klara Klara… of course I can translate! 😉 I’ll add on something to the end hopefully today. Although, it’s actually really great that you realized I was referencing Star Trek. There is hope for you yet! 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s