Thursday started like any other day. After a good night’s sleep and a cup of coffee I settled in to fix the latest bugs on SpokenWord.org. I say ‘latest’ but dealing with the aberrant behavior of rogue RSS feeds could easily be a full-time job, as it probably is for many people at Google, Technorati, etc.
I had just added a fix on my development server for feeds that use GUIDs longer than 255 characters (eg, from Clear Channel) and it was time to test it. As usual, this meant starting with an empty database on the dev box then scanning the feed in question to create new program records. I’ve done this a thousand times.
DELETE FROM programs;
It wasn’t even a second later that I realized what I’d done. That’s right. I was connected to SpokenWord.org’s live database server, not my development machine. And there wasn’t a thing I could do about it. Thanks (?) to MySQL/InnoDB’s referential integrity and my own orphan-detection scripts that I forgot were still running, deleting all the programs also deleted or damaged the media instances, the titles, the tags, the descriptions, the categories, the ratings and the collections. Hey, stuff happens; what are you gonna do?
So I fired off one of those email messages to Sysadmin Tim. He says he knows there’s trouble when I reply to my own messages, adding more details, before he gets to the first one. Tim was busy — he has a day job — but he dropped what he was doing to help.
We have a decent backup strategy. Every night we dump, tar and gzip the entire database. We keep the most-recent seven days’ copies on the database server as well as copy them to Amazon S3. And we keep one backup per month forever or almost. (Not sure why we’d ever use them though.) And hey — as luck would have it, the backup had run just two hours before my fatal mistake!
Only two problems: (1) that recent backup copy appeared to be corrupted, and (2) my script that copied the backups to S3 hadn’t run successfully since January 30, 2009.
I won’t bore you with all that happened in between, but 18 hours after the initial disaster, we did succeed in restoring everything on SpokenWord.org to the state it was in two hours before my gaffe. Incredible thanks to Sysadmin Tim for (once again) saving my ass. Just goes to show that you can be sober, well-rested and well-intentioned and still destroy a year’s worth of data with a single click if you’re not careful.
> (1) backup copy appeared to be corrupted
> (2) script that copied the backups to S3 hadn’t run successfully since January 30, 2009.
That seems to happen a lot.
We have lost two months worth of Subversion commits that way.
Since no one cares about backups until you need them, no one bothers to monitor the backup scripts (and no one practices the recovery procedures either), and according to Murphy, automated processes fail as soon as you stop watching them.
LikeLike
Refreshingly open and honest, as always, Doug.
I 2nd Thilo’s comment: the rule I’ve always heard and have actually grokked is “always test *all* your backups” but even after having been burned by bad backups in the past I still don’t practice it religiously. I figure it’s a combination of the relief and exhaustion you feel when the restore is finally running again and the absolute last thing you want to do is test more backups and the hole-in-the-roof problem when you don’t think about testing the backups unless you’re having a backup problem.
Glad everything’s back up, though!
LikeLike