So, how frequently do you really test DR?

It’s after 4AM, I can’t sleep since I’m in pain after a car accident and I’ve had altogether too much caffeine. I’ve already watched 3 movies. BTW, “I am Legend” – WTF! Never have I seen a decent book butchered so much! The ideas in the book were so much stronger. Seriously, go get the book and forget the movie. Sorry, Will.

Now I’m writing from The Throne Chamber once more (blessed be the Colon Drano caffeine). I’m all cramped up and can’t get up, so I thought why not post something… can’t promise it will make sense since my brain ain’t the clearest at the moment…

So – when was the last time you tested DR? Really?

If I had a penny for every time I heard the line “we back up our servers to tape but we don’t test DR, but we’re confident we’ll be up and running within 36 hours in the event of a disaster” I’d be paying Trump more money than he ever made just so he can shine my shoes, and he’d be thankful.

Let me make something clear: You need to test DR a minimum of twice a year, preferably once a quarter. Anything less and you’re just setting yourself up for failure.

Start by testing the most important machines. You probably won’t even have to artificially inject extra problems to solve (Pervy Uncle Murphy usually is right there beside you to take care of that). Marvel at how long it really takes.

If things go real peachy, did you hit your RPO and RTO? if yes, test with more machines, until you can test with the full complement of boxes your company truly needs to be up and running and making money. Document it all.

If you didn’t hit your RPO/RTO, how much did you miss them by? If it’s by a ridiculous amount, maybe the way you’re going about DR will simply not work – try replication and/or VMware…

Once you get good at it, start inventing scenarios. for instance:

– Pretend one of your tapes is bad. See how long your offsite vendor takes to bring you a fresh set once you figure out what are the barcodes you need.
– Pretend one of the critical servers can’t be recovered and you need to go back 3 weeks. How does this affect the business?
– Recover to dissimilar hardware.
– Pretend you’re dead. Are your documented procedures clear enough for your underling to follow? Are they clear enough for the janitor? The janitor’s 3-year-old kid? The kid’s parakeet? Ultimately, your DR runbooks need to be so clear that even your CEO can follow them easily, and he needs to be able to do so right out of bed, before he’s had his morning ablutions, quad-vanilla-soy-latte and his Zoloft.

Ultimately (and sorry if I’m repeating myself), you probably need to be making at least 2 tape copies, 2 copies of your backup catalog, replicating (ideally CDP) and using VMware all at the same time to have any real insurance policy against disaster.

And if you ever tell me “well, we don’t have the time to be doing DR tests” – do you really think you’ll have the time once disaster really strikes?

And, if you think that a disaster is an RGE (Resume Generating Event) then you probably are working for the wrong company and won’t get much job satisfaction there anyway.

I think I’d better get up before I lose my legs.



Leave a comment for posterity...