This has been one of the worst trips ever – because of one of the silliest DR exercises ever

Well, aside from visiting Flames and helping fix a severe customer problem. Those were rewarding. I still haven’t pooped that steak, BTW.

I was supposed to only stay for 1 day in Manhattan, fix the issue, ba da bing. I ended up staying an extra day – had no extra clothes and no time to get anything. Washed my undies on my own and used the hair dryer over a period of hours to dry them. I learned my lesson now and will always have extra stuff with me.

So I try to go back home today and guess what – Air Traffic Control computers had a major glitch ( that messed up the whole country’s air travel. Thousands of flights delayed and canceled. Mine was canceled, after I spent about 10 hours in the airport. Another 2 hours in the line to simply rebook the flight since they had 3 people trying to serve hordes. And all because, at least according to the report, a system failed and the failover system didn’t have the capacity to sustain the whole load.

So, while I wait in the airport to catch a stand-by flight tomorrow morning, unbathed and frankly looking a bit menacing, I decided to vent a bit. No hotels, no cars.

Maybe this is too much conjecture and if I’m wrong please enlighten me, but let’s enumerate some of the things wrong with this picture:

  1. First things first: While it’s cool to fail over to a completely separate location, typically you want a robust local cluster first so you can fail over to another system in the original location.
  2. If the original location is SO screwed up (meaning that a local cluster has failed, which typically means something really ominous for most places) ONLY THEN do you fail over to another facility altogether.
  3. Last but not least: Whatever facility you fail over to has to have enough capacity (demostrated during tests) to sustain enough load to let operations proceed. Ideally, for critical systems, the loss of any one site should hardly be noticeable.

According to the report none of the aforementioned simple rules were followed. Someone made the decision to fail over to another facility, which promptly caved under the load. A cascade effect ensued.

I mean, seriously: One of the most important computer systems in the country does not have a well-thought-out and -tested DR implementation. Guys, those are rookie mistakes. Like some airports having 1 link to the outside world, or 2 links but with the same provider. Use some common sense!

So, I guess I’ll put that in the list together with using what’s tantamount to unskilled labor securing our airports instead of highly trained and well-paid personnel that’s been screened extremely intensely and actually takes pride in the job. Maybe some of those unskilled people are running the computers, it might be like the Clone Army in Star Wars. A mass of cheap, expendable labor that collectively has the IQ of my left nut (I’m not being overly harsh – my left nut is quite formidable). The armed forces heading the same way isn’t the most reassuring thought, either.

Yes, I’m upset!!!

