Will your Infrastructure Survive if you Disappear?

This one is dedicated to an old friend that’s been asking me to write about certain technologies like a very specific filesystem and whether it should be used as the basis for enterprise storage.

He loves using open source stuff for business, mostly to save money, but also for the control it affords and the sheer pleasure of tinkering (and, I suspect, a modicum of masochistic proclivity).

This isn’t a post against open source (if nothing else, that would be utterly hypocritical since most commercial stuff is at least partially based on open source software).

It’s more about risk mitigation and TCO.

The Acid Test

Think of the infrastructure you’ve built so far. If you disappeared today, how long would it take someone less talented than you to not only keep it from crumbling to dust, but to actually make it grow and be even more productive and reliable over time?

Sure, buying quality, supported products from a viable vendor helps in case the stuff breaks, but what about the customization specific to your organization? Maybe you’ve written cryptic scripts to do this or that… or you may have to perform some very specific steps to successfully complete a DR exercise.

And if you’re heavily reliant on open source for things like apps, operating systems, virtualization, storage and orchestration, just how much custom work have you done to make it all work? Who else could pick this up and run with it?

The cynical and possibly unprofessional answer might be “I don’t care, since I’ve disappeared…” – but for the sake of the argument, let’s assume that someone does care about the infrastructure you’ve built.

Or simply that you want to stop working as hard and hire a minion to do your job.


What Have You Documented?

If you say “it’s all well-documented in my head”, then see this as an opportunity.

Pick the hardest three things you have to do. Can you write down the steps required so someone less skilled than you could follow the steps and do those same things?

Including how to fix the top ten usual things that go wrong?

How could you refine the documentation so that a progressively reduced skill level is required? Because Minions.

What Have You Automated?

So, now you’ve identified all the custom steps you perform to support your infrastructure. And now it’s all documented.

How much of that is repeatable?

How much of it could you automate? Perchance, aside from backups, you could automate things such as:

  • Expanding the infrastructure (especially useful in multi-node-based environments)
  • Shrinking the infrastructure (for instance, ejecting components)
  • Updating firmware on all components (even the tricky ones). Especially for high risk components.
  • Upgrading software (including the hypervisor)
  • Switch configuration
  • Failing over to a DR site
  • Testing DR
  • Creating environments
  • Cloning environments
  • Taking action if a performance or capacity threshold is reached
  • Taking action if security is breached (early detection of ransomware and automatic action, for instance)
  • Taking action if an application-level problem is detected
  • Archival/deletion
  • And, of course, reporting…

An example of successful automation and documentation would be expanding a scale-out server-based architecture. An utter novice should be able to do this if the right automation & documentation is in place.

The business value of automation is colossal. This cannot be overstated.

What is the TCO?

Have you done the math of how much your time is worth?

Have you done the math of how much it costs the business if the infrastructure doesn’t work?

A proper TCO calculation has to include not just personnel cost but also the cost of business downtime. Especially if a risk is undertaken by selecting free/cheap gear/software to cut on CAPEX.

Another item to add to the calculations is the speed of problem resolution with different architectures. Everything will eventually break, but if something can be fixed in 30 minutes vs a week, what is that cost?

Ultimate reliability is another factor too. What’s the cost to the business if something is very easy to fix but breaks 50 times a week, versus a technology that is both easy to fix and doesn’t break nearly as frequently?

In Summary

When deciding on a technology, always ask yourself what the viability of the technology and the vendor is. And then figure out the cost not just for acquisition and implementation, but also administration, troubleshooting and possible business downtime.

And don’t forget to document and automate no matter what you pick in the end.


Leave a comment for posterity...