I got the idea for this post from a Twitter thread. I thought such discussions were dead but clearly they’re not, and decided to shed some light on this, having dealt with backup at insane scale in a previous life.
It doesn’t matter what a feature is called – can you use it to recover? And, if the answer is yes, how quickly and under which scenarios? And what are the downsides?
Start with failure modes, always!
Before designing a backup/DR/BR solution, you should start by thinking about everything you really want to protect against. For example, here are two straightforward scenarios:
- Accidental deletion of data
- Site disaster
But what about certain insidious scenarios that could cause data loss?
- Corporate sabotage or disgruntled admin deleting all your primary storage and all your backups from all sites (it would take some time but it’s utterly possible and in fact has happened).
- A hacker gaining admin credentials and deleting all your cloud stuff – and you just completed a full move to the cloud! (this has also happened).
- Ransomware encrypting all your data and asking you for a credit card number.
So – the trick is to create a failure mode list for your business, and start thinking what you are prepared to do in order to protect against all possible eventualities. Then you can start designing a solution.
Why the 3-2-1 rule is not enough
The “golden standard” of backup advice is called the 3-2-1 rule. Simply put:
- 3 total copies of data
- on 2 different media
- with 1 of the copies being offsite
This sounds nice, but it doesn’t do one bit to protect you against an admin with access to all the copies deleting everything.
Snapshots are backups – they just don’t cover all eventualities (namely, the Human Factor)
It’s not so much about whether snaps are backups. They are. It’s more about what they protect against.
For instance, it’s pretty easy to comply with the “golden standard” 3-2-1 rule even with snapshots:
- Multiple (virtual) copies locally and remotely, forget just 3 copies… (natural thing for snaps)
- Replication of some of the snaps to another system (and make sure the snapshot tech you use allows completely separating the retention of source vs destination)
- With one of the systems being in a completely different site (ideally at least 100 miles away).
Some people think that just because snap replication is to a “like” system, if one system has a catastrophic bug, the other one will too. That’s almost like thinking that 2 tapes of the same type will both have a problem at the same time. That’s not how stuff fails (well, at least not how quality stuff fails).
The real problem is that the array admin can simply, easily and quickly wipe all snaps from both source and destination.
So it’s not about what’s a backup vs what isn’t. It’s more about who controls the copy data management.
What you can do to protect against malicious intent
It should be clear by now that the human element poses the most danger, and not technology itself.
Something most people are unaware of:
The backup admin is the most dangerous person in IT!
Think about it. Backup software, by definition:
- Controls copy creation (including array snapshot creation and replication, in some solutions)
- Has agents on hosts that typically have 100% access to all data – including complete permission to overwrite all data. How do you think restores happen? 🙂
So, the backup admin could easily do the following:
- Delete all the array snaps (if using the backup software to orchestrate the snaps) – even though they’re not the array admin!
- Ensure all tapes (if using tapes) for that month are recalled and wiped (so much for putting stuff on a totally different medium).
- Use the backup agents to totally wipe all servers.
- Wipe the backup server DB and all its copies.
- <BONUS ROUND> Make a copy of sensitive data to hold ransom – after all, that person has access to all servers. No need for an admin password (domain admins, you really thought your security was a match for the 100% access rights of backup agents?)
Presto – all of a sudden, all current data and all tape backups for the last month, and all snaps, gone forever.
How would you recover from this? What would such data loss mean for your company?
Certain key personnel needs to go through an extreme background check
Start considering storage and backup admins as worthy of the most extreme scrutiny. Oh, and the same for whoever has the cloud credentials… 🙂
Have at least one copy of data 100% inaccessible to admins
This is the era of ransomware and worse. Think of some of the following ideas:
- If using tape, use two different offsite tape storage companies, and only allow the admins access to one. Only the CEO should be able to recall media from the other one.
- Use 3 sites, not 2. One site should be much further away. A different continent is ideal, and, perhaps in the future, in geostationary orbit.
- If not using tape, have a completely separate team responsible for an extra copy of data to a tertiary datacenter. The array in that location should not be accessible by any of the admins of the 2 other sites. The copy should be generated by a different method – not your usual backup software nor your array replication mechanism. The usual admins should not have access to that extra replication mechanism – ideally, they should be oblivious to it. Not easy to pull off.
- Consider tape in addition to everything else, if you are not using it today. Everyone is hating on tape, but it has its place.
I could keep going. You get the point.
Maybe a new “golden standard”? 4-3-3-3-1?
- 4 copies of data
- in 3 locations
- managed by 3 totally different copy creation systems
- managed by 3 different teams that have zero overlapping power with each other
- with 1 copy being completely inaccessible to almost everyone in the company
The sky is the limit!
And for the 100% cloud fans, this means using AWS, Azure and Google cloud at the same time (there are your 3 “locations”).
Beware of being too paranoid
At some point, you hit the law of diminishing returns. Ask yourself some questions:
- What are you really prepared to pay for?
- What is all this paranoia costing you, in both money and lost agility?
- Are you addressing the failure modes in order of decreasing likelihood, or are you going for the most extreme scenario first?
- Are you trusting the right people?
- Why are your admins after you? Some solutions don’t need technology…
This post has made me a bit nostalgic – I used to have tons of tape libraries that you could literally take a walk in. Nothing wrong with that.