Maybe the longest title for a post ever. And one of my longest, most rambling posts ever, it seems.
Recently we did a demo for a customer that I thought opened an interesting can of worms. Let’s set the stage – and, BTW, let it be known that I lost my train of thought multiple times writing this over multiple days so it may seem a bit incoherent (unusually, it wasn’t written in one shot).
The customer at the moment uses DASD and is looking to go to some kind of SAN for all the usual reasons. They were looking at EMC initially, then Dell told them they should look at Equallogic (imagine that). Not that there’s anything wrong with Dell or Equallogic, everything has its place.
So they get the obligatory “throw some sh1t on the wall and see what sticks” quote from Dell – literally Dell just sent them pricing on a few different models with wildly varying performance and storage capacities, apparently without rhyme or reason. I guess the rep figured they could afford at least one of the boxes.
So we start the meeting with yours truly asking the pointed questions, as is my idiom. It transpires that:
- Nobody looked at what their business actually does
- Nobody checked current and expected performance
- Nobody checked current and expected DR SLAs
- Nobody checked growth potential and patterns
- Nobody asked them what functionality they would like to have
- Nobody asked them what functionality they need to have
- Nobody asked how much storage they truly need
- Nobody asked them just how valuable their data is
- Nobody asked them how much money they can really spend, regardless of how valuable their data is and what they need.
So we do the dog-and-pony – and unfortunately, without really asking them anything about money, show them RecoverPoint first, which even worse than showing a Lamborghini (or insert your favorite grail car) to someone that’s only ever used and seen badly-maintained rickshaws, to use a car analogy.
To the uninitiated, EMC’s RecoverPoint is the be-all, end-all CDP (Continuous Data Protection) product, all nicely packaged in appliance format. It used to be Kashya (which seems to mean either “hard question” or “hard problem” in Hebrew), then EMC wisely bought Kashya, and changed the name to something that makes more marketing sense. Before EMC bought them, Kashya was the favorite replication technology of several vendors that just didn’t have anything decent in-place for replication (like Pillar). Obviously, with EMC now owning Kashya, it would look very, very bad if someone tried to sell you a Pillar array and their replication system came from EMC (it comes from FalconStor now). But I digress.
RecoverPoint lets you roll your disks back and forth in time, very much like a super-fine-grained TiVo for storage. It does this by creating a space equal to the space consumed by the original data that acts as a mirror, plus the use of what is essentially a redo log (so to use it locally you need 2x the storage + redo log space). The bigger the redo log, the more you can go back in time (you could literally go back several days). Oh, and they like to call the redo log The Journal.
It works by effectively mirroring the writes so they go to their target and to RecoverPoint. You can implement the “splitter” at the host level, the array (as long as it’s a Clariion from EMC) or with certain intelligent fiber switches using SSM modules (the last option being by far the most difficult and expensive to implement).
In essence, if you want to see a different version of your data, you ask RecoverPoint to present an “image” of what the disks would look like at a specified point-in-time (which can be entirely arbitrary or you can use an application-aware “bookmark”). You can then mount the set of disks the image represents (called a consistency group) to the same server or another server and do whatever you need to do. Obviously there are numerous uses for something like that. Recovering from data corruption while losing the least amount of data is the most obvious use case but you can use it to run what-if scenarios, migrations, test patches, do backups, etc.
You can also use RecoverPoint to replicate data to a remote site (where you need just 1x the storage + redo log). It does its own deduplication and TCP optimizations during replication, and is amazingly efficient (far more so than any other replication scheme in my opinion). They call it CRR (Continuous Remote Replication). Obviously, you get the TiVo-like functionality at the remote side as well.
What’s the kicker is the granularity of CRR/CDP. Obviously, as with anything, there can be no magic, but, given the optimizations it does, if the pipe is large enough you can do near-synchronous replication over distances previously unheard of, and get per-write granularity both locally and remotely. All without needing a WAN accelerator to help out, expensive FC-IP bridges and whatnot.
There’s one pretender that likes to take fairly frequent snapshots but even those are several minutes apart at best, can hurt performance and are limited in their ultimate number. Moreover, their recovery is nowhere near as slick, reliable and foolproof.
To wit: We did demos going back and forth a single transaction in SQL Server 2005. Trading firms love that one. The granularity was a couple of microseconds at the IOPS we were running. We recovered the DB back to entirely arbitrary points in time, always 100% successfully. Forget tapes or just having the current mirrored data!
We also showed Exchange being recovered at a remote Windows cluster. Due to Windows cluster being what it is, it had some issues with the initial version of disks it was presented. The customer exclaimed “this happened to me before during a DR exercise, it took me 18 hours to fix!!” We then simply used a different version of the data, going back a few writes. Windows was happy and Exchange started OK on the remote cluster. Total effort: the time spent clicking around the GUI asking for a different time + the time to present the data, less than a minute total. The guy was amazed at how streamlined and straightforward it all was.
It’s important to note that Exchange suffers more from those issues than other DBs since it’s not a “proper” relational DB like SQL is, the back-end DB is Jet and don’t let me get startedâ€¦ the gist is that replicating Exchange is not always straightforward. RecoverPoint gave us the chance to easily try different versions of the Exchange data, “just in case”.
How would you do that with traditional replication technologies?
How would you do that with other so-called CDP that is nowhere near as granular? How much data would you lose? Is that competing solution even functional? Anyone remember Mendocino? They kinda tried to do something similar, the stuff wouldn’t work right in a pristine lab environment, I gave up on it. RecoverPoint actually works.
Needless to say, the customer loved the demo (they always do, never seen anyone not like RecoverPoint, it’s like crack for IT guys). It solves all their DR issues, works with their stuff, and is almost magical. Problem is, it’s also pretty expensive â€“ to protect the amount of data that customer has they’d almost need to spend as much on RecoverPoint as on the actual storage itself.
Which brings us to the source of the problem. Of course they like the product. But for someone that is considering low-end boxes from Dell, IBM etc. this will be a huge price shock. They keep asking me to see the price, then I hear they’re looking at stuff from HDS and IBM and (no disrespect) that doesn’t make me any more confident that they can afford RecoverPoint.
Our mistake is that we didn’t at first figure out their budget. And we didn’t figure out the value of their data â€“ maybe they don’t need the absolute best DR technology extant since it won’t cost them that much if their data isn’t there for a few hours.
The best way to justify any DR solution is to figure out how much it costs the business if you can achieve, say, 1 day of RTO and 5 hours of RPO vs 5 minutes of RTO and near-zero RPO. Meaning, what is the financial impact to the business for the longer RPO and RTO? And how does it compare to the cost of the lower RPO and RTO recovery solution?
The real issue with DR is that almost no company truly goes through that exercise. Almost everyone says “my data is critical and I can afford zero data loss” but nobody seems to be in touch with reality, until presented with how much it will cost to give them the zero RPO capability.
The stages one goes through in order to reach DR maturity are like the stages of grief â€“ Denial, Anger, Bargaining, Depression, and Acceptance.
Once people see the cost, they hit the Denial stage and do a 180: “You know what, I really don’t need this data back that quickly and can afford a week of data loss!!! I’ll mail punch cards to the DR site!” – typically, this is removed from reality and is a complete knee-jerk reaction to the price.
Then comes Anger – “I can’t believe you charge this much for something essential like this! It should be free! You suck! It’s like charging a man dying of thirst for water! I’ll sue! I’ll go to the competition!”
Then they realize there’s no competition to speak of so we reach the Bargaining stage: “Guys, I’ll give you my decrepit HDS box as a trade-in. I also have a cool camera collection you can have, baseball cards, and I’ll let you have fun with my sister for a week!”
After figuring out how much money we can shave off by selling his HDS box, cameras and baseball cards on ebay and his sister to some sinister-looking guys with portable freezers (whoopsie, he did say only a week), it’s still not cheap enough. This is where Depression sets in. “I’m screwed, I’ll never get the money to do this, I’ll be out of a job and homeless! Our DR is an absolute joke! I’ll be forced to use simple asynchronous mirroring! What if I can’t bring up Exchange again? It didn’t work last time!”
The final stage is Acceptance – either you come to terms with the fact you can’t afford the gear and truly try to build the best possible alternative, or you scrounge up the money somehow by becoming realistic: “well, I’m only gonna use RecoverPoint for my Exchange and SQL box and maybe the most critical VMs, everything else will be replicated using archaic methods but at least my important apps are protected using the best there is”.
It would save everyone a lot of heartache and time if we just jump straight to the Acceptance phase where RecoverPoint is concerned:
- Yes, it really works that well.
- Yes, it’s that easy.
- Yes, it’s expensive because it’s the best.
- Yes, you might be able to afford it if you become realistic about what you need to protect.
- Yes, you’ll have to do your homework to justify the cost. If nothing else, you’ll know how much an outage truly costs your business! Maybe your data is more important than your bosses realize. Or maybe it’s a lot LESS important than what everyone would like to think. Either way you’re ahead!
- Yes, leasing can help make the price more palatable. Leasing is not always evil.
- No, it won’t be free.
- If you have no money at all why are you even bothering the vendors? Read the brochures instead.
- If you have some money please be upfront with exactly how much you can spend, contrary to popular belief not everyone is out to screw you out of all your IT budget. After all we know you can compare our pricing to others’ so there’s no point in trying to screw anyone. Moreover, the best customers are repeat customers, and we want the best customers! Just like with cars, there’s some wiggle room but at some point if you’re trying to get the expensive BMW you do need to have the dough.
Anyway, I rambled enough…