Category Archives: Backup

Updated blog code, plus a bit about NetApp recovery for cloud providers

Sometime last night/this morning a config file in my blog got corrupted. Maybe it got hacked (I was running an ancient WordPress version 2.1) but at any rate the site was down.

It’s hosted on a large, famous service provider, and they use NetApp gear.

I was able to recover my file through NetApp snapshots – The provider makes this trivial by giving all users a GUI for it that looks like a normal file manager. All self-service.

godaddy.png

No Vblocks, Avamar or Data Domain were harmed in the process that literally took all of one second to complete, most of which time was probably spent on Javascript doing its thing and the browser refreshing. BTW, I hadn’t touched that file since 2006.

This is a good example of storage for service providers doing more than just storing data.

With alternative solutions, a ticket would have to be opened, a helpdesk person would have to use a backup tool to find my file and restore it, then let me know. A whole lot more effort than what happened in this case.

In other news, I’m running the latest WordPress code, the site is now auto-optimized for mobile devices, and things are smooth again. Oh, and the old theme that most seemed to hate is gone. I’ll see if I can find a suitable picture for the header, for now this is OK.

If only that old version of WordPress I was using had a clean way of exporting stuff, if you look at older articles you’ll notice weird characters here and there. I might fix it. Probably not.

D

Technorati Tags: ,

More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?

Before all the variable-block aficionados go up in arms, I freely admit variable-block deduplication may overall squeeze more dedupe out of your data.

I won’t go into a laborious explanation of variable vs fixed, but, in a nutshell, fixed-block deduplication means that data is split into equal chunks, each chunk given a signature, compared to a DB and the common chunks are not stored.

Variable-block basically means the chunk size is variable, with more intelligent algorithms also having a sliding window, so that even if the content in a file is shifted, the commonality will still be discovered.

With that out of the way, let’s get to the FUD part of the post.

I recently had a TLA vendor tell my customer: “NetApp deduplication is fixed-block vs our variable-block, therefore far less efficient, therefore you must be sick in the head to consider buying that stuff for primary storage!”

This is a very good example of FUD that is based on accurate facts which, in addition, focuses the customer’s mind on the tech nitty-gritty and away from the big picture (that being “primary storage” in this case).

Using the argument for a pure backup solution is actually valid. But what if the customer is not just shopping for a backup solution? Or, what if, for the same money, they could have it all?

My question is: Why do we use deduplication?

At the most basic level, deduplication will reduce the amount of data stored on a medium, enabling you to buy less of said medium yet still store quite a bit of data.

So, backups were the most obvious place to deploy deduplication. Backup-to-Disk is all the rage, what if you can store more backups on target disk with less gear? That’s pretty compelling. In that space you have of course Data Domain and the Quantum DXi as the two of the more usual backup target suspects.

Another reason to deduplicate is to not only achieve more storage efficiency but also improve backup times by not even transferring over the network data that’s already been transferred. In that space there’s Avamar, PureDisk, Asigra, Evault and others.

NetApp simply came up with a few more reasons to deduplicate, not mutually exclusive with the other 2 use cases above:

  1. What if you could deduplicate your primary storage – typically the most expensive part of any storage investment – and as a result buy less?
  2. What if deduplication could actually dramatically improve your performance in some cases, while not hindering it in most cases? (the cache is deduplicated as well, more info later).
  3. What if deduplication was not limited to infrequently-accessed data but, instead, could be used for high-performance access?

For the uninitiated, NetApp is the only vendor, to date, that can offer block-level deduplication for all primary storage protocols for production data – block and file, FC, iSCSI, CIFS, NFS.

Which is a pretty big deal, as is anything useful AND exclusive.

What the FUD carefully fails to mention is that:

  1. Deduplication is free to all NetApp customers (whoever didn’t have it before can get it via a firmware upgrade for free)
  2. NetApp customers that use this free technology see primary storage savings that I’ve seen range anywhere from 10% to 95%, despite all the limitations the FUD-slingers keep mentioning
  3. It works amazingly well with virtualization and actually greatly speeds things up especially for VDI
  4. Things that would defeat NetApp dedupe will also defeat the other vendors’ dedupe (movies, compressed images, large DBs with a lot of block shuffling). There is no magic.

So, if a customer is considering a new primary storage system, like it or not, NetApp is the only game in town with deduplication across all storage protocols.

Which brings us back to whether fixed-block is less efficient than variable-block:

WHO CARES? If, even with whatever limitations it may have, NetApp dedupe can reduce your primary storage footprint by any decent percentage, you’re already ahead! Heck, even 20% savings can mean a lot of money in a large primary storage system!

Not bad for a technology given away with every NetApp system…

D

Is EMC under-sizing RecoverPoint and Avamar deals to win business?

It’s been a while since I wrote anything – unlike some, I actually have a day job! Well, at least that’s my excuse.

My admiration for RecoverPoint is well known (see older post, which is referenced internally within EMC as a great pro-RecoverPoint article). It really is a good product and, next to VMware, my favorite EMC acquisition.

So it incenses me when I see a good product being misconfigured, and reminds me of Hanlon’s Razor: “Never attribute to malice that which can be adequately explained by stupidity“. You see, I’d rather chalk this up as sales not knowing what they’re doing rather than assume that EMC knows full well the ramifications of their decision and goes ahead and does the dirty deed anyway.

However, I’ve seen multiple cases recently where RecoverPoint and/or Avamar were most decidedly incorrectly sized to support the customer’s workload. The customer likes the price and goes for the solution, only to be in for a nasty surprise later on. Not to worry, everything can be fixed with some more boxes, licenses and hard disks! After all, it’s tough and expensive to rip the stuff out!

To start with RecoverPoint: it can be a wonderful DR tool but, like any tool, needs to be used correctly in order to be most effective. For instance, there are several aspects when designing a RecoverPoint solution:

  • One needs to take into account the sustained throughput each device can handle (minuscule when compared to the total bandwidth of a CX4 or V-Max), and add extra devices in order to comfortably sustain the throughput the customer needs – even if that means you go beyond the 2-device-per-site RecoverPoint SE maximum and into the realm of “full” RecoverPoint (which can do more than 2 appliances per site, for added performance).
  • To expand on the previous point, assume that one of the RecoverPoint devices is “gravy” and is there to fail over if another box breaks. So, you effectively don’t want to be relying on having the full complement of RecoverPoint boxes working. This is especially important in 2-box RecoverPoint SE configs. If one box breaks (and they’re plain Dell 1950 servers) then that should not be debilitating to your performance while you’re waiting for a new box.
  • Licensing is capacity-based, which also needs to be explained to the customer (including what it means price-wise if you go beyond what RecoverPoint SE will support).
  • There is an absolute ceiling for TB replicated
  • There’s a different price depending on whether you want to do local only, remote only or both kinds of replication (CDP, CRR and CLR licenses)
  • Beware of the increased I/O on the array! When doing any kind of traffic through RecoverPoint, at the very least you get quite a bit more I/O on the “journal” (the redo log part of RecoverPoint) in addition to your main disk. If you want to also do local recovery, you could be doing as much as 3x the I/O! You see, you have to send the normal copy of the data through first, then Clariion splits off the I/Os to RecoverPoint, which then writes data to a full local mirror, then also to the journal. Obviously, the array needs enough fast disks to cope with this.
  • As a corollary to the previous point, to do CDP you need at least 2x the space plus a percentage for the journal (depends on the change rate and how far back in time you want to be able to go to)
  • Additionally, you can’t present multiple clones of the data simultaneously, from different points in time – you have to do them one at a time. Could be important in some use cases.
  • Creating a full-speed-access snapshot of your data can take quite a while, again could be important in some cases.
  • Last but not least – RecoverPoint, while efficient, is still subject to the laws of physics, so if you are told you’ll get zero RPO/RTO over a multi-thousand-mile link, stop what you’re doing, email me and I’ll overnight you an industrial-strength cattleprod, gratis… which you can then use on the rep in question.

So – all I’m saying is, ask all the right questions before sending that PO over…

Avamar is a different case altogether. It’s a dedup backup appliance that dedupes the data before it’s sent over the network. It’s very efficient at doing rapid backups over poor WAN connections. You don’t have to pay per-client fees, it supports most major OSes and applications, and is fairly easy to use. However – the original use case for the product was doing centralized backups of multiple small remote sites that are connected via poor links, and it still excels in that. Doing backups of large datasets at the datacenter, on the other hand, is not really what it was designed to do, yet I see it positioned in such a way.

I also see EMC selling really, really small Avamar configs (1-2 boxes), the hope being that dedup will be so effective that it’ll all be a wash in the end. Well – deduplication, in general, is the ultimate “it depends” solution!

Here are some considerations:

  • Not all data deduplicates equally! Make sure you run the EMC dedup estimator not just on fileserver data but also on your DBs! (DBs don’t really dedupe well, and media files and in general anything compressed dedupes even worse). Make sure you really get a good sample of your data analyzed, ideally all of it if possible.
  • If the sizer and dedup tool have only been run for plain fileserver data and that’s not what you have, don’t believe anything you see…
  • Explain your desired retentions and insist you see the Avamar sizer results. A good rule of thumb is that if your data is 5TB, then even with dedupe and compression, you’ll still need about 5TB once you factor in retention, unless you’re one of those rare cases that had tremendous duplication to begin with.
  • Make sure you understand the ramifications of not going to the RAIN grid in the first place – if you get a couple of Avamar boxes they can’t be part of the RAIN architecture, and if you lose one then the entire system is down hard. If you have RAIN, you could lose an entire node and it will be OK (kinda like RAID5 for servers) but migrating from non-RAIN to RAIN is non-trivial. Ask for the details. Ideally, even if you don’t need enough capacity to go RAIN, just buy the appliances to go RAIN but don’t buy the capacity licenses (i.e. you could buy 1TB of capacity yet have 5 nodes that theoretically can have a bunch more capacity).
  • Figure out if you want fast backups or fast recovery or both, and choose product accordingly (the fastest recovery is always replication/snapshots of primary data). Remember – usually, the desired end result is to recover, not to back up!
  • Understand exactly how Avamar can go to tape – the solution is not clean and it’s excessively slow. The product is really meant for those that want to go tapeless.

That’s all I have for now.

D

 

Should your backups to disk consume more disk than you use for production? Seriously?

So, let’s talk about this not-so-hypothetical customer… They have:

  • A few sites
  • A lot of data per site
  • Much of the data is DBs and Multimedia
  • No replication currently
  • Can’t back up everything currently
  • No proper DR
  • Fairly significant rate of change
  • Not the fastest pipes between sites

They asked me to propose a solution that will back everything up and cross-replicate the backups between the sites. They want to move as far away from tape as possible.

After much deliberation and examination of the data and requirements, we concluded that, in order to back everything up (and to stick to their requirements), even with various kinds of dedupe (I sized the solution with best practices for the usual suspects), due to the rate of change and the large amount of data with poor undedupability (that can’t possibly be a word), they will need about 3x the total amount of production space in order to achieve backups to disk (including dedupe!)

So, we declined to propose a solution. I want to sell something as much as the next guy but primarily I want repeat customers and the only way to get a happy repeat customer is to not screw him the first time… And selling them 3x the space only for backups doesn’t make too much sense to me when they could be spending their money much more wisely.

I explained how it doesn’t make sense to spend that kind of money on disk that’s just for backups! After all, backups are a last resort. My list of preferred methods for recovery (from best to worst):

  1. Local and remote replication + application-aware snapshots
  2. Backups to disk
  3. Backups to tape
  4. Snot, a claw hammer, duct tape and bailing wire (sometimes actually works better than tape but anyway…)

Wouldn’t it be a slightly better idea to use maybe 2x the disk, possibly even spend less money compared to the backup-only solution, and instead:

  • Cross-replicate the production data for rapid recovery
  • Achieve full local and remote DR
  • Be able to go back in time with snapshots both locally and remotely
  • Replicate the snapshots themselves automatically
  • Still get dedupe but this time on primary storage (make the current storage last longer)
  • Not need a forklift upgrade (investment protection)
  • Reduce or eliminate tape and reliance on the backup software
  • Get even longer retention than with backups to disk
  • No pipe upgrades
  • Drastically simplify administration
  • Potentially save millions over the next few years!

We’ll see what they decide to do. There was tremendous resistance to what I and a horde of seasoned engineers believe is the proper solution, with all kinds of very reasonable excuses being voiced (“we have no time, no resources, the stakeholders don’t care” etc). However, my position on this is clear. Yes, there’s more short-term pain in order to transform the infrastructure to the utopic vision of the bullets above, but the long-term gains are staggering!

I’ll let everyone know what happened the moment I hear. This one is really interesting…

D

, , , , , , , ,

So, what’s the best way to back up VMs?

Backing up VMs seems to be one of the topics nobody can seem to be able to agree on despite a plethora of reading material on the subject… and maybe because of said plethora.

I will focus on VMware since it is the leading and prevalent virtualization method in the marketplace today (I’m sure the KVM, Xen and Hyper-V fanboys will have their 15 minutes of fame someday).

VMware has several ways for backing up VMs:

  1. Install a backup agent in the VM, just as with a normal client
  2. Back up the entire VM by installing a backup agent in the ESX console
  3. Use VCB (VMware Consolidated Backup).

     

They all have their pros and cons so the short answer to the topic is that there’s no best method, instead you’ll get the “it depends” answer. Sorry. Here’s the skinny on each method:

 

1. Install a backup agent in the VM, just as with a normal client

 

Pros:

  • Everyone understands this, since it works just like a real physical client and can do most of the same things
  • Can do incrementals
  • File-level recovery is straightforward with no confusion as to which VM owns which file
  • Advanced backup features such as DB agents work fine

 

Cons:

  • Impact on the host and network
  • Deployment just as difficult as when using the physical clients
  • Can make backup software licensing more expensive than needed
  • Bare-metal-recovery of VMs only a bit less difficult than with physical boxes

 

2. Back up the entire VM by installing a backup agent in the ESX console

 

Pros:

  • Licensing cost for backup software minimized (1 license needed per ESX server)
  • The entire VM is backed up so recovery is like Bare Metal Recovery – you’ll get the entire box back with a very high probability of success
  • Fast since the virtualization layers are bypassed

 

Cons:

  • Still significant impact on the host and network
  • Cannot restore individual files
  • Advanced backup agents won’t work (no hot backups of SQL or Exchange, for instance)
  • Backups always large since a full backup is required every time
  • Backups take long (see previous point)
  • Requires some scripting knowledge to deploy properly.

 

3. Use VCB (VMware Consolidated Backup).

 

Pros:

  • Works with most backup software
  • Almost no impact on the host or network (backups can be entirely SAN-based)
  • Reduced backup software licensing cost
  • Works with VSS in windows to provide better backup reliability
  • Allows for incremental backups
  • Uses VM snapshots
  • No disk space used for staging of incrementals
  • Very simple DR
  • File-level backups are possible

 

Cons:

  • Cannot back up RDMs in Physical Compatibility Mode
  • Advanced functionality (file-level backups and application integration won’t work with non-windows VMs)
  • Cannot back up clustered VMs (i.e. MSCS-clustered VMs can’t be backed up)
  • FullVM backup speed is limited to 1GB/min (limitation of windows’ cmd.exe but can get around it by creating multiple threads I guess – but you could have speed issues if you cannot break the jobs up and they’re large)
  • Significant disk space needed for Holding Tank (where FullVM copies are placed)
  • Advanced backup agents will not work
  • File-level backups won’t back up the Windows registry
  • File-level recovery is complex and generally a two-step process

 

The lists could go on but as you can see there are serious wrinkles with all the approaches.

The problem is compounded by the fact that most modern backup software has arcane licensing schemes depending on whether an agent is on a VM or not, for instance (CommVault) or allowing you unlimited agents per ESX server as long as you buy the more expensive client license for the ESX server (NetBackup), and various permutations thereof.

Another wrinkle is Deduplication. Products that do source-based Deduplication such as EMC’s Avamar can comfortably have their agents inside the VMs or in the service console since subsequent backups take only a fraction of the time and there’s almost no space penalty. So, with Avamar one could be doing both kinds of backup (entire VM and individual files) and be covered both ways and only worrying about time and space when reading Hawking’s books… The negative is cost.

NetBackup offers another interesting twist since their implementation of VCB allows individual files to be recovered from a FullVM backup – the rationale being that you use their PureDisk Deduplication to store everything in order to reduce the expense of backup disk.

In the end, the only recommendation I can give that doesn’t depend too much on your individual circumstances is to try and do both file-level and FullVM-type backup so that you’re covered in multiple ways. Then replicate those backups, etc… you know the drill by now.

D

 

 

So, how frequently do you really test DR?

It’s after 4AM, I can’t sleep since I’m in pain after a car accident and I’ve had altogether too much caffeine. I’ve already watched 3 movies. BTW, “I am Legend” – WTF! Never have I seen a decent book butchered so much! The ideas in the book were so much stronger. Seriously, go get the book and forget the movie. Sorry, Will.

Now I’m writing from The Throne Chamber once more (blessed be the Colon Drano caffeine). I’m all cramped up and can’t get up, so I thought why not post something… can’t promise it will make sense since my brain ain’t the clearest at the moment…

So – when was the last time you tested DR? Really?

If I had a penny for every time I heard the line “we back up our servers to tape but we don’t test DR, but we’re confident we’ll be up and running within 36 hours in the event of a disaster” I’d be paying Trump more money than he ever made just so he can shine my shoes, and he’d be thankful.

Let me make something clear: You need to test DR a minimum of twice a year, preferably once a quarter. Anything less and you’re just setting yourself up for failure.

Start by testing the most important machines. You probably won’t even have to artificially inject extra problems to solve (Pervy Uncle Murphy usually is right there beside you to take care of that). Marvel at how long it really takes.

If things go real peachy, did you hit your RPO and RTO? if yes, test with more machines, until you can test with the full complement of boxes your company truly needs to be up and running and making money. Document it all.

If you didn’t hit your RPO/RTO, how much did you miss them by? If it’s by a ridiculous amount, maybe the way you’re going about DR will simply not work – try replication and/or VMware…

Once you get good at it, start inventing scenarios. for instance:

- Pretend one of your tapes is bad. See how long your offsite vendor takes to bring you a fresh set once you figure out what are the barcodes you need.
- Pretend one of the critical servers can’t be recovered and you need to go back 3 weeks. How does this affect the business?
- Recover to dissimilar hardware.
- Pretend you’re dead. Are your documented procedures clear enough for your underling to follow? Are they clear enough for the janitor? The janitor’s 3-year-old kid? The kid’s parakeet? Ultimately, your DR runbooks need to be so clear that even your CEO can follow them easily, and he needs to be able to do so right out of bed, before he’s had his morning ablutions, quad-vanilla-soy-latte and his Zoloft.

Ultimately (and sorry if I’m repeating myself), you probably need to be making at least 2 tape copies, 2 copies of your backup catalog, replicating (ideally CDP) and using VMware all at the same time to have any real insurance policy against disaster.

And if you ever tell me “well, we don’t have the time to be doing DR tests” – do you really think you’ll have the time once disaster really strikes?

And, if you think that a disaster is an RGE (Resume Generating Event) then you probably are working for the wrong company and won’t get much job satisfaction there anyway.

I think I’d better get up before I lose my legs.

Nighty-night

D

A word of caution when setting up a deduplicating VTL

Based on some recent experiences I wanted to make people aware of some caveats with setting up a VTL with deduplication. This is specifically regarding the EMC DL3D (AKA Quantum DXi) but applies to all of them. This will be a mercifully short and to the point post. Here’s the rub:

  • Create small virtual tapes (100GB max, I’d go even smaller, obviously depends on your environment)
  • Create a bunch of virtual tape drives (you might have to create 20-30!)
  • Do NOT I repeat NOT multiplex in the backup software! It screws up the deduplication algorithm.
  • Do not compress the data before the backup
  • Do not encrypt the data
  • Be mindful of your retention policies, start gently then work your way up.
  • I’d personally not multi-stream a server at all, just so I can keep the tape utilization high. What I mean: Say you do not do multiplexing but you are multistreaming – i.e. you’re sending 10 streams from your client. This means you will need 10 tapes without multiplexing, so you’ll end up writing a tiny bit on each tape. It doesn’t take a genius to realize that you’ll end up with a ton of tapes with not much data on them, which will cause them to be appended to with more tiny amounts of data, which will in turn cause them to expire way later than you’d like.
  • If you can use the box as NAS and know how to get the throughput up there then do so, that way there’s no issue with multiple streams. My Data Domain boys are chuckling now (they always prefer to do NAS, but that also has to do with the fact that their box can’t really do VTL properly yet. Oh, the cattiness! BTW my company does sell quite a lot of their stuff).

The same rules apply otherwise as in my previous post about tuning NetBackup for large environments.

Regarding using the DL3D/DXi as NAS: Plug in as many GigE ports as you can, but make sure your switch can do straight-up EtherChannel (not LACP). So you pretty much need to have a “proper” Cisco switch in order to get the full benefit. Then use multiple media servers. Use a separate NAS share per media server. Team the NICs on the backup servers for performance (do LACP or PaGP there, whatever works with the server’s NIC software). Then call me in the morning.

D

 

What is the value of your data? Do you have the money to implement proper DR that works? How are you deciding what kind of storage and DR strategy you’ll follow? And how does Continuous Data Protection like EMC’s RecoverPoint help?

Maybe the longest title for a post ever. And one of my longest, most rambling posts ever, it seems.

Recently we did a demo for a customer that I thought opened an interesting can of worms. Let’s set the stage – and, BTW, let it be known that I lost my train of thought multiple times writing this over multiple days so it may seem a bit incoherent (unusually, it wasn’t written in one shot).

The customer at the moment uses DASD and is looking to go to some kind of SAN for all the usual reasons. They were looking at EMC initially, then Dell told them they should look at Equallogic (imagine that). Not that there’s anything wrong with Dell or Equallogic… everything has its place.

So they get the obligatory “throw some sh1t on the wall and see what sticks” quote from Dell – literally Dell just sent them pricing on a few different models with wildly varying performance and storage capacities, apparently without rhyme or reason. I guess the rep figured they could afford at least one of the boxes.

So we start the meeting with yours truly asking the pointed questions, as is my idiom. It transpires that:

  1. Nobody looked at what their business actually does
  2. Nobody checked current and expected performance
  3. Nobody checked current and expected DR SLAs
  4. Nobody checked growth potential and patterns
  5. Nobody asked them what functionality they would like to have
  6. Nobody asked them what functionality they need to have
  7. Nobody asked how much storage they truly need
  8. Nobody asked them just how valuable their data is
  9. Nobody asked them how much money they can really spend, regardless of how valuable their data is and what they need.

So we do the dog-and-pony – and unfortunately, without really asking them anything about money, show them RecoverPoint first, which even worse than showing a Lamborghini (or insert your favorite grail car) to someone that’s only ever used and seen badly-maintained rickshaws, to use a car analogy.

To the uninitiated, EMC’s RecoverPoint is the be-all, end-all CDP (Continuous Data Protection) product, all nicely packaged in appliance format. It used to be Kashya (which seems to mean either “hard question” or “hard problem” in Hebrew), then EMC wisely bought Kashya, and changed the name to something that makes more marketing sense. Before EMC bought them, Kashya was the favorite replication technology of several vendors that just didn’t have anything decent in-place for replication (like Pillar). Obviously, with EMC now owning Kashya, it would look very, very bad if someone tried to sell you a Pillar array and their replication system came from EMC (it comes from FalconStor now). But I digress.

RecoverPoint lets you roll your disks back and forth in time, very much like a super-fine-grained TiVo for storage. It does this by creating a space equal to the space consumed by the original data that acts as a mirror, plus the use of what is essentially a redo log (so to use it locally you need 2x the storage + redo log space). The bigger the redo log, the more you can go back in time (you could literally go back several days). Oh, and they like to call the redo log The Journal.

It works by effectively mirroring the writes so they go to their target and to RecoverPoint. You can implement the “splitter” at the host level, the array (as long as it’s a Clariion from EMC) or with certain intelligent fiber switches using SSM modules (the last option being by far the most difficult and expensive to implement).

In essence, if you want to see a different version of your data, you ask RecoverPoint to present an “image” of what the disks would look like at a specified point-in-time (which can be entirely arbitrary or you can use an application-aware “bookmark”). You can then mount the set of disks the image represents (called a consistency group) to the same server or another server and do whatever you need to do. Obviously there are numerous uses for something like that. Recovering from data corruption while losing the least amount of data is the most obvious use case but you can use it to run what-if scenarios, migrations, test patches, do backups, etc.

You can also use RecoverPoint to replicate data to a remote site (where you need just 1x the storage + redo log). It does its own deduplication and TCP optimizations during replication, and is amazingly efficient (far more so than any other replication scheme in my opinion). They call it CRR (Continuous Remote Replication). Obviously, you get the TiVo-like functionality at the remote side as well.

What’s the kicker is the granularity of CRR/CDP. Obviously, as with anything, there can be no magic, but, given the optimizations it does, if the pipe is large enough you can do near-synchronous replication over distances previously unheard of, and get per-write granularity both locally and remotely. All without needing a WAN accelerator to help out, expensive FC-IP bridges and whatnot.

There’s one pretender that likes to take fairly frequent snapshots but even those are several minutes apart at best, can hurt performance and are limited in their ultimate number. Moreover, their recovery is nowhere near as slick, reliable and foolproof.

To wit: We did demos going back and forth a single transaction in SQL Server 2005. Trading firms love that one. The granularity was a couple of microseconds at the IOPS we were running. We recovered the DB back to entirely arbitrary points in time, always 100% successfully. Forget tapes or just having the current mirrored data!

We also showed Exchange being recovered at a remote Windows cluster. Due to Windows cluster being what it is, it had some issues with the initial version of disks it was presented. The customer exclaimed “this happened to me before during a DR exercise, it took me 18 hours to fix!!” We then simply used a different version of the data, going back a few writes. Windows was happy and Exchange started OK on the remote cluster. Total effort: the time spent clicking around the GUI asking for a different time + the time to present the data, less than a minute total. The guy was amazed at how streamlined and straightforward it all was.

It’s important to note that Exchange suffers more from those issues than other DBs since it’s not a “proper” relational DB like SQL is, the back-end DB is Jet and don’t let me get started… the gist is that replicating Exchange is not always straightforward. RecoverPoint gave us the chance to easily try different versions of the Exchange data, “just in case”.

How would you do that with traditional replication technologies?

How would you do that with other so-called CDP that is nowhere near as granular? How much data would you lose? Is that competing solution even functional? Anyone remember Mendocino? They kinda tried to do something similar, the stuff wouldn’t work right in a pristine lab environment, I gave up on it. RecoverPoint actually works.

Needless to say, the customer loved the demo (they always do, never seen anyone not like RecoverPoint, it’s like crack for IT guys). It solves all their DR issues, works with their stuff, and is almost magical. Problem is, it’s also pretty expensive – to protect the amount of data that customer has they’d almost need to spend as much on RecoverPoint as on the actual storage itself.

Which brings us to the source of the problem. Of course they like the product. But for someone that is considering low-end boxes from Dell, IBM etc. this will be a huge price shock. They keep asking me to see the price, then I hear they’re looking at stuff from HDS and IBM and (no disrespect) that doesn’t make me any more confident that they can afford RecoverPoint.

Our mistake is that we didn’t at first figure out their budget. And we didn’t figure out the value of their data – maybe they don’t need the absolute best DR technology extant since it won’t cost them that much if their data isn’t there for a few hours.

The best way to justify any DR solution is to figure out how much it costs the business if you can achieve, say, 1 day of RTO and 5 hours of RPO vs 5 minutes of RTO and near-zero RPO. Meaning, what is the financial impact to the business for the longer RPO and RTO? And how does it compare to the cost of the lower RPO and RTO recovery solution?

The real issue with DR is that almost no company truly goes through that exercise. Almost everyone says “my data is critical and I can afford zero data loss” but nobody seems to be in touch with reality, until presented with how much it will cost to give them the zero RPO capability.

The stages one goes through in order to reach DR maturity are like the stages of grief – Denial, Anger, Bargaining, Depression, and Acceptance.

Once people see the cost, they hit the Denial stage and do a 180: “You know what, I really don’t need this data back that quickly and can afford a week of data loss!!! I’ll mail punch cards to the DR site!” – typically, this is removed from reality and is a complete knee-jerk reaction to the price.

Then comes Anger – “I can’t believe you charge this much for something essential like this! It should be free! You suck! It’s like charging a man dying of thirst for water! I’ll sue! I’ll go to the competition!”

Then they realize there’s no competition to speak of so we reach the Bargaining stage: “Guys, I’ll give you my decrepit HDS box as a trade-in. I also have a cool camera collection you can have, baseball cards, and I’ll let you have fun with my sister for a week!”

After figuring out how much money we can shave off by selling his HDS box, cameras and baseball cards on ebay and his sister to some sinister-looking guys with portable freezers (whoopsie, he did say only a week), it’s still not cheap enough. This is where Depression sets in. “I’m screwed, I’ll never get the money to do this, I’ll be out of a job and homeless! Our DR is an absolute joke! I’ll be forced to use simple asynchronous mirroring! What if I can’t bring up Exchange again? It didn’t work last time!”

The final stage is Acceptance – either you come to terms with the fact you can’t afford the gear and truly try to build the best possible alternative, or you scrounge up the money somehow by becoming realistic: “well, I’m only gonna use RecoverPoint for my Exchange and SQL box and maybe the most critical VMs, everything else will be replicated using archaic methods but at least my important apps are protected using the best there is”.

It would save everyone a lot of heartache and time if we just jump straight to the Acceptance phase where RecoverPoint is concerned:

  • Yes, it really works that well.
  • Yes, it’s that easy.
  • Yes, it’s expensive because it’s the best.
  • Yes, you might be able to afford it if you become realistic about what you need to protect.
  • Yes, you’ll have to do your homework to justify the cost. If nothing else, you’ll know how much an outage truly costs your business! Maybe your data is more important than your bosses realize. Or maybe it’s a lot LESS important than what everyone would like to think. Either way you’re ahead!
  • Yes, leasing can help make the price more palatable. Leasing is not always evil.
  • No, it won’t be free.
  • If you have no money at all why are you even bothering the vendors? Read the brochures instead.
  • If you have some money please be upfront with exactly how much you can spend, contrary to popular belief not everyone is out to screw you out of all your IT budget. After all we know you can compare our pricing to others’ so there’s no point in trying to screw anyone. Moreover, the best customers are repeat customers, and we want the best customers! Just like with cars, there’s some wiggle room but at some point if you’re trying to get the expensive BMW you do need to have the dough.

     

Anyway, I rambled enough…

 

D

    

This has been one of the worst trips ever – because of one of the silliest DR exercises ever

Well, aside from visiting Flames and helping fix a severe customer problem. Those were rewarding. I still haven’t pooped that steak, BTW.

I was supposed to only stay for 1 day in Manhattan, fix the issue, ba da bing. I ended up staying an extra day – had no extra clothes and no time to get anything. Washed my undies on my own and used the hair dryer over a period of hours to dry them. I learned my lesson now and will always have extra stuff with me.

So I try to go back home today and guess what – Air Traffic Control computers had a major glitch (abcnews.go.com/Business/wireStory?id=3259992) that messed up the whole country’s air travel. Thousands of flights delayed and canceled. Mine was canceled, after I spent about 10 hours in the airport. Another 2 hours in the line to simply rebook the flight since they had 3 people trying to serve hordes. And all because, at least according to the report, a system failed and the failover system didn’t have the capacity to sustain the whole load.

So, while I wait in the airport to catch a stand-by flight tomorrow morning, unbathed and frankly looking a bit menacing, I decided to vent a bit. No hotels, no cars.

Maybe this is too much conjecture and if I’m wrong please enlighten me, but let’s enumerate some of the things wrong with this picture:

  1. First things first: While it’s cool to fail over to a completely separate location, typically you want a robust local cluster first so you can fail over to another system in the original location.
  2. If the original location is SO screwed up (meaning that a local cluster has failed, which typically means something really ominous for most places) ONLY THEN do you fail over to another facility altogether.
  3. Last but not least: Whatever facility you fail over to has to have enough capacity (demostrated during tests) to sustain enough load to let operations proceed. Ideally, for critical systems, the loss of any one site should hardly be noticeable.

According to the report none of the aforementioned simple rules were followed. Someone made the decision to fail over to another facility, which promptly caved under the load. A cascade effect ensued.

I mean, seriously: One of the most important computer systems in the country does not have a well-thought-out and -tested DR implementation. Guys, those are rookie mistakes. Like some airports having 1 link to the outside world, or 2 links but with the same provider. Use some common sense!

So, I guess I’ll put that in the list together with using what’s tantamount to unskilled labor securing our airports instead of highly trained and well-paid personnel that’s been screened extremely intensely and actually takes pride in the job. Maybe some of those unskilled people are running the computers, it might be like the Clone Army in Star Wars. A mass of cheap, expendable labor that collectively has the IQ of my left nut (I’m not being overly harsh – my left nut is quite formidable). The armed forces heading the same way isn’t the most reassuring thought, either.

Yes, I’m upset!!!

wallpapers images animal gorilla

D

Ate at Trotter’s Tavern in Bowling Green, OH

I had some great customer meetings in OH this week. One meeting took me to Bowling Green, cute town.

The locals like to eat steak at Trotter’s Tavern. They only serve fist-sized and -shaped chunks of sirloin in some weird sauce that has at least some Worcestershire in it but is more tangy. No other cut choices, you get either 10 or 16 ounces and that’s it.

I asked the waitress how it was aged and got a blank stare back. I could almost read her mind: “we just defrost it in the microwave”.

Well, had it been cooked properly it might have been OK, but mine was well-done (which I hadn’t asked for). Ate it anyway, as is my idiom, but I can’t say I recommend the place. Maybe if you get the 10-ouncer and ask for medium rare it might be medium by the time you get it. It’s tough to cook a thick piece of meat properly.

At least the place is relatively inexpensive, their most expensive piece is $25 and comes with all the trimmings.

There was one weird thing though: The restroom was festooned with carvings (yes, carvings) asserting the gayness of various people.

D