Ease Of Use, Backup and Recovery And Efficiency in Modern Disk Arrays – What Questions Should You Really Be Asking the Vendors?

It’s interesting how many storage vendors claim their products are easy to use and, indeed, show nice canned demos full of wizards and elves and whatnot that seem to impress most. There are also grandiose claims of magically reliable hardware and other pixie dust. Ultimately, the reality is that:

  1. Most modern arrays, as long as you’re comparing like to like (i.e. from the same class, same kind of RAID), properly configured, will be reliable enough for most uses
  2. Similarly to #1 above: aside from insane marketing cache IOPS (a certain prominent vendor quotes IOPS numbers not even from cache but from the buffers of the FC ports, how realistic is that?) performance is not crazily different between similar-class boxes with similar numbers of disks. Ultimately, cache runs out and you need to hit the spindles… (so, boxes that can contain gigantic amounts of cache such as NetApp with PAM cache boards or EMC’s V-Max with multiple engines have a leg up there)
  3. There Is No Magic
  4. Almost everyone is using the same bits internally (CPUs, disks, RAM) – with some key enhancements here and there. Don’t let the exact hardware details cloud your judgment. A good example: Let’s say Array X has 2 CPUs at 2GHz and Array Y has 2 CPUs at 3GHz. Unless the arrays come from the same manufacturer and run the same code, it’s VERY difficult to compare. Even if the CPUs are exactly the same, it’s tough to compare. The reason? Running anything (let’s pick Oracle) on the exact SAME hardware may produce wildly different benchmarks depending on whether the OS is Solaris, Linux or Windows, the tunings employed, and whether it’s 64- or 32-bit – the variable here being the OS.
  5. It all comes down to the intelligence, efficiency and reliability of the array software

Some business-related questions to ask the vendor:

  1. How is the support? Is it outsourced or not?
  2. Is the company viable? Is it profitable? Growing? Or is it tiny, struggling and depends on a single cool feature to woo prospects?
  3. References? Are the customers loving it or is it just OK? A cool one I heard today: “since I stopped using <TLA> my blood pressure dropped”…
  4. What large companies are using the technology? It’s one thing to have a reference from a mom-and-pop shop, and another to have one from Oracle, Microsoft etc. (and have multiple PB deployed inside those large companies)
  5. How many PB is the vendor deploying daily? How many total installations CURRENTLY UNDER SUPPORT, I don’t care how many since the company’s inception since the “since inception” means you’ll get numbers including people that got RID of the solution.
  6. Is the vendor OK with giving a performance guarantee (i.e. based on your workload that you will get 100,000 IOPS) and giving you a 100% refund if they fail to meet the metrics?
  7. To expand on the previous item: Is the vendor OK with doing a “Right of Return” – let you return the box if it doesn’t meet some agreed-upon criteria?
  8. Is the vendor OK with doing a Proof-of-Concept?

The prospective customers should probably ask for a bit more detail – and focus on things that will be statistically more important day-to-day than cool features of debatable real-world use. Some technical questions I’d ask:

  • Can I add drives on my own? Easily? Or do I need PS?
  • What requires downtime? Why?
  • What protocols does the array support? Can I use whatever or am I locked in?
  • Do I need extra appliances to support more protocols or are they all truly built-in?
  • Can I expand the ports?
  • Can I switch a LUN so it’s presented via iSCSI instead of FC (or vice versa)?
  • How do I do stuff like add drives to a RAID group? Is it on-the-fly? Do I need to destroy the RAID group?
  • Do I need to add disks in groups or can I add 1-2 if I want?
  • How much realistic protection do the available RAID schemes afford me? And what do I give up?
  • Can I lose any 2 drives in rapid succession without losing data? (dual-drive-loss has happened to various people I know and to me twice, it’s not as rare as you’d think. I lost data…)
  • Does RAID6 result in a performance decrease?
  • What is the real usable capacity, after RAID, based on real disk capacities (base-2 not base-10) and not marketing? You see, a 1TB drive doesn’t really offer 1TB…
  • Explain all the overheads in the system if best practices are followed – in some systems, even after RAID, 10TB usable is more like 5TB usable…
  • How easy is it to have a LUN span multiple disks in multiple RAID groups for performance? Meaning, in practical terms – do I need to worry about the back-end or will the system just take care of it for me?
  • Do I need to worry about adding disks in certain multiples, especially when dealing with such spanned LUNs?
  • Can I move stuff around the array?
  • How quick is the rebuild of drives?
  • Does the array detect impending failures and fail drives before they actually fail, in order to avoid a parity rebuild?
  • Do I need to care and know a lot about the back-end in order to optimize performance?
  • Is it easy to set up replication? Do I need extra appliances? Can I use FC and/or IP?
  • What’s the replication delta? (some arrays have a pretty huge minimum chunk they need to send over, can affect RPO)
  • Is compression supported for replication?
  • Regarding replication (both local and remote): Can I set up logical LUN groups that get treated as one in order to maintain consistency?
  • Can I grow a LUN?
  • Can I shrink a LUN?
  • Can I do it all from 1 place (and when I say all I mean all the way to having the LUN visible in the OS as a Filesystem, complete with proper partition offsets) OR do I need to visit like 3 different interfaces? Most vendors focus on the creation of a LUN. Easily creating a LUN is only a small piece of the puzzle!
  • Can I multi-purpose my disks or do I need to dedicate some to NAS, some to FC?
  • Can I prioritize my I/O?
  • Can I prioritize and tune my cache?
  • Do thin provisioning and snapshots adversely impact performance?
  • How many snapshots can I keep?
  • Can I keep a snapshot for, say, a year without messing up my performance and without needing a ton of space?
  • Can I use snapshots to clone LUNs so that they can be used to rapidly provision, say, servers or VMs without occupying too much space?
  • How easily and quickly can I backup and restore?
  • What kind of application integration is available? Some vendors offer basic VSS integration for Windows, but can I, say, recover individual emails and clone DBs without needing to use my backup application? How easily and quickly?
  • What about integration with applications that aren’t on Windows and may even be custom? Is it easy to properly integrate them?
  • Can I increase the cache size if needed? By how much?
  • Can I tier my data?
  • Does it work with the primary backup apps and VMware SRM?
  • Can I get data encryption?
  • Can I get data compression for all kinds of data?
  • Can I get deduplication? And does it work for backup data only or also for my production data so I can save space?
  • What is the deduplication impact?
  • Can I script operations if I want to?
  • What kind of reporting and data gathering is available?

This is not even a comprehensive list and I’m sure everyone has their own (if you haven’t written your own list down I suggest you do!) but represents what I feel are features that are realistically valuable.

What do you think? Comments always welcome.


One Reply to “Ease Of Use, Backup and Recovery And Efficiency in Modern Disk Arrays – What Questions Should You Really Be Asking the Vendors?”

Leave a comment for posterity...