Updated blog code, plus a bit about NetApp recovery for cloud providers

Sometime last night/this morning a config file in my blog got corrupted. Maybe it got hacked (I was running an ancient WordPress version 2.1) but at any rate the site was down.

It’s hosted on a large, famous service provider, and they use NetApp gear.

I was able to recover my file through NetApp snapshots – The provider makes this trivial by giving all users a GUI for it that looks like a normal file manager. All self-service.

godaddy.png

No Vblocks, Avamar or Data Domain were harmed in the process that literally took all of one second to complete, most of which time was probably spent on Javascript doing its thing and the browser refreshing. BTW, I hadn’t touched that file since 2006.

This is a good example of storage for service providers doing more than just storing data.

With alternative solutions, a ticket would have to be opened, a helpdesk person would have to use a backup tool to find my file and restore it, then let me know. A whole lot more effort than what happened in this case.

In other news, I’m running the latest WordPress code, the site is now auto-optimized for mobile devices, and things are smooth again. Oh, and the old theme that most seemed to hate is gone. I’ll see if I can find a suitable picture for the header, for now this is OK.

If only that old version of WordPress I was using had a clean way of exporting stuff, if you look at older articles you’ll notice weird characters here and there. I might fix it. Probably not.

D

Technorati Tags: ,

Questions to ask EMC regarding their new VNX systems…

It’s that time of the year again. The usual websites are busy with news of the upcoming EMC midrange refresh called VNX. And records being broken.

(NEWSFLASH: Watching the webcast now, the record they kept saying they would break ended up being some guy jumping over a bunch of EMC arrays with a motorcycle – and here I was hoping to see some kind of performance record…)

I’m not usually one to rain on anyone’s parade, but I keep seeing the “unified” word a lot, but based on what I’m seeing, it’s all more of the same, albeit with newer CPUs, a different faceplate, and (join the club) SAS. I’m sure the new systems will be faster courtesy of faster CPUs, more RAM and SAS. But are they offering something materially closer to a unified architecture?

Note that I’m not attacking anything in the EMC announcement, merely the continued “unified” claim. I’m sure the new Data Domain, Isilon and Vmax systems are great.

So here are some questions to ask EMC regarding VNX – I’ll keep this as a list instead of a more verbose entry to keep things easy for the ADD-afflicted and allow easier copy-paste into emails :)

  1. Let’s say I have a 100TB VNX system. Let’s say I allocate all 100TB to NAS. Then let’s say that all the 100TB is really chewed up in the beginning but after a year my real data requirements are more like 70TB. Can I take that 30TB I’m not using any more and instantly use it for FC? Since it’s “unified” and all? Without breaking best practices for LUN allocation to Celerra? Or is it forever tied to the NAS part and I have to buy all new storage if I don’t want to destroy what’s there and start from scratch?
  2. Is the VNX (or even the NS before it) 3rd-party verified as an over 5-nines system? (I believe the CX is but is the CX/NS combo?)
  3. How is the architecture of these boxes any different than before? It looks like you still have 2 CX SPs, then some NAS gateways. Seems like very much the same overall architecture and there’s (still) nothing unified about it. I call for some truth in advertising! Only the little VNXe seems materially different (not in the software but in the amount of blades it takes to run it all).
  4. Are the new systems licenced by capacity?
  5. Can the new systems use more than the 2TB of FAST Cache?
  6. On the subject of cache, what is the best practice regarding the minimum number of SSDs to use for cache? Is it 8? How many shelves/buses should they be distributed on?
  7. What is the best practice regarding cache oversubscription and how is this sized?
  8. Since the FAST Cache can also cache writes, what are the ramifications if the cache fails? How many customers have had this happen? After all, we are talking about SSDs, and even mirrored SSDs are much less reliable than mirrored RAM.
  9. What’s the granularity for using RecoverPoint to replicate the NAS piece? Seems like it needs to replicate everything NAS as one chunk as a large consistency group, with Celerra Replicator needed for more granular replication.
  10. What’s the granularity for recovering NAS with RecoverPoint? Seems like you can’t do things by file or by volume even. The entire data mover may need to be recovered in one go, regardless of the volumes within.
  11. When using RecoverPoint, does one need to not use storage pools for certain operations? And what does that mean regarding the complexity of implementation?
  12. Speaking of storage pools, when are they recommended, when not, and why? And what does that mean about the complexity of administration?
  13. What functionality does one lose if one does not use pools?
  14. Can one prioritize FAST Cache in pool LUNs or is cache simply on or off for the entire pool?
  15. Can I do a data-in-place upgrade from CX3 or CX4 or is this a forklift upgrade?
  16. Why is FASTv2 not recommended for Exchange 2010 and various other DBs?
  17. If Autotiering is not really applicable to many workloads, what is it really good for?
  18. What is the percentage of flash needed to properly do autotiering on VNX? (it’s only 3% on VMAX since it uses a 7MB page, but VNX uses a 1GB page, which is far more inefficient). Why is FAST still at the grossly inefficient 1GB chunk?
  19. Can FAST on the VNX exclude certain time periods that can confuse the algorithms, like when backups occur?
  20. Is file-level FAST still a separate system?
  21. Why does the low-end VNXe not offer FC?
  22. Can I upgrade from VNXe to VNX?
  23. Does the VNXe offer FAST?
  24. Can a 1GB chunk span RAID groups or is performance limited to 1 RAID group’s worth of drives?
  25. Why are functions like block, NAS and replication still in separate hardware and software?
  26. Why are there still 2 kinds of snapshotting systems?
  27. Are the block snaps finally without a huge write performance impact? How about the NAS snaps?
  28. Are the snaps finally able to be retained for years if needed?
  29. Why are there 4 kinds of replication? (Mirrorview, Celerra Replicator, Recoverpoint, SAN copy)
  30. Why are there still all these OSes to patch? (Win XP in the SPs, Linux on the Control Station and RecoverPoint, DART on the NAS blades, maybe more if they can run Rainfinity and Atmos on the blades as well)
  31. Why still no dedupe for FC and iSCSI?
  32. Why no dedupe for memory and cache?
  33. Why not sub-file dedupe?
  34. Why is Celerra still limited to 256TB per data mover?
  35. Is Celerra still limited to 16TB per volume? Or is yet another, completely separate system (Isilon) needed to do that?
  36. Is Celerra still limited to not being able to share a volume between data movers? Or is, again, Isilon needed to do that?
  37. Can Celerra non-disruptively move CIFS and NFS volumes between data movers?
  38. Why can there not be a single FCoE link to transfer all the protocols if the boxes are “unified”?
  39. Have the thin provisioning performance overheads been fixed?
  40. Have the pool performance bottlenecks been fixed? Or is it still recommended to use normal RAID LUNs for highest performance?
  41. Can one actually stripe/restripe within a FLARE pool now? When adding storage? With thin provisioning?
  42. What is the best practice for expanding, say, a 50 drive pool? How many drives do I have to expand by? Why?
  43. Does one still need to do a migration to use thin provisioning?
  44. Does one need to do yet another migration to “re-thin” a LUN once it gets temporarily chunky?
  45. Have the RAID5 and RAID6 write inefficiencies been fixed? And how?
  46. Will the benchmarks for the new systems use RAID6 or will they, again, show RAID10? After all, most customers don’t deploy RAID10 for everything, and RAID5 is thousands of times less reliable than RAID6. How about some SPC-1 benchmarks?
  47. Why is EMC still not fessing up to using a filesystem for their new pools? Maybe because they keep saying doing so is not a “real” SAN, even in recent communication?
  48. Since EMC is using a filesystem in order to get functionality in the CX SPs like pools, thin provisioning, compression and auto-tiering (and probably dedupe in the future), how are they keeping fragmentation under control? (how the tables have turned!)

What I notice is a lack of thought leadership when it comes to technology innovation – EMC is still playing catch-up with other vendors in many important architectural areas,  and keeps buying companies left and right to plug portfolio holes. All vendors play catch-up to some extent, the trick is finding the one playing catch-up in the fewest areas and leading in the most, with the fewest compromises.

Some areas of NetApp leadership to answer a question in the comments:

  • First Unified architecture (since 2002)
  • First with RAID that has the space efficiency of RAID5, the performance of RAID10 and the reliability of RAID6
  • First with block-level deduplication for all protocols
  • FIrst with zero-impact snapshots
  • First with Megacaches (up to 16TB cache per system possible)
  • First with VMware integration including VM clones
  • First with space- and time-efficient, integrated replication for all protocols
  • First with snapshot-based archive storage (being able to store different versions of your data for years on nearline storage)
  • First with Unified Connect and FCoE – single cable capability for all protocols (FC, iSCSI, NFS, CIFS)

However, EMC is strong when it comes to marketing, messaging and – wait for it – the management part. Since it’s amazingly difficult to integrate all the technologies EMC has acquired over the years (heck, it’s taking NetApp forever to properly integrate Spinnaker and that’s just one other architecture), EMC is focusing instead on the management of the various bits (the current approach being Unisphere, tying together a subset of EMC’s acquisitions).

So, Unified Storage in EMC-speak really means unified management. Which would be fine if they were upfront about it. Somehow, “our new arrays with unified management but not unified architecture” doesn’t quite roll off the tongue as easily as “unified storage”.

Mike Riley eloquently explains whether it’s easier to fix an architecture or fix management here. Ultimately, unified management can’t tackle all the underlying problems and limitations, but it does allow for some very nice demos.

A cool GUI with frankenstorage behind it is like putting lipstick on a pig, or putting a nice shell on top of a car cobbled together from disparate bits. The underlying build is masked superficially, until it’s not… usually, at the worst possible time.

Sure, ultimately, management is what the end user interfaces with. Many people won’t really care about what goes on inside, nor have the time or inclination to learn. I merely invite them to start thinking more about the inner bits, because when things get tricky is also when something like a portal GUI meshing 4-5 different products together also stops working as expected, and that’s also when you start bouncing between 3-4 completely different support teams all trying to figure out which of the underlying products is causing the problem.

Always think in terms of what happens if something goes wrong with a certain subsystem and always assume things will break – only then can you have proper procedures and be prepared for the worst.

 

 

 

 

 

 

 

 

And always remember that the more complex a machine, the more difficult it can be to troubleshoot and fix when it does break (and it will break – everything does). There’s no substitute for clean and simple engineering.

Of course, Rube Goldberg-esque machines can be entertaining… if entertainment is what you’re after :)

D

 

Technorati Tags: , , , , , , , , , , ,

 

Single wire and single OS: yet another way to tell true unified storage from the rest

This is going to be a mercifully short entry. I’m saving the big one for another day :)

One of the features of NetApp storage is that by using Converged Network Adapters (CNAs) one can use a single wire and transport over that FC, iSCSI, NFS and CIFS, at the same time.

You see, since NetApp storage is truly unified, we don’t need cables coming out of 5 different boxes running 3 or more different OSes to do something (which is what, say, a certain competitor’s “unified” box is like – actually it’s even more boxes if one counts the external replication devices).

You might say “OK, that’s cool but how does it affect my bottom line?”

Just a few benefits that immediately come to mind:

  1. Far less cables to run in your datacenter for both storage and all the servers (each server needs 2 cables for redundancy vs 4 or more)
  2. No compromises since there’s no need to be forced to choose between iSCSI, NAS and FC – each server can happily use whatever’s best for the task at hand yet retain the exact same connectivity
  3. Less switches (no need for both FC and Ethernet switches)
  4. Less OpEx since it’s a simpler solution to manage
  5. Very high speeds (each link is 10Gbit) and low latency (FCoE is similar to FC – no need to do iSCSI if the same link can do both)
  6. Overall a far simpler and cleaner Datacenter

The other part is also important: Single OS. Inherently, something running a single OS has 3x less moving parts than something running 3 totally different OSes, regardless of packaging.

Here are some cool throughput results. Line speeds :)

One can dance around such concepts with marchitecture and fancy Powerpoint slides, but, in the end, just use your head. It’s pretty simple.

Food for thought…

D

NetApp posts new SPEC SFS NFS results – far faster than V-Max with Celerra VG8

Following the new NetApp block-based SPC-1 results yesterday, here is some NAS benchmark action. This page contains all the SPEC SFS results. SPEC SFS is the NAS equivalent of SPC-1.

SPEC SFS is more cache-friendly than the brutal SPC-1, click here for some more information regarding this industry-standard NAS benchmark. The idea is that thousands of CIFS and NFS servers have been profiled and the benchmark reflects real-life NAS usage patterns.

In the same vein as the SPC-1 benchmarks, the configurations we submit to the standard benchmarking authorities are based on realistic systems customers could buy, not $7m lab queens. So, NetApp SPEC and SPC submissions:

  • Are always tested with RAID-DP (RAID-6 protection equivalent) – other vendors test with RAID10 most of the time, and never with RAID-6 (ask them why this is, BlueArc gets respect for being the only other one in the list doing our level of protection)
  • Have a target of using the most cost-effective configuration possible
  • Provide not just high IOPS but also very low latency
  • Are a realistic, deployable configuration, not just the fastest box we have (we still have the 1 million SPEC ops record for a 24-node system, that’s kind of pricy plus the result is old and can’t be compared with the current benchmark code – still, look at the rankings).

So, with those lofty goals in mind, we have 3 new submissions:

  1. CIFS benchmark, 3210 w/ SATA drives – typical low/mid-range system
  2. NFS benchmark, 3270 w/ SAS drives – typical mid-range system, no Flash Cache used in this one.
  3. NFS benchmark, 6240 w/ SAS drives – typical high-end (but not highest) system.

All NetApp systems included some Flash Cache memory boards to provide further acceleration (EDIT: aside from the 3270). We have an even faster system (6280) that we will be submitting later on as a special treat (there’s a certain degree of red tape and ceremony to even do one submission…)

Here’s an abbreviated chart in easily digestible form – showing the most recent results from perennial rivals NetApp and EMC (BTW – of all the systems in the chart, only one of them is truly unified and can provide block and NAS on the same architecture without the need for contortions).

System Result (higher is better) Overall Response Time (lower is better) # Disks Exported Capacity in TB RAID Protocol
NetApp 3210 64292 1.50 144x 1TB SATA 87 RAID-DP CIFS
NetApp 3270 101183 1.66 360x 15K RPM 450GB SAS 110 RAID-DP NFS
NetApp 6240 190675 1.17 288x 15K RPM 450GB SAS 85 RAID-DP NFS
EMC NS-G8 on V-Max 118463 1.92 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 CIFS
EMC NS-G8 on V-Max 110621 2.32 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 NFS
EMC VG8 on V-Max 135521 1.92 312x 15K RPM 450GB FC 19 RAID-10 NFS

Guide to reading the chart, and lessons learned:

  • A “puny” NetApp 3210 with SATA gets better overall response time than an all-SSD V-Max costing well over 10x
  • Notice the amount of usable space on NetApp systems, with even better protection than RAID10
  • The 6240 scored far higher even though it had less disks
  • The NetApp systems have “just” 2 controllers that do everything, vs. the EMC submissions with 4 V-Max engines, plus extra Celerra Data Movers and Control Stations on top. What do you think is more efficient?

In slide format:

I do have some questions to ask certain other vendors as a parting shot:

  1. Sun/Oracle – you keep saying your new boxes are a cheaper way to get NetApp-type functionality, you’ve had them for a while, why not submit to SPEC or SPC? (there is not a single SPEC result from Sun).
  2. EMC – maybe show the world how a system not based on V-Max runs? With RAID-6? (Even V-Max with RAID6, no problem…)
  3. EMC: What’s the deal with the exported capacity, even with 312x drives?
  4. All of you with large striped pools of RAID5 – have you bothered explaining to your customers what will happen to the pool if you have a dual-drive failure in any RAID group? Unacceptable.

D

New NetApp SPC-1 submission – more IOPS per drive than any other vendor, and a bit on write caching

The SPC-1(E) benchmark is the standard high-intensity test for block storage, consisting of very stringent rules and a standard test suite.

SPC-1 is one of the worst things you can do to a disk array. The benchmark itself does a lot of writes, is highly random and is hostile to most caching systems. Which neatly explains why IBM has all kinds of system submissions but doesn’t show XIV, and the complete absence of another prominent vendor (look at the submissions, you’ll figure it out – the big boys of storage are NetApp, IBM, HDS, HP and one more :) ).

That same vendor might complain that SPC-1 is not always representative of real-life workloads but, short of putting all possible systems in your datacenter, nothing really will represent exactly how you massage your data. At least SPC-1 is a well-established standard and a great torture test for systems. All the other vendors are participating after all. And, interestingly, the SPEC SFS NAS benchmark doesn’t seem to bother said vendor’s anti-SPC crew none (spec.org). How come that one is more “real”? :) (NetApp participates in both block and NAS standard benchmarks BTW, since our systems all do both).

Some things to look for when trying to decipher SPC-1 results:

  • Type of RAID used (RAID-DP, RAID10, RAID5, RAID6)
  • How many drives were used to get the final result
  • The cost for the configuration
  • The price/performance
  • How much of the storage was usable, how much was unused…

For instance – a system that can do 50,000 SPC-1 IOPS with 100 disks and RAID6, is far more efficient than one that needs 200 disks and RAID10 to achieve the same result.

 

My favorite way of reading the results is figuring out the effective IOPS per drive, see how close (or far) it is from the 220 IOPS a normal modern 15K drive can sustain without RAID, with good response times.

So, without further ado, looky here… it’s the link to the results page showing all the vendors. Or here for the full details. 68,000 sustained IOPS with 120 ordinary 300GB drives and just 2 Flash Cache modules, with 84% of the usable space occupied.

What this means to you:

The effective IOPS per drive for the NetApp 3270 submission are 567. Next best is around 400, most vendors can’t break 300, and the highest scoring systems (relying on thousands of drives and many controllers) don’t break 200.

 

It is important to note that NetApp is the only vendor in the list showing results with dual-parity RAID-DP (RAID6 equivalent protection). All other vendors are doing RAID10! If your vendor is selling you RAID5, that’s not representative of their systems in the chart!

The NetApp result boils down to 13,600 sustained IOPS per shelf of 15K drives, and a system cost that’s very reasonable for the reliability, performance and features provided.

What this means to the anti-NetApp FUD club with their complex auto-tiering schemes that need 15 types of drives…

You really need to figure out how to present a decent result with:

  • RAID6 (otherwise your RAID1 or RAID5 protection is inferior to NetApp RAID-DP, especially when talking about large pools)
  • Your fancy auto-tiering algorithm showing no performance degradation on the unpredictable SPC-1 workload while still storing data on all drive tiers (otherwise, it’s single-tiering, and not auto-tiering)
  • Large caches. If your competitive product can use Megacaches, and you claim you can do efficient write caching with them, how about we all see how effective that is? After all, you claim that’s a huge benefit. We show the world ours, show yours. Otherwise, your product is only fast on Powerpoint slides, and I’ve yet to see a product fail on Powerpoint.

Stand by for more results from the bigger boxes, this wasn’t one of them, but it is a realistic system companies could actually afford and not a $7m all-SSD config like some others have… :)

 

D

Technorati Tags: 

Has NetApp sold more flash than any other enterprise disk vendor?

NetApp has been selling our custom cache boards with flash chips for a while now. We have sold over 3PB of usable cache this way.

The question was raised in public forums such as Twitter – someone mentioned that this figure may be more usable Solid State storage than all other enterprise disk vendors have sold combined (whether it’s used for caching or normal storage – I know we have greatly outsold anyone else that does it for caching alone :) ).

I don’t know if it is, maybe the boys from the other vendors can chime in on this and tell us, after RAID, how much usable SSD they’ve sold, but the facts remain:

  • NetApp has demonstrated thought leadership in pioneering the pervasive use of Megacaches
  • The market has widely adopted the NetApp Flash Cache technology (I’d say 3PB of usable cache is pretty wide adoption)
  • The performance benefits in the real world are great, due to the extra-granular nature of the cache (4KB blocks vs 64+ KB for others) and extremely intelligent caching algorithms
  • The cost of entry is extremely reasonable
  • It’s a very easy way to add extra performance without forcing data into faster tiers.

Comments welcome…

D

Technorati Tags: , , , , , , , , ,

A look at EMC’s FASTv2, FAST Cache and FLARE30 – EMC giveth, EMC taketh away

[Update: some grammar mistakes fixed and a few questions added]

Before anyone starts frothing at the mouth, notice that in the categories this post is part of FUD :) Always do your own analysis… I just wanted to give people some food for thought, like I did when FASTv1 came out. I didn’t make this up, it’s all based on various EMC documents available online. I advise people looking at this technology to ask for extensive documentation regarding best practices before taking the leap.

As a past longtime user and sometimes pusher of EMC gear, some of the enhancements in FASTv2 seemed pretty cool to me, and potentially worrisome from a competitive standpoint. So I decided to do some reading to see how cool the new technology really is.

Summary of the new features:

  • Large heterogeneous pools (a single pool can consist of different drive types and encompass all drives in a box minus the vault and spares)
  • FASTv2 – sub-LUN movement of hot or cold chunks of data for auto-tiering between drive types
  • FAST Cache – add plain SSDs as cache
  • Much-touted feature: ability to use SSD as a write cache
  • LUN compression
  • Thin LUN space reclamation

It all sounds so good and, if functional, could bring Clariions to parity with some of the more advanced storage arrays out there. However, some examination of the features reveals a few things (I’m sure my readers will correct any errors). In no particular order:

EMC now uses a filesystem

It finally had to happen, thin LUN pools at the very least live on a filesystem laid on top of normal RAID groups (and I suspect all new pools on both Symm and CX now live on a filesystem). So is it real FC or some hokey emulation? Not that it matters if it provides useful functionality impossible to achieve otherwise, it’s just an about-face. But how mature is this new filesystem? Does it automatically defragment itself or at least provide tools for manual defragmentation? Filesystem design is not trivial.

LUN compression

  1. Best practices indicate compression should not be used for anything I/O intensive and is best suited for static workloads (i.e. not good for VMs or DBs). However, new data is compressed as a post-process, which theoretically doesn’t penalize new writes – which I find interesting. Also: What happens with overwrites? Do compressed blocks that need to be overwritten get uncompressed and re-laid down uncompressed until the next compression cycle? Do the blocks to be overwritten get overwritten in their original place or someplace new? What happens with fragmentation? It all sounds so familiar :)
  2. The read performance hit is reported to be about 25% – makes sense since the CPU has to work harder to uncompress the data.
  3. Turning on compression for an existing traditional LUN means the LUN will need to be migrated to a thin LUN in a pool (not converted, migrated – indeed, you need to select where the new LUN will go). Not an in-place operation, it seems.
  4. Does data need to be migrated to a lower tier in order to be compressed?
  5. It follows you need enough space for the conversion to take place… (can you do more than one in parallel? If so, quite a bit of extra space will be needed).
  6. How does this work with external replication engines like RecoverPoint? Does data need to be uncompressed? (probably counts as a normal “read” operation which will uncompress the data).
  7. Does this kind of compression mess with alignment of, say, VMs? This could have catastrophic consequences regarding performance of such workloads…

Thin LUN space reclamation

  1. Another case where migration from thick to thin takes place (doesn’t seem like the LUN is converted in-place to thin)
  2. Unclear whether an already thin LUN that has temporarily ballooned in size can have its space reclaimed (NetApp and a few other arrays can actually do this). You see, LUNs don’t only grow in size… several operations (i.e. MS Exchange checking) can cause a LUN to temporarily expand in space consumption, then go back down to its original size. Thin provisioning is only truly useful if it can help the LUN remain thin :)

Dual-drive ownership, especially when it pertains to pool LUNs

Dual-drive ownership is not strictly a new feature, but best practices is for a single CX controller (SP) to own a drive, and not have it shared. Furthermore, with pool LUNs, if you change the controller ownership of a pool LUN, I/O will see much higher latencies – it’s recommended to do a migration to a new LUN controlled by the other SP (yet another scenario that needs migration). I’m mentioning this since EMC likes to make a big deal about how both controllers can use all drives at the same time… obviously this is not nearly as clean as it’s made to appear. The Symmetrix does it properly.

Metadata used per thin LUN

3GB is the minimum space a thin LUN will occupy due to metadata and other structures. Indeed, LUN space is whatever you allocate plus another 3GB. Depending on how many LUNs you want to create, this can add up, especially if you need many small LUNs.

Loss of performance with thin LUNs and pools in general

It’s not recommended to use pools and especially thin LUNs for performance-sensitive apps, and in general old-style LUNs are recommended for the highest performance, not pools. Which is interesting, since most of the new features need pools in order to work… I heard 30% losses if thin LUNs are used in particular, but that’s unconfirmed. I’m sure someone from EMC can chime in.

Expansion, RAID and scalability caveats with pools

  1. To maintain performance, you need to expand the pool by adding as many drives as the pool already has – I suspect this has something to do with the way data is striped. This could cause issues as the system gets larger (who will really expand a CX4-960 by 180 drives a pop? Because best practices state that’s what you have to do if you start with 180 drives in the pool) .
  2. Another thing that’s extremely unclear is how data is load-balanced among the pool drives. Most storage vendors are extremely open about such things. All I could tell is that there are maximum increments at which you can add drives to a pool, ranging from 40 on a CX4-120 to 180 on a CX4-960. Since a pool can theoretically encompass all drives aside from vault and spares, does this mean that striping happens in groups of 180 in a CX4-960 and if you add another 180 that’s another stripe and the stripes concatenated?
  3. What if you don’t add drives by the maximum increment, and you only add them, say, 30 at a time? What do you give up, if anything?
  4. RAID6 is recommended for large pools (which makes total sense since it’s by far the most reliable RAID at the moment when many drives are concerned). However, RAID6 on EMC gear has a serious write performance penalty. Catch-22?

FAST Cache (includes being able to cache writes)

  1. Cache only on or off for the entire pool, can’t tune it per LUN (can only be turned on/off per LUN if old-style RAID groups and LUNs are used).
  2. 64KB block size (which means that a hot 4K block will still take 64K in cache – somewhat inefficient).
  3. A block will only be cached if it’s hit more than twice. Is that really optimal for the best hit rate? Can it respond quickly to a rapidly changing working set?
  4. Unclear set associativity (important for cache efficiency).
  5. No option to automatically optimize for sequential read after random write workloads (many DB workloads are like that).
  6. Flash drives aren’t that fast for writes as confirmed by EMC’s Barry Burke (the Storage Anarchist) in his comment here and by Randy Loeschner here. Is the write benefit really that significant? Maybe for Clariions with SATA, possibly due to the heavy RAID write penalties, especially with RAID6.
  7. It follows that highly localized overwrites could be significantly optimized since the Clariion RAID suffers a great performance degradation with overwrites, especially with RAID6 (something other vendors neatly sidestep).
  8. EMC Clariions don’t do deduplication so the cache isn’t deduplicated itself, but is it at least aware of compression? Or do blocks have to be uncompressed in cache? Either way, it’s a lot less efficient than NetApp Flash Cache for environments where there’s a lot of block duplication.
  9. The use of standard SSDs versus a custom cache board is a mixed blessing – by definition, there will be more latency. At the speeds these devices are going, those latencies add up (since it’s added latency per operation, and you’re doing way more than one operation). All high-end arrays add cache in system boards, not with drives…
  10. Smaller Clariions have severely limited numbers of flash drives that can be used for caching (2-8 depending on the model, with the smaller ones only able to use very small cache drives). Only the CX4-960 can do 20 mirrored cache drives, which I predict will provide good performance even for fairly heavy write workloads. However, that will come at a steep price. The idea behind caches like NetApp’s Flash Cache is to reduce costs

For a very detailed discussion regarding megacaches in general read here.

I can see FAST Cache helping significantly on a system with lots of SATA in a well-configured CX4-960. And I can definitely see it helping with heavy read workloads that have good locality of reference, since SSDs are very good for reads.

And finally, the pièce de résistance,

FASTv2

This is EMC’s sub-LUN auto-tiering feature. Meaning that a LUN is chopped up into 1GB chunks, and that the 1GB chunks move to slower or faster disks depending on how heavily accessed they are. The idea being that, after a little while, you will achieve steady state and end up with the most appropriate data on the most appropriate drives.

Other vendors (most notably Compellent and now also 3Par, IBM and HDS) have some form of that feature (Compellent pioneered this technology and has the smallest possible chunks I believe at 512KB).

The issues I can see with the CX approach of FASTv2:

  1. Gigantic 1GB slice to be moved. EMC admits this was due to the Clariion not being fast enough to deal with the increased metadata of many smaller slices (the far more capable Symmetrix can do 768KB per slice, offering far more granularity). It follows that the bigger the slice the less optimal the results are from an efficiency standpoint.
  2. All RAID groups within the pool have to be of the same RAID type (i.e, RAID6). So you can’t have, say, SATA as RAID6 and SSD as RAID5 in the same pool. Important since RAID6 on most arrays has a big performance impact.
  3. Unknown performance impact for keeping track of the slices (possibly the same as using thin provisioning – 30% or so?)
  4. The most important problem in my opinion: Too much data can end up in expensive drives. For instance, imagine a 1TB DB LUN. That LUN will be sliced into 1,000x 1GB chunks. Unless the hotspots of the DB are extremely localized, even if a few hundred blocks are busy per slice, that entire slice will get migrated to SSD the next day (it’s a scheduled move). Now imagine if, say, half the slices have blocks that are deemed busy enough – half the LUN (512GB in this example) will be migrated to SSD, even if the hot data in those slices were more like 5GB (say a 10% working set size, quite typical). Clearly, this is not the most effective use of fast disks. EMC has hand-waved this objection away in the past, but if it’s not important, why does the Symmetrix go with the smaller slice?
  5. Extremely slow transactional performance for the data that has been migrated to SATA, especially with RAID6 – EMC says you need to pair this with FAST Cache, which makes sense… Of course, come next day that data will move to SSD or FC drives, but will that be fast enough? Policies will have to be edited and maintained per application (often removing the auto-tiering by locking an app at a tier), which removes much of the automation on offer.
  6. The migration is I/O intensive, and we’re talking about migrations of 1GB slices (on a large array, many thousands of them). What does that mean for the back-end? After all, once a day all the migrations need to be processed… and will need to contend with normal I/O activity.
  7. Doesn’t support compressed blocks, data needs to be uncompressed in order to be moved.
  8. I still think this technology is most applicable to fairly predictable, steady workloads with good locality of reference.

Messaging inconsistencies

As I’ve mentioned before, I don’t have an issue with EMC’s technology, merely with the manner in which the capabilities and restrictions are messaged (or not, as the case may be). For instance, I’ve seen marketing announcements and blog entries talking about doing VMware on thin LUNs with compression etc. – sure, that could be space-efficient, but will it also be fast?

Now that the limitations of the new features are more understood, EMC’s marketing message loses some of its punch.

  • Will compression really work with busy VMware or DB setups?
  • Will thin LUNs be OK for busy systems?
  • Unless 20 disks are used for FAST Cache (only with a CX4-960), is the performance really enough to accelerate highly random writes on large systems?
  • What is the performance impact of thin LUNs for highly-intensive workloads?
  • What is the performance of a large system all running RAID6?
  • Last but not least – does the filesystem EMC uses allow defragmentation? By definition, features such as thin provisioning, compression and FAST will create fragmentation.

Moreover – what all the messaging lacks is some comparison to other people’s technology. Showing a video booting 1000 VMs in 50 minutes where before it took 100 is cool until you realize others do it in 12.

And why is EMC (I’m picking on them since they’re the most culpable in this aspect) ridiculing technologies such as NetApp’s Flash Cache and Compellent’s Data Motion only to end up implementing similar technologies and presenting things to the world as if they are unique in figuring this out? “You see, none of the other guys did it right, now that we did it it’s safe”.

Too many of the new features are extremely obscure in their design, if storage professionals can’t easily figure them out, how is the average consumer expected to? I think more openness is in order, otherwise it just looks like you have something to hide.

Ultimately – the devil is in the details, so why would you have to choose between space OR performance, and not be able to optimize space utilization AND performance?

I think it has to do with the original design of your storage system. Not all systems lend themselves to advanced features because of the way they started.

But that’s a subject for another day.

D

NetApp benefits for virtualization – benchmarked and proven

My colleague Vaughn Stewart explains it in detail here. I didn’t feel we gave this the publicity it deserves.

In a nutshell: We have numbers (published only after VMware engineering themselves approved the paper as accurate and gave their permission) proving that, compared to traditional arrays, running virtualized workloads on NetApp gear needs less resources while providing excellent performance.

If you don’t want to spend time reading Vaughn’s article, this link has the goods in impressive detail.

It’s worth noting the “traditional” array had a lot more disks and RAM, but the NetApp array had a Flash Cache module. We are not allowed to publish the vendor of the “traditional” array due to licensing restrictions, but, as mentioned, VMware engineering verified the results – the test was legit (no vendor is allowed to publish VMware performance data unless VMware engineering has verified all testing was aboveboard and accurate).

Some pictures for the impatient:

 

 

Key take-aways:

  1. A lot less disk space needed with NetApp
  2. A lot quicker to provision the VMs
  3. Faster performance than RAID10 even without the Flash Cache (and dramatically higher with)
  4. No-compromise RAID-DP offers same protection as RAID6 without the penalty
  5. NFS for VMware can be pretty fast inded given the appropriate storage behind!

D

FUD tales from the blogosphere: when vendors attack (and a wee bit on expanding and balancing RAID groups)

Haven’t blogged in a while, way too busy. Against my better judgment, I thought I’d respond to some comments I’ve seen on the blogosphere, adding one of my trademark extremely long titles. Part response, part tutorial. People with no time to read it all: Skip to the end and see if you know the answer to the question or if you have ideas on how to do such a thing.

It’s funny how some vendors won’t hesitate to wholeheartedly agree when some “independent” blogger criticizes their competition (before I get flamed, independent in quotes since, as I discussed before, there ain’t no such thing whether said blogger realizes it or not – being biased is a basic human condition).

The equivalent of someone posting in an Audi forum about excessive brake dust, and having guys from Mercedes and BMW chime in and claim how they “tested” Audis and indeed they had issues (but of course!) and how their cars are better now and indeed maybe Audi doesn’t have as much of a lead any more (if, indeed, they ever did). I think the term for that is “shill” but I can understand taking every opportunity to harm an opponent.

So the “Storage Architect” posted entries asking about certain features to be implemented on NetApp storage, one of them being able to reduce the size of an aggregate. Then everyone and their mum jumped on and complained how on earth such an important feature isn’t there :) BTW I’m not saying such a thing wouldn’t be useful to have from time to time. I’ll just try to explain why it’s tricky to implement and maybe ways to avoid problems.

For the uninitiated, a NetApp aggregate is a collection of RAID-DP RAID groups, that are pooled, striped and I/O then hits all the drives from all RAID groups equally for performance. You then carve out volumes out of that aggregate (containers for NFS, CIFS, iSCSI, FC).

A pretty simple structure, really, but effective. Similar constructs are used by many other storage vendors that allow pooling.

So, the question was, why not be able to make an aggregate smaller? (you can already make it bigger on-the-fly, as well as grow or shrink the existing volumes within).

An HP guy them proceeded to complain about how he put too few drives in an aggregate and ended up with an imbalanced configuration while trying to test a NetApp box.

So, some basics:  the following picture shows a well-balanced pool – notice the equal number of drives per RAID group:

The idea being that everything is load-balanced:

Makes sense, right?

You then end up with pieces of data across all disks, which is the intent. Growing it is easy – which is, after all, what 99.99% of customers ever want to do.

However, the HP dude didn’t have enough disks to create a balanced config with the default-sized RAID group (16). So he ended up with something like this, not performance-optimal:

So what the HP dude wanted to do, was to reduce the size of the RAID group and remove drives, even though he expanded the aggregate (and by extension the RAID group) originally.

Normally, before one starts creating pools of storage (with any storage system), one also knows (or should) what one has to play with in order to get the best overall config. It’s like “I want to build a 12-cylinder car engine, but I only have 9 cylinders”. Well – either buy more cylinders, or build an 8-cylinder engine! Don’t start building the 12-cylinder engine and go “oops” :) This is just Storage 101. Mistakes can and do happen, of course.

So, with the current state of tech, if I only had 20 drives to play with (and no option to get more), assuming no spares, I’d rather do one of the following:

  1. Aggregate with 10 + 10 RAID groups inside or
  2. Use all 20 drives in a single RAID group for max space
  3. Ask someone that knows the system better than I do for some advice

This is common sense and both doable and trivial with a NetApp system. The idea is you set the desired RAID group size for that aggregate BEFORE you put in disks. Not really difficult and pretty logical.

For instance, aggr options HPdudeAggr raidsize 10 before adding the drives would have achieved #1 above. Graphically, the Web GUI has that option in there as well, when you modify an aggregate. The option exists and it’s well-known and documented. Not knowing about it is a basic education issue. Arguing that no education should be needed to use a storage device (with an extreme number of features) properly even for deeply involved, low-level operations, is a romantic notion at best. Maybe some day. We are all working hard to make it a reality. Indeed, a lot of things that would take a really long time in the past (or still, with other boxes) have become trivialized – look at SnapDrive and the SnapManager products, for instance.

Back to our example: if, in the future, 10 more disks were purchased, and approach #1 above was taken, one would simply add the ten disks to the aggregate with aggr add HPdudeAggr 10. Resulting in a 10+10+10 config.

But what if I had done #2 above (make a 20-drive RAID group the default for that aggregate)?

Then, simply, you’d end up imbalanced again, with a 20+10. Some thought is needed before embarking on such journeys.

Maybe a better approach would be to add, say, a more reasonable number of drives to achieve good balance? Adding 12 more drives, for example, would allow for an aggregate with 16+16 drives. So, one could simply change the raidsize using aggr options HPdudeAggr raidsize 16, then, add the 12 disks to the aggregate with aggr add HPdudeAggr -g all 12.

This would expand both RAID groups contained within the aggregate dynamically to 16 drives per, resulting in a 16+16 configuration. Which, BTW, is not something you can easily do with most other storage systems!

Having said all that, I think that for people that are not storage savvy (or for the storage savvy that are suffering from temporary brain fog), a good enhancement would be for the interfaces to warn you about imbalanced final configs and show you what will be created in a nice graphical fashion, asking you if you agree (and possibly providing hints on how it could be done better).

I’m not aware of any other storage system that does that degree of handholding but hey, I don’t know everything.

Indeed, maybe the nature of the other posts was being bait so I’ll obligingly take the bait and ask the question so you can advertise your wares here: :)

Is anyone aware of a well-featured storage system from an established, viable vendor that currently (Aug 7, 2010, not roadmap or “Real Soon Now”) allows the creation of a wide-striped pool of drives with some RAID structures underneath; then allows one to evacuate and then destroy some of those underlying RAID groups selectively, non-disruptively, without losing data, even though they already contain parts of the stripes; then change the RAID layout to something else using those same existing drives and restripe without requiring some sort of data migration to another pool and without needing to buy more drives? Again, NOT for expansion, but for the shrinking of the pool?

To clarify even further: What the HP guy did was exactly this: He had 20 drives to play with, he created by mistake a pool with 2 RAID groups, 14+2 and a 2+2, how would your solution take those 2 RAID groups, with data, and change the config to something like 10 + 10 without needing more drives or the destruction of anything?

Can you dynamically reduce a RAID group? (NetApp can dynamically expand, but not reduce a RAID group).

I’m not implying such a thing doesn’t exist, I’m merely curious. I could see ways to make this work by virtualizing RAID further. Still, it’s just one (small) part of the storage puzzle.

The one without sin may cast the first stone! :)

D

Technorati Tags: ,,