A look at EMC’s FASTv2, FAST Cache and FLARE30 – EMC giveth, EMC taketh away

[Update: some grammar mistakes fixed and a few questions added]

Before anyone starts frothing at the mouth, notice that in the categories this post is part of FUD :) Always do your own analysis… I just wanted to give people some food for thought, like I did when FASTv1 came out. I didn’t make this up, it’s all based on various EMC documents available online. I advise people looking at this technology to ask for extensive documentation regarding best practices before taking the leap.

As a past longtime user and sometimes pusher of EMC gear, some of the enhancements in FASTv2 seemed pretty cool to me, and potentially worrisome from a competitive standpoint. So I decided to do some reading to see how cool the new technology really is.

Summary of the new features:

  • Large heterogeneous pools (a single pool can consist of different drive types and encompass all drives in a box minus the vault and spares)
  • FASTv2 – sub-LUN movement of hot or cold chunks of data for auto-tiering between drive types
  • FAST Cache – add plain SSDs as cache
  • Much-touted feature: ability to use SSD as a write cache
  • LUN compression
  • Thin LUN space reclamation

It all sounds so good and, if functional, could bring Clariions to parity with some of the more advanced storage arrays out there. However, some examination of the features reveals a few things (I’m sure my readers will correct any errors). In no particular order:

EMC now uses a filesystem

It finally had to happen, thin LUN pools at the very least live on a filesystem laid on top of normal RAID groups (and I suspect all new pools on both Symm and CX now live on a filesystem). So is it real FC or some hokey emulation? Not that it matters if it provides useful functionality impossible to achieve otherwise, it’s just an about-face. But how mature is this new filesystem? Does it automatically defragment itself or at least provide tools for manual defragmentation? Filesystem design is not trivial.

LUN compression

  1. Best practices indicate compression should not be used for anything I/O intensive and is best suited for static workloads (i.e. not good for VMs or DBs). However, new data is compressed as a post-process, which theoretically doesn’t penalize new writes – which I find interesting. Also: What happens with overwrites? Do compressed blocks that need to be overwritten get uncompressed and re-laid down uncompressed until the next compression cycle? Do the blocks to be overwritten get overwritten in their original place or someplace new? What happens with fragmentation? It all sounds so familiar :)
  2. The read performance hit is reported to be about 25% – makes sense since the CPU has to work harder to uncompress the data.
  3. Turning on compression for an existing traditional LUN means the LUN will need to be migrated to a thin LUN in a pool (not converted, migrated – indeed, you need to select where the new LUN will go). Not an in-place operation, it seems.
  4. Does data need to be migrated to a lower tier in order to be compressed?
  5. It follows you need enough space for the conversion to take place… (can you do more than one in parallel? If so, quite a bit of extra space will be needed).
  6. How does this work with external replication engines like RecoverPoint? Does data need to be uncompressed? (probably counts as a normal “read” operation which will uncompress the data).
  7. Does this kind of compression mess with alignment of, say, VMs? This could have catastrophic consequences regarding performance of such workloads…

Thin LUN space reclamation

  1. Another case where migration from thick to thin takes place (doesn’t seem like the LUN is converted in-place to thin)
  2. Unclear whether an already thin LUN that has temporarily ballooned in size can have its space reclaimed (NetApp and a few other arrays can actually do this). You see, LUNs don’t only grow in size… several operations (i.e. MS Exchange checking) can cause a LUN to temporarily expand in space consumption, then go back down to its original size. Thin provisioning is only truly useful if it can help the LUN remain thin :)

Dual-drive ownership, especially when it pertains to pool LUNs

Dual-drive ownership is not strictly a new feature, but best practices is for a single CX controller (SP) to own a drive, and not have it shared. Furthermore, with pool LUNs, if you change the controller ownership of a pool LUN, I/O will see much higher latencies – it’s recommended to do a migration to a new LUN controlled by the other SP (yet another scenario that needs migration). I’m mentioning this since EMC likes to make a big deal about how both controllers can use all drives at the same time… obviously this is not nearly as clean as it’s made to appear. The Symmetrix does it properly.

Metadata used per thin LUN

3GB is the minimum space a thin LUN will occupy due to metadata and other structures. Indeed, LUN space is whatever you allocate plus another 3GB. Depending on how many LUNs you want to create, this can add up, especially if you need many small LUNs.

Loss of performance with thin LUNs and pools in general

It’s not recommended to use pools and especially thin LUNs for performance-sensitive apps, and in general old-style LUNs are recommended for the highest performance, not pools. Which is interesting, since most of the new features need pools in order to work… I heard 30% losses if thin LUNs are used in particular, but that’s unconfirmed. I’m sure someone from EMC can chime in.

Expansion, RAID and scalability caveats with pools

  1. To maintain performance, you need to expand the pool by adding as many drives as the pool already has – I suspect this has something to do with the way data is striped. This could cause issues as the system gets larger (who will really expand a CX4-960 by 180 drives a pop? Because best practices state that’s what you have to do if you start with 180 drives in the pool) .
  2. Another thing that’s extremely unclear is how data is load-balanced among the pool drives. Most storage vendors are extremely open about such things. All I could tell is that there are maximum increments at which you can add drives to a pool, ranging from 40 on a CX4-120 to 180 on a CX4-960. Since a pool can theoretically encompass all drives aside from vault and spares, does this mean that striping happens in groups of 180 in a CX4-960 and if you add another 180 that’s another stripe and the stripes concatenated?
  3. What if you don’t add drives by the maximum increment, and you only add them, say, 30 at a time? What do you give up, if anything?
  4. RAID6 is recommended for large pools (which makes total sense since it’s by far the most reliable RAID at the moment when many drives are concerned). However, RAID6 on EMC gear has a serious write performance penalty. Catch-22?

FAST Cache (includes being able to cache writes)

  1. Cache only on or off for the entire pool, can’t tune it per LUN (can only be turned on/off per LUN if old-style RAID groups and LUNs are used).
  2. 64KB block size (which means that a hot 4K block will still take 64K in cache – somewhat inefficient).
  3. A block will only be cached if it’s hit more than twice. Is that really optimal for the best hit rate? Can it respond quickly to a rapidly changing working set?
  4. Unclear set associativity (important for cache efficiency).
  5. No option to automatically optimize for sequential read after random write workloads (many DB workloads are like that).
  6. Flash drives aren’t that fast for writes as confirmed by EMC’s Barry Burke (the Storage Anarchist) in his comment here and by Randy Loeschner here. Is the write benefit really that significant? Maybe for Clariions with SATA, possibly due to the heavy RAID write penalties, especially with RAID6.
  7. It follows that highly localized overwrites could be significantly optimized since the Clariion RAID suffers a great performance degradation with overwrites, especially with RAID6 (something other vendors neatly sidestep).
  8. EMC Clariions don’t do deduplication so the cache isn’t deduplicated itself, but is it at least aware of compression? Or do blocks have to be uncompressed in cache? Either way, it’s a lot less efficient than NetApp Flash Cache for environments where there’s a lot of block duplication.
  9. The use of standard SSDs versus a custom cache board is a mixed blessing – by definition, there will be more latency. At the speeds these devices are going, those latencies add up (since it’s added latency per operation, and you’re doing way more than one operation). All high-end arrays add cache in system boards, not with drives…
  10. Smaller Clariions have severely limited numbers of flash drives that can be used for caching (2-8 depending on the model, with the smaller ones only able to use very small cache drives). Only the CX4-960 can do 20 mirrored cache drives, which I predict will provide good performance even for fairly heavy write workloads. However, that will come at a steep price. The idea behind caches like NetApp’s Flash Cache is to reduce costs

For a very detailed discussion regarding megacaches in general read here.

I can see FAST Cache helping significantly on a system with lots of SATA in a well-configured CX4-960. And I can definitely see it helping with heavy read workloads that have good locality of reference, since SSDs are very good for reads.

And finally, the pièce de résistance,

FASTv2

This is EMC’s sub-LUN auto-tiering feature. Meaning that a LUN is chopped up into 1GB chunks, and that the 1GB chunks move to slower or faster disks depending on how heavily accessed they are. The idea being that, after a little while, you will achieve steady state and end up with the most appropriate data on the most appropriate drives.

Other vendors (most notably Compellent and now also 3Par, IBM and HDS) have some form of that feature (Compellent pioneered this technology and has the smallest possible chunks I believe at 512KB).

The issues I can see with the CX approach of FASTv2:

  1. Gigantic 1GB slice to be moved. EMC admits this was due to the Clariion not being fast enough to deal with the increased metadata of many smaller slices (the far more capable Symmetrix can do 768KB per slice, offering far more granularity). It follows that the bigger the slice the less optimal the results are from an efficiency standpoint.
  2. All RAID groups within the pool have to be of the same RAID type (i.e, RAID6). So you can’t have, say, SATA as RAID6 and SSD as RAID5 in the same pool. Important since RAID6 on most arrays has a big performance impact.
  3. Unknown performance impact for keeping track of the slices (possibly the same as using thin provisioning – 30% or so?)
  4. The most important problem in my opinion: Too much data can end up in expensive drives. For instance, imagine a 1TB DB LUN. That LUN will be sliced into 1,000x 1GB chunks. Unless the hotspots of the DB are extremely localized, even if a few hundred blocks are busy per slice, that entire slice will get migrated to SSD the next day (it’s a scheduled move). Now imagine if, say, half the slices have blocks that are deemed busy enough – half the LUN (512GB in this example) will be migrated to SSD, even if the hot data in those slices were more like 5GB (say a 10% working set size, quite typical). Clearly, this is not the most effective use of fast disks. EMC has hand-waved this objection away in the past, but if it’s not important, why does the Symmetrix go with the smaller slice?
  5. Extremely slow transactional performance for the data that has been migrated to SATA, especially with RAID6 – EMC says you need to pair this with FAST Cache, which makes sense… Of course, come next day that data will move to SSD or FC drives, but will that be fast enough? Policies will have to be edited and maintained per application (often removing the auto-tiering by locking an app at a tier), which removes much of the automation on offer.
  6. The migration is I/O intensive, and we’re talking about migrations of 1GB slices (on a large array, many thousands of them). What does that mean for the back-end? After all, once a day all the migrations need to be processed… and will need to contend with normal I/O activity.
  7. Doesn’t support compressed blocks, data needs to be uncompressed in order to be moved.
  8. I still think this technology is most applicable to fairly predictable, steady workloads with good locality of reference.

Messaging inconsistencies

As I’ve mentioned before, I don’t have an issue with EMC’s technology, merely with the manner in which the capabilities and restrictions are messaged (or not, as the case may be). For instance, I’ve seen marketing announcements and blog entries talking about doing VMware on thin LUNs with compression etc. – sure, that could be space-efficient, but will it also be fast?

Now that the limitations of the new features are more understood, EMC’s marketing message loses some of its punch.

  • Will compression really work with busy VMware or DB setups?
  • Will thin LUNs be OK for busy systems?
  • Unless 20 disks are used for FAST Cache (only with a CX4-960), is the performance really enough to accelerate highly random writes on large systems?
  • What is the performance impact of thin LUNs for highly-intensive workloads?
  • What is the performance of a large system all running RAID6?
  • Last but not least – does the filesystem EMC uses allow defragmentation? By definition, features such as thin provisioning, compression and FAST will create fragmentation.

Moreover – what all the messaging lacks is some comparison to other people’s technology. Showing a video booting 1000 VMs in 50 minutes where before it took 100 is cool until you realize others do it in 12.

And why is EMC (I’m picking on them since they’re the most culpable in this aspect) ridiculing technologies such as NetApp’s Flash Cache and Compellent’s Data Motion only to end up implementing similar technologies and presenting things to the world as if they are unique in figuring this out? “You see, none of the other guys did it right, now that we did it it’s safe”.

Too many of the new features are extremely obscure in their design, if storage professionals can’t easily figure them out, how is the average consumer expected to? I think more openness is in order, otherwise it just looks like you have something to hide.

Ultimately – the devil is in the details, so why would you have to choose between space OR performance, and not be able to optimize space utilization AND performance?

I think it has to do with the original design of your storage system. Not all systems lend themselves to advanced features because of the way they started.

But that’s a subject for another day.

D

8 thoughts on “A look at EMC’s FASTv2, FAST Cache and FLARE30 – EMC giveth, EMC taketh away

  1. pbphoto

    If EMC is now pushing emulated luns on top of a file system (storage pool)in Flare 30, then surely that means the Cubs will win the World Series in 2011!

    Thanks for the feature breakdown.

    Reply
  2. Nick Triantos

    I’ve been coming back to this post since you posted it, specifically to see the replies but to my surprise there are none!!!!

    Consequently, it leads me to believe your post has accurate information.

    Reply
  3. Jeremy Barth

    Sub-LUN chunk size varies quite a bit across vendor tiering implementations. From what I’ve been able to gather, Compellent is by far the most granular (as you note), probably because they built their system from the ground up to support tiering. HDS’s Dynamic Tiering is next smallest at a 42 MB chunk size, 3PAR’s Adaptive Optimization weighs in at 128 MB (half of their standard 256 MB chunklet size) and both IBM’s Easy Tier and EMC’s FASTv2 use 1 GB.

    The real question is whether that’s “good enough.” It might be for many use cases, but considering how much, say, the vendors are charging for a 256 GB enterprise SSD, only being able to auto-tier 256 blocks (at 1 GB apiece) makes one question if it’s worth the expense. Even 3PAR’s implementation will only be able to fit at most 2,000 “hot” blocks into a 256 GB SSD — again, better than nothing but is it really worth the SSD premium?

    Reply
  4. Tony Pearson

    To clarify Jeremy’s comment above: IBM System Storage Easy Tier uses different extent sizes for different products. For DS8000 it uses 1GB as stated for its sub-LUN automatic tiering. For SVC 6.1 and the new Storwize V7000, you can select the extent size from 16MB to 8GB in size. The default is 256MB.

    Tony Pearson (IBM)

    Reply
  5. Stan Edwards

    For HP SAN/IQ Virtualization the block size is 256KB. Would anyone know what are the managed block sizes for their SVSP (SAN Virtualization Services Platform) that is OEM’ed from LSI? Would it be the underlying block size of the managed storage array (since SVSP does not have FAST-like functionality)?

    Reply
  6. M. Saif

    I want to expand a current 4X100 GB (raid 1) Fast Cache on CX4-480 by 2X100 GB, but not sure if this’s the recommended configuration, or else I should expand by 4X100 GB?
    I couldn’t figure out how to expand the fast cache, so should I destroy the current configuration (has this high impact on operation?) & reconfigure with the available 6X100 GB?

    Reply

Leave a comment for posterity...