EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

10 thoughts on “EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

  1. David A. Chapa

    Great blog Dimitris! You’ve highlighted yet another flaw in EMC’s overall acquisition and technology strategy. Most companies have a boxcar approach to acquisitions either by company or technology group with a future plan of tighter integration into their core. EMC appears to have been on train wreck approach for a few years running and since that appears to be the case, marketing must come in to rescue. Not every company is 100% on target with their strategies, but FAST sure seems like the “surprise baby” – a reaction to the market and not a response to customer need.

    It will be interesting to see if v2 is better and by what standard “better” will be measured.

    Chapa

    Reply
  2. StorageTexan

    Nice Job Dimitris !!

    I 100% agree with your findings. Stick it in the right tier the first time can save you a TON of time and MONEY.

    As I discussed in my blog, don’t get so wrapped up on the cost savings that they are positioning. Compellent positions DataProgression as a feature to help admins reduce their need to purchase expensive tier 1 storage. What they fail to point out is they more than make up for those cost savings in software licensing and software maintenance. You want to add more spindles to your array, congrats you are also going to add more than a few lines of software licensing to that quote as well as your maintenance costs just went up. I’m not sure how FAST is sold, but my guess is there isn’t a free lunch on that either.

    Here is the article on my blog site
    http://storagetexan.com/2010/02/25/truth-lies-and-software-licensing/

    Nice job again!!
    @StorageTexan

    Reply
  3. the storage anarchist

    Your assessment of FAST v1 is fairly accurate, and there has been no attempt to mislead the market or customers about what FAST v1 can – and cannot – do.

    As to your criticisms against FAST v2, I think I can clarify a few things that might change your assessment.

    First, the unit of granularity in FACT v2 is not related to the size of the LUN nor the size of the storage pool(s) it is built from. The size of s aub-LUN relocation is based more on the predicated locality of future reference – the likelihood that other blocks adjacent to a specific I/O request will be requested in the future. For example, with Symmetrix that smallest chunk-size is 768KB, but locality of reference analysis for cache misses indicates that the more optimal unit of relocation might be 5-6 times larger than that. And in fact, different granularity might be appropriate for different applications – we’ll see.

    Second, as to your example of an application that appears “dormant” for most of the month, then pops into action – the opportunity is indeed to demote that database as low as possible so that the more expensive resources can be applied to other applications during most of the month. The ultimate intent is for FASTv2+ to recognize the surge in demand, and then to leverage cache as a buffer while the data is promoted.

    Third, FAST (v1 and v2) is managed by policies, and these can be applied to multiple applications (groups of LUNs), and each application can have its own policy. So in the aforementioned example, the policy might be set so as not to allow that application to be demoted below 15K rpm drives (as an example). It is also possible to script changes to the policy, so indeed, the payroll process script might actually start with “Promote DB=PayRoll Target=50% EFD” (that’s make-believe syntax, for illustration only).

    Net-net – the operational intent of FAST v2 is to augment the array’s SDRAM cache by promoting blocks that are frequently accessed as “cache misses” so as to minimize response time, while moving as much as possible of the “untouched in ages” blocks to SATA to reduce the operational $/GB.

    This requires sub-LUN relocation of an appropriate granularity, as well as the ability to recognize changes in demand and react fairly quickly. With Symmetrix, the large SDRAM cache is already there – and in a sense, the use of the various tiers will also follow (a form of) cache management logistics.

    And thanks for the constructive analysis. If you have other questions, please don’t hesitate to send me email (or post them on your blog or mine) – I’ll answer everything I can about Symmetrix-related topics.

    Reply
  4. Dimitris Post author

    Thanks Barry. It’s good that FASTv2+, whenever it’s ready to really go into production, will be at a variable chunk size.

    If it can help with the surge in demand then that’s great news for EMC, but I just haven’t seen the “mature” FAST tech at the moment (Compellent) do this at all well… then again, EMC has only about 1,000 times more engineers than Compellent :)

    On the messaging side:

    There’s a little bit of a dichotomy within EMC as you may know:

    There’s 3 kinds of FAST (CX, Symm, NS – I understand you’re the Symm guy). Or is that a trichotomy? Will CX work like you said or only Symm? Anyway…

    Then, people-wise, there’s the supporting organization outside field sales, then field sales, and within field sales there’s commercial and enterprise.

    The stuff I write about is based not on stuff I see in blogs but actual sales campaigns. Just today I had a call with a customer that told me EMC told him FAST on the Celerra would move files as well as iSCSI and FC LUNs back and forth between tiers. We know that, at least at the moment, this is not quite right…

    Obviously that can only be one of two things:

    1. The field sales force is lying
    2. The field sales force isn’t well-educated on their own product.

    I’ll go with #2 and will not assume malice.

    Either way, this doesn’t help EMC, since it took me very little time to explain things to this customer – it only discredited their EMC rep and engineer (and VAR).

    Selling futures is great, as long as you’re upfront with it :)

    Gotta go to my next call…

    D

    Reply
  5. the storage anarchist

    Indeed, the EMC FAST implementations are different for block vs. file. Kinda have to be.

    FAST is also different on Symm vs. CLARiiON, this due to architectural differences of the platforms. But the two implementations aim to deliver the same customer value, even if the units of operation and mechanics are different.

    As to the positioning anomoly you’ve mentioned, I think there is an option #3 – the field is correct. I’m getting verification and I’ll reply as soon as I do.

    Confusion factor understood, though. Indeed, keeping 45,000 people trained and consistent is a serious challenge, and one that the company is constantly trying to improve upon.

    Thanks for pointing this one out.

    Reply
  6. Mike Richardson

    Barry,

    In regards to the promotion, demotion of data at a sub-lun level, I’m curious as to how this function with typical filesystems. As I’m sure you know, applications tend to slide around filesystems as filesystems age. One side effect of this is that it tends to defeat thin provisioning over time.

    As it relates to your sub-lun tiering, wouldn’t the same complication arise? Under the premise that, within a lun, as the database/application/etc slides around the filesystem, the storage array can’t easily predict when or where the next block will be written to. The blocks aren’t so static as we’d like them to be. So, a block/chunk that has been moved to slow media all the sudden gets a bust of I/O it wasn’t prepared for. In other words, block activity of filesystems can be somewhat unpredictable and therefore moving chunks to slower tiers can lead to unpredictable performance during future reads or writes.

    It seems like it would be a lot less risky and allow more predictable SLAs to just tier at a lun by lun level. That way, no matter where the application moves to on the lun over time, it will always have the same performance characteristics.

    Maybe you can give me more insight on the logic to help me understand better, or perhaps, provide some real world examples of use cases that don’t lead to unpredictable performance within a lun.

    -mike

    Reply
  7. the storage anarchist

    The thing most people don’t understand about Flash is that writes aren’t really all that much faster to a good SSD than they are to a regular disk drive. And thus, predicting where writes are going isn’t an objective of FAST – that’s why Symmetrix has such large caches in the first place (to land writes).

    FAST is thus mostly optimized around read cache misses, and not the “randomness” of writes (which, truthfully, aren’t nearly as random as you seem to think, but that’s a different discussion). In a Symmetrix, we have 20+ years of experience optimizing SDRAM cache prefetch to get stuff into cache before it is required, and this is proven every day to work extremely well for a broad range of applications, both random and sequential, read and write. With usuable cache capacities up to 512GB, Symmetrix delivers read/write efficiencies far and above that of any competitive storage array.

    What FAST adds to this is the simple ability to get into Flash those regularly accessed (read) blocks that otherwise would be a read cache miss and require a trip to slow disk. And more importantly, it will move to SATA the vast majority of a file system that has never been written to, and/or those blocks that contain (parts of) files that are no longer changing. And when you think about it, if it takes a couple of milliseconds longer to read an old PowerPoint file, or to save a new one, nobody is really going to notice. But the SDRAM cache will likely hold the hot blocks of the inode tree (or NTFS directory table), and FAST will have located the most recently opened files into flash, thus making sure that even if the cached blocks are forced out, the operations will still execute faster than if the entire file system were on 15K rpm JBOD.

    So, for Symmetrix at least, FAST and FLash is all about reducing response times of Read Misses; the Symmetrix large global cache is ALREADY able to keep the response time of writes well below 1ms in most situations.

    Reply
  8. Dimitris Post author

    Thanks Barry – I completely agree about flash write speeds as implemented by the SSD vendors (incl STEC), and this is one of the reasons NTAP didn’t go with SSDs and instead built the bespoke PAMII board – its performance characteristics are insane. Obviously I can’t elaborate in this public forum on the actual design.

    Everyone needs to figure out different approaches that work with their architecture, I’m sure FAST is more optimal for the Symm architecture, as PAM is more appropriate for NTAP due to the totally different way we write to the disk (though it seems from what you’re saying that both Symm and FAS are both ultimately trying to optimize for the random reads, just in different ways).

    Still, I don’t know how some apps won’t suffer until they’re migrated again from SATA to SSD or FC, cache or no cache – some access patterns will defeat any cache, and not everyone will be able to go to a maxed-out 8-engine Symm due to $$$ issues, so the 512GB cache won’t ever be seen but by a very select few. Most Symms in the wild will be well under that.

    Anyway, if I were EMC, I’d be trying to get the far more capable Symm to replace the CX. Now that the Symm is also using commodity hardware like most boxes, it should be economically feasible, and would simplify EMC’s lineup (though storagebod might disagree with such a simplification move).

    However, the bulk of EMC’s market is the CX, and I totally understand the need to make it seem like it’s receiving the same kind of feature love as the Symm, but obviously it isn’t. Most similar features are far better implemented on Symm than CX.

    The gist of my original post is that it’s too tempting for marketing to say “all platforms support FAST” and too easy for field sales to not disabuse customers of their perception that it’s all the same.

    D

    Reply
  9. StorageWookie

    So, Dimitris makes a good point regarding Symm cache. Total cache sizes for the symm aren’t really changing in line with the increasing density of disks. A worthwhile FAST implementation that made use of significant amounts of SATA would further stretch the cache (in terms of Cache/TB storage) and with the total amount of available cache in a V-Max being limited by the maximum number of installed engines it could be difficult to get good cache sizes in a 2-4 engine typical V-Max.

    In the great ILM v Level 2 cache debate the (maybe perceived) inability of ILM to react to short term changes in demand is often highlighted. When you consider my point about overall cache above are we heading into a danger zone where the primary cache just isn’t able to compensate for short term bursts?

    Reply
  10. Ayotunde Itayemi

    Maybe it should be called “Partially Automated Storage Tiering” (PAST) but the person that comes up with that acronym in-house will soon be out of a job!

    Reply

Leave a comment for posterity...