I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shotâ€¦
I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!
Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futuresâ€¦
Perception is reality.
I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle readerâ€¦
The tiering concept
Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.
Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside â€“ all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to itâ€¦
Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.
What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.
Pressure points and the need for passing the hot potato
But wait, there’s moreâ€¦
The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.
With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult â€“ a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs â€“ maybe you just need 50GB here, another 300GB thereâ€¦ in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.
So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.
You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.
Again â€“ this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.
But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?
That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).
The current FAST pitfalls
Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:
On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection â€“ maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops â€“ with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long timeâ€¦
On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:
- Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
- You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
- The PC will then provide a list of recommendations
- Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
- Doesn’t support thin provisioning
- Not sure if it supports MetaLUNs
- It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?
On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA â€“ how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?
Ah, but it gets better with FASTv2! Or does it?
EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.
The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:
The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.
The real problem
First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.
However â€“ both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:
Lack of real-time acceleration.
Think about it â€“ performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.
Consider this scenario:
You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”
Then crunch time comes, and the DB is on the SATA drives. Oopsie.
People complain, and the storage admin is forced to manually migrate it back to SSD.
Kinda defeats the whole purposeâ€¦ unless I’m missing something the size of Titanic.
So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and thereâ€¦
Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.
Now here’s a novel ideaâ€¦
What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?
For instance â€“ normal fileshares, deep archives, large media files, backups to disk â€“ most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.
DBs, VM OS files â€“ should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!
LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do thatâ€¦ that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).
And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic piecesâ€¦ with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.
Maybe life doesn’t need to be complicated after all.
It’s almost 0300 so I’d better go to bedâ€¦