Architecture has long term scalability implications for All Flash Appliances

Recently, NetApp announced the availability of a 3.84TB SSD. It’s not extremely exciting – it’s just a larger storage medium. Sure, it’s really advanced 3D NAND, it’s fast and ultra-reliable, and will allow some nicely dense configurations at a reduced $/GB. Another day in Enterprise Storage Land.

But, ultimately, that’s how drives roll – they get bigger. And in the case of SSD, the roadmaps seem extremely aggressive regarding capacities.

Then I realized that several of our competitors don’t have this large SSD capacity available. Some don’t even have half that.

But why? Why ignore such a seemingly easy and hugely cost-effective way to increase density?

In this post I will attempt to explain why certain architectural decisions may lead to inflexible design constructs that can have long-term flexibility and scalability ramifications.

Design Center

Each product has its genesis somewhere. It is designed to address certain key requirements in specific markets and behave in a better/different way than competitors in some areas. Plug specific gaps. Possibly fill a niche or even become a new product category.

This is called the “Design Center” of the product.

Design centers can evolve over time. But, ultimately, every product’s Design Center is an exercise in compromise and is one of the least malleable parts of the solution.

There’s no such thing as a free lunch. Every design decision has tradeoffs. Often, those tradeoffs sacrifice long term viability for speed to market. There’s nothing wrong with tradeoffs as long as you know what those are, especially if the tradeoffs have a direct impact on your data management capabilities long term.

It’s all about the Deduplication/RAM relationship

Aside from compression, scale up and/or scale out, deduplication is a common way to achieve better scalability and efficiencies out of storage.

There are several ways to offer deduplication in storage arrays: Inline, post-process, fixed chunk, variable chunk, per volume scope, global scope – all are design decisions that can have significant ramifications down the line and various business impacts.

Some deduplication approaches require very large amounts of memory to store metadata (hashes representing unique chunk signatures). This may limit scalability or make a product more expensive, even with scale-out approaches (since many large, costly controllers would be required).

There is no perfect answer, since each kind of architecture is better at certain things than others. This is what is meant by “tradeoffs” in specific Design Centers. But let’s see how things look for some example approaches (this is not meant to be a comprehensive list of all permutations).

I am keeping it simple – I’m not showing how metadata might get shared and compared between nodes (in itself a potentially hugely impactful operation as some scale-out AFA vendors have found to their chagrin). In addition, I’m not exploring container vs global deduplication or different scale-out methods – this post would become unwieldy… If there’s interest drop me a line or comment and I will do a multi-part series covering the other aspects.

Fixed size chunk approach

In the picture below you can see the basic layout of a fixed size chunk deduplication architecture. Each data chunk is represented by a hash value in RAM. Incoming new chunks are compared to the RAM hash store in order to determine where and whether they may be stored:

Hashes fixed chunk

The benefit of this kind of approach is that it’s relatively straightforward from a coding standpoint, and it probably made a whole lot of sense a couple of years ago when small SSDs were all that was available and speed to market was a major design decision.

The tradeoff is that a truly exorbitant amount of memory is required in order to store all the hash metadata values in RAM. As SSD capacities increase, the linear relationship of SSD size vs RAM size results in controllers with multi-TB RAM implementations – which gets expensive.

It follows that systems using this type of approach will find it increasingly difficult (if not impossible) to use significantly larger SSDs without either a major architectural change or the cost of multiple TB of RAM dropping dramatically. You should really ask the vendor what their roadmap is for things like 10+TB SSDs… and whether you can expand by adding the larger SSDs into a current system without having to throw everything you’ve already purchased away.

Variable size chunk approach

This one is almost identical to the previous example, but instead of a small, fixed block, the architecture allows for variable size blocks to be represented by the same hash size:

Hashes variable chunk

This one is more complex to code, but the massive benefit is that metadata space is hugely optimized since much larger data chunks are represented by the same hash size as smaller data chunks. The system does this chunk division automatically. Less hashes are needed with this approach, leading to better utilization of memory.

Such an architecture needs far less memory than the previous example. However, it is still plagued by the same fundamental scaling problem – only at a far smaller scale. Conversely, it allows a less expensive system to be manufactured than in the previous example since less RAM is needed for the same amount of storage. By combining multiple inexpensive systems via scale-out, significant capacity at scale can be achieved at a lesser cost than with the previous example.

Fixed chunk, metadata both in RAM and on-disk

An approach to lower the dependency on RAM is to have some metadata in RAM and some on SSD:

Hashes fixed chunk metadata on disk

This type of architecture finds it harder to do full speed inline deduplication since not all metadata is in RAM. However, it also offers a more economical way to approach hash storage. SSD size is not a concern with this type of approach. In addition, being able to keep dedupe metadata on cold storage aids in data portability and media independence, including storing data in the cloud.

Variable chunk, multi-tier metadata store

Now that you’ve seen examples of various approaches, it starts making logical sense what kind of architectural compromises are necessary to achieve both high deduplication performance and capacity scale.

For instance, how about variable blocks and the ability to store metadata on multiple tiers of storage? Upcoming, ultra-fast Storage Class Memory technologies are a good intermediate step between RAM and SSD. Lots of metadata can be placed there yet retain high speeds:

Hashes variable chunk metadata on 3 tiers

Coding for this approach is of course complex since SCM and SSD have to be treated as a sort of Level 2/Level 3 cache combination but with cache access time spans in the days or weeks, and parts of the cache never going “cold”. It’s algorithmically more involved, plus relies on technologies not yet widely available… but it does solve multiple problems at once. One could of course use just SCM for the entire metadata store and simplify implementation, but that would somewhat reduce the performance afforded by the approach shown (RAM is still faster). But if the SCM is fast enough… 🙂

However, being able to embed dedupe metadata in cold storage can still help with data mobility and being able to retain deduplication even across different types of storage and even cloud. This type of flexibility is valuable.

Why should you care?

Aside from the academic interest and nerd appeal, different architecture approaches have a business impact:

  • Will the storage system scale large enough for significant future growth?
  • Will I be able to use significantly new media technologies and sizes without major disruption?
  • Can I use extremely large media sizes?
  • Can I mix media sizes?
  • Can I mix controller types in a scale-out cluster?
  • Can I use cost-optimized hardware?
  • Does deduplication at scale impact performance negatively, especially with heavy writes?
  • If inline efficiencies aren’t comprehensive, how does that affect overall capacity sizing?
  • Does the deduplication method enforce a single large failure domain? (single pool – meaning that any corruption would result in the entire system being unusable)
  • What is the interoperability with Cloud and Disk technologies?
  • Can data mobility from All Flash to Disk to Cloud retain deduplication savings?
  • What other tradeoffs is this shiny new technology going to impose now and in the future? Ask to see a 5-year vision roadmap!

Always look beyond the shiny feature and think of the business benefits/risks. Some of the above may be OK for you. Some others – not so much.

There’s no free lunch.

D

Technorati Tags: , , , ,

3 Replies to “Architecture has long term scalability implications for All Flash Appliances”

Leave a comment for posterity...