The Importance of Automated Headroom Management

Before we begin: This is another vendor-neutral post. I realize there may be no architecture that can do everything I’m proposing, but some may come closer to what you need than others. Whether you’re a vendor or a customer, see it as stuff you should be doing or be asking for respectively…

Headroom!

Headroom is a term that applies to almost all technologies, and it’s crucially important for all of them. Some examples:

  • Photography
  • Cars
  • Bridges
  • Storage arrays…

Why is Sufficient Headroom Important?

Maintaining sufficient headroom in any solution is a way to ensure safety and predictability of operation under most conditions (especially under unfavorable ones).

For instance, if the maximum load for an evenly loaded bridge before it collapses is X, the overall recommended load will be a fraction of that. But even the weight/length/axle count of a single truck on a bridge will also be subject to certain strict limits in order to avoid excessive localized stress on the structure.

Headroom in Storage Arrays

Apologies to the seasoned storage pros for all the foundational material but it’s crucial to take this step by step.

It is important to note that headroom in arrays is not necessarily as simple as how busy the CPU is. Headroom is a multi-dimensional concept.

More factors than just CPU come into play, including how busy the underlying storage media are, how saturated various buses are, and how much of the CPU is spent on true workload vs opportunistic tasks (that could be deferred). Not to mention that in some systems, certain tasks are single-threaded and could pose an overall headroom bottleneck by maxing out a single CPU core, while the rest of the CPU is not busy at all.

Maintaining sufficient headroom in storage arrays is necessary in order to provide acceptable latency, especially in the event of high load during a controller failover. Depending on the underlying architecture of an array, different headroom approaches and calculations are necessary.

Some examples of different architectures:

  • Active-Active controllers, per-controller pool
  • Active-Standby, single pool
  • Active-Active, single pool
  • Grid, single pool
  • Permutations thereof (it’s beyond the scope of this article to explore all possible options)

The single vs multiple pool question complicates things a bit, plus things like disk ownership are also hugely important. This isn’t an argument about which architecture is better (it depends anyway), but rather about headroom management in different architectures.

Dual-Controller Headroom

Dual-controller architectures need to be extremely careful with headroom management. After all, there are only two controllers in play. Here’s what sufficient headroom looks like in a dual-controller system:

HeadroomHA2

There are not many options to keep things healthy in a dual-controller architecture. In an Active-Standby system, the Standby controller is ready to immediately take over. There is no danger in loading up the Active controller, aside from expected load-related latency.

In an Active-Active HA system, maintaining a healthy amount of headroom has to be managed so that there is, overall, an entire controller’s worth of free headroom available.

Headroom in a Cluster of HA Pairs Architecture

There are several implementations that make use of a multiple HA Pair architecture. Often, the multiple HA pairs present a virtual pool to the outside world, even if, internally, there are multiple private pools. Some implementations just keep it to pools owned by each controller.

Here’s an example of healthy headroom in such a system:

HeadroomMultiHA2

Even though there are multiple controllers (at least 4 total), in order to maintain an overall healthy system, a total of 100% headroom needs to be maintained in each  HA pair, otherwise the performance of an underlying private pool (in green) might suffer, making the overall virtual pool performance (light blue) unpredictable.

Headroom in a True Grid Architecture

Grid Architectures spread overall load among multiple nodes (often plain servers with some disks inside and connected via a network).

In such a scheme, overall headroom that needs to be maintained per node as a percentage is 100/N, where N is the number of nodes in the storage cluster.

So, in a 4-node cluster, 100/4=25% headroom per node needs to be maintained.

This doesn’t account for the significant work that rebalancing after a node failure takes in such architectures, nor the capacity headroom needed, but it’s roughly accurate enough for our purposes.

Schematically:

HeadroomGrid2

How Headroom is Managed is Crucial

In order to manage headroom, four things need to be able to happen first:

  1. Be able to calculate headroom
  2. Be able to throttle workloads
  3. Be able to prioritize between types of workload
  4. Be able to move workloads around (architecture-dependent).

The only architecture that inherently makes this a bit easier is Active-Standby since there is always a controller waiting to take over if anything bad happens. But even with a single active controller, headroom needs to be managed in order to avoid bad latency conditions during normal operation (see here for an example approach). Remember, headroom is a multi-dimensional thing.

Example Problem Case: Imbalanced & Overloaded Controllers

Consider the following scenario: An Active-Active system has both controllers overloaded, and one of them is really busy:

Headroom Imbalanced2

Clearly, there are a few problems with this picture:

  1. It may be impossible to fail over in the event of a controller failure (total load is 165% of a single controller’s headroom)
  2. The first controller may already be experiencing latency issues
  3. Why did the system even get to this point in the first place?
This is a commonplace occurrence unfortunately.

Automation is Key in Managing Headroom

The biggest problem in our example is actually the last point: Why was the system allowed to get to that state to begin with?

Even if a system is able to calculate headroom, throttle workloads and move workloads around, if nothing is done automatically to prevent problems, it’s extremely easy for users to get into the problem situation depicted above. I’ve seen it affect critical production systems far too many times for comfort.

Manual QoS is Not The Best Answer

Being able to manually throttle workloads can obviously help in such a situation. The problems with the manual QoS approach are outlined in a past article, but, in summary, most users simply have absolutely no idea what the actual limits should be (nor should they be expected to). Most importantly, placing QoS limits up front doesn’t result in balanced controllers… and may even result in other kinds of performance problems.

Of course, using QoS limits reactively is not going to prevent the problem from occurring in the first place.

Some companies offer Data Classification as a Professional Services engagement, in order to try and figure out an IOPS/TB/Application metric. Even if that is done, it doesn’t result in balanced controllers… it’s also not very useful in dynamic environments. It’s more used as a guideline for setting up manual QoS.

Automation Mechanisms to Consider for Managing Headroom

Clearly, pervasive automation is needed in order to keep headroom at safe levels.

I will split up the proposed mechanisms per architecture. There is some common functionality needed regardless of architecture:

Common Automation Needed

Every architecture needs to have the ability to automatically achieve the following, without user intervention at any point:

  1. Conserve headroom per controller
  2. Differentiate between different kinds of user workloads
  3. Differentiate between different kinds of system workloads
  4. Automatically prioritize between different workloads, especially under pressure
  5. Automatically throttle different kinds of workloads, especially under pressure

And now for the extra automation needed per architecture:

Active-Standby Automation

If in a single HA pair, nothing else is needed. If in a scale-out cluster of Active-Standby pairs:

  1. Automatically balance capacity and headroom utilization between HA pairs even if they’re different types
  2. Be able to auto-migrate workloads to other cluster nodes (if using multiple pools instead of one)

Active-Active Automation

  1. Automatically conserve one node’s worth of headroom across the HA pair (50/50, 60/40, 70/30 – all are OK)
  2. When provisioning new workloads, auto-balance them by performance and capacity across the nodes
  3. Be able to balance by auto-migrating workloads to the other node (if using multiple pools instead of one)

Active-Active with Multiple HA Pairs Automation

  1. Automatically conserve one node’s worth of headroom per HA pair
  2. Be able to auto-migrate workloads to any other node
  3. Automatically balance workloads and capacity utilization in the underlying per-HA pools

Grid Automation

  1. Automatically conserve at least one node’s worth of headroom across the grid
  2. Automatically conserve enough capacity to be able to lose one node, rebalance, and have enough capacity left to lose another one (the more cautious may want the capability to lose 2-3 nodes simultaneously)
  3. Automatically take into account grid size and rebalancing effort in order to conserve the right amount of headroom

In Closing…

If you’re a consumer of storage systems, always remember to be running your storage with sufficient headroom to be able to sustain a major failure without overly affecting your performance.

In addition, when looking to refresh your storage system, always ask the vendors about how they automate their headroom management.

Finally, if any vendor is quoting you performance numbers, always ask them how much headroom is left on the array at that performance level… (in addition to the extra questions about read/write percentages and latencies you should be asking already).

The answer may surprise you.

D

The Well-Behaved Storage System: Automatic Noisy Neighbor Avoidance

This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.

I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.

It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.

The Problem

The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.

OK, maybe that wasn’t the simplest way.

Consider the following scenarios:

  1. A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
  2. Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
  3. Failed drives
  4. Controller failover (due to an actual problem or simply a software update)

#3 and #4 are more obvious – a well-behaved system will ensure high performance even during a drive failure (or three), and after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.

#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.

The Case of Competing Workloads During Hard Times

Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.

Several user-generated workloads, for instance:

  • DB
  • VDI
  • File Services
  • Analytics

Various internal array processes – they also are workloads, just array-generated, and often critical:

  • Data reduction (dedupe, compression)
  • Cleanup (object deletion, garbage collection)
  • Data protection (integrity-related)
  • Backups (snaps, replication)

If the system has enough headroom, all these things will happen without performance problems.

If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.

The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:

bully_vs_victim_2

 

In this example, the latency-sensitive workload will greatly and unfairly suffer after the “noisy neighbor” suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).

This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change – for instance a patch or a modified SQL query). This stuff happens.

How some vendors have tried to fix the issue with Manual “QoS”

As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:

  • Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
  • Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting.
Manual mechanisms for fixing the “bully” workload challenge result in systems that are hard to consume and complex to support while under performance pressure. Moreover, when a performance issue occurs, speed of resolution is critical. The issue needs to be resolved immediately, especially for latency-sensitive workloads. Manual methods will simply not be fast enough. Business will be impacted.

How Nimble Storage Fixed the Noisy Neighbor Issue

No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:

  • Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
  • Fair Share CPU scheduling
  • Dynamic Weight Adjustment – automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind

The end result is a system that:

  • Lets system latency increase gracefully and progressively as load increases
  • Carefully and automatically balances user and system workloads
  • Achieves I/O deadlines and preemption behavior
  • Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
  • Allows latency-sensitive small-block I/O to proceed without interference from bully workloads
Such automation achieves a better business result: less risk, less OpEx, easy supportability, simple and safe overall consumption even under difficult conditions.

What Should Nimble Customers do to get this Capability?

As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up – 2.3 did some of the magic but 3.1 is the fully realized vision).

A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.

D

Technorati Tags: , , , ,

Architecture has long term scalability implications for All Flash Appliances

Recently, NetApp announced the availability of a 3.84TB SSD. It’s not extremely exciting – it’s just a larger storage medium. Sure, it’s really advanced 3D NAND, it’s fast and ultra-reliable, and will allow some nicely dense configurations at a reduced $/GB. Another day in Enterprise Storage Land.

But, ultimately, that’s how drives roll – they get bigger. And in the case of SSD, the roadmaps seem extremely aggressive regarding capacities.

Then I realized that several of our competitors don’t have this large SSD capacity available. Some don’t even have half that.

But why? Why ignore such a seemingly easy and hugely cost-effective way to increase density?

In this post I will attempt to explain why certain architectural decisions may lead to inflexible design constructs that can have long-term flexibility and scalability ramifications.

Design Center

Each product has its genesis somewhere. It is designed to address certain key requirements in specific markets and behave in a better/different way than competitors in some areas. Plug specific gaps. Possibly fill a niche or even become a new product category.

This is called the “Design Center” of the product.

Design centers can evolve over time. But, ultimately, every product’s Design Center is an exercise in compromise and is one of the least malleable parts of the solution.

There’s no such thing as a free lunch. Every design decision has tradeoffs. Often, those tradeoffs sacrifice long term viability for speed to market. There’s nothing wrong with tradeoffs as long as you know what those are, especially if the tradeoffs have a direct impact on your data management capabilities long term.

It’s all about the Deduplication/RAM relationship

Aside from compression, scale up and/or scale out, deduplication is a common way to achieve better scalability and efficiencies out of storage.

There are several ways to offer deduplication in storage arrays: Inline, post-process, fixed chunk, variable chunk, per volume scope, global scope – all are design decisions that can have significant ramifications down the line and various business impacts.

Some deduplication approaches require very large amounts of memory to store metadata (hashes representing unique chunk signatures). This may limit scalability or make a product more expensive, even with scale-out approaches (since many large, costly controllers would be required).

There is no perfect answer, since each kind of architecture is better at certain things than others. This is what is meant by “tradeoffs” in specific Design Centers. But let’s see how things look for some example approaches (this is not meant to be a comprehensive list of all permutations).

I am keeping it simple – I’m not showing how metadata might get shared and compared between nodes (in itself a potentially hugely impactful operation as some scale-out AFA vendors have found to their chagrin). In addition, I’m not exploring container vs global deduplication or different scale-out methods – this post would become unwieldy… If there’s interest drop me a line or comment and I will do a multi-part series covering the other aspects.

Fixed size chunk approach

In the picture below you can see the basic layout of a fixed size chunk deduplication architecture. Each data chunk is represented by a hash value in RAM. Incoming new chunks are compared to the RAM hash store in order to determine where and whether they may be stored:

Hashes fixed chunk

The benefit of this kind of approach is that it’s relatively straightforward from a coding standpoint, and it probably made a whole lot of sense a couple of years ago when small SSDs were all that was available and speed to market was a major design decision.

The tradeoff is that a truly exorbitant amount of memory is required in order to store all the hash metadata values in RAM. As SSD capacities increase, the linear relationship of SSD size vs RAM size results in controllers with multi-TB RAM implementations – which gets expensive.

It follows that systems using this type of approach will find it increasingly difficult (if not impossible) to use significantly larger SSDs without either a major architectural change or the cost of multiple TB of RAM dropping dramatically. You should really ask the vendor what their roadmap is for things like 10+TB SSDs… and whether you can expand by adding the larger SSDs into a current system without having to throw everything you’ve already purchased away.

Variable size chunk approach

This one is almost identical to the previous example, but instead of a small, fixed block, the architecture allows for variable size blocks to be represented by the same hash size:

Hashes variable chunk

This one is more complex to code, but the massive benefit is that metadata space is hugely optimized since much larger data chunks are represented by the same hash size as smaller data chunks. The system does this chunk division automatically. Less hashes are needed with this approach, leading to better utilization of memory.

Such an architecture needs far less memory than the previous example. However, it is still plagued by the same fundamental scaling problem – only at a far smaller scale. Conversely, it allows a less expensive system to be manufactured than in the previous example since less RAM is needed for the same amount of storage. By combining multiple inexpensive systems via scale-out, significant capacity at scale can be achieved at a lesser cost than with the previous example.

Fixed chunk, metadata both in RAM and on-disk

An approach to lower the dependency on RAM is to have some metadata in RAM and some on SSD:

Hashes fixed chunk metadata on disk

This type of architecture finds it harder to do full speed inline deduplication since not all metadata is in RAM. However, it also offers a more economical way to approach hash storage. SSD size is not a concern with this type of approach. In addition, being able to keep dedupe metadata on cold storage aids in data portability and media independence, including storing data in the cloud.

Variable chunk, multi-tier metadata store

Now that you’ve seen examples of various approaches, it starts making logical sense what kind of architectural compromises are necessary to achieve both high deduplication performance and capacity scale.

For instance, how about variable blocks and the ability to store metadata on multiple tiers of storage? Upcoming, ultra-fast Storage Class Memory technologies are a good intermediate step between RAM and SSD. Lots of metadata can be placed there yet retain high speeds:

Hashes variable chunk metadata on 3 tiers

Coding for this approach is of course complex since SCM and SSD have to be treated as a sort of Level 2/Level 3 cache combination but with cache access time spans in the days or weeks, and parts of the cache never going “cold”. It’s algorithmically more involved, plus relies on technologies not yet widely available… but it does solve multiple problems at once. One could of course use just SCM for the entire metadata store and simplify implementation, but that would somewhat reduce the performance afforded by the approach shown (RAM is still faster). But if the SCM is fast enough… 🙂

However, being able to embed dedupe metadata in cold storage can still help with data mobility and being able to retain deduplication even across different types of storage and even cloud. This type of flexibility is valuable.

Why should you care?

Aside from the academic interest and nerd appeal, different architecture approaches have a business impact:

  • Will the storage system scale large enough for significant future growth?
  • Will I be able to use significantly new media technologies and sizes without major disruption?
  • Can I use extremely large media sizes?
  • Can I mix media sizes?
  • Can I mix controller types in a scale-out cluster?
  • Can I use cost-optimized hardware?
  • Does deduplication at scale impact performance negatively, especially with heavy writes?
  • If inline efficiencies aren’t comprehensive, how does that affect overall capacity sizing?
  • Does the deduplication method enforce a single large failure domain? (single pool – meaning that any corruption would result in the entire system being unusable)
  • What is the interoperability with Cloud and Disk technologies?
  • Can data mobility from All Flash to Disk to Cloud retain deduplication savings?
  • What other tradeoffs is this shiny new technology going to impose now and in the future? Ask to see a 5-year vision roadmap!

Always look beyond the shiny feature and think of the business benefits/risks. Some of the above may be OK for you. Some others – not so much.

There’s no free lunch.

D

Technorati Tags: , , , ,

Beware of storage performance guarantees

Ah, nothing to bring joy to the holidays like a bit of good old-fashioned sales craziness.

Recently we started seeing weird performance “guarantees” by some storage vendors, who seem will try anything for a sale.

Probably by people that haven’t read this.

It goes a bit like this:

“Mr. Customer, we guarantee our storage will do 100,000 IOPS no matter the I/O size and workload”.

Next time a vendor pulls this, show them the following chart. It’s a simple plot of I/O size vs throughput for 100,000 IOPS:

Throughput IO Size

Notice that at a 1MB I/O size the throughput is a cool 100GB/s 🙂

Then ask that vendor again if they’re sure they still want to make that guarantee. In writing. With severe penalties if it’s not met. As in free gear UNTIL the requirement is met. At any point during the lifetime of the equipment.

Then sit back and enjoy the backpedalling. 

You can make it even more fun, especially if it’s a hybrid storage vendor (mixed spinning and flash storage for caching, with or without autotiering):

  • So you will guarantee those IOPS even if the data is not in cache?
  • For completely random reads spanning the entire pool?
  • For random overwrites? (that should be a fun one, 100GB/s of overwrite activity).
  • For non-zero or at least not crazily compressible data?
  • And what’s the latency for the guarantee? (let’s not forget the big one).
  • etc. You get the point.
 
Happy Holidays everyone!
 
Thx
 
D
 

 

An explanation of IOPS and latency

<I understand this extremely long post is redundant for seasoned storage performance pros – however, these subjects come up so frequently, that I felt compelled to write something. Plus, even the seasoned pros don’t seem to get it sometimes… 🙂 >

IOPS: Possibly the most common measure of storage system performance.

IOPS means Input/Output (operations) Per Second. Seems straightforward. A measure of work vs time (not the same as MB/s, which is actually easier to understand – simply, MegaBytes per Second).

How many of you have seen storage vendors extolling the virtues of their storage by using large IOPS numbers to illustrate a performance advantage?

How many of you decide on storage purchases and base your decisions on those numbers?

However: how many times has a vendor actually specified what they mean when they utter “IOPS”? 🙂

For the impatient, I’ll say this: IOPS numbers by themselves are meaningless and should be treated as such. Without additional metrics such as latency, read vs write % and I/O size (to name a few), an IOPS number is useless.

And now, let’s elaborate… (and, as a refresher regarding the perils of ignoring such things when it comes to sizing, you can always go back here).

 

One hundred billion IOPS…

drevil

I’ve competed with various vendors that promise customers high IOPS numbers. On a small system with under 100 standard 15K RPM spinning disks, a certain three-letter vendor was claiming half a million IOPS. Another, a million. Of course, my customer was impressed, since that was far, far higher than the number I was providing. But what’s reality?

Here, I’ll do one right now: The old NetApp FAS2020 (the older smallest box NetApp had to offer) can do a million IOPS. Maybe even two million.

Go ahead, prove otherwise.

It’s impossible, since there is no standard way to measure IOPS, and the official definition of IOPS (operations per second) does not specify certain extremely important parameters. By doing any sort of I/O test on the box, you are automatically imposing your benchmark’s definition of IOPS for that specific test.

 

What’s an operation? What kind of operations are there?

It can get complicated.

An I/O operation is simply some kind of work the disk subsystem has to do at the request of a host and/or some internal process. Typically a read or a write, with sub-categories (for instance read, re-read, write, re-write, random, sequential) and a size.

Depending on the operation, its size could range anywhere from bytes to kilobytes to several megabytes.

Now consider the following most assuredly non-comprehensive list of operation types:

  1. A random 4KB read
  2. A random 4KB read followed by more 4KB reads of blocks in logical adjacency to the first
  3. A 512-byte metadata lookup and subsequent update
  4. A 256KB read followed by more 256KB reads of blocks in logical sequence to the first
  5. A 64MB read
  6. A series of random 8KB writes followed by 256KB sequential reads of the same data that was just written
  7. Random 8KB overwrites
  8. Random 32KB reads and writes
  9. Combinations of the above in a single thread
  10. Combinations of the above in multiple threads
…this could go on.

As you can see, there’s a large variety of I/O types, and true multi-host I/O is almost never of a single type. Virtualization further mixes up the I/O patterns, too.

Now here comes the biggest point (if you can remember one thing from this post, this should be it):

No storage system can do the same maximum number of IOPS irrespective of I/O type, latency and size.

Let’s re-iterate:

It is impossible for a storage system to sustain the same peak IOPS number when presented with different I/O types and latency requirements.

 

Another way to see the limitation…

A gross oversimplification that might help prove the point that the type and size of operation you do matters when it comes to IOPS. Meaning that a system that can do a million 512-byte IOPS can’t necessarily do a million 256K IOPS.

Imagine a bucket, or a shotshell, or whatever container you wish.

Imagine in this container you have either:

  1. A few large balls or…
  2. Many tiny balls
The bucket ultimately contains about the same volume of stuff either way, and it is the major limiting factor. Clearly, you can’t completely fill that same container with the same number of large balls as you can with small balls.
IOPS containers

 

 

 

 

 

 

 

 

 

 

 

 

They kinda look like shotshells, don’t they?

Now imagine the little spheres being forcibly evacuated rapildy out of one end… which takes us to…

 

Latency matters

So, we’ve established that not all IOPS are the same – but what is of far more significance is latency as it relates to the IOPS.

If you want to read no further – never accept an IOPS number that doesn’t come with latency figures, in addition to the I/O sizes and read/write percentages.

Simply speaking, latency is a measure of how long it takes for a single I/O request to happen from the application’s viewpoint.

In general, when it comes to data storage, high latency is just about the least desirable trait, right up there with poor reliability.

Databases especially are very sensitive with respect to latency – DBs make several kinds of requests that need to be acknowledged quickly (ideally in under 10ms, and writes especially in well under 5ms). In particular, the redo log writes need to be acknowledged almost instantaneously for a heavy-write DB – under 1ms is preferable.

High sustained latency in a mission-critical app can have a nasty compounding effect – if a DB can’t write to its redo log fast enough for a single write, everything stalls until that write can complete, then moves on. However, if it constantly can’t write to its redo log fast enough, the user experience will be unacceptable as requests get piled up – the DB may be a back-end to a very busy web front-end for doing Internet sales, for example. A delay in the DB will make the web front-end also delay, and the company could well lose thousands of customers and millions of dollars while the delay is happening. Some companies could also face penalties if they cannot meet certain SLAs.

On the other hand, applications doing sequential, throughput-driven I/O (like backup or archival) are nowhere near as sensitive to latency (and typically don’t need high IOPS anyway, but rather need high MB/s).

It follows that not all I/O sizes and I/O operations are subject to the same latency requirements.

Here’s an example from an Oracle DB – a system doing about 15,000 IOPS at 25ms latency. Doing more IOPS would be nice but the DB needs the latency to go a lot lower in order to see significantly improved performance – notice the increased IO waits and latency, and that the top event causing the system to wait is I/O:

AWR example Now compare to this system (different format this data but you’ll get the point):

Notice that, in this case, the system is waiting primarily for CPU, not storage.

A significant amount of I/O wait is a good way to determine if storage is an issue (there can be other latencies outside the storage of course – CPU and network are a couple of usual suspects). Even with good latencies, if you see a lot of I/O waits it means that the application would like faster speeds from the storage system.

But this post is not meant to be a DB sizing class. Here’s the important bit that I think is confusing a lot of people and is allowing vendors to get away with unrealistic performance numbers:

It is possible (but not desirable) to have high IOPS and high latency simultaneously.

How? Here’s a, once again, oversimplified example:

Imagine 2 different cars, both with a top speed of 150mph.

  • Car #1 takes 50 seconds to reach 150mph
  • Car #2 takes 200 seconds to reach 150mph

The maximum speed of the two cars is identical.

Does anyone have any doubt as to which car is actually faster? Car #1 indeed feels about 4 times faster than Car #2, even though they both hit the exact same top speed in the end.

Let’s take it an important step further, keeping the car analogy since it’s very relatable to most people (but mostly because I like cars):

  • Car #1 has a maximum speed of 120mph and takes 30 seconds to hit 120mph
  • Car #2 has a maximum speed of 180mph, takes 50 seconds to hit 120mph, and takes 200 seconds to hit 180mph

In this example, Car #2 actually has a much higher top speed than Car #1. Many people, looking at just the top speed, might conclude it’s the faster car.

However, Car #1 reaches its top speed (120mph) far faster than Car # 2 reaches that same top speed of Car #1 (120mph).

Car #2 continues to accelerate (and, eventually, overtakes Car #1), but takes an inordinately long amount of time to hit its top speed of 180mph.

Again – which car do you think would feel faster to its driver?

You know – the feeling of pushing the gas pedal and the car immediately responding with extra speed that can be felt? Without a large delay in that happening?

Which car would get more real-world chances of reaching high speeds in a timely fashion? For instance, overtaking someone quickly and safely?

Which is why car-specific workload benchmarks like the quarter mile were devised: How many seconds does it take to traverse a quarter mile (the workload), and what is the speed once the quarter mile has been reached?

(I fully expect fellow geeks to break out the slide rules and try to prove the numbers wrong, probably factoring in gearing, wind and rolling resistance – it’s just an example to illustrate the difference between throughput and latency, I had no specific cars in mind… really).

 

And, finally, some more storage-related examples…

Some vendor claims… and the fine print explaining the more plausible scenario beneath each claim:

“Mr. Customer, our box can do a million IOPS!”

512-byte ones, sequentially out of cache.

“Mr. Customer, our box can do a quarter million random 4K IOPS – and not from cache!”

at 50ms latency.

“Mr. Customer, our box can do a quarter million 8K IOPS, not from cache, at 20ms latency!”

but only if you have 1000 threads going in parallel.

“Mr. Customer, our box can do a hundred thousand 4K IOPS, at under 20ms latency!”

but only if you have a single host hitting the storage so the array doesn’t get confused by different I/O from other hosts.

Notice how none of these claims are talking about writes or working set sizes… or the configuration required to support the claim.

 

What to look for when someone is making a grandiose IOPS claim

Audited validation and a specific workload to be measured against (that includes latency as a metric) both help. I’ll pick on HDS since they habitually show crazy numbers in marketing literature.

For example, from their website:

HDS USP IOPS

 

It’s pretty much the textbook case of unqualified IOPS claims. No information as to the I/O size, reads vs writes, sequential or random, what type of medium the IOPS are coming from, or, of course, the latency…

However, that very same box almost makes 270,000 SPC-1 IOPS with good latency in the audited SPC-1 benchmark:

VSP_SPC1

Last I checked, 270,000 was almost 15 times less than 4,000,000. Don’t get me wrong, 260,000 low-latency IOPS is a great SPC-1 result, but it’s not 4 million SPC-1 IOPS.

Check my previous article on SPC-1 and how to read the results here. And if a vendor is not posting results for a platform – ask why.

 

Where are the IOPS coming from?

So, when you hear those big numbers, where are they really coming from? Are they just ficticious? Not necessarily. So far, here are just a few of the ways I’ve seen vendors claim IOPS prowess:

  1. What the controller will theoretically do given unlimited back-end resources.
  2. What the controller will do purely from cache.
  3. What a controller that can compress data will do with all zero data.
  4. What the controller will do assuming the data is at the FC port buffers (“huh?” is the right reaction, only one three-letter vendor ever did this so at least it’s not a widespread practice).
  5. What the controller will do given the configuration actually being proposed driving a very specific application workload with a specified latency threshold and real data.
The figures provided by the approaches above are all real, in the context of how the test was done by each vendor and how they define “IOPS”. However, of the (non-exhaustive) options above, which one do you think is the more realistic when it comes to dealing with real application data?

 

What if someone proves to you a big IOPS number at a PoC or demo?

Proof-of-Concept engagements or demos are great ways to prove performance claims.

But, as with everything, garbage in – garbage out.

If someone shows you IOmeter doing crazy IOPS, use the information in this post to help you at least find out what the exact configuration of the benchmark is. What’s the block size, is it random, sequential, a mix, how many hosts are doing I/O, etc. Is the config being short-stroked? Is it coming all out of cache?

Typically, things like IOmeter can be a good demo but that doesn’t mean the combined I/O of all your applications’ performance follows the same parameters, nor does it mean the few servers hitting the storage at the demo are representative of your server farm with 100x the number of servers. Testing with as close to your application workload as possible is preferred. Don’t assume you can extrapolate – systems don’t always scale linearly.

 

Factors affecting storage system performance

In real life, you typically won’t have a single host pumping I/O into a storage array. More likely, you will have many hosts doing I/O in parallel. Here are just some of the factors that can affect storage system performance in a major way:

 

  1. Controller, CPU, memory, interlink counts, speeds and types.
  2. A lot of random writes. This is the big one, since, depending on RAID level, the back-end I/O overhead could be anywhere from 2 I/Os (RAID 10) to 6 I/Os (RAID6) per write, unless some advanced form of write management is employed.
  3. Uniform latency requirements – certain systems will exhibit latency spikes from time to time, even if they’re SSD-based (sometimes especially if they’re SSD-based).
  4. A lot of writes to the same logical disk area. This, even with autotiering systems or giant caches, still results in tremendous load on a rather limited set of disks (whether they be spinning or SSD).
  5. The storage type used and the amount – different types of media have very different performance characteristics, even within the same family (the performance between SSDs can vary wildly, for example).
  6. CDP tools for local protection – sometimes this can result in 3x the I/O to the back-end for the writes.
  7. Copy on First Write snapshot algorithms with heavy write workloads.
  8. Misalignment.
  9. Heavy use of space efficiency techniques such as compression and deduplication.
  10. Heavy reliance on autotiering (resulting in the use of too few disks and/or too many slow disks in an attempt to save costs).
  11. Insufficient cache with respect to the working set coupled with inefficient cache algorithms, too-large cache block size and poor utilization.
  12. Shallow port queue depths.
  13. Inability to properly deal with different kinds of I/O from more than a few hosts.
  14. Inability to recognize per-stream patterns (for example, multiple parallel table scans in a Database).
  15. Inability to intelligently prefetch data.

 

What you can do to get a solution that will work…

You should work with your storage vendor to figure out, at a minimum, the items in the following list, and, after you’ve done so, go through the sizing with them and see the sizing tools being used in front of you. (You can also refer to this guide).

  1. Applications being used and size of each (and, ideally, performance logs from each app)
  2. Number of servers
  3. Desired backup and replication methods
  4. Random read and write I/O size per app
  5. Sequential read and write I/O size per app
  6. The percentages of read vs write for each app and each I/O type
  7. The working set (amount of data “touched”) per app
  8. Whether features such as thin provisioning, pools, CDP, autotiering, compression, dedupe, snapshots and replication will be utilized, and what overhead they add to the performance
  9. The RAID type (R10 has an impact of 2 I/Os per random write, R5 4 I/Os, R6 6 I/Os – is that being factored?)
  10. The impact of all those things to the overall headroom and performance of the array.

If your vendor is unwilling or unable to do this type of work, or, especially, if they tell you it doesn’t matter and that their box will deliver umpteen billion IOPS – well, at least now you know better 🙂

D

Technorati Tags: , , , , , , , , , , , ,