This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.
I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.
It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.
The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.
OK, maybe that wasn’t the simplest way.
Consider the following scenarios:
- A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
- Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
- Failed drives
- Controller failover (due to an actual problem or simply a software update)
#3 and #4 are more obvious – a well-behaved system will ensure high performance even during a drive failure (or three), and after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.
#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.
The Case of Competing Workloads During Hard Times
Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.
Several user-generated workloads, for instance:
- File Services
Various internal array processes – they also are workloads, just array-generated, and often critical:
- Data reduction (dedupe, compression)
- Cleanup (object deletion, garbage collection)
- Data protection (integrity-related)
- Backups (snaps, replication)
If the system has enough headroom, all these things will happen without performance problems.
If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.
The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:
In this example, the latency-sensitive workload will greatly and unfairly suffer after the “noisy neighbor” suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).
This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change – for instance a patch or a modified SQL query). This stuff happens.
In addition, the noisy neighbor doesn’t have to be a different LUN/Volume. It could be competing workload within the same Volume! In such cases, simple fair sharing will not work.
How some vendors have tried to fix the issue with Manual “QoS”
As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:
- Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
- Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting.
- A few modern arrays do “fair sharing” among LUNs/Volumes. This can help a bit overall but doesn’t address the issue of competing workloads within the same Volume.
How Nimble Storage Fixed the Noisy Neighbor Issue
No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:
- Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
- Fair Share CPU scheduling
- Dynamic Weight Adjustment (the most important innovation listed) – automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind. For instance, even within the same Volume, preferentially prioritize latency-sensitive I/O (like a DB redo log write or random DB access) vs things like a large DB table scan (which isn’t latency-sensitive).
The end result is a system that:
- Lets system latency increase gracefully and progressively as load increases
- Carefully and automatically balances user and system workloads, even within the same Volume
- Achieves I/O deadlines and preemption behavior for latency-sensitive I/O, instead of blindly doing fair sharing among all workloads
- Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
- Allows latency-sensitive small-block I/O to proceed without interference from bully workloads even at extremely high system loads.
What Should Nimble Customers do to get this Capability?
As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up – 2.3 did some of the magic but 3.1 is the fully realized vision).
A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.
EDIT: There’s now a nice writeup about our Auto QoS in Nimble Connect here.
Another update: As of v4.x, NimbleOS adds support for manual tweaking of QoS, in addition to the fully automatic heuristics described in the post. Please see here.