Why it is Incorrect to use Average Block Size for Storage Performance Benchmarking

Just a quick post to address something many people either get wrong or just live with due to convenience.

In summary: Please, let’s stop using average I/O sizes to characterize storage system performance. It’s wrong and doesn’t describe how the real world works. Using an average number is as bad as using small block 100% read numbers shown in vanity benchmarks. Neither is representative of real life.

Using a single I/O size for benchmarking became a practice for a vanity benchmark and to provide a level playing field to compare multiple products.

But, ultimately, even though the goal of comparing different systems is desirable, using a single I/O size is fundamentally flawed.

Why Averages can be a Problem

You see, a simple average calculation is horrendously wrong for many datasets, and can lead to very wrong conclusions.

For example, the following statements may be mathematically accurate, but don’t help with understanding what’s real:

  • The average speed of a supercar (not counting when it’s parked) is 60mph. Sure – if you calculate the average speed given the time it’s stopped in traffic and the rare times it’s being driven on a track. This average calculation gives absolutely no indication as to the true performance potential of such a car.
  • The average size vehicle is a minivan. Sure – if you calculate the average size of all vehicles on the road, that may be true. It will also have disastrous consequences if someone tries to use that metric to design bridges and overpasses…

Why Using Average I/O Sizes to Determine Storage Performance Can be Grossly Inaccurate

Using average I/O size in storage systems is a really old practice, borne out of necessity, and insufficient tooling. Yet it can lead to the same wrong conclusions as the car examples above.

In a nutshell, many people (and storage performance tools) characterize performance simply as average IOPS at an average I/O size observed over time. For example, 100,000 IOPS at an average of 52KB I/O size.

The challenges with that approach are manifold:

  • Storage system performance is not a linear relationship between IOPS and I/O size (putting this one first because it seems to be by far the most common misconception). For instance, a system that can do a maximum of 300,000 read IOPS at a 32K I/O size won’t do 2.4 million IOPS at a 4K size, and vice versa (for example, you may find that the system that really can do 2.4 million read IOPS at 4K to be able to do much more than 300,000 32K read IOPS).
  • Real applications are bursty and can have completely different I/O profiles for daytime vs nighttime workloads. With an average number, that profile is not respected.
  • In real life, several applications hit storage systems with multiple size and type I/O requests simultaneously (for a primer go here). The same application could be doing 8K random reads and writes, 512-byte sequential appends, and 256K sequential reads, at the same time. Trying to characterize such complex performance using a single number is rather optimistic and never works in practice, simply because storage systems don’t internally work using averages.
  • Storage systems may exhibit very different performance characteristics depending on the size of I/O operation that’s happening. For instance, a system may be heavily optimized for sequential I/O and/or random I/O at a certain I/O size, such as 16K or 32K, and experience abnormally higher overheads doing other I/O sizes. Testing that system in its “optimal” I/O size, especially if said I/O size isn’t anywhere near what applications do, may be unwise (but will provide stellar numbers for said system, so there’s that).
  • Storage systems may also exhibit vastly different performance characteristics depending on the kind of I/O operation that’s happening. For instance, a system that can read at colossal speeds may find itself exhibiting an order of magnitude reduction in random write performance, especially with more resilient RAID algorithms and inline data reduction.

Here’s a simple example of how averages can lead to incredibly wrong conclusions when talking about storage performance:

Throughput Burst

In this chart, a single application does 300,000 transactional random IOPS at 4K, but at some point it switches workloads to 20,000 sequential IOPS at the much larger 256K I/O size. The system performs fine doing this, and it’s how the application does different kinds of work during a working day.

Now, the average I/O size in this example is 32K.

The average IOPS are 268,888.

If one used the simple (average IOPS) * (average I/O size) calculation to derive the throughput required, the result is 8,402 MB/s, which is 68% more throughput than this application actually does (the max it ever hits is 5,000 MB/s).

If someone is relying on this calculation to spec a new system, then they will end up with something that’s potentially far more expensive than it needs to be.

Simply put, the calculated 32K average I/O size value is wildly inaccurate compared to what this application really does!

How Nimble Storage Approached the I/O Size Question

Instead of using averages, HPE Nimble used advanced analytics and AI that can help identify what happens in the real world of storage system performance.

Nimble Storage customers can tag each volume created with an application profile. This makes it easier to report on how specific applications do I/O and avoid gross generalizations or searching for matching volume name strings (no need to put the “SQL” string in your SQL volumes! Just tell the array if the volume is SQL data or SQL log, for example).

By looking, in detail, at the workloads of all Nimble arrays reporting into InfoSight, Nimble was able to very accurately characterize the precise I/O patterns of many common applications.

What were the Findings?

What did we learn? In simple terms, most applications either have a bimodal I/O distribution, with smaller I/O sizes doing transactional, latency-sensitive work, and large I/O sizes doing high throughput work.

Or, they have a very focused I/O size (either mostly small or mostly large).

No application actually did a majority of 32K I/Os, which further proves the point that using a single I/O size to characterize storage system performance is colossally wrong.

For instance, here’s SQL. BTW – the closed bracket means inclusive of – so, [8,16) means an IO size equal to 8 but less than 16. The bulk of I/O is 8K, then 64K, then 128, then 256.

Sql IO

Here’s VDI. The vast majority of I/O is small, with a lot of writes:

VDI IO

Exhange is interesting. It’s easy to see the huge changes Microsoft made from 2003 to 2013. Exchange 2003 did mostly small block IO:

Exchange 2003 IO

Wheres Exchange 2013 introduced a strong preference for very large IO sizes (with a lot of small writes though):

Exchange2013 IO

I’ll stop with the examples. If you want to see more, get ready for some heady material here (easier) and here (comprehensive). Make sure you learn how to read the density maps if you hit those links.

What Nimble did with the Findings

Knowledge is power.

By understanding how applications really work, it’s possible to more accurately optimize storage systems for better application-aware performance.

For instance, what if your storage system could, automatically, sense what kind of I/O it should prioritize when faced with too much work? Wouldn’t a DB redo log naturally need higher priority and lower latency than a general DB read? And a general DB read higher priority than a DB table scan? This takes things beyond simplistic auto-QoS schemes like sequential vs random and fair sharing.

Such automation lowers business risk (better performance stability under pressure), OPEX (it’s simple) and CAPEX (since it doesn’t require over-engineered solutions).

In addition, it makes possible to more accurately (and safely) size storage systems – again, without resorting to over-engineering and the resulting increased costs.

The Nimble sizing tool actually is inside InfoSight and can use real application statistics to determine sizing.

What Customers can do with this Information

If your existing array performance visualization tool just gives you an average IO number (like many do), don’t use it as your block size in a benchmark tool. The testing will give you wrong numbers. You can use instead numbers from this research. A good typical blend to use would be 8K random and 256K sequential. If you must use some sort of average, at least this captures what happens in real life a bit more accurately – but is still insufficient for many applications.

If you want to even more accurately test a storage system, use this kind of information to concoct more elaborate benchmarks. That way you can do some proper testing instead of trying out test scenarios that simply aren’t found anywhere in the real world of applications.

For instance, you could build a compound test that contains workloads that look like VDI, SQL, Exchange, file services and so on, with a percentage of each as you’d use in your environment.

For the ultimate accuracy, you should just test using real applications… or use something that records real application I/O and plays it back (there are a few nice options).

And yes, I know it’s easier just testing a single I/O size number, but if that technique is demonstrably wrong, then it’s easy but pointless.

D

Leave a comment for posterity...