Proper Testing vs Real World Testing

The idea for this article came from seeing various people attempt product testing. Though I thought about storage when writing this, the ideas apply to most industries.

Three different kinds of testing

There are really three different kinds of testing.

The first is the incomplete, improper, useless in almost any situation testing. Typically done by people that have little training on the subject. Almost always ends in misleading results and is arguably dangerous, especially if used to make purchasing decisions.

The second is what’s affectionately and romantically called “Real World Testing”. Typically done by people that will try to simulate some kind of workload they believe they encounter in their environment, or use part of their environment to do the testing. Much more accurate than the first kind, if done right. Usually the workload is decided arbitrarily 🙂

The third and last kind is what I term “Proper Testing”. This is done by professionals (that usually do this type of testing for a living) that understand how complex testing for a broad range of conditions needs to be done. It’s really hard to do, but pays amazing dividends if done thoroughly.

Let’s go over the three kinds in more details, with some examples.

Useless Testing

Hopefully after reading this you will know if you’re a perpetrator of Useless Testing and never do it again.

A lot of what I deal with is performance, so I will draw examples from there.

Have you or someone you know done one or more of the following after asking to evaluate a flashy, super high performance enterprise storage device?

  • Use tiny amounts of test data, ensuring it will all be cached
  • Try to test said device with a single server
  • With only a couple of I/O ports
  • With a single LUN
  • With a single I/O thread
  • Doing only a certain kind of operation (say, all random reads)
  • Using only a single I/O size (say, 4K or 32K) for all operations
  • Not looking at latency
  • Using extremely compressible data on an array that does compression

I could go on but you get the idea. A good move would be to look at the performance primer here before doing anything else…

Pitfalls of Useless Testing

The main reason people do such poor testing is usually that it’s easy to do. Another reason is that it’s easy to use it to satisfy a Confirmation Bias.

The biggest problem with such testing is that it doesn’t tell you how the system behaves if exposed to different conditions.

Yes, you see how it behaves in a very specific situation, and that situation might even be close to a very limited subset of what you need to do in “Real Life”, but you learn almost nothing about how the system behaves in other kinds of scenarios.

In addition, you might eliminate devices that would normally behave far better for your applications, especially under duress, since they might fail Useless Testing scenarios (since they weren’t designed for such unrealistic situations).

Making purchasing decisions after doing Useless Testing will invariably result in making bad purchasing decisions unless you’re very lucky (which usually means that your true requirements weren’t really that hard to begin with).

Part of the problem of course is not even knowing what to test for.

“Real World” Testing

This is what most sane people strive for.

How to test something in conditions most approximating how it will be used in real life.

Some examples of how people might try to perform “Real World” testing:

  • One could use real application I/O traces and advanced benchmarking software that can replay them, or…
  • Spin synthetic benchmarks designed to simulate real applications, or…
  • Put one of their production applications on the system and use it like they normally would.

Clearly, any of this would result in more accurate results than Useless Testing. However…

Pitfalls or “Real World” Testing

The problem with “Real World” testing is that it invariably does not reproduce the “Real World” closely enough to be comprehensive and truly useful.

Such testing addresses only a small subset of the “Real World”. The omissions dictate how dangerous the results are.

In addition, extrapolating larger-scale performance isn’t really possible. How the system ran one application doesn’t mean you know how it will run ten applications in parallel.

Some examples:

  • Are you testing one workload at a time, even if that workload consists of multiple LUNs? Shared systems will usually have multiple totally different workloads hitting them in parallel, often wildly different and conflicting, coming and going at different times, each workload being a set of related LUNs. For instance, an email workload plus a DSS plus an OLTP plus a file serving plus a backup workload in parallel… 🙂
  • Are you testing enough workloads in parallel? True Enterprise systems thrive on crunching many concurrent workloads.
  • Are you injecting other operations that you might be doing in the “Real World”? For instance, replicate and delete large amounts of data while doing I/O on various applications? Replications and deletions have been known to happen… 😉
  • Are you testing how the system behaves in degraded mode while doing all the above? Say, if it’s missing a controller or two, or a drive or two… also known to happen.

That’s right, this stuff isn’t easy to do properly. It also takes a very long time. Which is why serious Enterprise vendors have huge QA organizations going through these kinds of scenarios. And why we get irritated when we see people drawing conclusions based on incomplete testing 😉

Proper Testing

See, the problem with the “Real World” concept is that there are multiple “Real Worlds”.

Take car manufacturers for instance.

Any car manufacturer worth their salt will torture-test their cars in various wildly different conditions, often rapidly switching from one to the other if possible to make things even worse:

  • Arctic conditions
  • Desert, dusty conditions plus harsh UV radiation
  • Extremely humid conditions
  • All the above in different altitudes

You see, maybe your “Real World” is Alaska, but another customer’s “Real World” will be the very wet Cherrapunji, and a third customer’s “Real World” might be the Sahara desert. A fourth might be a combination (cold, very high altitude, harsh UV radiation, or hot and extremely humid).

These are all valid “Real World” conditions, and the car needs to be able to deal with all of them while maintaining full functioning until beyond the warranty period.

Imagine if moving from Alaska to the Caribbean meant your car would cease functioning. People have been known to move, too… 🙂

Storage manufacturers that actually know what they’re doing have an even harder job:

We need to test multiple “Real Worlds” in parallel. Oh, and we do test the different climate conditions as well, don’t think we assume everyone operates their hardware in ideal conditions… especially folks in the military or any other life-or-death situation.

Closing thoughts

If I’ve succeeded in stopping even one person from doing Useless Testing, this article was a success 🙂

It’s important to understand that Proper Testing is extremely hard to do. Even gigantic QA organizations miss bugs, imagine someone doing incomplete testing. There are just too many permutations.

Another sobering thought is that very few vendors are actually able to do Proper Testing of systems. The know-how, sheer time and numbers of personnel needed is enough to make most smaller vendors skimp on the testing out of pure necessity. They test what they can, otherwise they’d never ship anything.

A large number of data points helps. Selling millions of units of something provides a vendor with way more failure data than selling a thousand units of something possibly can. Action can be taken to characterize the failure data and see what approach is best to avoid the problem, and how common it really is.

One minor failure in a million may not be even worth fixing, whereas if your sample size is ten, one failure means 10%. It might be the exact same failure, but you don’t know that. Or you might have zero failures in the sample size of ten, which might make you think your solution is better than it really is…

But I digress into other, admittedly interesting areas.

All I ask is… think really hard before testing anything in the future. Think what you really are trying to prove. And don’t be afraid to ask for help.

D

Technorati Tags: , , ,

Leave a comment for posterity...