Beware of storage performance guarantees

Ah, nothing to bring joy to the holidays like a bit of good old-fashioned sales craziness.

Recently we started seeing weird performance “guarantees” by some storage vendors, who seem will try anything for a sale.

Probably by people that haven’t read this.

It goes a bit like this:

“Mr. Customer, we guarantee our storage will do 100,000 IOPS no matter the I/O size and workload”.

Next time a vendor pulls this, show them the following chart. It’s a simple plot of I/O size vs throughput for 100,000 IOPS:

Throughput IO Size

Notice that at a 1MB I/O size the throughput is a cool 100GB/s :)

Then ask that vendor again if they’re sure they still want to make that guarantee. In writing. With severe penalties if it’s not met. As in free gear UNTIL the requirement is met. At any point during the lifetime of the equipment.

Then sit back and enjoy the backpedalling. 

You can make it even more fun, especially if it’s a hybrid storage vendor (mixed spinning and flash storage for caching, with or without autotiering):

  • So you will guarantee those IOPS even if the data is not in cache?
  • For completely random reads spanning the entire pool?
  • For random overwrites? (that should be a fun one, 100GB/s of overwrite activity).
  • For non-zero or at least not crazily compressible data?
  • And what’s the latency for the guarantee? (let’s not forget the big one).
  • etc. You get the point.
Happy Holidays everyone!


Is convenience devaluing products? Does quality suffer because of it?

Kind of a long hiatus posting (far too busy working on cool stuff) and for you looking for a deep technical post this may not be it… but here goes anyway since the content may also apply to my more usual subjects.

Recently I decided to discard my Luddite membership card and join the hordes of people using network-based services for music.

The experiment is ongoing – I do like the convenience of being able to select almost any song or album for the monthly equivalent cost of less than what an album is worth.

It’s a pretty good deal if you listen to a lot of new music, and/or you don’t like listening to ads on the radio.

There’s a plethora of free offerings but if you are mobile and want to use it on your phone, there’s usually a cost involved to have the convenience of selecting the exact songs you like.

How convenience has affected me

I did notice several interesting aspects in which this newfound convenience has changed my listening habits in a positive way:

  1. I am discovering a lot more new music since it’s so incredibly easy to do so. And some old music I never gave a chance to.
  2. Sharing music with other people is easy and involves no illegal copying of data.
  3. I don’t have to worry about putting the “right” music in my portable device – I can stream what I want, from wherever I want, even on devices I don’t own, and even designate items for “offline use” – meaning they’ll be cached and playable even if I’m not connected to a network.
  4. I have easy access to most of my oldie favorites that I might normally not keep in my device due to space reasons.
  5. The quality is very good. But we won’t go into psychoacoustics here :)

However, there have also been some pretty negative aspects to all this convenience… for instance:

  1. I realize I now suffer from music ADD – I seldom just sit down and listen to a whole album like we all used to do in the olden days.
  2. Albums now have zero monetary value in my mind – they’re just part of the low monthly fee.
  3. If albums have a perceived zero monetary value, they become a commodity and not something to be treasured. I remember when it was a huge deal to get a new album from my favorite artists: the anticipation, the excitement, the trip to the record store, waiting in line, scarcity, the artwork in the packaging, the sheer physicality of it all. This combination of attributes ensured I would at least give that album a chance – indeed, I was likely to listen to it repeatedly, analyze it and appreciate the artistry involved. I was invested.
  4. As a result of this devaluing, amazing works of art that were extremely difficult to accomplish may now be skipped altogether because they may be a bit time-consuming or even difficult to “get into” – some concept albums you just need to be in the right frame of mind for and/or have the requisite amount of time to listen to the story unfold. Since there’s no perceived investment and no excitement, it’s less likely to spend the energy trying to get into the album, no matter how rewarding it may be in the end.
  5. For something more practical: The toll on the mobile devices’ batteries is 2-3x that of just playing music natively (even without streaming – the tracks are encrypted so you can’t just lift them from the storage, which adds CPU cycles to decrypt, plus some products use codecs more computationally intensive than mp3). Best have a device with a fast CPU.
  6. An extended unplanned network or music provider outage will mean no access to music.

How this applies to other aspects of our lives

I wonder now what other conveniences have affected our lives significantly?

And are we all looking for that quick fix, the easy way out?

Are we heading towards the world depicted in the movie Idiocracy? (very interesting flick – it’s worth watching for the premise alone).

Already, most of us in the more “civilized” parts of the globe don’t know how to hunt down and skin an animal, build a weapon, start a fire, build a shelter. That is knowledge that convenience robbed us of many years ago. You can study how to do those things, but chances are, if you’re in need to do so, you won’t have the training to be anywhere near as successful as our ancestors were in those endeavors.

Same goes for taking pictures – aside from a few people that still develop and print their own film, most of us use digital (with the same deluge of information problem described in the music section above – thousands of pictures may now be taken during a vacation, where previously no more than a hundred would, with tremendous love and care – but most of the hundred were keepers).

Many of us are getting heavier, too – convenient access to food and low levels of physical activity (since locomotion is so convenient) being the killer combination.

Does quality suffer because of convenience?

Conveniences aren’t a bad thing overall – I am not hankering for the destruction of all things convenient. However, I posit that certain aspects of quality absolutely suffer because of convenience:

  1. Consumers are more likely to pick an easier to use, throw-away and even short-sighted product over a better-engineered, longer-lasting one – shifting the engineering emphasis instead to ease-of-use and disposability.
  2. The quality of workers in many fields isn’t what it used to be.
  3. We are heading towards more generalists and less specialists.
  4. Troubleshooting is becoming a lost art.

I’m not sure how to even conclude – I’m probably part of the problem since one of the things I do is help make very advanced technology easier to consume and more forgiving.

Just don’t get too comfortable.

Toilet chair



How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.



Technorati Tags: , , , , , , ,

So now it is OK to sell systems using “Raw IOPS”???

As the self-proclaimed storage vigilante, I will keep bringing these idiocies up as I come across them.

So, the latest “thing” now is selling systems using “Raw IOPS” numbers.

Simply put, some vendors are selling based on the aggregate IOPS the system will do based on per-disk statistics and nothing else

They are not providing realistic performance estimates for the proposed workload, with the appropriate RAID type and I/O sizes and hot vs cold data and what the storage controller overhead will be to do everything. That’s probably too much work. 

For example, if one assumes 200x IOPS per disk, and 200 such disks are in the system, this vendor is showing 40,000 “Raw IOPS”.

This is about as useful as shoes on a snake. Probably less.

The reality is that this is the ultimate “it depends” scenario, since the achievable IOPS depend on far more than how many random 4K IOPS a single disk can sustain (just doing RAID6 could result in having to divide the raw IOPS by 6 where random writes are concerned – and that’s just one thing that affects performance, there are tons more!)

Please refer to prior articles on the subject such as the IOPS/latency primer here and undersizing here. And some RAID goodness here.

If you’re a customer reading this, you have the ultimate power to keep vendors honest. Use it!


Technorati Tags: ,

An explanation of IOPS and latency

<I understand this extremely long post is redundant for seasoned storage performance pros – however, these subjects come up so frequently, that I felt compelled to write something. Plus, even the seasoned pros don’t seem to get it sometimes… :) >

IOPS: Possibly the most common measure of storage system performance.

IOPS means Input/Output (operations) Per Second. Seems straightforward. A measure of work vs time (not the same as MB/s, which is actually easier to understand – simply, MegaBytes per Second).

How many of you have seen storage vendors extolling the virtues of their storage by using large IOPS numbers to illustrate a performance advantage?

How many of you decide on storage purchases and base your decisions on those numbers?

However: how many times has a vendor actually specified what they mean when they utter “IOPS”? :)

For the impatient, I’ll say this: IOPS numbers by themselves are meaningless and should be treated as such. Without additional metrics such as latency, read vs write % and I/O size (to name a few), an IOPS number is useless.

And now, let’s elaborate… (and, as a refresher regarding the perils of ignoring such things wnen it comes to sizing, you can always go back here).


One hundred billion IOPS…


I’ve competed with various vendors that promise customers high IOPS numbers. On a small system with under 100 standard 15K RPM spinning disks, a certain three-letter vendor was claiming half a million IOPS. Another, a million. Of course, my customer was impressed, since that was far, far higher than the number I was providing. But what’s reality?

Here, I’ll do one right now: The old NetApp FAS2020 (the older smallest box NetApp had to offer) can do a million IOPS. Maybe even two million.

Go ahead, prove otherwise.

It’s impossible, since there is no standard way to measure IOPS, and the official definition of IOPS (operations per second) does not specify certain extremely important parameters. By doing any sort of I/O test on the box, you are automatically imposing your benchmark’s definition of IOPS for that specific test.


What’s an operation? What kind of operations are there?

It can get complicated.

An I/O operation is simply some kind of work the disk subsystem has to do at the request of a host and/or some internal process. Typically a read or a write, with sub-categories (for instance read, re-read, write, re-write, random, sequential) and a size.

Depending on the operation, its size could range anywhere from bytes to kilobytes to several megabytes.

Now consider the following most assuredly non-comprehensive list of operation types:

  1. A random 4KB read
  2. A random 4KB read followed by more 4KB reads of blocks in logical adjacency to the first
  3. A 512-byte metadata lookup and subsequent update
  4. A 256KB read followed by more 256KB reads of blocks in logical sequence to the first
  5. A 64MB read
  6. A series of random 8KB writes followed by 256KB sequential reads of the same data that was just written
  7. Random 8KB overwrites
  8. Random 32KB reads and writes
  9. Combinations of the above in a single thread
  10. Combinations of the above in multiple threads
…this could go on.

As you can see, there’s a large variety of I/O types, and true multi-host I/O is almost never of a single type. Virtualization further mixes up the I/O patterns, too.

Now here comes the biggest point (if you can remember one thing from this post, this should be it):

No storage system can do the same maximum number of IOPS irrespective of I/O type, latency and size.

Let’s re-iterate:

It is impossible for a storage system to sustain the same peak IOPS number when presented with different I/O types and latency requirements.


Another way to see the limitation…

A gross oversimplification that might help prove the point that the type and size of operation you do matters when it comes to IOPS. Meaning that a system that can do a million 512-byte IOPS can’t necessarily do a million 256K IOPS.

Imagine a bucket, or a shotshell, or whatever container you wish.

Imagine in this container you have either:

  1. A few large balls or…
  2. Many tiny balls
The bucket ultimately contains about the same volume of stuff either way, and it is the major limiting factor. Clearly, you can’t completely fill that same container with the same number of large balls as you can with small balls.
IOPS containers













They kinda look like shotshells, don’t they?

Now imagine the little spheres being forcibly evacuated rapildy out of one end… which takes us to…


Latency matters

So, we’ve established that not all IOPS are the same – but what is of far more significance is latency as it relates to the IOPS.

If you want to read no further – never accept an IOPS number that doesn’t come with latency figures, in addition to the I/O sizes and read/write percentages.

Simply speaking, latency is a measure of how long it takes for a single I/O request to happen from the application’s viewpoint.

In general, when it comes to data storage, high latency is just about the least desirable trait, right up there with poor reliability.

Databases especially are very sensitive with respect to latency – DBs make several kinds of requests that need to be acknowledged quickly (ideally in under 10ms, and writes especially in well under 5ms). In particular, the redo log writes need to be acknowledged almost instantaneously for a heavy-write DB – under 1ms is preferable.

High sustained latency in a mission-critical app can have a nasty compounding effect – if a DB can’t write to its redo log fast enough for a single write, everything stalls until that write can complete, then moves on. However, if it constantly can’t write to its redo log fast enough, the user experience will be unacceptable as requests get piled up – the DB may be a back-end to a very busy web front-end for doing Internet sales, for example. A delay in the DB will make the web front-end also delay, and the company could well lose thousands of customers and millions of dollars while the delay is happening. Some companies could also face penalties if they cannot meet certain SLAs.

On the other hand, applications doing sequential, throughput-driven I/O (like backup or archival) are nowhere near as sensitive to latency (and typically don’t need high IOPS anyway, but rather need high MB/s).

It follows that not all I/O sizes and I/O operations are subject to the same latency requirements.

Here’s an example from an Oracle DB – a system doing about 15,000 IOPS at 25ms latency. Doing more IOPS would be nice but the DB needs the latency to go a lot lower in order to see significantly improved performance – notice the increased IO waits and latency, and that the top event causing the system to wait is I/O:

AWR example Now compare to this system (different format this data but you’ll get the point):

Notice that, in this case, the system is waiting primarily for CPU, not storage.

A significant amount of I/O wait is a good way to determine if storage is an issue (there can be other latencies outside the storage of course – CPU and network are a couple of usual suspects). Even with good latencies, if you see a lot of I/O waits it means that the application would like faster speeds from the storage system.

But this post is not meant to be a DB sizing class. Here’s the important bit that I think is confusing a lot of people and is allowing vendors to get away with unrealistic performance numbers:

It is possible (but not desirable) to have high IOPS and high latency simultaneously.

How? Here’s a, once again, oversimplified example:

Imagine 2 different cars, both with a top speed of 150mph.

  • Car #1 takes 50 seconds to reach 150mph
  • Car #2 takes 200 seconds to reach 150mph

The maximum speed of the two cars is identical.

Does anyone have any doubt as to which car is actually faster? Car #1 indeed feels about 4 times faster than Car #2, even though they both hit the exact same top speed in the end.

Let’s take it an important step further, keeping the car analogy since it’s very relatable to most people (but mostly because I like cars):

  • Car #1 has a maximum speed of 120mph and takes 30 seconds to hit 120mph
  • Car #2 has a maximum speed of 180mph, takes 50 seconds to hit 120mph, and takes 200 seconds to hit 180mph

In this example, Car #2 actually has a much higher top speed than Car #1. Many people, looking at just the top speed, might conclude it’s the faster car.

However, Car #1 reaches its top speed (120mph) far faster than Car # 2 reaches that same top speed of Car #1 (120mph).

Car #2 continues to accelerate (and, eventually, overtakes Car #1), but takes an inordinately long amount of time to hit its top speed of 180mph.

Again – which car do you think would feel faster to its driver?

You know – the feeling of pushing the gas pedal and the car immediately responding with extra speed that can be felt? Without a large delay in that happening?

Which car would get more real-world chances of reaching high speeds in a timely fashion? For instance, overtaking someone quickly and safely?

Which is why car-specific workload benchmarks like the quarter mile were devised: How many seconds does it take to traverse a quarter mile (the workload), and what is the speed once the quarter mile has been reached?

(I fully expect fellow geeks to break out the slide rules and try to prove the numbers wrong, probably factoring in gearing, wind and rolling resistance – it’s just an example to illustrate the difference between throughput and latency, I had no specific cars in mind… really).


And, finally, some more storage-related examples…

Some vendor claims… and the fine print explaining the more plausible scenario beneath each claim:

“Mr. Customer, our box can do a million IOPS!”

512-byte ones, sequentially out of cache.

“Mr. Customer, our box can do a quarter million random 4K IOPS – and not from cache!”

at 50ms latency.

“Mr. Customer, our box can do a quarter million 8K IOPS, not from cache, at 20ms latency!”

but only if you have 1000 threads going in parallel.

“Mr. Customer, our box can do a hundred thousand 4K IOPS, at under 20ms latency!”

but only if you have a single host hitting the storage so the array doesn’t get confused by different I/O from other hosts.

Notice how none of these claims are talking about writes or working set sizes… or the configuration required to support the claim.


What to look for when someone is making a grandiose IOPS claim

Audited validation and a specific workload to be measured against (that includes latency as a metric) both help. I’ll pick on HDS since they habitually show crazy numbers in marketing literature.

For example, from their website:



It’s pretty much the textbook case of unqualified IOPS claims. No information as to the I/O size, reads vs writes, sequential or random, what type of medium the IOPS are coming from, or, of course, the latency…

However, that very same box almost makes 270,000 SPC-1 IOPS with good latency in the audited SPC-1 benchmark:


Last I checked, 270,000 was almost 15 times less than 4,000,000. Don’t get me wrong, 260,000 low-latency IOPS is a great SPC-1 result, but it’s not 4 million SPC-1 IOPS.

Check my previous article on SPC-1 and how to read the results here. And if a vendor is not posting results for a platform – ask why.


Where are the IOPS coming from?

So, when you hear those big numbers, where are they really coming from? Are they just ficticious? Not necessarily. So far, here are just a few of the ways I’ve seen vendors claim IOPS prowess:

  1. What the controller will theoretically do given unlimited back-end resources.
  2. What the controller will do purely from cache.
  3. What a controller that can compress data will do with all zero data.
  4. What the controller will do assuming the data is at the FC port buffers (“huh?” is the right reaction, only one three-letter vendor ever did this so at least it’s not a widespread practice).
  5. What the controller will do given the configuration actually being proposed driving a very specific application workload with a specified latency threshold and real data.
The figures provided by the approaches above are all real, in the context of how the test was done by each vendor and how they define “IOPS”. However, of the (non-exhaustive) options above, which one do you think is the more realistic when it comes to dealing with real application data?


What if someone proves to you a big IOPS number at a PoC or demo?

Proof-of-Concept engagements or demos are great ways to prove performance claims.

But, as with everything, garbage in – garbage out.

If someone shows you IOmeter doing crazy IOPS, use the information in this post to help you at least find out what the exact configuration of the benchmark is. What’s the block size, is it random, sequential, a mix, how many hosts are doing I/O, etc. Is the config being short-stroked? Is it coming all out of cache?

Typically, things like IOmeter can be a good demo but that doesn’t mean the combined I/O of all your applications’ performance follows the same parameters, nor does it mean the few servers hitting the storage at the demo are representative of your server farm with 100x the number of servers. Testing with as close to your application workload as possible is preferred. Don’t assume you can extrapolate – systems don’t always scale linearly.


Factors affecting storage system performance

In real life, you typically won’t have a single host pumping I/O into a storage array. More likely, you will have many hosts doing I/O in parallel. Here are just some of the factors that can affect storage system performance in a major way:


  1. Controller, CPU, memory, interlink counts, speeds and types.
  2. A lot of random writes. This is the big one, since, depending on RAID level, the back-end I/O overhead could be anywhere from 2 I/Os (RAID 10) to 6 I/Os (RAID6) per write, unless some advanced form of write management is employed.
  3. Uniform latency requirements – certain systems will exhibit latency spikes from time to time, even if they’re SSD-based (sometimes especially if they’re SSD-based).
  4. A lot of writes to the same logical disk area. This, even with autotiering systems or giant caches, still results in tremendous load on a rather limited set of disks (whether they be spinning or SSD).
  5. The storage type used and the amount – different types of media have very different performance characteristics, even within the same family (the performance between SSDs can vary wildly, for example).
  6. CDP tools for local protection – sometimes this can result in 3x the I/O to the back-end for the writes.
  7. Copy on First Write snapshot algorithms with heavy write workloads.
  8. Misalignment.
  9. Heavy use of space efficiency techniques such as compression and deduplication.
  10. Heavy reliance on autotiering (resulting in the use of too few disks and/or too many slow disks in an attempt to save costs).
  11. Insufficient cache with respect to the working set coupled with inefficient cache algorithms, too-large cache block size and poor utilization.
  12. Shallow port queue depths.
  13. Inability to properly deal with different kinds of I/O from more than a few hosts.
  14. Inability to recognize per-stream patterns (for example, multiple parallel table scans in a Database).
  15. Inability to intelligently prefetch data.


What you can do to get a solution that will work…

You should work with your storage vendor to figure out, at a minimum, the items in the following list, and, after you’ve done so, go through the sizing with them and see the sizing tools being used in front of you. (You can also refer to this guide).

  1. Applications being used and size of each (and, ideally, performance logs from each app)
  2. Number of servers
  3. Desired backup and replication methods
  4. Random read and write I/O size per app
  5. Sequential read and write I/O size per app
  6. The percentages of read vs write for each app and each I/O type
  7. The working set (amount of data “touched”) per app
  8. Whether features such as thin provisioning, pools, CDP, autotiering, compression, dedupe, snapshots and replication will be utilized, and what overhead they add to the performance
  9. The RAID type (R10 has an impact of 2 I/Os per random write, R5 4 I/Os, R6 6 I/Os – is that being factored?)
  10. The impact of all those things to the overall headroom and performance of the array.

If your vendor is unwilling or unable to do this type of work, or, especially, if they tell you it doesn’t matter and that their box will deliver umpteen billion IOPS – well, at least now you know better :)


Technorati Tags: , , , , , , , , , , , ,