Note: I edited this a bit to remove some confusing pieces of info.
Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers – remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like youâ€™re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, donâ€™t sell someone an under-configured systemâ€¦
This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (Iâ€™m not implying this is how all Compellent deals go down). The executive summary: In a deal Iâ€™m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as Iâ€™m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.
And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:
There is no magic.
This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No â€“ this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.
Ergo, this is about to get spindle-bound…
And for all the seasoned pros out there: I know you may know all this, itâ€™s not for you, so donâ€™t complain that itâ€™s too basic. This post is for people new to performance sizing (and maybe some engineers 🙂 )
This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.
They’ll be running Exchange plus some databases.
From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.
For instance – there is peak performance, average, and what we called “steady-state”.
In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.
The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.
If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache 🙂
Then, you have your average performance. That just takes a straight math average across all performance points – so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.
Then there’s the concept of “steady state”.
This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.
A picture will make things clearer:
In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, thatâ€™s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!
So, to summarize: it’s usually not correct to size for maximum or average since theyâ€™re both misleading (unless youâ€™re sizing for a minimum-latency DB application â€“ then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what youâ€™re trying to accommodate.
So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.
I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just canâ€™t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).
Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…
What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.
So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.
Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since thereâ€™s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors â€“ the only time thereâ€™s no I/O penalty, is if youâ€™re doing RAID0 (= no protection).
You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesnâ€™t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.
You also need to figure out read vs write percentage, I/O block size distributions, random vs sequentialâ€¦ not rocket science, but definitely extra work in order to do right.
The last thing that needs to be taken into account is the working set. Basically, it means this:
Imagine you have a 10TB database, but youâ€™re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically donâ€™t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.
Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.
Going back to the reason I wrote this post in the first place:
In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldnâ€™t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.
Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember â€“ that doesnâ€™t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and youâ€™re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.
And, finally, the offense in question:
Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, youâ€™re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. Iâ€™m not the expert, Iâ€™m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penaltiesâ€¦
Based on the above analysis, this somehow doesnâ€™t compute with 11 drives, half what my calculations indicateâ€¦ so, my final question is:
How do Compellent engineers size for performance?