More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers – remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers 🙂 )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance – there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache 🙂

Then, you have your average performance. That just takes a straight math average across all performance points – so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

36 Replies to “More tales from the field: Sizing best practices – does Compellent follow them?”

  1. I would ask what documents were submitted to Compellent for sizing? 12 SAS disks for exchange and databases doesn’t make sense…if they have any number of users on them…

    When we did our performance sizing for our 18 host vmware there was a document Compellent sent with how they wanted measurments done down to the exact esxtop command settings.

    In our case, we were very impressed with the performance monitoring done before sizing. We had a good idea on our own but still…good to see the final numbers.

  2. I will say that whoever the engineer/architect is that sized the system either had wrong information from the end user, didn’t process the information correctly or just figured that a 15K drive could do almost double what is generally used for sizing standards within Compellent.

    When I size a system, I go for worst case performance… 15K drive gives me 150 IOPS, 10K drive gives me 100 IOPS and 7.2K gives me 75 IOPS for easy numbers. Using those numbers, I would have sized this system with 20 active 15K drives and just based on how licensing is done, I probably would have filled out 2 12-bay enclosures providing 22 active drives plus 1 spare per enclosure. If that didn’t hit the space requirements, that’s where 7.2K drives come into play and the system features brings it all together.

    In short, don’t jump on all of the architects, some actually follow the rules.

  3. I’m not jumping on all the architects, I did mention that I don’t believe all Compellent deals are like this. The company would have gone out of business had that been the case.

    I was given the same perfmon data they had… I know this is weird, and I’m sure it’s a localized thing since I’ve talked to other Compellent engineers and they seem extremely knowledgeable (AG you seem to be one of them and your numbers echo mine).

    I normally ignore such foolishness, but, since this was a repeat offense, I thought to report it. The exchange offense was even worse, it was literally sized for the TB and not # users at all, it was off by several times the # drives needed!

    Every company has reps that are struggling to make it and/or don’t care, and such people do nothing to improve the image of said company – and need to be removed in my opinion. They typically either do things on their own or force their engineers to make bad decisions (I’ve seen it – the engineer recommending something else but the rep not wanting to do it because it would increase the cost).

    D

  4. Enrico,

    I read your reply, and maybe something is getting lost in the translation. The steady-state IOPS (whatever the “percentile”) were clear, just like in the example graph. The desired IOPS were about 3,000, not 2,400.

    Regarding Fast Track:

    The 20% of the outer part of the disk is, indeed, faster. Consider this scenario: What if you have a constant stream of 3,000 write IOPS that are writing all over the disk?

    Will Compellent “cache” the writes in the beginning 20%, then move them to the desired location?

    If this is the case, then the moves need IOPS themselves, and with 11 drives the system will never catch up. The 20% of 11 disks may not be enough, just like memory cache is not enough sometimes.

    This also doesn’t explain the case where Compellent was suggesting many times less drives for exchange than necessary.

    You also have some errors regarding NetApp in your post, but in any case this wasn’t a post about NetApp vs Compellent at all, but rather how should we size for predictable performance.

    NetApp has features that perform write-combining and, especially for writes, may result in a lot less disks needed, and we don’t even offer RAID10 (rather, if you’re super-paranoid, you can do RAID-DP-1, or mirrored RAID-DP 🙂 )

    But, at some point, you need to size for the worst case scenario, and I wanted to make sure that I can deliver 3,000 IOPS even WITHOUT the optimizations, since the optimizations are not guaranteed in all I/O patterns. And – remember – this is not accounting for the 30K IOPS spikes, I’m talking about the steady-state.

    For instance – what about providing 3,000 random reads from all over the disk? That won’t fit in cache? How many disks will Compellent need then? Fast Track and Data Progression don’t do anything in that case.

    By your model, you use 177 IOPS per disk, so you would need 17 disks to satisfy 3K random 4K reads.

    And what about analyzing for block sizes? The data wasn’t all 4K, maybe the average is 12K. Do you size for that?

    My last question is – how do you RAID-adjust? As I mention in my post, there’s ALWAYS overhead, regardless of implementation. Some vendors need more disk than others, but everyone needs more disks than RAID0. Is the 177 IOPS already RAID-adjusted? RAID10 and RAID5 have different formulas to figure out the back-end IOPS, and the formulas depend on writes vs reads…

    D

  5. Enrico – I’m not sure I buy your assertion that FastTrack gives you an IOPS “boost” in the way you explain it. Yes, access to hot blocks would theoretically get more throughput than if it were on the inside tracks, but the total IOPS that a 15k drive is capable of -already- takes into account the fact that certain parts of the disk are faster than others, no?

    So you aren’t magically capable of more IOPS per disk as you assert. You might get improved response times -for hot blocks- versus a system that doesn’t have FastTrack-type capability, but that’s about it.

    Am I missing something obvious?

  6. Dimitri – Just a quick answer for your questions:

    – Size for 99th percentile, not 95th.
    – 177 IOPS (average) for a 15K drive is already RAID 10 adjusted.
    – sample your perfmon every 3 seconds to get maximum granularity in the 99th percentile calculation.

    With that in mind, Compellent cache can “cushion” the spikes very effectively.

    BTW I agree with you on the design considerations, looking at the perfmon stats can give you an insight of what the performance are right NOW, if they’re storage constrained you simply cannot design a new system based on those numbers. If the customer doesn’t konw their requirements (very common where I live) I usually apply a +20% IOPS increase in the new design.

    Ciao,
    Fabio

  7. Thanks Fabio. I might go ahead and remove the percentile stuff since it’s confusing.

    Regardless of how you call it, my point is that you NEED to look at the data, we should not just apply a blanket percentile statement and assume it’s good. I’ve yet to see an automated analysis tool that can replace double-checking with the human eye. Ultimately, in our specific case, the final numbers are:

    3,000 IOPS
    70/30 R/W mix
    Won’t fit in cache
    12K avg blocksize

    177 IOPS is RAID10 adjusted, thanks for clarifying. What about RAID5? And is that for reads or writes? You see, 100% reads in RAID is not the same back-end IOPS as 100% writes…

  8. Dimitris,
    Just to clarify even more, based on the numbers I would say that the 177 IOPS calculation fits completely your workload description (70/30 r/w with 8k to 16k blocksize), it’s a pretty common DB workload.

    With that in mind you should consider that the (recommended / best practice) Compellent approach is to always write in RAID 10 (thus almost eliminating the RAID penalty) and then progress the data to lower tier / raid level.

    Concerning the backend IOPS, you should always take in account that when designing a new system, you know that during writes in R10 the traffic in the backend doubles, so you need to have that in mind when designing backends, I’ve seen bad designs where a system made of 7 15K enclosures was connected using just 2 backend loops per controller, and they were having performance issues, wonder why? 🙂

    Ciao,
    Fabio

  9. Sorry for delay, but I saw some comments to your doubts from Fabio.

    My thoughts were only on the ability of Compellent to offer less disks than other competitors because of features capable to get more optimized access to disks, as I wrote in my post, I don’t know enough details about the customer, the environment, prerequisites, etc. So I will continue only to show you the Compellent capabilities at my best.

    Fast Track and Data Progression are tightly integrated and they achieve awesome results if coupled: I assumed an advantage in about 20% of performance improvements as an example but it can be slightly more in function of the environment (I personally saw 7000 sustained IOPS from an Oracle ERP server on 15*15K+15*10K disks !!! ).

    BTW, I will try to clarify better my points:

    Compellent uses 450GB SAS disks (418GB really).
    In your case we will have 418*11=4598GB of usable space and, of course, 919GB of Fast Track. Each block on Compellent has its RAID level so the real net space varies from 919/2 for RAID1 to 817GB for RAID5-9, this space will reserved and freed dynamically. I don’t think that this customer has more than 800GB of active data in a system capable to deliver only 4TB!

    To recap in a very coarse way: Each write is performed on FastTrack then blocks will be migrated to other RAID levels and/or portion of disks in background at a low priority (this operation doesn’t impact on front end performance). All managed by system policies and volume profiles.

    Finally, it is hard but not impossible to achieve 3000 sustained IOPS on a well configured Compellent with 11 active disks, 😉

    ciao,
    Enrico

    PS posted this comment on my blog too: http://www.cinetica.it/2010/03/10/why-compellent-proposes-fewer-disks/

  10. Thanks Enrico! Unfortunately, your response further confused me.

    So, to understand: A 450GB disk is right-sized in Base2 to 418GB? As in: After converting to Base2, what’s the real usable space in a 450GB drive, BEFORE any RAID or any data get written to the disk?

    Or does the 418GB include all the parity?

    I mean – how do 11 drives, with a mix of RAID1 and RAID5, come out at 4,598GB usable?

    If you assume 3,000 constant write 12K OPS, how many RAID10 spindles are needed?

    If you assume 3,000 constant read 12K OPS, how many RAID5 spindles are needed?

    I’ll keep the questions to these…

    D

  11. Dimitris,
    Sorry, I tried for a short answer but i failed my goal. 🙁

    Correct, 418GB is raw usable space when converted in base2. This system has 11 active disks so you have 11*418 = 4598 usable space before raid.
    Compellent RAID protection is quite different from other vendors: you don’t need to define RAID groups but you go directly to create LUNs from the usable space!

    In the LUN creation wizard you will associate a profile to the LUN: the profile defines the behavior of the LUN. There are standard (out of the box) and custom (user created) profiles. In the profile are defined all the RAID levels and tiers for the LUN. You can also modify profiles and LUNs properties on the fly.
    i.e.: you may define a DB LUN to be positioned on RAID 10 Fast, RAID 10 standard, RAID5 standard and snapshots on RAID 5 standard.
    So, you will write and access each new and hot block on “fast tracked” RAID 10 and, on the other hand, old and less used blocks are automatically migrated to RAID5 on less valuable tracks saving space and speed!!!
    To say if the 11 disks can provide 3000 IOPS I need to know more about applications/servers and data involved but, I repeat, it is hard but not impossible.

    Now, go back to the usable net space after RAID: we will have a variable net space (after RAID) ranging from half the raw space (4598/2=2299) to 4092GB for RAID 5 made with 9 disks stripes. The real space depends on how are organized the profiles and data activity… but if we can hypothesize something like 20% of RAID 10 and the rest of RAID5 you will obtain about 450GB net of RAID1 and 3270 net of RAID5. The Total net usable in this case is 450+3270=3720GBs.
    This can be changed with a pair of clicks, the system will start immediately to work with the new profile freeing or allocating space as needed! 🙂

    With this architecture you need to change drastically how to think about storage metrics and it is very important to analyze in deep the environment.
    In a well sized and configured Compellent system you will write and read heavily accessed blocks on faster portion of disks (or on SSDs) and you will not pay for a write penalty due to RAID calculations getting awesome performance and space savings.

    It’s not useful to answer how many spindles I use to write in RAID 5 because I don’t need to write in RAID 5 and pay a penalty. I will write in RAID 10 and then the systems move the blocks to RAID 5.

    ciao,
    Enrico

    PS posted this comment on my blog too: http://www.cinetica.it/2010/03/10/why-compellent-proposes-fewer-disks/

  12. Thanks Enrico, if you look at the last part of my question I ask this:

    If you assume 3,000 constant write 12K OPS, how many RAID10 spindles are needed? Please take into account moving them into RAID5. Assume a totally constant stream of 3K write IOPS.

    If you assume 3,000 constant read 12K OPS, how many RAID5 spindles are needed?

    So, I understand you’ll write to the RAID10 portion and read from the RAID5 portion, I just want to understand how many IOPS one can expect in that case. Notice the I/O size and that it will be totally random I/O.

    Then, to correct for this specific workload:

    2,100 IOPS 12K random reads
    900 IOPS 12K random writes

    Assume no cache helping.

    Thx

    D

  13. Dimitris,
    I’m not sure what you’re trying to prove, Compellent is using the same drives that everyone in the industry is using, if you’re sizing for an endless stream of 3000 IOPS with that workload you need at least 15 active drives (450GB 15K) on Compellent. There’s no such thing as Raid 10 spindles or Raid 5 spindles, both raid levels live on the same drives dynamically.
    And if your wondering if you can have 3000 reads from 15 drives using raid 5, yes, you can, the raid penalty doesn’t apply during reads so after the data as been progressed to Raid 5 (the progression is not happening continuously, it’s running like a scheduled job) you can read as fast as the spindles can (and for your exact workload the adjusted value is 196 IOPS per drive, I checked).

    I really cannot understand why you’re making assumption about constant/fixed workloads (It’s not common to have a 3000 fixed and constant stream of iops), as I stated previously, to make a clear and good analysis you need to have:

    – a granular performance capture.
    – use 99th percentile.
    – apply the desired increment of performance.

    And then consider all the other little aspects you mentioned like backend topology and so on.

    HTH,
    Fabio

  14. (Not a vendor)

    So first off, per Fabio and Enrico, it looks like the Compellent solution is undersized: “endless stream of 3000 IOPS with that workload you need at least 15 active drives (450GB 15K)”. But lets dig a little.

    Dimitri stole my trademark comment. There is no magic. I’m not going to quote each comment, since the conversation is spread between two sites and I’m working off my notes.

    First off, IO requirements for the workload are 3000 IOPS spiking to up to 30000. Per the Compellent sizing guidelines, assume 177 IOPS per 15k disk. All writes are staged to FastTrack, and then at an interval, demoted to inner track RAID5. Comparison point here: NetApp coalesces the writes into Full Stripes at ingestion point, Compellent writes to mirrors, then demotes at interval. Initial performance hit is probably close to identical (cache -> non-RAID penalty write) with Compellent having to do a migration at a later time (at a lower priority).

    Question: Is the interval for the destage to RAID 5 scheduled or based on the %utilization of Fast Track?

    In both cases, ingestion rate could impact performance since a “full” fast track either needs to rapidly destage (higher priority) or write directly to RAID 5 (if ingestion is slow enough that you just need to destage off hours, probably not an issue). Probably not an issue in this case, but even low priority IOs move the spindles.

    >(15-20% performance boost due to Fast Track)

    At certain utilization percentages, sure, this is possible. To blanketly state that the benefit is always 15-30% regardless of anything else is a little to close to “magic.” The explanation seemed to miss a few things. First off, by leveraging the outer track for FastTrack, you’re guaranteeing in a heavy mixed workload that your %seek time is going to be higher – to minimize %seek time, the middle of the spindle would probably be better (required capacity vs total capacity plays here).

    Secondly, the main benefits of FastTrack+Data progression appear to be writes and frequent reads – Exchange is going to reduce the read benefit (random), and the performance benefits of the writes are matched by the full stripe writes of NetApp.

    So, in short:

    For a 3000 (steady) workload spiking to 30000, 12 spindles is likely substantially undersized. Even if you take the near-magic-configuration of 3000 IOPS from 11 spindles in a mixed workload + Exchange environment as the truth, drive rebuilds will kill it as will the spikes.

  15. Techmute,
    I think that you’re mixing up a little, I just responded to the last two questions that Dimitris posted and they were:

    “If you assume 3,000 constant write 12K OPS, how many RAID10 spindles are needed? Please take into account moving them into RAID5. Assume a totally constant stream of 3K write IOPS.”

    and

    “If you assume 3,000 constant read 12K OPS, how many RAID5 spindles are needed?”

    and with the workload that Dimitris posted:

    “2,100 IOPS 12K random reads
    900 IOPS 12K random writes

    Assume no cache helping.”

    Following the Compellent performance guidelines, you have 15 active 15k spindles.

    That’s it.

    I think that we simply CANNOT talk about a real-world configuration without having *at least* the performance data in hand, and do a real analysis on that.
    And just for the record, I’m not from Compellent, actually I work with both NetApp and Compellent as a Partner and I know how they behave and I like them both.

    Ciao,
    Fabio

  16. Hi Fabio,

    I’m now just trying to understand it all. I think we’re all in agreement that 11 active spindles wouldn’t really deal with the workload, which was the original point of the post.

    I notice you are not taking the 12K average I/O size into account (or are you)? We’re NOT talking 4K I/Os here.

    Anyway, what I’m struggling with is this:

    Effectively, Compellent seems to treat the RAID10 chunklets (to borrow a 3Par term) as a write staging area.

    Then, later on, the writes get migrated to the long-term retention area. Please correct me if I’m not right.

    I think what Techmute is saying is that, assuming a certain write workload, the flushing/destaging of the data to the RAID5 chunklets will occupy a certain amount of IOPS (has to be read from the RAID10 chunklets and written to the RAID5 chunklets).

    What is the overhead of that operation? (I know “it depends”, I’m just trying to better understand how the box works).

    So, if the workload is pretty consistent (aside from the crazy spikes which we will ignore), you know the system has to service, in this example:

    2100 12K read OPS, pretty consistent
    900 12K write OPS, pretty consistent

    So, and I understand we’re working without the actual perfmon here, but bear with me:

    The system has to deal with an average of 900 12K writes per second, that will go to the RAID10 chunklets. This, depending on how the system lays out the data, will translate to a certain number of real disk IOPS. How many, approx?

    Then, the system has to deal with an average of 2100 12K reads per second, presumably both from the RAID5 and the RAID10 chunklets (I understand that spans all available drives).

    Then, we have the flush/destage from the RAID10 chunklets to the RAID5 chunklets, when that’s happening, what is the impact to the system?

    Thx

    D

  17. Techmute, Dimitris,

    You don’t know the technology involved and it is very difficult for me to speak about theory and compare it with a real world case when we (me and Fabio) don’t know the environment of the customer!
    I invite you to share with us all the informations about this particular case, at least:
    – a complete sampling (28800 samples in 24h, one every 3 seconds for all the server involved)
    – a full picture of the SAN (servers, applications, data)
    – customer requests

    or stop to confuse who is reading!

    BTW, for that customer did you propose something like 23 disks???
    This is very far away from 15 and we may continue to discuss for years about the subsized 12HDs Compellent configuration or the oversized 23 HDs NetApp one!
    My first question may be: Why you are proposing 23 HDs when 16 (15+1 spare) are more than enough?

    Then, please, let me know if you want to talk about the theoretical 3000 IOPS or the real world solution comprising a 12 disks system and relative optimizations.

    Anyway, I suggest you to spend some time on the Compellent site (http://www.compellent.com) to look for some documentation and videos and learn a little bit more about the architecture ( http://www.compellent.com/Products/Architecture.aspx ) of the product, probably you will find some interesting readings about how the system works, this will widen your horizons.

    Well, back to the theory.
    Fabio wrote about the 3000 IOPS with 15*15K disk for the IO pattern you suggested, without any specific optimization, and I add that Compellent may do more with data placement optimization features (Fast Track). He never spoke about a 12K block size because it’s not important, we obtain similar results with 4,8,16K blocks.

    The main difference between Compellent and others is the fully virtualized concept of the LUN thanks to block’s metadata: each LUN is dispersed in every disk of the system (SSD,FC,SAS,SATA) and with different RAID levels.
    There is no staging area (I apologize if my simplifications were pushed to the limit) and all data movements (RAID level and tier) are done when the system’s load allows to.
    All of this makes sense only if the system is properly configured.

    ciao,
    Enrico

    PS this comment was posted on my blog too: http://www.cinetica.it/2010/03/10/why-compellent-proposes-fewer-disks/

  18. Dimitris,

    I’m not sure you’re understanding correctly how Compellent works, let me explain how it works basic-style:

    First thing first, the concepts:

    Tiers are organized like that:

    1st Tier – Fastest disks
    2nd Tier – Medium disks
    3rd Tier – Slowest disks

    and they’re dynamically chosen based on the available drives in the storage, every tier is subdivided in different RAID Levels and different Tracks.

    To clear things up, let’s imagine that we have a system configured with:

    15 active drives FC 450GB 15K
    15 active drives SATA 1TB 7.2K

    With this kind of system we would have:

    Tier 1 – 15K drives
    – Raid 10 Fast Tracks
    – Raid 5-9 Fast Tracks
    – Raid 10 Standard Tracks
    – Raid 5-9 Standard Tracks

    Tier 3 – 7.2K drives
    – Raid 10 Fast Tracks
    – Raid 5-9 Fast Tracks
    – Raid 10 Standard Tracks
    – Raid 5-9 Standard Tracks

    I’m using an “old” example since right now you can also have Raid 6 in the mix but let’s leave that alone for now, also Raid 5-9 means that it’s a Raid stripe made of 8 Data blocks and 1 Parity block (You can also have Raid 5-5 if you want)

    So in this system my data can live on those 8 “tiers”, right now when I create a new Volume (LUN) I can choose where to put my active data and my snapshot data just selecting a “Storage Profile”, for example let’s use a best practice for that:

    The “Recommended (All Tiers)” default profile is the most used and it’s configured like that:

    Write data on Tier1:Raid 10 and Tier2:Raid 10
    Snapshot data on Tier1:Raid 5-9, Tier2:Raid 5-9, Tier3:Raid 5-9

    I usually create another custom storage Profile called “Archival Data (R5-9)” that’s configured like that:

    Write data on Tier3:Raid 5-9
    Snapshot Data on Tier3:Raid 5-9

    To accomodate the need for low impact stuff.

    Considering that let’s see how the data flow is for those two profiles:

    —- Profile “Recommended (All Tiers)”

    – Data flow from the server hbas to Compellent front-end ports
    – the Data stage to write cache (512MB per controller) and it’s replicated to the other controller.
    – The data is written to disk on the Tier1, Fast Tracks in Raid 10.

    The data is now on stable storage.

    —- Profile “Archival Data (R5-9)”

    – Data flow from the server hbas to Compellent front-end ports
    – the Data stage to write cache (512MB per controller) and it’s replicated to the other controller.
    – The data is cached until a full stripe write is possible.
    – The data is written to disk on the Tier3, Fast Tracks in Raid 5-9.

    The data is now on stable storage.

    And that’s for the write data flow, there’s no such thing as continuous destaging to Raid 5 from Raid 10.

    After the data is on disk there are several ways to progress to the lower tiers, if you just leave the Volume (LUN) alone, the system will continue to keep statistic for every “chunk” of data (512k, 2MB or 4MB), and then progress it slowly (the algorithm is based on access treshold on a 14-day basis) to the lower tiers, that’s not a “quick & dirty” destage to Raid 5.

    Instead, if you take snapshot, either using scheduled snapshot or manually, the data progress quicker, just for example, let’s imagine we have a 100GB volume (LUN):

    it’s 12.00 pm

    – We write 10GB of data (let’s call it Data Blob 1), it’s in Raid 10 on Tier 1, Fast Tracks, and it’s consuming 20GB of raw space (100% raid overhead due to Raid10)
    – take a snapshot of the volume

    It’s 3.00 pm

    – We write another 5GB of data (call it Data Blob 2), that’s also in Raid 10 on Tier 1, Fast Tracks, 10GB of Raw Space (grand total of 30GB)

    Usually at 7.00 pm (that’s the default but it’s configurable) the Data Progression Job kicks in, due to the fact that Compellent’s snapshot are pointer based (just like NetApp) what we’ve called “Data Blob 1” becomes a Read-Only Blob of data and it progress immediately to Raid 5-9 to get some free space back.
    So the next morning we find the system in this situation:

    Blob Data 1 who’s now part of the Snapshot Data is written in Raid 5-9 on Tier 1, Fast Tracks, consuming almost 12 GB of Raw Disk Space
    Blob Data 2 who’s part of the Active Data is still written in Raid 10 on Tier 1, Fast Tracks, still consuming 10 GB of Raw Disk Space

    We just got 8GB of Raw Disk Space back, without sacrificing performances, because “Blob Data 1” is considered “Read Only” thus we don’t have to write Raid 5 but just read from it, eliminating the raid penalty.
    If we’re going to write data on that volume we’ll still write to “Blob Data 2” who’s still active data, still in Raid 10.

    Said that, I would not imply that the configuration that you found from Compellent was right for the kind of workload the customer had, as I stated many times previously, I trust only *MY* configurations, made from a known set of information that *I* analyze, but I still hope that you can more easily wrap around your head on how Compellent works in detail and why you simply cannot take off the optimizations from the drawing board.

    HTH,
    Fabio

  19. Enrico and Fabio, thanks for posting such detailed answers. It certainly explains quite a bit about Compellent architecture, much of it intriguing.

    Enrico: the minimum acceptable sizing for NetApp came out to 17 disks including spares. The 23 disks was what a typical R5 implementation would require with 2 spares. All I was saying is, no healthy box could possibly produce that specific workload with 11 drives… and even if it was possible at max utilization, nobody wants to (or should) drive their storage system at 100% just to meet a price point.

    Regarding block sizes: I find it hard to believe that I/O size has absolutely no effect on the system. For instance – 3,000 12K IOPS are not the same to the disk as 3,000 4K IOPS. When you see disk performance estimates, they typically are for a 4K transfer size. I understand that, when writing especially, one can avoid a lot of that with intelligent algorithms that serialize previously random I/O (NetApp has been doing that forever), but for totally random reads, 3,000 12K IOPS transfer exactly 3 times as much data as 3,000 4K IOPS. That’s not a vendor-specific thing, right? Or am I missing something? What about 3,000 reads of 256K each, is that the same? Optimizations help, but at some point you have to hit the spindles!!!

    The original perf data files were large so I got rid of them after doing the analysis. However, we can easily make a hypothetical case – we don’t even need to pick these numbers so we’re not talking about a specific case, how’s that?

    The goal of any statistical analysis that looks at a large data sample is to reduce the complexity of said sample and make it more manageable for humans.

    Most storage companies have tools with (hopefully comprehensive) mathematical models of how the systems perform, that you can feed the numbers from the analysis to and the models will estimate the kind of config necessary.

    Therefore, one could well argue that, if given all the variables that the modeling tool expects, one can expect a reasonable sizing estimate.

    In my case, an example of what I can work with to provide a customer with a pretty accurate estimate: If I have, for example – I/O size, read %, random read %, total IOPS, response time needed and, most importantly, working set, then I can give you a solid config that will work.

    Effectively, what does the Compellent sizing tool expect as inputs?

    You see, I’m beginning to understand that different bits get written to different RAID levels, data is moved around etc etc, which is all very exciting, but is it possible to create a solid mathematical model of the system?

    If it is, then we can have fun.

    From Fabio’s response, it’s becoming clear that you initially select the “tier” and data progression will move stuff around.

    If you only have fast track, and not data progression, how does it work? Does the data just stay in RAID10? (not everyone buys data progression).

    If the profile is the one with R5, does the system not have the same restrictions as any system with R5 regarding reliability and parity calculations?

    And finally – assume a hypothetical totally random steady-write-ops workload that will not fit in R10 spindles. I’m really curious to understand how the system works then. This is not the only scenario I’m curious about, but too many questions gets confusing…

    Have a great weekend!

    D

  20. Dimitris,
    only the last comment.

    3000/15 = 200 iops per disk

    4K*200 iops = 0.8 MB/sec
    16K*200 iops = 3.2 MB/sec
    these numbers are less than the capabilities of the single drive. (measurable with every benchmarking tool)

    256K*200 iops = 51.2MB/sec !!! is another tale.

    4,8,16 are reasonable, 256 are over the limit.

    ciao,
    Enrico

  21. IOPS != throughput. They’re related though. True, sequential I/O at 3.2MB/s is nothing for even slow drives.

    I guarantee you that a drive that maxes out at 200 4K totally random reads at 15ms latency will not be able to do 200 16K totally random reads with the same 15ms latency.

    64K transfer sizes are employed by a few different pieces of software, and may well be random.

    256K will typically be backup software so that won’t be random, but rather, sequential.

    But, ignore the last 2 sentences, and focus on the first 2.

    I need to stop doing this at crazy hours, it’s 0510… haven’t slept all night. Don’t know why.

    D

  22. Compellent’s tech looks very geeky and interesting. Here’s hoping it solves real-world problems in a cost effective way.

    (Admin for an IBM N-series customer).

  23. Dimitris,
    Sorry for the late reply, but this weekend I’ve been busy taking care of my garden 🙂

    – Regarding block sizes:

    As I told you previously, I am taking into account the block size factor, actually the values that I gave you on a previous comment (15 active drives for your workload) are for a 16K to 32K block sizes workload, if we consider it a 4K workload the maximum IOPS are almost 3400.
    If you break down how much time the drive is spending doing its thing you’ll find that the transfer part is very limited, I’ve got some documentation where is stated that the transfer portion for a 4K I/O is near 0.3 ms.

    – Regarding what Compellent use to analyze the customer data:

    There are guidelines made by Compellent which are confidential, BUT, as a Storage consultant/architect/whatever :-), I can tell you that we collect TONS of performance data from the customer, VERY granular (most of the time we end with more than 100GB of raw data) and then we start digging into data using our custom built tools.
    After that we model a workload (or workloads) and create a configuration based on the outcome plus the increment that the customer need in terms of performance and space.
    There are a number of small things overlooked by most that are crucial in a system with pointer-based snapshot (like Compellent and NetApp of course), we try to consider everything and usually my customers are very satisfied by my configs :-).

    – Regarding Fast Track w/out Data Progression:

    Yes, you can license Fast Track and/or Data Progression separately, if you buy only Fast track you simply create a new tier without the possibility to make it progress automatically down the other tiers, for example:

    NO Data Progression, only Fast Track.

    Write to Raid 10 Fast Track until it’s full,
    when Raid 10 FAST is full, start write to Raid 10 Standard.
    Now if you want to progress data to Raid 5 (or slower tier) you need to have Data Progression or Instant Replay (snapshot), If you have Data Progression licensed everything works “automagically” as explained before, if you have only Instant Replay, you can take a snapshot and then the Instant Replay transforms all the Snapshotted Data Blob into Raid 5 and progress it to the lower tiers, but it works only for the snapshot data.
    Actually the choice is more between having Instant Replay, Data Progression AND/OR Fast Track, but the former two usually go together as they are the value-add that Compellent gives to the customer.
    Also, if you want just to demote an entire volume to Raid 5 or to a lower tier in a FAST V1 style, you can just do a Copy/migrate that works transparently host side and it’s included for free in the Base license.

    I didn’t get your question on the steady write workload that doesn’t fit the R10 spindles, what you need to know? if the latency goes up? usually when a system (that’s not Compellent specific) is overloaded and the spindles can’t keep up with the work the latency of the system goes up and everything is slowed down.

    HTH,
    Ciao,
    Fabio

  24. Interesting discussion. I wish Compellent had an easy way to calculate storage. The features are great but put a lot of dynamic into deciding on the initial storage space needed and leave planning questions open for raid overhead expectations.

  25. I realize I’m doing some serious necromancy on this old thread, but felt I had to put in a comment anyway.

    1. Thanks to everyone commenting in the thread, it’s all great information.
    2. It’s obvious that the original complaint, that the proposed 12 disk CML system was undersized for the customer’s workload.
    3. I had the following experience, personally. I put out a request for bids earlier this year, and received responses from several vendors, among them Compellent, and Netapp. We were looking to replace an aged EVA 4000 that was hosting an Oracle RAC database with an array that could handle both the DBs, as well as our VMware load. We were using a newer EVA4400 for the VMware, some other DBs, and mail servers at the time. The Compellent engineers did in fact request very granular (3 second interval data for approximately 6 weeks of total time) performance numbers. Because this was the most granular data requested by any of the vendors, I offered to provide it to all of them. One vendor in particular (not Netapp) said their tools wouldn’t accept data that granular, and requested 30 second interval instead, so I provided that to them.

    Before I had even completed the capture of the performance data, I received a Netapp quote for a V-series array with 12 15k spindles and PAM2. They said that I could just use this to virtualize the existing EVA4400 for VMware, and use the spindles/PAM in the Netapp to host the Oracle RAC workload. While this was definitely an interesting proposal, and getting Netapp features on the EVA would have been nice, this wasn’t what I asked for to begin with. The data collection was completed and data provided to all interested vendors (Netapp reseller stood by their original quote and said they didn’t need the perf data.) From Compellent, and other vendors, I received a bid for 50-80 Tier 1 (15K) spindles, along with 10-30 Tier 3 (7.2K) spindles, along with well laid out documentation of the information gleaned from the perf data and what specific data points had driven their recommendations. I reiterated to the Netapp vendor that I was looking for a standalone array to host the required systems, and if they could reconfigure their bid to meet that requirement, the value add of being able to virtualize the EVA 4400 would be taken into account. The continued to stand by the assertion that 12 15K spindles and PAM would be as good as 50+ spindles from anyone else, and even had the gall to say, “If it’s not, you can just hang another disk shelf off the controller.” When this vendor was told we had selected someone else, they responded by trying to play the “First Hit is Free” game by knocking nearly 45% off their initial bid.

    The point is, bad recommendations sometimes get made regardless of whose storage is being sold. We ended up with a Compellent system with 63 (60 active, 3 spare) 15K spindles and 24 (22 active, 2 spare) 7.2K spindles. Everything works as advertised, and I’m getting outstanding performance. The system is easy to use and manage, and the service is top notch. Could I have been just as happy with a correctly architected NetApp system? I’m sure I could. It’s not a question of “Can you provide the performance I need?” with any of the vendors I worked with, and there were a lot of the features of the Netapp that I really liked. Unfortunately, sometimes the vendor/reseller gets in the way of providing a system that meets the customers requirements.

    Just because, I want to say that I am not employed by, nor am I receiving any compensation from, any vendor mentioned or not mentioned. I wrote this completely to provide my own experience.

    HTH,
    John

    1. Thanks for the comment, John. Necromancy is fine by me.

      Indeed customer experiences depend on geo, vendor rep/SE, VAR rep/SE, and more.

      I’ll ask this:

      If anyone shows a figure that’s too good to be true, simply ask them to prove it.

      Vendors typically have sizing tools (NetApp has the most comprehensive ones I’ve ever seen, BTW).

      For some workloads, because of Flash Cache and ONTAP write optimizations, one could need a lot less drives than with other vendors but that guideline needs to come from the sizer, nobody should just say “it’ll be fine”.

      So ask to see the sizers, regardless of vendor.

      D

  26. Hi,
    After this lengthy and interesting discussion I would just like to add one piece of info on CML internals. one of the reasons that get people confused about how CML works is that it is, in fact, just software. it’s not a HW solution, simply SW. what I mean by this is that with CML you do not have Disk Arrays ( you know, I buy 8 HDDs beacuse I need 7+1 etc); you do not organize the machine in Raid Groups, format them as R1, R5, R6 or whatever, concatenate etc, then write. the cute thing is that you attach to EACH block metadata; in the metadata you state that that BLOCK is R5 or R1 protected. the writes are always R10. when you do Data Progression, or take a snapshot, you update the metadata and change the Raid parity, or make it Read only; then you move the block ( or, rather, the 2MB page) if it’s necessary. what you do, in practice, is Software Raid. It took me some time at first to understand how this works, but then things became clear. I now build a CML array with, more or less, 20-30% less HDDs than Netapp, for example, and I give the customer the same performances, by correctly sizing the pyramid made up of a smaller than traditionally done first and second tiers that only accommodate the needed IOPS for the workloads ( over the 12 day Data Progression monitoring period) and a third capacitive tier where data gets progressed as it cools.
    data progression runs daily via schedule, BTW, but it can be configured.
    11 HDDs for your customer are way too few, that’s obvious. sorry you ran into science fiction writers, instead of professional engineers.

  27. Hi all,

    Necro from my part too i guess :).

    ive read it all. And its one of the best discussed storage discussions on the big http://WWW... KUDOS to all that have taken free time to discuss this.

    We are evaluating Compellent (Dell now). And it looks promising (more expensive than the IBM, EMC quotes dont have any from NetApp or HP) not just for because of the tiering / dataprogression technology but quite some others like the fact that you own the software forever etc. We have an EMC CX4 now and Dell are quite keen on making us change.

    One thing that is buggling my head is how the whole array handles an potential dual drive (or more) failure. And how it rebuilds…

    Yes i know… hotspares, Raid6DP etc.. But it can happen. Its happened to all vendors. Bad drives, bad on-site technitian… internal human error…
    Just like the EMC competitors are talking FUD about “what would happen if you would have a dual drive failure on the VAULT drives”…

    What happens with Compellent if you loose several drives? Or an entire disk shelf? (lets say you plug out the wrong back end SAS cables)

    In a traditional array (or server) if you loose one RAID set or have to rebuild it you dont necessarily loose (or even loose access) to your other data.

    Say i devote a 15k diskshelf to my DB(and devide it into different raid sets for log, temp, DB etc)
    And another diskshelf with 7,2 to file (tier3), images etc
    And a third shelf to lets say VMware for different servers…

    Loosing more than one drive (lets say there is no Raid6) on any of the enclosures would result in either DB or Vmware or File etc not being accessable. Which would / could be a disaster, but obviously a smaller one than if ALL the data was lost.

    If a LUN is BOTH in RAID10, RAID5 and RAID6… big parts of the data are still… YES? in RAID5? Well that means that two drives going bad (statistically can happen on a bigger system with lets say 100 drives) and the hotspare cannot rebuild (i mean why do Dell quote 1 hotspare per shelf?)… well do i loose my entire SAN?
    Now with all this Compellent “magic”… with “all drives participating in all raid levels/stripes” and a block / page being moved around… In a case of the above failure… having practically all drive drives in one storage pool… wouldnt the whole system go totally down? In case i loose more drives than the Raid profile protection that Compellent has can afford… and my LUNs are no longer dedicated to certain shelfs / RAID groups… would i have to restore from backup?
    When a hotspare takes over / or a drive is replaced and rebuildt… surely that takes time… could be 5 hours… could be 3 days. And in the mean time all data has been moved around… what happens?

    Im sure there is a smart answer… i just cant figure it out and the answers Dell are giving me are vague… kind of freaking me out!

    And please dont say “remote replication” (yes that is also an option but we consider that as an option in case of a different disaster… not dual disk error.)

    V.

  28. You got this right.

    With Compellent, auto-tiering happens on chunks that are part of a snapshot.

    So, let’s say you have a 1TB DB LUN that hosts the most absolutely critical DB in your company.

    You take a snapshot of the LUN. It’s important – without the snapshot there is NO auto-tiering with Compellent (most people seem to be unaware of this).

    The Compellent system is set up to move “cold” 2MB chunks of that LUN to different disks.

    So, you may want the DB in RAID10, but your SATA might be configured as RAID5.

    The 2MB chunks that are not being accessed frequently will end up on the RAID5 SATA space.

    If you have data loss in the RAID5 space, YOU WILL LOSE YOUR LUN.

    There is no magic.

    Chunklet vendors like Compellent and 3Par will tell you their scheme is more reliable than straight RAID5 but here’s the simple truth:

    You get a 2MB chunk.

    You spread the chunk onto 5 disks if you’re doing RAID5 4+1.

    If you lose any of those 2 disks you’ve lost that chunk.

    End of story.

    Friends don’t let friends use RAID5 in this day and age.

    D

  29. Speaking of *serious* necromancy, I was cleaning up my blog and I ended up here, funny to see this comment thread still going after 2 full years :-).

    Anyway, just to clear things up, you can now have (since SC5, we’re now at SC6.2…) dual redundancy in the same disk folder, for instance:

    Tier 1: 200GB SSD Drives – Single Redundancy (RAID10 + RAID5)
    Tier 2: 600GB 15K SAS Drives – Single Redundancy (RAID10 + RAID5)
    Tier 3: 3TB 7.2 NL SAS Drives – Dual Redundancy (RAID10DualMirror + RAID6)

    As usual, every disk in the disk folder is part of the same pool, if you suffer a dual drive failure on your Tier 3, you’ll be just fine, and changing the redundancy level (on every tier) can be done without taking an outage.

    Ciao,
    Fabio

  30. Nice numeric breakdown of typical IOPS/type of drive. My current biggest issue with Compellent is the inability to scale controllers. I do not like having two 90-98% CPU controllers; that is not HA.

    Also, I can’t see the image that will make everything clear.
    http://imgur.com/1qhqGI7
    🙁

  31. Dell Compellent storage is nothing but a piece of junk. you should stay away this junk. There is a software bug in SCOS which Dell Denies categorically but I have a email from their own tech this is the fact. RAID scrub will kill backend IOPS on any disk you use. If you have a IO instance environment stay way Dell.

    1. Can you provide more info? We are experiencing this issue on one of our compellent storage centers. Not getting anywhere with Dell support. Funny thing is we have 2 identical storage centers, one has this issue, the other doesn’t. They both run the same firmware, same volumes etc yet one system is running raid scrub 100% of the time (for the 6 months its been in use) and using up 100% of the IOPS on the spindle disks. The other storage center only runs a scrub once in a while and finishes very quickly.

Leave a comment for posterity...