NetApp benefits for virtualization – benchmarked and proven

My colleague Vaughn Stewart explains it in detail here. I didn’t feel we gave this the publicity it deserves.

In a nutshell: We have numbers (published only after VMware engineering themselves approved the paper as accurate and gave their permission) proving that, compared to traditional arrays, running virtualized workloads on NetApp gear needs less resources while providing excellent performance.

If you don’t want to spend time reading Vaughn’s article, this link has the goods in impressive detail.

It’s worth noting the “traditional” array had a lot more disks and RAM, but the NetApp array had a Flash Cache module. We are not allowed to publish the vendor of the “traditional” array due to licensing restrictions, but, as mentioned, VMware engineering verified the results – the test was legit (no vendor is allowed to publish VMware performance data unless VMware engineering has verified all testing was aboveboard and accurate).

Some pictures for the impatient:

 

 

Key take-aways:

  1. A lot less disk space needed with NetApp
  2. A lot quicker to provision the VMs
  3. Faster performance than RAID10 even without the Flash Cache (and dramatically higher with)
  4. No-compromise RAID-DP offers same protection as RAID6 without the penalty
  5. NFS for VMware can be pretty fast inded given the appropriate storage behind!

D

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers – remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance – there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points – so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Marc Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

 

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok…  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So – there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

What a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 “Leonidas”, a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  – this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • “XFS tuned” means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • “ext3 tuned” means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…

Postmark numbers:

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s  
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D

Linux filesystem benchmark extravaganza – including Deadline vs CFQ schedulers and ext4 instability

I have some spare time these days so I figured I’d finally test as many filesystems on Linux as I could…

The new ext4 is an option with modern kernels so I loaded Ubuntu 9.04 and tried postmark and bonnie++ on the same partition using various filesystems and switching between the CFQ and Deadline schedulers.

Switching schedulers permanently can be achieved by changing the boot options and appending, say, elevator=deadline, but you can also switch them on the fly by running the following:

echo deadline > /sys/block/sda/queue/scheduler

You can check what’s currently selected by simply typing

cat /sys/block/sda/queue/scheduler

You’ll get back something like:

noop anticipatory [deadline] cfq

The scheduler in brackets is the currently selected one.

Reader beware: Running postmark on ext4 locked up the system repeatedly during the transaction phase of the benchmark, using either my own compiled version and the one from the repository, so obviously there is some issue there and I cannot at this time recommend ext4 – no other filesystem caused lockups. I did run bonnie++ as well since that didn’t crash with ext4.

The objective of this exercise wasn’t to show which filesystem is fastest, but rather to illustrate that, depending on what you want to do, you may want to re-examine the choice of filesystem and scheduler with your application if you’re running Linux. BTW the current recommendation for Databases and fast intelligent external arrays – and ubuntu’s default in the server edition – is the Deadline scheduler, and not CFQ. However, all other distrubutions at the moment use CFQ!

So, without further ado, some benchmarks… (I’m not including the entire postmark output since it would be far too large, I just kept the most important metrics, anyone that wants the entire results is more than welcome to send me an email and I’ll hook you up).

Postmark MB/s:

Filesystem

Read MB/s

Write MB/s

IOPS

Reiser CFQ

4.85

10.25

227

Reiser Deadline

5.38

11.35

246

XFS CFQ

2.33

4.93

109

XFS Deadline

2.35

4.97

105

XFS Tuned

2.73

5.76

120

JFS CFQ

1.75

3.69

78

JFS Deadline

1.73

3.65

76

Ext3 CFQ

2.71

5.73

115

Ext3 Deadline

2.86

6.03

122

 

MBPS

Postmark IOPS:

iops

Bonnie++ write speed:

Filesystem

IOPS

Block writes KB/s

Rewrite KB/s

Reiser CFQ

428

31657

18199

Reiser Deadline

462

32290

18154

XFS CFQ

471

39901

18557

XFS Deadline

483

39840

19653

XFS Tuned

592

40604

20746

JFS CFQ

433

31651

18528

JFS Deadline

452

39106

18755

Ext3 CFQ

403

31108

17235

Ext3 Deadline

338

31803

17885

Ext4 CFQ

451

39265

18519

Ext4 Deadline

446

39257

18221

bonnieMBPS

Bonnie++ IOPS:

bonnieiops

Observations:

The Deadline scheduler seems to be consistently better for anything that’s not ext-based! A lot of work has been done on the Linux kernel to optimize it for the ext2-3-4 filesystems, and that shows. However, depending on what you want to do, ext3 may not be the best option (I don’t know yet about ext4 for postmark-type loads but based on the bonnie++ results it’s solid).

Here’s a list of some considerations:

  • Will the filesystem host many many small files or a few large ones? Reiser still rules the “many small files” use case, by far. The rest are fairly close, and JFS seriously lags. For large files, XFS is great.
  • Do you care if the filesystem takes a long time to fsck? Ext3 still takes quite long, whereas something like XFS doesn’t. Ext4 should remedy this.
  • Do you care for something that’s still actively being maintained? In this case only ext3-4 and XFS are the options.
  • Do you want defrag tools? Choose wisely since few filesystems do (XFS and ext4).

My current overall recommendation is XFS since it’s mature and also very tunable. For reference, here’s how I got the better results for XFS (the results in the graphs for tuned XFS were with the deadline scheduler):

mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=64m /dev/sda7

mount -o nobarrier,noatime,nodiratime,logbufs=8 /dev/sda7 /test

Don’t just follow the above blindly, normally mkfs tries to auto-adjust those (i.e. the agcount) but the important ones to look for are the log size and the mount options, especially the nobarrier and logbufs. Remember though that nobarrier is only recommended if you have battery backup.

D

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken – I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens – which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance – put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

Cinebench benchmarks – performance comparison between Vista 64 and Mac OS X

Been a while since I posted anything – there’s a TON of material but some of us actually do more than blog, it’s quarter/year end, and I barely have time to go to the bathroom…

But this was an easy one so I thought I’d post it real quick. Using Scribefire, a blogging plug-in for Firefox. I hate it.

Disclaimer: The machines used are not identical.

However, the CPUs supposedly are pretty close in speed (2.6 vs 2.8 GHz). Memory is the same.

Graphics are also similar but the Lenovo box has 128MB VRAM whereas the Mac has 512MB and is a faster GPU.

The contenders: Macbook Pro 2.8GHz vs Lenovo T62p (14″ model) running Vista 64, 2.6GHz.

The Mac is running a 32-bit OS (64-bitness is coming with Snow Leopard next year). It also has switchable graphics and one can choose between the on-chipset Nvidia 9400 or the discrete 9600. Typically on-board graphics are pretty crappy.

Despite the dissimilarity of the machines here are some notables:

  • Cinebench really takes off in 64-bit mode in Vista
  • OS X seems to do quite well even though it’s not 64-bit yet
  • The integrated graphics on the new Mac are awesome
  • The discrete graphics are great for a laptop
  • OS X seems to be more efficient than Vista when doing multi-CPU work, at least in this case
  • If someone is looking for a decent modern laptop they can do far worse than the new Macs, even a plain Macbook would be pretty decent

Here’s a chart of the results:

OS/Config 1-CPU 2-CPU GFX Multiprocessor speedup
Macbook Pro 2.8GHz integrated GFX 3208 6051 4813 1.87
Macbook Pro 2.8GHz discrete GFX 3213 5926 6130 1.84
Lenovo 2.6GHz 32-bit 2693 4755 4264 1.77
Lenovo 2.6GHz 64-bit 3040 5367 4256 1.77