NetApp delivers 1TB/s performance to giant supercomputer for big data

What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?

To be specific: What if you needed to transfer data at 1TB per second?

That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.

You can read the official press release here. I wanted to get more into the technical details.

People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:

  • Too much data to be processed by a “normal” computer or cluster
  • Too much data to work with using a relational DB
  • Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
  • Too much data to process using traditional methods within an acceptable time frame

Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.

So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.

For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 1TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.

This created some serious logistics problems regarding storage:

  • No single disk array can hold that amount of data
  • No single disk array can perform anywhere close to 1TB/s

Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.

The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.

The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.

 

A bit about the solution…

Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.

That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.

There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.

The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP :)

The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.

Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too… :)

Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves :) (you know who you are!)

There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.

Most importantly, what we accomplished at the LLNL was no marketing exercise…

 

The benefits of truly high performance density

Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.

However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).

For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.

 

Enter the cluster filesystem

So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?

The answer: By using a cluster filesystem.

Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.

It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.

The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:

 

Lustre Scaled Cluster

 

High-speed connections are used liberally for lower latency and higher throughput.

A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.

Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.

It sounds good in theory – and it delivered in practice: 1TB/s sustained performance. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.

 

Putting it all together

NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.

So – how fast do you need to go?

D

 

 

Technorati Tags: , ,

 

NetApp posts world-record SPEC SFS2008 NFS benchmark result

Just as NetApp dominated the older version of the SPEC SFS97_R1 NFS benchmark back in May of 2006 (and was unsurpassed in that benchmark with 1 million SFS operations per second), the time has come to once again dominate the current version, SPEC SFS2008 NFS.

Recently we have been focusing on benchmarking realistic configurations that people might actually put in their datacenters, instead of lab queens with unusable configs focused on achieving the highest result regardless of cost.

However, it seems the press doesn’t care about realistic configs (or to even understand the configs) but instead likes headline-grabbing big numbers.

So we decided to go for the best of both worlds – a headline-grabbing “big number” but also a config that would make more financial sense than the utterly crazy setups being submitted by competitors.

Without further ado, NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode. Click here for the specific result. There are other results in the page showing different size clusters so you can get some idea of the scaling possible.

See below table for a high-level analysis (including the list pricing I could find for these specific performance-optimized configs for whatever that’s worth). The comparison is between NetApp and the nearest scale-out competitor result (one of many EMC’s recent acquisitions –  Isilon, the niche, dedicated NAS box – nothing else is close enough to bother including in the comparison).

BTW – the EMC price list is publicly available from here (and other places I’m sure): http://www.emc.com/collateral/emcwsca/master-price-list.pdf

From page 422:

S200-6.9TB & 200GB SSD, 48GB RAMS200-6.9TB & 200GB SSD, 48GB RAM, 2x10GE SFP+ & 2x1G $84,061. Times 140…

Before we dive into the comparison, an important note since it seems the competition doesn’t understand how to read SPEC SFS results:

Out of 1728 450GB disks (the number includes spares and OS drives, otherwise it was 1632 disks), the usable capacity was 574TB (73% of all raw space – even more if one considers a 450GB disk never actually provides 450 real GB in base2). The exported capacity was 288TB. This doesn’t mean we tried to short-stroke or that there is a performance benefit exporting a smaller filesystem – the way NetApp writes to disk, the size of the volume you export has nothing to do with performance. Since SPEC SFS doesn’t use all the available disk space, the person doing the setup thought like a real storage admin and didn’t give it all the available space. 

Lest we be accused of tuning this config or manually making sure client accesses were load-balanced and going to the optimal nodes, please understand this:  23 out of 24 client accesses were not going to the nodes owning the data and were instead happening over the cluster interconnect (which, for any scale-out architecture, is worst-case-scenario performance). Look under the “Uniform Access Rules Compliance” in the full disclosure details of the result in the SPEC website here. This means that, compared to the 2-node ONTAP 7-mode results, there is a degradation due to the cluster operating (intentionally) through non-optimal paths.

EMC NetApp Difference
Cost (approx. USD List) 11,800,000 6,280,000 NetApp is almost half the cost while offering much higher performance
SPEC SFS2008 NFS operations per second 1,112,705 1,512,784 NetApp is over 35% faster, while using potentially better RAID protection
Average Latency (ORT) 2.54 1.53 NetApp offers almost 40% better average latency without using costly SSDs, and is usable for challenging random workloads like DBs, VMs etc.
Space (TB) 864 (out of which 128889GB was used in the test) 574 (out of which 176176GB was used in the test) Isilon offers about 50% more usable space (coming from a lot more drives, 28% more raw space and potentially less RAID protection – N+2 results from Isilon would be different)
$/SPEC SFS2008 NFS operation 10.6 4.15 Netapp is less than half the cost per SPEC SFS2008 NFS operation
$/TB 13,657 10,940 NetApp is about 20% less expensive than EMC per usable TB
RAID Per-file protection. Files < 128K are at least mirrored. Files over 128K are at a 13+1 level protection in this specific test. RAID-DP Ask EMC what 13+1 protection means in an Isilon cluster (I believe 1 node can be completely gone but what about simultaneous drive failures that contain the sameprotected file?)NetApp RAID-DP is mathematically analogous to RAID6 and has a parity drive penalty of 2 drives every 16-20 drives.
Boxes needed to accomplish result 140 nodes, 3,360 drives (incl. 25TB of SSDs for cache), 1,120 CPU cores, 6.7TB RAM. 24 unified controllers, 1,728 drives, 12.2TB Flash Cache, 192 CPU cores, 1.2TB RAM. NetApp is far more powerful per node, and achieves higher performance with a lotless drives, CPUs, RAM and cache.In addition, NetApp can be used for all protocols (FC, iSCSI, NFS, CIFS) and all connectivity methods (FC 4/8Gb, Ethernet 1/10Gb, FCoE).

 

Notice the response time charts:

IsilonVs6240response

NetApp exhibits traditional storage system behavior – latency is very low initially and gradually gets higher the more the box is pushed, as one would expect. Isilon on the other hand starts out slow and gets faster as more metadata gets cached, until the controllers run out of steam (SPEC SFS is very heavy in NAS metadata ops, and should not be compared to heavy-duty block benchmarks like SPC-1).

This is one of the reasons an Isilon cluster is not really applicable for low-latency DB-type apps, or low-latency VMs. It is a great architecture designed to provide high sequential speeds for large files over NAS protocols, and is not a general-purpose storage system. Kudos to the Isilon guys for even getting the great SPEC result in the first place, given that this isn’t what the box is designed to do (the extreme Isilon configuration needed to run the benchmark is testament to that). The better application for Isilon would be capacity-optimized configs (which is what the system is designed for to begin with).

 

Some important points:

  1. First and foremost, the cluster-mode ONTAP architecture now supports all protocols, it is the only unified scale-out architecture available. Any competitors playing in that space only have NAS or SAN offerings but not both in a single architecture.
  2. We didn’t even test with the even faster 6280 box and extra cache (that one can take 8TB cache per node). The result is not the fastest a NetApp cluster can go :) With 6280s it would be a healthy percentage faster, but we had a bunch of the 6240s in the lab so it was easier to test them, plus they’re a more common and less expensive box, making for a more realistic result.
  3. ONTAP in cluster-mode is a general-purpose storage OS, and can be used to run Exchange, SQL, Oracle, DB2, VMs, etc. etc. Most other scale-out architectures are simply not suitable for low-latency workloads like DBs and VMs and are instead geared towards high NAS throughput for large files (IBRIX, SONAS, Isilon to name a few – all great at what they do best).
  4. ONTAP in cluster mode is, indeed, a single scale-out cluster and administered as such. It should not be compared to block boxes with NAS gateways in front of them like VNX, HDS + Bluearc, etc.
  5. In ONTAP cluster mode, workloads and virtual interfaces can move around the cluster non-disruptively, regardless of protocol (FC, iSCSI, NFS and yes, even CIFS can move around non-disruptively assuming you have clients that can talk SMB 2.1 and above).
  6. In ONTAP cluster mode, any data can be accessed from any node in the cluster – again, impossible with non-unified gateway solutions like VNX that have individual NAS servers in front of block storage, with zero awareness between the NAS heads aside from failover.
  7. ONTAP cluster mode can allow certain cool things like upgrading storage controllers from one model to another completely non-disruptively, most other storage systems need some kind of outage to do this. All we do is add the new boxes to the existing cluster :)
  8. ONTAP cluster mode supports all the traditional NetApp storage efficiency and protection features: RAID-DP, replication, deduplication, compression, snaps, clones, thin provisioning. Again, the goal is to provide a scale-out general-purpose storage system, not a niche box for only a specific market segment. It even supports virtualizing your existing storage.
  9. There was a single namespace for the NFS data. Granted, not the same architecture as a single filesystem from some competitors.
  10. Last but not least – no “special” NetApp boxes are needed to run Cluster Mode. In contrast to other vendors selling a completely separate scale-out architecture (different hardware and software and management), normal NetApp systems can enter a scale-out cluster as long as they have enough connectivity for the cluster network and can run ONTAP 8. This ensures investment protection for the customer plus it’s easier for NetApp since we don’t have umpteen hardware and software architectures to develop for and support :)
  11. Since people have been asking: The SFS benchmark generates about 120MB per operation. The slower you go, the less space you will use on the disks, regardless of how many disks you have. This creates some imbalance in large configs (for example, only about 128TB of the 864TB available was used on Isilon).

Just remember – in order to do what ONTAP in Cluster Mode does, how many different architectures would other vendors be proposing?

  • Scale-out SAN
  • Scale-out NAS
  • Replication appliances
  • Dedupe appliances
  • All kinds of management software

How many people would it take to keep it all running? And patched? And how many firmware inter-dependencies would there be?

And what if you didn’t need, say, scale-out SAN to begin with, but some time after buying traditional SAN realized you needed scale-out? Would your current storage vendor tell you you needed, in addition to your existing SAN platform, that other one that can do scale-out? That’s completely different than the one you bought? And that you can’t re-use any of your existing stuff as part of the scale-out box, regardless of how high-end your existing SAN is?

How would that make you feel?

Always plan for the future…

Comments welcome.

D

PS: Made some small edits in the RAID parts and also added the official EMC pricelist link.

 

Technorati Tags: , , , , , , , , ,

 

Interpreting $/IOPS and IOPS/RAID correctly with SPC-1 results

Been a while since an update to this blog, my provider got hacked and people were getting a redirect to some questionable site (I still can’t show images, if someone knows a fix please let me know).

Should be better now, I just wish the people doing the hacking would devote their considerable skills to helping out humanity…

Anyway, there are some impressive new scores at storageperformance.org, with the usual crazy configurations of thousands of drives etc.

Regarding price/performance:

When looking at $/IOP, make sure you are comparing list price (look at the full disclosure report, that has all the details for each config).

Otherwise, you could get the wrong $/IOP since some vendors have list prices, others show heavy discounting.

For example, a box that does $6.5/IOP after 50% discounting, would be $13/IOP using list prices.

Regarding RAID:

As I have mentioned in other posts, RAID plays a big role in both protection and performance.

All SPC-1 results are using RAID10, with the notable exception of NetApp (we use RAID-DP, mathematically analogous to RAID6 in protection).

Here’s a (very) rough way to convert a RAID10 result to RAID6, if the vendor you’re looking for doesn’t have a RAID6 result:

  1. SPC-1 is about 60% writes.
  2. Take any RAID10 result, let’s say 200,000 IOPS.
  3. 60% of that is 120,000, that’s the write ops. 40% is the reads, or 80,000 read ops.
  4. If using RAID6, you’d be looking at roughly a 4x slowdown for the writes: 120,000/4 = 30,000
  5. Add that to the 40% of the reads and you get the final result:
  6. 80,000 reads + 30,000 writes = 110,000 RAID6-corrected SPC-1 IOPS. Which is almost half the RAID10 result… :)

Just make sure you’re comparing apples to apples, that’s all. I know we all suffer from ADD in this age of information overload, but do spend some time going through the full disclosure, since there’s always interesting stuff in there…

D

 

 

Buyer beware: is your storage vendor sizing properly for performance, or are they under-sizing technologies like Megacaching and Autotiering?

With the advent of performance-altering technologies (notice the word choice), storage sizing is just not what it used to be.

I’m writing this post because more and more I see some vendors not using scientific methods to size their solution, instead aiming to reach a price point, hoping the technology will work to achieve the requisite performance (and if it doesn’t, it’s sold anyway, either they can give some free gear to make the problem go away, or the customer can always buy more, right?)

Back in the “good old days”, with legacy arrays one could (and still can) get fairly deterministic performance by knowing the workload required and, given a RAID type, know roughly how many disks would be needed to maintain the required performance in a sustained fashion, as long as the controller and buses were not overloaded.

With modern systems, there is now a plethora of options that can be used to get more performance out of the array, or, alternatively, get the same average performance as before, using less hardware (hopefully for less money).

If anything, advanced technologies have made array sizing more complex than before.

For instance, Megacaches can be used to dramatically change the I/O reaching the back-end disks of the array. NetApp FAS systems can have up to 16TB of deduplication-aware, ultra-granular (4K) and intelligent read cache. Truly a gigantic size, bigger than the vast majority of storage users will ever need (and bigger than many customers’ entire storage systems). One could argue that with such an enormous amount of cache, one could dispense with most disk drives and instead save money by using SATA (indeed, several customers are doing exactly that). Other vendors are following NetApp’s lead and starting to implement similar technologies — simply because it makes a lot of sense.

However…

It is crucial that, when relying on caching, extra care is taken to size the solution properly, if a reduction in the number and speed of the back-end disks is desired.

You see, caches only work well if they can cache the majority of what’s called the active working set.

Simply put, the working set is not all your data, but the subset of the data you’re “touching” constantly over a period of time. For a customer that has, say, a 20TB Database, the true working set may only be something as small as 5% — enabling most of the active data to fit in 1TB of cache. So, during daily use, a 1TB cache could satisfy most of the I/O requirements of the DB. The back-end disks could comfortably be just enough SATA to fit the DB.

But what about the times when I/O is not what’s normally expected? Say, during a re-indexing, or a big DB export, or maybe month-end batch processing. Such operations could vastly change the working set and temporarily raise it from 5% to something far larger — at which point, a 1TB cache and a handful of back-end SATA may not be enough.

Which is why, when sizing, multiple measurements need to be taken, and not just average or even worst-case.

Let’s use a database as an example again (simply because the I/O can change so dramatically with DBs).You could easily have the following I/O types:

  1. Normal use – 20,000 IOPS, all random, 8K I/O size, 80% reads
  2. DB exports — high MB/s, mostly sequential write,large I/O size, relatively few IOPS
  3. Sequential read after random write — maybe data is added to the DB randomly, then a big sequential read (or maybe many parallel ones) are launched.

You see, the I/O profile can change dramatically. If you only size for case #1, you may not have enough back-end disk to sustain the DB exports or the parallel sequential table scans. If you size for case 2, you may think you don’t need much cache since the I/O is mostly sequential (and most caches are bypassed for sequential I/O). But that would be totally wrong during normal operation.

If your storage vendor has told you they sized for what generates the most I/O, then the question is, what kind of I/O was it?

The other new trendy technology (and the most likely to be under-sized) is Autotiering.

Autotiering, simply put, allows moving chunks of data around the array depending on their “heat index”. Chunks that are very active may end up on SSD, whereas chunks that are dormant could safely stay on SATA.

Different arrays do different kinds of Autotiering, mostly based on various underlying architectural characteristics and limitations. For example, on an EMC Symmetrix the chunk size is about 7.5MB. On an HDS VSP, the chunk is about 40MB. On an IBM DS8000, SVC or EMC Clariion/VNX, it’s 1GB.

With Autotiering, just like with caching, the smaller the chunk size, the more efficient the end result will ultimately be. For instance, a 7.5MB chunk could need as little as 3-5%% of ultra-fast disk as a tier, whereas a 1GB chunk may need as much as 10-15%, due to the larger size chunk containing not very active data mixed together with the active data.

Since most arrays write data with a geometric locality of reference (in contrast, NetApp uses geometric and temporal), with large-chunk autotiering you end up with pieces of data that are “hot” that always occupy the same chunk as neighboring “cool” pieces of data. This explains why the smaller the chunk, the better off you are.

So, with a large chunk, this can happen:

Slide1

The array will try to cache as much as it can, then migrate chunks if they are consistently busy or not. But the whole chunk has to move, not just the active bits within the chunk… which may be just fine, as long as you have enough of everything.

So what can you do to ensure correct sizing?

There are a few things you can do to make sure you get accurate sizing with modern technologies.

  1. Provide performance statistics to vendors — the more detailed the better. If we don’t know what’s going on, it’s hard to provide an engineered solution.
  2. Provide performance expectations — i.e. “I want Oracle queries to finish in 1/4th the time compared to what I have now” — and tie those expectations to business benefits (makes it easier to justify).
  3. Ask vendors to show you their sizing tools and explain the math behind the sizing — there is no magic!
  4. Ask vendors if they are sizing for all the workloads you have at the moment (not just different apps but different workloads within each app) — and how.
  5. Ask them to show you what your working set is and how much of it will fit in the cache.
  6. Ask them to show you how your data would be laid out in an Autotiered environment and what bits of it would end up on what tier. How is that being calculated? Is the geometry of the layout taken into consideration?
  7. Do you have enough capacity for each tier? On Autotiering architectures with large chunks, do you have 10-15% of total storage being SSD?
  8. Have the controller RAM and CPU overheads due to caching and autotiering been taken into account? Such technologies do need extra CPU and RAM to work. Ask to see the overhead (the smaller the Autotiering chunk size, the more metadata overhead, for example). Nothing is free.
  9. Beware of sizings done verbally or on cocktail napkins, calculators, or even spreadsheets – I’ve yet to see a spreadsheet model storage performance accurately.
  10. Beware of sizings of the type “a 15K disk can do 180 IOPS” — it’s a lot more complicated than that!
  11. Understand the difference between sequential, random, reads, writes and I/O size for each proposed architecture — the differences in how I/O is done depending on the platform are staggering and can result in vastly different disk requirements — making apples-to-apples comparisons challenging.
  12. Understand the extra I/O and capacity impact of certain CDP/Replication devices — it can be as much as 3x, and needs to be factored in.
  13. What RAID type is each vendor using? That can have a gigantic performance impact on write-intensive workloads (in addition to the reliability aspect).
  14. If you are getting unbelievably low pricing — ask for a contract ensuring upgrade pricing will be along the same lines. “The first hit is free” is true in more than one line of business.
  15. And, last but by no means least — ask how busy the proposed solution will be given the expected workload! It surprises me that people will try to sell a box that can do the workload but will be 90% busy doing so. Are you OK with that kind of headroom? Remember – disk arrays are just computers running specialized software and hardware, and as such their CPU can run out of steam just like anything else.

If this all seems hard — it’s because it is. But see it as due diligence — you owe it to your company, plus you probably don’t want to be saddled with an improperly-sized box for the next 3-5 years, just because the offer was too good to refuse…

D

 

Technorati Tags: , , , , , , , , ,

OS X and SSD – tunings plus performance with and without TRIM

I finally decided to spring for a SSD for my laptop since I hammer it heavily with a lot of mostly random I/O. It was money well spent.

I went for an Intel 320 model, since it includes extra capacitors for flushing the cache in the event of power failure, and has RAID-4 onboard for protection beyond sparing (there are other, faster SSDs but I need the reliability and can’t afford large-sized SLC).

I used the trusty postmark (here’s a link to the OS X executable) to generate a highly random workload with varying file sizes, using these settings:

set buffering false
set size 500 100000
set read 4096
set write 4096
set number 10000
set transactions 20000
run

All testing was done on OS X 10.6.7.

Here’s the result with the original 7200 RPM HDD:

Time:
198 seconds total
186 seconds of transactions (107 per second)

Files:
20163 created (101 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10163 files (54 per second)
10053 read (54 per second)
9945 appended (53 per second)
20163 deleted (101 per second)
Deletion alone: 10326 files (3442 per second)
Mixed with transactions: 9837 files (52 per second)

Data:
557.87 megabytes read (2.82 megabytes per second)
1165.62 megabytes written (5.89 megabytes per second)

I then replaced the internal drive with SSD, popped the old internal drive into an external caddy, plugged it into the Mac, reinstalled OS X and simply told it to move the user and app stuff from the old drive to the new (Apple makes those things so easy – on a PC you’d probably need something like an imaging program but that wouldn’t take care of very different hardware). I spent a ton of time testing to make sure it was all OK, in disbelief it was that easy. Kudos, Apple.

Here are the results with SSD (2/3rds full FWIW):

Time:
19 seconds total
13 seconds of transactions (1538 per second)

Files:
20163 created (1061 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (781 per second)
10053 read (773 per second)
9945 appended (765 per second)
20163 deleted (1061 per second)
Deletion alone: 10326 files (5163 per second)
Mixed with transactions: 9837 files (756 per second)

Data:
557.87 megabytes read (29.36 megabytes per second)
1165.62 megabytes written (61.35 megabytes per second)

A fair bit of improvement… :) The perceived difference is amazing. For some things I’ve caught it doing over 200MB/s sustained writes.

I also disabled the sudden motion sensor since there’s no point stopping I/O to a SSD if one shakes the laptop. From the command line:

sudo pmset -a sms 0 (this disables it)
sudo pmset –g (to verify it was done)

And since I don’t need hotfile adaptive clustering on a SSD, I decided to disable access time updates (noatime in UNIX parlance).

You need to put the script from here: http://dl.dropbox.com/u/5875413/Tools/com.my.noatime.plist

into /Library/LaunchDaemons

And make sure it has the right permissions:

sudo chown root:wheel com.my.noatime.plist

Then reboot, type mount from the command line, and see if the root filesystem shows noatime as one of the mount arguments.

For example mine shows

/dev/disk0s2 on / (hfs, local, journaled, noatime)

I then re-ran postmark, here are the results with noatime:

Time:
16 seconds total
11 seconds of transactions (1818 per second)

Files:
20163 created (1260 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (923 per second)
10053 read (913 per second)
9945 appended (904 per second)
20163 deleted (1260 per second)
Deletion alone: 10326 files (10326 per second)
Mixed with transactions: 9837 files (894 per second)

Data:
557.87 megabytes read (34.87 megabytes per second)
1165.62 megabytes written (72.85 megabytes per second)

Even better.

Now here comes the part that I hoped would work better than it did:

OS X doesn’t support the TRIM command for SSDs yet (unless you have a really new Mac with an Apple SSD). Fortunately, some enterprising users found out that it is possible to turn TRIM on OS X. There are various ways to do it but someone already automated the process. Be sure to do a backup first (both system backup and through the TRIM enabler application).

The process does work. However, it seems it tries to run TRIM too aggressively, messing up with the random access optimizations some drives have.

Benchmark after TRIM enabled:

Time:
39 seconds total
31 seconds of transactions (645 per second)

Files:
20163 created (517 per second)
Creation alone: 10000 files (3333 per second)
Mixed with transactions: 10163 files (327 per second)
10053 read (324 per second)
9945 appended (320 per second)
20163 deleted (517 per second)
Deletion alone: 10326 files (2065 per second)
Mixed with transactions: 9837 files (317 per second)

Data:
557.87 megabytes read (14.30 megabytes per second)
1165.62 megabytes written (29.89 megabytes per second)

This kind of performance loss is unacceptable to me, so I restored the kext file through the TRIM app, rebooted and re-ran the benchmark and all was fine again.

My recommendations:

  1. Always test before and after the tweaks – my results may only apply to Intel drives. Please post your results with other drives
  2. Always do backups before serious tweaks
  3. If TRIM seems to slow down random I/O on your Mac SSD, don’t keep it running, maybe enable it once a month, go to disk utility, and ask it to erase the free space. This will ensure the drive stays in good shape without adversely affecting normal random I/O.

D

Examining value for money regarding the SPEC benchmarks

Some of the comments in my previous post asked about $/IOPS and $/TB.

Since SPEC doesn’t require prices to be listed, I did my own analysis.

The NetApp numbers are simply 4x the existing 6240 result, which is what EMC did with their submission, they used 4x separate VNX systems and aggregated the result.

I used this clarifying analogy over at Nigel’s blog to explain why this makes sense before anyone yells “but this is not published”:

A storage system typically has some kind of bottleneck – cluster interconnect, number of drives, bandwidth to the controller, etc.

When you’re testing a single system, you’re ultimately hitting one of those bottlenecks.

If you’re testing multiple systems independent of each other, they do not share the bottlenecks (since they’re separate), and your performance will scale linearly as you add systems.

For example, if 1 truck can hold 10 tons of stuff, 4 like trucks will hold 40 tons of stuff, 10 trucks 100 tons, etc. There’s no limit.

Once you inject a limiting factor (“the trucks all have to fit on a bridge and the bridge can take this much load and it’s this big”) then you will have a limitation on how many trucks you can load and put on that bridge.

EMC tested 4 separate “trucks”. In that same way, I can add up the result of 4 separate NetApp “trucks”. Here are the results:

EMC NetApp Difference
Cost (approx. USD List) 6,000,000 5,000,000 NetApp is over 16% cheaper in absolute terms
SPEC SFS NFS IOPS 497,623 762,700 NetApp is 53% faster in absolute terms
Average Latency (ORT) 0.96 1.17 EMC offers a mere 18% less latency (with less NFS OPS) despite using only SSDs!
Space (TB) 60 343 NetApp offers 5.7 times more usable space
$/SPEC NFS IOPS 12.06 6.56 Netapp is 45.6% less expensive per SPEC NFS operation
$/TB 100,000 14,577 NetApp is less than 1/6 the price of EMC per TB
RAID RAID5 RAID-DP NetApp is thousands of times more reliable
Boxes needed to accomplish result 15 (4x separate VNX, each with 2 controllers, plus a total of 5x Celerra VG8 heads and 2 Control Stations) 8x unified controllers NetApp is far less complex (the benefit of a truly unified architecture)

Who can spot the better deal? Smile

I added the latency in the chart, thanks to my buddy Mark Twomey for pointing it out.

You see, people needing enterprise NAS with that kind of performance usually need speed, plenty of space and high reliability. Not just one of the three. BTW, here’s a paper on relative RAID reliability.

NetApp provides all three, in spades, plus great value for money, a truly simple, flexible unified system, and efficiency.

Most customers want to see how a real configuration performs. I refer customers to our SPEC and SPC results constantly since quite frequently their desired configuration is very similar.

Which makes benchmarking realistic configurations actually useful – imagine that.

Maybe EMC needs to submit results with VNX the way they sell it to people, for example:

  • A mix of SSD cache, SSD, high-speed SAS and high-capacity SAS
  • Autotiering
  • RAID6
  • A typical amount of space for a configuration that size

Then submit results.

Keep your existing result of course, but also show the people how what you actually sell them really performs.

I still don’t understand why this is such a hard concept.

D

Technorati Tags: ,,,,

EMC conclusively proves that VNX bottlenecks NAS performance

A bit of a controversial title, no?

Allow me to elaborate.

EMC posted a new SPEC SFS result as part of a marketing stunt (which is working, look at what I’m doing – I’m talking about them, if only to clear the air).

In simple terms, EMC got almost 500,000 SPEC SFS NFS IOPS (not to be confused with, say, block-based SPC-1 IOPS) with the following configuration:

  1. Four (4) totally separate VNX arrays, each loaded with SSD storage, utterly unaware of each other (8 total controllers since each box has 2)
  2. Five (5) Celerra VG8 NAS heads/gateways (1 spare), one on top of each VNX box
  3. 2 Control Stations
  4. 8 exported filesystems (2 per VG8 head/VNX system)
  5. Multiple pools of storage (at least 1 per VG8) – not shared among the various boxes, no data mobility between boxes
  6. Only 60TB NAS space with RAID5 (or 15TB per box)

Now, this post is not about whether this configuration is unrealistic and expensive (almost nobody would pay $6m for merely 60TB of NAS, not today). I get it that EMC is trying to publish the best possible number by loading a bunch of separate arrays with SSD. It’s OK as long as everyone understands the details.

My beef has to do with how it’s marketed.

EMC is very vague about the configuration, unless you look at the actual SPEC website. In the marketing materials they just mention VNX, as in “The EMC VNX performed at 497,623 SPECsfs2008_nfs.v3 operations per second”. Kinda like saying it’s OK to take 3 5-year olds and a 6-year old to a bar because their age adds up to 21.

No – the far more accurate statement is “four separate VNXs working independently and utterly unaware of each other did 124,405 SPEC fs2008_nfs.v3 operations per second each“.

All EMC did was add up the result of 4 boxes.

Heck, that’s easy to do!

NetApp already has a result for the 6240 (just 2 controllers doing a respectable 190,675 SPEC NFS ops taking care of NAS and RAID all at once since they’re actually unified, no cornucopia of boxes there) without using Solid State Drives (common SAS drives plus a large cache were used instead – a standard, realistic config we sell every day, and not a “lab queen”).

If all we’re doing is adding up the result of different boxes, simply multiply this by 4 (plus we do have Cluster-Mode for NAS so it would count as a single clustered system with failover etc. among the nodes) and end up with the following result:

  1. 762,700 SPEC SFS NFS operations
  2. 8 exported filesystems
  3. 343TB usable with RAID-DP (thousands of times more resilient than RAID5)

So, which one do you think is the better deal? More speed, 343TB and better protection, or less speed, 60TB and far less protection? :)

Customers curious about other systems can do the same multiplication trick for other configs, the sky is the limit!

The other, more serious part, and what prompted me to title the post the way I did, is that EMC’s benchmarking made pretty clear the fact that the VNX is the bottleneck, only able to really support a single VG8 head at top speed, necessitating the need for 4 separate VNX systems to accomplish the final result. So, the fact that a VNX can have up to 8 Celerra heads on top of it means nothing since the back-end is your limiting factor. You might as well stick to a dual-head VG8 config (1 active 1 passive) since that’s all it can comfortably drive (otherwise why benchmark it that way?)

But with only 1 active NAS head you’d be limited to just 256TB max NAS capacity, since that’s how much total space a Celerra head can address as of the time of this writing. Which is probably enough for most people.

I wonder if the NAS heads that can be bought as a package with VNX are slower than VG8 heads, and by how much. You see, most people buying the VNX will be getting the NAS heads that can be packaged with it since it’s cheaper that way. How fast does that go? I’m sure customers would like to know, since that’s what they will typically buy.

I also wonder how fast it would be with RAID6.

Here’s a novel idea: benchmark what customers will actually buy!

So apples-to-apples comparisons can become easier instead of something like this:

Bothapples

For the curious: on the left you see an “Autumn Glory” Malus Floribunda (miniature apple). Photo courtesy of John Fullbright.

D

Technorati Tags: , , , , , , , ,

EMC finally joining the SPC – plus some advice

Just came to my attention that EMC is finally joining the SPC (Storage Performance Council). As I’ve pointed in their past, their absence from this most standard industry benchmark was puzzling, kudos for rectifying this omission.

I do have some advice for EMC (and all other vendors that have already posted results):

  1. Do show results for your various supported RAID types, not just RAID10 – after all, if most of your customers don’t just deploy RAID10, it makes sense to show RAID5 and RAID6, especially if you want to compare results with NetApp RAID-DP (that’s the protection equivalent of RAID6). This will increase your credibility. The argument that you only show RAID10 since that’s the best performing doesn’t hold water – everyone knows the other RAID types will have different performance levels, providing results for everything will at least enable customers to get an idea of how much of a performance hit they’ll have with the different RAID types with a write-intensive workload.
  2. Do show long-running benchmarks with auto-tiering enabled. After all, if you are claiming your auto-tiering implementation doesn’t hurt and can even improve results, this is your chance to show it.
  3. Enable features like snapshots.
  4. Use your large cache if you have one, especially if you keep advertising that it accelerates writes. It will just solidify your claims.

Welcome to the club!

D

 

Technorati Tags: , , , , , , , , , , , ,

NetApp posts new SPEC SFS NFS results – far faster than V-Max with Celerra VG8

Following the new NetApp block-based SPC-1 results yesterday, here is some NAS benchmark action. This page contains all the SPEC SFS results. SPEC SFS is the NAS equivalent of SPC-1.

SPEC SFS is more cache-friendly than the brutal SPC-1, click here for some more information regarding this industry-standard NAS benchmark. The idea is that thousands of CIFS and NFS servers have been profiled and the benchmark reflects real-life NAS usage patterns.

In the same vein as the SPC-1 benchmarks, the configurations we submit to the standard benchmarking authorities are based on realistic systems customers could buy, not $7m lab queens. So, NetApp SPEC and SPC submissions:

  • Are always tested with RAID-DP (RAID-6 protection equivalent) – other vendors test with RAID10 most of the time, and never with RAID-6 (ask them why this is, BlueArc gets respect for being the only other one in the list doing our level of protection)
  • Have a target of using the most cost-effective configuration possible
  • Provide not just high IOPS but also very low latency
  • Are a realistic, deployable configuration, not just the fastest box we have (we still have the 1 million SPEC ops record for a 24-node system, that’s kind of pricy plus the result is old and can’t be compared with the current benchmark code – still, look at the rankings).

So, with those lofty goals in mind, we have 3 new submissions:

  1. CIFS benchmark, 3210 w/ SATA drives – typical low/mid-range system
  2. NFS benchmark, 3270 w/ SAS drives – typical mid-range system, no Flash Cache used in this one.
  3. NFS benchmark, 6240 w/ SAS drives – typical high-end (but not highest) system.

All NetApp systems included some Flash Cache memory boards to provide further acceleration (EDIT: aside from the 3270). We have an even faster system (6280) that we will be submitting later on as a special treat (there’s a certain degree of red tape and ceremony to even do one submission…)

Here’s an abbreviated chart in easily digestible form – showing the most recent results from perennial rivals NetApp and EMC (BTW – of all the systems in the chart, only one of them is truly unified and can provide block and NAS on the same architecture without the need for contortions).

System Result (higher is better) Overall Response Time (lower is better) # Disks Exported Capacity in TB RAID Protocol
NetApp 3210 64292 1.50 144x 1TB SATA 87 RAID-DP CIFS
NetApp 3270 101183 1.66 360x 15K RPM 450GB SAS 110 RAID-DP NFS
NetApp 6240 190675 1.17 288x 15K RPM 450GB SAS 85 RAID-DP NFS
EMC NS-G8 on V-Max 118463 1.92 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 CIFS
EMC NS-G8 on V-Max 110621 2.32 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 NFS
EMC VG8 on V-Max 135521 1.92 312x 15K RPM 450GB FC 19 RAID-10 NFS

Guide to reading the chart, and lessons learned:

  • A “puny” NetApp 3210 with SATA gets better overall response time than an all-SSD V-Max costing well over 10x
  • Notice the amount of usable space on NetApp systems, with even better protection than RAID10
  • The 6240 scored far higher even though it had less disks
  • The NetApp systems have “just” 2 controllers that do everything, vs. the EMC submissions with 4 V-Max engines, plus extra Celerra Data Movers and Control Stations on top. What do you think is more efficient?

In slide format:

image

I do have some questions to ask certain other vendors as a parting shot:

  1. Sun/Oracle – you keep saying your new boxes are a cheaper way to get NetApp-type functionality, you’ve had them for a while, why not submit to SPEC or SPC? (there is not a single SPEC result from Sun).
  2. EMC – maybe show the world how a system not based on V-Max runs? With RAID-6? (Even V-Max with RAID6, no problem… Smile )
  3. EMC: What’s the deal with the exported capacity, even with 312x drives?
  4. All of you with large striped pools of RAID5 – have you bothered explaining to your customers what will happen to the pool if you have a dual-drive failure in any RAID group? Unacceptable.

D

New NetApp SPC-1 submission – more IOPS per drive than any other vendor, and a bit on write caching

The SPC-1(E) benchmark is the standard high-intensity test for block storage, consisting of very stringent rules and a standard test suite.

SPC-1 is one of the worst things you can do to a disk array. The benchmark itself does a lot of writes, is highly random and is hostile to most caching systems. Which neatly explains why IBM has all kinds of system submissions but doesn’t show XIV, and the complete absence of another prominent vendor (look at the submissions, you’ll figure it out – the big boys of storage are NetApp, IBM, HDS, HP and one more :) ).

That same vendor might complain that SPC-1 is not always representative of real-life workloads but, short of putting all possible systems in your datacenter, nothing really will represent exactly how you massage your data. At least SPC-1 is a well-established standard and a great torture test for systems. All the other vendors are participating after all. And, interestingly, the SPEC SFS NAS benchmark doesn’t seem to bother said vendor’s anti-SPC crew none (spec.org). How come that one is more “real”? :) (NetApp participates in both block and NAS standard benchmarks BTW, since our systems all do both).

Some things to look for when trying to decipher SPC-1 results:

  • Type of RAID used (RAID-DP, RAID10, RAID5, RAID6)
  • How many drives were used to get the final result
  • The cost for the configuration
  • The price/performance
  • How much of the storage was usable, how much was unused…

For instance – a system that can do 50,000 SPC-1 IOPS with 100 disks and RAID6, is far more efficient than one that needs 200 disks and RAID10 to achieve the same result.

 

My favorite way of reading the results is figuring out the effective IOPS per drive, see how close (or far) it is from the 220 IOPS a normal modern 15K drive can sustain without RAID, with good response times.

So, without further ado, looky here… it’s the link to the results page showing all the vendors. Or here for the full details. 68,000 sustained IOPS with 120 ordinary 300GB drives and just 2 Flash Cache modules, with 84% of the usable space occupied.

What this means to you:

The effective IOPS per drive for the NetApp 3270 submission are 567. Next best is around 400, most vendors can’t break 300, and the highest scoring systems (relying on thousands of drives and many controllers) don’t break 200.

 

It is important to note that NetApp is the only vendor in the list showing results with dual-parity RAID-DP (RAID6 equivalent protection). All other vendors are doing RAID10! If your vendor is selling you RAID5, that’s not representative of their systems in the chart!

The NetApp result boils down to 13,600 sustained IOPS per shelf of 15K drives, and a system cost that’s very reasonable for the reliability, performance and features provided.

What this means to the anti-NetApp FUD club with their complex auto-tiering schemes that need 15 types of drives…

You really need to figure out how to present a decent result with:

  • RAID6 (otherwise your RAID1 or RAID5 protection is inferior to NetApp RAID-DP, especially when talking about large pools)
  • Your fancy auto-tiering algorithm showing no performance degradation on the unpredictable SPC-1 workload while still storing data on all drive tiers (otherwise, it’s single-tiering, and not auto-tiering)
  • Large caches. If your competitive product can use Megacaches, and you claim you can do efficient write caching with them, how about we all see how effective that is? After all, you claim that’s a huge benefit. We show the world ours, show yours. Otherwise, your product is only fast on Powerpoint slides, and I’ve yet to see a product fail on Powerpoint.

Stand by for more results from the bigger boxes, this wasn’t one of them, but it is a realistic system companies could actually afford and not a $7m all-SSD config like some others have… :)

 

D

Technorati Tags: