Recovery Monkey http://recoverymonkey.org Musings on backups, storage, tuning and more Tue, 09 Sep 2014 20:31:58 +0000 en-US hourly 1 When competitors try too hard and miss the point http://recoverymonkey.org/2014/09/09/when-competitors-try-too-hard-and-miss-the-point/ http://recoverymonkey.org/2014/09/09/when-competitors-try-too-hard-and-miss-the-point/#comments Tue, 09 Sep 2014 20:07:38 +0000 Dimitris http://recoverymonkey.org/?p=611 (edit: fixed the images)

After a long hiatus, we return to our regularly scheduled programming with a 2-part series that will address some wild claims Oracle has been making recently.

I’m pleased to introduce Jeffrey Steiner, ex-Oracle employee and all-around DB performance wizard. He helps some of our largest customers with designing high performance solutions for Oracle DBs:

Greetings from a guest-blogger.

I’m one of the original NetApp customers.

I bought my first NetApp in 1995 (I have a 3-digit support case in the system) and it was an F330. I think it came with 512MB SCSI drives, and maxed out at 16GB. It met our performance needs, it was reliable, and it was cost effective.  I continued to buy more over the following years at other employers. We must have been close to the first company to run Oracle databases on NetApp storage. It was late 1999. Again, it met our performance needs, it was reliable, and it was cost effective. My employer immediately prior to joining NetApp was Oracle.

I’m now with NetApp product operations as the principal architect for enterprise solutions, which usually means a big Oracle database is involved, but it can also include DB2, SAS, MongoDB, and others.

I normally ignore competitive blogs, and I had never commented on a blog in my life until I ran into something entitled “Why your NetApp is so slow…” and found this statement:

If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different disk IOPS

That’s just openly false. I tried to correct the poster, but was met with nothing but other unsubstantiated claims and insults to the product line. It was clear the blogger wasn’t going to acknowledge their false premise, so I asked Dimitris if I could borrow some time on his blog.

Here’s one of the alleged results of this behavior with ONTAP– the blogger was nice enough to do this calculation for a system reading at 2.6GB/s:

 

erroneous_oracle_calcs


Seriously?

I’m not sure how to interpret this. Are they saying that this alleged horrible, awful design flaw in ONTAP leads to customers buying 50X more drives than required, and our evil sales teams have somehow fooled our customer based into believing this was necessary? Or, is this a claim that ZFS arrays have some kind of amazing ability to use 50X fewer drives?

Given the false premise about ONTAP chopping up any and all IO’s into little 4K blocks and spraying them over the drives, I’m guessing readers are supposed to believe the first interpretation.

Ordinarily, I enjoy this type of marketing. Customers bring this to our attention, and it allows us to explain how things actually work, plus it discredits the account team who provided the information. There was a rep in the UK who used to tell his customers that Oracle had replaced all competing storage arrays in their OnDemand centers with Pillar. I liked it when he said stuff like that. The reason I’m responding is not because I care about the existence of the other blog, but rather that I care about openly false information being spread about how ONTAP works.

How does ONTAP really work?

Some of NetApp’s marketing folks might not like this, but here’s my usual response:

Why does it matter?

It’s an interesting subject, and I’m happy to explain write tetrises and NVMEM write coalescence, and core utilization, but what does that have to do with your business? There was a time we dealt with accusations that NetApp was slow because we has 25 nanometer process CPU’s while the state of the art was 17nm or something like that. These days ‘cores’ seems to come up a lot, as if this happens:

logic

That’s the Brawndo approach to storage sales (https://www.youtube.com/watch?v=Tbxq0IDqD04)

“Our storage arrays contain

5 kinds of technology

which make them AWESOME

unlike other storage arrays which are

NOT AWESOME.”

A Better Way

I prefer to promote our products based on real business needs. I phrase this especially bluntly when talking to our sales force:

When you are working with a new enterprise customer, shut up about NetApp for at least the first 45 minutes of the discussion

I say that all the time. Not everyone understands it. If you charge into a situation saying, “NetApp is AWESOME, unlike EMC who is NOT AWESOME” the whole conversation turns into PowerPoint wars, links to silly blog articles like the one that prompted this discussion, and whoever wins the deal will win it based on a combination of luck and speaking ability. Providing value will become secondary.

I’m usually working in engineeringland, but in major deals I get involved directly. Let’s say we have a customer with a database performance issue and they’re looking for new storage. I avoid PowerPoint and usually request Oracle AWR/statspack data. That allows me to size a solution with extreme accuracy. I know exactly what the customer needs, I know their performance bottlenecks, and I know whatever solution I propose will meet their requirements. That reduces risk on both sides. It also reduces costs because I won’t be proposing unnecessary capabilities.

None of this has anything to do with who’s got the better SPC-2 benchmark, unless you plan on buying that exact hardware described, configuring it exactly the same way, and then you somehow make money based on running SPC-2 all day.

Here’s an actual Oracle AWR report from a real customer using NetApp. I have pruned the non-storage related parameters to make it easier to read, and I have anonymized the identifying data. This is a major international insurance company calculating its balance sheet at end-of-month. I know of at least 9 or 10 customers that have similar workloads and configurations.

awr1

Look at the line that says “Physical reads”. That’s the blocks read per second. Now look at “Std Block Size”. That’s the block size. This is 90K physical block reads per second, which is 90K IOPS in a sense. The IO is predominantly db_file_scattered_read, which counter-intuitively is sequential IO. A parameter called db_file_multiblock_read_count is set to 128. This means Oracle is attempting to read 128 blocks at a time, which equates to 1MB block sizes. It’s a sequential IO read of a file.

Here’s what we got:

1)     89K read “IOPS”, sort of.

2)     Those 89K read IOPS are actually packaged as units of 8 blocks in a single 64k unit.

3)     3K write IOPS

4)     8MB/sec of redo logging.

The most important point here is that the customer needed about 800MB/sec of throughput, they liked the cost savings of IP, and the storage system is meeting their needs. They refresh with NetApp on occasion, so obviouly they’re happy with the TCO.

To put a final nail in the coffin of the Oracle blogger’s argument, if we are really doing 89K block reads/sec, and those blocks are really chopped up into 4k units, that’s a total of about 180,000 4k IOPS that would need to be serviced at the disk layer, per the blogger’s calculation.

  • Our opposing blogger thinks that  would require about 1000 disks in theory
  • This customer is using 132 drives in a real production system.

There’s also a ton of other data on those drives for other workloads. That’s why we have QoS – it allows mixed workloads to play nicely on a single unified system.

To make this even more interesting, the data would have been randomly written in 8k units, yet they are still able to read at 800MB/sec? How is this possible? For one, ONTAP does NOT break up individual IO’s into 4k units. It tries very, very hard to never break up an IO across disks, although that can happen on occasion, notably if you fill you system up to 99% capacity or do something very much against best practices.

The main reason ONTAP can provide good sequential performance with randomly written data is the blocks are organized contiguously on disk. Strictly speaking, there is a sort of ‘fragmentation’ as our competitors like to say, but it’s not like randomly spraying data everywhere. It’s more like large contiguous chunks of data are evenly distributed across the disks. As long as those contiguous segments are sufficiently large, readahead can ensure good throughput.

That’s somewhat of an oversimplification, but it would take a couple hours and a whiteboard to explain the complete details. 20+ years of engineering can’t exactly be summarized in a couple paragraphs. The document misrepresented by the original blog was clearly dated 2006 (and that was to slightly refresh the original posting back in the nineties), and while it’s still correct as far as I can see, it’s also lacking information on the enhancements and how we package data onto disks.

By the way, this database mentioned above? It’s virtualized in VMware too.

Why did I pick an example of only 90K IOPS?  My point was this customer needed 90K IOPS, so they bought 90K IOPS.

If you need this performance:

awr2

then let us know. Not a problem. This is from a large SAP environment for a manufacturing company. It beats me what they’re doing, because this is about 10X more IO than what we typically see for this type of SAP application. Maybe they just built a really, really good network that permits this level of IO performance even though they don’t need it.

In any case, that’s 201,734 blocks/sec using a block size of 8k. That’s about 2GB/sec, and it’s from a dual-controller FAS3220 configuration which is rather old (and was the smallest box in its range when it was new).

Sing the bizarro-universe math from the other blog, these 200K IOPS would have been chopped up into 4k blocks and require a total of 400K back-end disk IOPS to service the workload. Divided by 125 IOPS/drive, we have a requirement for 3200 drives. It was ACTUALLY using more like 200 drives.

We can do a lot more, especially with the newer platforms and ONTAP clustering, which would enable up to 24 controllers in the storage cluster. So the performance limits per cluster are staggeringly high.

Futures

To put a really interesting (and practical) twist on this, sequential IO in the Oracle realm is probably going to become less important.  You know why? Oracle’s new in-memory feature. Me and several others were floored when we got the first debrief on Oracle In-Memory. I couldn’t have asked for a better implementation if I was in charge of Oracle engineering myself. Folks at NetApp started asking what this means for us, and here’s my list:

  1. Oracle customers will be spending less on storage.

That’s it. That’s my list. The data format on disk remains unchanged, the backup/restore process is the same, the data commitment process is the same. All the NetApp features that earned us around 12,500 Oracle customers are still applicable.

The difference is customers will need smaller controllers, fewer disks, and less bandwidth because they’ll be able to replace a lot of the brute-force full table scan activity with a little In-Memory magic. No, the In-Memory licenses aren’t free, but the benefits will be substantial.

SPC-2 Benchmarks and Engenio Purchases

The other blog demanded two additional answers:

1)     Why hasn’t NetApp done an SPC-2 bencharmk?

2)     Why did NetApp purchase Engenio?

SPC-2

I personally don’t know why we haven’t done an SPC-2 benchmark with ONTAP, but they are rather expensive and it’s aimed at large sequential IO processing. That’s not exactly the prime use case for FAS systems, but not because they’re weak on it. I’ve got AWR reports well into the GB/sec, so it certainly can do all the sequential IO you want in the real world, but what workloads are those?

I see little point in using an ONTAP system for most (but certainly not all) such workloads because the features overall aren’t applicable. I’m aware of some VOD applications on ONTAP where replication and backups were important. Overall, if you want that type of workload, you’d specify a minimum bandwidth requirement, capacity requirement, and then evaluate the proposals from vendors. Cost is usually the deciding factor.

Engenio Acquisition

Again, my personal opinion here on why NetApp acquired Engenio.

Tom Georgens, our CEO, spent 9 years leading Engenio and obviously knew the company and its financials well. I can’t think of any possible way to know you’re getting value for money than having someone in Georgens’ position making this decision.

Here’s the press release about it:

Engenio will enable NetApp to address emerging and fast-growing market segments such as video, including full-motion video capture and digital video surveillance, as well as high performance computing applications, such as genomics sequencing and scientific research.

Yup, sounds about right. That’s all about maximum capacity, high throughput, and low cost. In contrast, ONTAP is about manageability and advanced features. Those are aimed at different sets of business drivers.

Hey, check this out. Here’s an SEC filing:

Since the acquisition of the Engenio business in May 2011, NetApp has been offering the formerly-branded Engenio products as NetApp E-Series storage arrays for SAN workloads. Core differentiators of this price-performance leader include enterprise reliability, availability and scalability. Customers choose E-Series for general purpose computing, high-density content repositories, video surveillance, and high performance computing workloads where data is managed by the application and the advanced data management capabilities of Data ONTAP storage operating system are not required.

Key point here is “where the advanced data management capabilities of Data ONTAP are not required.” It also reflected my logic in storage decisions prior to joining NetApp, and it reflects the message I still repeat to account teams:

  1. Is there any particular feature in ONTAP that is useful for your customer’s actual business requirements? Would they like to snapshot something? Do they need asynchronous replication? Archival? SnapLock? Scale-out clusters with many nodes? Non-disruptive everything? Think carefully, and ask lots of questions.
  2. If the answer is “yes”, go with ONTAP.
  3. If the answer is “no”, go with E-Series.

That’s what I did. I probably influenced or approved around $5M in total purchases. It wasn’t huge, but it wasn’t nothing either. I’d guess we went ONTAP about 70% of the time, but I had a lot of IBM DS3K arrays around too, now known as E-Series.

“Dumb Storage”

I’ve annoyed the E-Series team a few times by referring to it as “dumb storage”, but I mean that in the nicest possible way. It’s primary job is to just sit there and work. It needs to do it fast, reliably, and cost effectively, but on a day-to-day basis it’s not generally doing anything all that advanced.

In some ways, the reliability was a weakness. It was so reliable, that we forgot it was there at all, and we’d do something like changing the email server addresses, and forget to update the RAS feature of the E-Series. Without email notification, it can take a couple years before someone notices the LED that indicates a drive needs replacement.

 

]]>
http://recoverymonkey.org/2014/09/09/when-competitors-try-too-hard-and-miss-the-point/feed/ 0
Is convenience devaluing products? Does quality suffer because of it? http://recoverymonkey.org/2014/01/24/is-convenience-devaluing-products-does-quality-suffer-because-of-it/ http://recoverymonkey.org/2014/01/24/is-convenience-devaluing-products-does-quality-suffer-because-of-it/#comments Fri, 24 Jan 2014 07:45:02 +0000 Dimitris http://recoverymonkey.org/?p=607 Kind of a long hiatus posting (far too busy working on cool stuff) and for you looking for a deep technical post this may not be it… but here goes anyway since the content may also apply to my more usual subjects.

Recently I decided to discard my Luddite membership card and join the hordes of people using network-based services for music.

The experiment is ongoing – I do like the convenience of being able to select almost any song or album for the monthly equivalent cost of less than what an album is worth.

It’s a pretty good deal if you listen to a lot of new music, and/or you don’t like listening to ads on the radio.

There’s a plethora of free offerings but if you are mobile and want to use it on your phone, there’s usually a cost involved to have the convenience of selecting the exact songs you like.

How convenience has affected me

I did notice several interesting aspects in which this newfound convenience has changed my listening habits in a positive way:

  1. I am discovering a lot more new music since it’s so incredibly easy to do so. And some old music I never gave a chance to.
  2. Sharing music with other people is easy and involves no illegal copying of data.
  3. I don’t have to worry about putting the “right” music in my portable device – I can stream what I want, from wherever I want, even on devices I don’t own, and even designate items for “offline use” – meaning they’ll be cached and playable even if I’m not connected to a network.
  4. I have easy access to most of my oldie favorites that I might normally not keep in my device due to space reasons.
  5. The quality is very good. But we won’t go into psychoacoustics here :)

However, there have also been some pretty negative aspects to all this convenience… for instance:

  1. I realize I now suffer from music ADD – I seldom just sit down and listen to a whole album like we all used to do in the olden days.
  2. Albums now have zero monetary value in my mind – they’re just part of the low monthly fee.
  3. If albums have a perceived zero monetary value, they become a commodity and not something to be treasured. I remember when it was a huge deal to get a new album from my favorite artists: the anticipation, the excitement, the trip to the record store, waiting in line, scarcity, the artwork in the packaging, the sheer physicality of it all. This combination of attributes ensured I would at least give that album a chance – indeed, I was likely to listen to it repeatedly, analyze it and appreciate the artistry involved. I was invested.
  4. As a result of this devaluing, amazing works of art that were extremely difficult to accomplish may now be skipped altogether because they may be a bit time-consuming or even difficult to “get into” – some concept albums you just need to be in the right frame of mind for and/or have the requisite amount of time to listen to the story unfold. Since there’s no perceived investment and no excitement, it’s less likely to spend the energy trying to get into the album, no matter how rewarding it may be in the end.
  5. For something more practical: The toll on the mobile devices’ batteries is 2-3x that of just playing music natively (even without streaming – the tracks are encrypted so you can’t just lift them from the storage, which adds CPU cycles to decrypt, plus some products use codecs more computationally intensive than mp3). Best have a device with a fast CPU.
  6. An extended unplanned network or music provider outage will mean no access to music.

How this applies to other aspects of our lives

I wonder now what other conveniences have affected our lives significantly?

And are we all looking for that quick fix, the easy way out?

Are we heading towards the world depicted in the movie Idiocracy? (very interesting flick – it’s worth watching for the premise alone).

Already, most of us in the more “civilized” parts of the globe don’t know how to hunt down and skin an animal, build a weapon, start a fire, build a shelter. That is knowledge that convenience robbed us of many years ago. You can study how to do those things, but chances are, if you’re in need to do so, you won’t have the training to be anywhere near as successful as our ancestors were in those endeavors.

Same goes for taking pictures – aside from a few people that still develop and print their own film, most of us use digital (with the same deluge of information problem described in the music section above – thousands of pictures may now be taken during a vacation, where previously no more than a hundred would, with tremendous love and care – but most of the hundred were keepers).

Many of us are getting heavier, too – convenient access to food and low levels of physical activity (since locomotion is so convenient) being the killer combination.

Does quality suffer because of convenience?

Conveniences aren’t a bad thing overall – I am not hankering for the destruction of all things convenient. However, I posit that certain aspects of quality absolutely suffer because of convenience:

  1. Consumers are more likely to pick an easier to use, throw-away and even short-sighted product over a better-engineered, longer-lasting one – shifting the engineering emphasis instead to ease-of-use and disposability.
  2. The quality of workers in many fields isn’t what it used to be.
  3. We are heading towards more generalists and less specialists.
  4. Troubleshooting is becoming a lost art.

I’m not sure how to even conclude – I’m probably part of the problem since one of the things I do is help make very advanced technology easier to consume and more forgiving.

Just don’t get too comfortable.

Toilet chair

Thx

D

]]>
http://recoverymonkey.org/2014/01/24/is-convenience-devaluing-products-does-quality-suffer-because-of-it/feed/ 1
EMC’s VNX2 forklift: The importance of hardware re-use, slowing down obsolescence, and maximizing your investment http://recoverymonkey.org/2013/09/24/emcs-vnx2-forklift-the-importance-of-hardware-re-use-slowing-down-obsolescence-and-maximizing-your-investment/ http://recoverymonkey.org/2013/09/24/emcs-vnx2-forklift-the-importance-of-hardware-re-use-slowing-down-obsolescence-and-maximizing-your-investment/#comments Tue, 24 Sep 2013 15:48:48 +0000 Dimitris http://recoverymonkey.org/?p=604 It was with interest that I watched the launch of EMC’s VNX refresh. The updated boxes got some long-awaited features, EMC talked a lot about how some pretty severe single-threaded bottlenecks were removed, more CPU and memory was put in, and there was much rejoicing.

Not really trying to pick on the new boxes (that will be in a future post, relax :) ), but what I thought was interesting was that the code that makes most of the new features a possibility cannot be loaded on current-gen VNX boxes (not even the biggest, the 7500, which has plenty of CPU and RAM juice even compared to the next gen boxes).

Software-defined storage indeed.

The existing VNX disk shelves also seemingly can’t be re-used at the moment (correct me if I’m wrong please).

This forced obsolescence has been a theme with EMC: Clariion -> VNX -> VNX2 are all complete forklift upgrades. When the original VNX was released, it was utterly incompatible with the CX (Clariion) shelves (SAS replaced FC connectivity). Despite using the exact same code (FLARE).

Other vendors are guilty of this too – a new controller is released, and all prior investments on disk shelves are rendered useless (HDS did this with several iterations of AMS, maybe even with AMS -> HUS, HP with EVA…)

I understand that as technology progresses we sometimes have to abandon the old to make room for the new but, at the same time, customers make significant investments in N-1 technology – and often want to be able to re-use some of their investment with N (and sometimes N+1).

I just had this conversation with a customer, and he said “well, I throw away my gear every 3 years, why should I care?

Let’s try a thought experiment.

Imagine you just bought a system that’s running the fastest controllers a company sells today, and you got 1PB of storage behind it.

Now, imagine that a mere month after you purchased your controllers, new ones are released that are significantly faster. OK, that stuff happens.

Your gear is not 3 years old yet. It’s 1 month old. It’s running OK.

Now, imagine that your array runs out of steam 6 months later due to unprecedented performance growth. Your system is now 7 months old. You can’t just throw it away and start fresh.

You could buy a new storage system and migrate some of the data to share the load. However, you don’t need more space – you just ran out of controller headroom. Indeed, you still have tons of free space.

But what if you could replace the controllers with the new, beefier ones? And maintain your investment in the 1PB of storage, cache etc? Wouldn’t that be nice?

Or at least be able to move some of the storage pools you may have to the new family of controllers? Even if you had to reformat the disk?

Well – most vendors will tell you “sorry, no, you need to migrate your data to the new box”.

Let’s try another thought experiment.

You bought a storage system a year ago. It performs fine, but it lacks true deduplication capabilities. You have determined it would save you a lot of storage space (= money) if your array had deduplication.

The vendor you purchased the system from announces the refreshed storage OS that finally includes deduplication. And that same vendor made a truly gigantic fuss about software-defined storage, which made everyone feel software would be the big enabler and that coolness was a mere firmware upgrade away.

However, you are eventually told they will not allow the code that enables deduplication to be loaded to your array, and, instead, they ask you to migrate to the refreshed array that supports deduplication. Since the updated code somehow only runs on the new box. Something to do with unicorn milk.

But your array has plenty of CPU headroom to handle deduplication… and you could reformat the disks given some swing storage shelves if the underlying disk format is the issue. But the option is not provided.

How NetApp does things instead

At NetApp we sort of take it for granted so we don’t make a big fuss about software-defined storage, but hardware was always considered an enabler for the software and not the other way around. 

For instance: deduplication was released as a free software upgrade for Data ONTAP (the OS for our main line of storage). Back in 2007. For all storage protocols.

In general, we try to let systems be able to load at a minimum N+1 software releases, but most of the time we utterly spoil customers and go far above and beyond, unless we’re talking about the smallest boxes, which naturally have less headroom.

For example, the now aging FAS 3070 I have in the local lab (the bigger of the older midrange boxes, released in 2006) supports anything from ONTAP 7.2.1 (what it was released with) to ONTAP 8.1.3 (released in mid-2013).

This spans multiple major ONTAP releases – huge changes in the code have happened during those releases: 7.3, 8.0, 8.1… Multiple newer arrays were also released as replacements for the 3070: 3170, 3270, 3250.

Yet the 3070 soldiers on with a fully supported, modern OS, 7 years later.

What arrays did our competitors have back then? What is the most modern OS those same arrays can run today? What is that OS missing vs the OS that competitor’s more modern arrays have?

Let’s talk disk shelves.

We used FC loop connectivity for the older shelves (DS14). We then switched to fancy multi-channel SAS and totally different shelves and disks, but never stopped supporting the older shelf technology, even with newer controllers.

That’s the big one. I have customers with DS14 shelves that they purchased for a 3070 that they now have on a 3270 running 8.2. It all works, all supported. Other vendors cut support off after the transition from FC to SAS.

Will we support those older shelves forever? No, that’s impossible, but at least we give our customers a lot of leeway and let them stretch their hardware investments significantly longer than any other major storage vendor I can think of.

Think long term

I encourage customers to always think long term. Try to stop thinking in 3-year increments. Start thinking of other ways you can stretch your investment, other ways to deploy older gear while still keeping it interoperable with newer hardware.

And start thinking about what will happen to your investment once newer gear is released.

D

Technorati Tags: , , ,

]]>
http://recoverymonkey.org/2013/09/24/emcs-vnx2-forklift-the-importance-of-hardware-re-use-slowing-down-obsolescence-and-maximizing-your-investment/feed/ 10
How to decipher EMC’s new VNX pre-announcement and look behind the marketing. http://recoverymonkey.org/2013/05/16/how-to-decipher-emcs-new-vnx-pre-announcement-and-look-behind-the-marketing/ http://recoverymonkey.org/2013/05/16/how-to-decipher-emcs-new-vnx-pre-announcement-and-look-behind-the-marketing/#comments Thu, 16 May 2013 15:53:06 +0000 Dimitris http://recoverymonkey.org/?p=598 It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.

D

 

Technorati Tags: , , , , , , ,

]]>
http://recoverymonkey.org/2013/05/16/how-to-decipher-emcs-new-vnx-pre-announcement-and-look-behind-the-marketing/feed/ 9
More EMC VNX caveats http://recoverymonkey.org/2013/05/06/more-emc-vnx-caveats/ http://recoverymonkey.org/2013/05/06/more-emc-vnx-caveats/#comments Mon, 06 May 2013 06:14:43 +0000 Dimitris http://recoverymonkey.org/?p=595 Lately, when competing with VNX, I see EMC using several points to prove they’re superior (or at least not deficient).

I’d already written this article a while back, and today I want to explore a few aspects in more depth since my BS pain threshold is getting pretty low. The topics discussed:

  1. VNX space efficiency
  2. LUNs can be served by either controller for “load balancing”
  3. Claims that autotiering helps most workloads
  4. Claims that storage pools are easier
  5. Thin provisioning performance (this one’s interesting)
  6. The new VNX snapshots

References to actual EMC documentation will be used. Otherwise I’d also be no better than a marketing droid.

VNX space efficiency

EMC likes claiming they don’t suffer from the 10% space “tax” NetApp has. Yet, linked here are the best practices showing that in an autotiered pool, at least 10% free space should be available per tier in order for autotiering to be able to do its thing (makes sense).

Then there’s also a 3GB minimum overhead per LUN, plus metadata overhead, calculated with a formula in the linked article. Plus possibly more metadata overhead if they manage to put dedupe in the code.

My point is: There’s no free lunch. If you want certain pool-related features, there is a price to pay. Otherwise, keep using the old-fashioned RAID groups that don’t offer any of the new features but at least offer predictable performance and capacity utilization.

LUNs can be served by either controller for “load balancing”

This is a fun one. The claim is that LUN ownership can be instantly switched over from one VNX controller to another in order to load balance their utilization. Well – as always, it depends. It’s also important to note that VNX as of this writing does not do any sort of automatic load balancing of LUN ownership based on load.

  1. If it’s using old-fashioned RAID LUNs: Transferring LUN ownership is indeed doable with no issues. It’s been like that forever.
  2. If the LUN is in a pool – different story. There’s no quick way to shift LUN ownership to another controller without significant performance loss.

There’s copious information here. Long story short: You don’t change LUN ownership with pools, but rather need to do a migration of the LUN contents to the other controller (to another LUN, you can’t just move the LUN as-is – this also creates issues), otherwise there will be a performance tax to pay.

Claims that autotiering helps most workloads

Not so FAST. EMC’s own best practice guides are rife with caveats and cautions regarding autotiering. Yet this feature is used as a gigantic differentiator at every sales campaign.

For example, in the very thorough “EMC Scaling performance for Oracle Virtual Machine“, the following graph is shown on page 35:

NewImage

The arrows were added by me. Notice that most of the performance benefit is provided once cache is appropriately sized. Adding an extra 5 SSDs for VNX tiering provides almost no extra benefit for this database workload.

One wonders how fast it would go if an extra 4 SSDs were added for even more cache instead of going to the tier… :)

Perchance the all-cache line with 8 SSDs would be faster than 4 cache SSDs and 5 tier SSDs, but that would make for some pretty poor autotiering marketing.

Claims that storage pools are easier

The typical VNX pitch to a customer is: Use a single, easy, happy, autotiered pool. Despite what marketing slicks show, unfortunately, complexity is not really reduced with VNX pools – simply because single pools are not recommended for all workloads. Consider this typical VNX deployment scenario, modeled after best practice documents:

  1. RecoverPoint journal LUNs in a separate RAID10 RAID group
  2. SQL log LUNs in a separate RAID10 RAID group
  3. Exchange 2010 log LUNs in a separate RAID10 RAID group
  4. Exchange 2010 data LUNs can be in a pool as long as it has a homogeneous disk type, otherwise use multiple RAID groups
  5. SQL data can be in an autotiered pool
  6. VMs might have to go in a separate pool or maybe share the SQL pool
  7. VDI linked clone repository would probably use SSDs in a separate RAID10 RAID group

OK, great. I understand that all the I/O separation above can be beneficial. However, the selling points of pooling and autotiering are that they should reduce complexity, reduce overall cost, improve performance and improve space efficiency. Clearly, that’s not the case at all in real life. What is the reason all the above can’t be in a single pool, maybe two, and have some sort of array QoS to ensure prioritization?

And what happens to your space efficiency if you over-allocate disks to the old-fashioned RAID groups above? How do you get the space back?

What if you under-allocated? How easy would it be to add a bit more space or performance? (not 2-3x – let’s say you need just 20% more). Can you expand an old-fashioned VNX RAID group by a couple of disks?

And what’s the overall space efficiency now that this kind of elaborate split is necessary? Hmm… ;)

For more detail, check these Exchange and SQL design documents.

Thin provisioning performance

This is just great.

VNX thin provisioning performs very poorly relative to thick and even more poorly relative to standard RAID groups. The performance issue makes complete sense due to how space is allocated when writing thin on a VNX, with 8KB blocks assigned as space is being used. A nice explanation of how pool space is allocated is here. A VNX writes to pools using 1GB slices. Thick LUNs pre-allocate as many 1GB slices as necessary, which keeps performance acceptable. Thin LUNs obviously don’t pre-allocate space and currently have no way to optimize writes or reads – the result is fragmentation, in addition to the higher CPU, disk and memory overhead to maintain thin LUNs :)

From the Exchange 2010 design document again, page 23:

NewImage

Again, I added the arrows to point out a couple of important things:

  1. Thin provisioning is not recommended for high performance workloads on VNX
  2. Indeed, it’s so slow that you should run your thin pools with RAID10!!!

But wait – thin provisioning is supposed to help me save space, and now I have to run it with RAID10, which chews up more space?

Kind of an oxymoron.

And what if the customer wants the superior reliability of RAID6 for the whole pool? How fast is thin provisioning then?

Oh, and the VNX has no way to fix the fragmentation that’s rampant in its thin LUNs. Short of a migration to another LUN (kind of a theme it seems).

The new VNX snapshots

The VNX has a way to somewhat lower the traditionally extreme impact of FLARE snapshots by switching from COFW (Copy On First Write) to ROFW (Redirect On First Write).

The problem?

The new VNX snapshots need a pool, and need thin LUNs. It makes sense from an engineering standpoint, but…

Those are exactly the 2 VNX features that lower performance.

There are many other issues with the new VNX snapshots, but that’s a story for another day. It’s no wonder EMC pushes RecoverPoint far more than their snaps…

The takeaway

There’s marketing, and then there’s engineering reality.

Since the VNX is able to run both pools and old-fashioned RAID groups, marketing wisely chooses to not be very specific about what works with what.

The reality though is that all the advanced features only work with pools. But those come with significant caveats.

If you’re looking at a VNX – at least make sure you figure out whether the marketed features will be usable for your workload. Ask for a full LUN layout.

And we didn’t even talk about having uniform RAID6 protection in pools, which is yet another story for another day.

D

Technorati Tags: , , , , , , , , ,

]]>
http://recoverymonkey.org/2013/05/06/more-emc-vnx-caveats/feed/ 8
Are SSDs reliable enough? The importance of extensive testing under adverse conditions. http://recoverymonkey.org/2013/04/24/are-ssds-reliable-enough-the-importance-of-extensive-testing-under-adverse-conditions/ http://recoverymonkey.org/2013/04/24/are-ssds-reliable-enough-the-importance-of-extensive-testing-under-adverse-conditions/#comments Wed, 24 Apr 2013 17:56:43 +0000 Dimitris http://recoverymonkey.org/?p=592 Recently, interesting research (see here) from researchers at Ohio State was presented at USENIX. 

To summarize, they tested 15 SSDs, several of them “Enterprise” grade, and subjected them to various power fault conditions. 

Almost all the drives suffered data loss that should not have occurred, and some were so corrupt as to be rendered utterly unusable (could not even be seen on the bus). It’s worth noting that spinning drives used in enterprise arrays would not have suffered the same way.

It’s not just an issue of whether or not the SSD has some supercapacitors in order to de-stage the built-in RAM contents to flash – a certain very prominent SSD vendor was hit with this issue even though the SSDs in question had the supercapacitors, generous overprovisioning and internal RAID. A firmware issue is suspected and this is not fixed yet.

You might ask, why am I mentioning this?

Several storage systems try to lower SSD costs by using cheap SSDs (often consumer models found in laptops, not even eMLC) and then try to get more longevity out said SSDs by using clever write techniques in order to minimize the amount of data written (dedupe, compression) as well as make the most of wear-leveling the flash chips in the box by also writing in flash-friendly ways (more appends, less overwrites, moving data around as needed, and more).

However, all those (perfectly valid) techniques have a razor-sharp focus on the fact that cheaper flash has a very limited number of write/erase cycles, but are utterly unrelated to things like massive corruption stemming from weird power failures or firmware bugs (and, after having lived through multiple UPS and generator failures, I don’t accept those as a complete answer, either).

On the other hand, the Tier 1 storage vendors typically do pretty extensive component testing, including various power failure scenarios, from the normal to the very strange. The system has to withstand those, then come up no matter what. Edge cases are tested as a matter of course – a main reason people buy enterprise storage is how edge cases are handled… :)

At NetApp, when we certify SSDs, they go through an extra-rigorous process since we are paranoid and they are still a relatively new technology. We also offer our standard dual-parity RAID, along with multiple ways to safeguard against lost writes, for all media. The last thing one needs is multiple drives failing due to a strange power failure or a firmware bug.

Protection against failures is even more important in storage systems that lack the extra integrity checks NetApp offers. Those non-NetApp systems that use SSDs either as their only storage or as part of a pool can suffer catastrophic failures if the integrity of the SSDs is compromised sufficiently since, by definition, if part of the pool fails, then the whole pool fails, which could mean the entire storage system may have to be restored from backup.

For those systems where cheap SSDs are merely used as an acceleration mechanism, catastrophic performance failures are a very real potential outcome. 1000 VDI users calling the helpdesk is not my idea of fun.

Such component behavior is clearly unacceptable.

Proper testing comes with intelligence, talent, but also experience and extensive battle scarring. Back when NetApp was young, we didn’t know the things we know today, and couldn’t handle some of the fault conditions we can handle today. Test harnesses in most Tier 1 vendors become more comprehensive as new problems are discovered, and sometimes the only way to discover the really weird problems is through sheer numbers (selling many millions of units of a certain component provides some pretty solid statistics regarding its reliability and failure modes).

“With age comes wisdom”.

 

D

 

Technorati Tags: , , , ,

]]>
http://recoverymonkey.org/2013/04/24/are-ssds-reliable-enough-the-importance-of-extensive-testing-under-adverse-conditions/feed/ 0
Beware of benchmarking storage that does inline compression http://recoverymonkey.org/2013/02/25/beware-of-benchmarking-storage-that-does-inline-compression/ http://recoverymonkey.org/2013/02/25/beware-of-benchmarking-storage-that-does-inline-compression/#comments Mon, 25 Feb 2013 22:46:05 +0000 Dimitris http://recoverymonkey.org/?p=585 In this post I will examine the effects of benchmarking highly compressible data and why that’s potentially a bad idea.

Compression is not a new storage feature. Of the large storage vendors, at a minimum NetApp, EMC and IBM can do it (depending on the array). <EDIT (thanks to Matt Davis for reminding me): Some arrays also do zero detection and will not write zeroes to disk – think of it as a specialized form of compression that ONLY works on zeroes>

A lot of the newer storage vendors are now touting real-time compression for all data (often used instead of true deduplication – it’s just easier to implement compression).

Nothing wrong with real-time compression. However, and here’s where I have a problem with some of the sales approaches some vendors follow:

Real-time compression can provide grossly unrealistic benchmark results if the benchmarks used are highly compressible!

Compression can indeed provide a performance benefit for various data types (simply since less data has to be read and written from disk), with the tradeoff being CPU. However, most normal data isn’t composed of all zeroes. Typically, compressing data will provide a decent benefit on average, but usually not several times.

So, what will typically happen is, a vendor will drop off one of their storage appliances and provide the prospect with some instructions on how to benchmark it with your garden variety benchmark apps. Nothing crazy.

Here’s the benchmark problem

A lot of the popular benchmarks just write zeroes. Which of course are extremely easy for compression and zero-detect algorithms to deal with and get amazing efficiency out of, resulting in extremely high benchmark performance.

I wanted to prove this out in an easy way that anyone can replicate with free tools. So I installed Fedora 18 with the btrfs filesystem and ran the bonnie++ benchmark with and without compression. The raw data with mount options etc. is here. An explanation of the various fields here. Not everything is accelerated by btrfs compression in the bonnie++ benchmark, but a few things really are (sequential writes, rewrites and reads):

Bonniebtrfs

Notice the gigantic improvement (in write throughput especially) btrfs compression affords with all-zero data.

Now, does anyone think that, in general, the write throughput will be 300MB/s for a decrepit 5400 RPM SATA disk?  That will be impossible unless the user is constantly writing all-zero data, at which point the bottlenecks lie elsewhere.

Some easy ways for dealing with the compressible benchmark issue

So what can you do in order to ensure you get a more realistic test for your data? Here are some ideas:

  • Always best is to use your own applications and not benchmarks. This is of course more time-consuming and a bigger commitment. If you cant do that, then…
  • Create your own test data using, for example, dd and /dev/random as a source in some sort of Unix/Linux variant. Some instructions here. You can even move that data to use with Windows and IOmeter – just generate the random test data in UNIX-land and move the file(s) to Windows.
  • Another, far more realistic way: Use your own data. In IOmeter, you just copy one of your large DB files to iobw.tst and IOmeter will use your own data to test… Just make sure it’s large enough and doesn’t all fit in array cache. If not large enough, you could probably make it large enough by concatenating multiple data files and random data together.
  • Use a tool that generates incompressible data automatically, like the AS-SSD benchmark (though it doesn’t look like it can deal with multi-TB stuff – but worth a try).

 

In all cases though, be aware of how you are testing. There is no magic :)

D

 

Technorati Tags: , ,

]]>
http://recoverymonkey.org/2013/02/25/beware-of-benchmarking-storage-that-does-inline-compression/feed/ 8
Are some flash storage vendors optimizing too heavily for short-lived NAND flash? http://recoverymonkey.org/2013/02/22/are-some-flash-storage-vendors-optimizing-too-heavily-for-short-lived-nand-flash/ http://recoverymonkey.org/2013/02/22/are-some-flash-storage-vendors-optimizing-too-heavily-for-short-lived-nand-flash/#comments Fri, 22 Feb 2013 23:21:21 +0000 Dimitris http://recoverymonkey.org/?p=578 I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?

D

Technorati Tags: , ,

]]>
http://recoverymonkey.org/2013/02/22/are-some-flash-storage-vendors-optimizing-too-heavily-for-short-lived-nand-flash/feed/ 8
Filesystem and OS benchmark extravaganza: Software makes a huge difference http://recoverymonkey.org/2013/02/04/filesystem-and-os-benchmark-extravaganza-software-makes-a-huge-difference/ http://recoverymonkey.org/2013/02/04/filesystem-and-os-benchmark-extravaganza-software-makes-a-huge-difference/#comments Tue, 05 Feb 2013 02:39:08 +0000 Dimitris http://recoverymonkey.org/?p=574 I’ve published benchmarks of various OSes and filesystems in the past, but this time I thought I’d try a slightly different approach.

For the regular readers of my articles, I think there is something in here for you. The executive summary is this:

Software is what makes hardware sing. 

All too frequently I see people looking at various systems and focus on what CPU each box has, how many GHz, cores etc, how much memory. They will try to compare systems by putting the specs on a spreadsheet.

OK – let’s see how well that approach works.

For this article I used the NetApp postmark benchmark. You can find the source and various executables here. I’ve been using this version for many years. The way I always run it:

  • set directories 5
  • set number 10000
  • set transactions 20000
  • set read 4096
  • set write 4096
  • set buffering false
  • set size 500 100000
  • run

This way it will create 10,000 files, and do 20,000 things with them, all in 5 directories, with a 4K read and write, no buffered I/O, and the files will range in size from 500 bytes to 100,000 bytes in size. It will also ask the OS to bypass the buffer cache. Then it will delete everything.

It’s fairly easy to interpret since the workload is pre-determined, and a low total time to complete the workload is indicative of good performance, instead of doing an IOPS/latency measurement.

This is very much a transactional/metadata-heavy way to run the benchmark, and not a throughput one (as you will see from the low MB/s figures). Overall, this is an old benchmark, not multithreaded, and probably not the best way to test stuff any more, but, again, for the purposes of this experiment it’s more than enough to illustrate the point of the article (if you want to see more comprehensive tests, visit phoronix.com).

It’s very important to realize one thing before looking at the numbers: This article was NOT written to show which OS or filesystem is overall faster. Just to illustrate the differences an OS and filesystem can potentially mean to I/O performance given the same exact base hardware and the same application driving the same workload.

 

Configuration

OSes tested (all 64-bit, latest patches as of the time of writing):

  • Windows Server 2012
  • Ubuntu 12.10 (in Linux Mint guise)
  • Fedora 18
  • Windows 8
  • Windows 7

For Linux, I compiled postmark with the -O3 directive, the deadline scheduler was used, I disabled filesystem barriers and used the following sysctl tweaks:

  • vm.vfs_cache_pressure=50
  • vm.swappiness=20
  • vm.dirty_background_ratio=10
  • vm.dirty_ratio=30

Filesystems tested:

  • NTFS
  • XFS
  • BTRFS
  • EXT4

The hardware is unimportant, the point is it was the exact same for all tests and that there was no CPU or memory starvation… There was no SSD involved, BTW, even though some of the results may make it seem like there was.

 

The results

First, transactions per second. That doesn’t include certain operations. Higher is better.

Postmark tps

 

Second, total time. This includes all transactions, file creation and deletion:

Postmark time

 

Finally, the full table that includes the mount parameters and other details:

Postmark table

 

Some interesting discoveries…

…At least in the context of this test, the same findings may or may not apply with other workloads and hardware configurations:

  • Notice how incredibly long Windows 8 took with its default configuration. I noticed the readyboost file being modified while the benchmark was running, which must be a serious bug, since readyboost isn’t normally used for ephemeral stuff (windows 7 didn’t touch that file while the benchmark was running). If you’re running Windows 8 you probably want to consider disabling the superfetch service… (I believe it’s disabled by default if Windows detects an SSD).
  • BTRFS performance varies too much between Linux kernels (to be expected from a still experimental filesystem). With The Ubuntu 12.10 kernel, it looks like it ignored the directive to not use buffer cache – I didn’t see the drive light blink throughout the benchmark :) Unless its write allocator is as cool as the ONTAP one, it’s kinda suspect behavior, especially when compared to the Fedora BTRFS results, which were closer to what I was expecting.
  • BTRFS does use CPU much more than the other filesystems, with the box I was using it was OK but it’s probably not a good choice for something like an ancient laptop. But I think overall it shows great promise, given the features it can deliver.
  • The Linux NTFS driver did pretty well compared to Windows itself :)
  • Windows Server 2012 performed well, similar to Ubuntu Linux with ext4
  • XFS did better than ext4 overall

 

What did we learn?

We learned that the choice of OS and filesystem is enough to create a huge performance difference given the exact same workload generated by the exact same application compiled from the exact same source code.

When looking at purchasing something, looking at the base hardware specs is not enough.

Don’t focus too much on the underlying hardware unless the hardware you are looking to purchase is running the exact same OS (for example, you are comparing laptops running the same rev of Windows, or phones running the same rev of Android, or enterprise storage running the same OS, etc). You can apply the same logic to some extent to almost anything – for example, car engines. Bigger can mean more powerful but it depends.

And, ultimately, what really matters isn’t the hardware or even the OS it’s running. It’s how fast it will run your stuff in a reliable fashion with the features that will make you more productive.

D

Technorati Tags: , , , , , , , ,

]]>
http://recoverymonkey.org/2013/02/04/filesystem-and-os-benchmark-extravaganza-software-makes-a-huge-difference/feed/ 0
Are you doing a disservice to your company with RFPs? http://recoverymonkey.org/2012/12/21/are-you-doing-a-disservice-to-your-company-with-rfps/ http://recoverymonkey.org/2012/12/21/are-you-doing-a-disservice-to-your-company-with-rfps/#comments Fri, 21 Dec 2012 23:49:01 +0000 Dimitris http://recoverymonkey.org/?p=568 Whether we like it or not, RFPs (Request For Proposal) are a fact of life for vendors.

It usually works like this: A customer has a legitimate need for something. They decide (for whatever reason) to get bids from different vendors. They then craft an RFP document that is either:

  1. Carefully written, with the best intentions, so that they get the most detailed proposal possible given their requirements, or
  2. Carefully tailored by them and the help of their preferred vendor to box out the other vendors.

Both approaches have merit, even if #2 seems unethical and almost illegal. I understand that some people are just happy with what they have so they word their document to block anyone from changing their environment, turning the whole RFP process into an exercice in futility. I doubt that whatever I write here will change that kind of mindset.

However – I want to focus more on #1. The carefully written RFP that truly has the best intentions (and maybe some of it will rub off on the #2 “blocking” RFP type folks).

Here’s the major potential problem with the #1 approach:

You don’t know what you don’t know. For example, maybe you are not an expert on how caching works at a very low level, but you are aware of caching and what it does. So – you know that you don’t know about the low-level aspects of caching (or whatever other technology) and word your RFP so that you learn in detail how the various vendors do it.

The reality is – there are things whose existence you can’t even imagine – indeed, most things:

WhatUknow

By crafting your RFP around things you are familiar with, you are potentially (and unintentionally) eliminating solutions that may do things that are entirely outside your past experiences.

Back to our caching example – suppose you are familiar with arrays that need a lot of write cache in order to work well for random writes, so you put in your storage RFP requirements about very specific minimum amounts of write cache.

That’s great and absolutely applicable to the vendors that write to disk the way you are familiar with.

But what if someone writes to disk entirely differently than what your experience dictates and doesn’t need large amounts of write cache to do random writes even better than what you’re familiar with? What if they use memory completely differently in general?

Another example where almost everyone gets it wrong is specifying performance requirements. Unless you truly understand the various parameters that a storage system needs in order to properly be sized for what you need, it’s almost guaranteed the requirements list will be incomplete. For example, specifying IOPS without an I/O size and read/write blend and latency and sequential vs random – at a minimum – will not be sufficient to size a storage system (there’s a lot more here in case you missed it).

By setting an arbitrary limit to something that doesn’t apply to certain technologies, you are unintentionally creating a Type #2 RFP document – and you are boxing out potentially better solutions, which is ultimately not good for your business. And by not providing enough information, you are unintentionally making it almost impossible for the solution providers to properly craft something for you.

So what to do to avoid these RFP pitfalls?

Craft your RFP by asking questions about solving the business problem, not by trying to specify how the vendor should solve the business problem.

For example: Say something like this about space savings:

“Describe what, if any, technologies exist within the gizmo you’re proposing that will result in the reduction of overall data space consumption. In addition, describe what types of data and what protocols such technologies can work with, when they should be avoided, and what, if any, performance implications exist. Be specific.”

Instead of this:

“We need the gizmo to have deduplication that works this way with this block size plus compression that uses this algorithm but not that other one“.

Or, say something like this about reliability:

“Describe the technologies employed to provide resiliency of data, including protection from various errors, like lost or misplaced writes”.

Instead of:

“The system needs to have RAID10 disk with battery-backed write cache”.

It’s not easy. Most of us try to solve the problem and have at least some idea of how we think it should be solved. Just try to avoid that instinct while writing the RFP…

And, last but not least:

Get some help for crafting your RFP. We have this website that will even generate one for you. It’s NetApp-created, so take it with a grain of salt, but it was designed so the questions were fair and open-ended and not really vendor-specific. At least go through it and try building an RFP with it. See if it puts in questions you hadn’t thought of asking, and see how things are worded.

And get some help in getting your I/O requirements… most vendors have tools that can help with that. It may mean that you are repeating the process several times – but at least you’ll get to see how thorough each vendor is regarding the performance piece. Beware of the ones that aren’t thorough.

D

]]>
http://recoverymonkey.org/2012/12/21/are-you-doing-a-disservice-to-your-company-with-rfps/feed/ 2
So now it is OK to sell systems using “Raw IOPS”??? http://recoverymonkey.org/2012/10/22/so-now-it-is-ok-to-sell-systems-using-raw-iops/ http://recoverymonkey.org/2012/10/22/so-now-it-is-ok-to-sell-systems-using-raw-iops/#comments Mon, 22 Oct 2012 15:46:07 +0000 Dimitris http://recoverymonkey.org/?p=563 As the self-proclaimed storage vigilante, I will keep bringing these idiocies up as I come across them.

So, the latest “thing” now is selling systems using “Raw IOPS” numbers.

Simply put, some vendors are selling based on the aggregate IOPS the system will do based on per-disk statistics and nothing else

They are not providing realistic performance estimates for the proposed workload, with the appropriate RAID type and I/O sizes and hot vs cold data and what the storage controller overhead will be to do everything. That’s probably too much work. 

For example, if one assumes 200x IOPS per disk, and 200 such disks are in the system, this vendor is showing 40,000 “Raw IOPS”.

This is about as useful as shoes on a snake. Probably less.

The reality is that this is the ultimate “it depends” scenario, since the achievable IOPS depend on far more than how many random 4K IOPS a single disk can sustain (just doing RAID6 could result in having to divide the raw IOPS by 6 where random writes are concerned – and that’s just one thing that affects performance, there are tons more!)

Please refer to prior articles on the subject such as the IOPS/latency primer here and undersizing here. And some RAID goodness here.

If you’re a customer reading this, you have the ultimate power to keep vendors honest. Use it!

D

Technorati Tags: ,

]]>
http://recoverymonkey.org/2012/10/22/so-now-it-is-ok-to-sell-systems-using-raw-iops/feed/ 5
An explanation of IOPS and latency http://recoverymonkey.org/2012/07/26/an-explanation-of-iops-and-latency/ http://recoverymonkey.org/2012/07/26/an-explanation-of-iops-and-latency/#comments Thu, 26 Jul 2012 18:36:26 +0000 Dimitris http://recoverymonkey.org/?p=488 <I understand this extremely long post is redundant for seasoned storage performance pros – however, these subjects come up so frequently, that I felt compelled to write something. Plus, even the seasoned pros don’t seem to get it sometimes… :) >

IOPS: Possibly the most common measure of storage system performance.

IOPS means Input/Output (operations) Per Second. Seems straightforward. A measure of work vs time (not the same as MB/s, which is actually easier to understand – simply, MegaBytes per Second).

How many of you have seen storage vendors extolling the virtues of their storage by using large IOPS numbers to illustrate a performance advantage?

How many of you decide on storage purchases and base your decisions on those numbers?

However: how many times has a vendor actually specified what they mean when they utter “IOPS”? :)

For the impatient, I’ll say this: IOPS numbers by themselves are meaningless and should be treated as such. Without additional metrics such as latency, read vs write % and I/O size (to name a few), an IOPS number is useless.

And now, let’s elaborate… (and, as a refresher regarding the perils of ignoring such things wnen it comes to sizing, you can always go back here).

 

One hundred billion IOPS…

drevil

I’ve competed with various vendors that promise customers high IOPS numbers. On a small system with under 100 standard 15K RPM spinning disks, a certain three-letter vendor was claiming half a million IOPS. Another, a million. Of course, my customer was impressed, since that was far, far higher than the number I was providing. But what’s reality?

Here, I’ll do one right now: The old NetApp FAS2020 (the older smallest box NetApp had to offer) can do a million IOPS. Maybe even two million.

Go ahead, prove otherwise.

It’s impossible, since there is no standard way to measure IOPS, and the official definition of IOPS (operations per second) does not specify certain extremely important parameters. By doing any sort of I/O test on the box, you are automatically imposing your benchmark’s definition of IOPS for that specific test.

 

What’s an operation? What kind of operations are there?

It can get complicated.

An I/O operation is simply some kind of work the disk subsystem has to do at the request of a host and/or some internal process. Typically a read or a write, with sub-categories (for instance read, re-read, write, re-write, random, sequential) and a size.

Depending on the operation, its size could range anywhere from bytes to kilobytes to several megabytes.

Now consider the following most assuredly non-comprehensive list of operation types:

  1. A random 4KB read
  2. A random 4KB read followed by more 4KB reads of blocks in logical adjacency to the first
  3. A 512-byte metadata lookup and subsequent update
  4. A 256KB read followed by more 256KB reads of blocks in logical sequence to the first
  5. A 64MB read
  6. A series of random 8KB writes followed by 256KB sequential reads of the same data that was just written
  7. Random 8KB overwrites
  8. Random 32KB reads and writes
  9. Combinations of the above in a single thread
  10. Combinations of the above in multiple threads
…this could go on.

As you can see, there’s a large variety of I/O types, and true multi-host I/O is almost never of a single type. Virtualization further mixes up the I/O patterns, too.

Now here comes the biggest point (if you can remember one thing from this post, this should be it):

No storage system can do the same maximum number of IOPS irrespective of I/O type, latency and size.

Let’s re-iterate:

It is impossible for a storage system to sustain the same peak IOPS number when presented with different I/O types and latency requirements.

 

Another way to see the limitation…

A gross oversimplification that might help prove the point that the type and size of operation you do matters when it comes to IOPS. Meaning that a system that can do a million 512-byte IOPS can’t necessarily do a million 256K IOPS.

Imagine a bucket, or a shotshell, or whatever container you wish.

Imagine in this container you have either:

  1. A few large balls or…
  2. Many tiny balls
The bucket ultimately contains about the same volume of stuff either way, and it is the major limiting factor. Clearly, you can’t completely fill that same container with the same number of large balls as you can with small balls.
IOPS containers

 

 

 

 

 

 

 

 

 

 

 

 

They kinda look like shotshells, don’t they?

Now imagine the little spheres being forcibly evacuated rapildy out of one end… which takes us to…

 

Latency matters

So, we’ve established that not all IOPS are the same – but what is of far more significance is latency as it relates to the IOPS.

If you want to read no further – never accept an IOPS number that doesn’t come with latency figures, in addition to the I/O sizes and read/write percentages.

Simply speaking, latency is a measure of how long it takes for a single I/O request to happen from the application’s viewpoint.

In general, when it comes to data storage, high latency is just about the least desirable trait, right up there with poor reliability.

Databases especially are very sensitive with respect to latency – DBs make several kinds of requests that need to be acknowledged quickly (ideally in under 10ms, and writes especially in well under 5ms). In particular, the redo log writes need to be acknowledged almost instantaneously for a heavy-write DB – under 1ms is preferable.

High sustained latency in a mission-critical app can have a nasty compounding effect – if a DB can’t write to its redo log fast enough for a single write, everything stalls until that write can complete, then moves on. However, if it constantly can’t write to its redo log fast enough, the user experience will be unacceptable as requests get piled up – the DB may be a back-end to a very busy web front-end for doing Internet sales, for example. A delay in the DB will make the web front-end also delay, and the company could well lose thousands of customers and millions of dollars while the delay is happening. Some companies could also face penalties if they cannot meet certain SLAs.

On the other hand, applications doing sequential, throughput-driven I/O (like backup or archival) are nowhere near as sensitive to latency (and typically don’t need high IOPS anyway, but rather need high MB/s).

Here’s an example from an Oracle DB – a system doing about 15,000 IOPS at 25ms latency. Doing more IOPS would be nice but the DB needs the latency to go a lot lower in order to see significantly improved performance – notice the increased IO waits and latency, and that the top event causing the system to wait is I/O:

AWR example

 

 

 

 

 

 

 

 

Now compare to this system (different format this data but you’ll get the point):

Notice that, in this case, the system is waiting primarily for CPU, not storage.

A significant amount of I/O wait is a good way to determine if storage is an issue (there can be other latencies outside the storage of course – CPU and network are a couple of usual suspects). Even with good latencies, if you see a lot of I/O waits it means that the application would like faster speeds from the storage system.

But this post is not meant to be a DB sizing class. Here’s the important bit that I think is confusing a lot of people and is allowing vendors to get away with unrealistic performance numbers:

It is possible (but not desirable) to have high IOPS and high latency simultaneously.

How? Here’s a, once again, oversimplified example:

Imagine 2 different cars, both with a top speed of 150mph.

  • Car #1 takes 50 seconds to reach 150mph
  • Car #2 takes 200 seconds to reach 150mph

The maximum speed of the two cars is identical.

Does anyone have any doubt as to which car is actually faster? Car #1 indeed feels about 4 times faster than Car #2, even though they both hit the exact same top speed in the end.

Let’s take it an important step further, keeping the car analogy since it’s very relatable to most people (but mostly because I like cars):

  • Car #1 has a maximum speed of 120mph and takes 30 seconds to hit 120mph
  • Car #2 has a maximum speed of 180mph, takes 50 seconds to hit 120mph, and takes 200 seconds to hit 180mph

In this example, Car #2 actually has a much higher top speed than Car #1. Many people, looking at just the top speed, might conclude it’s the faster car.

However, Car #1 reaches its top speed (120mph) far faster than Car # 2 reaches that same top speed of Car #1 (120mph).

Car #2 continues to accelerate (and, eventually, overtakes Car #1), but takes an inordinately long amount of time to hit its top speed of 180mph.

Again – which car do you think would feel faster to its driver?

You know – the feeling of pushing the gas pedal and the car immediately responding with extra speed that can be felt? Without a large delay in that happening?

Which car would get more real-world chances of reaching high speeds in a timely fashion? For instance, overtaking someone quickly and safely?

Which is why car-specific workload benchmarks like the quarter mile were devised: How many seconds does it take to traverse a quarter mile (the workload), and what is the speed once the quarter mile has been reached?

(I fully expect fellow geeks to break out the slide rules and try to prove the numbers wrong, probably factoring in gearing, wind and rolling resistance – it’s just an example to illustrate the difference between throughput and latency, I had no specific cars in mind… really).

 

And, finally, some more storage-related examples…

Some vendor claims… and the fine print explaining the more plausible scenario beneath each claim:

“Mr. Customer, our box can do a million IOPS!”

512-byte ones, sequentially out of cache.

“Mr. Customer, our box can do a quarter million random 4K IOPS – and not from cache!”

at 50ms latency.

“Mr. Customer, our box can do a quarter million 8K IOPS, not from cache, at 20ms latency!”

but only if you have 1000 threads going in parallel.

“Mr. Customer, our box can do a hundred thousand 4K IOPS, at under 20ms latency!”

but only if you have a single host hitting the storage so the array doesn’t get confused by different I/O from other hosts.

Notice how none of these claims are talking about writes or working set sizes… or the configuration required to support the claim.

 

What to look for when someone is making a grandiose IOPS claim

Audited validation and a specific workload to be measured against (that includes latency as a metric) both help. I’ll pick on HDS since they habitually show crazy numbers in marketing literature.

For example, from their website:

HDS USP IOPS

 

It’s pretty much the textbook case of unqualified IOPS claims. No information as to the I/O size, reads vs writes, sequential or random, what type of medium the IOPS are coming from, or, of course, the latency…

However, that very same box barely breaks 200,000 SPC-1 IOPS with good latency in the audited SPC-1 benchmark:

HDS USP SPC IOPS

 

Last I checked, 200,000 was 20 times less than 4,000,000. Don’t get me wrong, 200,000 low-latency IOPS is a great SPC-1 result, but it’s not 4 million SPC-1 IOPS.

Check my previous article on SPC-1 and how to read the results here. And if a vendor is not posting results for a platform – ask why.

 

Where are the IOPS coming from?

So, when you hear those big numbers, where are they really coming from? Are they just ficticious? Not necessarily. So far, here are just a few of the ways I’ve seen vendors claim IOPS prowess:

  1. What the controller will theoretically do given unlimited back-end resources.
  2. What the controller will do purely from cache.
  3. What a controller that can compress data will do with all zero data.
  4. What the controller will do assuming the data is at the FC port buffers (“huh?” is the right reaction, only one three-letter vendor ever did this so at least it’s not a widespread practice).
  5. What the controller will do given the configuration actually being proposed driving a very specific application workload with a specified latency threshold and real data.
The figures provided by the approaches above are all real, in the context of how the test was done by each vendor and how they define “IOPS”. However, of the (non-exhaustive) options above, which one do you think is the more realistic when it comes to dealing with real application data?

 

What if someone proves to you a big IOPS number at a PoC or demo?

Proof-of-Concept engagements or demos are great ways to prove performance claims.

But, as with everything, garbage in – garbage out.

If someone shows you IOmeter doing crazy IOPS, use the information in this post to help you at least find out what the exact configuration of the benchmark is. What’s the block size, is it random, sequential, a mix, how many hosts are doing I/O, etc. Is the config being short-stroked? Is it coming all out of cache?

Typically, things like IOmeter can be a good demo but that doesn’t mean the combined I/O of all your applications’ performance follows the same parameters, nor does it mean the few servers hitting the storage at the demo are representative of your server farm with 100x the number of servers. Testing with as close to your application workload as possible is preferred. Don’t assume you can extrapolate – systems don’t always scale linearly.

 

Factors affecting storage system performance

In real life, you typically won’t have a single host pumping I/O into a storage array. More likely, you will have many hosts doing I/O in parallel. Here are just some of the factors that can affect storage system performance in a major way:

 

  1. Controller, CPU, memory, interlink counts, speeds and types.
  2. A lot of random writes. This is the big one, since, depending on RAID level, the back-end I/O overhead could be anywhere from 2 I/Os (RAID 10) to 6 I/Os (RAID6) per write, unless some advanced form of write management is employed.
  3. Uniform latency requirements – certain systems will exhibit latency spikes from time to time, even if they’re SSD-based (sometimes especially if they’re SSD-based).
  4. A lot of writes to the same logical disk area. This, even with autotiering systems or giant caches, still results in tremendous load on a rather limited set of disks (whether they be spinning or SSD).
  5. The storage type used and the amount – different types of media have very different performance characteristics, even within the same family (the performance between SSDs can vary wildly, for example).
  6. CDP tools for local protection – sometimes this can result in 3x the I/O to the back-end for the writes.
  7. Copy on First Write snapshot algorithms with heavy write workloads.
  8. Misalignment.
  9. Heavy use of space efficiency techniques such as compression and deduplication.
  10. Heavy reliance on autotiering (resulting in the use of too few disks and/or too many slow disks in an attempt to save costs).
  11. Insufficient cache with respect to the working set coupled with inefficient cache algorithms, too-large cache block size and poor utilization.
  12. Shallow port queue depths.
  13. Inability to properly deal with different kinds of I/O from more than a few hosts.
  14. Inability to recognize per-stream patterns (for example, multiple parallel table scans in a Database).
  15. Inability to intelligently prefetch data.

 

What you can do to get a solution that will work…

You should work with your storage vendor to figure out, at a minimum, the items in the following list, and, after you’ve done so, go through the sizing with them and see the sizing tools being used in front of you. (You can also refer to this guide).

  1. Applications being used and size of each (and, ideally, performance logs from each app)
  2. Number of servers
  3. Desired backup and replication methods
  4. Random read and write I/O size per app
  5. Sequential read and write I/O size per app
  6. The percentages of read vs write for each app and each I/O type
  7. The working set (amount of data “touched”) per app
  8. Whether features such as thin provisioning, pools, CDP, autotiering, compression, dedupe, snapshots and replication will be utilized, and what overhead they add to the performance
  9. The RAID type (R10 has an impact of 2 I/Os per random write, R5 4 I/Os, R6 6 I/Os – is that being factored?)
  10. The impact of all those things to the overall headroom and performance of the array.

If your vendor is unwilling or unable to do this type of work, or, especially, if they tell you it doesn’t matter and that their box will deliver umpteen billion IOPS – well, at least now you know better :)

D

Technorati Tags: , , , , , , , , , , , ,

 

]]>
http://recoverymonkey.org/2012/07/26/an-explanation-of-iops-and-latency/feed/ 44
What is hard disk short stroking? http://recoverymonkey.org/2012/07/24/what-is-hard-disk-short-stroking/ http://recoverymonkey.org/2012/07/24/what-is-hard-disk-short-stroking/#comments Tue, 24 Jul 2012 23:15:43 +0000 Dimitris http://recoverymonkey.org/?p=467 This is going to be a short post, to atone for my past sins of overly long posts but mostly because I want to eat dinner.

On storage systems with spinning disks, a favorite method for getting more performance is short-stroking the disk.

It’s a weird term but firmly based on science. Some storage vendors even made a big deal about being able to place data on certain parts of the disk, geometrically speaking.

Consider the relationship between angular and linear velocity first:

Linearvsangular

 

 

 

 

 

 

 

 

 

Assuming something round that rotates at a constant speed (say, 15 thousand revolutions per minute), the angular speed is constant.

The linear speed, on the other hand, increases the further you get away from the center of rotation.

Which means, the part furthest away from the center has the highest linear speed.

Now imagine a hard disk, and let’s say you want to measure its performance for some sort of random workload demanding low latency.

What if you could put all the data of your benchmark at the very outer edge of the disk?

You would get several benefits:

  1. The data would enjoy the highest linear speed and
  2. The disk tracks at the outer edge store more data, further increasing speeds, plus
  3. The disk heads would only have to move a small amount to get to all the data (the short-stroking part). This leads to much reduced seek times, which is highly beneficial for most workloads.

I whipped this up to explain it pictorially:

Short stroking

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Using a lot of data in a benchmark would not be enough to avoid short-stroking. One would also need to ensure that the access pattern touches the entire disk surface.

This is why NetApp always randomizes data placement for the write-intensive SPC-1 benchmark, to ensure we are not accused of short-stroking, no matter how much data is used in the benchmark.

Hope this clears it up.

If you are participating at any vendor proof-of-concept involving performance – I advise you to consider the implications of short-stroking. Caching, too, but I feel that has been tackled sufficiently.

D

]]>
http://recoverymonkey.org/2012/07/24/what-is-hard-disk-short-stroking/feed/ 2
NetApp posts great Cluster-Mode SPC-1 result http://recoverymonkey.org/2012/06/20/netapp-posts-great-cluster-mode-spc-1-result/ http://recoverymonkey.org/2012/06/20/netapp-posts-great-cluster-mode-spc-1-result/#comments Wed, 20 Jun 2012 21:38:22 +0000 Dimitris http://recoverymonkey.org/?p=404 <Edited to add some more information on how SPC-1 works since there was some confusion based on the comments received>

We’ve been busy at NetApp… busy perfecting the industry’s only scale-out unified platform, among other things.

We’ve already released ONTAP 8.1, which, in Cluster-Mode, allows 24 nodes (each with up to 8TB cache) for NAS workloads, and 4 nodes for block workloads (FC and iSCSI).

With ONTAP 8.1.1 (released on June 14th), we increased the node count to 6 for block workloads plus we added some extra optimizations and features. FYI: the node count is just what’s officially supported now, there’s no hard limit.

After our record NFS benchmark results, people have been curious about the block I/O performance of ONTAP Cluster-Mode, so we submitted an SPC-1 benchmark result using part of the same gear left over from the SPEC SFS NFS testing.

To the people that think NetApp is not a fit for block workloads (typically the ones believing competitor FUD): These are among the best SPC-1 results for enterprise disk-based systems given the low latency for the IOPS provided (it’s possible to get higher IOPS with higher latency, as we’ll explain later on in this post).

Here’s the link to the result and another with the page showing all the results.

This blog has covered SPC-1 tests before. A quick recap: The SPC-1 benchmark is an industry-standard, audited, tough, block-based benchmark (on Fiber Channel) that tries to stress-test disk subsystems with a lot of writes, overwrites, hotspots, a mix of random and sequential, write after read, read after write, etc. About 60% of the workload is writes. The I/O sizes are of a large variety – from small to large (so, SPC-1 IOPS are decidedly not the same thing as fully random uniform 4KB IOPS and should not be treated as such).

The benchmark access patterns do have hotspots that are a significant percentage of the total workload. Such hotspots can be either partially cached if the cache is large enough or placed on SSD if the arrays tested have an autotiering system granular and intelligent enough.

If an array can perform well in the SPC-1 workload, it will usually perform extremely well under difficult, latency-sensitive, dynamically changing DB workloads and especially OLTP. The full spec is here for the morbidly curious.

The trick with benchmarks is interpreting the results. A single IOPS number, while useful, doesn’t tell the whole story with respect to the result being useful for real applications. We’ll attempt to assist in the deciphering of the results in this post.

Before we delve into the obligatory competitive analysis, some notes for the ones lacking in faith:

  1. There was no disk short-stroking in the NetApp benchmark (a favorite way for many vendors to get good speeds out of disk systems by using only the outer part of the disk – the combination of higher linear velocity and smaller head movement providing higher performance and reduced seeks). Indeed, we used a tuning parameter that uses the entire disk surface, no matter how full the disks. Look at the full disclosure report here, page 61. For the FUD-mongers out there: This effectively pre-ages WAFL. We also didn’t attempt to optimize the block layout by reallocating blocks.
  2. There was no performance degradation over time.
  3. Average latency (“All ASUs” in the results) was flat and stayed below 5ms during multiple iterations of the test, including the sustainability test (page 28 of the full disclosure report).
  4. No extra cache beyond what comes with the systems was added (512GB comes standard with each 6240 node, 3TB per node is possible on this model, so there’s plenty of headroom for much larger working sets).
  5. It was not a “lab queen” system. We used very few disks to achieve the performance compared to the other vendors, and it’s not even the fastest box we have.


ANALYSIS

When looking at this type of benchmark, one should probably focus on :
  1. High sustained IOPS (inconsistency is frowned upon).
  2. IOPS/drive (a measure of efficiency – 500 IOPS/drive is twice as efficient as 250 IOPS/drive, meaning a lot less drives are needed, which results in lower costs, less physical footprint, etc.)
  3. Low, stable latency over time (big spikes are frowned upon).
  4. IOPS as a function of latency (do you get high IOPS but also very high latency at the top end? Is that a useful system?)
  5. The RAID protection used (RAID6? RAID10? RAID6 can provide both better protection and better space efficiency than mirroring, resulting in lower cost yet more reliable systems).
  6. What kind of drives were used? Ones you are likely to purchase?
  7. Was autotiering used? If not, why not? Isn’t it supposed to help in such difficult scenarios? Some SSDs would be able to handle the hotspots.
  8. The amount of hardware needed to get the stated performance (are way too many drives and controllers needed to do it? Does that mean a more complex and costly system? What about management?)
  9. The cost (some vendors show discounts and others show list price, so be careful there).
  10. The cost/op (which is the more useful metric – assuming you compare list price to list price).
SPC-1 is not a throughput-type benchmark, for sheer GB/s look elsewhere. Most of the systems didn’t do more than 4GB/s in this benchmark since a lot of the operations are random (and 4GB/s is quite a lot of random I/O).

SYSTEMS COMPARED

In this analysis we are comparing disk-based systems. Pure-SSD (or plain old RAM) performance-optimized configs can (predictably) get very high performance and may be a good choice if someone has a very small workload that needs to run very fast.

The results we are focusing on, on the other hand, are highly reliable, general-purpose systems that can provide both high performance, low latency and high capacity at a reasonable cost to many hosts and applications, along with rich functionality (snaps, replication, megacaching, thin provisioning, deduplication, compression, multiple protocols incl. NAS etc. Whoops – none of the other boxes aside from NetApp do all this, but such is the way the cookie crumbles).

Here’s a list of the systems with links to their full SPC-1 disclosure where you can find all the info we’ll be displaying. Those are all systems with high results and relatively flat sustained latency results.

There are some other disk-based systems with decent IOPS results but if you look at their sustained latency (“Sustainability – Average Response Time (ms) Distribution Data” in any full disclosure report) there’s too high a latency overall and too much jitter past the initial startup phase, with spikes over 30ms (which is extremely high), so we ignored them.

Here’s a quick chart of the results sorted according to latency. In addition, the prices shown are the true list prices (which can be found in the disclosures) plus the true $/IO cost based on that list price (a lot of vendors show discounted pricing to make that seem lower):

…BUT THAT CHART SHOWS THAT SOME OF THE OTHER BIG BOXES ARE FASTER THAN NETAPP… RIGHT?

That depends on whether you value and need low latency or not (and whether you take RAID type into account). For the vast majority of DB workloads, very low I/O latencies are vastly preferred to high latencies.

Here’s how you figure out the details:
  1. Choose any of the full disclosure links you are interested in. Let’s say the 3Par one, since it shows both high IOPS and high latency.
  2. Find the section titled “Response Time – Throughput Curve”. Page 13 in the 3Par result.
  3. Check whether latency rises sharply as load is added to the system.

Shown below is the 3Par curve:

3parlatency

Notice how latency rises quite sharply after a certain point.

Now compare this to the NetApp result (page 13):

Netappspclatency

Notice how the NetApp result has in general much lower latency but, more importantly, the latency stays low and rises slowly as load is added to the system.

Which is why the column “SPC-1 IOPS around 3ms” was added to the table. Effectively, what would the IOPS be at around the same latency for all the vendors?

Once you do that, you realize that the 3Par system is actually slower than the NetApp system if a similar amount of low latency is desired. Plus it costs several times more.

You can get the exact latency numbers just below the graphs on page 13, the NetApp table looks like this (under the heading “Response Time – Throughput Data”):

Netappspcdata

Indeed, of all the results compared, only the IBM SVC (with a bunch of V7000 boxes behind it) is faster than NetApp at that low latency point. Which neatly takes us to the next section…

WHAT IS THE 100% LOAD POINT?

I had to add this since it is confusing. The 100% load point does not mean the arrays tested were necessarily maxed out. Indeed, most of the arrays mentioned could sustain bigger workloads given higher latencies. 3Par just decided to show the performance at that much higher latency point. The other vendors decided to show the performance at latencies more palatable to Tier 1 DB workloads.

The SPC-1 load generators are simply told to run at a specific target IOPS and that is chosen to be the load level. The goal being to balance cost, IOPS and latency.

JUST HOW MUCH HARDWARE IS NEEDED TO GET A SYSTEM TO PERFORM?

Almost any engineering problem can be solved given the application of enough hardware. The IBM result is a great example of a very fast system built by adding a lot of hardware together:

  • 8 SVC virtualization engines plus…
  • …16 separate V7000 systems under the SVC controllers…
  • …each consisting of 2 more SVC controllers and 2 RAID controllers
  • 1,920 146GB 15,000 RPM disks (not quite the drive type people buy these days)
  • For a grand total of 40 Linux-based SVC controllers (8 larger and 32 smaller), 32 RAID controllers, and a whole lot of disks.

Putting aside for a moment the task of actually putting together and managing such a system, or the amount of power it draws, or the rack space consumed, that’s quite a bit of gear. I didn’t even attempt to add up all the CPUs working in parallel, I’m sure it’s a lot.

Compare it to the NetApp configuration:
  • 6 controllers in one cluster
  • 432 450GB 15,000 RPM disks (a pretty standard and common drive type as of the time of this writing in June 2012).

SOME QUESTIONS (OTHER VENDORS FEEL FREE TO RESPOND):

  1. What would performance be with RAID6 for the other vendors mentioned? NetApp always tests with our version of RAID6 (RAID-DP). RAID6 is more reliable than mirroring, especially when large pools are in question (not to mention more space-efficient). Most customers won’t buy big systems with all-RAID10 configs these days… (customers, ask your vendor. There is no magic – I bet they have internal results with RAID6, make them show you).
  2. Autotiering is the most talked-about feature it seems, with attributes that make it seem more important than the invention of penicillin or even the wheel, maybe even fire… However, none of the arrays mentioned are using any SSDs for autotiering (IBM published a result once – nothing amazing, draw your own conclusions). One would think that a benchmark that creates hot spots would be an ideal candidate… (and, to re-iterate, there are hotspots and of a percentage small enough to easily fit in SSD). At least IBM’s result proves that (after about 19 hours) autotiering works for the SPC-1 workload – which further solidifies the question: Why is nobody doing this if it’s supposed to be so great?
  3. Why are EMC and Dell unwilling to publish SPC-1 results? (they are both SPC members). They are the only 2 major storage vendors that won’t publish SPC-1 results. EMC said in the past they don’t think SPC-1 is a realistic test – well, only running your applications with your data on the array is ever truly realistic. What SPC-1 is, though, is an industry-standard benchmark for a truly difficult random workload with block I/O, and a great litmus test.
  4. For a box regularly marketed for Tier-1 workloads, the IBM XIV is, once more, suspiciously absent, even in its current Gen3 guise. It’s not like IBM is shy about submitting SPC-1 results :)
  5. Finally – some competitors keep saying NetApp is “not true SAN”, “emulated SAN” etc. Whatever that means – maybe the NetApp approach is better after all… the maximum write latency of the NetApp submission was 1.91ms for a predominantly write workload :)

FINAL THOUGHTS

With this recent SPC-1 result, NetApp showed once more that ONTAP running in Cluster-Mode is highly performing and highly scalable for both SAN and NAS workloads. Summarily, ONTAP Cluster-Mode:
  • Allows for highly performant and dynamically-scalable unified clusters for FC, iSCSI, NFS and CIFS.
  • Exhibits proven low latency while maintaining high performance.
  • Provides excellent price/performance.
  • Allows data on any node to be accessed from any other node.
  • Moves data non-disruptively between nodes (including CIFS, which normally is next to impossible).
  • Maintains the traditional NetApp features (write optimization, application awareness, snapshots, deduplication, compression, replication, thin provisioning, megacaching).
  • Can use the exact same FAS gear as ONTAP running in the legacy 7-mode for investment protection.
  • Can virtualize other arrays behind it.
 Courteous comments always welcome.
D
]]>
http://recoverymonkey.org/2012/06/20/netapp-posts-great-cluster-mode-spc-1-result/feed/ 46
NetApp delivers 1.3TB/s performance to giant supercomputer for big data http://recoverymonkey.org/2012/02/10/netapp-delivers-1tbs-performance-to-giant-supercomputer-for-big-data/ http://recoverymonkey.org/2012/02/10/netapp-delivers-1tbs-performance-to-giant-supercomputer-for-big-data/#comments Sat, 11 Feb 2012 03:40:17 +0000 Dimitris http://recoverymonkey.org/?p=358 (Edited: My bad, it was 1.3TB/s, not 1TB/s).

What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?

To be specific: What if you needed to transfer data at 1TB per second? (or 1.3TB/s, as it eventually turned out to be)?

That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.

You can read the official press release here. I wanted to get more into the technical details.

People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:

  • Too much data to be processed by a “normal” computer or cluster
  • Too much data to work with using a relational DB
  • Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
  • Too much data to process using traditional methods within an acceptable time frame

Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.

So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.

For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 1TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.

This created some serious logistics problems regarding storage:

  • No single disk array can hold that amount of data
  • No single disk array can perform anywhere close to 1TB/s

Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.

The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.

The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.

 

A bit about the solution…

Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.

That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.

There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.

The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP :)

The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.

Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too… :)

Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves :) (you know who you are!)

There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.

Most importantly, what we accomplished at the LLNL was no marketing exercise…

 

The benefits of truly high performance density

Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.

However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).

For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.

 

Enter the cluster filesystem

So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?

The answer: By using a cluster filesystem.

Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.

It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.

The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:

 

Lustre Scaled Cluster

 

High-speed connections are used liberally for lower latency and higher throughput.

A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.

Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.

It sounds good in theory – and it delivered in practice: 1.3TB/s sustained performance was demonstrated to the NetApp block devices. Work is ongoing to finalize the testing with the complete Lustre environment. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.

 

Putting it all together

NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.

So – how fast do you need to go?

D

 

 

Technorati Tags: , ,

 

]]>
http://recoverymonkey.org/2012/02/10/netapp-delivers-1tbs-performance-to-giant-supercomputer-for-big-data/feed/ 5
NetApp posts world-record SPEC SFS2008 NFS benchmark result http://recoverymonkey.org/2011/11/01/netapp-posts-world-record-spec-sfs2008-nfs-benchmark-result/ http://recoverymonkey.org/2011/11/01/netapp-posts-world-record-spec-sfs2008-nfs-benchmark-result/#comments Tue, 01 Nov 2011 19:11:52 +0000 Dimitris http://recoverymonkey.org/?p=312 Just as NetApp dominated the older version of the SPEC SFS97_R1 NFS benchmark back in May of 2006 (and was unsurpassed in that benchmark with 1 million SFS operations per second), the time has come to once again dominate the current version, SPEC SFS2008 NFS.

Recently we have been focusing on benchmarking realistic configurations that people might actually put in their datacenters, instead of lab queens with unusable configs focused on achieving the highest result regardless of cost.

However, it seems the press doesn’t care about realistic configs (or to even understand the configs) but instead likes headline-grabbing big numbers.

So we decided to go for the best of both worlds – a headline-grabbing “big number” but also a config that would make more financial sense than the utterly crazy setups being submitted by competitors.

Without further ado, NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode. Click here for the specific result. There are other results in the page showing different size clusters so you can get some idea of the scaling possible.

See below table for a high-level analysis (including the list pricing I could find for these specific performance-optimized configs for whatever that’s worth). The comparison is between NetApp and the nearest scale-out competitor result (one of many EMC’s recent acquisitions –  Isilon, the niche, dedicated NAS box – nothing else is close enough to bother including in the comparison).

BTW – the EMC price list is publicly available from here (and other places I’m sure): http://www.emc.com/collateral/emcwsca/master-price-list.pdf

From page 422:

S200-6.9TB & 200GB SSD, 48GB RAMS200-6.9TB & 200GB SSD, 48GB RAM, 2x10GE SFP+ & 2x1G $84,061. Times 140…

Before we dive into the comparison, an important note since it seems the competition doesn’t understand how to read SPEC SFS results:

Out of 1728 450GB disks (the number includes spares and OS drives, otherwise it was 1632 disks), the usable capacity was 574TB (73% of all raw space – even more if one considers a 450GB disk never actually provides 450 real GB in base2). The exported capacity was 288TB. This doesn’t mean we tried to short-stroke or that there is a performance benefit exporting a smaller filesystem – the way NetApp writes to disk, the size of the volume you export has nothing to do with performance. Since SPEC SFS doesn’t use all the available disk space, the person doing the setup thought like a real storage admin and didn’t give it all the available space. 

Lest we be accused of tuning this config or manually making sure client accesses were load-balanced and going to the optimal nodes, please understand this:  23 out of 24 client accesses were not going to the nodes owning the data and were instead happening over the cluster interconnect (which, for any scale-out architecture, is worst-case-scenario performance). Look under the “Uniform Access Rules Compliance” in the full disclosure details of the result in the SPEC website here. This means that, compared to the 2-node ONTAP 7-mode results, there is a degradation due to the cluster operating (intentionally) through non-optimal paths.

EMC NetApp Difference
Cost (approx. USD List) 11,800,000 6,280,000 NetApp is almost half the cost while offering much higher performance
SPEC SFS2008 NFS operations per second 1,112,705 1,512,784 NetApp is over 35% faster, while using potentially better RAID protection
Average Latency (ORT) 2.54 1.53 NetApp offers almost 40% better average latency without using costly SSDs, and is usable for challenging random workloads like DBs, VMs etc.
Space (TB) 864 (out of which 128889GB was used in the test) 574 (out of which 176176GB was used in the test) Isilon offers about 50% more usable space (coming from a lot more drives, 28% more raw space and potentially less RAID protection – N+2 results from Isilon would be different)
$/SPEC SFS2008 NFS operation 10.6 4.15 Netapp is less than half the cost per SPEC SFS2008 NFS operation
$/TB 13,657 10,940 NetApp is about 20% less expensive than EMC per usable TB
RAID Per-file protection. Files < 128K are at least mirrored. Files over 128K are at a 13+1 level protection in this specific test. RAID-DP Ask EMC what 13+1 protection means in an Isilon cluster (I believe 1 node can be completely gone but what about simultaneous drive failures that contain the sameprotected file?)NetApp RAID-DP is mathematically analogous to RAID6 and has a parity drive penalty of 2 drives every 16-20 drives.
Boxes needed to accomplish result 140 nodes, 3,360 drives (incl. 25TB of SSDs for cache), 1,120 CPU cores, 6.7TB RAM. 24 unified controllers, 1,728 drives, 12.2TB Flash Cache, 192 CPU cores, 1.2TB RAM. NetApp is far more powerful per node, and achieves higher performance with a lotless drives, CPUs, RAM and cache.In addition, NetApp can be used for all protocols (FC, iSCSI, NFS, CIFS) and all connectivity methods (FC 4/8Gb, Ethernet 1/10Gb, FCoE).

 

Notice the response time charts:

IsilonVs6240response

NetApp exhibits traditional storage system behavior – latency is very low initially and gradually gets higher the more the box is pushed, as one would expect. Isilon on the other hand starts out slow and gets faster as more metadata gets cached, until the controllers run out of steam (SPEC SFS is very heavy in NAS metadata ops, and should not be compared to heavy-duty block benchmarks like SPC-1).

This is one of the reasons an Isilon cluster is not really applicable for low-latency DB-type apps, or low-latency VMs. It is a great architecture designed to provide high sequential speeds for large files over NAS protocols, and is not a general-purpose storage system. Kudos to the Isilon guys for even getting the great SPEC result in the first place, given that this isn’t what the box is designed to do (the extreme Isilon configuration needed to run the benchmark is testament to that). The better application for Isilon would be capacity-optimized configs (which is what the system is designed for to begin with).

 

Some important points:

  1. First and foremost, the cluster-mode ONTAP architecture now supports all protocols, it is the only unified scale-out architecture available. Any competitors playing in that space only have NAS or SAN offerings but not both in a single architecture.
  2. We didn’t even test with the even faster 6280 box and extra cache (that one can take 8TB cache per node). The result is not the fastest a NetApp cluster can go :) With 6280s it would be a healthy percentage faster, but we had a bunch of the 6240s in the lab so it was easier to test them, plus they’re a more common and less expensive box, making for a more realistic result.
  3. ONTAP in cluster-mode is a general-purpose storage OS, and can be used to run Exchange, SQL, Oracle, DB2, VMs, etc. etc. Most other scale-out architectures are simply not suitable for low-latency workloads like DBs and VMs and are instead geared towards high NAS throughput for large files (IBRIX, SONAS, Isilon to name a few – all great at what they do best).
  4. ONTAP in cluster mode is, indeed, a single scale-out cluster and administered as such. It should not be compared to block boxes with NAS gateways in front of them like VNX, HDS + Bluearc, etc.
  5. In ONTAP cluster mode, workloads and virtual interfaces can move around the cluster non-disruptively, regardless of protocol (FC, iSCSI, NFS and yes, even CIFS can move around non-disruptively assuming you have clients that can talk SMB 2.1 and above).
  6. In ONTAP cluster mode, any data can be accessed from any node in the cluster – again, impossible with non-unified gateway solutions like VNX that have individual NAS servers in front of block storage, with zero awareness between the NAS heads aside from failover.
  7. ONTAP cluster mode can allow certain cool things like upgrading storage controllers from one model to another completely non-disruptively, most other storage systems need some kind of outage to do this. All we do is add the new boxes to the existing cluster :)
  8. ONTAP cluster mode supports all the traditional NetApp storage efficiency and protection features: RAID-DP, replication, deduplication, compression, snaps, clones, thin provisioning. Again, the goal is to provide a scale-out general-purpose storage system, not a niche box for only a specific market segment. It even supports virtualizing your existing storage.
  9. There was a single namespace for the NFS data. Granted, not the same architecture as a single filesystem from some competitors.
  10. Last but not least – no “special” NetApp boxes are needed to run Cluster Mode. In contrast to other vendors selling a completely separate scale-out architecture (different hardware and software and management), normal NetApp systems can enter a scale-out cluster as long as they have enough connectivity for the cluster network and can run ONTAP 8. This ensures investment protection for the customer plus it’s easier for NetApp since we don’t have umpteen hardware and software architectures to develop for and support :)
  11. Since people have been asking: The SFS benchmark generates about 120MB per operation. The slower you go, the less space you will use on the disks, regardless of how many disks you have. This creates some imbalance in large configs (for example, only about 128TB of the 864TB available was used on Isilon).

Just remember – in order to do what ONTAP in Cluster Mode does, how many different architectures would other vendors be proposing?

  • Scale-out SAN
  • Scale-out NAS
  • Replication appliances
  • Dedupe appliances
  • All kinds of management software

How many people would it take to keep it all running? And patched? And how many firmware inter-dependencies would there be?

And what if you didn’t need, say, scale-out SAN to begin with, but some time after buying traditional SAN realized you needed scale-out? Would your current storage vendor tell you you needed, in addition to your existing SAN platform, that other one that can do scale-out? That’s completely different than the one you bought? And that you can’t re-use any of your existing stuff as part of the scale-out box, regardless of how high-end your existing SAN is?

How would that make you feel?

Always plan for the future…

Comments welcome.

D

PS: Made some small edits in the RAID parts and also added the official EMC pricelist link.

 

Technorati Tags: , , , , , , , , ,

 

]]>
http://recoverymonkey.org/2011/11/01/netapp-posts-world-record-spec-sfs2008-nfs-benchmark-result/feed/ 55
Interpreting $/IOPS and IOPS/RAID correctly for various RAID types http://recoverymonkey.org/2011/10/19/interpreting-iops-and-iopsraid-correctly-with-spc-1-results/ http://recoverymonkey.org/2011/10/19/interpreting-iops-and-iopsraid-correctly-with-spc-1-results/#comments Wed, 19 Oct 2011 20:25:24 +0000 Dimitris http://recoverymonkey.org/?p=302 <Article updated with more accurate calculation>

There are some impressive new scores at storageperformance.org, with the usual crazy configurations of thousands of drives etc.

Regarding price/performance:

When looking at $/IOP, make sure you are comparing list price (look at the full disclosure report, that has all the details for each config).

Otherwise, you could get the wrong $/IOP since some vendors have list prices, others show heavy discounting.

For example, a box that does $6.5/IOP after 50% discounting, would be $13/IOP using list prices.

Regarding RAID:

As I have mentioned in other posts, RAID plays a big role in both protection and performance.

Most SPC-1 results are using RAID10, with the notable exception of NetApp (we use RAID-DP, mathematically analogous to RAID6 in protection).

Here’s a (very) rough way to convert a RAID10 result to RAID6, if the vendor you’re looking for doesn’t have a RAID6 result, but you know the approximate percentage of random writes:

  1. SPC-1 is about 60% writes.
  2. Take any RAID10 result, let’s say 200,000 IOPS.
  3. 60% of that is 120,000, that’s the write ops. 40% is the reads, or 80,000 read ops.
  4. If using RAID6, you’d be looking at roughly a 3x slowdown for the writes: 120,000/3 = 40,000
  5. Add that to the 40% of the reads and you get the final result:
  6. 80,000 reads + 40,000 writes = 120,000 RAID6-corrected SPC-1 IOPS. Which is not quite as big as the RAID10 result… :)
  7. RAID5 would be the writes divided by 2.
All this happens because one random write can result in 6 back-end I/Os with RAID6, 4 with RAID5 and 2 with RAID10.

Just make sure you’re comparing apples to apples, that’s all. I know we all suffer from ADD in this age of information overload, but do spend some time going through the full disclosure, since there’s always interesting stuff in there…

D

 

 

]]>
http://recoverymonkey.org/2011/10/19/interpreting-iops-and-iopsraid-correctly-with-spc-1-results/feed/ 0
Buyer beware: is your storage vendor sizing properly for performance, or are they under-sizing technologies like Megacaching and Autotiering? http://recoverymonkey.org/2011/06/29/buyer-beware-is-your-storage-vendor-sizing-properly-for-performance-or-are-they-under-sizing-technologies-like-megacaching-and-autotiering/ http://recoverymonkey.org/2011/06/29/buyer-beware-is-your-storage-vendor-sizing-properly-for-performance-or-are-they-under-sizing-technologies-like-megacaching-and-autotiering/#comments Wed, 29 Jun 2011 14:12:18 +0000 Dimitris http://recoverymonkey.org/?p=269 With the advent of performance-altering technologies (notice the word choice), storage sizing is just not what it used to be.

I’m writing this post because more and more I see some vendors not using scientific methods to size their solution, instead aiming to reach a price point, hoping the technology will work to achieve the requisite performance (and if it doesn’t, it’s sold anyway, either they can give some free gear to make the problem go away, or the customer can always buy more, right?)

Back in the “good old days”, with legacy arrays one could (and still can) get fairly deterministic performance by knowing the workload required and, given a RAID type, know roughly how many disks would be needed to maintain the required performance in a sustained fashion, as long as the controller and buses were not overloaded.

With modern systems, there is now a plethora of options that can be used to get more performance out of the array, or, alternatively, get the same average performance as before, using less hardware (hopefully for less money).

If anything, advanced technologies have made array sizing more complex than before.

For instance, Megacaches can be used to dramatically change the I/O reaching the back-end disks of the array. NetApp FAS systems can have up to 16TB of deduplication-aware, ultra-granular (4K) and intelligent read cache. Truly a gigantic size, bigger than the vast majority of storage users will ever need (and bigger than many customers’ entire storage systems). One could argue that with such an enormous amount of cache, one could dispense with most disk drives and instead save money by using SATA (indeed, several customers are doing exactly that). Other vendors are following NetApp’s lead and starting to implement similar technologies — simply because it makes a lot of sense.

However…

It is crucial that, when relying on caching, extra care is taken to size the solution properly, if a reduction in the number and speed of the back-end disks is desired.

You see, caches only work well if they can cache the majority of what’s called the active working set.

Simply put, the working set is not all your data, but the subset of the data you’re “touching” constantly over a period of time. For a customer that has, say, a 20TB Database, the true working set may only be something as small as 5% — enabling most of the active data to fit in 1TB of cache. So, during daily use, a 1TB cache could satisfy most of the I/O requirements of the DB. The back-end disks could comfortably be just enough SATA to fit the DB.

But what about the times when I/O is not what’s normally expected? Say, during a re-indexing, or a big DB export, or maybe month-end batch processing. Such operations could vastly change the working set and temporarily raise it from 5% to something far larger — at which point, a 1TB cache and a handful of back-end SATA may not be enough.

Which is why, when sizing, multiple measurements need to be taken, and not just average or even worst-case.

Let’s use a database as an example again (simply because the I/O can change so dramatically with DBs).You could easily have the following I/O types:

  1. Normal use – 20,000 IOPS, all random, 8K I/O size, 80% reads
  2. DB exports — high MB/s, mostly sequential write,large I/O size, relatively few IOPS
  3. Sequential read after random write — maybe data is added to the DB randomly, then a big sequential read (or maybe many parallel ones) are launched.

You see, the I/O profile can change dramatically. If you only size for case #1, you may not have enough back-end disk to sustain the DB exports or the parallel sequential table scans. If you size for case 2, you may think you don’t need much cache since the I/O is mostly sequential (and most caches are bypassed for sequential I/O). But that would be totally wrong during normal operation.

If your storage vendor has told you they sized for what generates the most I/O, then the question is, what kind of I/O was it?

The other new trendy technology (and the most likely to be under-sized) is Autotiering.

Autotiering, simply put, allows moving chunks of data around the array depending on their “heat index”. Chunks that are very active may end up on SSD, whereas chunks that are dormant could safely stay on SATA.

Different arrays do different kinds of Autotiering, mostly based on various underlying architectural characteristics and limitations. For example, on an EMC Symmetrix the chunk size is about 7.5MB. On an HDS VSP, the chunk is about 40MB. On an IBM DS8000, SVC or EMC Clariion/VNX, it’s 1GB.

With Autotiering, just like with caching, the smaller the chunk size, the more efficient the end result will ultimately be. For instance, a 7.5MB chunk could need as little as 3-5%% of ultra-fast disk as a tier, whereas a 1GB chunk may need as much as 10-15%, due to the larger size chunk containing not very active data mixed together with the active data.

Since most arrays write data with a geometric locality of reference (in contrast, NetApp uses geometric and temporal), with large-chunk autotiering you end up with pieces of data that are “hot” that always occupy the same chunk as neighboring “cool” pieces of data. This explains why the smaller the chunk, the better off you are.

So, with a large chunk, this can happen:

Slide1

The array will try to cache as much as it can, then migrate chunks if they are consistently busy or not. But the whole chunk has to move, not just the active bits within the chunk… which may be just fine, as long as you have enough of everything.

So what can you do to ensure correct sizing?

There are a few things you can do to make sure you get accurate sizing with modern technologies.

  1. Provide performance statistics to vendors — the more detailed the better. If we don’t know what’s going on, it’s hard to provide an engineered solution.
  2. Provide performance expectations — i.e. “I want Oracle queries to finish in 1/4th the time compared to what I have now” — and tie those expectations to business benefits (makes it easier to justify).
  3. Ask vendors to show you their sizing tools and explain the math behind the sizing — there is no magic!
  4. Ask vendors if they are sizing for all the workloads you have at the moment (not just different apps but different workloads within each app) — and how.
  5. Ask them to show you what your working set is and how much of it will fit in the cache.
  6. Ask them to show you how your data would be laid out in an Autotiered environment and what bits of it would end up on what tier. How is that being calculated? Is the geometry of the layout taken into consideration?
  7. Do you have enough capacity for each tier? On Autotiering architectures with large chunks, do you have 10-15% of total storage being SSD?
  8. Have the controller RAM and CPU overheads due to caching and autotiering been taken into account? Such technologies do need extra CPU and RAM to work. Ask to see the overhead (the smaller the Autotiering chunk size, the more metadata overhead, for example). Nothing is free.
  9. Beware of sizings done verbally or on cocktail napkins, calculators, or even spreadsheets – I’ve yet to see a spreadsheet model storage performance accurately.
  10. Beware of sizings of the type “a 15K disk can do 180 IOPS” — it’s a lot more complicated than that!
  11. Understand the difference between sequential, random, reads, writes and I/O size for each proposed architecture — the differences in how I/O is done depending on the platform are staggering and can result in vastly different disk requirements — making apples-to-apples comparisons challenging.
  12. Understand the extra I/O and capacity impact of certain CDP/Replication devices — it can be as much as 3x, and needs to be factored in.
  13. What RAID type is each vendor using? That can have a gigantic performance impact on write-intensive workloads (in addition to the reliability aspect).
  14. If you are getting unbelievably low pricing — ask for a contract ensuring upgrade pricing will be along the same lines. “The first hit is free” is true in more than one line of business.
  15. And, last but by no means least — ask how busy the proposed solution will be given the expected workload! It surprises me that people will try to sell a box that can do the workload but will be 90% busy doing so. Are you OK with that kind of headroom? Remember – disk arrays are just computers running specialized software and hardware, and as such their CPU can run out of steam just like anything else.

If this all seems hard — it’s because it is. But see it as due diligence — you owe it to your company, plus you probably don’t want to be saddled with an improperly-sized box for the next 3-5 years, just because the offer was too good to refuse…

D

 

Technorati Tags: , , , , , , , , ,

]]>
http://recoverymonkey.org/2011/06/29/buyer-beware-is-your-storage-vendor-sizing-properly-for-performance-or-are-they-under-sizing-technologies-like-megacaching-and-autotiering/feed/ 8
NetApp vs EMC usability report: malice, stupidity or both? http://recoverymonkey.org/2011/05/12/netapp-vs-emc-usability-report-malice-stupidity-or-both/ http://recoverymonkey.org/2011/05/12/netapp-vs-emc-usability-report-malice-stupidity-or-both/#comments Thu, 12 May 2011 21:58:56 +0000 Dimitris http://recoverymonkey.org/?p=262 Most are familiar with Hanlon’s Razor:

Never attribute to malice that which is adequately explained by stupidity.

A variation of that is:

Never attribute to malice that which is adequately explained by stupidity, but don’t rule out malice.

You see, EMC sponsored a study comparing their systems to ones from the company they look up to and try to emulate. The report deals with ease-of-use (and I’ll be the first to admit the current iteration of EMC boxes is far easier to use than in the past and the GUI has some cool stuff in it). I was intrigued, but after reading the official-looking report posted by Chuck Hollis, I wondered who in their right mind will lend it credence, and ignored it since I have a real day job solving actual customer problems and can’t possibly respond to every piece of FUD I see (and I see a lot).

Today I’m sitting in a rather boring meeting so I thought I’d spend a few minutes to show how misguided the document is.

In essence, the document tackles the age-old dilemma of which race car to get by comparing how easy it is to change the oil, and completely ignores the “winning the race with said car” part. My question would be: “which car allows you to win the race more easily and with the least headaches, least cost and least effort?”

And if you think winning a “race” is just about performance, think again.

It is also interesting how the important aspects of efficiency, reliability and performance are not tackled, but I guess this is a “usability” report…

Strange that a company named “Strategic Focus” reduces itself to comparing arrays by measuring the number of mouse clicks. Not sure how this is strategic for customers. They were commissioned by EMC, so maybe EMC considers this strategic.

I’ll show how wrong the document is by pointing at just some of the more glaring issues, but I’ll start by saying a large multinational company has many PB of NetApp boxes around the globe and 3 relaxed guys to manage it all. How’s that for a real example? :)

  1. Page 2, section 4, “Methodology”: EMC’s own engineers set up the VNX properly. No mention of who did the NetApp testing, what their qualifications are, and so on. So, first question: “Do these people even know what they’re doing? Have they really used a NetApp system before?”
  2. Page 10, Table A, showing the configurations. A NetApp FAS3070 was used, running the latest code at this moment (8.01). Thanks EMC for the unintended compliment – you see, that system is 2 generations old (the current one is 3270 and the previous one is 3170) yet it can still run the very latest 64-bit ONTAP code just fine. What about the EMC CX3? Can it run FLARE31? Or is that a forklift upgrade? Something to be said for investment protection :)
  3. Page 3 table 5-1, #1. Storage pools on all modern arrays would typically be created upfront, so the wording is very misleading. In order to create a new LUN one does NOT NEED to create a pool. Same goes for all vendors.
  4. Same table and section (also mentioned in section 7): Figuring out the space available is as simple as going to the aggregate page, where the space is clearly shown for the aggregates. So, unsure what is meant here.
  5. Regarding LUN creation… Let me ask you a question: After you create a LUN on any array, what do you need to do next? You see, the goal is to attach the LUN to a host, do alignment, partition creation, multipathing and create a filesystem and write stuff to it. You know, use it. NetApp largely automates end-to-end creation of host filesystems and, indeed, does not need an administrator to create a LUN on the array at all. Clearly the person doing the testing is either not aware of this or decided to omit this fact.
  6. Page 4, item 4 (thin provisioning). Asinine statement – plus, any NetApp LUN can be made thick or thin with a single click, whereas a VNX needs to do a migration. Indeed, NetApp does not complicate things whether thin or thick is required, does not differentiate between thin and thick when writing, and therefore does not incur a performance penalty, whereas EMC does (according to EMC documentation).
  7. Page 4, item 5 (Creation of virtual CIFS servers). The Multistore feature is free of charge on all new systems and allows one to create fully segregated, secure multitenancy virtual CIFS, NFS and iSCSI NetApp “partitions” – far beyond the capabilities of EMC. Again, misleading.
  8. Page 4, item 6 (growing storage elements). No measurable difference? Kindly show all the steps to grow a LUN until the new space is visible from the host side. End-to-end is important to real users since they want to use the storage. Or maybe not, for the authors of this document.
  9. Page 5, Item 1. We are really talking here about EMC snapshots? Seriously? Versus NetApp? To earn the right to do so assumes your snapshots are a usable and decent feature and that you can take a good number of them without the box crumbling to pieces. Ask any vendor about a production array with the most snaps and ask to talk to the customer using it. Then compare the number of snaps to a typical NetApp customer’s. Don’t be surprised if one number is a few hundred times less than the other.
  10. Page 5, item 3 (storage tiering): part of a longer conversation but this assumes all arrays need to do tiering. If my solution is optimized to the level that it doesn’t need to do this but yours is not optimized so it needs tiering, why on earth am I being penalized for doing storage more efficiently than you? (AKA the “not invented here” syndrome).
  11. Page 6 item 1 (VMware awareness): NetApp puts all the awareness inside vCenter and, indeed, datastore creation (including volume/LUN/NAS creation and resizing), VM cloning etc. all from within vCenter itself. Ask for a demo and prepare to be amazed.
  12. Page 6, item 2 – (dedupe/compress individual VMs): This one had my blood boiling. You see, EMC cannot even dedupe individual VMs, (impossible, given the fact that current DART code only does dedupe at the file and not block level and no two active VMs will ever be exactly the same), can’t dedupe at all for block storage (maybe in the future but not today), and in general doesn’t recommend compression for VMs! Ask to see the best practices guide that states all this is supported and recommended for active production VMs, and to talk to a customer doing it at scale (not 10 VMs). A feature you can theoretically turn on but that will never work is not quite useful, you see…
  13. Page 8, entire table: Too much to comment on, suffice it to say that NetApp systems come with tools not mentioned in this report that go so far beyond what Unisphere does that it’s not even funny (at no additional cost). Used by customers that have thousands of NetApp systems. That’s how much those tools scale. EMC would need vast portions of the Ionix suite to do anything remotely similar (at $$$). Of course, mentioning that would kinda derail this document… and the piece about support and upgrades is utterly wrong, but I like to keep the surprise for when I do the demos and not share cool IP ideas here :)
  14. Page 11, Table B1: In the end, the funniest one of all! If you add up the total number of mouse clicks, NetApp needed 92 vs EMC’s 111. Since the whole point of this usability report is to show overall ease of use by measuring the total number of clicks to do stuff, it’s interesting that they didn’t do a simple total to show who won in the end… :)

I could keep going but I need to pay attention to my meeting now since it suddenly became interesting.

Ultimately, when it comes to ease of use, it’s simple to just do a demo and have the customer decide for themselves which approach they like best. Documents such as this one mean less than nothing for actual end users.

I should have another similar list showing clicks and TIME needed to do certain other things. For instance, using RecoverPoint (or any other method), kindly show the number of clicks and time (and disk space) for creating 30 writable clones of a 10TB SQL DB and mounting them on 30 different DB servers simultaneously. Maintaining unique instance names etc. Kinda goes a bit beyond LUN creation, doesn’t it? :)

All this BTW doesn’t mean any vendor should rest on their laurels and stop working on improving usability. It’s a never-ending quest. Just stop it with the FUD, please…

Finishing with something funny: Check this video for a good demonstration of something needing few clicks yet not being that easy to do.

Comments welcome.

D

Technorati Tags: , , , ,

]]>
http://recoverymonkey.org/2011/05/12/netapp-vs-emc-usability-report-malice-stupidity-or-both/feed/ 4
OS X and SSD – tunings plus performance with and without TRIM http://recoverymonkey.org/2011/05/05/os-x-ssdtunings-plus-performance-with-and-without-trim/ http://recoverymonkey.org/2011/05/05/os-x-ssdtunings-plus-performance-with-and-without-trim/#comments Thu, 05 May 2011 17:46:04 +0000 Dimitris http://recoverymonkey.org/2011/05/05/os-x-ssdtunings-plus-performance-with-and-without-trim/ I finally decided to spring for a SSD for my laptop since I hammer it heavily with a lot of mostly random I/O. It was money well spent.

I went for an Intel 320 model, since it includes extra capacitors for flushing the cache in the event of power failure, and has RAID-4 onboard for protection beyond sparing (there are other, faster SSDs but I need the reliability and can’t afford large-sized SLC).

I used the trusty postmark (here’s a link to the OS X executable) to generate a highly random workload with varying file sizes, using these settings:

set buffering false
set size 500 100000
set read 4096
set write 4096
set number 10000
set transactions 20000
run

All testing was done on OS X 10.6.7.

Here’s the result with the original 7200 RPM HDD:

Time:
198 seconds total
186 seconds of transactions (107 per second)

Files:
20163 created (101 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10163 files (54 per second)
10053 read (54 per second)
9945 appended (53 per second)
20163 deleted (101 per second)
Deletion alone: 10326 files (3442 per second)
Mixed with transactions: 9837 files (52 per second)

Data:
557.87 megabytes read (2.82 megabytes per second)
1165.62 megabytes written (5.89 megabytes per second)

I then replaced the internal drive with SSD, popped the old internal drive into an external caddy, plugged it into the Mac, reinstalled OS X and simply told it to move the user and app stuff from the old drive to the new (Apple makes those things so easy – on a PC you’d probably need something like an imaging program but that wouldn’t take care of very different hardware). I spent a ton of time testing to make sure it was all OK, in disbelief it was that easy. Kudos, Apple.

Here are the results with SSD (2/3rds full FWIW):

Time:
19 seconds total
13 seconds of transactions (1538 per second)

Files:
20163 created (1061 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (781 per second)
10053 read (773 per second)
9945 appended (765 per second)
20163 deleted (1061 per second)
Deletion alone: 10326 files (5163 per second)
Mixed with transactions: 9837 files (756 per second)

Data:
557.87 megabytes read (29.36 megabytes per second)
1165.62 megabytes written (61.35 megabytes per second)

A fair bit of improvement… :) The perceived difference is amazing. For some things I’ve caught it doing over 200MB/s sustained writes.

I also disabled the sudden motion sensor since there’s no point stopping I/O to a SSD if one shakes the laptop. From the command line:

sudo pmset -a sms 0 (this disables it)
sudo pmset –g (to verify it was done)

And since I don’t need hotfile adaptive clustering on a SSD, I decided to disable access time updates (noatime in UNIX parlance).

You need to put the script from here: http://dl.dropbox.com/u/5875413/Tools/com.my.noatime.plist

into /Library/LaunchDaemons

And make sure it has the right permissions:

sudo chown root:wheel com.my.noatime.plist

Then reboot, type mount from the command line, and see if the root filesystem shows noatime as one of the mount arguments.

For example mine shows

/dev/disk0s2 on / (hfs, local, journaled, noatime)

I then re-ran postmark, here are the results with noatime:

Time:
16 seconds total
11 seconds of transactions (1818 per second)

Files:
20163 created (1260 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (923 per second)
10053 read (913 per second)
9945 appended (904 per second)
20163 deleted (1260 per second)
Deletion alone: 10326 files (10326 per second)
Mixed with transactions: 9837 files (894 per second)

Data:
557.87 megabytes read (34.87 megabytes per second)
1165.62 megabytes written (72.85 megabytes per second)

Even better.

Now here comes the part that I hoped would work better than it did:

OS X doesn’t support the TRIM command for SSDs yet (unless you have a really new Mac with an Apple SSD). Fortunately, some enterprising users found out that it is possible to turn TRIM on OS X. There are various ways to do it but someone already automated the process. Be sure to do a backup first (both system backup and through the TRIM enabler application).

The process does work. However, it seems it tries to run TRIM too aggressively, messing up with the random access optimizations some drives have.

Benchmark after TRIM enabled:

Time:
39 seconds total
31 seconds of transactions (645 per second)

Files:
20163 created (517 per second)
Creation alone: 10000 files (3333 per second)
Mixed with transactions: 10163 files (327 per second)
10053 read (324 per second)
9945 appended (320 per second)
20163 deleted (517 per second)
Deletion alone: 10326 files (2065 per second)
Mixed with transactions: 9837 files (317 per second)

Data:
557.87 megabytes read (14.30 megabytes per second)
1165.62 megabytes written (29.89 megabytes per second)

This kind of performance loss is unacceptable to me, so I restored the kext file through the TRIM app, rebooted and re-ran the benchmark and all was fine again.

My recommendations:

  1. Always test before and after the tweaks – my results may only apply to Intel drives. Please post your results with other drives
  2. Always do backups before serious tweaks
  3. If TRIM seems to slow down random I/O on your Mac SSD, don’t keep it running, maybe enable it once a month, go to disk utility, and ask it to erase the free space. This will ensure the drive stays in good shape without adversely affecting normal random I/O.

D

]]>
http://recoverymonkey.org/2011/05/05/os-x-ssdtunings-plus-performance-with-and-without-trim/feed/ 11