Category Archives: Storage

Marketing fun: NetApp industry first of up to 13 million IOPS in a single rack

I’m seeing some really “out there” marketing lately, every vendor (including us) trying to find an angle that sounds exciting while not being an outright lie (most of the time).

A competitor recently claimed an industry first of up to 1.7 million (undefined type) IOPS in a single rack.

The number (which admittedly sounds solid), got me thinking. Was the “industry first” that nobody else did up to 1.7 million IOPS in a single rack?

Would that statement also be true if someone else did up to 5 million IOPS in a rack?

I think that, in the world of marketing, it would – since the faster vendor doesn’t do up to 1.7 million IOPS in a rack, they do up to 5! It’s all about standing out in some way.

Well – let’s have some fun.

I can stuff 21x EF560 systems in a single rack.

Each of those systems can do 650,000 random 4K reads at a stable 800 microseconds (since I like defining my performance stats), 600,000 random 8K reads at under 1ms, and over 300,000 random 32KB reads at under 1ms. Also 12GB/s large blog sequential reads. This is I/O straight from the SSDs and not RAM cache (the I/O from cache can of course be higher but let’s not count that).

See here for the document showing some of the performance numbers.

Well – some simple math shows a standard 42U rack fully populated with EF560 will do the following:

  • 13,650,000 IOPS.
  • 252GB/s throughput.
  • Up to 548TB of usable SSD capacity using DDP protection (up to 639TB with RAID5).

Not half bad.

Doesn’t quite roll off the tongue though – industry first of up to thirteen million six hundred and fifty thousand IOPS in a single rack. :)

I hope rounding down to 13 million is OK with everyone.

 

D

Technorati Tags: , , ,

NetApp Posts Top Ten SPC-1 Price-Performance Results for the new EF560 All-Flash Array

<edit: updated with the changes in the SPC-1 price/performance lineup as of 3/27/2015, fixed some typos>

I’m happy to announce that today we announced the new, third-gen EF560 all-flash array, and also posted SPC-1 results showing the impressive performance it is capable of in this extremely difficult benchmark.

If you have no time to read further – the EF560 achieves the absolute best price/performance at very low latencies in the SPC-1 benchmark.

The EF line has been enjoying great success for some time now with huge installations in some of the biggest companies in the world with the highest profile applications (as in, things most of us use daily).

The EF560 is the latest all-flash variant of the E-Series family, optimized for very low latency and high performance workloads while ensuring high reliability, cost effectiveness and simplicity.

EF560 highlights

The EF560 runs SANtricity – a lean, heavily optimized storage OS with an impressively short path length (the overhead imposed by the storage OS itself to all data going through the system). In the case of the EF the path length is tiny, around 30 microseconds. Most other storage arrays have a much longer path length as a result of more features and/or coding inefficiencies.

Keeping the path length this impressively short is one of the reasons the EF does away with fashionable All-Flash features like compression and deduplication –  make no mistake, no array that performs those functions is able to sustain that impressively short a path length. There’s just too much in the way. If you really want data reduction and an incredible number of features, we offer that in the FAS line – but the path length naturally isn’t as short as the EF560’s.

A result of the short path length is impressively low latency while maintaining high IOPS with a very reasonable configuration, as you will see further in the article.

Some other EF560 features:

  • No write cliff due to SSD aging or fullness
  • No performance impact due to SSD garbage collection
  • Enterprise components – including SSDs
  • Six-nines available
  • Up to 120x 1.6TB SSDs per system (135TB usable with DDP protection, even more with RAID5/6)
  • High throughput – 12GB/s reads, 8GB/s writes per system (many people forget that DB workloads need not just low latency and high IOPS but also high throughput for certain operations).
  • All software is included in the system price, apart from encryption
  • The system can do snaps and replication, including fully synchronous replication
  • Consistency Group support
  • Several application plug-ins
  • There are no NAS capabilities but instead there is a plethora of block connectivity options: FC, iSCSI, SAS, InfiniBand
  • The usual suspects of RAID types – 5, 10, 6 plus…
  • DDP – Dynamic Disk Pools, a type of declustered RAID6 implementation that performs RAID at the sub-disk level – very handy for large pools, rapid disk rebuilds with minimal performance impact and overall increased flexibility (for instance, you could add a single disk to the system instead of entire RAID groups’ worth)
  • T10-PI to help protect against insidious data corruption that might bypass RAID and normal checksums, and provide end-to-end protection, from the application all the way to the storage device
  • Can also be part of a Clustered Data ONTAP system using the FlexArray license on FAS.

The point of All-Flash Arrays

Going back to the short path length and low latency discussion…

Flash has been a disruptive technology because, if used properly, it allows an unprecedented performance density, at increasingly reasonable costs.

The users of All-Flash Arrays typically fall in two camps:

  1. Users that want lots of features, data reduction algorithms, good but not deterministic performance and not crazy low latencies – 1-2ms is considered sufficient for this use case (with the occasional latency spike), as it is better than hybrid arrays and way better than all-disk systems.
  2. Users that need the absolute lowest possible latency (starting in the microseconds – and definitely less than 1ms worst-case) while maintaining uncompromising reliability for their applications, and are willing to give up certain features to get that kind of performance. The performance for this type of user needs to be deterministic, without weird latency spikes, ever.

The low latency camp typically uses certain applications that need very low latency to generate more revenue. Every microsecond counts, while failures would typically mean significant revenue loss (to the point of making the cost of the storage seem like pocket change).

Some of you may be reading this and be thinking “so what, 1ms to 2ms is a tiny difference, it’s all awesome”. Well – at that level of the game, 2ms is twice the latency of 1ms, and it is a very big deal indeed. For the people that need low latency, a 1ms latency array is half the speed of a 500 microsecond array, even if both do the same IOPS.

You may also be thinking “SSDs that fit in a server’s PCI slot have low latency, right?”

The answer is yes, but what’s missing is the reliability a full-fledged array brings. If the server dies, access is lost. If the card dies, all is lost.

So, when looking for an All-Flash Array, think about what type of flash user you are. What your business actually needs. That will help shape your decisions.

All-Flash Array background operations can affect latency

The more complex All-Flash Arrays have additional capabilities compared to the ultra-low-latency gang, but also have a higher likelihood of producing relatively uneven latency under heavy load while full, and even latency spikes (besides their naturally higher latency due to the longer path length).

For instance, things like cleanup operations, various kinds of background processing that kicks off at different times, and different ways of dealing with I/O depending on how full the array is, can all cause undesirable latency spikes and overall uneven latency. It’s normal for such architectures, but may be unacceptable for certain applications.

Notably, the EF560 doesn’t suffer from such issues. We have been beating competitors in difficult performance situations with the slower predecessors of the EF560, and we will keep doing it with the new, faster system :)

Enough already, show me the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a block-based benchmark with its own I/O blend and, as such, the results from any vendor’s SPC-1 Result should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend and hotspots of the SPC-1 workload). Indeed, the tested configuration could perform way more “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The EF560 SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 by Price-Performance” systems page so you can compare to other submissions (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 lists).

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

  • Latency vs IOPS – many submissions will show high IOPS at huge latency, which would be rather useless for the low-latency crowd
  • Sustainability – was performance even or are there constant huge spikes?
  • RAID level – most submissions use RAID10 for speed, what would happen with RAID6?
  • Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.
  • Price – discounted or list?

Let’s go over these one by one.

Latency vs IOPS

Our average latency was 0.93ms at 245,011.76 SPC-1 IOPS, and extremely flat during the test:

EF560distrib

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. There was no significant variation in performance during the test:

EF560distramp

RAID level

RAID-10 was used for all testing, with T10-PI Data Assurance enabled (which has a performance penalty but the applications these systems are used for typically need paranoid data integrity). This system would perform slower with RAID5 or RAID6. But for applications where the absolute lowest latency is important, RAID10 is a safe bet, especially with systems that are not write-optimized for RAID6 writes like Data ONTAP is. Not to fret though – the price/performance remained stellar as you will see.

Application Utilization

Our Application Utilization was a very high 46.90% – among the highest of any submission with RAID10 (and among the highest overall, only Data ONTAP submissions can go higher due to RAID-DP).

EF560CapUtil

We did almost completely fill up the resulting RAID10 space, to show that the system’s performance is unaffected when very full. However, Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25% :)

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the SPC-1 Top Ten Price-Performance results both in absolute terms and also normalized around 500 microsecond latency to illustrate the fact that very low latency with great performance is still possible at a compelling price point with this solution.

Here are the Top Ten Price-Performance systems as of March 27, 2015, with SPC-1 Results links if you want to look at things in detail:

  1. X-IO ISE 820 G3 All Flash Array
  2. Dell Storage SC4020 (6 SSDs)
  3. NetApp EF560 Storage System
  4. Huawei OceanStor Dorado2100 G2
  5. HP 3PAR StoreServ 7400 Storage System
  6. FUJITSU ETERNUS DX200 S3
  7. Kaminario K2 (28 nodes)
  8. Huawei OCEANSTOR Dorado 5100
  9. Huawei OCEANSTOR Dorado 2100
  10. FUJITSU ETERNUS DX100 S3

I will show columns that explain the results of each vendor around 500 microseconds, plus how changing the latency target affects SPC-1 IOPS and also how it affects $/SPC1-IOPS.

The way you determine that lower latency point (SPC calls it “Average Response Time“) is by looking at the graph that shows latency vs SPC-1 IOPS and finding the load point closest to 500 microseconds. Let’s pick Kaminario’s K2 so you learn what to look for:

K2curve

Notice how the SPC-1 IOPS around half a millisecond is about 10x slower than the performance around 3ms latency. The system picks up after that very rapidly, but if your requirements are for latency to not exceed 500 microseconds, you will be better off spending your money elsewhere (indeed, a very high profile client asked us for 400 microsecond max response at the host level from the first-gen EF systems for their Oracle DBs – this is actually very realistic for many market segments).

Here’s the table with all this analysis done for you. BTW, the “adjusted latency” $/SPC-1 IOPS is not something in the SPC-1 Reports but simply calculated for our example by dividing system price by the SPC-1 IOPS found at the 500 microsecond point in all the reports.

What do the results show?

As submitted, the EF560 is #3 in the absolute Price-Performance ranking. Interestingly, once adjusted for latency around 500 microseconds at list prices (to keep a level playing field), the price/performance of the EF560 is far better than anything else on the chart.

Regarding pricing: Note that some vendors have discounted pricing and some not, always check the SPC-1 report for the prices and don’t just read the summary at the beginning (for example, Fujitsu has 30% discounts showing in the reports, Dell 48%, HP 45% – the rest aren’t discounted). Our price-performance is even better once you adjust for discounts in some of the other results. Update: In this edited version of the chart I show the list price calculations as well.

Another interesting observation is the effects of longer path length on some platforms – for instance, Dell’s lowest reported latency is 0.70ms at a mere 11,249.97 SPC-1 IOPS. Clearly, that is not a system geared towards high performance at very low latency. In addition, the response time for the submitted max SPC-1 IOPS for the Dell system is 4.83ms, firmly in the “nobody cares” category for all-flash systems :) (sorry guys).

Conversely… The LRT (Least Response Time) we submitted for the EF560 was a tiny 0.18ms (180 microseconds) at 24,501.04 SPC-1 IOPS. This is the lowest LRT anyone has ever posted on any array for the SPC-1 benchmark.

Clearly we are doing something right :)

Final thoughts

If your storage needs require very low latency coupled with very high reliability, the EF560 would be an ideal candidate. In addition, the footprint of the system is extremely compact, the SPC-1 results shown are with just a 2U EF560 with 24x 400GB SSDs.

Coupled with Clustered Data ONTAP systems and OnCommand Insight and WorkFlow Automation, NetApp has an incredible portfolio, able to take on any challenge.

Thx

D

Technorati Tags: , , , ,

Beware of storage performance guarantees

Ah, nothing to bring joy to the holidays like a bit of good old-fashioned sales craziness.

Recently we started seeing weird performance “guarantees” by some storage vendors, who seem will try anything for a sale.

Probably by people that haven’t read this.

It goes a bit like this:

“Mr. Customer, we guarantee our storage will do 100,000 IOPS no matter the I/O size and workload”.

Next time a vendor pulls this, show them the following chart. It’s a simple plot of I/O size vs throughput for 100,000 IOPS:

Throughput IO Size

Notice that at a 1MB I/O size the throughput is a cool 100GB/s :)

Then ask that vendor again if they’re sure they still want to make that guarantee. In writing. With severe penalties if it’s not met. As in free gear UNTIL the requirement is met. At any point during the lifetime of the equipment.

Then sit back and enjoy the backpedalling. 

You can make it even more fun, especially if it’s a hybrid storage vendor (mixed spinning and flash storage for caching, with or without autotiering):

  • So you will guarantee those IOPS even if the data is not in cache?
  • For completely random reads spanning the entire pool?
  • For random overwrites? (that should be a fun one, 100GB/s of overwrite activity).
  • For non-zero or at least not crazily compressible data?
  • And what’s the latency for the guarantee? (let’s not forget the big one).
  • etc. You get the point.
 
Happy Holidays everyone!
 
Thx
 
D
 

 

When competitors try too hard and miss the point – part two

This will be another FUD-busting post in the two-part series (first part here).

It’s interesting how some competitors, in their quest to beat us at any cost, set aside all common sense.

Recently, an Oracle blogger attempted to understand a document NetApp originally wrote in the 90’s (and which we haven’t really updated since, which is admittedly our bad) that explains how WAFL, the block layout engine of Data ONTAP (the storage OS on the FAS platform) works at a high level.

Apparently, he thinks that we turn everything into 4K I/Os, so if someone tried to read 256K, it would have to become 64 separate I/Os, and, by extension, believes this means no NetApp system running ONTAP can ever sustain good read throughput since the back-end would be inundated with IOPS.

The conclusions he comes to are interesting to say the least. I will copy-paste one of the calculations he makes for a 100% read workload:

Erroneous oracle calcs

I like the SAS logo, I guess this is meant to make the numbers look legit, as if they came from actual SAS testing :)

So this person truly believes that to read 2.6GB/s we need 5,120 drives due to the insane back-end IOPS we purportedly generate :)

This would be hilarious if it were true since it would mean NetApp managed to quietly perpetrate the biggest high tech scam in history, fooling customers for 22 years, and somehow managing to become the industry’s #1 storage OS and remain so.

Because customers are that gullible.

Right.

Well – here are some stats from a single 8040 controller, with 24 drives, doing over 2.7GB/s reads, at well under 1ms latency, so it’s not even stressed. Thanks to the Australian team for providing the stats:

8040 singlenode

In this example, 2.74GB/s are being read. From stable storage, not cache.

Now, if we do the math the way the competitor would like, it means the back-end is running at over 700,000 4K IOPS. On a single mid-range controller :)

That would be really impressive and hugely wasteful at the same time. Wait – maybe I should turn this around and claim 700,000 4K IOPS at 0.6ms capability per mid-range controller! Imagine how fast the big ones go!

It would also assume 35,000 IOPS per disk at a consistent speed and sub-millisecond response (0.64ms) – because the numbers above are from a single node with only about 20 data SSDs (plus parity and spares).

SSDs are fast but they’re not really that fast, and the purpose of this blog is to illuminate and not obfuscate.

Remember Occam’s razor. What explanation do you think makes more sense here? Pixie-dust drives and controllers, or that the Oracle blogger is massively wrong? :)

Another example – with spinning disks this time

This is a different output, to also illustrate our ability to provide detailed per-disk statistics.

From a single 8060 node, running at over 3GB/s reads during an actual RMAN job and not a benchmark tool (to use a real Oracle application example). There are 192x 10,000 RPM 600GB disks in the config (180x data, 24x parity – we run dual-parity RAID, there were 12x 16-drive RAID groups in a 14+2 config).

Numbers kindly provided by the legendary neto from Brazil (@netofrombrazil on Twitter). Check the link for his blog and all kinds of DB coolness.

This is part of the statit command’s output. I’m not showing all the disks since there are 192 of them after all and each one is a line in the output:

Read chain

The key in these stats is the “chain” column. This shows, per read command, how many blocks were read as a single entity. In this case, the average is about 49, or 196KB per read operation.

Notice the “xfers” – these drives are only doing about 88 physical IOPS on average per drive, and each operation just happens to be large. They could go faster (see the “ut%” column) but that’s just how much they were loaded during the RMAN job.

Again, if we used the blogger’s calculations, this system would have needed over 5,000 drives and generated over 750,000 back-end disk IOPS.

A public apology and retraction would be nice, guys…

Let’s extrapolate this performance at scale.

My examples are for single mid-range controllers. You can multiply that by 24 to see how fast it could go in a full cluster (yes, it’s linear). And that’s not the max these systems will do – just what was in the examples I found that were close to the competitor’s read performance example.

You see, where most of the competition is still dealing with 2-controller systems, NetApp FAS systems running Clustered ONTAP can run 8 engines for block workloads and 24 engines for NAS (8 if mixed), and each engine can have multiple TB of read/write cache (18TB max cache per node currently with ONTAP 8.2.x).

Even if a competitor’s 2 engines are faster than 2 FAS engines, if they stop at 2 and FAS stops at 24, the fight is over before it begins.

People that live in glass houses shouldn’t throw stones.

Since the competitor questioned why NetApp bought Engenio (the acquisition for our E-Series), I have a similar question: Why did Oracle buy Pillar Data? It was purchased after the Sun acquisition. Does that signify a major lack in the ZFS boxes that Pillar is supposed to address?

The Oracle blogger mentioned how their ZFS system had a great score in the SPC-2 tests (which measure throughput and not IOPS). Great.

Interestingly, Oracle ZFS systems can significantly degrade in performance over time (see here http://blog.delphix.com/uday/2013/02/19/78/) especially after writes, deletes and overwrites. Unlike ONTAP systems, ZFS boxes don’t have mechanisms to perform the necessary block reallocations to optimize the data layout in order to bring performance back to original levels (backing up, wiping the box, rebuilding and restoring is not a solution, sorry). There are ways to delay the inevitable, but nothing to fix the core issue.

It follows that the ZFS performance posted in the benchmarks may not be anywhere near what one will get long-term once the ZFS pools are fragmented and full. Making the ZFS SPC-2 benchmark result pretty useless.

NetApp E-Series inherently doesn’t have this fragmentation problem (and is near the top as a price-performance leader in the SPC-2 benchmark, as tested by SGI that resells it). Since there is no long-term speed deterioration issue with E-Series, the throughput you see in the SPC-2 benchmark will be perpetually maintained. The box is in it for the long haul.

Wouldn’t E-Series then be a better choice for a system that needs to constantly deal with such a workload? Both cost-effective and able to sustain high throughput no matter what?

As an aside, I do need to write an article on block layout optimizations available in ONTAP. Many customers are unaware of the possibilities, and competitors use FUD based on observations from back when mud was a novelty. In the meantime, if you’re a NetApp FAS customer, ask your SE and/or check your documentation for the volume option read_realloc space_optimized – great for volumes containing DB data files. Also, check the documentation for the Aggregate option free_space_realloc.

So you’re fast. What else can you do?

There were other “fighting words” in the blogger’s article and they were all about speed and how much faster the new boxes from the competitor are versus some ancient boxes they had from us. Amazing, new controllers being faster than old ones! :)

I see this trend recently, new vendors focusing solely on speed. Guess what – it’s easy to go fast. It’s also easy to be cheap. I’ll save that for a full post another time. But I fully accept that speed sells.

I can build you a commodity-based million-IOPS box during my lunch break. It’s really not that hard. Building a server with dozens of cores and TB of RAM is pretty easy.

But for Enterprise Storage, Reliability is extremely important, far more than sheer speed.

Plus Availability and Serviceability (where the RAS acronym comes from).

Predictability.

Non-Disruptive Operations, even during events that would leave other systems down for extended periods of time.

Extensive automation, management, monitoring and alerting at scale as well.

And of crucial importance is Application Integration, including the ability to perform application-aware data manipulation (fully consistent backups, restores, clones, replication).

So if a system can go fast but can’t do much else, its utility is more towards being a point solution rather than as part of a large, strategic, long-term deployment. Point solutions are useful, yes – but they are also interchangeable with the next cheap fast thing. Most won’t survive.

You know who you are.

D

Technorati Tags: , , , , , , ,

When competitors try too hard and miss the point

(edit: fixed the images)

After a long hiatus, we return to our regularly scheduled programming with a 2-part series that will address some wild claims Oracle has been making recently.

I’m pleased to introduce Jeffrey Steiner, ex-Oracle employee and all-around DB performance wizard. He helps some of our largest customers with designing high performance solutions for Oracle DBs:

Greetings from a guest-blogger.

I’m one of the original NetApp customers.

I bought my first NetApp in 1995 (I have a 3-digit support case in the system) and it was an F330. I think it came with 512MB SCSI drives, and maxed out at 16GB. It met our performance needs, it was reliable, and it was cost effective.  I continued to buy more over the following years at other employers. We must have been close to the first company to run Oracle databases on NetApp storage. It was late 1999. Again, it met our performance needs, it was reliable, and it was cost effective. My employer immediately prior to joining NetApp was Oracle.

I’m now with NetApp product operations as the principal architect for enterprise solutions, which usually means a big Oracle database is involved, but it can also include DB2, SAS, MongoDB, and others.

I normally ignore competitive blogs, and I had never commented on a blog in my life until I ran into something entitled “Why your NetApp is so slow…” and found this statement:

If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different disk IOPS

That’s just openly false. I tried to correct the poster, but was met with nothing but other unsubstantiated claims and insults to the product line. It was clear the blogger wasn’t going to acknowledge their false premise, so I asked Dimitris if I could borrow some time on his blog.

Here’s one of the alleged results of this behavior with ONTAP– the blogger was nice enough to do this calculation for a system reading at 2.6GB/s:

 

erroneous_oracle_calcs


Seriously?

I’m not sure how to interpret this. Are they saying that this alleged horrible, awful design flaw in ONTAP leads to customers buying 50X more drives than required, and our evil sales teams have somehow fooled our customer based into believing this was necessary? Or, is this a claim that ZFS arrays have some kind of amazing ability to use 50X fewer drives?

Given the false premise about ONTAP chopping up any and all IO’s into little 4K blocks and spraying them over the drives, I’m guessing readers are supposed to believe the first interpretation.

Ordinarily, I enjoy this type of marketing. Customers bring this to our attention, and it allows us to explain how things actually work, plus it discredits the account team who provided the information. There was a rep in the UK who used to tell his customers that Oracle had replaced all competing storage arrays in their OnDemand centers with Pillar. I liked it when he said stuff like that. The reason I’m responding is not because I care about the existence of the other blog, but rather that I care about openly false information being spread about how ONTAP works.

How does ONTAP really work?

Some of NetApp’s marketing folks might not like this, but here’s my usual response:

Why does it matter?

It’s an interesting subject, and I’m happy to explain write tetrises and NVMEM write coalescence, and core utilization, but what does that have to do with your business? There was a time we dealt with accusations that NetApp was slow because we has 25 nanometer process CPU’s while the state of the art was 17nm or something like that. These days ‘cores’ seems to come up a lot, as if this happens:

logic

That’s the Brawndo approach to storage sales (https://www.youtube.com/watch?v=Tbxq0IDqD04)

“Our storage arrays contain

5 kinds of technology

which make them AWESOME

unlike other storage arrays which are

NOT AWESOME.”

A Better Way

I prefer to promote our products based on real business needs. I phrase this especially bluntly when talking to our sales force:

When you are working with a new enterprise customer, shut up about NetApp for at least the first 45 minutes of the discussion

I say that all the time. Not everyone understands it. If you charge into a situation saying, “NetApp is AWESOME, unlike EMC who is NOT AWESOME” the whole conversation turns into PowerPoint wars, links to silly blog articles like the one that prompted this discussion, and whoever wins the deal will win it based on a combination of luck and speaking ability. Providing value will become secondary.

I’m usually working in engineeringland, but in major deals I get involved directly. Let’s say we have a customer with a database performance issue and they’re looking for new storage. I avoid PowerPoint and usually request Oracle AWR/statspack data. That allows me to size a solution with extreme accuracy. I know exactly what the customer needs, I know their performance bottlenecks, and I know whatever solution I propose will meet their requirements. That reduces risk on both sides. It also reduces costs because I won’t be proposing unnecessary capabilities.

None of this has anything to do with who’s got the better SPC-2 benchmark, unless you plan on buying that exact hardware described, configuring it exactly the same way, and then you somehow make money based on running SPC-2 all day.

Here’s an actual Oracle AWR report from a real customer using NetApp. I have pruned the non-storage related parameters to make it easier to read, and I have anonymized the identifying data. This is a major international insurance company calculating its balance sheet at end-of-month. I know of at least 9 or 10 customers that have similar workloads and configurations.

awr1

Look at the line that says “Physical reads”. That’s the blocks read per second. Now look at “Std Block Size”. That’s the block size. This is 90K physical block reads per second, which is 90K IOPS in a sense. The IO is predominantly db_file_scattered_read, which counter-intuitively is sequential IO. A parameter called db_file_multiblock_read_count is set to 128. This means Oracle is attempting to read 128 blocks at a time, which equates to 1MB block sizes. It’s a sequential IO read of a file.

Here’s what we got:

1)     89K read “IOPS”, sort of.

2)     Those 89K read IOPS are actually packaged as units of 8 blocks in a single 64k unit.

3)     3K write IOPS

4)     8MB/sec of redo logging.

The most important point here is that the customer needed about 800MB/sec of throughput, they liked the cost savings of IP, and the storage system is meeting their needs. They refresh with NetApp on occasion, so obviouly they’re happy with the TCO.

To put a final nail in the coffin of the Oracle blogger’s argument, if we are really doing 89K block reads/sec, and those blocks are really chopped up into 4k units, that’s a total of about 180,000 4k IOPS that would need to be serviced at the disk layer, per the blogger’s calculation.

  • Our opposing blogger thinks that  would require about 1000 disks in theory
  • This customer is using 132 drives in a real production system.

There’s also a ton of other data on those drives for other workloads. That’s why we have QoS – it allows mixed workloads to play nicely on a single unified system.

To make this even more interesting, the data would have been randomly written in 8k units, yet they are still able to read at 800MB/sec? How is this possible? For one, ONTAP does NOT break up individual IO’s into 4k units. It tries very, very hard to never break up an IO across disks, although that can happen on occasion, notably if you fill you system up to 99% capacity or do something very much against best practices.

The main reason ONTAP can provide good sequential performance with randomly written data is the blocks are organized contiguously on disk. Strictly speaking, there is a sort of ‘fragmentation’ as our competitors like to say, but it’s not like randomly spraying data everywhere. It’s more like large contiguous chunks of data are evenly distributed across the disks. As long as those contiguous segments are sufficiently large, readahead can ensure good throughput.

That’s somewhat of an oversimplification, but it would take a couple hours and a whiteboard to explain the complete details. 20+ years of engineering can’t exactly be summarized in a couple paragraphs. The document misrepresented by the original blog was clearly dated 2006 (and that was to slightly refresh the original posting back in the nineties), and while it’s still correct as far as I can see, it’s also lacking information on the enhancements and how we package data onto disks.

By the way, this database mentioned above? It’s virtualized in VMware too.

Why did I pick an example of only 90K IOPS?  My point was this customer needed 90K IOPS, so they bought 90K IOPS.

If you need this performance:

awr2

then let us know. Not a problem. This is from a large SAP environment for a manufacturing company. It beats me what they’re doing, because this is about 10X more IO than what we typically see for this type of SAP application. Maybe they just built a really, really good network that permits this level of IO performance even though they don’t need it.

In any case, that’s 201,734 blocks/sec using a block size of 8k. That’s about 2GB/sec, and it’s from a dual-controller FAS3220 configuration which is rather old (and was the smallest box in its range when it was new).

Sing the bizarro-universe math from the other blog, these 200K IOPS would have been chopped up into 4k blocks and require a total of 400K back-end disk IOPS to service the workload. Divided by 125 IOPS/drive, we have a requirement for 3200 drives. It was ACTUALLY using more like 200 drives.

We can do a lot more, especially with the newer platforms and ONTAP clustering, which would enable up to 24 controllers in the storage cluster. So the performance limits per cluster are staggeringly high.

Futures

To put a really interesting (and practical) twist on this, sequential IO in the Oracle realm is probably going to become less important.  You know why? Oracle’s new in-memory feature. Me and several others were floored when we got the first debrief on Oracle In-Memory. I couldn’t have asked for a better implementation if I was in charge of Oracle engineering myself. Folks at NetApp started asking what this means for us, and here’s my list:

  1. Oracle customers will be spending less on storage.

That’s it. That’s my list. The data format on disk remains unchanged, the backup/restore process is the same, the data commitment process is the same. All the NetApp features that earned us around 12,500 Oracle customers are still applicable.

The difference is customers will need smaller controllers, fewer disks, and less bandwidth because they’ll be able to replace a lot of the brute-force full table scan activity with a little In-Memory magic. No, the In-Memory licenses aren’t free, but the benefits will be substantial.

SPC-2 Benchmarks and Engenio Purchases

The other blog demanded two additional answers:

1)     Why hasn’t NetApp done an SPC-2 bencharmk?

2)     Why did NetApp purchase Engenio?

SPC-2

I personally don’t know why we haven’t done an SPC-2 benchmark with ONTAP, but they are rather expensive and it’s aimed at large sequential IO processing. That’s not exactly the prime use case for FAS systems, but not because they’re weak on it. I’ve got AWR reports well into the GB/sec, so it certainly can do all the sequential IO you want in the real world, but what workloads are those?

I see little point in using an ONTAP system for most (but certainly not all) such workloads because the features overall aren’t applicable. I’m aware of some VOD applications on ONTAP where replication and backups were important. Overall, if you want that type of workload, you’d specify a minimum bandwidth requirement, capacity requirement, and then evaluate the proposals from vendors. Cost is usually the deciding factor.

Engenio Acquisition

Again, my personal opinion here on why NetApp acquired Engenio.

Tom Georgens, our CEO, spent 9 years leading Engenio and obviously knew the company and its financials well. I can’t think of any possible way to know you’re getting value for money than having someone in Georgens’ position making this decision.

Here’s the press release about it:

Engenio will enable NetApp to address emerging and fast-growing market segments such as video, including full-motion video capture and digital video surveillance, as well as high performance computing applications, such as genomics sequencing and scientific research.

Yup, sounds about right. That’s all about maximum capacity, high throughput, and low cost. In contrast, ONTAP is about manageability and advanced features. Those are aimed at different sets of business drivers.

Hey, check this out. Here’s an SEC filing:

Since the acquisition of the Engenio business in May 2011, NetApp has been offering the formerly-branded Engenio products as NetApp E-Series storage arrays for SAN workloads. Core differentiators of this price-performance leader include enterprise reliability, availability and scalability. Customers choose E-Series for general purpose computing, high-density content repositories, video surveillance, and high performance computing workloads where data is managed by the application and the advanced data management capabilities of Data ONTAP storage operating system are not required.

Key point here is “where the advanced data management capabilities of Data ONTAP are not required.” It also reflected my logic in storage decisions prior to joining NetApp, and it reflects the message I still repeat to account teams:

  1. Is there any particular feature in ONTAP that is useful for your customer’s actual business requirements? Would they like to snapshot something? Do they need asynchronous replication? Archival? SnapLock? Scale-out clusters with many nodes? Non-disruptive everything? Think carefully, and ask lots of questions.
  2. If the answer is “yes”, go with ONTAP.
  3. If the answer is “no”, go with E-Series.

That’s what I did. I probably influenced or approved around $5M in total purchases. It wasn’t huge, but it wasn’t nothing either. I’d guess we went ONTAP about 70% of the time, but I had a lot of IBM DS3K arrays around too, now known as E-Series.

“Dumb Storage”

I’ve annoyed the E-Series team a few times by referring to it as “dumb storage”, but I mean that in the nicest possible way. It’s primary job is to just sit there and work. It needs to do it fast, reliably, and cost effectively, but on a day-to-day basis it’s not generally doing anything all that advanced.

In some ways, the reliability was a weakness. It was so reliable, that we forgot it was there at all, and we’d do something like changing the email server addresses, and forget to update the RAS feature of the E-Series. Without email notification, it can take a couple years before someone notices the LED that indicates a drive needs replacement.

 

EMC’s VNX2 forklift: The importance of hardware re-use, slowing down obsolescence, and maximizing your investment

It was with interest that I watched the launch of EMC’s VNX refresh. The updated boxes got some long-awaited features, EMC talked a lot about how some pretty severe single-threaded bottlenecks were removed, more CPU and memory was put in, and there was much rejoicing.

Not really trying to pick on the new boxes (that will be in a future post, relax :)), but what I thought was interesting was that the code that makes most of the new features a possibility cannot be loaded on current-gen VNX boxes (not even the biggest, the 7500, which has plenty of CPU and RAM juice even compared to the next gen boxes).

Software-defined storage indeed.

The existing VNX disk shelves also seemingly can’t be re-used at the moment (correct me if I’m wrong please).

This forced obsolescence has been a theme with EMC: Clariion -> VNX -> VNX2 are all complete forklift upgrades. When the original VNX was released, it was utterly incompatible with the CX (Clariion) shelves (SAS replaced FC connectivity). Despite using the exact same code (FLARE).

Other vendors are guilty of this too – a new controller is released, and all prior investments on disk shelves are rendered useless (HDS did this with several iterations of AMS, maybe even with AMS -> HUS, HP with EVA…)

I understand that as technology progresses we sometimes have to abandon the old to make room for the new but, at the same time, customers make significant investments in N-1 technology – and often want to be able to re-use some of their investment with N (and sometimes N+1).

I just had this conversation with a customer, and he said “well, I throw away my gear every 3 years, why should I care?

Let’s try a thought experiment.

Imagine you just bought a system that’s running the fastest controllers a company sells today, and you got 1PB of storage behind it.

Now, imagine that a mere month after you purchased your controllers, new ones are released that are significantly faster. OK, that stuff happens.

Your gear is not 3 years old yet. It’s 1 month old. It’s running OK.

Now, imagine that your array runs out of steam 6 months later due to unprecedented performance growth. Your system is now 7 months old. You can’t just throw it away and start fresh.

You could buy a new storage system and migrate some of the data to share the load. However, you don’t need more space – you just ran out of controller headroom. Indeed, you still have tons of free space.

But what if you could replace the controllers with the new, beefier ones? And maintain your investment in the 1PB of storage, cache etc? Wouldn’t that be nice?

Or at least be able to move some of the storage pools you may have to the new family of controllers? Even if you had to reformat the disk?

Well – most vendors will tell you “sorry, no, you need to migrate your data to the new box”.

Let’s try another thought experiment.

You bought a storage system a year ago. It performs fine, but it lacks true deduplication capabilities. You have determined it would save you a lot of storage space (= money) if your array had deduplication.

The vendor you purchased the system from announces the refreshed storage OS that finally includes deduplication. And that same vendor made a truly gigantic fuss about software-defined storage, which made everyone feel software would be the big enabler and that coolness was a mere firmware upgrade away.

However, you are eventually told they will not allow the code that enables deduplication to be loaded to your array, and, instead, they ask you to migrate to the refreshed array that supports deduplication. Since the updated code somehow only runs on the new box. Something to do with unicorn milk.

But your array has plenty of CPU headroom to handle deduplication… and you could reformat the disks given some swing storage shelves if the underlying disk format is the issue. But the option is not provided.

How NetApp does things instead

At NetApp we sort of take it for granted so we don’t make a big fuss about software-defined storage, but hardware was always considered an enabler for the software and not the other way around. 

For instance: deduplication was released as a free software upgrade for Data ONTAP (the OS for our main line of storage). Back in 2007. For all storage protocols.

In general, we try to let systems be able to load at a minimum N+1 software releases, but most of the time we utterly spoil customers and go far above and beyond, unless we’re talking about the smallest boxes, which naturally have less headroom.

For example, the now aging FAS 3070 I have in the local lab (the bigger of the older midrange boxes, released in 2006) supports anything from ONTAP 7.2.1 (what it was released with) to ONTAP 8.1.3 (released in mid-2013).

This spans multiple major ONTAP releases – huge changes in the code have happened during those releases: 7.3, 8.0, 8.1… Multiple newer arrays were also released as replacements for the 3070: 3170, 3270, 3250.

Yet the 3070 soldiers on with a fully supported, modern OS, 7 years later.

What arrays did our competitors have back then? What is the most modern OS those same arrays can run today? What is that OS missing vs the OS that competitor’s more modern arrays have?

Let’s talk disk shelves.

We used FC loop connectivity for the older shelves (DS14). We then switched to fancy multi-channel SAS and totally different shelves and disks, but never stopped supporting the older shelf technology, even with newer controllers.

That’s the big one. I have customers with DS14 shelves that they purchased for a 3070 that they now have on a 3270 running 8.2. It all works, all supported. Other vendors cut support off after the transition from FC to SAS.

Will we support those older shelves forever? No, that’s impossible, but at least we give our customers a lot of leeway and let them stretch their hardware investments significantly longer than any other major storage vendor I can think of.

Think long term

I encourage customers to always think long term. Try to stop thinking in 3-year increments. Start thinking of other ways you can stretch your investment, other ways to deploy older gear while still keeping it interoperable with newer hardware.

And start thinking about what will happen to your investment once newer gear is released.

D

Technorati Tags: , , ,

How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.

D

 

Technorati Tags: , , , , , , ,

More EMC VNX caveats

Lately, when competing with VNX, I see EMC using several points to prove they’re superior (or at least not deficient).

I’d already written this article a while back, and today I want to explore a few aspects in more depth since my BS pain threshold is getting pretty low. The topics discussed:

  1. VNX space efficiency
  2. LUNs can be served by either controller for “load balancing”
  3. Claims that autotiering helps most workloads
  4. Claims that storage pools are easier
  5. Thin provisioning performance (this one’s interesting)
  6. The new VNX snapshots

References to actual EMC documentation will be used. Otherwise I’d also be no better than a marketing droid.

VNX space efficiency

EMC likes claiming they don’t suffer from the 10% space “tax” NetApp has. Yet, linked here are the best practices showing that in an autotiered pool, at least 10% free space should be available per tier in order for autotiering to be able to do its thing (makes sense).

Then there’s also a 3GB minimum overhead per LUN, plus metadata overhead, calculated with a formula in the linked article. Plus possibly more metadata overhead if they manage to put dedupe in the code.

My point is: There’s no free lunch. If you want certain pool-related features, there is a price to pay. Otherwise, keep using the old-fashioned RAID groups that don’t offer any of the new features but at least offer predictable performance and capacity utilization.

LUNs can be served by either controller for “load balancing”

This is a fun one. The claim is that LUN ownership can be instantly switched over from one VNX controller to another in order to load balance their utilization. Well – as always, it depends. It’s also important to note that VNX as of this writing does not do any sort of automatic load balancing of LUN ownership based on load.

  1. If it’s using old-fashioned RAID LUNs: Transferring LUN ownership is indeed doable with no issues. It’s been like that forever.
  2. If the LUN is in a pool – different story. There’s no quick way to shift LUN ownership to another controller without significant performance loss.

There’s copious information here. Long story short: You don’t change LUN ownership with pools, but rather need to do a migration of the LUN contents to the other controller (to another LUN, you can’t just move the LUN as-is – this also creates issues), otherwise there will be a performance tax to pay.

Claims that autotiering helps most workloads

Not so FAST. EMC’s own best practice guides are rife with caveats and cautions regarding autotiering. Yet this feature is used as a gigantic differentiator at every sales campaign.

For example, in the very thorough “EMC Scaling performance for Oracle Virtual Machine“, the following graph is shown on page 35:

NewImage

The arrows were added by me. Notice that most of the performance benefit is provided once cache is appropriately sized. Adding an extra 5 SSDs for VNX tiering provides almost no extra benefit for this database workload.

One wonders how fast it would go if an extra 4 SSDs were added for even more cache instead of going to the tier… :)

Perchance the all-cache line with 8 SSDs would be faster than 4 cache SSDs and 5 tier SSDs, but that would make for some pretty poor autotiering marketing.

Claims that storage pools are easier

The typical VNX pitch to a customer is: Use a single, easy, happy, autotiered pool. Despite what marketing slicks show, unfortunately, complexity is not really reduced with VNX pools – simply because single pools are not recommended for all workloads. Consider this typical VNX deployment scenario, modeled after best practice documents:

  1. RecoverPoint journal LUNs in a separate RAID10 RAID group
  2. SQL log LUNs in a separate RAID10 RAID group
  3. Exchange 2010 log LUNs in a separate RAID10 RAID group
  4. Exchange 2010 data LUNs can be in a pool as long as it has a homogeneous disk type, otherwise use multiple RAID groups
  5. SQL data can be in an autotiered pool
  6. VMs might have to go in a separate pool or maybe share the SQL pool
  7. VDI linked clone repository would probably use SSDs in a separate RAID10 RAID group

OK, great. I understand that all the I/O separation above can be beneficial. However, the selling points of pooling and autotiering are that they should reduce complexity, reduce overall cost, improve performance and improve space efficiency. Clearly, that’s not the case at all in real life. What is the reason all the above can’t be in a single pool, maybe two, and have some sort of array QoS to ensure prioritization?

And what happens to your space efficiency if you over-allocate disks to the old-fashioned RAID groups above? How do you get the space back?

What if you under-allocated? How easy would it be to add a bit more space or performance? (not 2-3x – let’s say you need just 20% more). Can you expand an old-fashioned VNX RAID group by a couple of disks?

And what’s the overall space efficiency now that this kind of elaborate split is necessary? Hmm… ;)

For more detail, check these Exchange and SQL design documents.

Thin provisioning performance

This is just great.

VNX thin provisioning performs very poorly relative to thick and even more poorly relative to standard RAID groups. The performance issue makes complete sense due to how space is allocated when writing thin on a VNX, with 8KB blocks assigned as space is being used. A nice explanation of how pool space is allocated is here. A VNX writes to pools using 1GB slices. Thick LUNs pre-allocate as many 1GB slices as necessary, which keeps performance acceptable. Thin LUNs obviously don’t pre-allocate space and currently have no way to optimize writes or reads – the result is fragmentation, in addition to the higher CPU, disk and memory overhead to maintain thin LUNs :)

From the Exchange 2010 design document again, page 23:

NewImage

Again, I added the arrows to point out a couple of important things:

  1. Thin provisioning is not recommended for high performance workloads on VNX
  2. Indeed, it’s so slow that you should run your thin pools with RAID10!!!

But wait – thin provisioning is supposed to help me save space, and now I have to run it with RAID10, which chews up more space?

Kind of an oxymoron.

And what if the customer wants the superior reliability of RAID6 for the whole pool? How fast is thin provisioning then?

Oh, and the VNX has no way to fix the fragmentation that’s rampant in its thin LUNs. Short of a migration to another LUN (kind of a theme it seems).

The new VNX snapshots

The VNX has a way to somewhat lower the traditionally extreme impact of FLARE snapshots by switching from COFW (Copy On First Write) to ROFW (Redirect On First Write).

The problem?

The new VNX snapshots need a pool, and need thin LUNs. It makes sense from an engineering standpoint, but…

Those are exactly the 2 VNX features that lower performance.

There are many other issues with the new VNX snapshots, but that’s a story for another day. It’s no wonder EMC pushes RecoverPoint far more than their snaps…

The takeaway

There’s marketing, and then there’s engineering reality.

Since the VNX is able to run both pools and old-fashioned RAID groups, marketing wisely chooses to not be very specific about what works with what.

The reality though is that all the advanced features only work with pools. But those come with significant caveats.

If you’re looking at a VNX – at least make sure you figure out whether the marketed features will be usable for your workload. Ask for a full LUN layout.

And we didn’t even talk about having uniform RAID6 protection in pools, which is yet another story for another day.

D

Technorati Tags: , , , , , , , , ,

Are SSDs reliable enough? The importance of extensive testing under adverse conditions.

Recently, interesting research (see here) from researchers at Ohio State was presented at USENIX. 

To summarize, they tested 15 SSDs, several of them “Enterprise” grade, and subjected them to various power fault conditions. 

Almost all the drives suffered data loss that should not have occurred, and some were so corrupt as to be rendered utterly unusable (could not even be seen on the bus). It’s worth noting that spinning drives used in enterprise arrays would not have suffered the same way.

It’s not just an issue of whether or not the SSD has some supercapacitors in order to de-stage the built-in RAM contents to flash – a certain very prominent SSD vendor was hit with this issue even though the SSDs in question had the supercapacitors, generous overprovisioning and internal RAID. A firmware issue is suspected and this is not fixed yet.

You might ask, why am I mentioning this?

Several storage systems try to lower SSD costs by using cheap SSDs (often consumer models found in laptops, not even eMLC) and then try to get more longevity out said SSDs by using clever write techniques in order to minimize the amount of data written (dedupe, compression) as well as make the most of wear-leveling the flash chips in the box by also writing in flash-friendly ways (more appends, less overwrites, moving data around as needed, and more).

However, all those (perfectly valid) techniques have a razor-sharp focus on the fact that cheaper flash has a very limited number of write/erase cycles, but are utterly unrelated to things like massive corruption stemming from weird power failures or firmware bugs (and, after having lived through multiple UPS and generator failures, I don’t accept those as a complete answer, either).

On the other hand, the Tier 1 storage vendors typically do pretty extensive component testing, including various power failure scenarios, from the normal to the very strange. The system has to withstand those, then come up no matter what. Edge cases are tested as a matter of course – a main reason people buy enterprise storage is how edge cases are handled… :)

At NetApp, when we certify SSDs, they go through an extra-rigorous process since we are paranoid and they are still a relatively new technology. We also offer our standard dual-parity RAID, along with multiple ways to safeguard against lost writes, for all media. The last thing one needs is multiple drives failing due to a strange power failure or a firmware bug.

Protection against failures is even more important in storage systems that lack the extra integrity checks NetApp offers. Those non-NetApp systems that use SSDs either as their only storage or as part of a pool can suffer catastrophic failures if the integrity of the SSDs is compromised sufficiently since, by definition, if part of the pool fails, then the whole pool fails, which could mean the entire storage system may have to be restored from backup.

For those systems where cheap SSDs are merely used as an acceleration mechanism, catastrophic performance failures are a very real potential outcome. 1000 VDI users calling the helpdesk is not my idea of fun.

Such component behavior is clearly unacceptable.

Proper testing comes with intelligence, talent, but also experience and extensive battle scarring. Back when NetApp was young, we didn’t know the things we know today, and couldn’t handle some of the fault conditions we can handle today. Test harnesses in most Tier 1 vendors become more comprehensive as new problems are discovered, and sometimes the only way to discover the really weird problems is through sheer numbers (selling many millions of units of a certain component provides some pretty solid statistics regarding its reliability and failure modes).

“With age comes wisdom”.

 

D

 

Technorati Tags: , , , ,

Beware of benchmarking storage that does inline compression

In this post I will examine the effects of benchmarking highly compressible data and why that’s potentially a bad idea.

Compression is not a new storage feature. Of the large storage vendors, at a minimum NetApp, EMC and IBM can do it (depending on the array). <EDIT (thanks to Matt Davis for reminding me): Some arrays also do zero detection and will not write zeroes to disk – think of it as a specialized form of compression that ONLY works on zeroes>

A lot of the newer storage vendors are now touting real-time compression for all data (often used instead of true deduplication – it’s just easier to implement compression).

Nothing wrong with real-time compression. However, and here’s where I have a problem with some of the sales approaches some vendors follow:

Real-time compression can provide grossly unrealistic benchmark results if the benchmarks used are highly compressible!

Compression can indeed provide a performance benefit for various data types (simply since less data has to be read and written from disk), with the tradeoff being CPU. However, most normal data isn’t composed of all zeroes. Typically, compressing data will provide a decent benefit on average, but usually not several times.

So, what will typically happen is, a vendor will drop off one of their storage appliances and provide the prospect with some instructions on how to benchmark it with your garden variety benchmark apps. Nothing crazy.

Here’s the benchmark problem

A lot of the popular benchmarks just write zeroes. Which of course are extremely easy for compression and zero-detect algorithms to deal with and get amazing efficiency out of, resulting in extremely high benchmark performance.

I wanted to prove this out in an easy way that anyone can replicate with free tools. So I installed Fedora 18 with the btrfs filesystem and ran the bonnie++ benchmark with and without compression. The raw data with mount options etc. is here. An explanation of the various fields here. Not everything is accelerated by btrfs compression in the bonnie++ benchmark, but a few things really are (sequential writes, rewrites and reads):

Bonniebtrfs

Notice the gigantic improvement (in write throughput especially) btrfs compression affords with all-zero data.

Now, does anyone think that, in general, the write throughput will be 300MB/s for a decrepit 5400 RPM SATA disk?  That will be impossible unless the user is constantly writing all-zero data, at which point the bottlenecks lie elsewhere.

Some easy ways for dealing with the compressible benchmark issue

So what can you do in order to ensure you get a more realistic test for your data? Here are some ideas:

  • Always best is to use your own applications and not benchmarks. This is of course more time-consuming and a bigger commitment. If you cant do that, then…
  • Create your own test data using, for example, dd and /dev/random as a source in some sort of Unix/Linux variant. Some instructions here. You can even move that data to use with Windows and IOmeter – just generate the random test data in UNIX-land and move the file(s) to Windows.
  • Another, far more realistic way: Use your own data. In IOmeter, you just copy one of your large DB files to iobw.tst and IOmeter will use your own data to test… Just make sure it’s large enough and doesn’t all fit in array cache. If not large enough, you could probably make it large enough by concatenating multiple data files and random data together.
  • Use a tool that generates incompressible data automatically, like the AS-SSD benchmark (though it doesn’t look like it can deal with multi-TB stuff – but worth a try).
  • vdbench seems to be a very solid benchmark tool – with tunable compression and dedupe settings.

 

In all cases though, be aware of how you are testing. There is no magic :)

D

 

Technorati Tags: , ,