NetApp posts world-record SPEC SFS2008 NFS benchmark result

Just as NetApp dominated the older version of the SPEC SFS97_R1 NFS benchmark back in May of 2006 (and was unsurpassed in that benchmark with 1 million SFS operations per second), the time has come to once again dominate the current version, SPEC SFS2008 NFS.

Recently we have been focusing on benchmarking realistic configurations that people might actually put in their datacenters, instead of lab queens with unusable configs focused on achieving the highest result regardless of cost.

However, it seems the press doesn’t care about realistic configs (or to even understand the configs) but instead likes headline-grabbing big numbers.

So we decided to go for the best of both worlds – a headline-grabbing “big number” but also a config that would make more financial sense than the utterly crazy setups being submitted by competitors.

Without further ado, NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode. Click here for the specific result. There are other results in the page showing different size clusters so you can get some idea of the scaling possible.

See below table for a high-level analysis (including the list pricing I could find for these specific performance-optimized configs for whatever that’s worth). The comparison is between NetApp and the nearest scale-out competitor result (one of many EMC’s recent acquisitions –  Isilon, the niche, dedicated NAS box – nothing else is close enough to bother including in the comparison).

BTW – the EMC price list is publicly available from here (and other places I’m sure): http://www.emc.com/collateral/emcwsca/master-price-list.pdf

From page 422:

S200-6.9TB & 200GB SSD, 48GB RAMS200-6.9TB & 200GB SSD, 48GB RAM, 2x10GE SFP+ & 2x1G $84,061. Times 140…

Before we dive into the comparison, an important note:

Lest we be accused of tuning this config or manually making sure client accesses were load-balanced and going to the optimal nodes, please understand this:  23 out of 24 client accesses were not going to the nodes owning the data and were instead happening over the cluster interconnect (which, for any scale-out architecture, is worst-case-scenario performance). Look under the “Uniform Access Rules Compliance” in the full disclosure details of the result in the SPEC website here.

EMC NetApp Difference
Cost (approx. USD List) 11,800,000 6,280,000 NetApp is almost half the cost while offering much higher performance
SPEC SFS2008 NFS operations per second 1,112,705 1,512,784 NetApp is over 35% faster, while using potentially better RAID protection
Average Latency (ORT) 2.54 1.53 NetApp offers almost 40% better average latency without using costly SSDs, and is usable for challenging random workloads like DBs, VMs etc.
Space (TB) 864 (out of which 128889GB was used in the test) 574 (out of which 176176GB was used in the test) Isilon offers about 50% more usable space (coming from a lot more drives, 28% more raw space and potentially less RAID protection – N+2 results from Isilon would be different)
$/SPEC SFS2008 NFS operation 10.6 4.15 Netapp is less than half the cost per SPEC SFS2008 NFS operation
$/TB 13,657 10,940 NetApp is about 20% less expensive than EMC per usable TB
RAID Per-file protection. Files < 128K are at least mirrored. Files over 128K are at a 13+1 level protection in this specific test. RAID-DP Ask EMC what 13+1 protection means in an Isilon cluster (I believe 1 node can be completely gone but what about simultaneous drive failures that contain the same protected file?)

NetApp RAID-DP is mathematically analogous to RAID6 and has a parity drive penalty of 2 drives every 16-20 drives.

Boxes needed to accomplish result 140 nodes, 3,360 drives (incl. 25TB of SSDs for cache), 1,120 CPU cores, 6.7TB RAM. 24 unified controllers, 1,728 drives, 12.2TB Flash Cache, 192 CPU cores, 1.2TB RAM. NetApp is far more powerful per node, and achieves higher performance with a lotless drives, CPUs, RAM and cache. 

In addition, NetApp can be used for all protocols (FC, iSCSI, NFS, CIFS) and all connectivity methods (FC 4/8Gb, Ethernet 1/10Gb, FCoE).

 

Notice the response time charts:

IsilonVs6240response

NetApp exhibits traditional storage system behavior – latency is very low initially and gradually gets higher the more the box is pushed, as one would expect. Isilon on the other hand starts out slow and gets faster as more metadata gets cached, until the controllers run out of steam (SPEC SFS is very heavy in NAS metadata ops, and should not be compared to heavy-duty block benchmarks like SPC-1).

This is one of the reasons an Isilon cluster is not really applicable for low-latency DB-type apps, or low-latency VMs. It is a great architecture designed to provide high sequential speeds for large files over NAS protocols, and is not a general-purpose storage system. Kudos to the Isilon guys for even getting the great SPEC result in the first place, given that this isn’t what the box is designed to do (the extreme Isilon configuration needed to run the benchmark is testament to that). The better application for Isilon would be capacity-optimized configs (which is what the system is designed for to begin with).

 

Some important points:

  1. First and foremost, the cluster-mode ONTAP architecture now supports all protocols, it is the only unified scale-out architecture available. Any competitors playing in that space only have NAS or SAN offerings but not both in a single architecture.
  2. We didn’t even test with the even faster 6280 box and extra cache (that one can take 8TB cache per node). The result is not the fastest a NetApp cluster can go :) With 6280s it would be a healthy percentage faster, but we had a bunch of the 6240s in the lab so it was easier to test them, plus they’re a more common and less expensive box, making for a more realistic result.
  3. ONTAP in cluster-mode is a general-purpose storage OS, and can be used to run Exchange, SQL, Oracle, DB2, VMs, etc. etc. Most other scale-out architectures are simply not suitable for low-latency workloads like DBs and VMs and are instead geared towards high NAS throughput for large files (IBRIX, SONAS, Isilon to name a few – all great at what they do best).
  4. ONTAP in cluster mode is, indeed, a single scale-out cluster and administered as such. It should not be compared to block boxes with NAS gateways in front of them like VNX, HDS + Bluearc, etc.
  5. In ONTAP cluster mode, workloads and virtual interfaces can move around the cluster non-disruptively, regardless of protocol (FC, iSCSI, NFS and yes, even CIFS can move around non-disruptively assuming you have clients that can talk SMB 2.1 and above).
  6. In ONTAP cluster mode, any data can be accessed from any node in the cluster – again, impossible with non-unified gateway solutions like VNX that have individual NAS servers in front of block storage, with zero awareness between the NAS heads aside from failover.
  7. ONTAP cluster mode can allow certain cool things like upgrading storage controllers from one model to another completely non-disruptively, most other storage systems need some kind of outage to do this. All we do is add the new boxes to the existing cluster :)
  8. ONTAP cluster mode supports all the traditional NetApp storage efficiency and protection features: RAID-DP, replication, deduplication, compression, snaps, clones, thin provisioning. Again, the goal is to provide a scale-out general-purpose storage system, not a niche box for only a specific market segment. It even supports virtualizing your existing storage.
  9. There was a single namespace for the NFS data. Granted, not the same architecture as a single filesystem from some competitors.
  10. Last but not least – no “special” NetApp boxes are needed to run Cluster Mode. In contrast to other vendors selling a completely separate scale-out architecture (different hardware and software and management), normal NetApp systems can enter a scale-out cluster as long as they have enough connectivity for the cluster network and can run ONTAP 8. This ensures investment protection for the customer plus it’s easier for NetApp since we don’t have umpteen hardware and software architectures to develop for and support :)
  11. Since people have been asking: The SFS benchmark generates about 120MB per operation. The slower you go, the less space you will use on the disks, regardless of how many disks you have. This creates some imbalance in large configs (for example, only about 128TB of the 864TB available was used on Isilon).

Just remember – in order to do what ONTAP in Cluster Mode does, how many different architectures would other vendors be proposing?

  • Scale-out SAN
  • Scale-out NAS
  • Replication appliances
  • Dedupe appliances
  • All kinds of management software

How many people would it take to keep it all running? And patched? And how many firmware inter-dependencies would there be?

And what if you didn’t need, say, scale-out SAN to begin with, but some time after buying traditional SAN realized you needed scale-out? Would your current storage vendor tell you you needed, in addition to your existing SAN platform, that other one that can do scale-out? That’s completely different than the one you bought? And that you can’t re-use any of your existing stuff as part of the scale-out box, regardless of how high-end your existing SAN is?

How would that make you feel?

Always plan for the future…

Comments welcome.

D

PS: Made some small edits in the RAID parts and also added the official EMC pricelist link.

 

Technorati Tags: , , , , , , , , ,

 

Interpreting $/IOPS and IOPS/RAID correctly with SPC-1 results

Been a while since an update to this blog, my provider got hacked and people were getting a redirect to some questionable site (I still can’t show images, if someone knows a fix please let me know).

Should be better now, I just wish the people doing the hacking would devote their considerable skills to helping out humanity…

Anyway, there are some impressive new scores at storageperformance.org, with the usual crazy configurations of thousands of drives etc.

Regarding price/performance:

When looking at $/IOP, make sure you are comparing list price (look at the full disclosure report, that has all the details for each config).

Otherwise, you could get the wrong $/IOP since some vendors have list prices, others show heavy discounting.

For example, a box that does $6.5/IOP after 50% discounting, would be $13/IOP using list prices.

Regarding RAID:

As I have mentioned in other posts, RAID plays a big role in both protection and performance.

All SPC-1 results are using RAID10, with the notable exception of NetApp (we use RAID-DP, mathematically analogous to RAID6 in protection).

Here’s a (very) rough way to convert a RAID10 result to RAID6, if the vendor you’re looking for doesn’t have a RAID6 result:

  1. SPC-1 is about 60% writes.
  2. Take any RAID10 result, let’s say 200,000 IOPS.
  3. 60% of that is 120,000, that’s the write ops. 40% is the reads, or 80,000 read ops.
  4. If using RAID6, you’d be looking at roughly a 4x slowdown for the writes: 120,000/4 = 30,000
  5. Add that to the 40% of the reads and you get the final result:
  6. 80,000 reads + 30,000 writes = 110,000 RAID6-corrected SPC-1 IOPS. Which is almost half the RAID10 result… :)

Just make sure you’re comparing apples to apples, that’s all. I know we all suffer from ADD in this age of information overload, but do spend some time going through the full disclosure, since there’s always interesting stuff in there…

D

 

 

Buyer beware: is your storage vendor sizing properly for performance, or are they under-sizing technologies like Megacaching and Autotiering?

With the advent of performance-altering technologies (notice the word choice), storage sizing is just not what it used to be.

I’m writing this post because more and more I see some vendors not using scientific methods to size their solution, instead aiming to reach a price point, hoping the technology will work to achieve the requisite performance (and if it doesn’t, it’s sold anyway, either they can give some free gear to make the problem go away, or the customer can always buy more, right?)

Back in the “good old days”, with legacy arrays one could (and still can) get fairly deterministic performance by knowing the workload required and, given a RAID type, know roughly how many disks would be needed to maintain the required performance in a sustained fashion, as long as the controller and buses were not overloaded.

With modern systems, there is now a plethora of options that can be used to get more performance out of the array, or, alternatively, get the same average performance as before, using less hardware (hopefully for less money).

If anything, advanced technologies have made array sizing more complex than before.

For instance, Megacaches can be used to dramatically change the I/O reaching the back-end disks of the array. NetApp FAS systems can have up to 16TB of deduplication-aware, ultra-granular (4K) and intelligent read cache. Truly a gigantic size, bigger than the vast majority of storage users will ever need (and bigger than many customers’ entire storage systems). One could argue that with such an enormous amount of cache, one could dispense with most disk drives and instead save money by using SATA (indeed, several customers are doing exactly that). Other vendors are following NetApp’s lead and starting to implement similar technologies — simply because it makes a lot of sense.

However…

It is crucial that, when relying on caching, extra care is taken to size the solution properly, if a reduction in the number and speed of the back-end disks is desired.

You see, caches only work well if they can cache the majority of what’s called the active working set.

Simply put, the working set is not all your data, but the subset of the data you’re “touching” constantly over a period of time. For a customer that has, say, a 20TB Database, the true working set may only be something as small as 5% — enabling most of the active data to fit in 1TB of cache. So, during daily use, a 1TB cache could satisfy most of the I/O requirements of the DB. The back-end disks could comfortably be just enough SATA to fit the DB.

But what about the times when I/O is not what’s normally expected? Say, during a re-indexing, or a big DB export, or maybe month-end batch processing. Such operations could vastly change the working set and temporarily raise it from 5% to something far larger — at which point, a 1TB cache and a handful of back-end SATA may not be enough.

Which is why, when sizing, multiple measurements need to be taken, and not just average or even worst-case.

Let’s use a database as an example again (simply because the I/O can change so dramatically with DBs).You could easily have the following I/O types:

  1. Normal use – 20,000 IOPS, all random, 8K I/O size, 80% reads
  2. DB exports — high MB/s, mostly sequential write,large I/O size, relatively few IOPS
  3. Sequential read after random write — maybe data is added to the DB randomly, then a big sequential read (or maybe many parallel ones) are launched.

You see, the I/O profile can change dramatically. If you only size for case #1, you may not have enough back-end disk to sustain the DB exports or the parallel sequential table scans. If you size for case 2, you may think you don’t need much cache since the I/O is mostly sequential (and most caches are bypassed for sequential I/O). But that would be totally wrong during normal operation.

If your storage vendor has told you they sized for what generates the most I/O, then the question is, what kind of I/O was it?

The other new trendy technology (and the most likely to be under-sized) is Autotiering.

Autotiering, simply put, allows moving chunks of data around the array depending on their “heat index”. Chunks that are very active may end up on SSD, whereas chunks that are dormant could safely stay on SATA.

Different arrays do different kinds of Autotiering, mostly based on various underlying architectural characteristics and limitations. For example, on an EMC Symmetrix the chunk size is about 7.5MB. On an HDS VSP, the chunk is about 40MB. On an IBM DS8000, SVC or EMC Clariion/VNX, it’s 1GB.

With Autotiering, just like with caching, the smaller the chunk size, the more efficient the end result will ultimately be. For instance, a 7.5MB chunk could need as little as 3-5%% of ultra-fast disk as a tier, whereas a 1GB chunk may need as much as 10-15%, due to the larger size chunk containing not very active data mixed together with the active data.

Since most arrays write data with a geometric locality of reference (in contrast, NetApp uses geometric and temporal), with large-chunk autotiering you end up with pieces of data that are “hot” that always occupy the same chunk as neighboring “cool” pieces of data. This explains why the smaller the chunk, the better off you are.

So, with a large chunk, this can happen:

Slide1

The array will try to cache as much as it can, then migrate chunks if they are consistently busy or not. But the whole chunk has to move, not just the active bits within the chunk… which may be just fine, as long as you have enough of everything.

So what can you do to ensure correct sizing?

There are a few things you can do to make sure you get accurate sizing with modern technologies.

  1. Provide performance statistics to vendors — the more detailed the better. If we don’t know what’s going on, it’s hard to provide an engineered solution.
  2. Provide performance expectations — i.e. “I want Oracle queries to finish in 1/4th the time compared to what I have now” — and tie those expectations to business benefits (makes it easier to justify).
  3. Ask vendors to show you their sizing tools and explain the math behind the sizing — there is no magic!
  4. Ask vendors if they are sizing for all the workloads you have at the moment (not just different apps but different workloads within each app) — and how.
  5. Ask them to show you what your working set is and how much of it will fit in the cache.
  6. Ask them to show you how your data would be laid out in an Autotiered environment and what bits of it would end up on what tier. How is that being calculated? Is the geometry of the layout taken into consideration?
  7. Do you have enough capacity for each tier? On Autotiering architectures with large chunks, do you have 10-15% of total storage being SSD?
  8. Have the controller RAM and CPU overheads due to caching and autotiering been taken into account? Such technologies do need extra CPU and RAM to work. Ask to see the overhead (the smaller the Autotiering chunk size, the more metadata overhead, for example). Nothing is free.
  9. Beware of sizings done verbally or on cocktail napkins, calculators, or even spreadsheets – I’ve yet to see a spreadsheet model storage performance accurately.
  10. Beware of sizings of the type “a 15K disk can do 180 IOPS” — it’s a lot more complicated than that!
  11. Understand the difference between sequential, random, reads, writes and I/O size for each proposed architecture — the differences in how I/O is done depending on the platform are staggering and can result in vastly different disk requirements — making apples-to-apples comparisons challenging.
  12. Understand the extra I/O and capacity impact of certain CDP/Replication devices — it can be as much as 3x, and needs to be factored in.
  13. What RAID type is each vendor using? That can have a gigantic performance impact on write-intensive workloads (in addition to the reliability aspect).
  14. If you are getting unbelievably low pricing — ask for a contract ensuring upgrade pricing will be along the same lines. “The first hit is free” is true in more than one line of business.
  15. And, last but by no means least — ask how busy the proposed solution will be given the expected workload! It surprises me that people will try to sell a box that can do the workload but will be 90% busy doing so. Are you OK with that kind of headroom? Remember – disk arrays are just computers running specialized software and hardware, and as such their CPU can run out of steam just like anything else.

If this all seems hard — it’s because it is. But see it as due diligence — you owe it to your company, plus you probably don’t want to be saddled with an improperly-sized box for the next 3-5 years, just because the offer was too good to refuse…

D

 

Technorati Tags: , , , , , , , , ,

NetApp vs EMC usability report: malice, stupidity or both?

Most are familiar with Hanlon’s Razor:

Never attribute to malice that which is adequately explained by stupidity.

A variation of that is:

Never attribute to malice that which is adequately explained by stupidity, but don’t rule out malice.

You see, EMC sponsored a study comparing their systems to ones from the company they look up to and try to emulate. The report deals with ease-of-use (and I’ll be the first to admit the current iteration of EMC boxes is far easier to use than in the past and the GUI has some cool stuff in it). I was intrigued, but after reading the official-looking report posted by Chuck Hollis, I wondered who in their right mind will lend it credence, and ignored it since I have a real day job solving actual customer problems and can’t possibly respond to every piece of FUD I see (and I see a lot).

Today I’m sitting in a rather boring meeting so I thought I’d spend a few minutes to show how misguided the document is.

In essence, the document tackles the age-old dilemma of which race car to get by comparing how easy it is to change the oil, and completely ignores the “winning the race with said car” part. My question would be: “which car allows you to win the race more easily and with the least headaches, least cost and least effort?”

And if you think winning a “race” is just about performance, think again.

It is also interesting how the important aspects of efficiency, reliability and performance are not tackled, but I guess this is a “usability” report…

Strange that a company named “Strategic Focus” reduces itself to comparing arrays by measuring the number of mouse clicks. Not sure how this is strategic for customers. They were commissioned by EMC, so maybe EMC considers this strategic.

I’ll show how wrong the document is by pointing at just some of the more glaring issues, but I’ll start by saying a large multinational company has many PB of NetApp boxes around the globe and 3 relaxed guys to manage it all. How’s that for a real example? :)

  1. Page 2, section 4, “Methodology”: EMC’s own engineers set up the VNX properly. No mention of who did the NetApp testing, what their qualifications are, and so on. So, first question: “Do these people even know what they’re doing? Have they really used a NetApp system before?”
  2. Page 10, Table A, showing the configurations. A NetApp FAS3070 was used, running the latest code at this moment (8.01). Thanks EMC for the unintended compliment – you see, that system is 2 generations old (the current one is 3270 and the previous one is 3170) yet it can still run the very latest 64-bit ONTAP code just fine. What about the EMC CX3? Can it run FLARE31? Or is that a forklift upgrade? Something to be said for investment protection :)
  3. Page 3 table 5-1, #1. Storage pools on all modern arrays would typically be created upfront, so the wording is very misleading. In order to create a new LUN one does NOT NEED to create a pool. Same goes for all vendors.
  4. Same table and section (also mentioned in section 7): Figuring out the space available is as simple as going to the aggregate page, where the space is clearly shown for the aggregates. So, unsure what is meant here.
  5. Regarding LUN creation… Let me ask you a question: After you create a LUN on any array, what do you need to do next? You see, the goal is to attach the LUN to a host, do alignment, partition creation, multipathing and create a filesystem and write stuff to it. You know, use it. NetApp largely automates end-to-end creation of host filesystems and, indeed, does not need an administrator to create a LUN on the array at all. Clearly the person doing the testing is either not aware of this or decided to omit this fact.
  6. Page 4, item 4 (thin provisioning). Asinine statement – plus, any NetApp LUN can be made thick or thin with a single click, whereas a VNX needs to do a migration. Indeed, NetApp does not complicate things whether thin or thick is required, does not differentiate between thin and thick when writing, and therefore does not incur a performance penalty, whereas EMC does (according to EMC documentation).
  7. Page 4, item 5 (Creation of virtual CIFS servers). The Multistore feature is free of charge on all new systems and allows one to create fully segregated, secure multitenancy virtual CIFS, NFS and iSCSI NetApp “partitions” – far beyond the capabilities of EMC. Again, misleading.
  8. Page 4, item 6 (growing storage elements). No measurable difference? Kindly show all the steps to grow a LUN until the new space is visible from the host side. End-to-end is important to real users since they want to use the storage. Or maybe not, for the authors of this document.
  9. Page 5, Item 1. We are really talking here about EMC snapshots? Seriously? Versus NetApp? To earn the right to do so assumes your snapshots are a usable and decent feature and that you can take a good number of them without the box crumbling to pieces. Ask any vendor about a production array with the most snaps and ask to talk to the customer using it. Then compare the number of snaps to a typical NetApp customer’s. Don’t be surprised if one number is a few hundred times less than the other.
  10. Page 5, item 3 (storage tiering): part of a longer conversation but this assumes all arrays need to do tiering. If my solution is optimized to the level that it doesn’t need to do this but yours is not optimized so it needs tiering, why on earth am I being penalized for doing storage more efficiently than you? (AKA the “not invented here” syndrome).
  11. Page 6 item 1 (VMware awareness): NetApp puts all the awareness inside vCenter and, indeed, datastore creation (including volume/LUN/NAS creation and resizing), VM cloning etc. all from within vCenter itself. Ask for a demo and prepare to be amazed.
  12. Page 6, item 2 – (dedupe/compress individual VMs): This one had my blood boiling. You see, EMC cannot even dedupe individual VMs, (impossible, given the fact that current DART code only does dedupe at the file and not block level and no two active VMs will ever be exactly the same), can’t dedupe at all for block storage (maybe in the future but not today), and in general doesn’t recommend compression for VMs! Ask to see the best practices guide that states all this is supported and recommended for active production VMs, and to talk to a customer doing it at scale (not 10 VMs). A feature you can theoretically turn on but that will never work is not quite useful, you see…
  13. Page 8, entire table: Too much to comment on, suffice it to say that NetApp systems come with tools not mentioned in this report that go so far beyond what Unisphere does that it’s not even funny (at no additional cost). Used by customers that have thousands of NetApp systems. That’s how much those tools scale. EMC would need vast portions of the Ionix suite to do anything remotely similar (at $$$). Of course, mentioning that would kinda derail this document… and the piece about support and upgrades is utterly wrong, but I like to keep the surprise for when I do the demos and not share cool IP ideas here :)
  14. Page 11, Table B1: In the end, the funniest one of all! If you add up the total number of mouse clicks, NetApp needed 92 vs EMC’s 111. Since the whole point of this usability report is to show overall ease of use by measuring the total number of clicks to do stuff, it’s interesting that they didn’t do a simple total to show who won in the end… :)

I could keep going but I need to pay attention to my meeting now since it suddenly became interesting.

Ultimately, when it comes to ease of use, it’s simple to just do a demo and have the customer decide for themselves which approach they like best. Documents such as this one mean less than nothing for actual end users.

I should have another similar list showing clicks and TIME needed to do certain other things. For instance, using RecoverPoint (or any other method), kindly show the number of clicks and time (and disk space) for creating 30 writable clones of a 10TB SQL DB and mounting them on 30 different DB servers simultaneously. Maintaining unique instance names etc. Kinda goes a bit beyond LUN creation, doesn’t it? :)

All this BTW doesn’t mean any vendor should rest on their laurels and stop working on improving usability. It’s a never-ending quest. Just stop it with the FUD, please…

Finishing with something funny: Check this video for a good demonstration of something needing few clicks yet not being that easy to do.

Comments welcome.

D

Technorati Tags: , , , ,

OS X and SSD – tunings plus performance with and without TRIM

I finally decided to spring for a SSD for my laptop since I hammer it heavily with a lot of mostly random I/O. It was money well spent.

I went for an Intel 320 model, since it includes extra capacitors for flushing the cache in the event of power failure, and has RAID-4 onboard for protection beyond sparing (there are other, faster SSDs but I need the reliability and can’t afford large-sized SLC).

I used the trusty postmark (here’s a link to the OS X executable) to generate a highly random workload with varying file sizes, using these settings:

set buffering false
set size 500 100000
set read 4096
set write 4096
set number 10000
set transactions 20000
run

All testing was done on OS X 10.6.7.

Here’s the result with the original 7200 RPM HDD:

Time:
198 seconds total
186 seconds of transactions (107 per second)

Files:
20163 created (101 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10163 files (54 per second)
10053 read (54 per second)
9945 appended (53 per second)
20163 deleted (101 per second)
Deletion alone: 10326 files (3442 per second)
Mixed with transactions: 9837 files (52 per second)

Data:
557.87 megabytes read (2.82 megabytes per second)
1165.62 megabytes written (5.89 megabytes per second)

I then replaced the internal drive with SSD, popped the old internal drive into an external caddy, plugged it into the Mac, reinstalled OS X and simply told it to move the user and app stuff from the old drive to the new (Apple makes those things so easy – on a PC you’d probably need something like an imaging program but that wouldn’t take care of very different hardware). I spent a ton of time testing to make sure it was all OK, in disbelief it was that easy. Kudos, Apple.

Here are the results with SSD (2/3rds full FWIW):

Time:
19 seconds total
13 seconds of transactions (1538 per second)

Files:
20163 created (1061 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (781 per second)
10053 read (773 per second)
9945 appended (765 per second)
20163 deleted (1061 per second)
Deletion alone: 10326 files (5163 per second)
Mixed with transactions: 9837 files (756 per second)

Data:
557.87 megabytes read (29.36 megabytes per second)
1165.62 megabytes written (61.35 megabytes per second)

A fair bit of improvement… :) The perceived difference is amazing. For some things I’ve caught it doing over 200MB/s sustained writes.

I also disabled the sudden motion sensor since there’s no point stopping I/O to a SSD if one shakes the laptop. From the command line:

sudo pmset -a sms 0 (this disables it)
sudo pmset –g (to verify it was done)

And since I don’t need hotfile adaptive clustering on a SSD, I decided to disable access time updates (noatime in UNIX parlance).

You need to put the script from here: http://dl.dropbox.com/u/5875413/Tools/com.my.noatime.plist

into /Library/LaunchDaemons

And make sure it has the right permissions:

sudo chown root:wheel com.my.noatime.plist

Then reboot, type mount from the command line, and see if the root filesystem shows noatime as one of the mount arguments.

For example mine shows

/dev/disk0s2 on / (hfs, local, journaled, noatime)

I then re-ran postmark, here are the results with noatime:

Time:
16 seconds total
11 seconds of transactions (1818 per second)

Files:
20163 created (1260 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (923 per second)
10053 read (913 per second)
9945 appended (904 per second)
20163 deleted (1260 per second)
Deletion alone: 10326 files (10326 per second)
Mixed with transactions: 9837 files (894 per second)

Data:
557.87 megabytes read (34.87 megabytes per second)
1165.62 megabytes written (72.85 megabytes per second)

Even better.

Now here comes the part that I hoped would work better than it did:

OS X doesn’t support the TRIM command for SSDs yet (unless you have a really new Mac with an Apple SSD). Fortunately, some enterprising users found out that it is possible to turn TRIM on OS X. There are various ways to do it but someone already automated the process. Be sure to do a backup first (both system backup and through the TRIM enabler application).

The process does work. However, it seems it tries to run TRIM too aggressively, messing up with the random access optimizations some drives have.

Benchmark after TRIM enabled:

Time:
39 seconds total
31 seconds of transactions (645 per second)

Files:
20163 created (517 per second)
Creation alone: 10000 files (3333 per second)
Mixed with transactions: 10163 files (327 per second)
10053 read (324 per second)
9945 appended (320 per second)
20163 deleted (517 per second)
Deletion alone: 10326 files (2065 per second)
Mixed with transactions: 9837 files (317 per second)

Data:
557.87 megabytes read (14.30 megabytes per second)
1165.62 megabytes written (29.89 megabytes per second)

This kind of performance loss is unacceptable to me, so I restored the kext file through the TRIM app, rebooted and re-ran the benchmark and all was fine again.

My recommendations:

  1. Always test before and after the tweaks – my results may only apply to Intel drives. Please post your results with other drives
  2. Always do backups before serious tweaks
  3. If TRIM seems to slow down random I/O on your Mac SSD, don’t keep it running, maybe enable it once a month, go to disk utility, and ask it to erase the free space. This will ensure the drive stays in good shape without adversely affecting normal random I/O.

D

Examining value for money regarding the SPEC benchmarks

Some of the comments in my previous post asked about $/IOPS and $/TB.

Since SPEC doesn’t require prices to be listed, I did my own analysis.

The NetApp numbers are simply 4x the existing 6240 result, which is what EMC did with their submission, they used 4x separate VNX systems and aggregated the result.

I used this clarifying analogy over at Nigel’s blog to explain why this makes sense before anyone yells “but this is not published”:

A storage system typically has some kind of bottleneck – cluster interconnect, number of drives, bandwidth to the controller, etc.

When you’re testing a single system, you’re ultimately hitting one of those bottlenecks.

If you’re testing multiple systems independent of each other, they do not share the bottlenecks (since they’re separate), and your performance will scale linearly as you add systems.

For example, if 1 truck can hold 10 tons of stuff, 4 like trucks will hold 40 tons of stuff, 10 trucks 100 tons, etc. There’s no limit.

Once you inject a limiting factor (“the trucks all have to fit on a bridge and the bridge can take this much load and it’s this big”) then you will have a limitation on how many trucks you can load and put on that bridge.

EMC tested 4 separate “trucks”. In that same way, I can add up the result of 4 separate NetApp “trucks”. Here are the results:

EMC NetApp Difference
Cost (approx. USD List) 6,000,000 5,000,000 NetApp is over 16% cheaper in absolute terms
SPEC SFS NFS IOPS 497,623 762,700 NetApp is 53% faster in absolute terms
Average Latency (ORT) 0.96 1.17 EMC offers a mere 18% less latency (with less NFS OPS) despite using only SSDs!
Space (TB) 60 343 NetApp offers 5.7 times more usable space
$/SPEC NFS IOPS 12.06 6.56 Netapp is 45.6% less expensive per SPEC NFS operation
$/TB 100,000 14,577 NetApp is less than 1/6 the price of EMC per TB
RAID RAID5 RAID-DP NetApp is thousands of times more reliable
Boxes needed to accomplish result 15 (4x separate VNX, each with 2 controllers, plus a total of 5x Celerra VG8 heads and 2 Control Stations) 8x unified controllers NetApp is far less complex (the benefit of a truly unified architecture)

Who can spot the better deal? Smile

I added the latency in the chart, thanks to my buddy Mark Twomey for pointing it out.

You see, people needing enterprise NAS with that kind of performance usually need speed, plenty of space and high reliability. Not just one of the three. BTW, here’s a paper on relative RAID reliability.

NetApp provides all three, in spades, plus great value for money, a truly simple, flexible unified system, and efficiency.

Most customers want to see how a real configuration performs. I refer customers to our SPEC and SPC results constantly since quite frequently their desired configuration is very similar.

Which makes benchmarking realistic configurations actually useful – imagine that.

Maybe EMC needs to submit results with VNX the way they sell it to people, for example:

  • A mix of SSD cache, SSD, high-speed SAS and high-capacity SAS
  • Autotiering
  • RAID6
  • A typical amount of space for a configuration that size

Then submit results.

Keep your existing result of course, but also show the people how what you actually sell them really performs.

I still don’t understand why this is such a hard concept.

D

Technorati Tags: ,,,,

EMC conclusively proves that VNX bottlenecks NAS performance

A bit of a controversial title, no?

Allow me to elaborate.

EMC posted a new SPEC SFS result as part of a marketing stunt (which is working, look at what I’m doing – I’m talking about them, if only to clear the air).

In simple terms, EMC got almost 500,000 SPEC SFS NFS IOPS (not to be confused with, say, block-based SPC-1 IOPS) with the following configuration:

  1. Four (4) totally separate VNX arrays, each loaded with SSD storage, utterly unaware of each other (8 total controllers since each box has 2)
  2. Five (5) Celerra VG8 NAS heads/gateways (1 spare), one on top of each VNX box
  3. 2 Control Stations
  4. 8 exported filesystems (2 per VG8 head/VNX system)
  5. Multiple pools of storage (at least 1 per VG8) – not shared among the various boxes, no data mobility between boxes
  6. Only 60TB NAS space with RAID5 (or 15TB per box)

Now, this post is not about whether this configuration is unrealistic and expensive (almost nobody would pay $6m for merely 60TB of NAS, not today). I get it that EMC is trying to publish the best possible number by loading a bunch of separate arrays with SSD. It’s OK as long as everyone understands the details.

My beef has to do with how it’s marketed.

EMC is very vague about the configuration, unless you look at the actual SPEC website. In the marketing materials they just mention VNX, as in “The EMC VNX performed at 497,623 SPECsfs2008_nfs.v3 operations per second”. Kinda like saying it’s OK to take 3 5-year olds and a 6-year old to a bar because their age adds up to 21.

No – the far more accurate statement is “four separate VNXs working independently and utterly unaware of each other did 124,405 SPEC fs2008_nfs.v3 operations per second each“.

All EMC did was add up the result of 4 boxes.

Heck, that’s easy to do!

NetApp already has a result for the 6240 (just 2 controllers doing a respectable 190,675 SPEC NFS ops taking care of NAS and RAID all at once since they’re actually unified, no cornucopia of boxes there) without using Solid State Drives (common SAS drives plus a large cache were used instead – a standard, realistic config we sell every day, and not a “lab queen”).

If all we’re doing is adding up the result of different boxes, simply multiply this by 4 (plus we do have Cluster-Mode for NAS so it would count as a single clustered system with failover etc. among the nodes) and end up with the following result:

  1. 762,700 SPEC SFS NFS operations
  2. 8 exported filesystems
  3. 343TB usable with RAID-DP (thousands of times more resilient than RAID5)

So, which one do you think is the better deal? More speed, 343TB and better protection, or less speed, 60TB and far less protection? :)

Customers curious about other systems can do the same multiplication trick for other configs, the sky is the limit!

The other, more serious part, and what prompted me to title the post the way I did, is that EMC’s benchmarking made pretty clear the fact that the VNX is the bottleneck, only able to really support a single VG8 head at top speed, necessitating the need for 4 separate VNX systems to accomplish the final result. So, the fact that a VNX can have up to 8 Celerra heads on top of it means nothing since the back-end is your limiting factor. You might as well stick to a dual-head VG8 config (1 active 1 passive) since that’s all it can comfortably drive (otherwise why benchmark it that way?)

But with only 1 active NAS head you’d be limited to just 256TB max NAS capacity, since that’s how much total space a Celerra head can address as of the time of this writing. Which is probably enough for most people.

I wonder if the NAS heads that can be bought as a package with VNX are slower than VG8 heads, and by how much. You see, most people buying the VNX will be getting the NAS heads that can be packaged with it since it’s cheaper that way. How fast does that go? I’m sure customers would like to know, since that’s what they will typically buy.

I also wonder how fast it would be with RAID6.

Here’s a novel idea: benchmark what customers will actually buy!

So apples-to-apples comparisons can become easier instead of something like this:

Bothapples

For the curious: on the left you see an “Autumn Glory” Malus Floribunda (miniature apple). Photo courtesy of John Fullbright.

D

Technorati Tags: , , , , , , , ,

EMC finally joining the SPC – plus some advice

Just came to my attention that EMC is finally joining the SPC (Storage Performance Council). As I’ve pointed in their past, their absence from this most standard industry benchmark was puzzling, kudos for rectifying this omission.

I do have some advice for EMC (and all other vendors that have already posted results):

  1. Do show results for your various supported RAID types, not just RAID10 – after all, if most of your customers don’t just deploy RAID10, it makes sense to show RAID5 and RAID6, especially if you want to compare results with NetApp RAID-DP (that’s the protection equivalent of RAID6). This will increase your credibility. The argument that you only show RAID10 since that’s the best performing doesn’t hold water – everyone knows the other RAID types will have different performance levels, providing results for everything will at least enable customers to get an idea of how much of a performance hit they’ll have with the different RAID types with a write-intensive workload.
  2. Do show long-running benchmarks with auto-tiering enabled. After all, if you are claiming your auto-tiering implementation doesn’t hurt and can even improve results, this is your chance to show it.
  3. Enable features like snapshots.
  4. Use your large cache if you have one, especially if you keep advertising that it accelerates writes. It will just solidify your claims.

Welcome to the club!

D

 

Technorati Tags: , , , , , , , , , , , ,

Stack Wars: The Clone Wars

It seems that everyone and their granny is trying to create some sort of stack offering these days. Look at all the brouhaha – HP buying 3Par, Dell buying Compellent, all kinds of partnerships being formed left and right. Stacks are hot.

To the uninitiated, a stack is what you can get when a vendor is able to offer multiple products under a single umbrella. For instance, being able to get servers, an OS, a DB, an email system, storage and network switches from a single manufacturer (not a VAR) is an example of a single-sourced stack.

The proponents of stacks maintain that with stacks, customers potentially get simpler service and better integration – a single support number to call, “one throat to choke”, no finger-pointing between vendors. And that’s partially true. A stack potentially provides simpler access to support. On the “better integration” part – read on.

My main problem with stacks is that nobody really offers a complete stack, and those that are more complete than others, don’t necessarily offer best-of-breed products (not even “good enough”), nor do they offer particularly great integration between the products within the stack.

Personal anecdote: a few years ago I had the (mis)fortune of being the primary backup and recovery architect for one of the largest airlines in the world. Said airline had a very close relationship with a certain famous vendor. Said vendor only flew with that airline, always bought business class seats, the airline gave the vendor discounts, the vendor gave the airline discounts, and in general there was a lot of mutual back-scratching going on. So much so that the airline would give that vendor business before looking at anyone else, and only considered alternative vendors if the primary vendor didn’t have anything that even smelled like what the airline was looking for.

All of which resulted in the airline getting a backup system that was designed for small businesses since that’s all that vendor had to offer (and still does).

The problem is, that backup product simply could not scale to what I needed it to. I ended up having to stop file-level logging and could only restore entire directories since the backup database couldn’t handle the load, despite me running multiple instances of the tool for multiple environments. Some of those directories were pretty large, so you can imagine the hilarity that ensued when trying to restore stuff…

The vendor’s crack development team came over from overseas and spent days with me trying to figure out what they needed to change about the product to make it work in my environment (I believe it was the single largest installation they had).

Problem is, they couldn’t deliver soon enough, so, after much pain, the airline moved to a proper enterprise backup system from another vendor, which fixed most problems, given the technology I had to work with at the time.

Had the right decision been made up front, none of that pain would have been experienced. The world would have been one IT anecdote short, but that’s a small price to pay for a working environment. And this is but just one way that single vendor stacks can fail.

How does one decide on a stack?

Let’s examine a high-level view of a few stack offerings. By no means an all-inclusive list.

Microsoft: They offer an OS (catering from servers to phones), a virtualization engine, a DB, a mail system, a backup tool and the most popular office apps in the world, among many other things. Few will argue that all the bits are best-of-breed, despite being hugely popular. Microsoft doesn’t like playing in the hardware space, so they’re a pure software stack. Oh, there’s the XBox, too.

EMC: Various kinds of storage platforms (7-10 depending on how you count), all but one (Symmetrix) coming from acquisition. A virtualization engine (80% owner of VMware – even though it’s run as a totally separate entity). DB, many kinds of backup, document management, security and all other kinds of software. Some bits are very strong, others not so much.

Oracle: They offer an OS, a virtualization engine, a DB, middleware, servers and storage. An office suite. No networking. Oracle is a good example of an incomplete software/hardware stack. Aside from the ultra-strong DB and good OS, few will say their products are all best-of-breed.

Dell: They offer servers desktops, laptops, phones, various flavors of storage, switches. Dell is an example of a pure hardware stack. Not many software aspirations here. Few will claim any of their products are best-of-breed.

HP: They offer servers desktops, laptops, phones, even more flavors of storage, a UNIX OS with its own type of virtualization (can’t run x86), switches, backup software, a big services arm, printers, calculators… All in all, great servers, calculators and printers, not so sure about the rest. Fairly complete stack.

IBM: Servers, 2 strong DBs, at least 3 different OSes, CPUs, many kinds of storage,  email system, middleware, backup software, immense services arm. No x86 virtualization (they do offer virtualization for their other platforms). Very complete stack, albeit without networking.

Cisco: All kinds of networking (including telephony), servers. Limited stack if networking is not your bag, but what it offers can be pretty good.

Apple: Desktops, laptops, phones, tablets, networking gear, software. Great example of a consumer-oriented hardware and software stack. They used to offer storage and servers but they exited that business.

Notice anything common about the various single-vendor stacks? Did you find a stack that can truly satisfy all your IT needs without giving anything up?

The fact of the matter is that none of the above companies, as formidable as they are, offers a complete stack – software and hardware. You can get some of the way there, but it’s next to impossible to single-source everything without shooting yourself in the foot. At a minimum, you’re probably looking at some kind of Microsoft stack + server/storage stack + networking stack – mixing a minimum of 3 different vendors to get what you need without sacrificing much (the above example assumes you either don’t want to virtualize or are OK with Hyper-V).

Most companies have something like this: Microsoft stack + virtualization + DB + server + storage + networking – 6 total stacks.

So why do people keep pushing single-vendor stacks?

Only a few valid reasons (aside from it being fashionable to talk about). One of them is control – the more stuff you have from a company, the tighter their hold on you. The other is that it at least limits the support points you have to deal with, and can potentially get you better pricing (theoretically). For instance, Dell has “given away” many an Equallogic box. Guess what – the cost of that box was blended into everything else you purchased, it’s all a shell game. But if someone does buy a lot of gear from a single vendor, there are indeed ways to get better deals. You just won’t necessarily get best-of-breed or even good enough gear.

What about integration?

One would think that buying as much as possible from a single vendor gets you better integration between the bits. Not necessarily. For instance, most large vendors acquire various technologies and/or have OEM deals – if one looks just at storage as an example, Dell has their own storage (Equallogic and Compellent – two different acquisitions) plus an OEM deal with EMC. There’s not much synergy between the various technologies.

HP has their own storage (EVA, Lefthand, Ibrix, Polyserve, 3Par – four different acquisitions) and two OEM deals with Dot Hill for the MSA boxes and HDS for the high-end XP systems. That’s a lot of storage tin to keep track of (all of which comes from 7 different places and 7 totally different codebases), and any HP management software needs to be able to work with all of those boxes (and doesn’t).

IBM has their own storage (XIV, DS6000, DS8000, SONAS, SVC – I believe three homegrown and two acquisitions) and two different OEM deals (NetApp for N Series and LSI Logic for DS5000 and below). The integration between those products and the rest of the IBM landscape should be examined on a case-by-case basis.

EMC’s challenge is that they have acquired too much stuff, making it difficult to provide proper integration for everything. Supporting and developing for that plethora of systems is not easy and teams end up getting fragmented and inconsistent. Far too many codebases to keep track of. This dilutes the R&D dollars available and prolongs development cycles.

Aren’t those newfangled special-purpose multi-vendor stacks better?

There’s another breed of stack, the special-purpose one where a third party combines gear from different vendors, assembles it and sells it as a supported and pre-packaged solution for a specific application. Such stacks are actually not new – they have been sold for military, industrial and healthcare applications for the longest time. Recently, Netapp and EMC have been promoting different versions of what a “virtualization stack” should be (as usual, with very different approaches, check out FlexPod and Vblock).

The idea behind the “virtualization stack” is that you sell the customer a rack that has inside it network gear, servers, storage, management and virtualization software. Then, all the customer has to do is load up the gear with VMs and off they go.

With such a stack, you don’t limit the customer by making them buy their gear all from one vendor, but instead you limit them by pre-selecting vendors that are “best of breed”. Not everyone will be OK with the choice of gear, of course.

Then there’s the issue of flexibility – some of the special-purpose stacks literally are like black boxes – you are not supposed to modify them or you lose support. To the point where you’re not allowed to add RAM to servers or storage to arrays, both limitations that annoy most customers, but are viewed as a positive by some.

Is it a product or a “kit”?

Back to the virtualization-specific stacks: This is the main argument, do you buy a ready-made “product” or a “kit” some third party assembles after following a detailed design guide. As of this writing (there have been multiple changes to how this is marketed), Vblock is built by a company known as VCE – efectively a third party that puts together a custom stack made of different kinds of EMC storage, Cisco switches and servers, VMware, and a management tool called UIM. It is not built by EMC, VMware or Cisco. VCE then resells the assembled system to other VARs or directly to customers.

NetApp’s FlexPod is built by VARs. The difference is that more than one VAR can build FlexPods (as long as they meet some specific criteria) and don’t need to involve a middleman (also translating to more profits for VARs). All VARs building FlexPods need to follow specific guidelines to build the product (jointly designed by VMware, Cisco and NetApp), use components and firmware tested and certified to work together, and add best-of-breed management software to manage the stack.

The FlexPod emphasis is on sizing and performance flexibility (from tiny to gigantic), Secure Multi Tenancy (SMT – a unique differentiator), space efficiency, application integration, extreme resiliency and network/workload isolation – all highly important features in virtualized environments. In addition, it supports non-virtualized workloads.

Ultimately, in both cases the customer ends up with a pre-built and pre-tested product.

What about support?

This has been both the selling point and the drawback of such multi-vendor stacks. In my opinion, it has been the biggest selling point for Vblock, since a customer calls VCE for support. VCE has support staff that is trained on Cisco, VMware and EMC and can handle many support cases via a single support number – obviously a nice feature.

Where this breaks down a bit: VCE has to engage VMware, EMC and Cisco for anything that’s serious. Furthermore, Vblock support doesn’t support the entire stack but stops at the hypervisor.

For instance, if a customer hits an Enginuity (Symmetrix OS) bug, then the EMC Symm team will have to be engaged, and possibly write a patch for the customer and communicate with the customer. VCE support simply cannot fix such issues, and is best viewed as first-level support. Same goes for Cisco or VMware bugs, and in general deeper support issues that the VCE support staff simply isn’t trained enough to resolve. In addition, Vblocks can be based on several different kinds of EMC storage, that itself requires different teams to support it.

Finally – ask VCE if they are legally responsible for maintaining the support SLAs. For instance, who is responsible if there is a serious problem with the array and the vendor takes 2 days to respond instead of 30 minutes?

FlexPod utilizes a cooperative support model between NetApp, Cisco and VMware, and cases are dealt with by experts from all three companies working in concert. The first company to be called owns the case.

When the going gets tough, both approaches work similarly. For easy cases that can be resolved by the actual VCE support personnel, Vblock probably has an edge.

Who needs the virtualization stack?

I see several kinds of customers that have a need for such a stack (combinations are possible):

  1. The technically/time constrained. For them, it might be easier to get a somewhat pre-integrated solution. That way they can get started more quickly.
  2. The customers needing to hide behind contracts for legal/CYA reasons.
  3. The large. They need such huge amounts of servers and storage, and so frequently, that they simply don’t have the time to put it in themselves, let alone do the testing. They don’t even have time to have a PS team come onsite and build it. They just want to buy large, ready-to-go chunks, shipped as ready to use as possible.
  4. The rapidly growing.
  5. Anyone wanting pre-tested, repeatable and predictable configurations.

Interesting factoid for the #3 case (large customer): They typically need extensive customizations and most of the time would prefer custom-built infrastructure pods with the exact configuration and software they need. For instance, some customers might prefer certain management software over others, and/or want systems to come preconfigured with some of their own customizations in-place – but they are still looking for the packaged product experience. FlexPod is flexible enough to allow that without deviating from the design. Of course, if the customer wants to dramatically deviate (i.e. not use one of the main components like Cisco switches or servers, for instance) – then it stops being a FlexPod and you’re back to building systems the traditional way.

What customers should really be looking for when building a stack?

In my opinion, customers should be looking for a more end-to-end experience.

You see – even with the virtualization stack in place, you will need to add:

  • OSes
  • DBs
  • Email
  • File Services
  • Security
  • Document Management
  • Chargeback
  • Backup
  • DR
  • etc etc.

You should partner with someone that can help you not just with the storage/virtualization/server/network stack, but also with:

  • Proper alignment of VMs to maintain performance
  • Application-level integration
  • Application protection
  • Application acceleration

In essense, treat things holistically. The vendor that provides your virtualization stack needs to be able to help you all the way to, say, configuring Exchange properly so it doesn’t break best practices, and ensuring that firmware revs on the hardware don’t clash with, say, software patch levels.

You still won’t avoid support complexity. Sure, maybe you can have Microsoft do a joint support exercise with the hardware stack VAR, but, no matter how you package it, you are going to be touching the support of the various vendors making up the entire stack.

And you know what?

It’s OK.

D

 

Technorati Tags: , , , , ,

 

Updated blog code, plus a bit about NetApp recovery for cloud providers

Sometime last night/this morning a config file in my blog got corrupted. Maybe it got hacked (I was running an ancient WordPress version 2.1) but at any rate the site was down.

It’s hosted on a large, famous service provider, and they use NetApp gear.

I was able to recover my file through NetApp snapshots – The provider makes this trivial by giving all users a GUI for it that looks like a normal file manager. All self-service.

godaddy.png

No Vblocks, Avamar or Data Domain were harmed in the process that literally took all of one second to complete, most of which time was probably spent on Javascript doing its thing and the browser refreshing. BTW, I hadn’t touched that file since 2006.

This is a good example of storage for service providers doing more than just storing data.

With alternative solutions, a ticket would have to be opened, a helpdesk person would have to use a backup tool to find my file and restore it, then let me know. A whole lot more effort than what happened in this case.

In other news, I’m running the latest WordPress code, the site is now auto-optimized for mobile devices, and things are smooth again. Oh, and the old theme that most seemed to hate is gone. I’ll see if I can find a suitable picture for the header, for now this is OK.

If only that old version of WordPress I was using had a clean way of exporting stuff, if you look at older articles you’ll notice weird characters here and there. I might fix it. Probably not.

D

Technorati Tags: ,