It’s all about data classification and searching

I don’t know if this has been discussed elsewhere but I felt like I had an epiphany so there…

They way I see it, in a decade or two the most important technology regarding data will be data classification and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they’re smaller than the slower-spinning drives.

Imagine if speed and size were not issues. I know that’s a big assumption but let’s play along for a second… (let’s just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it’s just expensive, so it’s not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue – nobody cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn’t get it, but they’re saying it should be out in a few years (there were issues with scalability and speed).

Let’s forget about the Microsoft-specific implementation and just think about the concept instead (I’d use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it – it’s just a huge database occupying the entire drive.

Think of the advantages:

  1. Far more resilient to failures
  2. Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
  3. Replication via log shipping
  4. Amazing indexing
  5. Easy expandability
  6. The potential for great performance, if done right
  7. Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let’s consider a simple text document – you’re trying to sell some storage, so you write a proposal for a new client. You could have metadata on:

  • Author
  • Filename
  • Client name
  • Type of document – proposal
  • Project name
  • Excerpt
  • Salesperson’s name
  • Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
  • Document revision (possible automatically generated)

… and so on. A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created:

  • Create a virtual directory with all files pertaining to that specific client (most common way people would organize it)
  • Show all the material for this specific project
  • Show all proposals that have to do with this salesperson

… and so on.

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course.

Existing software that does this classification is fairly poor, in my opinion. Please correct me if I’m wrong.

The other piece that needs to be there is extremely robust search and indexing capabilities. Some of that technology is there (google desktop and its ilk) but natural language search has to be – well, natural, but unambiguous at the same time.

I hope you can now see why I believe these technologies are important. If Google continues the way it’s going, it may well become the most important company in the next decade (some might argue it’s the most important one already).

For any sci-fi fans out there, this is a good novel that’s a bit related to the chaotic storage systems of the future:


Some clarification on the caching

Re the previous post:

If you want to use supercache or uptempo the idea is that you take AWAY from windows/SQL/exchange cache and add to the fancy cache.

So, even on windows server, in “file and print sharing for microsoft windows” (in the properties for your network card, under file and printer sharing for Microsoft networks, bizarrely enough), you could say “maximize throughput for network applications”. In the various apps you’d just minimize the cache (i.e. only 10MB for Exchange) and just give the rest to supercache/uptempo.

Be aware that supercache is on a PER VOLUME basis, not global (its blessing and its curse at the same time). If you have a lot of volumes maybe just cache a few key data volumes, tempdb and the pagefile partition, or use uptempo, which allows you to allocate a single global cache pool that is then shared among the volumes you choose.

For SQL, using a RAM disk for tempdb seems to work even better.

Having seen the products work wonders only with 128MB dedicated to them, and bearing in mind that most servers have 4GB RAM or more, I’d say go nuts. I’d buy 4GB of RAM and make it cache in a heartbeat.


On deduplication and Data Domain appliances

One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.

The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.

The ID is then associated with the block and both are stored.

If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).

This is what allows dedup technologies to store a lot of data.

Now, why it depends how much you can store:

If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.

Now the drawbacks (and why I never bought it):

The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).

Imagine you have 2 billion entries.

Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.

Each block has it’s ID calculated (CPU-intensive).

Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.

This can take a while, depending on what search/sort/store algorithms they use.

I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.

I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.

Now, imagine you’re trying to RECOVER your 1GB PST.

Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.

When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.

This is CPU- and disk-intensive. Takes a while.

The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.

Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.

There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.

So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.

Another tack would be to use virtual tape – doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.

Plus, even the cheapest EMC virtual tape box works at over 300MB/s.

I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.

Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).

Let me know what you think…


Do you need a VTL or not?

I first posted this as a comment on but this is its rightful place.

Having deployed what was, at the time, the largest VTL in the world, and subsequently numerous other VTL and ATA Solutions, I think I can offer a somewhat different perspective:

It depends on the number of data movers you have and how much manual work you’re prepared to do. Oh, and speed.

Licensing for VTL is now capacity-based for most packages (at least the famous/infamous/important ones like CommVault, Networker and NetBackup, not respectively).

Also, I’d forget about using VTL features such as replication and using the VTL to write directly to tape (unless you’re retarded, insane or the backup software is running ON the VTL, as is the case now with EMC’s CDL). Just use the VTL like tape. I’ve been so vehement about this that even the very stubborn and opinionated Curtis Preston is now afraid to say otherwise with me in the room… (I shut him up REALLY effectively during one Veritas Vision session we were co-presenting a couple years ago. I like Curtis but he’s too far removed from the real world. Great presenter, though, and funny).

Even dedup features are suspect in my opinion, since they rely on hashes and searches of databases of hashes, which progressively get slower the more you store in them. Most companies selling dedup (data domain, avamar, to name a couple major names) are sorta cagey when you confront them with questions such as “I have 5 servers with 50 million files each, how well will this thing work?”

Answer is, it won’t, even for far fewer files. Just get some raw-based backup method that also indexes, such as Networker’s snapimage or NBU’s flashbackup.

Dedup also fails with very large files such as database files.
I can expand on any of the above comments if anyone cares.

But back on the data movers (Media Agents, Storage Nodes, Media Servers):

Whether you use VTL or ATA, you effectively need to divvy up the available space.

With ATA, you either allocate a fixed amount of space to each data mover, or use a cluster filesystem (such as Adic’s Stornext) to allow all data movers to see the same disk.

With VTL, the smallest quantum of space you can allocate to a data mover is, simply, a virtual tape. A virtual tape, just like a real tape, gets automatically alocated, as needed.

So, imagine you have a large datacenter, with maybe 40 data movers and multiple backup masters.

Imagine you have a 64TB ATA array.

You can either:

1. Split the array into 40 chunks, and have a management nightmare
2. Deploy stornext so all servers see a SINGLE 64TB filesystem (at an extra 3-4K per server, plus probably 50K more for maintenance, central servers and failover) – easy to deal with but complex to deploy and more software on your boxes)
3. Deploy VTL and be done with it.

For such a large environment, option #3 is the best choice, hands down.

With filesystems, you have to worry about space, fragmentation, mount options, filesystem creation-time tunables, runtime tunables, esoteric kernel tunings, fancy disk layouts, and so on. If you’re weird like me and thoroughly enjoy such things, then go for it. As time goes by though, the novelty factor diminishes greatly. Been there, done that, smashed some speed records on the way.

What’s needed in the larger shops, aside from performance, is scalability, ease of use and deployment, and simplicity.

With VTL, you get all of that.

The other issue with disk is that backup vendors, while they’re getting better, impose restrictions on the # streams in/out, copy to tape and so on. No such restrictions on tape.

One issue with VTL: depending on your backup software, setting up all those new virtual drives etc. can be a pain (esp. on NBU).
for a small shop (less than 2 data movers), a VTL is probably overkill.


So who am I?

Hello everyone,

My name is Dimitris Krekoukias.

This blog used to be on another server, I moved it here – hopefully this hosting facility will be more stable.

I resemble a silverback gorila more than a monkey (man-pelt and all), and could probably wrestle one (and have a fair chance of winning).
I have extensive experience in the backup and recovery arena, and indeed know far more about certain products than I (or the vendors) would like to.
This blog will not be just about recovery – I have other interests, such as storage, OS design, tuning, filesystems, HPC, and other exotica. Plus a ton of non-IT-related hobbies – but that’s a story for another day.
Hopefully everyone will find this blog stimulating, controversial and, at times, annoying – in which case, tough.


<a href=””></a>