While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might share some of my experience regarding running Netbackup on very large setups – nothing like learning through pain.
Don’t get me wrong – NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.
I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much – we fixed that pronto…
I won’t go too deep into each point. If anyone cares then post a comment and I will expand on it.
If you have a small shop running NBU on a single server, much of this is not for you – but there may still be a nugget or two in there… However, if you don’t at least use barcodes, I will go after you. Use tar or Windows backup, or even a rusty abacus, go to your corner and be quiet.
- Have a dedicated master server – if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
- Go way beyond the tuning recommendations in the manual – if you know what you’re doing. For instance, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, whatever) support… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” whatever, I’ve gotten such ridiculously bad advice from their support I still cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
- Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
- Separate TAN (Tape Area Network), if you can swing it.
- Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (whatever nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
- Experiment with TOE cards, such as the Alacritech ones. You will get closer to full gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
- Try to use port bonding that works at the switch level, too – 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
- Don’t use weak switches at the core. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS!!!! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed – i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly blocked 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? Even after the initial transfer that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
- Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
- Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
- Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
- Use snapshots as much as possible.
- If you have more than a couple of media servers, consider a VTL.
- If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a power supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
- If the DBAs can’t be persuaded even after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
- As a corollary to #15, only use tape for large jobs that will actually stream your tape drives.
- Know what your boxes can push. Most servers, even very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, sustained. Do the math. Beat the score! I cheated, BTW.
- Know what expansion slots to use – not all are equal, even if they look the same.
- Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
- Be super-careful with command-line manipulation of the NBU DB. Perfectly legitimate commands will not function as you might think due to silly heuristics (or lack thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a pronounced limp.
- Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
- Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
- If you use tapes, have 2 copies of everything.
- Replicate to remote sites if at all possible. Tape should be a last resort.
- Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
- Do at least 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session – there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an extra email to an externally available address. You’re not paranoid if they’re really out to get you!
- Can you even read from disk as fast as you can write to your backup medium? Benchmark.
- What’s your current network throughput if you max out all the media servers? Benchmark.
- Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
- Use storage unit groups. Why on earth would you not?
- Cluster the master.
- Do NOT put media traffic through firewalls, it’s too much. ACLs on switches can work just fine.
- Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might as well let media server traffic through. If it seems that #32 and #33 are somewhat self-contradictory, give yourself a cigar.
- Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.
That’s all I have for now. Is there more? Tons, but I need to pee.