Netbackup best practices for ridiculously busy environments (but not exclusively).

While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might share some of my experience regarding running Netbackup on very large setups – nothing like learning through pain.

Don’t get me wrong – NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.

I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much – we fixed that pronto…

I won’t go too deep into each point. If anyone cares then post a comment and I will expand on it.

If you have a small shop running NBU on a single server, much of this is not for you – but there may still be a nugget or two in there… However, if you don’t at least use barcodes, I will go after you. Use tar or Windows backup, or even a rusty abacus, go to your corner and be quiet.

 

  1. Have a dedicated master server – if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
  2. Go way beyond the tuning recommendations in the manual – if you know what you’re doing. For instance, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, whatever) support… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” whatever, I’ve gotten such ridiculously bad advice from their support I still cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
  3. Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
  4. Separate TAN (Tape Area Network), if you can swing it.
  5. Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (whatever nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
  6. Experiment with TOE cards, such as the Alacritech ones. You will get closer to full gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
  7. Try to use port bonding that works at the switch level, too – 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
  8. Don’t use weak switches at the core. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS!!!! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed – i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly blocked 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? Even after the initial transfer that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
  9. Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
  10. Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
  11. Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
  12. Use snapshots as much as possible.
  13. If you have more than a couple of media servers, consider a VTL.
  14. If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a power supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
  15. If the DBAs can’t be persuaded even after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
  16. As a corollary to #15, only use tape for large jobs that will actually stream your tape drives.
  17. Know what your boxes can push. Most servers, even very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, sustained. Do the math. Beat the score! I cheated, BTW.
  18. Know what expansion slots to use – not all are equal, even if they look the same.
  19. Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
  20. Be super-careful with command-line manipulation of the NBU DB. Perfectly legitimate commands will not function as you might think due to silly heuristics (or lack thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a pronounced limp.
  21. Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
  22. Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
  23. If you use tapes, have 2 copies of everything.
  24. Replicate to remote sites if at all possible. Tape should be a last resort.
  25. Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
  26. Do at least 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session – there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an extra email to an externally available address. You’re not paranoid if they’re really out to get you!
  27. Can you even read from disk as fast as you can write to your backup medium? Benchmark.
  28. What’s your current network throughput if you max out all the media servers? Benchmark.
  29. Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
  30. Use storage unit groups. Why on earth would you not?
  31. Cluster the master.
  32. Do NOT put media traffic through firewalls, it’s too much. ACLs on switches can work just fine.
  33. Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might as well let media server traffic through. If it seems that #32 and #33 are somewhat self-contradictory, give yourself a cigar.
  34. Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.

 

That’s all I have for now. Is there more? Tons, but I need to pee.

D

23 thoughts on “Netbackup best practices for ridiculously busy environments (but not exclusively).

  1. Mike S

    D,

    You may want to add in something about long running jobs. Don’t let them run for more than a day or so. The bpjobsdb daemon gets hung up on the activity log if it is too long (I think it’s 8k of data, but not sure) All of the bpdbjobs commands would not return and bog down the master, etc.

    Otherwise this post is right on! I especially like your self-description for hitman purposes…. short with blond hair and a pronounced limp?!?

    -Mike

    Reply
  2. Rock

    Actually, it’s not long running jobs that cause the problem, it’s long _winded_ jobs. A job can run for 100 hours with no ill effects (well, they may not be healthy for you), if the try status details get too long they will crush the environment.


    Rock

    Reply
  3. Rock

    Remote media servers (on the other side of a firewall) are a pain, and depending on the circumstances are a necessary evil. You can make them work, but you better REALLY need them to work.

    Doing the backups through a firewall is not the same as doing backups on a remote media server and sending the metadata through the firewall, but both suck. What’s probably more important to remember is that NBU has HORRIBLE security, and you are likely compromizing your firewalled environment by trying to use it.


    Rock

    Reply
  4. Dimitris Post author

    Thanks for the commentary. Mike, don’t blow my cover. OK, so I don’t have a limp.

    D

    Reply
  5. Mike S

    Don’t worry D, I won’t out you. :-)

    And as far as the try status size is concerned, I guess the term I was looking for was “a chatty job” with lots of updates is a proverbial clot to the system. Rock’s description hits it on the head.

    And security holes, don’t get us started… one binary gives away the house with this product (I’m not naming it either!!!)

    Reply
  6. Zekeatron

    Hi Dimitris,

    Thank you for writing your article:

    If its alright, I’d like to ask you a few questions based on your experience and would greatly appreciate your thoughts.

    I currently assist in managing a system that contains oneMaster server and Two media servers. All run Windows 2003 Server. The first Media server is attached to an EMC and the Tape Library (ADIC Scalar i2000) via separate SAN fabrics (and Zones). The second Media server is attached to a iSCSI (Promise) via separate iSCSI network and the Tape Library via the SAN fabric.

    All three servers are connected directly to the core network (Cisco 6000) via single 1GB interfaces.

    We are backing up about 16 SAP servers (Linux, Oracle DB’s – SAN Attached to the EMC), and 30-40 Windows Servers, and a NetApp share (via NDMP).

    As you can probably already see we are having throughput issues. I have been looking at the option of (switch) teaming / bonding a few interfaces (8 maybe). I’m going to start with two for test purposes.

    I understand how to do the bonding on the Cisco, but what drivers / software do you use on Windows? Or are you primarily Solaris?

    These Media servers are also running 32 bit Windows 2003 Server. Do you feel there is any benefit to running 64 bit, or is there HUGE benefit to running Solaris (Open Solaris?) instead?

    The other issue we are seeing is during Disk staging. We are using a 1TB store on the EMC (connected to Media1 via SAN), and a 1.5TB store on the iSCSI (connected to Media2 via iSCSI LAN). We are seeing write speeds of about 20-30MBps on BOTH the EMC and the iSCSI disks. Slow. I realize the issue is likely more to do with the network speed than the disk, but this could be part of the issue too? Thoughts?

    And also what’s really frustrating is the SAP (Linux) servers are all SAN attached to the same fabric as the EMC but are separately zoned (as it must be) , so this means that to stage to disk, the data stream is like this.

    EMC -> (over) SAN -> (to) SAP Server (Linux) -> (over) 1GB Nic -> (through) Core -> (over) 1GB Nic -> (to) Media Server 1 (W2k3) -> (over) SAN -> (to) EMC. One big loop full of bottle-necks!

    And then to back this staged data up to tape, its (from) EMC -> (over) SAN -> (to) Media Server 1 (W2k3) -> (into) IO/Memory -> (over) SAN -> (to) Tape. I

    I’ve been looking at displacing the EMC by connecting the SAP (Linux) servers to the NetApp (instead of EMC) in order to do more with the SnapVault / SnapMirror technologies, allowing direct to tape backups over the SAN vs. having the data transfer (as above big loop) then backup to tape from the Media server (IO), over the SAN to the tape Library.

    I’m not sure how well this works yet, have you done this? I guess the NetApp can be seen as a NAS to the media server, this way the data goes NetApp -> SAN -> Tape Library. No IO staging on the media server, and no network bottle necks?

    So we’re not a huge setup, and while I’d love to team 4 x 10GB interfaces, I don’t see that happening anytime soon. Even getting a 10GB interface is likely out of the question. Though, I’m thinking that a conjunction of 64bit o/s with 4-8 teamed/bonded Ethernet interfaces (switch) at 4-8GBps (really 200-400MBps throughput), whats your thoughts on this?

    Finally, in this real world we live in, I’m seeing a 1000Mb NIC transfer at about 40-50MBps (to be clear MegaBytes not bits ) MAX, what’s your experience with network throughput?

    I’ve also read about issues with teaming/ bonding and NetBackup having a hissy fit over it, whats your thoughts on this also?

    Thanks,

    Chris

    Reply
  7. Dimitris Post author

    Chris,

    You pretty much answered it all yourself, but here goes:

    Unix tends to be faster for backups than Windows, but with well-patched and -tuned Windows 2003 R2 + SP2 you can do pretty well. 64-bit would work faster, typically. What tape drives are you using? LTO2+ will need a 256K blocksize to perform well.

    If you have stuff in a SAN, the fastest way to back up, by far, is to create snapshots and mount them on the media servers. You will then not have to go through the network at all.

    Pretty much the last method I would recommend for backups is iSCSI!

    If you DO insist on using iSCSI, you HAVE to do the following:

    1. iSCSI multipathing (NOT the same as bonding!)
    2. Use SEPARATE NICs for all iSCSI traffic. 2 minimum.
    3. Consider jumbo frames.

    Load balancing works in general, you have to set 802.3ad on the Cisco and then use the teaming software on the server and tell it to do the same (intel’s teaming software works and is free, most vendors provide something for free).

    Be aware that bonding won’t make a single client backup faster. Instead, multiple clients will effectively use different NICs on the backup server.

    Don’t do 8, unless your media server is massive or you use TOE cards (expensive, plus some TOEs can’t even do bonding).

    Write speeds: DISABLE the write cache on the B2D LUNs. Also, this is kinda like “how long is a piece of string?” I’ve made disk storage units go (with fairly crappy disks) at over 200MB/s. But intelligent disk layouts are necessary. Without more info I can’t help.

    But the way it’s connected, since it all goes over the network, your network could be the issue, as you said. Try network benchmarking tools.

    Beware of the supervisors and blades on the Cisco, not all are created equal and that could also be contributing to your issues. Just because it can do gig doesn’t mean it really DOES full gig PER port while ALL ports are busy.

    An untuned windows box should be able to do maybe 60MB/s per NIC. Try jumbo frames. Might get a 30% boost. Read up on jumbo frame caveats before proceeding.

    I don’t think doing what you said with the Netapp box would be your best bet in general. Try what we discussed above…

    D

    Reply
  8. Zekeatron

    Thanks Dimitris,

    A few addons per your reply:

    1. We are using 8x LTO3 drives in the i2000, using both LTO2 and LTO3 tape media in the pool at the moment. The block size is 256 in the NBU config, but I get confused on this, do you have to do any registry tweaking as well?

    2. You said “Don’t do 8, unless your media server is massive or you use TOE cards (expensive, plus some TOEs can’t even do bonding).”. Did you mean “Don’t do it”? Just want to be clear.

    3. The media servers are big machines (HP DL-585) with 4GB ram (Max for 32bit windows). If we go 64 bit we can put more ram, not sure if this will help much, that is going to say 8 or 16 or 32 gig’s of ram?

    4. Regarding the iSCSI, the (promise) disk is using two NetGear switches (supporting Jumbo frames, but not configured yet), and the Media2 server does have 2 x 1000MB NIC’s for iSCSI traffic. But when staging to disk to either the EMC or the iSCSI disk, the speeds are almost identical, so I as you said, I likely will not see the difference until the network issues are resolved.

    5. I had forgot to mention that we did have Norton Anti-Virus installed, and it was trying to scan all network traffic, we disabled it, but no difference in speeds were noticed.

    6. Regarding your comment “If you have stuff in a SAN, the fastest way to back up, by far, is to create snapshots and mount them on the media servers. You will then not have to go through the network at all.”

    I believe this can only be done with the NetApp, and that is why I was thinking of using it over the EMC. Unless I’m missing something and the EMC does have a Snapshot technology?

    7. Thanks again for all your assistance, its funny how a mid-size backup environment has so many variables that can cause so much stress…

    Cheers.

    Reply
  9. Dimitris Post author

    1. You may have to do some tweaking in windows to make it really take the 256K blocksize. Do some write benchmarks: Create a large RAM disk (say, 2GB) and create a highly-compressible 2GB file. Back up the file. With LTO3 you should get like 130MB/s+. If you’re NOT getting that, there’s some problem.

    2. I meant don’t bond 8 ports. Do less. 2-4 is best.

    3. More RAM won’t really help you. However, 64-bit windows uses 4GB “properly”.

    4. Yes, definitely check the network.

    5. Glad you checked the virus angle.

    6. EMC ABSOLUTELY does snaps! Actually, you can do snapshots (kinda similar to NetApp’s) or full clones (thereby creating a 1:1 copy on a separate set of disks, best for not impacting application performance when doing fast backups). NetApp’s marketing is great and they push their snap technology like crazy, but there are caveats. Let’s just say that, in my experience, EMC provides the highest throughput, long-term and predictable. NetApp and their WAFL fragmentation issues and the fact that all snap data has to reside in the same disks as the production LUNs is a whole separate issue and worthy of a blog entry in itself…

    7. No problem – I truly believe that the only real difference between mid-size and huge is scale. Typically, the components will be similar, just fewer… I could pull a “Curtis” and say I can’t tell you unless I charge you, but then what would the point of the blog be?

    D

    Reply
  10. Thanks in Advance:-)

    First off let me say thanks and tell you I had a smile on my face while reading your experiences. Q) I need to deploy a large library with 8 fiber channel tape drives. Do you think there is any advantage to deploying multiple media servers? Do you have a rule of thumb regarding the number of HBAs in a server (W3k)? What about larger media servers with more HBAs?

    Reply
  11. Dimitris Post author

    I have more questions than you do then:

    1. What kind of server exactly? (CPU, mem, chipset)
    2. How many PCI buses?
    3. 64-bit or 32-bit?
    4. What kind of tape drives?

    There’s no rule of thumb. UNIX/Linux boxes can push more data usually, unless one installs 64-bit Windows. Don’t figure you can really do more than 200MB/s comfortably on windows (which is really 2 modern tape drives). Depends on the size of the box. Get intimately familiar with the chipset and how much throughput it can provide to the bus(es).

    There’s absolutely an advantage to multiple media servers since they can fail over. If you cluster your master and have 2-4 media servers with SSO then you’ll get great throughput and great fault tolerance.

    D

    Reply
  12. Charlie Brown

    Dimitris, you are the Man!

    Your commentary is the Bomb!

    Please consider this notification that this web page and it’s contents will be presented to numerous NBU shops as to how they should do things.

    One question, any recommendations when backing up Lotus Notes Domino database files. There are over 1500 notes files in the same Folder location using 1.2 Terabytes. It takes almost 36 hours to backup. Your thoughts?

    Thanks Again for this treasure,

    Charlie Brown

    Reply
  13. Dimitris Post author

    Thanks Charlie.

    Regarding your poor throughput:

    Are you using the Notes agent or not?

    1,500 files isn’t that many really. Have you done your due diligence to see if there are low-hanging fruit? Bottlenecks?

    Like, is it thrashing the disk, the network, the CPU?

    Thx

    D

    Reply
  14. biafran

    Hello D,
    Thank you for this great post.I need help with setup of media servers in the SSO. I am running NetBackup 6.5.2 in a Solaris 9 environment. I have 1 master server 25 media servers in the SSO and need to add one new one that is in another network band. Original group was on the 192.168.25 network the new server is on the 192.168.108 network. They servers are segregated for security reasons. I am concerned about putting the new server in the SSO because I anticipate communication challenges and difficulty using the emm database.I am considering putting an interface to 192.168.25 on the new server and opening up NBU ports in the firewall. Please what do you suggest.

    Reply
  15. Chris

    Hi Dimitri, super intersting stuff you mention here.
    I’m working with some guys on tuning up a largish NB setup at present and I would love to know more about points 2, 10 and 19.
    Hope to hear more from you!

    Reply
  16. Dimitris Post author

    Sorry took me so long to reply, busy…

    #2: The voodoo tunings aren’t to be used lightly, and one needs a deep knowledge of OS internals in order to implement them properly. So it’s a bit hard to just put in some blanked tuning parameters. Send me more info and we can talk more.

    #10: Simply put, ensure the right blocksize for the tape drive, LTO needs larger sizes than DLT (256K)

    #19: Better not to overload ISLs. Keep SAN traffic localized by having backup servers and their tape drives on the same switch if at all possible. Otherwise, beef up your ISLs with 4 ports per and get a trunking license for the switches, too.

    D

    Reply
  17. Chris

    Thanks for responding Dimitris.

    I think the sheer variety of systems being backed up here would make it too difficult to provide the details you need so you can recommend ‘voodoo’ tuning parameters.

    point #10 is definitley yeilding good results with the systems that have been migrated so far.

    We are watching out ISL’s closley to make sure we are not saturating them.

    Reply
  18. VENKATA RAMIREDDY

    That’s very nice inputs related to netbackup.

    I would like to know, how many maximum no.of clients can be configured in one NBU 6.5 master server (so far we have configured around 1000 clients) and best practices realted to it?

    And i want to know what is netbackup 6.5 catalog databse maximum capacity?

    Reply

Leave a comment for posterity...