20121029

Sins of the Bond

Here I explore the interesting notion of bonding a set of bonds.

I'm working on configuring some very awesome, or perhaps convoluted, networking as I prepare to re-architect an entire network.  As I'm in the midst of this, I'm trying to decide how best to go about utilizing a four-fold set of links.  Basically:

  • I have a bunch of 4-port gigabit cards.
  • At least 4 of the 12 links are dedicated to one particular subnet - I plan on blasting the hell out of it.
4 gigabits per second?  I'd sure like that.  Dual-switch redundancy?  I'd like that even better.  I can't have both at the same time.  But can I get close?

For the remainder of this entry, I'll be referring to the Linux bonding driver and its various modes.

 I usually use mode 6 (balance-alb) because I want the most out of my bonds and what with the clients that may need that sort of bandwidth.  However, it appears that it only really provides 1 link per connection.  So, if I add 4 links to a mode-6 bond, I'll still only get 1 g/sec throughput, but I can do that to 4 systems simultaneously.  That's great, but not good enough if I only have three or four systems to connect up.

My new switches supports static and dynamic LACP.  That's really nice, but it also means that I'll either have to put all four links on one switch, or run mode-6 across two mode-4 bonds.  That's 2 g/sec per single connection, a total of 4 g/sec of throughput to up to two machines.  Naturally, you'd have to work hard to saturate that, so I expect it would spread the load quite nicely.

So, what's are all the options?

  1. Simplest:  Operate mode-6 across all links and both switches.  This achieves bandwidth of 1 g/sec per connection, up to 4 g/sec aggregate across all connections.
  2. Semi-redundant:  Operate mode-4, keeping all links on one switch.  This is not preferred, and moreover won't achieve much benefit if the inter-switch links cannot handle that capacity.  Offers  4 g/sec no matter what.
  3. Mode-6+4+4:  Operate two mode-4 bonds, one bond per participating switch, and bond those together with mode-6.  2 g/sec per connection, up to 4 g/sec aggregate.  We can lose any three links or either switch and still operate.
I am leaning toward Option 3.  It's a compromise to be sure, but will guarantee that I get both the throughput I am looking for, and the redundancy I need.  In the future I can always increase the size of the mode-4 bonds by adding more NICs.

The first challenge is setting up a bond within a bond.  Ubuntu 12.04 offers some nice things with their network configuration file, but I think it has some bugs and/or takes a few things upon itself that maybe it shouldn't.  Specifically, I've noticed it fails to order the loading of the various stacked network devices correctly.  I started into the kernel documentation for the bonding driver, and after toying with it for a while I came up with this set of sysfs-based calls:
 #!/bin/bash

modprobe bonding

SYS=/sys/class/net

if [[ -e $SYS/idmz ]]; then
  echo -idmz > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz1 ]]; then
  echo -idmz1 > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz2 ]]; then
  echo -idmz2 > $SYS/bonding_masters
fi


# create master bonds
echo +idmz > $SYS/bonding_masters
echo +idmz1 > $SYS/bonding_masters
echo +idmz2 > $SYS/bonding_masters

# configure bond characteristics
echo 6 > $SYS/idmz/bonding/mode
echo 4 > $SYS/idmz1/bonding/mode
echo 4 > $SYS/idmz2/bonding/mode

echo 100 > $SYS/idmz/bonding/miimon
echo 100 > $SYS/idmz1/bonding/miimon
echo 100 > $SYS/idmz2/bonding/miimon

echo +e1p2 > $SYS/idmz1/bonding/slaves
echo +e3p2 > $SYS/idmz1/bonding/slaves
echo +e2p3 > $SYS/idmz2/bonding/slaves
echo +e3p3 > $SYS/idmz2/bonding/slaves

echo +idmz1 > $SYS/idmz/bonding/slaves
echo +idmz2 > $SYS/idmz/bonding/slaves


These calls achieve exactly what I want: two 2 gigabit bonds that can live on separate switches, and yet appear under the guise of a single IP address.  The only thing I have omitted from the above example is the ifconfig on idmz for the network address.  This can evidently also be accomplished through sysfs.

I've toyed around with /etc/network/interfaces a bit, and just couldn't get it to act the way I wanted.  I need an ordering constraint, or some sort of smarter dependency tracking.  Well, I'm not going to get it, so a custom launch-script is probably the necessary thing for me.  I have, luckily, worked it out such that I can still use upstart to configure the adapter, route, and DNS:
iface idmz inet static
   pre-up /root/config-networking.sh
   address 192.168.1.2
   netmask 255.255.255.0
(I have obviously omitted the route and DNS settings here.)  The config-networking.sh script is essentially what you see above, with all the sysfs calls.  It's not terribly elegant, but it gets the job done.  You will probably be wondering where my eth0, eth1, eth2 adapter names are.  I renamed them in the persistent-net udev rules to correspond to the adapter ports.  I have, after all, 12 ports not counting the two onboard that are currently nonfunctional (that's another story).  e1p2 is the third port on the second card, counting from zero.

In case you want to do some poking, make sure to ls and cat the files in /proc/net/bonding.  You will be able to easily interrogate the various bonds as to their state and membership.  It was here I discovered that my mode-6 bond simply kept refusing to add my mode-4 bonds.  The basic issue appears to be that if you try to add an empty bond-master to another bond-master, configuration of the new master fails.  The bond-master-slave needs to have at least one adapter under it. 

Configuration failures have been catastrophic in nature; all my bonds and other devices utterly failed to start when the requested /etc/network/interfaces configuration of idmz didn't make sense to /sbin/ifup.  At best the ifup/upstart/parallel-launch functionality makes bond configuration non-deterministic at best.  Now what appears to be extra stupid is that if my two-stage bond is the only thing getting configured, it doesn't get configured at all.  ifup magically ignores it.

I am still considering doing full network configuration in my manual script, just for the sake of knowing it will always work.  In fact, that is looking like the only real option for at least a good portion of the interfaces.

Uhg.

1 comment:

  1. Great web site...and cool article man...thanx for the nice post...keep on posting such articles... Resources just like the one you mentioned here are going to be terribly helpful to me! i'll post a link to the current page on my diary. i'm certain my guests can realize that terribly helpful.
    flash drive recovery

    ReplyDelete