BURNING MIDNIGHTm.at.work: 2012/10

20121031

Bonded Bonds - A Brief Follow-up

I am reconsidering my original plan to bond two mode-4 bonds over a mode-6 super-bond.

To be fair, it works. BUT, link recovery-time on total catastrophic all-link-failure is not good. I would guess this is an edge-case that no one has ever really thought about before, and perhaps there is insufficient maturity in this sort of functionality. OR I have horribly mis-configured the super-bond device. More tuning and testing would be in order.

Basically, I have my two bonds connected to two switches and 802.3ad running. Everything looked smooth, and aside from the fact that I can't get any iperf bandwidth tests to show rates above 1 gbit/sec, I felt it was a fairly stable. With the mode-6 super-bond joining the two mode-4 bonds, I could easily disconnect any three wires and still have connectivity, albeit with a slight (1 to 3 seconds) delay if the currently active slave was not the last mode-4 bond to be connected. I suspect that delay is ARP-related, but that's just a hunch.

The trouble began when I simulated a multi-switch failure. Ordinarily I would expect that once either switch is restored, the super-bond will come back up. That was either (a) not the case, or (b) taking so long to occur that it was unacceptable. Of course, if a multi-switch failure really occurred during live operation, chances are I would not be able to do anything about the resultant consequences in time to prevent other resultant and catastrophic consequences, and maybe this is an edge-test-case that I should worry less about.

For the record, to test the above failure case, I pulled the cables first from switch A, then from switch B, one at a time. After a few seconds I plugged a cable back into switch A. No results. I'll have to test again to be certain, but I either had to plug both A cables back into switch A, and/or both B cables back into switch B before I saw connectivity restored. I want to say it was the latter case - the last switch to disconnect being the first one that would need to reconnect.

Using mode-6 on all links does allow me to traverse the switches and fail both switches, the last switch to fail being the last to be brought back up, with no appreciable delay in restoration of connectivity. I suspect that there is some non-talking happening between the super-bond and its subordinate bonds, but without digging through the code or asking people a lot more knowledgeable than myself, I will never no for certain.

Of course, on that note, Ubuntu 12.04 hasn't been exactly friendly about bringing my links back up after reboots. If it can't detect a carrier on one of the lines, it won't bring the interface back. Worse yet, it won't (or didn't) add it to the mode-6 bond on boot. If it's a deterministic failure on boot, fine. If it's non-deterministic, so much the worse, but the outcome would be the same: here again I may need to resort back to my manual script for initial configuration, as I want to guarantee that no matter the circumstances, the system is configured to expect connectivity someday, if not immediately.

I plan on doing additional bandwidth testing, and will probably try the stacked bonding one more time for kicks. I really want the maximum possible bandwidth along with the maximum possible reliability. We shall see how things turn out.

20121029

Sins of the Bond

Here I explore the interesting notion of bonding a set of bonds.

I'm working on configuring some very awesome, or perhaps convoluted, networking as I prepare to re-architect an entire network. As I'm in the midst of this, I'm trying to decide how best to go about utilizing a four-fold set of links. Basically:

I have a bunch of 4-port gigabit cards.
At least 4 of the 12 links are dedicated to one particular subnet - I plan on blasting the hell out of it.

4 gigabits per second? I'd sure like that. Dual-switch redundancy? I'd like that even better. I can't have both at the same time. But can I get close?

For the remainder of this entry, I'll be referring to the Linux bonding driver and its various modes.

I usually use mode 6 (balance-alb) because I want the most out of my bonds and what with the clients that may need that sort of bandwidth. However, it appears that it only really provides 1 link per connection. So, if I add 4 links to a mode-6 bond, I'll still only get 1 g/sec throughput, but I can do that to 4 systems simultaneously. That's great, but not good enough if I only have three or four systems to connect up.

My new switches supports static and dynamic LACP. That's really nice, but it also means that I'll either have to put all four links on one switch, or run mode-6 across two mode-4 bonds. That's 2 g/sec per single connection, a total of 4 g/sec of throughput to up to two machines. Naturally, you'd have to work hard to saturate that, so I expect it would spread the load quite nicely.

So, what's are all the options?

Simplest: Operate mode-6 across all links and both switches. This achieves bandwidth of 1 g/sec per connection, up to 4 g/sec aggregate across all connections.
Semi-redundant: Operate mode-4, keeping all links on one switch. This is not preferred, and moreover won't achieve much benefit if the inter-switch links cannot handle that capacity. Offers 4 g/sec no matter what.
Mode-6+4+4: Operate two mode-4 bonds, one bond per participating switch, and bond those together with mode-6. 2 g/sec per connection, up to 4 g/sec aggregate. We can lose any three links or either switch and still operate.

I am leaning toward Option 3. It's a compromise to be sure, but will guarantee that I get both the throughput I am looking for, and the redundancy I need. In the future I can always increase the size of the mode-4 bonds by adding more NICs.

The first challenge is setting up a bond within a bond. Ubuntu 12.04 offers some nice things with their network configuration file, but I think it has some bugs and/or takes a few things upon itself that maybe it shouldn't. Specifically, I've noticed it fails to order the loading of the various stacked network devices correctly. I started into the kernel documentation for the bonding driver, and after toying with it for a while I came up with this set of sysfs-based calls:

#!/bin/bash

modprobe bonding

SYS=/sys/class/net

if [[ -e $SYS/idmz ]]; then
echo -idmz > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz1 ]]; then
echo -idmz1 > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz2 ]]; then
echo -idmz2 > $SYS/bonding_masters
fi

# create master bonds
echo +idmz > $SYS/bonding_masters
echo +idmz1 > $SYS/bonding_masters
echo +idmz2 > $SYS/bonding_masters

# configure bond characteristics
echo 6 > $SYS/idmz/bonding/mode
echo 4 > $SYS/idmz1/bonding/mode
echo 4 > $SYS/idmz2/bonding/mode

echo 100 > $SYS/idmz/bonding/miimon
echo 100 > $SYS/idmz1/bonding/miimon
echo 100 > $SYS/idmz2/bonding/miimon

echo +e1p2 > $SYS/idmz1/bonding/slaves
echo +e3p2 > $SYS/idmz1/bonding/slaves
echo +e2p3 > $SYS/idmz2/bonding/slaves
echo +e3p3 > $SYS/idmz2/bonding/slaves

echo +idmz1 > $SYS/idmz/bonding/slaves
echo +idmz2 > $SYS/idmz/bonding/slaves

These calls achieve exactly what I want: two 2 gigabit bonds that can live on separate switches, and yet appear under the guise of a single IP address. The only thing I have omitted from the above example is the ifconfig on idmz for the network address. This can evidently also be accomplished through sysfs.

I've toyed around with /etc/network/interfaces a bit, and just couldn't get it to act the way I wanted. I need an ordering constraint, or some sort of smarter dependency tracking. Well, I'm not going to get it, so a custom launch-script is probably the necessary thing for me. I have, luckily, worked it out such that I can still use upstart to configure the adapter, route, and DNS:

iface idmz inet static
   pre-up /root/config-networking.sh
   address 192.168.1.2
   netmask 255.255.255.0

(I have obviously omitted the route and DNS settings here.) The config-networking.sh script is essentially what you see above, with all the sysfs calls. It's not terribly elegant, but it gets the job done. You will probably be wondering where my eth0, eth1, eth2 adapter names are. I renamed them in the persistent-net udev rules to correspond to the adapter ports. I have, after all, 12 ports not counting the two onboard that are currently nonfunctional (that's another story). e1p2 is the third port on the second card, counting from zero.

In case you want to do some poking, make sure to ls and cat the files in /proc/net/bonding. You will be able to easily interrogate the various bonds as to their state and membership. It was here I discovered that my mode-6 bond simply kept refusing to add my mode-4 bonds. The basic issue appears to be that if you try to add an empty bond-master to another bond-master, configuration of the new master fails. The bond-master-slave needs to have at least one adapter under it.

Configuration failures have been catastrophic in nature; all my bonds and other devices utterly failed to start when the requested /etc/network/interfaces configuration of idmz didn't make sense to /sbin/ifup. At best the ifup/upstart/parallel-launch functionality makes bond configuration non-deterministic at best. Now what appears to be extra stupid is that if my two-stage bond is the only thing getting configured, it doesn't get configured at all. ifup magically ignores it.

I am still considering doing full network configuration in my manual script, just for the sake of knowing it will always work. In fact, that is looking like the only real option for at least a good portion of the interfaces.

Uhg.

20121028

Absolute Evil

So, in my previous post, I outlined pretty much all of the steps I was taking to make 11.10 talk CMAN+DLM.

I've now uncovered another interesting bit of horror.

Perhaps this is well known, documented, and all that good stuff...

I had my 11.10 server running fine, talking with the 12.04 server. Then I decided to let the 12.04 server take command of the cluster. As the DC, it "improved" the version of the DC to 1.1.6-blahwhatevernonesense. This has two interesting results: first, the 11.10 crm, which is on 1.1.5, is no longer able to interpret the CIB. Second, when I tried to shunt back from 12.04 to 11.10 as the DC, the world pretty much ended in a set of hard reboots.

But I did learn one other thing: 12.04 could still talk to 11.10 and order it around. So, even though 11.10's crm couldn't tell us what the hell was happening, it was, in fact, functioning perfectly normally underneath the hood. It was able to link up with the iSCSI targets, mount the OCFS2 file systems, and play with the DLM.

I'm now back to my original choice, with one other option:

Just bite the bullet and upgrade to 12.04 before upgrading the cluster stacks, incurring hours and hours of cluster downtime.
"Upgrade" pacemaker on 11.10 to 1.1.5+CMAN and transition the cluster stack and the cluster resources more gracefully, being careful to NEVER let the 12.04 machine gain the DC role - probably not possible as more machines are transitioned to 12.04.
Upgrade pacemaker on 11.10 to 1.1.6+CMAN and do the same as option 2 above, except for worrying about who is the DC (for then it wouldn't matter).

I did learn how to rebuild the Debian package the, um, somewhat right way, except of course for signing the packages. That aside, it seemed to work pretty well, and I was able to build in CMAN support like I wanted. So, I am now tempted to try option 3 and see where that lands me. If anything, it may be the bridge-measure I need to move from 11.10 to 12.04 without obliterating the cluster for the rest of the night.

* * * * *
A short while later...

I'm not sure this is going to work. In attempting to transplant the Debian sources from 12.04 to 11.10, I had to also pull in the sources for the later versions of corosync, libcman, cluster-glue, and libfence. All development versions, too. I'm trying to build them now on the 11.10 machine, which means I am also having to install other additional libraries.

First one finished was the cluster-glue. Not much difficulty there.

Corosync built next. Had to apt-get more libraries. Each time I'm letting debuild tell me what it wants, then I go get it, and then let it fly.

redhat-cluster may be a problem. It wants a newer version of libvirt. The more I build, the more I wonder just how bad that downtime is going to be... Worse, of course, would be upgrades, except that I could just as easily nuke each node and do a clean reinstall. That would probably be required if I go this route.

* * * * *
A shorter while later...

The build is exploding exponentially. libvirt was really the last straw. For kicks I'm trying to build Pacemaker 1.1.6 manually against the system-installed Corosync et al. For sake of near-completeness I'm using the configure flags from the debian/rules file.

The resulting Pacemaker 1.1.6 works, and seems to work acceptably well. The 11.10 machine it's running on may be having some issues related or unrelated to the differing library versions, Pacemaker builds, or perhaps even the kernel. There were some rsc_timeout things happening in there. I performed a hard-and-fast reboot of the machine, though that's not really something I can afford to do on the live cluster. I've seen this issue before, but have never pinned it down nor had the time to help one of the maintainers trace it through different kernel versions. I also didn't have the hardware to spare; now, it seems, I do. It may actually be related, in some strange way, to open-iscsi.

It puts me a bit ill, as I now am not sure I can rely on this path to upgrade my cluster easily. I can't have machines spontaneously dying on me due to buggy kernels or iSCSI initiators or what have you.

The Final Verdict

My goal is to transition a virtualization cluster to 12.04, partly because it's LTS, partly because it's got better libvirt support, and partly because it has to happen sooner or later. I have a new 16-core host that might be able to take the whole load of the four other machines I'll be upgrading; I just won't be able to quietly and secretly transition those VMs over. I'll have to shut them all down, configure the new host to be the only host in the cluster, adjust the cluster stack, and then bring them all back up.

I could do that. I could even do it tonight, but I'm going to wait till I'm back in the office. The HA SAN (another bit of Pacemaker awesomeness) is short on network bandwidth, as is the new host I want to use. I'll want to get that a little more robust before I start pushing tons of traffic. The downside to this approach is that I'm left completely without redundancy while I upgrade the other machines. Of course, each completed upgrade means another machine worth of redundancy added back into the cluster.

I may attempt the best of both worlds: take one machine offline, upgrade it, and pair it with the new host. With those two hosts up, we can bring the rest of the old cluster down and swap out the OCFS2 cluster stack. Yes, that may work well. At this point, I think trying to sneak 1.1.6 into 11.10 is going to be too risky.

Cluster Upgrade: Ubuntu Server 11.10 to 12.04 LTS

I will attempt to detail here the upgrade path taken as I migrate from 11.10 to 12.04 LTS. I am using Corosync+Pacemaker+OCFS2 on 11.10. 12.04 appears to require transitioning to CMAN in order for the DLM to work again. DLM is required for OCFS2.

This is what I suspect, or tend toward, or have read about:

I will need to migrate my file systems to the new cluster stack - this should be doable with tune.ocfs2.
I will either need to upgrade the OS before or after installing and configuring CMAN.
I need CMAN before I can migrate OCFS2 to the new stack.
I am unsure what will happen to my existing resources after the migration.
One of my nodes is already on 12.04. The other is on 11.10 (this is a test cluster, by the way).

Because I'll be fundamentally affecting the DLM, I will need to shut down the 11.10 node completely (as far as the cluster is concerned) before acting on it. At least, that is what everything I've read and learned and suspect is telling me.

I have shut down the node by first putting it into standby. Pacemaker and Corosync are then respectively brought offline. My cluster configuration contains mostly that which is presented in the Clusters From Scratch documentation, but with some personalized modifications (forgive the spaces in the tags; I can't seem to put real XML in here without grief and I don't have the patience right now to learn the proper way to do so):

I am using the < dlm protocol="sctp" > option. This goes inside the < cluster > block.
I have set the keyfile to that which Corosync used to use: < cman keyfile="/etc/corosync/authkey" >
I have defined a specific multicast address inside the < cman > block: < multicast addr="226.94.94.123" / >

The SCTP option appears to be a Nice Thing To Have. The cluster.conf man page says its required when Corosync is involved. I don't honestly know what it all means. The keyfile is not required, but I thought it would be handy. The multicast address is also not required, obviously. Both the key and address are generated from the cluster-name. I am defining them explicitly here because I'm toying around and like the notion of being able to define an address and key that will NEVER EVER CHANGE. EVER.

I am running ccs_config_validate on each node to make sure everything is kosher. I found that it complained loudly when the fence-agents package was not installed. I will dump a list of what apt-gets I did at the bottom of this post. As I probably had to mention in another post, Ubuntu Server used to configure the /etc/hosts file with 127.0.1.1 pointing to the host name. This screws up cman very nicely, as it has auto-detect magic, and it binds itself to this pseudo-loopback instead of the real adapter. If your machines don't connect, run

corosync-objctl -a | grep ^totem

and you might see: totem.interface.bindnetaddr=127.0.1.1

Look familiar? What a pisser... With that fixed, both nodes now appear when I run cman_tool nodes. Now I shall attempt to upgrade the cluster stack. Before I can do this, however, I need to make some subtle changes. For starters, the cluster configuration can no longer fuck around with the DLM. It's managed by CMAN now, and if we toy with it we'll break everything. I posted about that before, also.

I start pacemaker and then enter configuration-mode. Wait, no I don't. 11.10's version of Pacemaker doesn't support CMAN. Now I remember why I dreaded this fucking upgrade. I have two choices now:

Perform the upgrade to 12.04 and get CMAN-support that way.
Obtain or build Pacemaker with CMAN support manually.

I opt to try my hand at a build. I've never done this before on Ubuntu, so the tool setup is unfamiliar. Getting to a ./configure is my goal, as that will be known-turf.

First, pull the necessary packages down - I don't know which of these are actually needed:

apt-get install pacemaker-dev libcman-dev libccs-dev

The build dependencies: apt-get build-dep pacemaker
Then the source: apt-get source pacemaker (Do this in the directory you want it to wind up in.)
I also needed: apt-get install libfenced*

CD into the pacemaker-x.x.x directory and do a ./autogen.sh.
Then:
./configure --prefix=/usr --with-ais --with-corosync --with-heartbeat --with-cman --with-cs-quorum --with-snmp --with-esmtp --with-acl --disable-fatal-warnings --localstatedir=/var --sysconfdir=/etc

Yes, I turned everything on. If it works correctly, you should see a feature line near the end of the output, and CMAN had BETTER be there. Had to disable fatal warnings because there were, um, stupid warnings that were fatal to the build. Let's hope they're just warnings, eh?

make

Now sit and wait. Hmmmm.... this is a nice fast machine, and now it's done! Now for the part where I shoot myself in the foot:

make install

If you were kinda dumb like me, you may have built this as root. Not advisable, generally speaking, but I'm to the point I don't care. I've built entire Linux deploys from scratch (read: Linux From Scratch), so it's pretty much the same thing over and over and over for me. I don't do LFS anymore, by the way. Package management features of distros like Ubuntu are just too damn shiny for me to ignore any longer.

Now I discover that I did not have my paths entirely correct during the configuration step. My cluster configuration is nowhere to be seen, because it's probably being sought in /usr/var/blah/blah/blah. And it is so. I've modified the above configure command to be more correct. And now, except for a strange "touch missing operand" bit of complaining on the part of the init script, the binary has found my cluster configuration. (edit: the last issue has also been fixed by the addition of the flag that sets /etc as a good place to go.)

With the new pacemakerd binary in place, I can get the thing started under the watchful gaze of CMAN. Now I have to update the cluster config to reflect the fact that the o2cb stack is now CMAN. Refer to the RA metadata for this, I won't repeat it here. With that done, I can bring at least the 11.10 node back online. The 12.04 node actually doesn't have all the necessary packages to make it work yet.

Predictably, the mount of the OCFS2 iSCSI drives fails - they're the wrong stack. BUT, the o2cb driver is up, and the iSCSI drives are connected. With that, I can do the following:

root@hv06:~# tunefs.ocfs2 --update-cluster-stack /dev/sdc
Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.Update the on-disk cluster information? yes

So, we can learn from this that updating the cluster stack must be done while the ENTIRE FRIGGING CLUSTER IS DOWN. I'll try to remember that for when I do this for real. I can now clean up the two failed filesystem resources, and lo and behold they mount! I can now configure the 12.04 machine to follow suit.

Conclusion

Now, having done all this, I am left to wonder whether or not this is in fact the best route. On the up-side:

I can easily copy the pre-custom-built pacemakerd stuff to all my 11.10 cluster machines and forego the painful build steps.
I can take the cluster down for a very short period of time.
This should not affect any data on the OCFS2 volumes, as we're just updating the cluster stack.
I can take my time while upgrading the individual machines to 12.04, or even leave them on 11.10 a bit longer.
I could build the latest-and-greatest pacemaker and use that instead of 1.1.5.

OK, point #5 is probably not as feasible as I would like it to be, nor is it necessary in my situation. On the down-side:

I still have to take the ENTIRE cluster down to make this "minor" update.
I could just push my new server back to 11.10, join it to the cluster, migrate my VMs there and then upgrade the rest of the cluster.
But if I do #2 here, I will still at some point have to bring the whole cluster down just to update the OCFS2 cluster stack info to CMAN.
I may be causing unknown issues once I push the 11.10 machines to 12.04.

At least point #4 of the downsides is something I can test. I can push that machine to 12.04 and see what happens during the upgrade. Ideally, nothing bad should come of it; the pacemakerd binary should get upgraded to the package maintainer's verion, 1.1.6-ubuntu-goodness-and-various-patches-go-here.

This will probably be useful to me: http://www.debian.org/doc/manuals/packaging-tutorial/packaging-tutorial.en.pdf

We shall see.

20121027

Forget the Brain... DLM in Ubuntu 12.04

This is a stream-of-research post - don't look for answers here, though I do link some interesting articles.

I'm in the process of preparing my cluster for expansion, and in the midst of installing a new server I inadvertently installed 12.04.1 instead of 11.10. The rest of the cluster uses 11.10.

Some important distinctions:

12.04 seems to support CMAN+Corosync+Pacemaker+OCFS2 quite well.
The same is not for certain on 11.10.
12.04 NO LONGER has dlm_controld.pcmk.
Trying to symlink or fake the dlm binary on 12.04 does not appear to work, from what my memory tells me.

You CAN connect 12.04's and 11.10's Corosyncs and Pacemakers, but from as far as I can tell, only if you Don't Need DLM.

I Need DLM.

So, I am trying understand CMAN a bit better. Here's some interesting articles:

Configuring CMAN and Corosync - this explains why some of my configurations failed brutally - http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf

Understanding CMAN and Corosync - written earlier than the above document - http://people.redhat.com/ccaulfie/docs/Whither%20cman.pdf

In summary - the CMAN-related binaries load Corosync, but CMAN itself is a plugin for Corosync, providing quorum support.

Uuuhhhgggg...

CMAN generates the necessary Corosync configuration parameters from cluster.conf and defaults.

Corosync appears to be the phoenix that rose from the, well, ashes of the OpenAIS project, since that project's page announces cessation of OpenAIS development. Corosync deals with all messaging and whatnot, and it appears thus that CMAN is providing definitive quorum information to Corosync even though Corosync has its own quorum mechanisms (which, if I read it right, are distilled versions from earlier CMAN incarnations).

20121012

Ubuntu Server 12.04 - Waiting for Network Configuration

Just ran across an interesting issue, and the forums I've read so far don't provide a real clear answer. I don't know that this is the answer, either, but it may be worth pursuing. This is a bit of stream-of-consciousness, by the way - my apologies.

I just set up some new servers and was in the midst of securing them. The first server I started on has a static IP, a valid gateway, valid DNS server, and all the networking checked out. On reboot, however, it would take forever to kill bind9, and then I'd see almost two minutes worth of "Waiting for network configuration." Well, there are only statically-assigned adapters present, and the loopback (which was left in its installer-default state).

I had introduced a slew of rules via iptables and I suspect they were wreaking havoc with the boot/shutdown procedures. If someone else is experiencing this problem, try nuking your iptables and make sure it doesn't reload on reboot - hopefully you'll see everything come back up quickly. UFW users would obviously need to disable ufw from operating. FWIW, I placed my iptables loader script in the /etc/network/if-pre-up.d/ folder, so it's one of the first things to crank up when networking starts.

Now, I have similar iptables configurations present on other machines, and I don't know that those machines specifically have the same problem. That being said, I really haven't rebooted them frequently enough to notice.

* * * * *

After a bit more experimentation, it appears there is some dependency on allowing OUTPUT to the loopback. Specifically, I'm looking at logs that note packets being sent from my machine's configured static address to the loopback, and consequently they're being dropped by my rules. They're TCP packets to port 953. This apparently rndc, and related to BIND, which makes sense since my other machines do not run BIND daemons.

This rule, while not the most elegant, and probably not the most correct, fixes the issue for now:

-A OUTPUT -m comment --comment "rdnc" -o lo -p tcp --dport 953 -j ACCEPT

It is probably important to note that this machine is not a gateway and so drops any packets that would be forwarded. I suppose I'm hoping this will be secure, but I just get a strange feeling something more needs to be done.

More on this later, hopefully.

BURNING MIDNIGHT
m.at.work