20121028

Absolute Evil

So, in my previous post, I outlined pretty much all of the steps I was taking to make 11.10 talk CMAN+DLM.

I've now uncovered another interesting bit of horror.

Perhaps this is well known, documented, and all that good stuff...

I had my 11.10 server running fine, talking with the 12.04 server.  Then I decided to let the 12.04 server take command of the cluster.  As the DC, it "improved" the version of the DC to 1.1.6-blahwhatevernonesense.  This has two interesting results: first, the 11.10 crm, which is on 1.1.5, is no longer able to interpret the CIB.  Second, when I tried to shunt back from 12.04 to 11.10 as the DC, the world pretty much ended in a set of hard reboots.

But I did learn one other thing: 12.04 could still talk to 11.10 and order it around.  So, even though 11.10's crm couldn't tell us what the hell was happening, it was, in fact, functioning perfectly normally underneath the hood.  It was able to link up with the iSCSI targets, mount the OCFS2 file systems, and play with the DLM.

I'm now back to my original choice, with one other option:
  • Just bite the bullet and upgrade to 12.04 before upgrading the cluster stacks, incurring hours and hours of cluster downtime.
  • "Upgrade" pacemaker on 11.10 to 1.1.5+CMAN and transition the cluster stack and the cluster resources more gracefully, being careful to NEVER let the 12.04 machine gain the DC role - probably not possible as more machines are transitioned to 12.04.
  • Upgrade pacemaker on 11.10 to 1.1.6+CMAN and do the same as option 2 above, except for worrying about who is the DC (for then it wouldn't matter).
I did learn how to rebuild the Debian package the, um, somewhat right way, except of course for signing the packages.  That aside, it seemed to work pretty well, and I was able to build in CMAN support like I wanted.  So, I am now tempted to try option 3 and see where that lands me.  If anything, it may be the bridge-measure I need to move from 11.10 to 12.04 without obliterating the cluster for the rest of the night.

* * * * *
A short while later...

I'm not sure this is going to work.  In attempting to transplant the Debian sources from 12.04 to 11.10, I had to also pull in the sources for the later versions of corosync, libcman, cluster-glue, and libfence.  All development versions, too.  I'm trying to build them now on the 11.10 machine, which means I am also having to install other additional libraries.

First one finished was the cluster-glue.  Not much difficulty there.

Corosync built next.  Had to apt-get more libraries.  Each time I'm letting debuild tell me what it wants, then I go get it, and then let it fly.  

redhat-cluster may be a problem.  It wants a newer version of libvirt.  The more I build, the more I wonder just how bad that downtime is going to be...  Worse, of course, would be upgrades, except that I could just as easily nuke each node and do a clean reinstall.  That would probably be required if I go this route.

* * * * *
A shorter while later...

The build is exploding exponentially.  libvirt was really the last straw.  For kicks I'm trying to build Pacemaker 1.1.6 manually against the system-installed Corosync et al.  For sake of near-completeness I'm using the configure flags from the debian/rules file. 

The resulting Pacemaker 1.1.6 works, and seems to work acceptably well.  The 11.10 machine it's running on may be having some issues related or unrelated to the differing library versions, Pacemaker builds, or perhaps even the kernel.  There were some rsc_timeout things happening in there.  I performed a hard-and-fast reboot of the machine, though that's not really something I can afford to do on the live cluster.  I've seen this issue before, but have never pinned it down nor had the time to help one of the maintainers trace it through different kernel versions.  I also didn't have the hardware to spare; now, it seems, I do.  It may actually be related, in some strange way, to open-iscsi. 

It puts me a bit ill, as I now am not sure I can rely on this path to upgrade my cluster easily.  I can't have machines spontaneously dying on me due to buggy kernels or iSCSI initiators or what have you.

The Final Verdict

My goal is to transition a virtualization cluster to 12.04, partly because it's LTS, partly because it's got better libvirt support, and partly because it has to happen sooner or later.  I have a new 16-core host that might be able to take the whole load of the four other machines I'll be upgrading; I just won't be able to quietly and secretly transition those VMs over.  I'll have to shut them all down, configure the new host to be the only host in the cluster, adjust the cluster stack, and then bring them all back up.

I could do that.  I could even do it tonight, but I'm going to wait till I'm back in the office.  The HA SAN (another bit of Pacemaker awesomeness) is short on network bandwidth, as is the new host I want to use.  I'll want to get that a little more robust before I start pushing tons of traffic.  The downside to this approach is that I'm left completely without redundancy while I upgrade the other machines.  Of course, each completed upgrade means another machine worth of redundancy added back into the cluster.

I may attempt the best of both worlds: take one machine offline, upgrade it, and pair it with the new host.  With those two hosts up, we can bring the rest of the old cluster down and swap out the OCFS2 cluster stack.  Yes, that may work well.  At this point, I think trying to sneak 1.1.6 into 11.10 is going to be too risky. 




No comments:

Post a Comment