20121028

Cluster Upgrade: Ubuntu Server 11.10 to 12.04 LTS

I will attempt to detail here the upgrade path taken as I migrate from 11.10 to 12.04 LTS.  I am using Corosync+Pacemaker+OCFS2 on 11.10.  12.04 appears to require transitioning to CMAN in order for the DLM to work again.  DLM is required for OCFS2.

This is what I suspect, or tend toward, or have read about:
  • I will need to migrate my file systems to the new cluster stack - this should be doable with tune.ocfs2.
  • I will either need to upgrade the OS before or after installing and configuring CMAN.
  • I need CMAN before I can migrate OCFS2 to the new stack.
  • I am unsure what will happen to my existing resources after the migration.
  • One of my nodes is already on 12.04.  The other is on 11.10 (this is a test cluster, by the way).
Because I'll be fundamentally affecting the DLM, I will need to shut down the 11.10 node completely (as far as the cluster is concerned) before acting on it.  At least, that is what everything I've read and learned and suspect is telling me.

I have shut down the node by first putting it into standby.  Pacemaker and Corosync are then respectively brought offline.  My cluster configuration contains mostly that which is presented in the Clusters From Scratch documentation, but with some personalized modifications (forgive the spaces in the tags; I can't seem to put real XML in here without grief and I don't have the patience right now to learn the proper way to do so):
  • I am using the < dlm protocol="sctp" > option.  This goes inside the < cluster > block.
  • I have set the keyfile to that which Corosync used to use:  < cman keyfile="/etc/corosync/authkey" >
  • I have defined a specific multicast address inside the < cman > block: < multicast addr="226.94.94.123" / >
The SCTP option appears to be a Nice Thing To Have.  The cluster.conf man page says its required when Corosync is involved.  I don't honestly know what it all means.  The keyfile is not required, but I thought it would be handy.  The multicast address is also not required, obviously.  Both the key and address are generated from the cluster-name.  I am defining them explicitly here because I'm toying around and like the notion of being able to define an address and key that will NEVER EVER CHANGE.  EVER.

I am running ccs_config_validate on each node to make sure everything is kosher.  I found that it complained loudly when the fence-agents package was not installed.  I will dump a list of what apt-gets I did at the bottom of this post.  As I probably had to mention in another post, Ubuntu Server used to configure the /etc/hosts file with 127.0.1.1 pointing to the host name.  This screws up cman very nicely, as it has auto-detect magic, and it binds itself to this pseudo-loopback instead of the real adapter.  If your machines don't connect, run

corosync-objctl -a | grep ^totem

and you might see:  totem.interface.bindnetaddr=127.0.1.1

Look familiar?  What a pisser...  With that fixed, both nodes now appear when I run cman_tool nodes.  Now I shall attempt to upgrade the cluster stack.  Before I can do this, however, I need to make some subtle changes.  For starters, the cluster configuration can no longer fuck around with the DLM.  It's managed by CMAN now, and if we toy with it we'll break everything.  I posted about that before, also.

I start pacemaker and then enter configuration-mode.  Wait, no I don't.  11.10's version of Pacemaker doesn't support CMAN.  Now I remember why I dreaded this fucking upgrade.  I have two choices now:
  • Perform the upgrade to 12.04 and get CMAN-support that way.
  • Obtain or build Pacemaker with CMAN support manually.
I opt to try my hand at a build.  I've never done this before on Ubuntu, so the tool setup is unfamiliar.  Getting to a ./configure is my goal, as that will be known-turf.

First, pull the necessary packages down - I don't know which of these are actually needed:

  apt-get install pacemaker-dev libcman-dev libccs-dev

The build dependencies:  apt-get build-dep pacemaker
Then the source: apt-get source pacemaker  (Do this in the directory you want it to wind up in.)
I also needed:  apt-get install libfenced*

CD into the pacemaker-x.x.x directory and do a ./autogen.sh.
Then:
    ./configure --prefix=/usr --with-ais --with-corosync --with-heartbeat --with-cman --with-cs-quorum --with-snmp --with-esmtp --with-acl --disable-fatal-warnings --localstatedir=/var --sysconfdir=/etc

Yes, I turned everything on.  If it works correctly, you should see a feature line near the end of the output, and CMAN had BETTER be there.  Had to disable fatal warnings because there were, um, stupid warnings that were fatal to the build.  Let's hope they're just warnings, eh?

  make

Now sit and wait.  Hmmmm....  this is a nice fast machine, and now it's done!  Now for the part where I shoot myself in the foot:

  make install

If you were kinda dumb like me, you may have built this as root.  Not advisable, generally speaking, but I'm to the point I don't care.  I've built entire Linux deploys from scratch (read: Linux From Scratch), so it's pretty much the same thing over and over and over for me.  I don't do LFS anymore, by the way.  Package management features of distros like Ubuntu are just too damn shiny for me to ignore any longer.


Now I discover that I did not have my paths entirely correct during the configuration step.  My cluster configuration is nowhere to be seen, because it's probably being sought in /usr/var/blah/blah/blah.  And it is so.  I've modified the above configure command to be more correct.  And now, except for a strange "touch missing operand" bit of complaining on the part of the init script, the binary has found my cluster configuration.  (edit: the last issue has also been fixed by the addition of the flag that sets /etc as a good place to go.)

With the new pacemakerd binary in place, I can get the thing started under the watchful gaze of CMAN.  Now I have to update the cluster config to reflect the fact that the o2cb stack is now CMAN.  Refer to the RA metadata for this, I won't repeat it here.  With that done, I can bring at least the 11.10 node back online.  The 12.04 node actually doesn't have all the necessary packages to make it work yet.

Predictably, the mount of the OCFS2 iSCSI drives fails - they're the wrong stack.  BUT, the o2cb driver is up, and the iSCSI drives are connected.  With that, I can do the following:

root@hv06:~# tunefs.ocfs2 --update-cluster-stack /dev/sdc
Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.Update the on-disk cluster information? yes

So, we can learn from this that updating the cluster stack must be done while the ENTIRE FRIGGING CLUSTER IS DOWN.  I'll try to remember that for when I do this for real.  I can now clean up the two failed filesystem resources, and lo and behold they mount!  I can now configure the 12.04 machine to follow suit.

Conclusion

Now, having done all this, I am left to wonder whether or not this is in fact the best route.  On the up-side:
  1. I can easily copy the pre-custom-built pacemakerd stuff to all my 11.10 cluster machines and forego the painful build steps.
  2. I can take the cluster down for a very short period of time.
  3. This should not affect any data on the OCFS2 volumes, as we're just updating the cluster stack.
  4. I can take my time while upgrading the individual machines to 12.04, or even leave them on 11.10 a bit longer.
  5. I could build the latest-and-greatest pacemaker and use that instead of 1.1.5.
OK, point #5 is probably not as feasible as I would like it to be, nor is it necessary in my situation.  On the down-side:
  1. I still have to take the ENTIRE cluster down to make this "minor" update.
  2. I could just push my new server back to 11.10, join it to the cluster, migrate my VMs there and then upgrade the rest of the cluster.
  3. But if I do #2 here, I will still at some point have to bring the whole cluster down just to update the OCFS2 cluster stack info to CMAN.
  4. I may be causing unknown issues once I push the 11.10 machines to 12.04.

At least point #4 of the downsides is something I can test.   I can push that machine to 12.04 and see what happens during the upgrade.  Ideally, nothing bad should come of it; the pacemakerd binary should get upgraded to the package maintainer's verion, 1.1.6-ubuntu-goodness-and-various-patches-go-here.

This will probably be useful to me: http://www.debian.org/doc/manuals/packaging-tutorial/packaging-tutorial.en.pdf

We shall see.

No comments:

Post a Comment