20130225

"Quick" notes on OCFS2 + cman + pacemaker + Ubuntu 12.04

Ha ha - "quick" is funny because now this document has become huge.  The good stuff is at the end.

Getting this working is my punishment for wanting what I evidently ought not to have.

When configuring CMAN, thou shalt NOT use "sctp" at the DLM communication protocol.  ocfs2_controld.cman does not seem to be compatible with it, and will forever bork itself while trying to initialize.  This is presented as something like:
Feb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 1 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 2 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 4 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 8 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 16 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 32 times while opening checkpoint "ocfs2:controld:00000003"

And it goes on forever.

To make the init scripts work, some evil might be required...
In /etc/default/o2cb, add 
O2CB_STACK=cman
The /etc/init.d/o2cb script tries to start ocfs2_controld.cman before it should.  Commenting out the appropriate line makes this script do part B, which is setting up everything it should, so that CMAN can do part A and C.  OR you can try running the o2cb script AFTER cman starts, and not worrying that CMAN is "controlling" o2cb...which it really doesn't anyway.

The fact is, the cman script's main job is to call a lot of other utilities and start a bunch of daemons.  cman_tool will crank up corosync, and configure it from the /etc/cluster/cluster.conf file.  It ignores /etc/corosync/corosync.conf entirely, as proved by experimentation and documented by the cman author(s).  As far as it cares about o2cb, it runs ocfs2_controld.cman only if it finds the appropriate things configured in the configfs mount-point.  It won't find those unless you've configured them yourself or run an modified o2cb init script.

Now, it gets better.  The  /etc/default/o2cb doesn't document the cluster stack option - you have to find out by reading the o2cb init script instead.  If you let the default stack (which is, ironically, "o2cb") stand, then ocfs2_controld.cman won't run and instead complains that you're using the wrong cluster stack.  Of course, running it with the default stack then runs the default ocfs2_controld, which doesn't complain about anything at all.  But does it play nice with cman and corosync and pacemaker??

Fact is, it doesn't play with cman/corosync/pacemaker at all when it plays as an "o2cb" stack.

How is this a big deal?

The crux to all of this comes down to fencing.  OK, so suppose you have a cluster and OCFS2 configured and something terrible happens to one node.  That node gets fenced.  Then what?  Well, OCFS2 and everyone else involved should go on with life, assuming quorum is maintained.

When o2cb is configured to use the o2cb stack, it appears to operate sort of "stand-alone," meaning it doesn't seem to talk to the corosync/pacemaker/cman stack.  It doesn't get informed when a node dies, it has to find this out on its own.  Moreover, it does its own thing regardless of quorum.  Here's the thing I just did:  configure a two node cluster, configure o2cb to use the o2cb stack, and then crash one of the two nodes while the other node is doing disk access (I'm using dbench, just because it does a lot of disk access and gives latency times - great way to watch how long a recovery takes!).

Watching the log on surviving-node (s-node), you can see the o2cb stack recover based on (I assume) the timeouts configured in the /etc/default/o2cb file.  About 90 seconds later access to the OCFS2 file system is restored, regardless of the state of crashed-node (c-node).

Now the good of this is that when you start up and shut down the o2cb stack and the cman stack, they don't care about each other.  This is great because on Ubuntu these start-up and shut-down sequences seem to be all fucked up.  More about that later.  The bad news is that because these stacks are not talking, the recovery takes (my default-configured cluster) 90 seconds, which would probably nuke any VM instances running on it and wreak all sorts of havoc.  Not acceptable, and I'm not crazy about modifying defaults downward when the documentation says (and I paraphrase): "You might want to increase these values..."

Reconfigure o2cb to use the cman stack instead (O2CB_STACK=cman).  Start o2cb service, ignore the o2cb_controld.cman failure, and start the cman service.  Cman starts o2cb_controld.cman.  Update the OCFS2 cluster stack, mount and start another dbench on s-node.  Crash c-node.  This time o2cb appears to find out from the three amigos that c-node died.  However, quorum is managed by cman, and since it's a two-node cluster it halts cluster operations (such as recovery) until quorum is reestablished.  This can be done simply by restarting cman (regardless of o2cb) on c-node...once c-node is rebooted.  Unfortunately, if you're not watching your cluster crash, it could be many minutes or hours before you notice that s-node isn't able to access its data.  Or maybe never, if c-node died due to, say, releasing its magic smoke.

What else to do?  cman documentation dictates using the two_node="1" and the expected_votes="1" attributes in the cman configuration tag in /etc/cluster/cluster.conf.  Now a single node is quorate.  Let's start dbench on s-node and crash c-node again.  Recovery after c-node bites the dust takes place after about 30 seconds of downtime.  That's better.  After adding some options to configure totem for greater responsiveness (hopefully not at the cost of stability), the only thing that takes a long time now is the ocfs2 journal replay.  And that's only because my SAN is overworked and under-powered.  Donations, anyone?

Lessons Learned

To get the benefit of ocfs2 + cman + pacemaker (under Ubuntu), you need to have ocfs2_controld.cman and it has to run when "cman" is running.  That is to say, when some particular daemons - notably dlm_controld - are running.

ocfs2 can run on its own (o2cb stack), but then you lose quorum control, so to speak, and it has to be configured and managed separately of cman-and-friends.  Ugly.

For two-node clusters, make absolutely sure you have correctly configured cman to know it's a two-node cluster and expect only one vote cluster-wide, otherwise there will be no recovery for node S when node C dies.  Two node clusters under cman demand:  two_node="1" and expected_votes="1"

ocfs2_controld.cman does NOT like to talk to the DLM via sctp.  You must NOT use sctp as the communication protocol.

When configuring cluster resources, about the only things you need under this setup will be connection to the data source, and mounting of the store.  In my case, that's an iSCSI initiator resource and to mount the OCFS2 partition once I'm connected to the target.  There is NO:
  • dlm_controld resource
  • o2cb control resource
  • gfs2 control resource
Basically, Pacemaker will not be managing any of those low-level things, unlike what you had to do back in Ubuntu 11.10.  Literally all I have in my cluster configuration is fencing, the iSCSI initiator, and the mount.  If you do anything else with the above three resources, you will find much pain when trying to put your nodes into standby or do anything with them other than leaving them running forever and ever.

Start-up sequence:
(Update 2013-02-28: The start-up order can be as now listed below.  ocfs2_controld.cman will connect to the dlm.  However, shutdown must take an alternate path.)
  1. service cman start
  2. service o2cb start
  3. service pacemaker start
If you start o2cb first:  You can start o2cb first, but o2cb WILL complain about not being able to start ocfs2_controld.cman.  Let it complain or modify the init script to not even try, or start cman first and don't worry that cman won't try to start ocfs_controld.cman.  But you MUST use "start" and not "load" because otherwise the script will not configure the necessary attributes under configfs (/sys/kernel/config) and cman will see an o2cb-leaning ocfs2 cluster instead of a cman-leaning ocfs2 cluster.

Shutdown almost is the reverse.  Whether or not you start o2cb then cman, or cman then o2cb, you must kill cman before killing o2cb.  Sometimes on shutdown, fenced will die before cman can kill it I think and the cman init script throws an error.  Run it again ("service cman stop" - yes, again), and when it completes successfully you can do "service o2cb stop".  If you try to stop o2cb before cman is totally dead, you will wind up with a minor mess.  Given all of this, I'd recommend disabling all of these scripts from being run at system boot.

CMAN-based o2cb requires O2CB_STACK=cman in /etc/default/o2cb.

If you are upgrading from Ubuntu 11.10 to 12.04, and you want to move your ocfs2 stack from whatever it's named to cman, remember to run tunefs.ocfs2 --update-cluster-stack [target] AFTER you have o2cb properly configured and running under cman.  This will mean your whole cluster will be unable to use that particular ocfs2 store, but then if you're doing this kind of upgrade you probably should not be using it live anyway.  Since I had my resources in groups, I configured the mount to be stopped before bringing the nodes up, and allowed the iSCSI initiator to connect to the target.  Then I was able to update the stack and start the mount resource, which succeeded as expected.

I hope you find this information useful.

2 comments:

  1. I struggled for weeks for building a stable cluster(CMAN, Pacemaker, OCFS2, CLVM, Dual-Primary DRBD, KVM) based on Ubuntu 12.04 LTS till i found your Blog!
    You saved my life. Also thanks for the "Broken cman init script?!" post.
    I hope i will all get better with Ubuntu 14.04 ...
    Thank You

    ReplyDelete
    Replies
    1. Very glad to hear that! I believe there has been many changes to 12.04's cluster management packages, to the point that I wasn't sure how much of this info was still accurate. 14.04 has presented its own conundrums, and I've started investigating NFS and CEPH as alternatives to OCFS2. Still, for two-node configurations, it's really hard to let go of OCFS2!!

      Delete