20130225

"Quick" notes on OCFS2 + cman + pacemaker + Ubuntu 12.04

Ha ha - "quick" is funny because now this document has become huge.  The good stuff is at the end.

Getting this working is my punishment for wanting what I evidently ought not to have.

When configuring CMAN, thou shalt NOT use "sctp" at the DLM communication protocol.  ocfs2_controld.cman does not seem to be compatible with it, and will forever bork itself while trying to initialize.  This is presented as something like:
Feb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 1 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 2 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 4 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 8 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 16 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 32 times while opening checkpoint "ocfs2:controld:00000003"

And it goes on forever.

To make the init scripts work, some evil might be required...
In /etc/default/o2cb, add 
O2CB_STACK=cman
The /etc/init.d/o2cb script tries to start ocfs2_controld.cman before it should.  Commenting out the appropriate line makes this script do part B, which is setting up everything it should, so that CMAN can do part A and C.  OR you can try running the o2cb script AFTER cman starts, and not worrying that CMAN is "controlling" o2cb...which it really doesn't anyway.

The fact is, the cman script's main job is to call a lot of other utilities and start a bunch of daemons.  cman_tool will crank up corosync, and configure it from the /etc/cluster/cluster.conf file.  It ignores /etc/corosync/corosync.conf entirely, as proved by experimentation and documented by the cman author(s).  As far as it cares about o2cb, it runs ocfs2_controld.cman only if it finds the appropriate things configured in the configfs mount-point.  It won't find those unless you've configured them yourself or run an modified o2cb init script.

Now, it gets better.  The  /etc/default/o2cb doesn't document the cluster stack option - you have to find out by reading the o2cb init script instead.  If you let the default stack (which is, ironically, "o2cb") stand, then ocfs2_controld.cman won't run and instead complains that you're using the wrong cluster stack.  Of course, running it with the default stack then runs the default ocfs2_controld, which doesn't complain about anything at all.  But does it play nice with cman and corosync and pacemaker??

Fact is, it doesn't play with cman/corosync/pacemaker at all when it plays as an "o2cb" stack.

How is this a big deal?

The crux to all of this comes down to fencing.  OK, so suppose you have a cluster and OCFS2 configured and something terrible happens to one node.  That node gets fenced.  Then what?  Well, OCFS2 and everyone else involved should go on with life, assuming quorum is maintained.

When o2cb is configured to use the o2cb stack, it appears to operate sort of "stand-alone," meaning it doesn't seem to talk to the corosync/pacemaker/cman stack.  It doesn't get informed when a node dies, it has to find this out on its own.  Moreover, it does its own thing regardless of quorum.  Here's the thing I just did:  configure a two node cluster, configure o2cb to use the o2cb stack, and then crash one of the two nodes while the other node is doing disk access (I'm using dbench, just because it does a lot of disk access and gives latency times - great way to watch how long a recovery takes!).

Watching the log on surviving-node (s-node), you can see the o2cb stack recover based on (I assume) the timeouts configured in the /etc/default/o2cb file.  About 90 seconds later access to the OCFS2 file system is restored, regardless of the state of crashed-node (c-node).

Now the good of this is that when you start up and shut down the o2cb stack and the cman stack, they don't care about each other.  This is great because on Ubuntu these start-up and shut-down sequences seem to be all fucked up.  More about that later.  The bad news is that because these stacks are not talking, the recovery takes (my default-configured cluster) 90 seconds, which would probably nuke any VM instances running on it and wreak all sorts of havoc.  Not acceptable, and I'm not crazy about modifying defaults downward when the documentation says (and I paraphrase): "You might want to increase these values..."

Reconfigure o2cb to use the cman stack instead (O2CB_STACK=cman).  Start o2cb service, ignore the o2cb_controld.cman failure, and start the cman service.  Cman starts o2cb_controld.cman.  Update the OCFS2 cluster stack, mount and start another dbench on s-node.  Crash c-node.  This time o2cb appears to find out from the three amigos that c-node died.  However, quorum is managed by cman, and since it's a two-node cluster it halts cluster operations (such as recovery) until quorum is reestablished.  This can be done simply by restarting cman (regardless of o2cb) on c-node...once c-node is rebooted.  Unfortunately, if you're not watching your cluster crash, it could be many minutes or hours before you notice that s-node isn't able to access its data.  Or maybe never, if c-node died due to, say, releasing its magic smoke.

What else to do?  cman documentation dictates using the two_node="1" and the expected_votes="1" attributes in the cman configuration tag in /etc/cluster/cluster.conf.  Now a single node is quorate.  Let's start dbench on s-node and crash c-node again.  Recovery after c-node bites the dust takes place after about 30 seconds of downtime.  That's better.  After adding some options to configure totem for greater responsiveness (hopefully not at the cost of stability), the only thing that takes a long time now is the ocfs2 journal replay.  And that's only because my SAN is overworked and under-powered.  Donations, anyone?

Lessons Learned

To get the benefit of ocfs2 + cman + pacemaker (under Ubuntu), you need to have ocfs2_controld.cman and it has to run when "cman" is running.  That is to say, when some particular daemons - notably dlm_controld - are running.

ocfs2 can run on its own (o2cb stack), but then you lose quorum control, so to speak, and it has to be configured and managed separately of cman-and-friends.  Ugly.

For two-node clusters, make absolutely sure you have correctly configured cman to know it's a two-node cluster and expect only one vote cluster-wide, otherwise there will be no recovery for node S when node C dies.  Two node clusters under cman demand:  two_node="1" and expected_votes="1"

ocfs2_controld.cman does NOT like to talk to the DLM via sctp.  You must NOT use sctp as the communication protocol.

When configuring cluster resources, about the only things you need under this setup will be connection to the data source, and mounting of the store.  In my case, that's an iSCSI initiator resource and to mount the OCFS2 partition once I'm connected to the target.  There is NO:
  • dlm_controld resource
  • o2cb control resource
  • gfs2 control resource
Basically, Pacemaker will not be managing any of those low-level things, unlike what you had to do back in Ubuntu 11.10.  Literally all I have in my cluster configuration is fencing, the iSCSI initiator, and the mount.  If you do anything else with the above three resources, you will find much pain when trying to put your nodes into standby or do anything with them other than leaving them running forever and ever.

Start-up sequence:
(Update 2013-02-28: The start-up order can be as now listed below.  ocfs2_controld.cman will connect to the dlm.  However, shutdown must take an alternate path.)
  1. service cman start
  2. service o2cb start
  3. service pacemaker start
If you start o2cb first:  You can start o2cb first, but o2cb WILL complain about not being able to start ocfs2_controld.cman.  Let it complain or modify the init script to not even try, or start cman first and don't worry that cman won't try to start ocfs_controld.cman.  But you MUST use "start" and not "load" because otherwise the script will not configure the necessary attributes under configfs (/sys/kernel/config) and cman will see an o2cb-leaning ocfs2 cluster instead of a cman-leaning ocfs2 cluster.

Shutdown almost is the reverse.  Whether or not you start o2cb then cman, or cman then o2cb, you must kill cman before killing o2cb.  Sometimes on shutdown, fenced will die before cman can kill it I think and the cman init script throws an error.  Run it again ("service cman stop" - yes, again), and when it completes successfully you can do "service o2cb stop".  If you try to stop o2cb before cman is totally dead, you will wind up with a minor mess.  Given all of this, I'd recommend disabling all of these scripts from being run at system boot.

CMAN-based o2cb requires O2CB_STACK=cman in /etc/default/o2cb.

If you are upgrading from Ubuntu 11.10 to 12.04, and you want to move your ocfs2 stack from whatever it's named to cman, remember to run tunefs.ocfs2 --update-cluster-stack [target] AFTER you have o2cb properly configured and running under cman.  This will mean your whole cluster will be unable to use that particular ocfs2 store, but then if you're doing this kind of upgrade you probably should not be using it live anyway.  Since I had my resources in groups, I configured the mount to be stopped before bringing the nodes up, and allowed the iSCSI initiator to connect to the target.  Then I was able to update the stack and start the mount resource, which succeeded as expected.

I hope you find this information useful.

20130205

HA MySQL (MariaDB) on Ubuntu 12.04 LTS

A few notes concerning this.


The tutorial provided on the Linbit site for HA-mysql is totally AWESOME!  Highly recommended.  It will get you 99% of the way there.

The resource definition for the MySQL server instance on Ubuntu 12.04 varies slightly due to Apparmor's need for things to line up neatly.  Specifically the file names for the pid and socket files must be correct.  Referencing the original Ubuntu configuration, we have this for a resource:
primitive p_db-mysql0 ocf:heartbeat:mysql \
        params binary="/usr/sbin/mysqld" \
               config="/etc/mysql/my.cnf" \
               datadir="/var/lib/mysql" \
               pid="/var/run/mysqld/mysqld.pid" \
               socket="/var/run/mysqld/mysqld.sock" \
               additional_parameters="--bind-address=127.0.0.1" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" timeout="30s"

Of course, the bind-address listed here is only for testing and must be changed to the bind address of the virtual IP that will be assigned to the database resource group.

I chose to have the database files stored on iSCSI, since my iSCSI SAN is HA already.  I realize that there is still the possibility of network switch failure causing runtime coma, but if that happens then there will be much larger problems at hand, since both database servers (two node cluster) are virtual machines.  To that end I must remember to configure them for virtual STONITH.

I'm still not sure virtualized database servers are the best idea; I can think of a few reasons not to love them, but also a few reasons to totally dig them. 

Minuses:
  • VM is subject to same iSCSI risks as the backing store for the databases right now - dedicated DRBD would be better; in my case, this isn't really applicable because the VMs are actually on DRBD and hosted via iSCSI, so I'd be doing double-duty there.
  • A VM migration SHOULDN'T cause any sort of db-cluster failure, but we will have to test to know for certain.  Perhaps modifying the corosync timeouts will be a beneficial thing.
Pluses:
  • The standard reason: hardware provisioning!!  No need to stand up more hard drives to watch die, or use more power than what I'm already using.
  • VMs means easy migration to other places, like a redundant VM cluster for instance.
  • Provisioning additional cluster nodes should be relatively painless.
  • The iSCSI backing store will soon be using ZFS, which will be more difficult to do for standalone nodes unless I spend $$ on drives, and ideally hot-swap cages.
  • If one of the VMs dies suddenly, we still won't suffer (hopefully) a major database access outage.  I'd like to move all internal database use over to this cluster, ultimately.  I am tempted to even put an LDAP server instance on there.  Then it can be all things data-access-related.
Hopefully I get more than I paid for.


Concerning MariaDB

This appears to be where the future is going, and more than one distro agrees.  So, to that end, I looked at the specs and the ideas behind MariaDB.  Satisfied that it was designed to be a literal "drop-in replacement" for MySQL, I immediately transitioned both machines over.  Now we will see how well it really works.  I had to follow their instructions on adding their repo to my servers.  The upgrade was painless, and all I have left now is to set up the virtual IPs and start connecting machines to the database instances.

Concerning PostgreSQL

My DB cluster also hosts PostgreSQL 9.1.  This was followed to the tee and works, as far as I have tested so far, quite well:

http://wiki.postgresql.org/images/0/07/Ha_postgres.pdf