20120504

Corosync, Pacemaker, Heartbeat, OCFS2...

Ran into an issue with OCFS2 and Heartbeat: with heartbeat running, OCFS refused to shut down for a reboot of the system.  This caused the system to totally hang for 12 hours until I could get to the office and force-shutdown the machines.  After a few more reboot tests, I disabled Heartbeat from even starting, and now reboot works again.  I've installed Corosync and Pacemaker to become the management systems.  I'm sure there's a way to fix Heartbeat, but since there seems to be a community trend toward Pacemaker, we'll go with that.  Plus, DRBD documents how to set up OCFS2 resources for HA with Pacemaker.

Installation
apt-get -y install corosync pacemaker build-essential

Don't know why build-essential is needed, but a site referenced it.  We'll see.  Right now I just want to get this thing running as quickly and as painlessly as possible.

Configuring Corosync
crm won't give any useful cluster status without something - like Corosync - running.  So I started with modifying corosync.conf to match most of the settings here.  Now I can ask crm for status and it tells me my cluster has zero nodes.  Good!  That's better than saying it can't connect or tell me anything useful.

Also, to get corosync to start, flip the switch in /etc/default/corosync to allow the init script to run.

After starting corosync on both nodes, crm status displays two nodes, two votes, no resources configured. Not sure what the "pending" on node 1 is all about yet, but after about 30 seconds it disappeared and now both nodes say they're online.  We have GLUE!


Configuring Resources
After reading a good deal of the Pacemaker Explained documentation, I decide to start out by following DRBD's example of configuring OCFS2 with Pacemaker.  That ended horribly.  OCFS2 kept the servers from wanting to willingly reboot, and it appears that whenever you want Pacemaker to manage something, you basically have to hand over ALL start/stop functionality to it.  That makes sense, of course... I just probably skimmed over that part of the documentation.  After resetting the cluster configuration to something like a large BLANK, I proceeded to restart from the examples in Clusters From Scratch.

First I configured a virtual IP assignment, which is needed for the iSCSI stuff to work.  Ideally, I'd like to load-balance the iSCSI backend between the two machines, by providing two virtual IPs (one for each).  If one goes down, the other will assume both IPs and everything should be good.  That's for a future project, however, so for right now let's get one virtual IP up and an iSCSI initiator connected.  Anyway, the virtual IP works and instantly and without fuss moves from server to server, whenever one or the other disappears.  ZERO loss in ping.  Would love to see how quickly the response time is for streaming data...

Next up: configuring DRBD in Pacemaker.  We'll leave OCFS2 for yet a later time.  My goal for later tonight is to get DRBD running correctly under Pacemaker - this means I must disable the drbd init scripts.  Note to self: do this.






No comments:

Post a Comment