20121124

Broken cman init script?!

Nothing is going right in Ubuntu 12.04 for cluster-aware file systems.

OCFS2 seems borked beyond belief.
(Update 2013-02-25: I believe I have made progress on the OCFS2 front:  http://burning-midnight.blogspot.com/2013/02/quick-notes-on-ocfs2-cman-pacemaker.html)

GFS2 has its own issues, some of which I will detail here.

You CAN get these two file systems running on 12.04.  Whether or not the cluster will remain stable when you have to put a node on standby is another question entirely, and a very good question.  It shouldn't even BE a question, but it is, and the answer is a resounding FUCKING NO!  Well, at least, as far as OCFS2 is concerned.  The problems there lie in who manages the ocfs2_controld daemon.  CMAN ought to do it, but CMAN doesn't want to.  Starting it in Pacemaker causes horrible heartburn when you put a node into standby, and things just all fall apart from there.

I decided to try out GFS2.  After installing all the necessary packages, and manually running bits here and there to see things work, I could not get Pacemaker to mount the GFS2 volume.  First problem was the CLVM: if you want to be able to shut down a node without shooting the fucker, you'll need to make sure the LVM system can deactivate volume groups.  The standard method of vgchange -an MyVG doesn't work for the cluster-aware LVM.  It complains loudly about "activation/monitoring=0" being an unacceptable condition for vgchange.  This detailed in this bug: https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/833368

The solution suggested there, at least where the OCF script is concerned, works: change the lines that use "vgchange" to include "--monitor y" on the command-line, and it will magically work again.

My cluster starts a DRBD resource, promotes it (dual-primary), then starts up clvmd (ocf:lvm2:clvmd), activates the appropriate LVM volumes (ocf:heartbeat:LVM), then mounts the GFS2 file system (ocf:heartbeat:FileSystem).  These are all cloned resources.
primitive p_clvmd ocf:lvm2:clvmd \
    op start interval="0" timeout="100" \
    op stop interval="0" timeout="100" \
    op monitor interval="60" timout="120"
primitive p_drbd_data ocf:linbit:drbd \
    params drbd_resource="data" \
    op start interval="0" timeout="240" \
    op promote interval="0" timeout="90" \
    op demote interval="0" timeout="90" \
    op notify interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s"
primitive p_fs_vm ocf:heartbeat:Filesystem \
    params device="/dev/cdata/vm" directory="/opt/vm" fstype="gfs2"
primitive p_lvm_cdata ocf:heartbeat:LVM \
    params volgrpname="cdata"
ms ms_drbd_data p_drbd_data \
    meta master-max="2" clone-max="2" interleave="true" notify="true" clone cl_clvmd p_clvmd \
    meta clone-max="2" interleave="true" notify="true" globally-unique="false" target-role="Started"
clone cl_fs_vm p_fs_vm \
    meta clone-max="2" interleave="true" notify="false" globally-unique="false" target-role="Started"
clone cl_lvm_cdata p_lvm_cdata \
    meta clone-max="2" interleave="true" notify="true" globally-unique="false" target-role="Started"
colocation colo_lvm_clvm inf: cl_fs_vm cl_lvm_cdata cl_clvmd ms_drbd_data:Master
order o_lvm inf: ms_drbd_data:promote cl_clvmd:start cl_lvm_cdata:start cl_fs_vm:start

The LVM clone is necessary so that you can deactivate the VG before disconnecting DRBD during a standby.  Not achieving this will STONITH the node.  The "--monitor y" change is absolutely necessary, or you won't even bring the VG online.  Starting clvmd inside Pacemaker might not be a necessary thing, but in this instance it seems to work very well.  It's also important to note that most of the init.d scripts related to this conundrum have been disabled: clvmd, drbd, to name two.

The GFS2 file system will not mount without gfs_controld running.  gfs_controld won't start on a clean Ubuntu Server 12.04 system because it seems the cman init script is fucked up.  Can't understand it, but inside /etc/init.d/cman you'll find a line that reads:
gfs_controld_enabled && cd /etc/init.d && ./gfs2-cluster start
Comment out this line and add this below it:
if [[ gfs_controld_enabled ]]; then
      cd /etc/init.d &&  ./gfs2-cluster start
fi
This will make the cman script actually CALL the gfs2-cluster script and thus start the gfs_controld daemon.  Shutdown seems to work correctly with no additional modifications.  You will find that once all these pieces are in place, GFS2 is viable on Ubuntu 12.04 AND you can bring your cluster up and down without watching your nodes commit creative suicide.

I honestly don't know why this is the way it is.  I wouldn't know where to even assign blame.  In the Ubuntu Server 12.04 Cluster Guide (work-in-progress), they suggest this resource:
primitive resGFSD ocf:pacemaker:controld \
        params daemon="gfs_controld" args="" \
        op monitor interval="120s"
This seems rather like a bastardization of what this resource agent is really for, but perhaps it works for them.  However, I would highly suspect this might suffer from the same issues that I ran into with OCFS2: that if CMAN isn't running the controld, putting a node into standby will wreak havoc on the node and cluster.  With OCFS2, the issue was in the ocfs2_controld daemon, which CMAN was all too happy to try to bring offline but would NOT under any circumstances that I could find start it up.

Once started by Pacemaker you also cannot seem to take it down, meaning the resource fails to stop and becomes a disqualifying offense for the node.   This issue seems unrelated to a missing killproc command that is non-standard among distributions, because even when you fix/fake it, the thing does not seem to accomplish anything.  ocfs2_controld continues to run in the background, and cman will fail to shutdown correctly after you try bringing a node down gracefully.  No ideas yet on how to fix this, but I might try for it next.  I had detailed making a working Ubuntu 12.04 OCFS2 cluster in a previous post...I will be double-checking those steps...

20121105

Useful

IPMI v2.0 - accessing SOL from Linux command line.

http://wiki.nikhef.nl/grid/Serial_Consoles

(use a bit rate that makes sense for you, only set if necessary)
ipmitool -I lanplus -H host.ipmi.nikhef.nl -U root sol set volatile-bit-rate 9.6
ipmitool -I lanplus -H host.ipmi.nikhef.nl -U root sol set non-volatile-bit-rate 9.6
 
ipmitool -I lanplus -H IPMI-BMC-IPADDR -U BMCPRIVUSER sol activate



20121102

Led Astray

It's frustrating, and it's my own damn fault.

I read in the HP v1910 switch documentation that an 802.3ad bond would utilize all connections for the transmission of data.   Even with static aggregation I thought I'd get something different than what, in fact, I received.  To quote their introduction on the concept of link aggregation:

"Link aggregation delivers the following benefits: * Increases bandwidth beyond the limits of any single link.  In an aggregate link, traffic is distributed across the member ports."
I'll spare you the rest.  It's my own damn fault because I took that little piece of marketing with an assumption:  That "traffic" indicated TCP packets regardless of their source or destination.  I know better now, and I do bow and scrape to the Prophets of Linux Bonding, the deities that espouse Whole Technical Truth.  I am not worthy!

Despite my best efforts, I cannot get more than 1G/sec between two LACP-connected machines.  Running iperf -s on one, and iperf -c on the other, the connection saturates as though a single channel were all that was available.  The only benefit then is that different machines are distributed across these multiple connections.  Those reading this and who knew better than I, I am sorry.  I'm an idiot.  May this blog serve to save others from my fate.

Static aggregation, as far as my HP switches are concerned, does nothing for mode-0 connections.  I can get a little better throughput, but watching the up-and-down of the flow rates suggests there is much evil happening, and I don't like it.  Plus, I can't really distribute a static aggregation across my switches as far as I know - maybe the HP switch stacking feature would help with this, but I also sense much evil there and don't want to go at it.

The only benefits I can derive from RR is by placing all connections into separate VLANs.  That, of course, kills any notion of redundancy and shared connectivity.  First, it's like having multiple switches, but if a single connection from a single machine goes down, then that whole machine is unable to communicate with the other machines across those virtual switches.  So, bollocks to that.

Second, it's damn hard to figure out a good, robust and non-impossible way to configure these VLANs to also communicate with the rest of the world.  I guess that it all boils down to my desire to use the maximum possible throughput to and from any given machine, without having to jump through hoops like creating gateway hosts just to aggregate all these connections into something recognizable by other networking hardware.  I am also not willing to sacrifice ports to the roles of active-passive, even though that would allow me at least one switch or link failure before catastrophic consequences took hold.

It's my own damn fault because I didn't take the time to read the bonding driver kernel documentation that the Good Lords of Kernel Development took the time to write.  I didn't, at least, until last night.  I poured through it, reading the telling tales of switches and support and the best way to get at certain kinds of redundancy or throughput.

802.3ad obviously doesn't do much for me either.  After reading the docs, I know this.  It does make aggregation on a single switch rather easy, but no more or less easy than mode-6 bonding.  Well, I take that back.  It IS less easy because the switch needs its ports configured.  It also doesn't support my need for multi-switch redundancy, so 802.3ad is out, too.

In short, if you're thinking of bonding two bonds together, don't.  It's just not worth it.  The trouble, the init scripts, the switch configuration will just not do you any good.  You'll still be stuck with 1 G/sec per machine connection.  Even worse, you might not even get your links quickly enough back if someone trips over the power-strip running your two highly-available switches.

I considered the VLAN solution, minus its connection to the world, thereby encapsulating my SAN-to-Hypervisor subnet in its own universe of ultra-high-throughput.  3 G/sec seemed a nice thing.  I managed to get close to that throughput; but, sadly, given that single-link failures would be catastrophic, I can't afford to take that risk.  Redundancy is too important.  I will relegate myself to mode-6, as it appears to be the most flexible, the most robust and the most reliable with regard to even link distribution.

I hope the price of 10GigE drops sooner rather than later...

20121031

Bonded Bonds - A Brief Follow-up

I am reconsidering my original plan to bond two mode-4 bonds over a mode-6 super-bond.

To be fair, it works.  BUT, link recovery-time on total catastrophic all-link-failure is not good.  I would guess this is an edge-case that no one has ever really thought about before, and perhaps there is insufficient maturity in this sort of functionality.  OR I have horribly mis-configured the super-bond device.  More tuning and testing would be in order.

Basically, I have my two bonds connected to two switches and 802.3ad running.  Everything looked smooth, and aside from the fact that I can't get any iperf bandwidth tests to show rates above 1 gbit/sec, I felt it was a fairly stable.  With the mode-6 super-bond joining the two mode-4 bonds, I could easily disconnect any three wires and still have connectivity, albeit with a slight (1 to 3 seconds) delay if the currently active slave was not the last mode-4 bond to be connected.  I suspect that delay is ARP-related, but that's just a hunch.

The trouble began when I simulated a multi-switch failure.  Ordinarily I would expect that once either switch is restored, the super-bond will come back up.  That was either (a) not the case, or (b) taking so long to occur that it was unacceptable.  Of course, if a multi-switch failure really occurred during live operation, chances are I would not be able to do anything about the resultant consequences in time to prevent other resultant and catastrophic consequences, and maybe this is an edge-test-case that I should worry less about.

For the record, to test the above failure case, I pulled the cables first from switch A, then from switch B, one at a time.  After a few seconds I plugged a cable back into switch A.  No results.  I'll have to test again to be certain, but I either had to plug both A cables back into switch A, and/or both B cables back into switch B before I saw connectivity restored.  I want to say it was the latter case - the last switch to disconnect being the first one that would need to reconnect.

Using mode-6 on all links does allow me to traverse the switches and fail both switches, the last switch to fail being the last to be brought back up, with no appreciable delay in restoration of connectivity.  I suspect that there is some non-talking happening between the super-bond and its subordinate bonds, but without digging through the code or asking people a lot more knowledgeable than myself, I will never no for certain.

Of course, on that note, Ubuntu 12.04 hasn't been exactly friendly about bringing my links back up after reboots.  If it can't detect a carrier on one of the lines, it won't bring the interface back.  Worse yet, it won't (or didn't) add it to the mode-6 bond on boot.  If it's a deterministic failure on boot, fine.  If it's non-deterministic, so much the worse, but the outcome would be the same:  here again I may need to resort back to my manual script for initial configuration, as I want to guarantee that no matter the circumstances, the system is configured to expect connectivity someday, if not immediately.

I plan on doing additional bandwidth testing, and will probably try the stacked bonding one more time for kicks.  I really want the maximum possible bandwidth along with the maximum possible reliability.  We shall see how things turn out.

20121029

Sins of the Bond

Here I explore the interesting notion of bonding a set of bonds.

I'm working on configuring some very awesome, or perhaps convoluted, networking as I prepare to re-architect an entire network.  As I'm in the midst of this, I'm trying to decide how best to go about utilizing a four-fold set of links.  Basically:

  • I have a bunch of 4-port gigabit cards.
  • At least 4 of the 12 links are dedicated to one particular subnet - I plan on blasting the hell out of it.
4 gigabits per second?  I'd sure like that.  Dual-switch redundancy?  I'd like that even better.  I can't have both at the same time.  But can I get close?

For the remainder of this entry, I'll be referring to the Linux bonding driver and its various modes.

 I usually use mode 6 (balance-alb) because I want the most out of my bonds and what with the clients that may need that sort of bandwidth.  However, it appears that it only really provides 1 link per connection.  So, if I add 4 links to a mode-6 bond, I'll still only get 1 g/sec throughput, but I can do that to 4 systems simultaneously.  That's great, but not good enough if I only have three or four systems to connect up.

My new switches supports static and dynamic LACP.  That's really nice, but it also means that I'll either have to put all four links on one switch, or run mode-6 across two mode-4 bonds.  That's 2 g/sec per single connection, a total of 4 g/sec of throughput to up to two machines.  Naturally, you'd have to work hard to saturate that, so I expect it would spread the load quite nicely.

So, what's are all the options?

  1. Simplest:  Operate mode-6 across all links and both switches.  This achieves bandwidth of 1 g/sec per connection, up to 4 g/sec aggregate across all connections.
  2. Semi-redundant:  Operate mode-4, keeping all links on one switch.  This is not preferred, and moreover won't achieve much benefit if the inter-switch links cannot handle that capacity.  Offers  4 g/sec no matter what.
  3. Mode-6+4+4:  Operate two mode-4 bonds, one bond per participating switch, and bond those together with mode-6.  2 g/sec per connection, up to 4 g/sec aggregate.  We can lose any three links or either switch and still operate.
I am leaning toward Option 3.  It's a compromise to be sure, but will guarantee that I get both the throughput I am looking for, and the redundancy I need.  In the future I can always increase the size of the mode-4 bonds by adding more NICs.

The first challenge is setting up a bond within a bond.  Ubuntu 12.04 offers some nice things with their network configuration file, but I think it has some bugs and/or takes a few things upon itself that maybe it shouldn't.  Specifically, I've noticed it fails to order the loading of the various stacked network devices correctly.  I started into the kernel documentation for the bonding driver, and after toying with it for a while I came up with this set of sysfs-based calls:
 #!/bin/bash

modprobe bonding

SYS=/sys/class/net

if [[ -e $SYS/idmz ]]; then
  echo -idmz > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz1 ]]; then
  echo -idmz1 > $SYS/bonding_masters
fi

if [[ -e $SYS/idmz2 ]]; then
  echo -idmz2 > $SYS/bonding_masters
fi


# create master bonds
echo +idmz > $SYS/bonding_masters
echo +idmz1 > $SYS/bonding_masters
echo +idmz2 > $SYS/bonding_masters

# configure bond characteristics
echo 6 > $SYS/idmz/bonding/mode
echo 4 > $SYS/idmz1/bonding/mode
echo 4 > $SYS/idmz2/bonding/mode

echo 100 > $SYS/idmz/bonding/miimon
echo 100 > $SYS/idmz1/bonding/miimon
echo 100 > $SYS/idmz2/bonding/miimon

echo +e1p2 > $SYS/idmz1/bonding/slaves
echo +e3p2 > $SYS/idmz1/bonding/slaves
echo +e2p3 > $SYS/idmz2/bonding/slaves
echo +e3p3 > $SYS/idmz2/bonding/slaves

echo +idmz1 > $SYS/idmz/bonding/slaves
echo +idmz2 > $SYS/idmz/bonding/slaves


These calls achieve exactly what I want: two 2 gigabit bonds that can live on separate switches, and yet appear under the guise of a single IP address.  The only thing I have omitted from the above example is the ifconfig on idmz for the network address.  This can evidently also be accomplished through sysfs.

I've toyed around with /etc/network/interfaces a bit, and just couldn't get it to act the way I wanted.  I need an ordering constraint, or some sort of smarter dependency tracking.  Well, I'm not going to get it, so a custom launch-script is probably the necessary thing for me.  I have, luckily, worked it out such that I can still use upstart to configure the adapter, route, and DNS:
iface idmz inet static
   pre-up /root/config-networking.sh
   address 192.168.1.2
   netmask 255.255.255.0
(I have obviously omitted the route and DNS settings here.)  The config-networking.sh script is essentially what you see above, with all the sysfs calls.  It's not terribly elegant, but it gets the job done.  You will probably be wondering where my eth0, eth1, eth2 adapter names are.  I renamed them in the persistent-net udev rules to correspond to the adapter ports.  I have, after all, 12 ports not counting the two onboard that are currently nonfunctional (that's another story).  e1p2 is the third port on the second card, counting from zero.

In case you want to do some poking, make sure to ls and cat the files in /proc/net/bonding.  You will be able to easily interrogate the various bonds as to their state and membership.  It was here I discovered that my mode-6 bond simply kept refusing to add my mode-4 bonds.  The basic issue appears to be that if you try to add an empty bond-master to another bond-master, configuration of the new master fails.  The bond-master-slave needs to have at least one adapter under it. 

Configuration failures have been catastrophic in nature; all my bonds and other devices utterly failed to start when the requested /etc/network/interfaces configuration of idmz didn't make sense to /sbin/ifup.  At best the ifup/upstart/parallel-launch functionality makes bond configuration non-deterministic at best.  Now what appears to be extra stupid is that if my two-stage bond is the only thing getting configured, it doesn't get configured at all.  ifup magically ignores it.

I am still considering doing full network configuration in my manual script, just for the sake of knowing it will always work.  In fact, that is looking like the only real option for at least a good portion of the interfaces.

Uhg.

20121028

Absolute Evil

So, in my previous post, I outlined pretty much all of the steps I was taking to make 11.10 talk CMAN+DLM.

I've now uncovered another interesting bit of horror.

Perhaps this is well known, documented, and all that good stuff...

I had my 11.10 server running fine, talking with the 12.04 server.  Then I decided to let the 12.04 server take command of the cluster.  As the DC, it "improved" the version of the DC to 1.1.6-blahwhatevernonesense.  This has two interesting results: first, the 11.10 crm, which is on 1.1.5, is no longer able to interpret the CIB.  Second, when I tried to shunt back from 12.04 to 11.10 as the DC, the world pretty much ended in a set of hard reboots.

But I did learn one other thing: 12.04 could still talk to 11.10 and order it around.  So, even though 11.10's crm couldn't tell us what the hell was happening, it was, in fact, functioning perfectly normally underneath the hood.  It was able to link up with the iSCSI targets, mount the OCFS2 file systems, and play with the DLM.

I'm now back to my original choice, with one other option:
  • Just bite the bullet and upgrade to 12.04 before upgrading the cluster stacks, incurring hours and hours of cluster downtime.
  • "Upgrade" pacemaker on 11.10 to 1.1.5+CMAN and transition the cluster stack and the cluster resources more gracefully, being careful to NEVER let the 12.04 machine gain the DC role - probably not possible as more machines are transitioned to 12.04.
  • Upgrade pacemaker on 11.10 to 1.1.6+CMAN and do the same as option 2 above, except for worrying about who is the DC (for then it wouldn't matter).
I did learn how to rebuild the Debian package the, um, somewhat right way, except of course for signing the packages.  That aside, it seemed to work pretty well, and I was able to build in CMAN support like I wanted.  So, I am now tempted to try option 3 and see where that lands me.  If anything, it may be the bridge-measure I need to move from 11.10 to 12.04 without obliterating the cluster for the rest of the night.

* * * * *
A short while later...

I'm not sure this is going to work.  In attempting to transplant the Debian sources from 12.04 to 11.10, I had to also pull in the sources for the later versions of corosync, libcman, cluster-glue, and libfence.  All development versions, too.  I'm trying to build them now on the 11.10 machine, which means I am also having to install other additional libraries.

First one finished was the cluster-glue.  Not much difficulty there.

Corosync built next.  Had to apt-get more libraries.  Each time I'm letting debuild tell me what it wants, then I go get it, and then let it fly.  

redhat-cluster may be a problem.  It wants a newer version of libvirt.  The more I build, the more I wonder just how bad that downtime is going to be...  Worse, of course, would be upgrades, except that I could just as easily nuke each node and do a clean reinstall.  That would probably be required if I go this route.

* * * * *
A shorter while later...

The build is exploding exponentially.  libvirt was really the last straw.  For kicks I'm trying to build Pacemaker 1.1.6 manually against the system-installed Corosync et al.  For sake of near-completeness I'm using the configure flags from the debian/rules file. 

The resulting Pacemaker 1.1.6 works, and seems to work acceptably well.  The 11.10 machine it's running on may be having some issues related or unrelated to the differing library versions, Pacemaker builds, or perhaps even the kernel.  There were some rsc_timeout things happening in there.  I performed a hard-and-fast reboot of the machine, though that's not really something I can afford to do on the live cluster.  I've seen this issue before, but have never pinned it down nor had the time to help one of the maintainers trace it through different kernel versions.  I also didn't have the hardware to spare; now, it seems, I do.  It may actually be related, in some strange way, to open-iscsi. 

It puts me a bit ill, as I now am not sure I can rely on this path to upgrade my cluster easily.  I can't have machines spontaneously dying on me due to buggy kernels or iSCSI initiators or what have you.

The Final Verdict

My goal is to transition a virtualization cluster to 12.04, partly because it's LTS, partly because it's got better libvirt support, and partly because it has to happen sooner or later.  I have a new 16-core host that might be able to take the whole load of the four other machines I'll be upgrading; I just won't be able to quietly and secretly transition those VMs over.  I'll have to shut them all down, configure the new host to be the only host in the cluster, adjust the cluster stack, and then bring them all back up.

I could do that.  I could even do it tonight, but I'm going to wait till I'm back in the office.  The HA SAN (another bit of Pacemaker awesomeness) is short on network bandwidth, as is the new host I want to use.  I'll want to get that a little more robust before I start pushing tons of traffic.  The downside to this approach is that I'm left completely without redundancy while I upgrade the other machines.  Of course, each completed upgrade means another machine worth of redundancy added back into the cluster.

I may attempt the best of both worlds: take one machine offline, upgrade it, and pair it with the new host.  With those two hosts up, we can bring the rest of the old cluster down and swap out the OCFS2 cluster stack.  Yes, that may work well.  At this point, I think trying to sneak 1.1.6 into 11.10 is going to be too risky. 




Cluster Upgrade: Ubuntu Server 11.10 to 12.04 LTS

I will attempt to detail here the upgrade path taken as I migrate from 11.10 to 12.04 LTS.  I am using Corosync+Pacemaker+OCFS2 on 11.10.  12.04 appears to require transitioning to CMAN in order for the DLM to work again.  DLM is required for OCFS2.

This is what I suspect, or tend toward, or have read about:
  • I will need to migrate my file systems to the new cluster stack - this should be doable with tune.ocfs2.
  • I will either need to upgrade the OS before or after installing and configuring CMAN.
  • I need CMAN before I can migrate OCFS2 to the new stack.
  • I am unsure what will happen to my existing resources after the migration.
  • One of my nodes is already on 12.04.  The other is on 11.10 (this is a test cluster, by the way).
Because I'll be fundamentally affecting the DLM, I will need to shut down the 11.10 node completely (as far as the cluster is concerned) before acting on it.  At least, that is what everything I've read and learned and suspect is telling me.

I have shut down the node by first putting it into standby.  Pacemaker and Corosync are then respectively brought offline.  My cluster configuration contains mostly that which is presented in the Clusters From Scratch documentation, but with some personalized modifications (forgive the spaces in the tags; I can't seem to put real XML in here without grief and I don't have the patience right now to learn the proper way to do so):
  • I am using the < dlm protocol="sctp" > option.  This goes inside the < cluster > block.
  • I have set the keyfile to that which Corosync used to use:  < cman keyfile="/etc/corosync/authkey" >
  • I have defined a specific multicast address inside the < cman > block: < multicast addr="226.94.94.123" / >
The SCTP option appears to be a Nice Thing To Have.  The cluster.conf man page says its required when Corosync is involved.  I don't honestly know what it all means.  The keyfile is not required, but I thought it would be handy.  The multicast address is also not required, obviously.  Both the key and address are generated from the cluster-name.  I am defining them explicitly here because I'm toying around and like the notion of being able to define an address and key that will NEVER EVER CHANGE.  EVER.

I am running ccs_config_validate on each node to make sure everything is kosher.  I found that it complained loudly when the fence-agents package was not installed.  I will dump a list of what apt-gets I did at the bottom of this post.  As I probably had to mention in another post, Ubuntu Server used to configure the /etc/hosts file with 127.0.1.1 pointing to the host name.  This screws up cman very nicely, as it has auto-detect magic, and it binds itself to this pseudo-loopback instead of the real adapter.  If your machines don't connect, run

corosync-objctl -a | grep ^totem

and you might see:  totem.interface.bindnetaddr=127.0.1.1

Look familiar?  What a pisser...  With that fixed, both nodes now appear when I run cman_tool nodes.  Now I shall attempt to upgrade the cluster stack.  Before I can do this, however, I need to make some subtle changes.  For starters, the cluster configuration can no longer fuck around with the DLM.  It's managed by CMAN now, and if we toy with it we'll break everything.  I posted about that before, also.

I start pacemaker and then enter configuration-mode.  Wait, no I don't.  11.10's version of Pacemaker doesn't support CMAN.  Now I remember why I dreaded this fucking upgrade.  I have two choices now:
  • Perform the upgrade to 12.04 and get CMAN-support that way.
  • Obtain or build Pacemaker with CMAN support manually.
I opt to try my hand at a build.  I've never done this before on Ubuntu, so the tool setup is unfamiliar.  Getting to a ./configure is my goal, as that will be known-turf.

First, pull the necessary packages down - I don't know which of these are actually needed:

  apt-get install pacemaker-dev libcman-dev libccs-dev

The build dependencies:  apt-get build-dep pacemaker
Then the source: apt-get source pacemaker  (Do this in the directory you want it to wind up in.)
I also needed:  apt-get install libfenced*

CD into the pacemaker-x.x.x directory and do a ./autogen.sh.
Then:
    ./configure --prefix=/usr --with-ais --with-corosync --with-heartbeat --with-cman --with-cs-quorum --with-snmp --with-esmtp --with-acl --disable-fatal-warnings --localstatedir=/var --sysconfdir=/etc

Yes, I turned everything on.  If it works correctly, you should see a feature line near the end of the output, and CMAN had BETTER be there.  Had to disable fatal warnings because there were, um, stupid warnings that were fatal to the build.  Let's hope they're just warnings, eh?

  make

Now sit and wait.  Hmmmm....  this is a nice fast machine, and now it's done!  Now for the part where I shoot myself in the foot:

  make install

If you were kinda dumb like me, you may have built this as root.  Not advisable, generally speaking, but I'm to the point I don't care.  I've built entire Linux deploys from scratch (read: Linux From Scratch), so it's pretty much the same thing over and over and over for me.  I don't do LFS anymore, by the way.  Package management features of distros like Ubuntu are just too damn shiny for me to ignore any longer.


Now I discover that I did not have my paths entirely correct during the configuration step.  My cluster configuration is nowhere to be seen, because it's probably being sought in /usr/var/blah/blah/blah.  And it is so.  I've modified the above configure command to be more correct.  And now, except for a strange "touch missing operand" bit of complaining on the part of the init script, the binary has found my cluster configuration.  (edit: the last issue has also been fixed by the addition of the flag that sets /etc as a good place to go.)

With the new pacemakerd binary in place, I can get the thing started under the watchful gaze of CMAN.  Now I have to update the cluster config to reflect the fact that the o2cb stack is now CMAN.  Refer to the RA metadata for this, I won't repeat it here.  With that done, I can bring at least the 11.10 node back online.  The 12.04 node actually doesn't have all the necessary packages to make it work yet.

Predictably, the mount of the OCFS2 iSCSI drives fails - they're the wrong stack.  BUT, the o2cb driver is up, and the iSCSI drives are connected.  With that, I can do the following:

root@hv06:~# tunefs.ocfs2 --update-cluster-stack /dev/sdc
Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.Update the on-disk cluster information? yes

So, we can learn from this that updating the cluster stack must be done while the ENTIRE FRIGGING CLUSTER IS DOWN.  I'll try to remember that for when I do this for real.  I can now clean up the two failed filesystem resources, and lo and behold they mount!  I can now configure the 12.04 machine to follow suit.

Conclusion

Now, having done all this, I am left to wonder whether or not this is in fact the best route.  On the up-side:
  1. I can easily copy the pre-custom-built pacemakerd stuff to all my 11.10 cluster machines and forego the painful build steps.
  2. I can take the cluster down for a very short period of time.
  3. This should not affect any data on the OCFS2 volumes, as we're just updating the cluster stack.
  4. I can take my time while upgrading the individual machines to 12.04, or even leave them on 11.10 a bit longer.
  5. I could build the latest-and-greatest pacemaker and use that instead of 1.1.5.
OK, point #5 is probably not as feasible as I would like it to be, nor is it necessary in my situation.  On the down-side:
  1. I still have to take the ENTIRE cluster down to make this "minor" update.
  2. I could just push my new server back to 11.10, join it to the cluster, migrate my VMs there and then upgrade the rest of the cluster.
  3. But if I do #2 here, I will still at some point have to bring the whole cluster down just to update the OCFS2 cluster stack info to CMAN.
  4. I may be causing unknown issues once I push the 11.10 machines to 12.04.

At least point #4 of the downsides is something I can test.   I can push that machine to 12.04 and see what happens during the upgrade.  Ideally, nothing bad should come of it; the pacemakerd binary should get upgraded to the package maintainer's verion, 1.1.6-ubuntu-goodness-and-various-patches-go-here.

This will probably be useful to me: http://www.debian.org/doc/manuals/packaging-tutorial/packaging-tutorial.en.pdf

We shall see.

20121027

Forget the Brain... DLM in Ubuntu 12.04

This is a stream-of-research post - don't look for answers here, though I do link some interesting articles.

I'm in the process of preparing my cluster for expansion, and in the midst of installing a new server I inadvertently installed 12.04.1 instead of 11.10.  The rest of the cluster uses 11.10.

Some important distinctions:
  • 12.04 seems to support CMAN+Corosync+Pacemaker+OCFS2 quite well.
  • The same is not for certain on 11.10.
  • 12.04 NO LONGER has dlm_controld.pcmk.
  • Trying to symlink or fake the dlm binary on 12.04 does not appear to work, from what my memory tells me.

You CAN connect 12.04's and 11.10's Corosyncs and Pacemakers, but from as far as I can tell, only if you Don't Need DLM.

I Need DLM.

So, I am trying understand CMAN a bit better.  Here's some interesting articles:

Configuring CMAN and Corosync - this explains why some of my configurations failed brutally - http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf

Understanding CMAN and Corosync - written earlier than the above document - http://people.redhat.com/ccaulfie/docs/Whither%20cman.pdf

In summary - the CMAN-related binaries load Corosync, but CMAN itself is a plugin for Corosync, providing quorum support. 

Uuuhhhgggg...

CMAN generates the necessary Corosync configuration parameters from cluster.conf and defaults. 

Corosync appears to be the phoenix that rose from the, well, ashes of the OpenAIS project, since that project's page announces cessation of OpenAIS development.  Corosync deals with all messaging and whatnot, and it appears thus that CMAN is providing definitive quorum information to Corosync even though Corosync has its own quorum mechanisms (which, if I read it right, are distilled versions from earlier CMAN incarnations). 



20121012

Ubuntu Server 12.04 - Waiting for Network Configuration

Just ran across an interesting issue, and the forums I've read so far don't provide a real clear answer.  I don't know that this is the answer, either, but it may be worth pursuing.  This is a bit of stream-of-consciousness, by the way - my apologies.

I just set up some new servers and was in the midst of securing them.  The first server I started on has a static IP, a valid gateway, valid DNS server, and all the networking checked out.  On reboot, however, it would take forever to kill bind9, and then I'd see almost two minutes worth of "Waiting for network configuration."  Well, there are only statically-assigned adapters present, and the loopback (which was left in its installer-default state).

I had introduced a slew of rules via iptables and I suspect they were wreaking havoc with the boot/shutdown procedures.  If someone else is experiencing this problem, try nuking your iptables and make sure it doesn't reload on reboot - hopefully you'll see everything come back up quickly.  UFW users would obviously need to disable ufw from operating.  FWIW, I placed my iptables loader script in the /etc/network/if-pre-up.d/ folder, so it's one of the first things to crank up when networking starts.

Now, I have similar iptables configurations present on other machines, and I don't know that those machines specifically have the same problem.  That being said, I really haven't rebooted them frequently enough to notice.

* * * * *

After a bit more experimentation, it appears there is some dependency on allowing OUTPUT to the loopback.  Specifically, I'm looking at logs that note packets being sent from my machine's configured static address to the loopback, and consequently they're being dropped by my rules.  They're TCP packets to port 953.  This apparently rndc, and related to BIND, which makes sense since my other machines do not run BIND daemons.

This rule, while not the most elegant, and probably not the most correct, fixes the issue for now:

-A OUTPUT -m comment --comment "rdnc" -o lo -p tcp --dport 953 -j ACCEPT

It is probably important to note that this machine is not a gateway and so drops any packets that would be forwarded.  I suppose I'm hoping this will be secure, but I just get a strange feeling something more needs to be done.

More on this later, hopefully.

20120914

Resizing a Live iSCSI Target

Despite upgrading the hard drives in my SAN servers, I finally hit the drive-limits of my iSCSI targets and now had to make use of all that extra hard drive space.  Unfortunately, it seems there isn't a lot of information or NICE tool-age available to make this happen seamlessly.  That's OK, it wasn't as painful as I thought it was going to be.

My setup is as follows:


The stores can be managed on either of the two cluster hosts, and usually this results in a splitting of the load.  The first requirement was, of course, to enlarge the RAID store.  That was easy as mdadm.  Second was to resize the two logical volumes that are used as backing stores for the DRBD devices, via LVM.  Next, DRBD had to be told to resize each volume, which predictably caused a resync event to occur.  Once that was finished, it was time to notify the initiators.

This is where a little trickery had to take place.  So far I've not really found anything that made it easy to tell ietd to "rescan" itself or to otherwise realize that its underlying target devices might have changed their sizes.  About the only thing I could really find was to basically remove and re-add the target, or restart it if you will.

Not really a fun idea, but at least Pacemaker gave me an out.  Instead of shutting down each target, I migrated each target and then unmigrated it back:
  crm resource migrate g_iscsistore0
  crm resource unmigrate g_iscsistore0

It's important to realize that you must wait for the migration to actually complete before un-migrating.  The un-migrate is used to remove the constraint that was automatically generated to force the migration.  This effectively causes the target restart I needed, and because the cluster is properly configured no initiators realized the connection was ever terminated.  This was important because the targets are very live and it's not easy to shut them down without shutting down several other machines.  This will probably be a problem for me in the future when I go to upgrade both the VM cluster that relies on these stores, and the storage cluster that serves them, to a newer release of Ubuntu Server.

In the meantime, I have now effectively resized the targets, and the next step is obviously the initiators.  I have this script to check for a resized device by occasionally asking open-iscsi to rescan:

rescan-iscsi.sh

#!/bin/bash
/sbin/iscsiadm -m node -R > /dev/null


This is actually set up as a cron job on the initiators, to run every 15 minutes.  By now all the machines in the cluster should have recognized the new device sizes.  I can now perform the resize online from one of the initiators:
  tunefs.ocfs2 -v -S /dev/sdc

The resize should be transparent and non-interrupting.  It only took a few minutes for each store to complete.  I now have two 500G iSCSI targets, ready for more data!

One thing I'd really like to do in the future is have my initiators NOT use /dev/sd? names.  I'm not quite sure yet how to do that.  I have run into problems where smartd would try to access the iSCSI targets via the initiator connection and cause the SAN nodes to die horrific deaths.  Not sure what that's about, either.

20120913

ZFS, Additional Thoughts

I am about to expand my ZFS array, and I'm a little bit stuck...not because I don't know what to do, but because I am reflecting on my experiences thus far.

I guess I just find ZFS a little, well, uncomfortable.  That's really the best word I can come up with.  It's not necessarily all ZFS' fault, although some of the fault does lie with it.  I'll try to enumerate what's troubling me.

First, the drive references - they recommend adding devices via their /dev/disk/by-id (or similarly unique-but-consistent) identifiers.  This makes sense in terms of making sure that the drives are always properly recognized and dealt with in the correct order, and having been through some RAID hell with drive ordering I can attest that there have been instances where I've cursed the seeming-randomness of how the /dev/sd? identifiers are assigned.  That being said, my Linux device identifiers look like this:

    scsi-3632120e0a37653430e79784212fdb020

That's really quite ugly and as I look through the list of devices to prune out which ones I've already assigned, I'm missing my little 3-character device names....a lot.  This doesn't seem to be an issue on OpenSolaris systems, but I can't/won't run OpenSolaris at this time.

Second, there's the obvious "expansion through addition" instead of "expansion through reshaping."  I want to believe that two RAID-5-style arrays will give me almost as much redundancy as a single RAID-6, but truth be told any two drives could fail at any time.  I do not think we can safely say that two failing in the same array is less likely than one failing in each array.  If anything, it's just as likely.  If Fate has its say, it's more likely, just to piss off Statistics.

But this is what I've got, and I can't wait another 15 days for all my stores to resync just because I added a drive.  That will be even more true once the redundant server is remote again, and syncing is happening over a tiny 10Mbit link.  I'll just have to bite the bullet and build another raidz1 of four drives, and hope for the best.

Third, I'm just a little disturbed about the fact that once you bring something like a raidz online, there is no initial sync.  I guess the creators of ZFS might have thought it superfluous.  After all, if you've written no data, why bother to sync garbage?  It's just something I've come to expect from things like mdadm and every RAID card there is, but then again I suppose it doesn't make a lot of sense after all.  I'm trying to find a counter-example, but so far I can't seem to think of a good one.

Fourth, the tools remind me a little of something out of the Windows-age.  They're quite minimalist, especially when compared to mdadm and LVM.  Those two latter tools provide a plethora of information, and while not all admins will use it, there have been times I've needed it.  I just feel like the conveniences offered by the ZFS command-line tools actually takes away from the depth of information I expect to have access to.  I know there is probably a good reason for it, yet it just isn't that satisfying.

The obvious question at this point is: why use it if I have these issues with it?  Well, for the simple fact that it does per-block integrity checks.  Nothing more.  That is the one killer feature I need because I can no longer trust my hard drives not to corrupt my data, and I can't afford to drop another $6K on new hard drives.  I want so badly to have a device driver that implements this under mdadm, but writing one still seems beyond the scope of my available time.

Or is it?

20120907

ZFS on Linux - A Test Drive

What Happened??

I have been suffering through recent data corruption events on my multi-terabyte arrays.  We have a pair of redundant arrays, intended for secure backup of all our data, so data integrity is obviously of importance.  I've chosen DRBD as an integral part of the stack, because it is solid and totally rocks.  I had gone with mdadm and RAID-6, after a few controller melt-downs left me with a bitter aftertaste.  Throw in some crypto and LVM and viola!

Then a drive went bad.

There may even be more than one.  Even though smartd detected it, SeaTools didn't immediately want to believe that the drive was defunct.  It took a long-repair operation before the drive was officially failed.  Meanwhile, I have scoured the Internets in search of a solution to what is evidently a now-growing problem.

The crux of the issue is that drive technology is not necessarily getting better (in terms of error rates and ECC), but we are putting more and more data on it.  I've seen posts where people have argued convincingly that, mathematically, the published bit-corruption rates are now unacceptably high in very large data arrays.  I'm afraid that based on my empirical experience, I must concur.  No sectors reported unreadable, no clicking noises were observed, and yet I lost two file systems and an entire 2TB of redundant data.

Thank goodness it was redundant, but now I seriously fear for the primary array's integrity, for it is composed of the same kind and manufacturer's drives as the array I am now rebuilding.  I guess I'm a little surprised that data integrity has never really been a subject of much work in the file-system community; then again, a few years ago, I probably wouldn't have thought much of it myself, but as I am now acutely aware of the value of data and the volatility of storage mediums, it's now a big issue.

A Non-Trivial Fix


I had originally thought that drives would either return good data or error-out.  This was not the case.  The corruption was extremely silent, but highly visible once a reboot proved that one file system was unrecoverable and the metadata for one DRBD device was obliterated.  RAID of course did nothing - it was not designed to do so.  The author of mdadm has also, in forum posts, said that an option to perform per-read integrity checks of blocks from the RAID was not implemented, and would not be implemented...though it is probably possible.  That's unfortunate.

I looked for a block-device solution, like how cryptsetup and DRBD work, to act as an intermediary between the physical medium and the remainder of the device stack.   Such a device would provide either a check on sector integrity and fail blocks up as needed, or provide ECC on sector data and attempt to fix data as it went, only failing if the data was totally unrecoverable.

I considered some options, and decided that a validity-device would best be placed between the RAID driver and the hard drive, so that it could fail sectors up to the RAID and let RAID recover them from the other disks.  This assumes that the likelihood of data corruption occurring across the array in such a way as to contaminate two (or three) blocks of the same stripe on a RAID-5 (or 6) would be statistically unlikely.

An ECC-device would probably be best placed after the RAID driver, but could also sit before it.  It might not be a bad idea to implement and use both - a validity-device before the RAID and an ECC-device after it.  Obviously we can no longer trust the hardware to do this sort of thing for us.

I performed a cursory examination of Low-Density Parity Check codes, or LDPC, but alas my math is not so good.  There are some libraries available, but writing a whole device driver isn't quite in my time-budget right now.  I'd love to, and I know there are others who would like to make use of it, so maybe someday I will.  Right now I need a solution that works out of the box.

The Options


The latest and greatest open-source file system is Btrfs.  Unfortunately it's much too unstable, from what I've been reading, to be trusted in production environments.  Despite the fact that I tend to take more risks than I should, I can't bring myself to go that route at this time.  That left me with only ZFS, the only other file system that has integral data integrity checks built in.  This looked promising, but being unable to test-drive OpenSolaris on KVM did not please me.

ZFS-Fuse is readily available and relatively stable, but lacks the block-device manufacturing capability that native ZFS offers.  A happy, albeit slightly more dangerous alternative to this is ZFS-On-Linux (http://zfsonlinux.org/), a native port that, due to licensing, cannot be easily packaged with the kernel.  It can however be distributed separately, which is what the project's authors have done.  It offers native performance through a DKMS module, and (most importantly for me) offers the block-device-generating ZFS Volume feature.

Test-Drive!

ZFS likes to have control over everything, from the devices up to the file system.  That's how it was designed.  I toyed around setting up a RAID and, through LVM, splitting it up into volumes that would be handled by ZFS - not a good idea: a single drive corrupting itself causes massive and wonton destruction of ZFS' integrity, not to mention that the whole setup would be subject to the same risks that compromised my original array.  So, despite my love of mdadm and LVM, I handed the keys over to ZFS.

I did some initial testing on a VM, by first creating a ZFS file system composed of dd-generated files, and then introduced faults.  ZFS handled them quite well.  I did the same with virtual devices, which is where I learned that mdadm was not going to mix well with ZFS.  I have since deployed on my redundant server and have started rebuilding one DRBD device.  So far, so good.

What I like about ZFS is that it does wrap everything up very nicely.  I hope this will result in improved performance, but I will not be able to gather any metrics at this time.  Adding devices to the pool is straightforward, and replacing them is relatively painless.  The redundancy mechanisms are also very nice.  It provides mirroring, RAID-5, RAID-6, and I guess what you could call RAID-(6+1) in terms of how many devices can fail in the array before it becomes a brick (one, one, two, and three devices respectively, in case you were wondering).

What I dislike about ZFS, and what seriously kept me from jumping immediately on it, was its surprisingly poor support for expanding arrays.  mdadm allows you to basically restructure your array across more disks, thus allowing for easy expansion.  It even does this online!  ZFS will only do this over larger disks, not more of them, so if you have an array of 3 disks then you will only ever be able to use 3 disks in that array.  On the bright side, you can add more arrays to your "pool", which is kind of like adding another PV to an LVM stack.  The downside of this is that if you have one RAID-6 with four devices, and you add another RAID-6 of four devices, you are now down four-devices-worth of space when you could be down by only two on mdadm's RAID-6 after restructuring.

So once you choose how your array is going to look to ZFS, you are stuck with it.  Want to change it?  Copy your data off, make the change, and copy it back.  I guess this is what people who use hardware RAID are accustomed to - I've become spoiled by the awesome flexibility of the mdadm/LVM stack.  At this point, however, data integrity is more important to me.

Consequently, with only 8 devices available for my ZFS target (and really right now only 7 because one is failed and removed), I had to choose basic 1-device-redundancy RAIDZ and split the array into two 4-device sub-arrays.  Only one sub-array is currently configured, since I can't bring the other one up until I have replaced my failed drive.  With this being a redundant system, I am hopeful that statistics are on my side and that a dual-drive failure on any given sub-array will not not occur at the same time as one on the sibling system.

We Shall See.

20120802

VM Cluster...now with STONITH! (On a Budget)

For some reason my Ubuntu 11.10 VM host servers have been misbehaving.  When one of the nodes died, it caused the DLM to hang and my other nodes quickly perished afterwards.  I've blogged about this a time or two now.  STONITH is the only answer, but how do you do it without spending $450 on an ethernet-controlled PDU?  Well, I "have" $450, but I have yet to put the stupid purchase request through to management...what can I say?  I'm busy...and it grieves me greatly.

What follows is what I did to make super-cheap STONITH a reality.  It does require some hardware, but if you have a lot of servers, you probably have a lot of UPSs, and maybe you might even have some that are APCs and are recent and have the little USB connection for your system.

In my case, I had several APC BackUPS ES (500 and larger) units laying around, all with fresh new batteries.  Most of them had USB connectivity.  These UPS-capable devices formed my new, albeit temporary, fencing solution.

Configuring NUT

First, a NUT server is needed.  I chose a non-cluster system for this job, but every system in your cluster could be a NUT server and serve to shoot other nodes.  The chosen system in my configuration just collects statistics and monitors other servers, so it's actually a very nice server for the job of shooting nodes.  For our example, we will call the NUT server stonithserver.

root@stonithserver:~#  apt-get install nut-server

/etc/nut/ups.conf was configured for each APC device as follows:

[cn01]
  driver = usbhid-ups
  port = auto
  serial = "BB0........3"


[cn02]
  driver = usbhid-ups
  port = auto
  serial = "BB0........5"

[cn03]
  driver = usbhid-ups
  port = auto
  serial = "BB0........8"

(The serial numbers here are obfuscated for security reasons, as are the cluster node names.  Your devices' serial numbers should be all alpha-numeric characters.  Methods other than serial numbers can be used to distinguish between devices - consult the NUT documentation for more details.)

You can grab the serial numbers via "lsusb -v | less".  Once you get NUT configured for a new UPS (or all of them), use "upscmd" to test them, first to make sure you didn't screw something up, and second to make sure it's going to work correctly when it needs to work.

root@stonithserver:~# upscmd -l cn01
root@stonithserver:~# upscmd cn01 load.off

The first command should return a list of available commands for your UPS.  The second will, on my APC BackUPS ES units, cause the UPS to switch off for about 1 second.  Use the command appropriate for your unit.  My units switch back on automatically, perhaps because they're still being fed mains power.

It's probably important to secure your NUT server in /etc/nut/upsd.users, although I imagine packet sniffing would end that pretty quick:

[stonithuser]
  password = ThisIsNotThePasswordYouAreLookingFor
  instcmds = ALL

Note that the above configuration is a very quick and simple (and probably stupid) one.  Review the relevant documentation to make for a more secure configuration.

Make sure that /etc/nut/upsd.conf is configured to allow connections in:

LISTEN 0.0.0.0 3493


Now each node in the cluster needs the nut-client package installed, or else it won't be able to talk to any other NUT server:

root@cn01:~#  apt-get install nut-client


root@cn02:~#  apt-get install nut-client


...

Configuring STONITH on the Cluster

Finally, some cluster configuration.  On Ubuntu, the NUT binaries are not where they are on Redhat/CentOS.  Also, my UPSs don't understand the "reset", so I had to change the reset command to "load.off".  It's enough to nuke a running server, and perhaps the best part is that if the server auto-powers-on (BIOS option), you have yourself a handy way to remote-reboot any failed machine.  Add Wake-on-LAN, and it's like having IPMI power control...without the nice user interface.

For each cluster node, a STONITH primitive is needed:



primitive p_stonith_cn01 stonith:external/nut \
        params hostname="cn01" \
               ups="cn01@stonithserver:3493" \
               username="stonithuser" \
               password="ThisIsNotThePasswordYouAreLookingFor" \
               upscmd="/bin/upscmd" \
               upsc="/bin/upsc" \
               reset="load.off" \
        op start interval="0" timeout="15" \
        op stop interval="0" timeout="15" \
        op monitor start-delay="15" interval="15" timout="15" \
        meta target-role="Started"


A STONITH primitive is like any other primitive - it runs and can be started and stopped.  Therefore it needs a node to run on.  Restrict them so that they don't run on the machines that are supposed to be killed by them - that is, a downed node can't (or shouldn't be expected to) suicide itself:

location l_stonith_cn01 p_stonith_cn01 -inf: cn01

Re-enable STONITH in the cluster options, because, frankly, if you're reading this then you've probably had it disabled this whole time:


property $id="cib-bootstrap-options" \
        stonith-enabled="true" \
...

Test the cluster by faking downed nodes.  Do this one machine at a time, and recover your cluster before testing another machine!  If you have three nodes, nuke one, then bring it back to life and let the cluster become stable again, and then nuke the second one.  Repeat for the third one.  This can be easily done by pulling network cables and watching the machines reboot.  Every machine should get properly  nuked. 


NB:  BEFORE you enable STONITH in Pacemaker, make sure you have a clean CIB.  I had a few stale machines (ex-nodes) defined in my CIB.  Pacemaker thought they were unclean it tried to STONITH them.  But since they really didn't exist and also didn't have any STONITH primitives defined, it failed, and in doing so prevented pretty much all my resources from loading throughout the cluster.  (I would classify that as a feature, not a bug.)  Once the defunct node definitions were removed, everything came up beautifully.

20120730

Cluster Building, Ubuntu 12.04 - REVISED

This is an updated post about building a Pacemaker server on Ubuntu 12.04 LTS.

I've learned a great deal since my last post, as many intervening posts will demonstrate.  Most of my machines are still on 11.10.  I have finally found some time to work on getting 12.04 to cooperate.

Our goals today will be a Pacemaker+CMAN cluster running DRBD and OCFS2.  This should cover most of the "difficult" stuff that I know anything about.

For those who have tried and failed to get a stable Pacemaker cluster running on 12.04, you might find that having the DLM managed by Pacemaker is not advisable.   In fact, it's not allowable.  I filed a formal bug report and was then informed that the DLM was, indeed, managed by CMAN.  Configuring it to be also managed by Pacemaker caused various crashes every time I put a node into standby.

Installation


Start with a clean, new Ubuntu 12.04 Server and make sure everything is up-to-date.
A few packages are for the good of the nodes themselves:
apt-get install ntp

Pull down the necessary packages for the cluster:
apt-get install cman pacemaker fence-agents openais

and the necessary packages for DRBD:
apt-get install drbd8-utils

and the necessary packages for OCFS2:
apt-get install ocfs2-tools ocfs2-tools-cman ocfs2-tools-pacemaker


Configuration, Part 1

CMAN

Configure CMAN to ignore quorum if you have a two-node cluster...or don't want to wait for quorum on startup:

echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/default/cman

For the cluster.conf, there are some good things to know:
  • The cluster multicast address is,  by default, generated as a hash of the cluster name - make this name unique if you run multiple clusters on the same subnet.  You can configure it manually, though I have not yet tried.
  • The interface element under the totem element appears to be "broken," or useless, and aside from that the Ubuntu docs suggest that any configuration values specified here will be overruled by whatever is under the clusternodes element.  Don't bother trying to set the bind-address here for the time being.
  • If you specify host names for each cluster node, reverse-resolution will attempt to determine what the bind address should be.  This will cause a bind to the loopback adapter unless you either (a) use IP addresses instead of the node names, or (b) remove the 127.0.1.1 address line from /etc/hosts!!  A symptom of this condition is that you bring both nodes up, and each node thinks it's all alone.
  • The two_node="1" attribute reportedly causes CMAN to ignore a loss of quorum for two-node clusters.
  • For added security, generate a keyfile with corosync-keygen and configure CMAN to pass it to Corosync - make sure to distribute it to all member nodes.
  • Always run ccs_config_validate before trying to launch the cman service.
  • Refer to /usr/share/cluster/cluster.rng for more (extremely detailed) info about cluster.conf

I wanted to put my cluster.conf here, but the XML is raising hell with Blogger.  Anyone who really wants to see it may email me.

Corosync

The Corosync config file is ignored when launching via CMAN.  cluster.conf is where those options live now.

 

Configuration, Part 2

By this time, if you have started CMAN and Pacemaker (in that order), both nodes should be visible to one another and should show up in crm_mon.  Make sure there are no monitor failures, as this will likely mean you're missing some packages on the reported node(s).

DRBD

I tend to place as much as I can into the /etc/drbd.d/global_common.conf, so as to save a lot of extra typing when creating new resources on my cluster.  This may not be best practice, but it works for me.  For my experimental cluster, I have two nodes: l9 and l10.  Here's a slimmed-down global_common.conf, and a single resource called "share".

/etc/drbd.d/global_common.conf
global {
    usage-count no;
}

common {
    protocol C;

    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    }

    startup {
          wfc-timeout 15;
      degr-wfc-timeout 60;
    }

    disk {
      on-io-error detach;
      fencing resource-only;
    }


    net {
           data-integrity-alg sha1;
           cram-hmac-alg sha1;
           # This isn't the secret you're looking for...
           shared-secret "234141231231234551";

           sndbuf-size 0;

           allow-two-primaries;

           ### Configure automatic split-brain recovery.
           after-sb-0pri discard-zero-changes;
           after-sb-1pri discard-secondary;
           after-sb-2pri disconnect;
    }

    syncer {
           rate 35M;
           use-rle;
           verify-alg sha1;
           csums-alg sha1;
    }
}
 
/etc/drbd.d/share.res
resource share  {
  device             /dev/drbd0;
  meta-disk          internal;

  on l9   {
    address   172.18.1.9:7788;
    disk      /dev/l9/share;
  }

  on l10  {
    address   172.18.1.10:7788;
    disk      /dev/l10/share;
  }
}
 
Those of you with a keen eye will note I've used an LVM volume as my backing storage device for DRBD.  Use whatever works for you.  Now, on both nodes:

drbdadm create-md share
drbdadm up share

And on only one node:
drbdadm -- -o primary share

It's probably best to let the sync finished, but I'm in a rush, so...on both nodes:
drbdadm down share
   and
service drbd stop
update-rc.d drbd disable

on both nodes.  The last line is particularly important, so I highlighted it.  DRBD cannot be allowed to crank up on its own - it will be Pacemaker's job to do this for us.   The same goes for O2CB and OCFS2:

update-rc.d o2cb disable
update-rc.d ocfs2 disable

OCFS2 also requires a couple of kernel parameters to be set.  Apply these to /etc/sysctl.conf:

echo "kernel.panic = 30" >> /etc/sysctl.conf
echo "kernel.panic_on_oops = 1" >> /etc/sysctl.conf
sysctl -p

With that done, we can go into crm and start configuring our resources.  What follows will be a sort-of run-of-the-mill configuration for a dual-primary resource.  YMMV.  I have used both single-primary and dual-primary configurations.  Use what suits the need.  Here I have a basic cluster configuration that will enable me to format my OCFS2 target:

node l10 \
        attributes standby="off"
node l9 \
        attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
        params drbd_resource="share" \
        op monitor interval="15s" role="Master" timeout="20s" \
        op monitor interval="20s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive p_o2cb ocf:pacemaker:o2cb \
        params stack="cman" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
        meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_o2cb p_o2cb \
        meta interleave="true" globally-unique="false"
property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="cman" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"

Of special note - we must specify the stack="cman" parameter for o2cb to function properly, otherwise you will see startup failures for that resource.  To round out this example, a usable store would be help.  After a format...

mkfs.ocfs2 /dev/drbd/by-res/share
mkdir /srv/share

Our mount target will be /srv/share.  Make sure to create this directory on both/all applicable nodes.  I have highlighted the modification to the above configuration to add the OCFS2 resource:
node l10 \
    attributes standby="off"
node l9 \
    attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
    params drbd_resource="share" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s" \
    op start interval="0" timeout="240s" \
    op stop interval="0" timeout="100s"
primitive p_fs_share ocf:heartbeat:Filesystem \
    params device="/dev/drbd/by-res/share" directory="/srv/share" fstype="ocfs2" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
    params stack="cman" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
    meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_fs_share p_fs_share \
    meta interleave="true" notify="true" globally-unique="false"
clone cl_o2cb p_o2cb \
    meta interleave="true" globally-unique="false"
colocation colo_share inf: cl_fs_share ms_drbd_share:Master
cl_o2cb
order o_o2cb inf: cl_o2cb cl_fs_share
order o_share inf: ms_drbd_share:promote cl_fs_share
property $id="cib-bootstrap-options" \
    dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
    cluster-infrastructure="cman" \
    stonith-enabled="false" \
    no-quorum-policy="ignore"

A couple of notes here, as well: not ordering the handling of O2CB correctly could wreak havoc when putting nodes into standby.  In this case I've ordered it with the file system mount, but a different approach may be more appropriate if we had multiple OCFS2 file systems to deal with.  Toying with the ordering of the colocations may also have an effect on things.  Read up on all applicable Pacemaker documentation. 

To test my cluster, I put each node in standby and brought it back a few times, then put the whole cluster in standby and rebooted all the nodes (all two of them).  Bringing them all back online should happen without incident.  In my case, I had to make one change:

order o_share inf: ms_drbd_share:promote cl_fs_share:start


Finally, the one missing piece to this configuration is proper STONITH devices and primitives.  These are a MUST for OCFS2, even if you're running it across virtual machines.   A single downed node will hang the entire cluster until the downed node is fenced.  Adding fencing is an exercise left to the reader, though I will be sharing my own experiences very soon.