BURNING MIDNIGHTm.at.work: 2012/11

20121124

Broken cman init script?!

Nothing is going right in Ubuntu 12.04 for cluster-aware file systems.

OCFS2 seems borked beyond belief.
(Update 2013-02-25: I believe I have made progress on the OCFS2 front: http://burning-midnight.blogspot.com/2013/02/quick-notes-on-ocfs2-cman-pacemaker.html)

GFS2 has its own issues, some of which I will detail here.

You CAN get these two file systems running on 12.04. Whether or not the cluster will remain stable when you have to put a node on standby is another question entirely, and a very good question. It shouldn't even BE a question, but it is, and the answer is a resounding FUCKING NO! Well, at least, as far as OCFS2 is concerned. The problems there lie in who manages the ocfs2_controld daemon. CMAN ought to do it, but CMAN doesn't want to. Starting it in Pacemaker causes horrible heartburn when you put a node into standby, and things just all fall apart from there.

I decided to try out GFS2. After installing all the necessary packages, and manually running bits here and there to see things work, I could not get Pacemaker to mount the GFS2 volume. First problem was the CLVM: if you want to be able to shut down a node without shooting the fucker, you'll need to make sure the LVM system can deactivate volume groups. The standard method of vgchange -an MyVG doesn't work for the cluster-aware LVM. It complains loudly about "activation/monitoring=0" being an unacceptable condition for vgchange. This detailed in this bug: https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/833368

The solution suggested there, at least where the OCF script is concerned, works: change the lines that use "vgchange" to include "--monitor y" on the command-line, and it will magically work again.

My cluster starts a DRBD resource, promotes it (dual-primary), then starts up clvmd (ocf:lvm2:clvmd), activates the appropriate LVM volumes (ocf:heartbeat:LVM), then mounts the GFS2 file system (ocf:heartbeat:FileSystem). These are all cloned resources.

primitive p_clvmd ocf:lvm2:clvmd \
    op start interval="0" timeout="100" \
    op stop interval="0" timeout="100" \
    op monitor interval="60" timout="120"
primitive p_drbd_data ocf:linbit:drbd \
    params drbd_resource="data" \
    op start interval="0" timeout="240" \
    op promote interval="0" timeout="90" \
    op demote interval="0" timeout="90" \
    op notify interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s"
primitive p_fs_vm ocf:heartbeat:Filesystem \
    params device="/dev/cdata/vm" directory="/opt/vm" fstype="gfs2"
primitive p_lvm_cdata ocf:heartbeat:LVM \
    params volgrpname="cdata"
ms ms_drbd_data p_drbd_data \
    meta master-max="2" clone-max="2" interleave="true" notify="true" clone cl_clvmd p_clvmd \
    meta clone-max="2" interleave="true" notify="true" globally-unique="false" target-role="Started"
clone cl_fs_vm p_fs_vm \
    meta clone-max="2" interleave="true" notify="false" globally-unique="false" target-role="Started"
clone cl_lvm_cdata p_lvm_cdata \
    meta clone-max="2" interleave="true" notify="true" globally-unique="false" target-role="Started"
colocation colo_lvm_clvm inf: cl_fs_vm cl_lvm_cdata cl_clvmd ms_drbd_data:Master
order o_lvm inf: ms_drbd_data:promote cl_clvmd:start cl_lvm_cdata:start cl_fs_vm:start

The LVM clone is necessary so that you can deactivate the VG before disconnecting DRBD during a standby. Not achieving this will STONITH the node. The "--monitor y" change is absolutely necessary, or you won't even bring the VG online. Starting clvmd inside Pacemaker might not be a necessary thing, but in this instance it seems to work very well. It's also important to note that most of the init.d scripts related to this conundrum have been disabled: clvmd, drbd, to name two.

The GFS2 file system will not mount without gfs_controld running. gfs_controld won't start on a clean Ubuntu Server 12.04 system because it seems the cman init script is fucked up. Can't understand it, but inside /etc/init.d/cman you'll find a line that reads:

gfs_controld_enabled && cd /etc/init.d && ./gfs2-cluster start

Comment out this line and add this below it:

if [[ gfs_controld_enabled ]]; then
cd /etc/init.d && ./gfs2-cluster start
fi

This will make the cman script actually CALL the gfs2-cluster script and thus start the gfs_controld daemon. Shutdown seems to work correctly with no additional modifications. You will find that once all these pieces are in place, GFS2 is viable on Ubuntu 12.04 AND you can bring your cluster up and down without watching your nodes commit creative suicide.

I honestly don't know why this is the way it is. I wouldn't know where to even assign blame. In the Ubuntu Server 12.04 Cluster Guide (work-in-progress), they suggest this resource:

primitive resGFSD ocf:pacemaker:controld \
params daemon="gfs_controld" args="" \
op monitor interval="120s"

This seems rather like a bastardization of what this resource agent is really for, but perhaps it works for them. However, I would highly suspect this might suffer from the same issues that I ran into with OCFS2: that if CMAN isn't running the controld, putting a node into standby will wreak havoc on the node and cluster. With OCFS2, the issue was in the ocfs2_controld daemon, which CMAN was all too happy to try to bring offline but would NOT under any circumstances that I could find start it up.

Once started by Pacemaker you also cannot seem to take it down, meaning the resource fails to stop and becomes a disqualifying offense for the node. This issue seems unrelated to a missing killproc command that is non-standard among distributions, because even when you fix/fake it, the thing does not seem to accomplish anything. ocfs2_controld continues to run in the background, and cman will fail to shutdown correctly after you try bringing a node down gracefully. No ideas yet on how to fix this, but I might try for it next. I had detailed making a working Ubuntu 12.04 OCFS2 cluster in a previous post...I will be double-checking those steps...

20121105

Useful

IPMI v2.0 - accessing SOL from Linux command line.

http://wiki.nikhef.nl/grid/Serial_Consoles

(use a bit rate that makes sense for you, only set if necessary)

ipmitool -I lanplus -H host.ipmi.nikhef.nl -U root sol set volatile-bit-rate 9.6
ipmitool -I lanplus -H host.ipmi.nikhef.nl -U root sol set non-volatile-bit-rate 9.6

ipmitool -I lanplus -H IPMI-BMC-IPADDR -U BMCPRIVUSER sol activate

20121102

Led Astray

It's frustrating, and it's my own damn fault.

I read in the HP v1910 switch documentation that an 802.3ad bond would utilize all connections for the transmission of data. Even with static aggregation I thought I'd get something different than what, in fact, I received. To quote their introduction on the concept of link aggregation:

"Link aggregation delivers the following benefits: * Increases bandwidth beyond the limits of any single link. In an aggregate link, traffic is distributed across the member ports."

I'll spare you the rest. It's my own damn fault because I took that little piece of marketing with an assumption: That "traffic" indicated TCP packets regardless of their source or destination. I know better now, and I do bow and scrape to the Prophets of Linux Bonding, the deities that espouse Whole Technical Truth. I am not worthy!

Despite my best efforts, I cannot get more than 1G/sec between two LACP-connected machines. Running iperf -s on one, and iperf -c on the other, the connection saturates as though a single channel were all that was available. The only benefit then is that different machines are distributed across these multiple connections. Those reading this and who knew better than I, I am sorry. I'm an idiot. May this blog serve to save others from my fate.

Static aggregation, as far as my HP switches are concerned, does nothing for mode-0 connections. I can get a little better throughput, but watching the up-and-down of the flow rates suggests there is much evil happening, and I don't like it. Plus, I can't really distribute a static aggregation across my switches as far as I know - maybe the HP switch stacking feature would help with this, but I also sense much evil there and don't want to go at it.

The only benefits I can derive from RR is by placing all connections into separate VLANs. That, of course, kills any notion of redundancy and shared connectivity. First, it's like having multiple switches, but if a single connection from a single machine goes down, then that whole machine is unable to communicate with the other machines across those virtual switches. So, bollocks to that.

Second, it's damn hard to figure out a good, robust and non-impossible way to configure these VLANs to also communicate with the rest of the world. I guess that it all boils down to my desire to use the maximum possible throughput to and from any given machine, without having to jump through hoops like creating gateway hosts just to aggregate all these connections into something recognizable by other networking hardware. I am also not willing to sacrifice ports to the roles of active-passive, even though that would allow me at least one switch or link failure before catastrophic consequences took hold.

It's my own damn fault because I didn't take the time to read the bonding driver kernel documentation that the Good Lords of Kernel Development took the time to write. I didn't, at least, until last night. I poured through it, reading the telling tales of switches and support and the best way to get at certain kinds of redundancy or throughput.

802.3ad obviously doesn't do much for me either. After reading the docs, I know this. It does make aggregation on a single switch rather easy, but no more or less easy than mode-6 bonding. Well, I take that back. It IS less easy because the switch needs its ports configured. It also doesn't support my need for multi-switch redundancy, so 802.3ad is out, too.

In short, if you're thinking of bonding two bonds together, don't. It's just not worth it. The trouble, the init scripts, the switch configuration will just not do you any good. You'll still be stuck with 1 G/sec per machine connection. Even worse, you might not even get your links quickly enough back if someone trips over the power-strip running your two highly-available switches.

I considered the VLAN solution, minus its connection to the world, thereby encapsulating my SAN-to-Hypervisor subnet in its own universe of ultra-high-throughput. 3 G/sec seemed a nice thing. I managed to get close to that throughput; but, sadly, given that single-link failures would be catastrophic, I can't afford to take that risk. Redundancy is too important. I will relegate myself to mode-6, as it appears to be the most flexible, the most robust and the most reliable with regard to even link distribution.

I hope the price of 10GigE drops sooner rather than later...

BURNING MIDNIGHT
m.at.work