20120507

Cluster Building - Round 2

Rebuilding the Cluster - a Learning Experience

I realized that perhaps a better way to gain experience with cluster building is to do it in my sandbox of virtual machines.  These machines are, for the most part, completely isolated from the greater network; they're on their own subnet, share a restricted virtual switch, and must communicate through a virtual firewall.  I've already used this sandbox to test DRBD+OCFS2+iSCSI.  Now I will use it again to experience with the cluster - the bonus here being that if a machine locks up, I can reboot it remotely!

I will record the steps for a fresh, clean install on Ubuntu Server 11.04, running the 'server' kernel (in deference to the 'virtual' kernel) for Pacemaker, Corosync, OCFS2, DRBD, and iSCSI.  As I have mentioned in a previous post, this Ubuntu wiki page has been instrumental in this process.

The following sequence of commands are run as root. I prefer to sudo su - my way into that shell when doing large batches of root-work.  Execute however you please.  As for the validity or accuracy of this document, no warranties are implied or suggested.  In other words, use at your own risk, and YMMV.  If you can dig it, have fun!

I will be executing these steps on l2.sandbox and l3.sandbox.  l4.sandbox will act as a test client.

COROSYNC and PACEMAKER

On l2 and l3...


We'll use PCMK for now, but bear in mind that OCFS2 has CMAN packages available as well.
Both
apt-get install drbd8-utils iscsitarget ocfs2-tools pacemaker corosync libdlm3 openais dlm-pcmk ocfs2-tools-pacemaker

I have already disabled a slew of services:
Both

update-rc.d o2cb disable



update-rc.d ocfs2 disable
update-rc.d open-iscsi disable
update-rc.d drbd disable
update-rc.d iscsitarget disable

Now a recommended reboot.
Both


reboot


Sadly, it was at this point that the VMs seemed to go into hyper-death.  I really don't know why that happens, but it happens on occasion.  I had to hard-kill the processes and restart the two machines.


I generated the authkey for corosync on l2 and copied it to l3.
l2
corosync-keygen
scp /etc/corosync/authkey l3:/etc/corosync/

Also, switch on corosync, or the init script won't touch it.
Both
sed -i 's/=no/=yes/' /etc/default/corosync



Next was to modify corosync.conf to meet my network's requirements.  I modified the network, and turned on secauth.  Then it was time to crank it up!

/etc/init.d/corosync start && crm_mon


It took about 30-45 seconds before the monitor registered the two machines.  Both came up as online.

DRBD

Both
Now it's time to configure a redundant resource with DRBD.  There are several ways to achieve data-redundancy, I just tend to like DRBD.  It certainly isn't the end-all-be-all, but it does the job and is pretty efficient.  I like the block-device approach as well.  For sake of brevity, I will leave out the configuration files, but highlight the important parts.  Following the DRBD documentation, we'll set up a device for our data, and another for the iscsi configuration files - both of which are configured for dual-primary mode.  Another side note:  I tend to use LVM as the backing store management of my hard drive or RAID storage.  This gives me enormous flexibility without costing too much in throughput.  To be honest, more is lost to the software-RAID (if you're crazy like me and do RAID-6) than to LVM, so it really doesn't hurt in my opinion to keep your options open.  The whole device chain would look like this:

     Drives -> RAID (mdadm) -> LVM -> DRBD -> OCFS2

You can toss encryption in there also, but be prepared for an additional penalty. 

My DRBD devices are called ds00 and iscsi-cfg, respectively, as are my LVM logical volumes.  Here's what I used for LVM (my VG is called ds):
Both
lvcreate -L+100M -n iscsi-cfg ds
lvcreate -l+100%FREE -n ds00 ds
drbdadm create-md ds00
drbdadm create-md iscsi-cfg

A word of caution - the docs recommend, and it's a good idea that they do, that you keep "allow-two-primaries" turned off until the resources are configured and up-to-date.  Once you start them, DRBD may freak out a little about their inconsistent initial state.
Both
# if you disabled drbd, then
/etc/init.d/drbd start
# otherwise
drbdadm up ds00
drbdadm up iscsi-cfg

Nuke one of the nodes to make things consistent
Either
drbdadm -- -o primary ds00
drbdadm -- -o primary iscsi-cfg

Waiting for the sync to complete is not necessary, but I wouldn't recommend rebooting until its done.  Things might get hairy.  You can watch the resource progress via cat /proc/drbd.

Cluster Configuration

Now we must do cluster configuration.  Of importance is to make sure we disable STONITH, and request that Pacemaker ignore a no-quorum condition.  I added these lines to the property statement in the configuration:
stonith-enabled="false" \
no-quorum-policy="ignore"

Then came a slew of cluster configuration:
Either, via crm
node l2.sandbox
node l3.sandbox
primitive p_dlm ocf:pacemaker:controld \
        op monitor interval="120s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="100s"
primitive p_drbd_ds00 ocf:linbit:drbd \
        params drbd_resource="ds00" \
        operations $id="drbd-operations" \
        op monitor interval="20" role="Master" timeout="20" \
        op monitor interval="30" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="240"
primitive p_drbd_iscsi-cfg ocf:linbit:drbd \
        params drbd_resource="iscsi-cfg" \
        op monitor interval="20" role="Master" timeout="20" \
        op monitor interval="30" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="240"
primitive p_o2cb ocf:pacemaker:o2cb \
        op monitor interval="120s" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
ms ms_drbd_ds00 p_drbd_ds00 \
        meta resource-stickiness="100" notify="true" master-max="2" interleave="true"
ms ms_drbd_iscsi-cfg p_drbd_iscsi-cfg \
        meta resource-stickiness="100" notify="true" master-max="2" interleave="true"
clone cl_dlm p_dlm \
        meta globally-unique="false" interleave="true"
clone cl_o2cb p_o2cb \
        meta globally-unique="false" interleave="true"
colocation colo_dlm-drbd 0: cl_dlm ( ms_drbd_ds00:Master ms_drbd_iscsi-cfg:Master )
colocation colo_o2cb-dlm inf: cl_o2cb cl_dlm
order o_dlm-o2cb 0: cl_dlm cl_o2cb
order o_drbd-dlm 0: ( ms_drbd_ds00:promote ms_drbd_iscsi-cfg:promote ) cl_dlm
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"

This configuration was built in a shadow-CIB.  Once it was committed, it seemed to execute fine.  I tested putting one node into standby, and then back online, and everything went away and came back up as expected.  With the resources now fully available, I did the OCFS2 file system creation:

Either
mkfs.ocfs2 -N 32 /dev/drbd/by-res/ds00
mkfs.ocfs2 -N 32 /dev/drbd/by-res/iscsi-cfg

Unfortunately, I hit this error:
iscsi-cfg is configured to use cluster stack "o2cb", but "pcmk" is currently running
After verifying I disabled all OCFS2 start scripts (via update-rc.d -f o2cb remove) and hitting the error again, I forced (-F) my way past it, and then realized that it was trying to tell me it detected that device as belonging to another cluster already.  It was the unfortunate choice of names that had utterly confused me.  Had I taken the trouble to zero out the whole device with dd, that would not have happened.  Let this be a lesson to you.

Remember how I told you to let the DRBD resources fully sync before rebooting?  Well, I didn't, and got a nasty split-brain while trying to resolve the above mkfs issue (due to a desperate reboot of one of the machines).  It was even on the iscsi-cfg device that I hadn't yet tried to initialize with a file system!  For future reference (DRBD 8.3 and below):

On the node you want to kill the data of:
drbdadm secondary resource
drbdadm -- --discard-my-data connect resource

With that all done, we can configure the file system access.  I'm rather curious to see whether or not accessing the device in a "mixed-mode" fashion will yield great evil (that is, with the DRBD-local machines accessing the DRBD device directly, and the remote devices via iSCSI).  Logically, it shouldn't...that's why we're using OCFS2, after all.

Of course, after attempting to configure the file systems, all hell broke loose and sudden l3 became completely unable to initialize its o2cb resource!  Reboots did not cure the problem.  I even removed the new file system resources, to no avail.  Ultimately I came across a Gossomer thread that suggested (loosely) stopping both nodes, and investigating the OCFS2 file systems with debugfs.ocfs2, using the 'stats' command to get information.  I discovered with this that the iscsi-cfg device was totally hosed.  After nuking it with dd, and rebooting, things came back up normally.

Taking things slower now, I first created the file system resources, committed, them, and watched them appear on the nodes (one per node, as per Pacemaker's known tendencies).  Migrating them worked, so now both nodes and the file systems seem happy.

Working carefully through the cloning, colocation, and ordering directives, I now have a set of operations that work.  The final modifications:
Either, via crm configure edit
primitive p_fs_ds00 ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/ds00" directory="/opt/data/ds00" fstype="ocfs2" \
        op monitor interval="120s" timeout="60s" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s"
primitive p_fs_iscsi-cfg ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/iscsi-cfg" directory="/opt/data/iscsi-cfg" fstype="ocfs2" \
        op monitor interval="120s" timeout="60s" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s"
clone cl_fs_ds00 p_fs_ds00 \
        meta interleave="true" ordered="true"
clone cl_fs_iscsi-cfg p_fs_iscsi-cfg \
        meta interleave="true" ordered="true"
colocation colo_fs_ds00 inf: cl_fs_ds00 cl_o2cb
colocation colo_fs_iscsi-cfg inf: cl_fs_iscsi-cfg cl_o2cb
order o_fs_ds00 0: cl_o2cb cl_fs_ds00
order o_fs_iscsi-cfg 0: cl_o2cb cl_fs_iscsi-cfg

One thing I am learning about Pacemaker is that you better not try to outsmart it, even if it's to make your life easier.  Modifying existing colocation or ordering statements will being the Uruk-hai to your doorstep!  Pacemaker appears quite smart enough to put things in their proper place once it sifts through the basic directives.  That may or may not explain the iscsi-cfg file system corruption (which could have, in fact, been due to bad mkfs.ocfs2 options on the little 100M volume).  Anyway, it was beautiful to watch it all come together.  At last.

For Next Time

The Clusterlabs documentation suggests that CMAN is a good and necessary things for Active/Active clusters.   I plan on going Active/Active/Active/Active... so perhaps they're right.  I need to investigate how easy it is to add nodes into a CMAN-managed cluster while it's live.

STONITH is required.  I understand and appreciate that now.  I must find out how to make STONITH work for my workstation-class servers.

Whether to manage the cluster with one management framework, or with multiple, is a big question. I have two hosts (at least) that will offer shared-storage.  At least one of those servers may also provide virtualization.  A number of other hosts will use the shared storage and do virtualization exclusively.  Sharing the cluster means control can happen from anywhere, management is "easier" and we gain some redundancy if our storage hosts double as VM hosts (even if we ensure this is NOT their primary responsibility).  On the other hand, splitting off the VM cluster management to a separate ring may be very beneficial from a configuration-management perspective.  That does appear, however, to be the only visible benefit I can yet think of.

The Ubuntu How-to (well, test-case) document I've been following goes on to talk about load-balancing.  I'd like to try implementing that for the two iSCSI target hosts.  Anything we can do to keep the adapters from getting completely overloaded would be A Good Thing (tm).


No comments:

Post a Comment