20120621

The Good and Bad of OCFS2

It's my own fault, really, for not having yet purchased an ethernet-controlled PDU.  I've been busy and time slips by, and the longer things run without happenstance the easier it is to forget how fragile it all is.

Whatever is causing the hiccups, it's pretty nasty when it happens.  I now have three hosts in my VM cluster.  I still run my two storage nodes, as a separate cluster.  There are two shared-storage devices, accessed by each VM host node via iSCSI, meant to distribute the load between the two storage nodes.  OCFS2 is the shared-storage file system for this installation.

Long story short, when one node dies, they all die.  15 VMs die with them, all at once.  Again, STONITH would fix this issue.  But what worries me more is the frequency of oops'.  I really can't have my VM hosts going AWOL on me just because they're tired of running load averages into the 60s.  I am beginning to rethink my design.  Here I will discuss a few pros and cons to the two approaches under consideration.

OCFS2 - Pros

  • Easy to share data between systems, or have a unified store that all systems can see.
  • VM images are files in directories, all named appropriately for their target VMs - No confusion, very little chance of human error.
  • Storage node configuration is easy - Set up the store, initialize with OCFS2, and you're done!

OCFS2 - Cons

  • Fencing is so massively important that you might as well not even use clustering without it.  Right now the cluster itself is about as stable as my "big VM host" that has motherboard and/or memory issues and regularly locks up for no apparent reason.
  • You have to configure the kernel to reboot on a panic, and to panic on an oops, per the OCFS2 1.6 documentation.  I'm not really uncomfortable with that, but again the prevalence of these system failures is leaving me in wonder about the stability of everything.  I cannot necessarily pin it on OCFS2 without some better logging, or at least some hammering while watching the system monitor closely.
  • One of my systems refuses to reboot on a panic, even though it says it's going to.  Don't have any idea what that's about.
  • The DLM is not terrible, but sometimes I wonder how great it is in terms of performance.  I may be misusing OCFS2.  Of course, I have only one uplink per storage node to the lone gigabit switch in the setup, and the ethernet adapters are of the onboard variety.  Did I mention I need to purchase some badass PCI-e ethernet cards??
The alternative to OCFS2, when you want to talk about virtualization, is of course straight-up iSCSI.  Libvirt actually has support for this, though I'm not certain how well it works or how robust it is to failures.  However, from what I've read and seen, I'd be very willing to give it a shot.

LIBVIRT iSCSI Storage Pool - Pros

  • STONITH is "less necessary" (even though it is STILL necessary) for the nodes in question, because they no longer have to worry so much about corrupting entire file systems.  They would only be at risk for corrupting a limited number of virtual machines...although, given the right circumstances I bet we could corrupt them all.
  • Single node failures do not disrupt the DLM, because there is no DLM.
  • iSCSI connections are on a per-machine basis, though it would be interesting to see how well this scales out.
  • No shared-storage means that the storage nodes themselves can use more traditional or possibly more robust file systems, like ext4 or jfs.

LIBVIRT iSCSI Storage Pools - Cons

  • Storage configuration for new and existing virtuals will require an iSCSI LUN for each one.  To keep the segregated, we could also introduce an iSCSI Target for each one, but that would become a cluster-management nightmare on the storage nodes.  It's already bad enough to think about pumping out new LUNs for the damn things.
  • Since LUNs would be the thing to use, there is greater risk of human error when configuring a new virtual machine (think: Did I start the installer on the right LUN?  Hmmmm....)
  • Changing to this won't necessarily solve the problems with the ethernet bottleneck.  In fact, it could very well exacerbate them.
  • There is no longer a "shared storage" between machines.  No longer a place to store all data and easily migrate it from machine to machine.  At present I keep all VM configuration on the shared storage and update the hosts every so often.  This would become significantly less pleasant without shared storage.
It would probably be in my best interest to simply keep the current configuration until I can get my STONITH devices and really see how well the system stays online.  It would also behoove me to configure the VM cluster to also monitor and protect the virtuals themselves.  I tested this with one VM, but haven't done a lot to toy with all the features and functions.

So much to do, so little time.

20120620

Cluster Building - Ubuntu 11.10


Some quick notes on setting up a new VM cluster host on Ubuntu Server 11.10.  Assuming the basic setup, some packages need to be installed:

apt-get install \
  ifenslave bridge-utils openais ocfs2-tools ocfs2-tools-pacemaker pacemaker \
  corosync resource-agents open-iscsi drbd-utils dlm-pcmk ethtool ntp libvirt-bin kvm

If you don't use DRBD on these machines, omit it from all machines.  If you install it on one, you'd probably better install it on all of them.

Copy the /etc/corosync/corosync.conf and /etc/corosync/authkey to the new machine.

Configure /etc/network/interfaces with bonding and bridging.  My configuration:

# The loopback network interface
auto lo
iface lo inet loopback


# The primary network interface
auto eth0 eth1 br0 br1 bond0


iface eth0 inet manual
  bond-master bond0


iface eth1 inet manual
  bond-master bond0


iface bond0 inet manual
  bond-miimon 100
  bond-slaves none
  bond-mode   6


# I can't seem to get br1 to accept the other bridge-* options!! :(
iface br1 inet manual
  pre-up brctl addbr br1
  post-up brctl stp br1 on


iface br0 inet static
  bridge-ports bond0
  address 192.168.1.10
  netmask 255.255.255.0
  gateway 192.168.1.1
  bridge-stp on
  bridge-fd 0
  bridge-maxwait 0

Disable the necessary rc.d resources:

update-rc.d corosync disable
update-rc.d o2cb disable
update-rc.d ocfs2 disable
update-rc.d drbd disable


Make sure corosync will start:
sed -i 's/START=no/START=yes/' /etc/default/corosync

Create the necessary mount-points for our shared storage:
mkdir -p /opt/{store0,store1}

You should now be ready to reboot and then join the cluster.  I have the following to files on all machines, under the security of root:
go.sh
#!/bin/bash
service corosync start
sleep 1
service pacemaker start


force_fastreboot.sh
#!/bin/bash
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

Also, the open-iscsi stuff has some evil on reboots.  If there is stuff in the /etc/iscsi/nodes folder, it screws up Pacemaker's attempt to connect.  For lack of a better solution, I have this script called from rc.local:

clean_iscsi.sh
#!/bin/bash
rm -rf /etc/iscsi/{nodes,send_targets}/*

Sometimes I need to resize my iSCSI LUNs.  Doing so means rescanning on all affected machines.  This script is called via the crontab entry  0,15,30,45 * * * * /root/rescan-iscsi.sh
rescan-iscsi.sh
#!/bin/bash
iscsiadm -m node -R > /dev/null

20120617

Be Careful With mdadm.conf

I ran into a silly problem with booting my RAID-6.  It seems mdadm inside initrd would not assemble my array on reboot!  Every time the system rebooted, I was dumped after a few seconds into an initrd rescue prompt, and left to my own /dev/*.  Here, again and again, I could manually assemble the array, but that was getting annoying and was certainly something I couldn't do remotely.

During one boot, I tried an mdadm -v --assemble --scan --run, and discovered it was scanning sda, sdb, sdc, etc, but no partitions.

I looked on Google for "initrd not assembling array" and "grub2 not assembling array," to no avail.  I tried updating the initrd with the provided script (update-initramfs).  Finally I happened upon someone mentioning the mdadm.conf file inside initrd.

I peeked in /etc/mdadm, and sure enough, there was a copy of my modified mdadm.conf.  It contained the change I had made: to scan only the five drives of my RAID-6, since the other drives I had in the system at the time were giving me hassles.  Unfortunately, the statement I concocted specified no partitions; my RAID was entirely on partitions.  Entirely my fault. :'-(

The good news is that, once found, it's an easy fix.  Get the system to boot again, reconfigure mdadm.conf, and then perform an update-initramfs.  The system now boots perfectly!

20120616

Resurrection of the RAID

Have you had a day like this?

Hmm... My RAID was working the last time I rebooted.  Why can I not assemble it now?  Only two devices available?  Nonsense...there should be five.  What?  What's this?!  The others are SPARES?!!  No they're not...they're part of the RAID!  Awwww F*** F*** F*** F*** F*** F***....


Thank the heavens for these critical links, which I repeat here for posterity:
http://maillists.uci.edu/mailman/public/uci-linux/2007-December/002225.htmlhttp://www.storageforum.net/forum/archive/index.php/t-5890.html (saves more lives!)
http://neil.brown.name/blog/20120615073245 !!! important read !!!
As you can probably surmise, one of my arrays decided to get a little fancy on me this week.  It all started when I was trying to reallocate some drives to better utilize their storage.  I had moved all my data onto a makeshift array composed of three internal SATA and two USB-connected hard drives.  I know, a glutton for punishment.  The irony of that is the USB drives were the only two with accurate metadata, as will be discussed below.

I really don't know what happened, but somewhere between moving ALL my data onto that makeshift array and booting into my new array, the metadata of the three internal SATA drives got borked....badly.  I discovered this while attempting to load up my makeshift array for data migration.  Upon examining the superblocks of all the drives (mdadm -E), I discovered that the three internals received a metadata change that turned them into nameless spares.  The two USB drives escaped destruction, so thankfully I had some info handy to verify how the array needed to be rebuilt.

After grieving a while, and pondering a while longer, I did a search for something along the lines of "resurrect mdadm raid" and eventually came up with the first two links.  The third was an attempt to find the author of mdadm, just to see if he had anything interesting on his blog.  Turns out he did!  Now, I honestly don't know if I was bit by the bug he mentions in that post, but whatever happened did its job quite well.

So, I started a Goggle Doc called "raid hell" and started recording important details, like which drives and partitions were in use by that array, what info I was able to glean from mdadm -E, and eventually I felt almost brave enough to try a reassembly.  But before I got the balls for that, I imaged the three drives with dd, piped it through gzip, and dumped them onto the new array that thankfully had just enough space to hold them.  Now, if something bad happened, I had at least one life in reserve.

The next step was now to attempt an array recreation with the --assume-clean option: this option would eliminate the usual sync-up that takes place on array creation, thereby preventing destructive writing from taking place while experimentation was happening.

Oh, did I mention it was a RAID-5?

Of course, with most of the devices suffering amnesia, the challenge now was to figure out what order the drives were originally in.  It would have been as easy as letter-order, maybe, if I had not grown the array two separate times during the course of the previous run of data transfers.  So, to make life a little easier, I wrote a shell-script.  Device names like /dev/sdf1 and /dev/sdi2 were long and hard and ugly, so I replaced them with variables F, G, I, J, and K (the letters of the drives for my array).  It went something like this:

F="/dev/sdf1"
G="/dev/sdg2"
I="/dev/sdi2"
J="/dev/sdj1"
K="/dev/sdk1"
mdadm -S /dev/md4
mdadm --create /dev/md4 --assume-clean -l 5 -n 5 --metadata=1.2 $F $G $I $J $K
dd if=/dev/md4 bs=256 count=1 | xxd
pvscan
With this, I could easily copy/paste and comment out mdadm --create lines that had the incorrect drive order, and keep track of what permutations I had attempted.  Now the array was originally part of LVM, so I was looking for some LVM header funk with the dd command.  I only knew it was there because I had been running xxd on the array previously, and seen it go whizzing by.  pvscan would be my second litmus test, and should the right first drive appear, the physical volume and subsequently the volume group and logical volumes would all become recognized.  With this, it took only five runs to get K as the first drive.  I now had four drives left to reorder.

To accomplish this, dd and pvscan would no longer help - they did their work on the first drive only.  Since I had two easily-accessible LVs on that array - root and home - it would be easy to run fsck -fn to determine if the file system was actually readable.  This assumed, of course, that all the data on the array was in good shape.  I honestly had no reason to believe otherwise, and Neil Brown's post gave me a great deal of hope in that regard.  Basically, if only the metadata was getting changed, the rest of the array should be A-OK.  I knew it was not being written to at the time of the last good shutdown...because the array was basically a transfer target and not even supposed to be operating live systems.

It took another half-dozen or so attempts before I managed to get the latter four drives in the correct order.  Finally, fsck returned no errors and a very clean pair of file systems.  Of course, this was done with -n, so nothing was actually written to the array.  I kicked off a RAID-check with an

echo check > /sys/block/md4/md/sync_action
and then the power went out.  After rebooting, and reassembling the RAID (this time without having to recreate it, since the metadata was now correct on all the drives), I re-ran the check.  It completed about three and a half hours later.  Nothing was reported amiss, so finally it was time: I performed some final fsck's without the -n to ensure everything was ultra-clean, and started mounting file systems.

SUCCESS!!!

Since root and home both weigh in around 30-60G each, it was easy to believe they touched every device in the array.  If something else had been out of order, I should have seen it (let's hope I'm right!!).  Now, with the volumes unmounted, I am migrating all the data off the makeshift array...after all, it has a habit of not actually assembling on boot, probably because of the two USB devices.

It Lives Again.