BURNING MIDNIGHTm.at.work

20120504

DRBD + Pacemaker

I followed this instruction fairly closely to get the basic DRBD + Pacemaker configuration working. A few notes from the effort:

clone-max must be 2, or else the other node won't come up.
Having the wrong file system on the target drive really hoses things up.

During my escapades, I had forgotten that the DRBD device I wanted to use for this example had been formatted as OCFS2. Since that file system requires some special massaging to mount, I specified ext4 in the DRBD resource configuration. Unfortunately, when Pacemaker failed to mount the file system, it got stuck in a state I didn't know how to get out of. The easy solution was to restart Corosync on each node. The first node that restarted came up instantly with the mount (once it was properly configured).

I now feel I have a basic working knowledge of crm's command syntax, and the kinds of resources I am able to configure. I still lack in-depth knowledge about items like ms (master-slave) and its meta fields, and other finer details. I believe they're in the parts of the Pacemaker documentation I have not yet come to, though I have diligently read through a good portion of it before starting this adventure.

I've now reconfigured the resource to work in dual-primary mode. That was easy - just change the master-max to 2, leave clone-max at 2, and remove the other options (don't know if I needed them or not - will find out later). Next, OCFS2 Pacemaker support. First order of business was examining the ocf:pacemaker:controld info. I noticed this line:

It assumes that dlm_controld is in your default PATH.

Habawha?! OK. I go to the prompt and type dlm_controld and find nothing. But Ubuntu is nice enough to point out I should install the cman package if I want this command. So I do so, and allow apt-get to install all the extra packages it believes it needs.

Following the DRBD OCFS2 guide, I notice one change I need - ocf:ocfs2:o2cb is actually ocf:pacemaker:o2cb. I took a gamble and configured the o2cb resource with the parameter stack="cman". Sadly, I endured mucho failure when I committed my changes. None of the new OCFS2 resources seemed to start, and complained loudly about something being "not installed." To this I answered with installing the dlm-pcmk package on both servers, and four of the errors went away (two per machine). I am now left with two monitor errors that still complain that something is "not installed."

Of course, it would have been AWESOME if I had just read further on the Ubuntu wiki page to see the full apt-get line for supporting OCFS2 - one or two more packages later, and that fixed the problem. Still, it was valuable to learn about CMAN, and I may migrate the cluster in that direction since it may help protect against internal split-brain.

I will now reformat the shared data store as OCFS2, modify the Filesystem resource, and prepare for cluster goodness. Tonight or tomorrow I might try to get the iSCSI target working under Pacemaker.

Corosync, Pacemaker, Heartbeat, OCFS2...

Ran into an issue with OCFS2 and Heartbeat: with heartbeat running, OCFS refused to shut down for a reboot of the system. This caused the system to totally hang for 12 hours until I could get to the office and force-shutdown the machines. After a few more reboot tests, I disabled Heartbeat from even starting, and now reboot works again. I've installed Corosync and Pacemaker to become the management systems. I'm sure there's a way to fix Heartbeat, but since there seems to be a community trend toward Pacemaker, we'll go with that. Plus, DRBD documents how to set up OCFS2 resources for HA with Pacemaker.

Installation
apt-get -y install corosync pacemaker build-essential

Don't know why build-essential is needed, but a site referenced it. We'll see. Right now I just want to get this thing running as quickly and as painlessly as possible.

Configuring Corosync
crm won't give any useful cluster status without something - like Corosync - running. So I started with modifying corosync.conf to match most of the settings here. Now I can ask crm for status and it tells me my cluster has zero nodes. Good! That's better than saying it can't connect or tell me anything useful.

Also, to get corosync to start, flip the switch in /etc/default/corosync to allow the init script to run.

After starting corosync on both nodes, crm status displays two nodes, two votes, no resources configured. Not sure what the "pending" on node 1 is all about yet, but after about 30 seconds it disappeared and now both nodes say they're online. We have GLUE!

Configuring Resources
After reading a good deal of the Pacemaker Explained documentation, I decide to start out by following DRBD's example of configuring OCFS2 with Pacemaker. That ended horribly. OCFS2 kept the servers from wanting to willingly reboot, and it appears that whenever you want Pacemaker to manage something, you basically have to hand over ALL start/stop functionality to it. That makes sense, of course... I just probably skimmed over that part of the documentation. After resetting the cluster configuration to something like a large BLANK, I proceeded to restart from the examples in Clusters From Scratch.

First I configured a virtual IP assignment, which is needed for the iSCSI stuff to work. Ideally, I'd like to load-balance the iSCSI backend between the two machines, by providing two virtual IPs (one for each). If one goes down, the other will assume both IPs and everything should be good. That's for a future project, however, so for right now let's get one virtual IP up and an iSCSI initiator connected. Anyway, the virtual IP works and instantly and without fuss moves from server to server, whenever one or the other disappears. ZERO loss in ping. Would love to see how quickly the response time is for streaming data...

Next up: configuring DRBD in Pacemaker. We'll leave OCFS2 for yet a later time. My goal for later tonight is to get DRBD running correctly under Pacemaker - this means I must disable the drbd init scripts. Note to self: do this.

20120503

Conversion of Flash Boot Media from Solo Device to RAID-1

The target system currently boots off one flash device, which we shall call sdf. sdg is the new device. sdf was configured with a DOS partition table and one 500M partition, beginning at cylinder 2. /dev/sdf1 is the mount source of /boot. The operating system is Ubuntu Server 11.04, with all the latest updates. The hardware is a Dell XPS with an Intel Pentium 4 processor and a relatively ancient BIOS. What follows are the steps for what I did to make it work, and what I observed along the way. YMMV...

1. Create the RAID-1 target

fdisk /dev/sdi

Create a new DOS table and a single partition, starting at cylinder 2 and extending 500M into the device.
Configure the partition as type FD (linux raid member) and set its bootable flag. Not sure what of this is absolutely necessary, so I did it all.

mdadm --create /dev/md1 --force --metadata=1.2 -l 1 -n 1 /dev/sdg1
mdadm -E --brief /dev/sdg1 >> /etc/mdadm/mdadm.conf

--force is required as we're building a 1-device RAID-1, which mdadm finds odd but will do anyway.
We'll try using metadata version 1.2 on here, because I've seen it work on other systems.
The array should have started automatically.
We add the array definition to the config file, so that it will be correctly auto-started AND correctly assigned as md1 on reboot.

dd if=/dev/sdf1 of=/dev/md1 bs=4k

We copy the entire boot file system over, sector for sector, assuming we have enough space on md1. In my case, I had more than enough. Note I'm using a 500M partition for two reasons: (a) I don't feel like waiting 30 minutes for things like copies to finish, and (b) there might be problems with GRUB and the full 8G size, although the other exemplar system had no troubles here. If all goes well, I'll expand the RAID accordingly.

umount /boot

Remove the device that represents /dev/sdf.

partprobe
mount /boot

We will now see that all the boot files are there, and the file system is a perfect mirror, now living on the RAID. The above will work if /etc/fstab lists the /boot source by UUID and not by device. This has become the recommended way of mounting devices anyway.

Try a GRUB install:

update-grub
grub-install --allow-floppy /dev/sdg

This first invocation failed for me, with GRUB unable to figure out the file system on /dev/md1. Running grub-probe -v revealed the same results. To fix:

umount /boot
mdadm -S /dev/md1
mdadm --assemble --scan

mdadm should report that our little md1 array was started successfully with one drive.

mount /boot
update-grub
grub-install --allow-floppy /dev/sdg

This time it succeeds. Now we attempt a reboot.
RESULTS: Reboot was mostly successful. Saw a warning about fd0 being unreadable, but GRUB cranked up and booted the OS. One snag: /boot failed to mount, and Ubuntu has the lovely feature of freezing all further booting until the administrator can intervene.

At this point, I am seeing that the operating system is trying to mount /dev/md1 on boot, but for some reason failing and thinking it needs to run fsck on the device. But the device is busy, and so it can't. It then waits in limbo until I hit the S or the I key (to skip mounting or ignore the issue, respectively). To test how a drive-check would affect a sister system, I tune2fs -C 30 /dev/md1 and then rebooted. This forced a drive check on the next reboot. Happily, or sadly, it did the check without blinking and went right to work bringing the rest of the OS online.

Now, note that the two major differences between these two systems is that one uses 0.90 metadata, and the other (malfunctioning) system uses version 1.2 metadata. Also, the malfunctioning system has a strange RAID configuration (a single-device RAID-1). To test a theory, I'm going to toy with trying to build a full-device RAID drive. In the past this has failed for some unknown reason - grub-install wouldn't take. Perhaps now we can explore a bit further.

* * * * * *

I toyed around with doing a whole-device RAID-1 as the boot drive. Not a good idea. GRUB doesn't REALLY understand, and I still have problems with the fsck hosing up and trying to read the individual partitions of the raid device, instead of the raid device itself. I think it may be an even bigger problem because I created a partition table and set the primary partition to be type "Linux". Perhaps fsck is seeing this and decided to attempt a check even though it really shouldn't? Anyway, all that aside, the rebuild onto the secondary device certainly worked well, and the system DID actually boot once. I fear it's not a stable solution, however, and stability is key to the success of this mission.

As I had copied all the boot files to /boot2 (a temporary on-drive space that was safe), I decided to ditch the single-drive RAID-1 in favor of going all the way in one shot. I again created a new raid, this time using both devices - more specifically, a partition from each device - as I had done above. I made sure the partition was coded as type 0xfd. Once the raid was synced, I created the file system (make sure you assign the correct UUID or things will break!) and copied the boot files over.

I ran dpkg-reconfigure grub-pc a few times. The first few it just regenerated the grub.cfg file. After running grub-install --allow-floppy /dev/sdf, and again for /dev/sdg, the dpkg-reconfigure grub-pc saw the two drives and re-ran the grub-install on them. I'm not sure if that had any positive effects beyond what I had done above, but the system now boots correctly, loads correctly, and doesn't try to fsck the underlying media!

I'm done.

P.S.

I'm not sure why the steps for the single-device RAID-1 failed. I honestly don't believe the single-device configuration was the fault. I am still at a complete loss to explain why fsck wanted to check the underlying media constantly, unless there is some hidden magick that I was unaware of in the installation scripts. Someday I'll have to play some more to better understand this bit of funk. Right now I have to configure these devices for Heartbeat and make sure I can transparently fail-over an iSCSI connection to a virtual IP. Then it's virtualization GO TIME!

USB Media and the Buggy BIOS

OK, so maybe the BIOS isn't entirely to blame, but in my older machines I'm finding that they treat the flash media a bit oddly.

Something has issues. It may be a convoluted blend of issues. Anyway, nix the GPT partition table if you want any hope of GRUB actually booting. Of course I am referring to whatever version they've stuffed into Ubuntu Server 11.04. My current recipe is as follows:

Create (explicitly) a DOS partition table
Set up a partition starting at cylinder 2
Load up the boot files into that partition (under a sensible file system, like ext2)
grub-install /dev/my-usb-device
Animal sacrifice

The grub-install is a bit dicey. Here's the problem: the older BIOS on a particular machine seems to see the USB device as a floppy drive. grub-install offers the --accept-floppy option, though I'm not sure if this does any good in my situation or not. Regardless, GRUB seems to boot with the right incantation. The array this device services is currently living with one USB boot device, since the RAID-1 configuration was giving it strange heartburn. I still don't know why. Its sister system, which is almost identical in model, has had (as far as I know) no real problems dealing with the RAID-1 file systems.

Of other interest, the partitions are only 500M in size. I got tired of syncing 8G between two USB flash drives. I also noticed that it gets really finicky with regard to the name of the RAID device, if you get that far. On one system the name got mangled by the kernel and/or the md driver. Whatever did the job, it made sure the array remembered its name as md127 instead of md1. After learning my limited way around the grub rescue> prompt, I was able to boot the system regardless of this mishap.

The solution? update-grub to regenerate the grub.cfg file with the correct device name. Other issues that had to be resolved to make this work:

mdadm.conf didn't have my current RAID-1. Remember that each array gets a new UUID, so recreating an array means it's totally different and has to be added to the conf file.
Metadata 0.90 doesn't seem to support the name attribute, and was refusing to assemble the array on that account alone (during testing).

I think metadata 1.2 will work - it works for the other two machines. The catch may be the partition table, the grub.cfg and the grub-install --allow-floppy procedure.

My Adaptec 2805 controller came with old and buggy firmware, and lost a drive on my RAID-6 during a cold-boot test. Luckily they had published new firmware, and now the device is up-to-date. Other than that it's been a good card so far...and for the 24 hours it's been running. We'll see how it holds up. I only wanted a SATA controller, no RAID, but those seem to be really hard to find. HighPoint seems to offer a solution that may or may not include RAID support. I don't need that support, so I don't want it. But the sites that list the card's features are ambiguous. And it's almost as expensive, if not more so, than the Adaptec! On the flip-side, if it could manage cards without having to taint them with its own adapter meta-funk, that would be rather nice. Automation will flow easier from that spring.

I will perform another cold-boot test tomorrow.

20120501

TCP Offload and Linux, Plus Goodies

Short answer - not happening. Another short answer: unwanted, unneeded.

At least, that's the sentiment from the kernel development team. That works for me - cheaper cards work just as well! I can get about 1.6Gbit/sec during bidirectional bandwidth tests using iperf...probably more if I do a direct-connect (will have to try that one out!!).

In other news - handy piece of knowledge: you can load the Linux bonding driver multiple times to configure multiple bonds with different settings! YAY!

HBA adapters act as iSCSI initiators and manage the connection while presenting a block device to the operating system. Note to self: don't try to use them as ethernet cards. Might as well USE ethernet cards, since many of my systems don't have that much space in them.

I started using micro USB sticks (the really short ones that stick out of the computer by no more than about 1/4 inch) as the boot media for my new servers. Works good, when it works. The trick is getting the install right. Some notes:

Make sure you turn off DOS-compatibility or you'll find that GRUB can't stuff itself into the appropriate region of the device.
When copying an existing boot device to a RAID device, dd the file system over (if possible) and then extend it: dd if=/dev/sda1 of=/dev/md1

The reason is because GRUB gets burned with UUIDs by the Ubuntu install process, and expects to find those later during boot. By creating a new file system on the RAID, you'll get a new UUID or have to hand-copy the old one over.

If turning off DOS-compatibility doesn't work, just push the start of the first partition back a cylinder or two. That should open enough space.
If you have two devices in RAID for /boot, and want to replace them with USB sticks:

Add the two devices, and then fail-rebuild your way on to them by failing one old device at a time. Do NOT fail them both at once!
Make sure to run grub-install /dev/sd_ to install the actual bootloader. Do this for both devices in the RAID.

Why would anyone want to boot off USB devices? Well, it's fast, and /boot rarely gets written to, and I can put my entire system on the RAID with the rest of the storage, meaning it too will be privy to array-wide hot-spares and all the redundancy that comes with RAID-6. Now, that being said, I'm still keeping backups...

20120426

Cluster File Systems, Beginning Trials

My first foray into cluster file systems taught me a great deal, though I still have a great deal to learn. Since coming to understand what OCFS2 can do for me, I have employed it thus (in a sandbox environment):

Create one iSCSI target - this would be our backing store, or "SAN".
Create three different accessing nodes - my iSCSI initiators.
Configure the three initiators into the same OCFS2 cluster.
Use OCFS2 to manage the iSCSI store effectively.

In short, it worked. All three initiators connected to the iSCSI target simultaneously, and, using OCFS2, were able to read/write the file system, together, in real-time.

After browsing through an Ubuntu document on high-availability iSCSI, I think the road-map will be as follows:

Set up a data-store server (server 1) with DRBD and iSCSI Enterprise Target.
Set up a second data-store server (server 2) to mirror the first server.
Configure High-Availability services.
Configure one or more new iSCSI initiators to use this store and OCFS2.
Test fail-over (during reading and writing).
Test live-migration of some sample VMs.

This begs the question: what are the limits of this cluster? Well, the physical file system limits are the limits of the technology and what I'm able to attach. To that end, I can expand the systems to be quite large, although performance may suffer. Increasing the size of the data-store cluster is an option, and a very viable one even without a technology to bind the cluster members together (in terms of their storage). As far as any member of the OCFS2 cluster is concerned, there can be any number of drive targets on the cluster, which equates (in our case) to any number of iSCSI targets.

Adding more data-store servers means basically adding one or more new iSCSI targets. All the initiators will access all of the targets, and data-migration from one target to another can happen anywhere, at any time. Live-migration will continue to work, and the only thing we really lose out on is increased redundancy - that is to say, the redundancy does not improve, but it also does not necessarily diminish. As the servers are already intended to be highly-available, here I think we've reached the threshold to the point of diminishing returns.

I will hopefully post configuration file samples soon, so that this information may live on.

20120425

OFCS2 - The Short Explanation

What is a Shared Disk Cluster File System? OCFS2. What follows is the translation by example. Some of it also becomes stream of consciousness as I try to work out exactly what I need, what I don't need, and where I should best invest my resources.

The Reason for OCFS2

Suppose you have a SAN, implemented perhaps by iSCSI target. Suppose now that you have multiple machines that want to access this SAN simultaneously. You could create multiple targets for them, or separate LUNs. But if they actually needed a shared file system (for the purpose of, say, VM live migration from one host to another - both hosts must be able to access the VM's hard drive image for that to work), then you need a file system like NFS. But NFS, although good, has some issues.

OCFS2 is a shared-disk-access file system, meaning that it represents a file system that multiple machines can access simultaneously. However, that is the limit of its capabilities. It doesn't provide the storage, it merely shares it. (Don't get me wrong - that's a big deal, in-and-of itself.) So, with our iSCSI target and perhaps 3 or 4 servers connected to it, with only one LUN to Rule Them All, we can use OCFS2 to ensure they don't stomp one another or corrupt the entire file system. In the example of the SAN, the target media and the participating hosts are different machines.

This also works for dual-primary DRBD clusters. In this case you have a file system that is being replicated between two hosts by DRBD, and both hosts have read/write access to it. Typically with DRBD, only one host is the primary, and the other host is a silent, unparticipating standby. Now, DRBD only provides the block device, on which we must put a file system. Using OCFS2 as the file system on our DRBD device allows our two hosts to both be primaries, both read and write, and avoid file system corruption. So, in this case, the target media and the participating hosts are the same machines.

Scalability

Let's talk scalability now. We can scale up the number of participating OCFS2 nodes, meaning we can have lots and lots of hosts all accessing the same file system. Grand. What about the backing storage? Well, since OCFS2 doesn't provide it, OCFS2 doesn't care. That being said, whatever backing storage we use, it must be accessible to all the participating machines. So scaling up our iSCSI target means scaling up the servers, in quality and/or quality.

By Quality
I'm going to assume for the rest of this article we're talking about Linux-driven solution. Certainly there are tons of sexy expensive toys you can buy to solve these problems. My problem is I have no sexy expensive toy budget. That being said....

If I have two DRBD-drive storage hosts, I can beef up the (mdadm-governed) RAID stores on them. I can swap out the drives with larger drives and rebuild my way up to larger space. With LVM I can even link multiple RAID arrays together for even larger storage arrays. This would have to be chained as such:

RAID arrays (as PVs) -> VG -> LG -> DRBD -> OCFS2

Store Failure Tolerance:

RAID: Single (or dual for level 6) drive failure per array
DRBD: Single host failure

With the right networking equipment, this could be a very fast and reliable configuration, and provide nearly continuous storage up-time. Barring massive power-outages, I would wager that this could serve at least 4-9's.

By Quantity
Increasing the number of storage servers could potentially provide additional redundancy, or at least increased performance. The key to redundancy is, of course, redundant copies of the data. The downside to increasing the quantity of servers is, of course, managing to chain all that storage together. The more links we have in the storage management chain, the slower that storage operates. Worse yet, finding a technology that effectively chains together storage is rather difficult, and is not without its risks.

We could, for instance, throw GlusterFS on the stack, since it can tether nodes together in an LVM-fashion and create one unified file system. Is that worth the trouble? Is it worth the risk, considering the state of the technology? That's not to say it's a bad system, but it seems almost as though the sheer cost to increase the size of the cluster does not necessarily justify that much flexibility. And, of course, there are other thoughts that must now go into a separate post.

BURNING MIDNIGHT
m.at.work