Showing posts with label best practices. Show all posts
Showing posts with label best practices. Show all posts

20210623

My Convoluted Coffee Making Process...

...Because Java is Good

This is probably not the most elaborate way to make coffee, and I won't call it the best.  After all, the former assertion requires robust knowledge of coffee-making around the world, and the latter is a matter of taste (no pun intended).  However, this is what we now usually do every morning, and I'm recording (and sharing) it here for posterity - and just in case one day I lose my mind.

Beginning with the French Press

My French press holds about 750 grams of very hot water comfortably, enough to make two decent cups of coffee.  I use the French press' beaker to measure out water into my kettle, putting around 2.25 (two and a quarter) beakers-full of water in.  I then heat it up to 195 degrees F.  Use an instant-read thermometer, preferably a digital one, stuck into the spout to monitor the temp.

The Beans, The Grind

I measure around 38 grams of beans to my 750 intended grams of water.  Measuring the beans before grinding works fine.  This is about a 20:1 ratio of water to coffee.  The smaller your ratio, the more coffee you are adding to your water.  A 15:1 ratio will mean that you'll add 50 grams of coffee to the 750 grams of water.  You'll have to work out your favorite, or whatever gets you the most caffeine before your eyes start twitching.

Now, I realize most people use liquid measures for measuring water, but I do most of my measuring by weight, and it gives excruciatingly predictable and repeatable results.

We have an aging burr grinder.  But it works - for the moment.  I do a course grind on the beans, not the absolute coarsest setting, but just a little finer than that.  We've also used store-bought pre-ground coffee (coffee-maker grind), but it tends to be harder to press and might be a little more bitter.

Grind while the water is heating up, or during the preheating step (below) if you have decided to leave everything unattended until the kettle whistle blows.

Preheating the Beaker

By now, your water should be up to temp.  If it's too hot, that's fine.  If it's just below 190, that's probably fine too.  Now, you're wondering why I poured 2.25 beakers-worth of water into the kettle.  It's because now we're going to waste a little for preheating the French press' beaker.  

Pour in a good amount - since I heated up so much water, I pour in as much as I reasonably can.  Let the beaker sit for a few minutes to warm.  The glass will, predictably, get quite hot.  Do not touch it unless you need to wake up faster.  I will put the plunger in and slosh the water in the beaker around over the sink.  Sloshing too vigorously is a good way to test your pain threshold.

If you had heated the water to a full boil, you can either take this time to let the beaker get very hot while the kettle water cools down, or - to rush it - you can add cold water to the kettle until the temperature is around 195 degrees F.  Swirl your instant-read around in the kettle while adding the cold water, to help it mix and to not over-cool.

On my little portable induction cooktop, I often heat to 190, then fill the beaker and leave the kettle on the cooktop at the lowest level with the spout lid open - it keeps its temp and heats very slowly toward 195.

The Pour

Once the beaker is quite hot, dump the water down the drain.  Now add the grinds, zero your scale, and add water from the kettle - I aim for the aforementioned 750 grams.  The temperature should remain in the 190s if you pop your instant-read thermometer in there for curiosity's sake.  

At this point you can set the lid and plunger in place on the top of the press and set a timer for five (5) minutes.  At the end of that time, perform a standard very-very-slow-press (weight-of-hand / gravity-press, but I'm meaty) and pour the magical caffeine-laden tonic into a worthy and deserving cup.

Pouring for Two

I mentioned this makes two decent cups.  The grinds absorb about 50 grams of water in the process, so you can usually get out about 700 grams total.  I put the two cups on the scale and dump around 350 into each.  Or if I'm unsure, I'll shoot for 345 each and then start splitting the extra with back-and-forth pours between the cups.  This works well if the cups are different shapes, as most of ours are.

The Reasoning, and Variations

We had bought some special coffee once from a local bulk retailer, and the instructions on a couple of the bags indicated brewing between 190 and 200 degrees F.  After much playing with that, I now try to stay within that range.  Previously I always poured at 212 degrees and left it for four minutes, but it tended to leach out a lot more acidity.  There is a difference between excellent, dark, strong coffee and obliterated, dark, strong coffee.  Needless to say, I didn't realize what I was missing.

Brewing at or below 200 for the longer time seems to bring out significantly more flavor, without tasting watered-downed or harsh.  I have brewed for as long as six minutes, though I can't remember how I felt about it afterwards.

One site I revisited while typing this suggested pouring at exactly 200 degrees.  However, they did not appear to preheat whatever they were pouring into - which I think were the mugs themselves.  Maybe I skipped over that part, though.  Anyway, the minute the water hits the vessel, it loses temperature.  I did some informal testing of this back when I started preheating the beaker, and was astounded to find double-digit drops in temperature (say, from 190 degrees kettle temp to 160 degrees in the beaker, shortly after pouring, but don't quote me on that).

While we could get very scientific about all of this, and confirm my very impromptu and not very scientific findings, it also doesn't appear to hurt to preheat the beaker - aside from wasting a little extra water (hell, save it for tea later!).  The second pour with the grinds loses very little temperature, the glass of the beaker having already absorbed and not immediately lost much of the first pour's heat.  The result is a slightly lower-temperature water striking the beans, and a (probably) more sustained temperature throughout the brewing.

Also, in the past I used to stir the grinds right after pouring, and then right before plunging.  I don't do either of those now, I just make sure all the grinds are wet while I'm pouring.  A fast, dangerous pour will accomplish this.  

Stirring before plunging seems to just gum up the screen.  Stirring after the pour probably does no harm, other than dirtying another utensil.  And in the morning, I prefer not to have to wash extra things.

Further Study

There is obviously a lot more we could do to confirm all of this.  I could sneak in some temperature probes, monitor the water every 30 seconds for the five minute duration.  I could test also against a room-temp beaker, to see how the water temp varies.  I could try using a double-boiler with one of my glass measuring cups to maintain the brew at exactly 195 degrees, or exactly 200 degrees, or test with temperatures in between, although at that point I think we'd be splitting hairs...

One must also play with the concentration of coffee to water.  I find that - personally - pushing to 40 grams of coffee is just too much and I end up with shoulder and back pain from a strange muscle tension that has been too consistently observed after enjoying a delicious cup of intense java.  Dialing down to 35 produces quite acceptable results, also, so feel free to experiment.

Good luck!



20120517

Software RAID Practices

I really like software RAID.  It's cheap, easy, and extremely stable.  It has lots of options, and I get the most awesome RAID levels I could ask for.  I can migrate from a RAID-1 to a RAID-5, to a RAID-6...I can grow the array live, add or remove hot-spares, allow more than one array share a hot-spare.  Compared with hardware solutions, it's amazingly complete.

What I love the most is that, unlike a hardware solution, any Linux system can mount it.  There are a few catches to that, in terms of what it names your md devices, but if you are to the point of mounting your raid on another system, and your raid happens to be your root drive, you're probably advanced enough to know what you're doing here.

But what is a good practice to Software RAID?  So many options make it difficult to know what the pros and cons are.  Here I will try to summarize my experiences.  The explorations here are via mdadm, the defacto standard software RAID configuration tool available on nearly every Linux distribution.  Special non-RAID-related features I discuss are part of the Ubuntu distribution, and I know them to exist on 10.04 and beyond.  Earlier versions may also support this greatness, though you may have to do more of the work on the command-line.  This is also a "theory and practice" document, and so does not contain any major command line statements.  If you want some of those, contact me.

Note that NO RAID is a guarantee of safety.  Sometimes multiple drives fail, and even RAID-6 will not save you if three die at once, or if your system is consumed in a fire.  The only true data safeguard is to backup to an offsite data-store.  Backup, Backup, Backup!


Method 1:  Every Partition is a RAID

Each physical disk is partitioned.  Each partition is added to its own RAID.  So you wind up with multiple RAIDs (possibly at different levels).
Here we have four physical disks, the first two of which are large enough for four partitions.  The other two join to augment the latter partitions.  md0 and md1 would very likely be /boot and /swap, and configured as RAID-1.  md2 could be the system root, as a RAID-5.  Suppose we also need high-speed writes without data integrity?  md3 could be a RAID-0.  One of my live systems is actually configured similar to this.  The reason: it captures a ton of statistics from other servers over the network.  The data comes so fast that it can easily bottleneck at the RAID-1 while the system waits for the drives to flush the data.  Since the data isn't mission-critical, the RAID-0 is more than capable of handling the load.

The benefits of this method is that you can do a lot with just a few drives, as evidenced above.  You could even just have every partition be a RAID-1, if you didn't need the extra space of sdc and sdd.  The detriment of it is that when a physical drive goes kaput, you need to recreate the exact same partition layout on the new drive.  This isn't really a problem, just a nuisance.  You also have to manually add each partition back into its target array.  RAID-rebuilds are done one or two at a time, usually, but the smaller devices will go very quickly.


Method 2:  Full Device RAID

Hardware controllers use almost the whole device; they reserve a bit for their proprietary metadata, even if you go JBOD (on some controllers, which stinks).  Popping a replacement drive into the array means no partition creation - it just adds it in and goes.  I thought, Why can't I do that?  So I did.

You can do this via a fresh install, or by migrating your system over to it, but either way this is a little advanced.  The concept here is to have a boot device separate from your main operating system RAID.  

In this diagram, our /boot partition is kept separate.  GRUB is able to handle this configuration without much issue. LILO could probably handle it, too, with an appropriate initrd image (i.e. you need to make sure your RAID drivers are loaded into the image to bootstrap the remainder of the system).

You can also turn /boot into a RAID-1, just to so that you don't run the risk of totally losing your way of booting your awesome array.  I have done this with a pair of very small USB thumb-drives.  They stick out of the computer only about a quarter of an inch, and the pair are partitioned and configured as RAID-1.  /boot is on the md0 device, and md1 is my root drive.  I tend to use LVM to manage the actual root RAID, so that I can easily carve it into a swap space and the root space, plus any additional logical drives I think are necessary.

There are some catches to using the thumb-drives as boot devices:
  • Obviously, you need a mobo that supports booting from USB sticks.
  • You MUST partition the sticks and RAID the partition.  Trying to raid the whole USB stick will give GRUB no place to live.  The GRUB installer on Ubuntu 10.04+ prefers to install to individual members of the RAID, and is smart enough to do so - put another way, it's not smart enough to realize that installing itself to a RAID-1 will practically guarantee its presence on both devices.  This may be a safety measure.
  • USB flash devices can sometimes raise hell with the BIOS regarding their number of cylinders and the size of them.  Using fdisk alone can be tumultuous, resulting in a partition table that is followed too closely by the first partition.  This results in GRUB complaining that there is no room to install itself on the desired device.  To resolve this, you can try making a new partition table (DOS-compatibility is fine), or moving the first partition up one cylinder.  The latter is almost guaranteed to work, and you won't lose enough space to even care.  After all, what's 8M inside 8G?  I doubt even NTFS is that efficient.
The plus to this is that replacing a failed root RAID device is as easy as adding a new device member back into the array - no partitioning required.   Very simple.  The downside is that sometimes the USB devices don't both join the RAID, so you have to watch for a degraded /boot array.  Also, it could be considered a detriment to have to boot off USB sticks.  It's worked fairly well for me, and is very, very fast, but there is another method that may be a reasonable compromise.

Method 3: Single Partition RAIDs

If you just want regular hard drives, easy rebuilds, and only one big happy array to deal with, a single partition RAID is probably your best bet.  Each drive has a partition table and exactly one partition.  This set of partitions is then added to a RAID device - levels 1, 5, and 6 are definitely supported by GRUB, others may be possible.  The purpose here is to provide a place for GRUB to live on each of the drives in the array.  In the event of a drive failure, BIOS should try the other drives in order until it finds one it can boot from.  The Ubuntu grub package configuration (available via grub-setup , or dpkg-reconfigure grub-pc) will take care of all the dirty-work.

Here again it is probably best practice - perhaps even downright necessary - to add your entire RAID device into LVM2 for management.  Your root, swap, and other logical drives will be easy to define, and GRUB will recognize them.  LVM2 provides such awesome flexibility anyway, I tend to feel you are better off using it than not.

The benefits here are fairly obvious: no gotchas with regard to USB sticks (because there are none), easy maintenance of the RAID, GRUB is automatically available on all drives (as long as you configure it that way), and LVM takes care of dividing up your available space however you please.  Growth is but another drive away.

Extras: Growing Your RAID, and Migrating Your Data


Growing Your RAID

An array is a great thing.  But sometimes you run out of room.  So, you thought two 80-gig drives in RAID-1 would be sufficient for all your future endeavors, until you started gathering up the LiveCD for every Linux distro under the sun.  First, you have a choice: another drive, or replacement drives.

If you have the ability to add more drives to your array, you can up the RAID level from 1 to 5.  For a RAID-1 on two 80 gig drives, you instantly get double the storage and RAID-5 redundancy.  If you want to replace your drives, you need to do it one at a time.  During this time, your RAID will become degraded, so it may be a little bit dicey to go this route.  You'll pull out one of the old drives, slide in a new drive (2TB, maybe?), and add it into your array.  The rebuild will happen, and will hopefully complete without issue.  The only instance I know of where it wouldn't complete is when your remaining live drive is actually bad.

Once you've replaced all the drives with bigger ones, you can resize your partitions (may require a reboot), order mdadm to enlarge the array to the maximum possible disk size, grow LVM (if applicable) and finally grow your file system.  Don't forget to do grub-installs on each of the new devices, as well, especially if you're using Method 1 or Method 3 for your array configuration.

Alternatively, you can crank up a whole new array, get it totally sync'd, and migrate your data over to it.  This is easy if you've got LVM running under the hood.

Migrating Your Data

Suppose you've configured your system as such:
  Drives -> RAID (mdadm) -> LVM PV -> LVM VG -> LVM LVs (root, swap, boot, data, junk)

A brief explanation about LVM2:  All physical volumes (PVs) are added to LVM before they can be used.  In this case, our PV is our RAID device.  All PVs are allocated to Volume Groups (VGs).  After all, if you have multiple physical volumes, you might want to allocate them out differently.  Typically I wind up with just one VG for everything, but it's no hard requirement.  Once you have a VG, you can divide it up into several Logical Volumes (LVs).  This is where the real beauty of LVM comes into play.

Suppose we've configured a second (huge) RAID array for our data.  If our system was originally configured as detailed immediately above, we can order LVM to migrate our data from one PV to another.  In other words, we would:
  1. Create our new RAID array.
  2. Add our new RAID to the LVM as a new PV.
  3. Ask LVM to move all data off our old RAID (old PV)
    1. This means it will use any and all available new PVs - in our case, we have only one new PV.
  4. Wait for the migration to complete.
  5. Order LVM to remove the old PV - we don't want to risk using it for future operations.
  6. Order LVM to grow its VG to the maximum size of the new PV.
  7. Resize our LVs accordingly (perhaps we want more space on root, data, and swap).
  8. Resize the affected file systems (can usually be done live).
  9. Make sure GRUB is installed on the new RAID's drives.
  10. Reboot when complete to make sure everything comes up fine.
The last step is not required, but is good practice to make sure your system won't be dead in six months, while you sit scratching your head trying to remember which of the above steps you left off at six months prior.  I recently performed this exact sequence to migrate a live cluster system over to a new set of drives.

Hardware RAID Adapters and JBOD

If you've decided to pimp out your system with lots of drives, you'll have some fun trying to find SATA adapters that don't come with RAID.  It's a challenge.  I suppose most manufacturers think, Why wouldn't you want hardware RAID?  Well, when I decide to purchase a competitor's card because your card is a piece of trash, I don't want to lose all my data in the process or have to buy a multi-terabyte array just to play the data shuffle game.

Most adapters at least offer JBOD.  Many better adapters, however, seem to also want to mark each disk they touch.  The result is a disk that is marginally smaller than the original, and tainted with some extra proprietary data.  This comes across as a drive with fewer sectors, meaning that if mdadm puts its metadata at the end of the known device, and you move that device to a system with straight-up SATA, you may not see your metadata!  (It's there, just earlier in the disk than mdadm is expecting, thanks to the hardware adapter from the original system.)

One benefit of partitioning a disk like this is that you can insulate mdadm from the adapter's insanity.  The metadata will reside at the end of the partition instead of the end of the known drive.  Migrating from one RAID controller to other, or to a standard SATA adapter, should be a little more safe, although I can't really speak much from experience concerning switching RAID adapters.  In any case, another option is, of course, to have mdadm use a different version of metadata.  Most of my arrays use version 1.2.  There are a few choices, and according to the documentation they put the actual metadata at different locations on the drive.  This may be a beneficial alternative, but is really probably a moot point if you RAID your partitions instead of the physical drive units.

20120511

Pacemaker Best Practices

This really isn't a document about the best practices to use when defining Pacemaker resources.  BUT, it IS a document containing things I'm learning along the way as I set up my HA clusters.  This is basically: what works, what doesn't, and why (if I happen to know).  I will try to update it as I learn.

iSCSI Target and Initiator Control


(Happily, I came to the realization that such a thing was probably not possible before I ever came to that article - now at least I feel like less of a dumb-ass.)

Suppose you want to offer up an HA iSCSI target to initiators (clients).  Suppose your initiators and your target are both governed by Pacemaker - in fact, are in the same cluster.  Here's a block diagram.  I'm using DRBD to sync two "data-stores."

The current DRBD primary, ds-1, is operating the iSCSI target.  For simplicity, ds-2 is acting as the only initiator here.  What does this solution give us?  If ds-1 dies for any reason, ideally we'd like the file system on ds-2 to not know anything about the service interruption.  That is, of course, our HA goal.  Now, we can easily clone both the initiator resource and the filesystem resource, and have a copy of those running on every acceptable node in the cluster, including ds-1.  My goal is to have two storage nodes and several other "worker" nodes.  This method totally masks the true location of the backing storage.  The downside?  Everything has to go through iSCSI, even for local access.  No getting around that, it's the cost of this benefit.

The good news is that you can seamlessly migrate the backing storage from node to node (well, between the two here) without any interruption.  I formatted my test case with ext4, mounted it on ds-1, and ran dbench for 60 seconds with 3 simulated clients while I migrated and unmigrated the target from machine to machine about 10 times.  dbench was oblivious.

(Note that for a production system where multiple cluster members would actually mount the iSCSI-based store, you need a cluster-aware file system like GFS or OCFS2.  I only used ext4 here for brevity and testing, and because I was manually mounting the store on exactly one system.)

Some keys to making this work:

  • You obviously must colocate your iSCSITarget, iSCSILogicalDevice, and IPaddr resources together.
  • the IPaddr resource should be started last.  If you don't, the willful migration of the resource will shut down the iSCSITarget/LUN first, which properly severs the connection with the initiator.  To trick the system into not knowing, we steal the communication pathway out from under the initiator, and give it back once the new resource is online.  This may not work for everyone, but it worked for me.
  • The iSCSITarget will need the portals parameter to be set to the virtual IP.  Actually it's the iscsi resource that requires that, as it gets upset when it thinks it sees multihomed targets.
  • Pick exactly one iSCSI target implementation - don't install both ietd and tgt, or evil will befall you.
  • To ensure that the iscsi initiator resource isn't stopped during migration, you must use a score of 0 in the order statement.  Here's the pertinent sections of my configuration:
-------------------

primitive p_drbd_store0 ocf:linbit:drbd \
        params drbd_resource="store0" \
        op monitor interval="15s" role="Master" timeout="20" \
        op monitor interval="20s" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
primitive p_ipaddr-store0 ocf:heartbeat:IPaddr2 \
        params ip="10.32.16.1" cidr_netmask="12" \
        op monitor interval="30s"
primitive p_iscsiclient-store0 ocf:heartbeat:iscsi \
        params portal="10.32.16.1:3260" target="iqn.2012-05.datastore:store0" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        op monitor interval="120" timeout="30"
primitive p_iscsilun_store0 ocf:heartbeat:iSCSILogicalUnit \
        params target_iqn="iqn.2012-05.com.ecsorl.core.datastore:store0" lun="0" path="/dev/drbd/by-res/store0"
primitive p_iscsitarget_store0 ocf:heartbeat:iSCSITarget \
        params iqn="iqn.2012-05.datastore:store0" portals="10.32.16.1:3260"
group g_iscsisrv-store0 p_iscsitarget_store0 p_iscsilun_store0 p_ipaddr-store0 \
        meta target-role="Started"
ms ms_drbd_store0 p_drbd_store0 \
        meta master-max="1" notify="true" interleave="true" clone-max="2" target-role="Started"
clone cl_iscsiclient-store0 p_iscsiclient-store0 \
        meta interleave="true" globally-unique="false" target-role="Started"
colocation colo_iscsisrv-store0 inf: g_iscsisrv-store0 ms_drbd_store0:Master
order o_iscsiclient-store0 0: g_iscsisrv-store0:start cl_iscsiclient-store0:start
order o_iscsisrv-store0 inf: ms_drbd_store0:promote g_iscsisrv-store0:start

-------------------

One final note...  To achieve "load balancing," I set up a second DRBD resource between the two servers, and configured a second set of Pacemaker resources to manage it.  In the above configuration snippet, I call the first one store0 - the second one is store1.  I also configured preferential location statements to keep store0 on ds-1 and store-1 on ds-2.  Yeah, I know, unfortunate names.  The truth is the stores can go on either box, and either box can fail or be gracefully pulled down.  The initiators should never know.

Why Fencing is REQUIRED...

I'll probably be dropping about $450 for this little gem, or something like it, very very soon.  It's the APC AP7900 PDU, rack-mountable ethernet-controllable power distribution unit.  Why?  I'm glad you asked!

While testing the aforementioned iSCSI configuration, and pumping copious amounts of data through it, I decided to see how a simulated failure would affect the cluster.  To "simulate", I killed Pacemaker on ds-2.  Ideally, the cluster should have realized something was amiss, and migrated services.  It did, in fact, realize something went bust, but migration failed - because I have no fencing.  The DRBD resource, primary on ds-2, wouldn't demote because Pacemaker was not there to tell it to do so.  We can do some things with DRBD to help this, but the fact is the iSCSITarget and IP were still assigned to ds-2, and there was no killing them off without STONITH.  Without killing them, reassignment to the new server would have resulted in an IP conflict.  Happy thoughts about what would've happened to our initiators next!

You now see the gremlins crawling forth from the server cage.

During the "failure," the dbench transfer continued like nothing changed, because, for all intents and purposes, nothing had.  DRBD was still replicating, iSCSI was still working, and everything was as it should have been had the node not turned inside-out.  Realize that even killing the corosync process would have no effect here.  If ds-2 has actually been driven batshit crazy, it would have had plenty of time to corrupt our entire datastore before any human would have noticed.  So much for HA!  The only reasonable recourse would have been to reboot or power-off the node as soon as the total failure in communication/control was detected.

This was a simulated failure, at least, but one I could very readily see happening.  Do yourself a favor: fence your nodes.

Oh yeah, and before you say anything, I'm doing this on desktop-class hardware, so IPMI isn't available here.  My other server boxen have it, and I love it, and want very much to use it more.  Still, some would advocate that it's a sub-standard fencing mechanism, and more drastic measures are warranted.  I have no opinions there.  FWIW, I'm ready to have a daemon listening on a port for a special command, so that a couple of echos can tell the kernel to kill itself.

Install All Your Resource Agents

I ran across an interesting problem.  On two cluster members, I had iSCSITargets defines.  On a third, I did not.  Running as an asymmetric cluster (symmetric-cluster="false" in the cluster options), I expected that Pacemaker would not try starting an iSCSITarget resource on that third machine without explicit permission.  Unfortunately, when it found it could not start a monitor for that resource on the third machine, the resource itself failed completely, and shut itself down on the other two machines.

Thanks to a handy reply from the mailing list, it is to be understood that Pacemaker will check to make sure a resource isn't running anywhere else on the cluster if it's meant to be run in only one place.  (This could be logically extended.)  Really, the case and point is: make sure you install all your resource agents on all your machines.  This will keep your cluster sane.

Monitor, Monitor, Monitor

Not sure if this qualifies as a best-practice yet or not.  While trying to determine the source of some DLM strangeness, I realized I had not defined any monitors for either the pacemaker:controld RA or the pacemaker:o2cb RA.  I posited that, as a result, the DLM was, for whatever reason, not correctly loading on the target machine, and consequently the o2cb RA failed terribly; this left me unable to mount my OCFS2 file system on that particular machine.

Pacemaker documentation states that it does not, by default, keep an eye on your resources.  You must tell it explicitly to monitor by defining the monitor operation.  My current word of advice: do this for everything you can, setting reasonable values.  I expect to do some tweaking therein, but having the monitor configured to recommended settings certainly seems less harmful than not having it at all.

iSCSI - Don't Mount Your Own Targets

This applies only if your targets happen to be block devices.  Surely, if you use a file as a backing store for a target, life will be easier (albeit a little slower).  The most recently meltdown occurred during a little node reconfiguration.  Simply, I wanted to convert my one node to use a bridge instead of the straight bond, which would thereby allow it to host virtuals as well as provide storage.  The standby was fine, the upgrade went great, but the restart was disastrous!  Long story short, the logs and the mtab revealed that the two OCFS2 stores which were intended for iSCSI were already mounted!  You can't share out something that is already hooked up, so the iSCSITarget resource agent failed - which also failed out the one initiator that was actively connected to it.  The initiator machine is now in la-la land, and the VMs that were hosted on the affected store are nuked.

If you build your targets as files instead of block devices, this is a non-issue.  The kernel will not sift through files looking for file system identifiers, and you will be safe from unscrupulous UUID mounts to the wrong place.  Otherwise, don't mount your target on your server, unless you're doing it yourself and have prepared very carefully to ensure there is NO WAY you or the OS could possibly mount the wrong version of it.

Adding a New Machine to the Cluster

Some handy bits of useful stuff for Ubuntu 11.10:
  • apt-get install openais ocfs2-tools ocfs2-tools-pacemaker pacemaker corosync resource-agents iscsitarget open-iscsi iscsitarget-dkms drbd8-utils dlm-pcmk
  • for X in {o2cb,drbd,pacemaker,corosync,ocfs2}; do update-rc.d ${X} disable; done
Idea: Have a secondary testing cluster, if feasible, with an identical cluster configuration (other than maybe running in a contained environment, on a different subnet).  Make sure your new machine plays nice with the testing cluster before deployment.  This way you can make sure you have all the necessary packages installed.  The goal here is to avoid contaminating your live cluster with a new, not-yet-configured machine.  Even if your back-end resources (such as the size of your DRBD stores) are different (much smaller), the point is to make sure the cluster configuration is good and stable.  I am finding that this very powerful tool can be rather unforgiving when prodded appropriately.  Luckily, some of my live iSCSI initiators were able to reconnect, as I caught a minor meltdown and averted disaster thanks to some recently-gained experience.

In the above commands, I install more things than I need on a given cluster machine, because Pacemaker doesn't seem to do its thing 100% right unless they are on every machine.  (I am finding this out the hard way.  And no, the Ubuntu resource-agents package alone does not seem to be enough.)  So, things like iscsitarget and DRBD are both unwanted but required.

Test Before Deployment

In the above section on adding a new machine to a cluster, I mention an "idea" that isn't really mine, and is a matter of good practice.  Actually, it's an absolutely necessary practice.  Do yourself a favor and find some scrap machines you are not otherwise using.  Get some reclaims from a local junk store if you have to.  Configure them the same way as your production cluster (from Pacemaker's vantage point), and use them as a test cluster.  It's important - here's why:

Today a fresh Ubuntu 11.10 install went on a new machine that I needed to add to my VM cluster.  I thought I had installed all the necessary resources, but as I wasn't near a browser I didn't check my own blog for the list of commonly "required" packages.  As a result, I installed pretty much everything except the dlm-pcmk and openais packages.  After I brought Pacemaker up, I realized it wasn't working right, and then realized (with subdued horror) that, thanks to those missing packages, my production cluster was now annihilating itself.  Only one machine remained alive: the one machine that successfully STONITHed every other machine.  Thankfully, there were only two other machines.   Not so thankfully was the fact that between them, about 12 virtual machines were simultaneously killed.

Your test cluster should mirror your production cluster in everything except whatever is the minimal amount of change necessary to segregate it from the production cluster; at least a different multicast address, and maybe a different auth-key.  A separate, off-network switch would probably be advisable.  Once you've vetted your new machine, remove it from the test cluster, delete the CIB and let the machine join the production cluster.

A word of warning - I haven't tried this whole method yet, but I plan to...very soon.