BURNING MIDNIGHTm.at.work

20121027

Forget the Brain... DLM in Ubuntu 12.04

This is a stream-of-research post - don't look for answers here, though I do link some interesting articles.

I'm in the process of preparing my cluster for expansion, and in the midst of installing a new server I inadvertently installed 12.04.1 instead of 11.10. The rest of the cluster uses 11.10.

Some important distinctions:

12.04 seems to support CMAN+Corosync+Pacemaker+OCFS2 quite well.
The same is not for certain on 11.10.
12.04 NO LONGER has dlm_controld.pcmk.
Trying to symlink or fake the dlm binary on 12.04 does not appear to work, from what my memory tells me.

You CAN connect 12.04's and 11.10's Corosyncs and Pacemakers, but from as far as I can tell, only if you Don't Need DLM.

I Need DLM.

So, I am trying understand CMAN a bit better. Here's some interesting articles:

Configuring CMAN and Corosync - this explains why some of my configurations failed brutally - http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf

Understanding CMAN and Corosync - written earlier than the above document - http://people.redhat.com/ccaulfie/docs/Whither%20cman.pdf

In summary - the CMAN-related binaries load Corosync, but CMAN itself is a plugin for Corosync, providing quorum support.

Uuuhhhgggg...

CMAN generates the necessary Corosync configuration parameters from cluster.conf and defaults.

Corosync appears to be the phoenix that rose from the, well, ashes of the OpenAIS project, since that project's page announces cessation of OpenAIS development. Corosync deals with all messaging and whatnot, and it appears thus that CMAN is providing definitive quorum information to Corosync even though Corosync has its own quorum mechanisms (which, if I read it right, are distilled versions from earlier CMAN incarnations).

20121012

Ubuntu Server 12.04 - Waiting for Network Configuration

Just ran across an interesting issue, and the forums I've read so far don't provide a real clear answer. I don't know that this is the answer, either, but it may be worth pursuing. This is a bit of stream-of-consciousness, by the way - my apologies.

I just set up some new servers and was in the midst of securing them. The first server I started on has a static IP, a valid gateway, valid DNS server, and all the networking checked out. On reboot, however, it would take forever to kill bind9, and then I'd see almost two minutes worth of "Waiting for network configuration." Well, there are only statically-assigned adapters present, and the loopback (which was left in its installer-default state).

I had introduced a slew of rules via iptables and I suspect they were wreaking havoc with the boot/shutdown procedures. If someone else is experiencing this problem, try nuking your iptables and make sure it doesn't reload on reboot - hopefully you'll see everything come back up quickly. UFW users would obviously need to disable ufw from operating. FWIW, I placed my iptables loader script in the /etc/network/if-pre-up.d/ folder, so it's one of the first things to crank up when networking starts.

Now, I have similar iptables configurations present on other machines, and I don't know that those machines specifically have the same problem. That being said, I really haven't rebooted them frequently enough to notice.

* * * * *

After a bit more experimentation, it appears there is some dependency on allowing OUTPUT to the loopback. Specifically, I'm looking at logs that note packets being sent from my machine's configured static address to the loopback, and consequently they're being dropped by my rules. They're TCP packets to port 953. This apparently rndc, and related to BIND, which makes sense since my other machines do not run BIND daemons.

This rule, while not the most elegant, and probably not the most correct, fixes the issue for now:

-A OUTPUT -m comment --comment "rdnc" -o lo -p tcp --dport 953 -j ACCEPT

It is probably important to note that this machine is not a gateway and so drops any packets that would be forwarded. I suppose I'm hoping this will be secure, but I just get a strange feeling something more needs to be done.

More on this later, hopefully.

20120914

Resizing a Live iSCSI Target

Despite upgrading the hard drives in my SAN servers, I finally hit the drive-limits of my iSCSI targets and now had to make use of all that extra hard drive space. Unfortunately, it seems there isn't a lot of information or NICE tool-age available to make this happen seamlessly. That's OK, it wasn't as painful as I thought it was going to be.

My setup is as follows:

The stores can be managed on either of the two cluster hosts, and usually this results in a splitting of the load. The first requirement was, of course, to enlarge the RAID store. That was easy as mdadm. Second was to resize the two logical volumes that are used as backing stores for the DRBD devices, via LVM. Next, DRBD had to be told to resize each volume, which predictably caused a resync event to occur. Once that was finished, it was time to notify the initiators.

This is where a little trickery had to take place. So far I've not really found anything that made it easy to tell ietd to "rescan" itself or to otherwise realize that its underlying target devices might have changed their sizes. About the only thing I could really find was to basically remove and re-add the target, or restart it if you will.

Not really a fun idea, but at least Pacemaker gave me an out. Instead of shutting down each target, I migrated each target and then unmigrated it back:
crm resource migrate g_iscsistore0
crm resource unmigrate g_iscsistore0

It's important to realize that you must wait for the migration to actually complete before un-migrating. The un-migrate is used to remove the constraint that was automatically generated to force the migration. This effectively causes the target restart I needed, and because the cluster is properly configured no initiators realized the connection was ever terminated. This was important because the targets are very live and it's not easy to shut them down without shutting down several other machines. This will probably be a problem for me in the future when I go to upgrade both the VM cluster that relies on these stores, and the storage cluster that serves them, to a newer release of Ubuntu Server.

In the meantime, I have now effectively resized the targets, and the next step is obviously the initiators. I have this script to check for a resized device by occasionally asking open-iscsi to rescan:

rescan-iscsi.sh

#!/bin/bash
/sbin/iscsiadm -m node -R > /dev/null

This is actually set up as a cron job on the initiators, to run every 15 minutes. By now all the machines in the cluster should have recognized the new device sizes. I can now perform the resize online from one of the initiators:
tunefs.ocfs2 -v -S /dev/sdc

The resize should be transparent and non-interrupting. It only took a few minutes for each store to complete. I now have two 500G iSCSI targets, ready for more data!

One thing I'd really like to do in the future is have my initiators NOT use /dev/sd? names. I'm not quite sure yet how to do that. I have run into problems where smartd would try to access the iSCSI targets via the initiator connection and cause the SAN nodes to die horrific deaths. Not sure what that's about, either.

20120913

ZFS, Additional Thoughts

I am about to expand my ZFS array, and I'm a little bit stuck...not because I don't know what to do, but because I am reflecting on my experiences thus far.

I guess I just find ZFS a little, well, uncomfortable. That's really the best word I can come up with. It's not necessarily all ZFS' fault, although some of the fault does lie with it. I'll try to enumerate what's troubling me.

First, the drive references - they recommend adding devices via their /dev/disk/by-id (or similarly unique-but-consistent) identifiers. This makes sense in terms of making sure that the drives are always properly recognized and dealt with in the correct order, and having been through some RAID hell with drive ordering I can attest that there have been instances where I've cursed the seeming-randomness of how the /dev/sd? identifiers are assigned. That being said, my Linux device identifiers look like this:

scsi-3632120e0a37653430e79784212fdb020

That's really quite ugly and as I look through the list of devices to prune out which ones I've already assigned, I'm missing my little 3-character device names....a lot. This doesn't seem to be an issue on OpenSolaris systems, but I can't/won't run OpenSolaris at this time.

Second, there's the obvious "expansion through addition" instead of "expansion through reshaping." I want to believe that two RAID-5-style arrays will give me almost as much redundancy as a single RAID-6, but truth be told any two drives could fail at any time. I do not think we can safely say that two failing in the same array is less likely than one failing in each array. If anything, it's just as likely. If Fate has its say, it's more likely, just to piss off Statistics.

But this is what I've got, and I can't wait another 15 days for all my stores to resync just because I added a drive. That will be even more true once the redundant server is remote again, and syncing is happening over a tiny 10Mbit link. I'll just have to bite the bullet and build another raidz1 of four drives, and hope for the best.

Third, I'm just a little disturbed about the fact that once you bring something like a raidz online, there is no initial sync. I guess the creators of ZFS might have thought it superfluous. After all, if you've written no data, why bother to sync garbage? It's just something I've come to expect from things like mdadm and every RAID card there is, but then again I suppose it doesn't make a lot of sense after all. I'm trying to find a counter-example, but so far I can't seem to think of a good one.

Fourth, the tools remind me a little of something out of the Windows-age. They're quite minimalist, especially when compared to mdadm and LVM. Those two latter tools provide a plethora of information, and while not all admins will use it, there have been times I've needed it. I just feel like the conveniences offered by the ZFS command-line tools actually takes away from the depth of information I expect to have access to. I know there is probably a good reason for it, yet it just isn't that satisfying.

The obvious question at this point is: why use it if I have these issues with it? Well, for the simple fact that it does per-block integrity checks. Nothing more. That is the one killer feature I need because I can no longer trust my hard drives not to corrupt my data, and I can't afford to drop another $6K on new hard drives. I want so badly to have a device driver that implements this under mdadm, but writing one still seems beyond the scope of my available time.

Or is it?

20120907

ZFS on Linux - A Test Drive

What Happened??

I have been suffering through recent data corruption events on my multi-terabyte arrays. We have a pair of redundant arrays, intended for secure backup of all our data, so data integrity is obviously of importance. I've chosen DRBD as an integral part of the stack, because it is solid and totally rocks. I had gone with mdadm and RAID-6, after a few controller melt-downs left me with a bitter aftertaste. Throw in some crypto and LVM and viola!

Then a drive went bad.

There may even be more than one. Even though smartd detected it, SeaTools didn't immediately want to believe that the drive was defunct. It took a long-repair operation before the drive was officially failed. Meanwhile, I have scoured the Internets in search of a solution to what is evidently a now-growing problem.

The crux of the issue is that drive technology is not necessarily getting better (in terms of error rates and ECC), but we are putting more and more data on it. I've seen posts where people have argued convincingly that, mathematically, the published bit-corruption rates are now unacceptably high in very large data arrays. I'm afraid that based on my empirical experience, I must concur. No sectors reported unreadable, no clicking noises were observed, and yet I lost two file systems and an entire 2TB of redundant data.

Thank goodness it was redundant, but now I seriously fear for the primary array's integrity, for it is composed of the same kind and manufacturer's drives as the array I am now rebuilding. I guess I'm a little surprised that data integrity has never really been a subject of much work in the file-system community; then again, a few years ago, I probably wouldn't have thought much of it myself, but as I am now acutely aware of the value of data and the volatility of storage mediums, it's now a big issue.

A Non-Trivial Fix

I had originally thought that drives would either return good data or error-out. This was not the case. The corruption was extremely silent, but highly visible once a reboot proved that one file system was unrecoverable and the metadata for one DRBD device was obliterated. RAID of course did nothing - it was not designed to do so. The author of mdadm has also, in forum posts, said that an option to perform per-read integrity checks of blocks from the RAID was not implemented, and would not be implemented...though it is probably possible. That's unfortunate.

I looked for a block-device solution, like how cryptsetup and DRBD work, to act as an intermediary between the physical medium and the remainder of the device stack. Such a device would provide either a check on sector integrity and fail blocks up as needed, or provide ECC on sector data and attempt to fix data as it went, only failing if the data was totally unrecoverable.

I considered some options, and decided that a validity-device would best be placed between the RAID driver and the hard drive, so that it could fail sectors up to the RAID and let RAID recover them from the other disks. This assumes that the likelihood of data corruption occurring across the array in such a way as to contaminate two (or three) blocks of the same stripe on a RAID-5 (or 6) would be statistically unlikely.

An ECC-device would probably be best placed after the RAID driver, but could also sit before it. It might not be a bad idea to implement and use both - a validity-device before the RAID and an ECC-device after it. Obviously we can no longer trust the hardware to do this sort of thing for us.

I performed a cursory examination of Low-Density Parity Check codes, or LDPC, but alas my math is not so good. There are some libraries available, but writing a whole device driver isn't quite in my time-budget right now. I'd love to, and I know there are others who would like to make use of it, so maybe someday I will. Right now I need a solution that works out of the box.

The Options

The latest and greatest open-source file system is Btrfs. Unfortunately it's much too unstable, from what I've been reading, to be trusted in production environments. Despite the fact that I tend to take more risks than I should, I can't bring myself to go that route at this time. That left me with only ZFS, the only other file system that has integral data integrity checks built in. This looked promising, but being unable to test-drive OpenSolaris on KVM did not please me.

ZFS-Fuse is readily available and relatively stable, but lacks the block-device manufacturing capability that native ZFS offers. A happy, albeit slightly more dangerous alternative to this is ZFS-On-Linux (http://zfsonlinux.org/), a native port that, due to licensing, cannot be easily packaged with the kernel. It can however be distributed separately, which is what the project's authors have done. It offers native performance through a DKMS module, and (most importantly for me) offers the block-device-generating ZFS Volume feature.

Test-Drive!

ZFS likes to have control over everything, from the devices up to the file system. That's how it was designed. I toyed around setting up a RAID and, through LVM, splitting it up into volumes that would be handled by ZFS - not a good idea: a single drive corrupting itself causes massive and wonton destruction of ZFS' integrity, not to mention that the whole setup would be subject to the same risks that compromised my original array. So, despite my love of mdadm and LVM, I handed the keys over to ZFS.

I did some initial testing on a VM, by first creating a ZFS file system composed of dd-generated files, and then introduced faults. ZFS handled them quite well. I did the same with virtual devices, which is where I learned that mdadm was not going to mix well with ZFS. I have since deployed on my redundant server and have started rebuilding one DRBD device. So far, so good.

What I like about ZFS is that it does wrap everything up very nicely. I hope this will result in improved performance, but I will not be able to gather any metrics at this time. Adding devices to the pool is straightforward, and replacing them is relatively painless. The redundancy mechanisms are also very nice. It provides mirroring, RAID-5, RAID-6, and I guess what you could call RAID-(6+1) in terms of how many devices can fail in the array before it becomes a brick (one, one, two, and three devices respectively, in case you were wondering).

What I dislike about ZFS, and what seriously kept me from jumping immediately on it, was its surprisingly poor support for expanding arrays. mdadm allows you to basically restructure your array across more disks, thus allowing for easy expansion. It even does this online! ZFS will only do this over larger disks, not more of them, so if you have an array of 3 disks then you will only ever be able to use 3 disks in that array. On the bright side, you can add more arrays to your "pool", which is kind of like adding another PV to an LVM stack. The downside of this is that if you have one RAID-6 with four devices, and you add another RAID-6 of four devices, you are now down four-devices-worth of space when you could be down by only two on mdadm's RAID-6 after restructuring.

So once you choose how your array is going to look to ZFS, you are stuck with it. Want to change it? Copy your data off, make the change, and copy it back. I guess this is what people who use hardware RAID are accustomed to - I've become spoiled by the awesome flexibility of the mdadm/LVM stack. At this point, however, data integrity is more important to me.

Consequently, with only 8 devices available for my ZFS target (and really right now only 7 because one is failed and removed), I had to choose basic 1-device-redundancy RAIDZ and split the array into two 4-device sub-arrays. Only one sub-array is currently configured, since I can't bring the other one up until I have replaced my failed drive. With this being a redundant system, I am hopeful that statistics are on my side and that a dual-drive failure on any given sub-array will not not occur at the same time as one on the sibling system.

We Shall See.

20120802

VM Cluster...now with STONITH! (On a Budget)

For some reason my Ubuntu 11.10 VM host servers have been misbehaving. When one of the nodes died, it caused the DLM to hang and my other nodes quickly perished afterwards. I've blogged about this a time or two now. STONITH is the only answer, but how do you do it without spending $450 on an ethernet-controlled PDU? Well, I "have" $450, but I have yet to put the stupid purchase request through to management...what can I say? I'm busy...and it grieves me greatly.

What follows is what I did to make super-cheap STONITH a reality. It does require some hardware, but if you have a lot of servers, you probably have a lot of UPSs, and maybe you might even have some that are APCs and are recent and have the little USB connection for your system.

In my case, I had several APC BackUPS ES (500 and larger) units laying around, all with fresh new batteries. Most of them had USB connectivity. These UPS-capable devices formed my new, albeit temporary, fencing solution.

Configuring NUT

First, a NUT server is needed. I chose a non-cluster system for this job, but every system in your cluster could be a NUT server and serve to shoot other nodes. The chosen system in my configuration just collects statistics and monitors other servers, so it's actually a very nice server for the job of shooting nodes. For our example, we will call the NUT server stonithserver.

root@stonithserver:~# apt-get install nut-server

/etc/nut/ups.conf was configured for each APC device as follows:

[cn01]
driver = usbhid-ups
port = auto
serial = "BB0........3"

[cn02]
driver = usbhid-ups
port = auto
serial = "BB0........5"

[cn03]
driver = usbhid-ups
port = auto
serial = "BB0........8"

(The serial numbers here are obfuscated for security reasons, as are the cluster node names. Your devices' serial numbers should be all alpha-numeric characters. Methods other than serial numbers can be used to distinguish between devices - consult the NUT documentation for more details.)

You can grab the serial numbers via "lsusb -v | less". Once you get NUT configured for a new UPS (or all of them), use "upscmd" to test them, first to make sure you didn't screw something up, and second to make sure it's going to work correctly when it needs to work.

root@stonithserver:~# upscmd -l cn01
root@stonithserver:~# upscmd cn01 load.off

The first command should return a list of available commands for your UPS. The second will, on my APC BackUPS ES units, cause the UPS to switch off for about 1 second. Use the command appropriate for your unit. My units switch back on automatically, perhaps because they're still being fed mains power.

It's probably important to secure your NUT server in /etc/nut/upsd.users, although I imagine packet sniffing would end that pretty quick:

[stonithuser]
password = ThisIsNotThePasswordYouAreLookingFor
instcmds = ALL

Note that the above configuration is a very quick and simple (and probably stupid) one. Review the relevant documentation to make for a more secure configuration.

Make sure that /etc/nut/upsd.conf is configured to allow connections in:

LISTEN 0.0.0.0 3493

Now each node in the cluster needs the nut-client package installed, or else it won't be able to talk to any other NUT server:

root@cn01:~# apt-get install nut-client

root@cn02:~# apt-get install nut-client

...

Configuring STONITH on the Cluster

Finally, some cluster configuration. On Ubuntu, the NUT binaries are not where they are on Redhat/CentOS. Also, my UPSs don't understand the "reset", so I had to change the reset command to "load.off". It's enough to nuke a running server, and perhaps the best part is that if the server auto-powers-on (BIOS option), you have yourself a handy way to remote-reboot any failed machine. Add Wake-on-LAN, and it's like having IPMI power control...without the nice user interface.

For each cluster node, a STONITH primitive is needed:

primitive p_stonith_cn01 stonith:external/nut \
params hostname="cn01" \
ups="cn01@stonithserver:3493" \
username="stonithuser" \
password="ThisIsNotThePasswordYouAreLookingFor" \
upscmd="/bin/upscmd" \
upsc="/bin/upsc" \
reset="load.off" \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor start-delay="15" interval="15" timout="15" \
meta target-role="Started"

A STONITH primitive is like any other primitive - it runs and can be started and stopped. Therefore it needs a node to run on. Restrict them so that they don't run on the machines that are supposed to be killed by them - that is, a downed node can't (or shouldn't be expected to) suicide itself:

location l_stonith_cn01 p_stonith_cn01 -inf: cn01

Re-enable STONITH in the cluster options, because, frankly, if you're reading this then you've probably had it disabled this whole time:

property $id="cib-bootstrap-options" \
stonith-enabled="true" \
...

Test the cluster by faking downed nodes. Do this one machine at a time, and recover your cluster before testing another machine! If you have three nodes, nuke one, then bring it back to life and let the cluster become stable again, and then nuke the second one. Repeat for the third one. This can be easily done by pulling network cables and watching the machines reboot. Every machine should get properly nuked.

NB: BEFORE you enable STONITH in Pacemaker, make sure you have a clean CIB. I had a few stale machines (ex-nodes) defined in my CIB. Pacemaker thought they were unclean it tried to STONITH them. But since they really didn't exist and also didn't have any STONITH primitives defined, it failed, and in doing so prevented pretty much all my resources from loading throughout the cluster. (I would classify that as a feature, not a bug.) Once the defunct node definitions were removed, everything came up beautifully.

20120730

Cluster Building, Ubuntu 12.04 - REVISED

This is an updated post about building a Pacemaker server on Ubuntu 12.04 LTS.

I've learned a great deal since my last post, as many intervening posts will demonstrate. Most of my machines are still on 11.10. I have finally found some time to work on getting 12.04 to cooperate.

Our goals today will be a Pacemaker+CMAN cluster running DRBD and OCFS2. This should cover most of the "difficult" stuff that I know anything about.

For those who have tried and failed to get a stable Pacemaker cluster running on 12.04, you might find that having the DLM managed by Pacemaker is not advisable. In fact, it's not allowable. I filed a formal bug report and was then informed that the DLM was, indeed, managed by CMAN. Configuring it to be also managed by Pacemaker caused various crashes every time I put a node into standby.

Installation

Start with a clean, new Ubuntu 12.04 Server and make sure everything is up-to-date.
A few packages are for the good of the nodes themselves:
apt-get install ntp

Pull down the necessary packages for the cluster:
apt-get install cman pacemaker fence-agents openais

and the necessary packages for DRBD:
apt-get install drbd8-utils

and the necessary packages for OCFS2:
apt-get install ocfs2-tools ocfs2-tools-cman ocfs2-tools-pacemaker

Configuration, Part 1

CMAN

Configure CMAN to ignore quorum if you have a two-node cluster...or don't want to wait for quorum on startup:

echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/default/cman

For the cluster.conf, there are some good things to know:

The cluster multicast address is, by default, generated as a hash of the cluster name - make this name unique if you run multiple clusters on the same subnet. You can configure it manually, though I have not yet tried.
The interface element under the totem element appears to be "broken," or useless, and aside from that the Ubuntu docs suggest that any configuration values specified here will be overruled by whatever is under the clusternodes element. Don't bother trying to set the bind-address here for the time being.
If you specify host names for each cluster node, reverse-resolution will attempt to determine what the bind address should be. This will cause a bind to the loopback adapter unless you either (a) use IP addresses instead of the node names, or (b) remove the 127.0.1.1 address line from /etc/hosts!! A symptom of this condition is that you bring both nodes up, and each node thinks it's all alone.
The two_node="1" attribute reportedly causes CMAN to ignore a loss of quorum for two-node clusters.
For added security, generate a keyfile with corosync-keygen and configure CMAN to pass it to Corosync - make sure to distribute it to all member nodes.
Always run ccs_config_validate before trying to launch the cman service.
Refer to /usr/share/cluster/cluster.rng for more (extremely detailed) info about cluster.conf

I wanted to put my cluster.conf here, but the XML is raising hell with Blogger. Anyone who really wants to see it may email me.

Corosync

The Corosync config file is ignored when launching via CMAN. cluster.conf is where those options live now.

Configuration, Part 2

By this time, if you have started CMAN and Pacemaker (in that order), both nodes should be visible to one another and should show up in crm_mon. Make sure there are no monitor failures, as this will likely mean you're missing some packages on the reported node(s).

DRBD

I tend to place as much as I can into the /etc/drbd.d/global_common.conf, so as to save a lot of extra typing when creating new resources on my cluster. This may not be best practice, but it works for me. For my experimental cluster, I have two nodes: l9 and l10. Here's a slimmed-down global_common.conf, and a single resource called "share".

/etc/drbd.d/global_common.conf
global {
    usage-count no;
}

common {
    protocol C;

    handlers {
       pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
       pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
       local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    }

    startup {
          wfc-timeout 15;
    degr-wfc-timeout 60;
    }

    disk {
    on-io-error detach;
    fencing resource-only;
    }

    net {
           data-integrity-alg sha1;
           cram-hmac-alg sha1;
           # This isn't the secret you're looking for...
           shared-secret "234141231231234551";

           sndbuf-size 0;

           allow-two-primaries;

           ### Configure automatic split-brain recovery.
           after-sb-0pri discard-zero-changes;
           after-sb-1pri discard-secondary;
           after-sb-2pri disconnect;
    }

    syncer {
           rate 35M;
           use-rle;
           verify-alg sha1;
           csums-alg sha1;
    }
}

/etc/drbd.d/share.res
resource share {
device             /dev/drbd0;
meta-disk          internal;

on l9   {
    address   172.18.1.9:7788;
    disk      /dev/l9/share;
}

on l10 {
    address   172.18.1.10:7788;
    disk      /dev/l10/share;
}
}

Those of you with a keen eye will note I've used an LVM volume as my backing storage device for DRBD. Use whatever works for you. Now, on both nodes:

drbdadm create-md share
drbdadm up share

And on only one node:
drbdadm -- -o primary share

It's probably best to let the sync finished, but I'm in a rush, so...on both nodes:
drbdadm down share
   and
service drbd stop
update-rc.d drbd disable

on both nodes. The last line is particularly important, so I highlighted it. DRBD cannot be allowed to crank up on its own - it will be Pacemaker's job to do this for us.   The same goes for O2CB and OCFS2:

update-rc.d o2cb disable
update-rc.d ocfs2 disable

OCFS2 also requires a couple of kernel parameters to be set. Apply these to /etc/sysctl.conf:

echo "kernel.panic = 30" >> /etc/sysctl.conf
echo "kernel.panic_on_oops = 1" >> /etc/sysctl.conf
sysctl -p

With that done, we can go into crm and start configuring our resources. What follows will be a sort-of run-of-the-mill configuration for a dual-primary resource. YMMV. I have used both single-primary and dual-primary configurations. Use what suits the need. Here I have a basic cluster configuration that will enable me to format my OCFS2 target:

node l10 \
        attributes standby="off"
node l9 \
        attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
        params drbd_resource="share" \
        op monitor interval="15s" role="Master" timeout="20s" \
        op monitor interval="20s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive p_o2cb ocf:pacemaker:o2cb \
        params stack="cman" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
        meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_o2cb p_o2cb \
        meta interleave="true" globally-unique="false"
property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="cman" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
Of special note - we must specify the stack="cman" parameter for o2cb to function properly, otherwise you will see startup failures for that resource. To round out this example, a usable store would be help. After a format...

mkfs.ocfs2 /dev/drbd/by-res/share
mkdir /srv/share

Our mount target will be /srv/share. Make sure to create this directory on both/all applicable nodes. I have highlighted the modification to the above configuration to add the OCFS2 resource:
node l10 \
    attributes standby="off"
node l9 \
    attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
    params drbd_resource="share" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s" \
    op start interval="0" timeout="240s" \
    op stop interval="0" timeout="100s"
primitive p_fs_share ocf:heartbeat:Filesystem \
    params device="/dev/drbd/by-res/share" directory="/srv/share" fstype="ocfs2" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
    params stack="cman" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
    meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_fs_share p_fs_share \
    meta interleave="true" notify="true" globally-unique="false"
clone cl_o2cb p_o2cb \
    meta interleave="true" globally-unique="false"
colocation colo_share inf: cl_fs_share ms_drbd_share:Master cl_o2cb
order o_o2cb inf: cl_o2cb cl_fs_share
order o_share inf: ms_drbd_share:promote cl_fs_share
property $id="cib-bootstrap-options" \
    dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
    cluster-infrastructure="cman" \
    stonith-enabled="false" \
    no-quorum-policy="ignore"
A couple of notes here, as well: not ordering the handling of O2CB correctly could wreak havoc when putting nodes into standby. In this case I've ordered it with the file system mount, but a different approach may be more appropriate if we had multiple OCFS2 file systems to deal with. Toying with the ordering of the colocations may also have an effect on things. Read up on all applicable Pacemaker documentation.

To test my cluster, I put each node in standby and brought it back a few times, then put the whole cluster in standby and rebooted all the nodes (all two of them). Bringing them all back online should happen without incident. In my case, I had to make one change:

order o_share inf: ms_drbd_share:promote cl_fs_share:start

Finally, the one missing piece to this configuration is proper STONITH devices and primitives. These are a MUST for OCFS2, even if you're running it across virtual machines.   A single downed node will hang the entire cluster until the downed node is fenced. Adding fencing is an exercise left to the reader, though I will be sharing my own experiences very soon.

BURNING MIDNIGHT
m.at.work