BURNING MIDNIGHTm.at.work: 2012/07

20120730

Cluster Building, Ubuntu 12.04 - REVISED

This is an updated post about building a Pacemaker server on Ubuntu 12.04 LTS.

I've learned a great deal since my last post, as many intervening posts will demonstrate. Most of my machines are still on 11.10. I have finally found some time to work on getting 12.04 to cooperate.

Our goals today will be a Pacemaker+CMAN cluster running DRBD and OCFS2. This should cover most of the "difficult" stuff that I know anything about.

For those who have tried and failed to get a stable Pacemaker cluster running on 12.04, you might find that having the DLM managed by Pacemaker is not advisable. In fact, it's not allowable. I filed a formal bug report and was then informed that the DLM was, indeed, managed by CMAN. Configuring it to be also managed by Pacemaker caused various crashes every time I put a node into standby.

Installation

Start with a clean, new Ubuntu 12.04 Server and make sure everything is up-to-date.
A few packages are for the good of the nodes themselves:
apt-get install ntp

Pull down the necessary packages for the cluster:
apt-get install cman pacemaker fence-agents openais

and the necessary packages for DRBD:
apt-get install drbd8-utils

and the necessary packages for OCFS2:
apt-get install ocfs2-tools ocfs2-tools-cman ocfs2-tools-pacemaker

Configuration, Part 1

CMAN

Configure CMAN to ignore quorum if you have a two-node cluster...or don't want to wait for quorum on startup:

echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/default/cman

For the cluster.conf, there are some good things to know:

The cluster multicast address is, by default, generated as a hash of the cluster name - make this name unique if you run multiple clusters on the same subnet. You can configure it manually, though I have not yet tried.
The interface element under the totem element appears to be "broken," or useless, and aside from that the Ubuntu docs suggest that any configuration values specified here will be overruled by whatever is under the clusternodes element. Don't bother trying to set the bind-address here for the time being.
If you specify host names for each cluster node, reverse-resolution will attempt to determine what the bind address should be. This will cause a bind to the loopback adapter unless you either (a) use IP addresses instead of the node names, or (b) remove the 127.0.1.1 address line from /etc/hosts!! A symptom of this condition is that you bring both nodes up, and each node thinks it's all alone.
The two_node="1" attribute reportedly causes CMAN to ignore a loss of quorum for two-node clusters.
For added security, generate a keyfile with corosync-keygen and configure CMAN to pass it to Corosync - make sure to distribute it to all member nodes.
Always run ccs_config_validate before trying to launch the cman service.
Refer to /usr/share/cluster/cluster.rng for more (extremely detailed) info about cluster.conf

I wanted to put my cluster.conf here, but the XML is raising hell with Blogger. Anyone who really wants to see it may email me.

Corosync

The Corosync config file is ignored when launching via CMAN. cluster.conf is where those options live now.

Configuration, Part 2

By this time, if you have started CMAN and Pacemaker (in that order), both nodes should be visible to one another and should show up in crm_mon. Make sure there are no monitor failures, as this will likely mean you're missing some packages on the reported node(s).

DRBD

I tend to place as much as I can into the /etc/drbd.d/global_common.conf, so as to save a lot of extra typing when creating new resources on my cluster. This may not be best practice, but it works for me. For my experimental cluster, I have two nodes: l9 and l10. Here's a slimmed-down global_common.conf, and a single resource called "share".

/etc/drbd.d/global_common.conf
global {
    usage-count no;
}

common {
    protocol C;

    handlers {
       pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
       pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
       local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    }

    startup {
          wfc-timeout 15;
    degr-wfc-timeout 60;
    }

    disk {
    on-io-error detach;
    fencing resource-only;
    }

    net {
           data-integrity-alg sha1;
           cram-hmac-alg sha1;
           # This isn't the secret you're looking for...
           shared-secret "234141231231234551";

           sndbuf-size 0;

           allow-two-primaries;

           ### Configure automatic split-brain recovery.
           after-sb-0pri discard-zero-changes;
           after-sb-1pri discard-secondary;
           after-sb-2pri disconnect;
    }

    syncer {
           rate 35M;
           use-rle;
           verify-alg sha1;
           csums-alg sha1;
    }
}

/etc/drbd.d/share.res
resource share {
device             /dev/drbd0;
meta-disk          internal;

on l9   {
    address   172.18.1.9:7788;
    disk      /dev/l9/share;
}

on l10 {
    address   172.18.1.10:7788;
    disk      /dev/l10/share;
}
}

Those of you with a keen eye will note I've used an LVM volume as my backing storage device for DRBD. Use whatever works for you. Now, on both nodes:

drbdadm create-md share
drbdadm up share

And on only one node:
drbdadm -- -o primary share

It's probably best to let the sync finished, but I'm in a rush, so...on both nodes:
drbdadm down share
   and
service drbd stop
update-rc.d drbd disable

on both nodes. The last line is particularly important, so I highlighted it. DRBD cannot be allowed to crank up on its own - it will be Pacemaker's job to do this for us.   The same goes for O2CB and OCFS2:

update-rc.d o2cb disable
update-rc.d ocfs2 disable

OCFS2 also requires a couple of kernel parameters to be set. Apply these to /etc/sysctl.conf:

echo "kernel.panic = 30" >> /etc/sysctl.conf
echo "kernel.panic_on_oops = 1" >> /etc/sysctl.conf
sysctl -p

With that done, we can go into crm and start configuring our resources. What follows will be a sort-of run-of-the-mill configuration for a dual-primary resource. YMMV. I have used both single-primary and dual-primary configurations. Use what suits the need. Here I have a basic cluster configuration that will enable me to format my OCFS2 target:

node l10 \
        attributes standby="off"
node l9 \
        attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
        params drbd_resource="share" \
        op monitor interval="15s" role="Master" timeout="20s" \
        op monitor interval="20s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive p_o2cb ocf:pacemaker:o2cb \
        params stack="cman" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
        meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_o2cb p_o2cb \
        meta interleave="true" globally-unique="false"
property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="cman" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
Of special note - we must specify the stack="cman" parameter for o2cb to function properly, otherwise you will see startup failures for that resource. To round out this example, a usable store would be help. After a format...

mkfs.ocfs2 /dev/drbd/by-res/share
mkdir /srv/share

Our mount target will be /srv/share. Make sure to create this directory on both/all applicable nodes. I have highlighted the modification to the above configuration to add the OCFS2 resource:
node l10 \
    attributes standby="off"
node l9 \
    attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
    params drbd_resource="share" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s" \
    op start interval="0" timeout="240s" \
    op stop interval="0" timeout="100s"
primitive p_fs_share ocf:heartbeat:Filesystem \
    params device="/dev/drbd/by-res/share" directory="/srv/share" fstype="ocfs2" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
    params stack="cman" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
    meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_fs_share p_fs_share \
    meta interleave="true" notify="true" globally-unique="false"
clone cl_o2cb p_o2cb \
    meta interleave="true" globally-unique="false"
colocation colo_share inf: cl_fs_share ms_drbd_share:Master cl_o2cb
order o_o2cb inf: cl_o2cb cl_fs_share
order o_share inf: ms_drbd_share:promote cl_fs_share
property $id="cib-bootstrap-options" \
    dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
    cluster-infrastructure="cman" \
    stonith-enabled="false" \
    no-quorum-policy="ignore"
A couple of notes here, as well: not ordering the handling of O2CB correctly could wreak havoc when putting nodes into standby. In this case I've ordered it with the file system mount, but a different approach may be more appropriate if we had multiple OCFS2 file systems to deal with. Toying with the ordering of the colocations may also have an effect on things. Read up on all applicable Pacemaker documentation.

To test my cluster, I put each node in standby and brought it back a few times, then put the whole cluster in standby and rebooted all the nodes (all two of them). Bringing them all back online should happen without incident. In my case, I had to make one change:

order o_share inf: ms_drbd_share:promote cl_fs_share:start

Finally, the one missing piece to this configuration is proper STONITH devices and primitives. These are a MUST for OCFS2, even if you're running it across virtual machines.   A single downed node will hang the entire cluster until the downed node is fenced. Adding fencing is an exercise left to the reader, though I will be sharing my own experiences very soon.

20120724

Reinventing the...cog?

WANTED: Error Detecting and Correcting Drive Surface

I am looking for a strong solution, and I fear I may have to implement it myself. I want to boost the error-detection capability of typical hard drives. Someone on a thread asked about creating a block device to do this for any given underlying device, like what LUKS does. I want to at least detect, and _maybe_ (big maybe) correct, errors during reads from a drive or set of drives. After having some drives corrupt a vast amount of data, I'm looking to subdue that threat for a very, very, very long time.

Error Detection

To avoid reinventing RAID - which, by the way, doesn't seem to care if the drive is spitting out bad data on reads, and from what I read that's evidently _correct RAID operation_ - I would propose writing a block driver that "resurfaces" the disk with advanced error detection/correction. So, for detection, suppose that each 4k sector had an associated SHA512 hash we could verify. We'd want to store these hashes either with the sector or away from it; in the latter case, consolidating hashes into dedicated hash-sectors might be handy. The block driver would transparently verify sectors on reads and rewrite SHA hashes on writes - all for the cost of around 1.5625% of your disk. Where this meets RAID would be that the driver would simply fail any sector that didn't match its hash, and RAID would be forced to reconstruct the sector from its hopefully-good stash of other disks...where redundant RAID is used.

Error Correction

Error correction using LDPC codes would be even more awesome, and I found some work from 2006 that is basically a general-use LDPC codec, LGPL'd. Perhaps I'd need to write my own, to avoid any issues with eventual inclusion in the kernel. The codes would be stored in a manner similar to the hashes, though probably conglomerated into chunks on the disk and stored with their own error-detection hash.

Questions, Questions, Questions

Anyway, lots of questions/thoughts arise from this concept:

How do we make this work with bootloaders? Perhaps it would work regardless of the bootloader if only used on partitions.
It's an intermediary between the drive and the actual end-use by whatever system will use it, so it HAS to be there and working before an end-user gets hold of the media.

In other words, suppose we did use this between mdadm and the physical drive - how do we prevent mdadm from trying to assemble the raw drives versus the protected drives (via the intermediary block device)? If it assembles the raw drives and does a sync or rebuild, it could wipe out the detection/correction codes, or (worse) obliterate valuable file system contents.

Where would be the best place for LDCP/hash codes? Before a block of sectors, or following?
How sparsely should LDCP/hash codes be distributed across the disk surface?
Is it better to inline hash codes with sectors and cause two sector reads where one would ordinarily suffice, or push hash codes into their own sector and still do two sector reads, but a little more cleanly?

The difference being that in the former, a sector of data would most likely be split across two sectors - sounds rather ugly.
The latter case would keep sector data in sectors, possibly allowing the drive to operate a little more efficiently than in the former case.

How much space is required for LDCP to correct a sector out of a block of, say, 64 4k sectors? How strong can we make this, and how quickly do we lose the benefits of storage space?
If a sector starts getting a history of being bad, do we relocate it, or let the system above us handle that?
How best do we "resurface" or format the underlying block device? I would imagine dumping out code sectors that are either indicative of no writes having yet been done there, or code sectors generated from whatever content currently exists on disk. A mode for just accepting what's there (--assume-clean, for example) should probably be available, too, but then do we seed the code sectors with magics and check to see that everything is how it SHOULD be before participating in system reads/writes?
Do we write out the code sector before the real sector? How do we detect faulty code sectors? What happens if we're writing a code sector when the power dies?

I guess this really boils down to the following: RAID doesn't verify data integrity on read, and that is bugging me. Knowing it would be a major performance hit to touch every drive in the array for every read, I can understand why it's that way. If we could do the job underneath RAID, however, maybe it wouldn't be so bad? Most file systems also don't seem to know/care when bad data shows up at the party, and trying to write one of those (like the guys working on btrfs) demonstrates it's no easy task.

And I guess the ultimate question is this: has anyone done this already? Is it already in the kernel and I just haven't done the right Google search yet?

Please say yes.

20120720

Loving iSCSI, Hating Hard Drives

A user of mine recently suffered a lovely virus attack that ate her Windows XP machine. Well, it was time for an upgrade anyway, so a reformat-reinstall to Windows 7 was the logical choice. What about the user's data?

That which wasn't virus-laden needed to be captured. I must admit I really enjoy how seamlessly iSCSI can make drive space available. As I snooped around my office for a spare drive and a USB adapter, the thought of copying gigs of critical data off a user's machine and onto an aging piece of storage medium just didn't seem appealing. So, I logged into my HA iSCSI cluster, created a new resource, and mounted it from the user's machine (via a secure, CD-boot operating system....read: KNOPPIX). Moments later...well, about an hour later...the data was secure and the machine ready to nuke.

But there were a few takeaways from this. First, it was a royal pain in the balls to get my new resource up. It's not that it was mysterious, or difficult, but that there were in fact six or seven different components that needed to be instantiated and configured (copy-paste-change) just to get the target online. I would have very dearly loved a "create a target for me with this much space, on this IP, named this," and all the appropriate cluster configuration would be done. This isn't so much a complaint as it is a wish. I have no issues with cluster configuration now, and see it as powerful and flexible. Automation must show a positive ROI. To that end, manufacturing HA virtualization clusters would certainly tend in the direction of good ROI for the right automation. It is, in fact, one of the things preventing me from just spawning independent iSCSI targets for each of my VMs.

Maybe that's A Good Thing (tm).

On another front, failing hard drives (at least one confirmed) and transparent, unnoticed corruption of file systems has me again thumbing through the ZFS manual, and trying to reconcile its use with the system I currently have built. I have some requirements: DRBD replication and heavy-duty data encryption. I get these with RAID+LVM. But RAID, it seems, does not really care about ensuring data integrity, except during scans or rebuilds. As a matter of fact, whatever corruption took place, I'm not even sure it was the hard drives that were at fault - except for the one that is spitting out the SMART message: "FAILURE IMMINENT!". It could have also been the RAID adapter (that is only doing JBOD at the moment), or a file system fluke...or perhaps the resync over DRBD went horribly wrong in just the wrong place.

We may never know.

I can happily say that it appears a good majority of my static data was intact. I'm thankful, because that static data was data I really only had one highly-redundant copy of. Thus the case for tape-drives, I guess. I'm considering trying something along the following: RAID+LVM+ZFS+DRBD+crypto+FS. Does that seem a bit asinine? Perhaps:

RAID+LVM+DRBD+CRYPTO+ZFS
RAID+LVM+DRBD+ZFS+CRYPTO+FS
ZFS+DRBD+CRYPTO+FS

The problem I am running into is keeping DRBD in the mix. So, the convoluted hierarchy may be the only one that makes sense. Of course, if ZFS handles replication to secondary servers that are WAN-connected, maybe going purely ZFS would be better. That then begs the question: stick with Linux+ZFS-FUSE, or go with OpenSolaris?

I'm not sure I'm ready for OpenSolaris.

20120709

Rethinking Property

I was looking for an RSS feed from a photography site, to use as screensaver images automatically in xscreensaver. I just wanted some low-res stuff, nothing fancy - preview-quality images would've been great! Unfortunately, I did not readily find an RSS link, but I found some Q&A pages under their support. One question was from a photographer that was upset about finding her images on other websites, due to people who had subscribed to her RSS feed from the parent site. She was also upset that the subscribed users had no profiles, names, or other information (for what purposes I know not, since I have not yet delved into their membership policies or community expectations). The support reply basically said that the RSS feed was doing what it was supposed to do, and that it was not possible to "block" someone from using an RSS feed.

Now, to me this underscores a basic disconnect that pervades several facets of the internet-connected world. A most notable facet is DRM-laden media from content producers. Another is authored, copyrighted works published as e-books. Ostensibly, the argument goes like this:

"I perform my trade to make money. If people access my works for free, I don't make money. Therefore, I must prohibit people from accessing my works for free, or else there is no longer an incentive for me to produce new works."

Herein lies the rub: published works, cut records, stomped CDs, printed photographs and assembled pieces of furniture were all relatively "difficult" to reproduce before the advent of modern technology. Unless you went to some serious trouble or had some serious equipment, you could not easily copy, say, a nice picture of a lake and have it look anything like the original. The quality would be shot, the color gone. These days a good photocopier does most of the work and the result comes out rather good. Go all-digital, and there is zero quality loss.

Bring in modern computer technology, and interconnect it to a massive, standardized backbone (i.e. the Internet), and now we have a new reproduction and delivery mechanism for products. Better yet, the cost of reproducing electronic goods is practically zero. Just make another copy - the operating system does all the work, and with the right tech it happens in a heartbeat. Put out a link and millions of people can download it - all for a minuscule fraction of the cost of what a photo or musical work would have cost to manually reproduce.

We could argue here that property is property, and people have a right to make money on however many duplicates of their works they choose to make. But let's step back a moment and consider another example. It takes a great deal of effort, experience, materials and tooling to produce a set of fine kitchen cabinets. You don't just grab a handsaw and start hacking at plywood. There is, of course, the wood, and the fasteners, the jigs, the saws, blades, planers, router/shaper bits, glues, clamps, work space, sandpapers, paints and finishes, just to name maybe 20% of everything you need for the cabinet production. The cost of the materials, plus the cost of the labor (which equates to experience, knowledge and skill) to turn those materials into the finished goods, plus around 10% profit, is all factored into the basis of cabinet prices for a generic cabinet shop. Case and point: there is a substantial investment to produce, and therefore a substantial fee for the product.

Let's now suppose that you could produce a cabinet electronically. That really doesn't make sense, but to just explore the issue, let's suppose it could happen. After you go through the trouble to do the design work, you get everything ready and then push a button. Into a machine goes wood chips, glue, and paint. Whir whir whir, and out pops a perfectly finished cabinet! This is our advanced cabinet shop.

What is that electronically-manufactured cabinet worth? Is it worth more, less, or the same as the one painstakingly made by the generic cabinet shop above?

The corollary is to all the things we can download from the Internet. Our servers, operating systems, content management systems all serve the purpose of reproducing data over and over again to countless consumers. The difference is that most of those consumers aren't paying for most of that content. Most of the content isn't something we'd charge for anyway. If they do pay for it, they are not willing to pay the cost of what they could buy a physical version for - to them, the value isn't there. Even if it costs fractions of a penny to stamp out music CDs, what does it cost to provide that CD's worth of data for download? The fact is, a stamped CD must be manufactured, quality-checked, packaged, labeled, wrapped, boxed, and shipped to a "brick-and-mortar" or online store for sale and distribution. Therein lies the cost of a $15 CD. Why pay $15 - or even $10 - for the content when you cut out the last seven or eight steps of the process?

The same applies to movies, books, and just about everything that is now electronically produced. I'm not saying it's right or wrong to charge whatever is charged for goods available online. I'm saying that in this new economic playground, there are very different rules at hand. I do not agree with people who publish content in so grossly a public place as on the Internet, and then complain about that content appearing in a dozen unauthorized locations. It would be no different than printing a thousand low-quality copies of a picture and dropping them out of a tall building window. A thousand people grabbing those photos later, do you have any right to complain about "unauthorized" reproductions or the "unauthorized" display of the work you scattered to the four winds?

There are those who would look at all of this and think: "Look at all that money we're losing! We need to capture all of that! Everything costs something, right?" Well, could you sell the air we breathe? Is the air in my home better than the air in yours, and could I bottle it and sell it to you? Would you buy it? There are limits to what is salable and what is not. Forcing the creation of an economy where none exists leads inexorably to destruction.

So, the issue here shouldn't be about property rights on the Internet, but rather: how do people make money with the Internet as their distribution system? Forget property rights, and assume they don't exist on the Internet. Really, I think they never did, and despite the best efforts of governments around the world, property on the internet will never work the way it works in the real, physical, hard-to-duplicate world. We have to rethink the economics of Internet-based production/consumption. The hard part is lining this up with aspects of the world that do not translate easily to the Internet, such as the production, delivery, and purchase of food-stuffs (my "People Need To Eat" axiom).

Conversely, a product that fetches $1 on the internet but is downloaded one million times certainly earns as much as a product costing $50 and purchased from a store 20,000 times. Perhaps we humans are just not ready for an Internet-based economy.

BURNING MIDNIGHT
m.at.work