20120519

Fence Your Junk

A little fencing goes a long way, as I am finding out.

I have a problem with my current set of VM hosts.  Every day around 6:30 AM, the active host goes bananas and pukes up a bunch of non-responsive CPU messages.  Lovely.  The worse part of this is not that it takes the VMs it's hosting with it, but that it nukes OCFS2.  More to the point, I believe it hangs the DLM (distributed lock manager) that OCFS2 relies on.  Without the DLM, no other node in the shared-storage cluster can access that shared-storage data, even if they themselves are perfectly sane.

What a pain!

There's a very good reason for it - if your insane node isn't just pumping out inane complaints that its CPU isn't happy, but instead crunching on your data, you really want it dead ASAP.  Evidently, the DLM understands this and appreciates it more than the average human.  It wholly expects STONITH to be properly configured and functional before it will let a dead node lie.

To test this, I followed these instructions on setting up libvirt-based STONITH on my virtualized test cluster.  Awesome stuff.  Had to make a few tweaks for my particular configuration.  Now, first I had to validate that I could reproduce the issue I was experiencing on my real machines.  I set up a DRBD-shared storage resource (dual-primary) and mounted it on both cluster nodes (who are called l2 and l3). I went to the mount-point on l2 and kicked off dbench for 500 seconds with 5 clients, to really work the system.  After it chugged along for a minute, I switched over to l3 and nuked it.  As predicted - and as experienced on the physical systems - the DLM hung itself.  dbench reported ever-increasing latency (second-by-second, indicating no accesses were actually succeeding or failing).

Once the libvirt-based STONITH resources were configured, I ran the test again.  I started dbench on l2, let it cook a few seconds, and then did a killall pacemakerd on l3 to simulate a major application crash.  Of course, pacemaker would never crash...unless you're using CMAN on Ubuntu 12.04 and you put the friggin' node into standby.  Yeah.

So the result of my manually killing pacemakerd was that the node was now insane.  Happily, the STONITH kicked in and rebooted the node.  Meanwhile the DLM went along its happy way.

Now, the story may not end there... after doing another test with "halt -f -n" I discovered my DLM still getting snagged.  More research must now be done.

No comments:

Post a Comment