20130905

Can't mount or fsck OCFS2, using CMAN...what you probably shouldn't ever do (but I did)

It's late.  You have only one node of a two-node cluster left, and you're using cman, pacemaker, and OCFS2.  The node gets rebooted.  Suddenly, and for no apparent reason, you can't mount your OCFS2 drives.  You can do fsck.ocfs2 -n, but the act of actually clearing the drive is just going nowhere.  You wait precious minutes.  What is wrong?

Checking the logs, you see that your stonith mechanisms are failing.  Strange, they used to work before.  But now they're not, and cman wants to fence the other node that you know is really, really, really fucking dead.  What to do?  Hmm.. can't tell it to unfence the node, because no commands I try seem to make it actually agree to those behests.

Desperation sets in.  You have to fsck the damn things.  You've rebooted a dozen times.  Carefully brought everything back up, and still it sits there, mocking you, not scanning a goddamn thing.  What did I do?  I installed a null stonith device (stonith:null) in pacemaker, and gave it the dead node in the hostlist.  On the next round of fencing attempts that cman made, pacemaker failed at the original stonith and succeeded at the null device (expectedly).  Suddenly cman was happy, and the world rejoiced, and the file system scans flew forth with verbose triumph.  Moments later everything was mounted and life continued as though nothing bad ever happened.

Now I have to figure out why my stonith device failed in the first place.  That pisses me off.