20130905

Can't mount or fsck OCFS2, using CMAN...what you probably shouldn't ever do (but I did)

It's late.  You have only one node of a two-node cluster left, and you're using cman, pacemaker, and OCFS2.  The node gets rebooted.  Suddenly, and for no apparent reason, you can't mount your OCFS2 drives.  You can do fsck.ocfs2 -n, but the act of actually clearing the drive is just going nowhere.  You wait precious minutes.  What is wrong?

Checking the logs, you see that your stonith mechanisms are failing.  Strange, they used to work before.  But now they're not, and cman wants to fence the other node that you know is really, really, really fucking dead.  What to do?  Hmm.. can't tell it to unfence the node, because no commands I try seem to make it actually agree to those behests.

Desperation sets in.  You have to fsck the damn things.  You've rebooted a dozen times.  Carefully brought everything back up, and still it sits there, mocking you, not scanning a goddamn thing.  What did I do?  I installed a null stonith device (stonith:null) in pacemaker, and gave it the dead node in the hostlist.  On the next round of fencing attempts that cman made, pacemaker failed at the original stonith and succeeded at the null device (expectedly).  Suddenly cman was happy, and the world rejoiced, and the file system scans flew forth with verbose triumph.  Moments later everything was mounted and life continued as though nothing bad ever happened.

Now I have to figure out why my stonith device failed in the first place.  That pisses me off.

2 comments:

  1. So why did your STONITH device fail?

    ReplyDelete
  2. Wow, that was quite a long time ago now (over a year) and as it has turned out, I've never gotten rid of the stonith-dummy device. But then again, I'm also down to a single-node on that cluster. So much for my high-availability setup! Still, there has been positive progress with other (Pacemaker) cluster setups. ;-)

    There are a few things it might have been - forgive me as I try to remember them. One possibility: as a two-node cluster, it may have been that CMAN was disallowing any other cluster actions to take place because it couldn't achieve quorum. Since the secondary node was pretty much "permanently" disabled, obtaining quorum wasn't going to happen. If CMAN was waiting for quorum before performing a STONITH, then STONITH was doomed to fail. If I'd had a third node in the mix, perhaps that would have prevented this issue.

    Alternatively, it may have been a configuration issue, in that perhaps the wrong daemon was managing stonith. CMAN+OCFS2 have to play together: OCFS2 has to be told that a STONITH happened, so CMAN should have been managing STONITH. If Pacemaker was managing it, but CMAN wasn't privy to that, bad things could have happened; but if I remember correctly, Pacemaker is supposed to work well with CMAN+STONITH.

    The first possibility, I think, remains the most likely one. It makes sense: if this were a five-node cluster and a link failure severed two of the nodes from the rest, you wouldn't want those two nodes to STONITH the other three. Quorum requires over 50% of the nodes to be present, and since you can't have more than 50% with a two-node cluster, you can never achieve quorum (unless you "ignore" this with various CMAN/Pacemaker options). I know Pacemaker fairly effectively handles two-node clusters and the STONITH problem, but I'm not sure about CMAN, at least on Ubuntu 12.04. Getting the clusters to work was at all, as my blog probably shows, challenging. The two-node case in CMAN required some features that I think I had to really dig for. There really should have been a third node.

    Really the important take-away from all this is: TEST your cluster thoroughly, especially failure cases! Attempting to recover a single node from a multi-node cluster setup is a great test case, one that I would have never thought of until I was forced to go through it on production hardware.

    ReplyDelete