20120513

Cascading Node Death


This might have something to do with the lack of actual STONITH devices in my configuration.  Well, it's not a "live" cluster yet, though I find the lack of stability disturbing.

It started yesterday, and culminated with a drive into the office to reboot the downed machine.  While I was there I rearranged some things, got everything back on gigabit (there was a 100 megabit switch in the mix), and added one PCI gigabit card that is only capable of about 660Mbit maximum aggregate throughput.  I need to go shopping.  Anyway, yesterday in the early afternoon, I was finally bringing some VMs online on v5.  The node had taken to a single VM instance without issue, so I decided to try migrating two others.  Then the first crack in the foundation appeared.  The node, for whatever reason, went completely dead.  The other two nodes, d1 and d2, appeared to stay up.

After the reboot and minor reconfiguration, I brought the three VMs up very slowly.  Everything seemed to go OK.  That was at about 1 am last night.  This morning, I came to my workstation to find a report that v5 was again dead.  I suspect a total kernel panic, but unfortunately without a screen attached I'll have to find out later.  To perform at least some manner of "fencing," I popped a couple of rules into IPTables to basically drop any and all traffic from v5.  This would theoretically be the same as pulling the power plug, unless there was communication below the IP layer.

Then I thought, "Hmm...maybe I can at least get things ready for some later tests."  I had been using semi-production VMs to date, mainly internal R&D stuff that isn't of much consequence.  After v5 died, I brought them back online on their original hosts.  I had set up several "sandbox" VMs on another server, and since those are most definitely NOT going to be missed by anyone, I thought I'd load them onto my iSCSI-shared storage via d2.  So around 15:00, I started an rsync to copy the VM images over.  They were cruising at about 20MB/sec.

But they never made it.

I came back to my desk to discover that d2 had died.  D1 was having a bit of trouble bringing the resources back online, so I suspected this was a case of The STONITH That Wasn't - I had been reading on some of the replies to other users of the Pacemaker mailing list that not having working STONITH can cause a hang-up (well, that's what I gathered, though it may not have been what they really said).  D2 was inaccessible, and after doing some resource cleaning I managed to get D1 to bring all cluster resources back to life.

And then D1 died.

No explanation behind it, but when I get to the office tomorrow I'm gonna beat the thing with a hammer until it tells me what the issue is.

Whilst I type this, however, there is one thing that comes to mind.  I now remember a little section of the OCFS2 installation guide, that mentions an imperative setting to basically force the kernel to reboot in the case of a particular kind of hang.  Come to think of it, I had completely forgotten it until now, and will have to see if that helps.  Naturally, that sort of thing does not seem very desirable on a virtualization host node.  I may have to rethink which file system I want to use for the shared storage.

No comments:

Post a Comment