20120510

Cluster Building - the Fallback!

Well, after a very trying couple of days, I've settled on a configuration that I think should work.  Some extensive testing still needs to be done.  It's late, so this will be rather brief.

For starters, I dropped back to Ubuntu 11.10.  It's a nice intermediate step between 11.04 (with its broken iscsitarget-dkms build) and 12.04 (with its endless amounts of CMAN frustration).  Basically, on 12.04, with CMAN running, putting a node in standby was a sentence of death for that particular node.  It seems to be related to the DLM, but I haven't done much testing beyond that to verify it's that alone.  I couldn't find anything in the forums to help, and don't feel like registering for accounts just to report what may possibly be my own stupidity, so the fallback position was a good compromise.

Why not just go without CMAN in 12.04?  I couldn't find the dlm-pmck package!  It's gone...possibly integrated into something else, but for lack of time and/or patience, I did not find it.  It might be there, well hidden.

I watched a great video today from the three guys behind the majority of this tech: High Availability Sprint: from the brink of disaster to the Zen of Pacemaker - YouTube  Really cool stuff, watch a cluster get built before your eyes!

After further trial and error, today I finally managed build and mount a HA iSCSI file store!  What's better?  On my two-node cluster, I successfully tested transparent fail-over during catastrophic node failure, while writing to the store.  Using wget, I pulled down an Ubuntu ISO (I know, I know...but they're easy to find) and then hammered the cluster a bit.  Now eventually things got kinda hairy and funky - maybe some 11.10 goodness to be fixed in 12.04?  But for the most part, things ran great.  And I was pretty brutal with the ups-and-downs of the resources and nodes.  Chances are, Corosync just had a rough time catching up.

I did notice something strange: Pacemaker seemed to think nodes were back online even though Corosync was the only thing running on the recovered node.

A few words of caution:

  • if your resource isn't starting, and you have constraints (like colocations, orders, etc), try lowering their scores or removing them entirely.
  • remember that you have to enable resource explicitly on an asymmetric cluster (symmetric-cluster="false" in the cluster options)
  • groups are handy ways to lump things together for location statements (where applicable)
  • Use the ( ) syntax in ordering to make semi-explicit order events
  • When using iscsitarget stuff, pick an implementation: iscsitarget or tgt - do NOT install both!

No comments:

Post a Comment