BURNING MIDNIGHTm.at.work: Cluster Building

Well, after a very trying couple of days, I've settled on a configuration that I think should work. Some extensive testing still needs to be done. It's late, so this will be rather brief.

For starters, I dropped back to Ubuntu 11.10. It's a nice intermediate step between 11.04 (with its broken iscsitarget-dkms build) and 12.04 (with its endless amounts of CMAN frustration). Basically, on 12.04, with CMAN running, putting a node in standby was a sentence of death for that particular node. It seems to be related to the DLM, but I haven't done much testing beyond that to verify it's that alone. I couldn't find anything in the forums to help, and don't feel like registering for accounts just to report what may possibly be my own stupidity, so the fallback position was a good compromise.

Why not just go without CMAN in 12.04? I couldn't find the dlm-pmck package! It's gone...possibly integrated into something else, but for lack of time and/or patience, I did not find it. It might be there, well hidden.

I watched a great video today from the three guys behind the majority of this tech: High Availability Sprint: from the brink of disaster to the Zen of Pacemaker - YouTube Really cool stuff, watch a cluster get built before your eyes!

After further trial and error, today I finally managed build and mount a HA iSCSI file store! What's better? On my two-node cluster, I successfully tested transparent fail-over during catastrophic node failure, while writing to the store. Using wget, I pulled down an Ubuntu ISO (I know, I know...but they're easy to find) and then hammered the cluster a bit. Now eventually things got kinda hairy and funky - maybe some 11.10 goodness to be fixed in 12.04? But for the most part, things ran great. And I was pretty brutal with the ups-and-downs of the resources and nodes. Chances are, Corosync just had a rough time catching up.

I did notice something strange: Pacemaker seemed to think nodes were back online even though Corosync was the only thing running on the recovered node.

A few words of caution:

if your resource isn't starting, and you have constraints (like colocations, orders, etc), try lowering their scores or removing them entirely.
remember that you have to enable resource explicitly on an asymmetric cluster (symmetric-cluster="false" in the cluster options)
groups are handy ways to lump things together for location statements (where applicable)
Use the ( ) syntax in ordering to make semi-explicit order events
When using iscsitarget stuff, pick an implementation: iscsitarget or tgt - do NOT install both!

BURNING MIDNIGHT
m.at.work

20120510

Cluster Building - the Fallback!

No comments:

Post a Comment

About Me

Followers