20130815

AoE, DRBD...coping strategies

This is a stream of consciousness...

DRBD - Upgrade Paths

When dealing with pre-compiled packages, upgrading from source can be hazardous.  Make sure of the following things:
  • DO make sure you have the correct kernel for the version of DRBD you want to use.
    • The 8.3 series won't compile in kernel 3.10, and possibly others.  It does compile in 3.2.  It appears that changes to procfs has made some of the relevant code in 8.3 out-of-date.
    • The 8.4 series will compile in kernel 3.2.  It comes with 3.10, so therefore must also build successfully there.
  • DO make sure you build the tools AND the module.
  • DO configure the tools to use the CORRECT paths.
    • These will depend on the distro and the original configure args used.  Deduce or look into the package sources for their build scripts.
    • Ubuntu has drbdadm in /sbin, the helper scripts in /usr/lib, the configs in /etc, and the state in /var/lib.
    • If you do not have the correct paths, things will break in unexpected ways and you might have a resource that continually tries to connect and then fails with some strange error (such as a protocol error or a mysterious "peer closed connection" on both sides).
  • DO be careful when installing the 8.3 kernel under Ubuntu 12.04, it doesn't seem to copy to where it needs to go - hand-copy if necessary (and it does seem necessary).
  • DO shut down Pacemaker on the upgrade node before upgrading.  Reboots may be necessary.  Module unloading/reloading is the minimum required.
  • DO reconnect the upgraded DRBD devices to their peers BEFORE bringing Pacemaker back into the mix.  If there is something amiss with your upgrade, you'll rather it simply fail the upgrade node than for that node and its cluster friends to start fencing one another.  Pacemaker won't care if a slave connects as long as it doesn't change the status quo (i.e. no auto master-seizure).  If everything is gold, you should be able to either start Pacemaker as is (it should see the resources and just go with it) or shut down DRBD and let Pacemaker bring it back up when it starts.
  • It's probably a better idea to upgrade DRBD before upgrading the kernel, so that you are only changing one major variable at a time.  In upgrading the kernel from 3.2 to 3.10, I ran into situations were things were subtly broken and the good node was getting fenced by the upgrading node for no good reason.
  • I have found, thus far, that wiping the metadata in the upgrade to 8.4 was not necessary, but it has been noted as a solution in certain circumstances.  This requires a full resync when done.
  • 8.3 and 8.4 WILL communicate with one another if you've done everything right.  If you haven't, they won't, but they will sorta seem like they should...blame yourself and go back through your DRBD install and double-check everything.
  • 8.4 WILL read 8.3's configuration scripts.  Update them when things are stable to the 8.4 syntax.
  • Pulling the source from GIT is a fun and easy way to obtain it.  Plus you can have the latest and greatest or any of the previously tagged releases.
  • And finally, WRITE DOWN the configure string you used on the upgrade node.  You'll want to replicate it exactly on the other node, especially if you pulled the source from git.
    • Even an rsync-copy doesn't guarantee that the source won't want to rebuild.  Plus if you end up switching to a newer or older revision, stuffing the configure command line into a little shell script makes rebuilding less error-prone.

AoE

This site is useful: http://support.coraid.com/support/linux/
To build in Ubuntu:
make install INSTDIR=/lib/modules/`uname -r`/kernel/drivers/block/aoe

Oh, where do I begin?  Things I have issue with and would like to do something about:
  • Absolutely no security at all in this protocol.  Security via hiding the data on a dedicated switch is not an answer, especially when you don't have said dedicated-switch to use.  VLANs are a joke.  Routability be damned, this protocol is more than vulnerable to any number of local area attacks, which are equally likely from a compromised node as they are for a routable protocol over the Internet.
    • I'd like to see the header or payload enhanced with a cryptographic sequence, at the very least.
      • The sequence could be a pair of 32-bit numbers, representing (1) the number that is the expected sequence number, and (2) the number that will be the next sequence number.
      • Providing these two numbers means that the number source can be anything, including random generation, making sequence prediction difficult (no predictable plaintext).
      • This could, at the very least, provide defense from replay attacks and give the target and the initiator something to identify each other with.
      • Extensions to this could allow for more robust initiator security, whereby a shared-secret is used to guarantee a target/initiator is who they say they are, in lieu of MAC filtering (which is pointless in the modern world of spoofable MACs.
  • vblade is, from the looks of things, "as good as it gets."  No other projects appear to be among the living.  Most stuff died back in '07 and '10 (if you're lucky).  Even vblade doesn't get much love from the looks of it.
    • Let's get at least a version number in there somewhere, so that I don't feel like an idiot for typing "vblade --version."
    • Figure out why the "announce" that vblade does falls on deaf ears.  If the initiator transmits a packet, and the recipient has died and "failed over" to another node, why does the initiator not care about this?  (update: evidently it does not fall on deaf ears, but it comes close.  In certain circumstances it works, others it doesn't.)
    • aoe-stat could dump out more info, like the MAC it was expecting to connect to. (update: This info was hidden away in a sysfs file called "debug")
    • The driver doesn't appear to try hard when it comes to reconnecting to a perceived-failed device.  The 12-page super-spec doesn't give any real guidance on how the initiator should behave, or how a target should respond, to any situation other than one in which everything works.  (How wonderfully optimistic!)  (update: OK so maybe I didn't read the whole spec line-for-line...)
  • Driver error detection and recovery appears to be nonexistent.  Again, very optimistic.  Plus, with two vblades running on two separate servers, only one server's MAC is being seen.  Why is this?!  Oh yeah, and that page about revalidating the device by removing and re-probing the driver?  Not gonna happen while the damn device is mounted somewhere.  PUNT!  (update: the mechanisms are evidently far more complex than I originally believed.  I must now examine the code carefully to understand them.  I have already encountered one test case that failed spectacularly.)
  • aoe-discover does dick when it thinks it knows where its devices are.  aoe-flush -a does nothing useful.  I hate that I have to look through their scripts just to find command line options.  Anyway, you CAN perform an aoe-flush e#.# and get it to forget about individual devices.  Then do the aoe-discover and things will work...if aoe-flush ever successfully returns.
  • If you aoe-flush a device that has moved to a new home, your mount is screwed.  Even if it hasn't moved to a new home, you're screwed.  Once the device is borked, the mount is lost.  This makes changing the MAC of the target a requirement if you want to failover to a secondary node.  (update: aoe-flush is not what we need.  The issue is deeper than that.)

At this point, I am strongly considering moving my resources back to iSCSI, and hoping that LIO can handle the load.

UPDATE: I've taken a different tact and thought to ask some useful questions.  Given the answers, there is perchance still hope for using AoE on my cluster.  We shall soon see.  Time, nonetheless, is running out (and I don't mean to imply my patience...I mean time, like REAL TIME).


No comments:

Post a Comment