20130822

Die-hard iSCSI SAN and Client Implementation Notes

Building out my next SAN and client

The goal here is a die-hard data access and integrity.

SAN Access Mechanisms

AOE - a no-go at this time

My testing to date (2013-08-21) has shown that AoE under vblade migrates well, but does not handle a failed node well.  Data corruption generally happens if writes are active, and there are cases I have encountered (especially during periods of heavy load) where the client fails to talk to the surviving node if that node is not already the primary (more below on that).  In other words, if the primary SAN target node fails, the secondary will come up, but the client might not use it (or might use it for a few seconds before things get borked).   I am actively investigating this and other related issues with guidance from the AoE maintainer.  At this time I cannot use it for what I want to use it for.  Pity, it's damn fast.

iSCSI - Server Setup

Ubuntu 12.04 has a 3.2 kernel and sports the LIO target suite.  In initial testing it worked well, though it will be interesting to see how it performs under more realistic loads.  My next test will involve physical machines to exercise iSCSI responsiveness over real hardware and jumbo-frames.

The Pacemaker (Heartbeat) resource agent for iSCSILogialUnit suffers from a bug in LIO, whereby if the underlying device/target is receiving writes the logical unit cannot be shut down.  This can cause a SAN node to get fenced for failure to shut down the resource when ordered to standby or migrate.  It can be reliably reproduced.  This post details what needs to be done to fix the issue.  These modifications can be applied with this patch fragment:



--- old/iSCSILogicalUnit 2013-08-21 16:13:20.000000000 -0400
+++ new/iSCSILogicalUnit 2013-08-21 16:12:56.000000000 -0400
@@ -365,6 +365,11 @@
   done
   ;;
      lio)
+               # First stop the TPGs for the given device.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 0 > "${TPG}"
+               done
+
                if [ -n "${OCF_RESKEY_allowed_initiators}" ]; then
                        for initiator in ${OCF_RESKEY_allowed_initiators}; do
                                ocf_run lio_node --dellunacl=${OCF_RESKEY_target_iqn} 1 \
@@ -373,6 +378,15 @@
                fi
   ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC
   ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC
+
+               # Now that the LUN is down, reenable the TPGs...
+               # This is a guess, so we'll gonna have to test with multiple LUNs per target
+               # to make sure we are doing the right thing here.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 1 > "${TPG}"
+               done
+
+
  esac
     fi


Basically, go through all TPGs for a given target, disable them, nuke the logical unit, and then re-enable them.  This has only been tested with one LUN.  It may screw things up for multiple LUNs.  Hopefully not, but you have been warned.  If I get around to testing, I'll update this post.  My setups always involve one LUN per target.

iSCSI - Pacemaker Setup

On the SERVER...  I group the target's virtual IP, iSCSITarget, and iSCSILogicalUnit together for simplicity (and because they can't exist without each other).  LIO requires the IP be up before it will build a portal to it.  
group g_iscsisrv-o1 p_ipaddr-o1 p_iscsitarget-o1 p_iscsilun-o1
Each target gets its own IP.  I'm using ocf:heartbeat:IPaddr2 for the resource agent.  The iSCSITarget primitives each have unique tids.  Other than that, LIO ignores parameters that iet and tgt care about, so configuration is pretty short.  Make sure to use implementation="lio" absolutely everywhere when specifying the iSCSITarget and iSCSILogicalUnit primitives.

On the CLIENT...  The ocf:heartbeat:iscsi resource agent needs this parameter to not break connections to the target when the conditions are right:
try_recovery="true"
Without it, a failed node or a migration will occasionally cause the connection to fail completely, which is not what you want when failover without noticeable interruption is your goal.

SAN Components

DRBD

Ubuntu 12.04 ships with 8.3.11, but the DRBD git repo has 8.3.15 and 8.4.3.  In the midst of debugging Pacemaker bug, I migrated to 8.4.3.  It works fine, and appears to be quite stable.  Make sure that you're using the 8.4.3 resource agent, or else things like dual-primary will fail (if everything is installed to standard locations, you should be fine).

Though it's not absolutely necessary, I am running my DRBD resources in dual-primary.  The allow-two-primaries option seems to shave a few seconds off the recovery, since all we have to migrate are the iSCSI target resources.  LIO migrates very quickly, so the most of the waiting appears to be cluster-management-related (waiting to make sure the node is really down, making sure it's fenced, etc).  We could probably get it faster with a little more work.

Pacemaker, Corosync

Without the need for OCFS2 on the SAN, I build the cluster suite from the sources using Corosync 2.3.1 and Pacemaker 1.1.10 + latest changes from git.  It's very near bleeding-edge, but it's also working very well at the moment.  Building the cluster requires a host of other packages.  I will detail the exact build requirements and sequence in another post; I wrote a script that does pretty much an automated install.  The important thing is to make sure you don't have any competing libraries/headers in the way, or parts of the build will break.  Luckily it breaks during the build and not during execution.  (libqb, I am looking at YOU!)

ZFS

I did not do any additional experimentation with this on the sandbox cluster, but it is worth noting that in my most recent experiences I have shifted to using drive UUIDs instead of any other available device addressing mechanisms.  The problem I ran into (several times) involved the array not loading on reboot, or (worse) the vdevs not appearing after reboot.  Since I the vdevs are the underlying devices for DRBD, it's rather imperative that they be present on reboot.  It appears to be a semi-remaining issue in ZoL, though it's less so in recent releases.

Testing and Results

For testing I created a cluster of four nodes, all virtual, with external/libvirt as the STONITH device.  The nodes, c6 thru c9, were configured thus:
  • c6 and c7 - SAN targets, synced with each other via DRBD
  • c8 - AoE test client
  • c9 - AoE and iSCSI test client

Server/Target

All test batches included migration tests (moving the target resource from one node to another), failover tests (manually fencing a node so that its partner takes over), single-primary tests (migration/failover when one node has primary and the other node has secondary), and dual-primary tests (migration/failover when both nodes are allowed to be primary).

Between tests, the DRBD stores were allowed to fully resync.  During some long-term tests, resync happened while client file system access continued.  

Client/Initiator

Two operations were tested: dbench and data transfer and verification.

dbench is fairly cut-and-dry.  It was set to run for upwards of 5000 seconds with 3 clients, while the SAN target nodes (c6 and c7) were subjected to migrations and fencing.

The data transfer and verification tests were more interesting, as they signaled corruption issues.  For sake of having options, I created three sets of files with dd if=/dev/urandom.  The first set was 400 1-meg files.  The second set was 40 10-meg files.  The last set was 4 100-meg files.  Random data was chosen to ensure that no compression features would interfere with the transfer, and also to provide useful data for verification.  SHA-512 sums were generated for every file.  As the files were done in three batches, three sum files were generated.  For each test, a selected batch of files was copied to the target via either rsync or cp, while migrations/failovers were being performed.  The batch was then checked for corruption by validating against the appropriate sums file.  Between tests, the target's copy of the data was deleted.  Occasionally the target store was reformatted to ensure that the file system was working correctly (especially after failed failover tests).

Results - AoE

AoE performed extremely well with transfer rates and migration, but failed during verification tests on failover testing.  This is interesting because it suggests the mechanism that AoE is using to push its writes to disk is buffering somewhere along the way.  vblade is forcibly terminated during migration, yet no corruption occurred throughout those tests.  

Failover reliably demonstrated corruption; the fencing of a node practically guaranteed that 2-4 files would fail their SHA-512 sums.  This can be fixed by using the "-s" option, but I find that to be rather unattractive.  Yet it may be the only option.

Another issue: during a failover, the client might fail to communicate with the new target.  Migration didn't seem to suffer from this.  Yet on failover, aoe-debug sometimes reported both aoetgts receiving packets even though one was dead and one was living.   More often than not, aoe would start talking to the remaining node, only to stop a few seconds later and never again resume.  I've spent a good deal of time examining the code, but at this time it's a bit too complex to make any in-roads.  At best, I've have intermittent success at generating the failure case.

One other point of interest regarding AoE: the failover is a bit slow, regrettably.  This appears due to a hard-coded 10-second time limit before scorning a target.  I might add a module-parameter for this, and/or see about a better logic-flow for dealing with suspected-failed targets.

Results - iSCSI

iSCSI performed, well, like iSCSI with regard to transfer rates - slower than AoE.  My biggest fear with iSCSI is resource contention when multiple processes are accessing the store.  Once the major issues involving the resource agent and the initiator were solved, migration worked like a charm.  During failover testing, no corruption was observed and the remaining node picked up the target almost immediately.  I will probably deploy with allow-two-primaries enabled.









20130815

AoE, DRBD...coping strategies

This is a stream of consciousness...

DRBD - Upgrade Paths

When dealing with pre-compiled packages, upgrading from source can be hazardous.  Make sure of the following things:
  • DO make sure you have the correct kernel for the version of DRBD you want to use.
    • The 8.3 series won't compile in kernel 3.10, and possibly others.  It does compile in 3.2.  It appears that changes to procfs has made some of the relevant code in 8.3 out-of-date.
    • The 8.4 series will compile in kernel 3.2.  It comes with 3.10, so therefore must also build successfully there.
  • DO make sure you build the tools AND the module.
  • DO configure the tools to use the CORRECT paths.
    • These will depend on the distro and the original configure args used.  Deduce or look into the package sources for their build scripts.
    • Ubuntu has drbdadm in /sbin, the helper scripts in /usr/lib, the configs in /etc, and the state in /var/lib.
    • If you do not have the correct paths, things will break in unexpected ways and you might have a resource that continually tries to connect and then fails with some strange error (such as a protocol error or a mysterious "peer closed connection" on both sides).
  • DO be careful when installing the 8.3 kernel under Ubuntu 12.04, it doesn't seem to copy to where it needs to go - hand-copy if necessary (and it does seem necessary).
  • DO shut down Pacemaker on the upgrade node before upgrading.  Reboots may be necessary.  Module unloading/reloading is the minimum required.
  • DO reconnect the upgraded DRBD devices to their peers BEFORE bringing Pacemaker back into the mix.  If there is something amiss with your upgrade, you'll rather it simply fail the upgrade node than for that node and its cluster friends to start fencing one another.  Pacemaker won't care if a slave connects as long as it doesn't change the status quo (i.e. no auto master-seizure).  If everything is gold, you should be able to either start Pacemaker as is (it should see the resources and just go with it) or shut down DRBD and let Pacemaker bring it back up when it starts.
  • It's probably a better idea to upgrade DRBD before upgrading the kernel, so that you are only changing one major variable at a time.  In upgrading the kernel from 3.2 to 3.10, I ran into situations were things were subtly broken and the good node was getting fenced by the upgrading node for no good reason.
  • I have found, thus far, that wiping the metadata in the upgrade to 8.4 was not necessary, but it has been noted as a solution in certain circumstances.  This requires a full resync when done.
  • 8.3 and 8.4 WILL communicate with one another if you've done everything right.  If you haven't, they won't, but they will sorta seem like they should...blame yourself and go back through your DRBD install and double-check everything.
  • 8.4 WILL read 8.3's configuration scripts.  Update them when things are stable to the 8.4 syntax.
  • Pulling the source from GIT is a fun and easy way to obtain it.  Plus you can have the latest and greatest or any of the previously tagged releases.
  • And finally, WRITE DOWN the configure string you used on the upgrade node.  You'll want to replicate it exactly on the other node, especially if you pulled the source from git.
    • Even an rsync-copy doesn't guarantee that the source won't want to rebuild.  Plus if you end up switching to a newer or older revision, stuffing the configure command line into a little shell script makes rebuilding less error-prone.

AoE

This site is useful: http://support.coraid.com/support/linux/
To build in Ubuntu:
make install INSTDIR=/lib/modules/`uname -r`/kernel/drivers/block/aoe

Oh, where do I begin?  Things I have issue with and would like to do something about:
  • Absolutely no security at all in this protocol.  Security via hiding the data on a dedicated switch is not an answer, especially when you don't have said dedicated-switch to use.  VLANs are a joke.  Routability be damned, this protocol is more than vulnerable to any number of local area attacks, which are equally likely from a compromised node as they are for a routable protocol over the Internet.
    • I'd like to see the header or payload enhanced with a cryptographic sequence, at the very least.
      • The sequence could be a pair of 32-bit numbers, representing (1) the number that is the expected sequence number, and (2) the number that will be the next sequence number.
      • Providing these two numbers means that the number source can be anything, including random generation, making sequence prediction difficult (no predictable plaintext).
      • This could, at the very least, provide defense from replay attacks and give the target and the initiator something to identify each other with.
      • Extensions to this could allow for more robust initiator security, whereby a shared-secret is used to guarantee a target/initiator is who they say they are, in lieu of MAC filtering (which is pointless in the modern world of spoofable MACs.
  • vblade is, from the looks of things, "as good as it gets."  No other projects appear to be among the living.  Most stuff died back in '07 and '10 (if you're lucky).  Even vblade doesn't get much love from the looks of it.
    • Let's get at least a version number in there somewhere, so that I don't feel like an idiot for typing "vblade --version."
    • Figure out why the "announce" that vblade does falls on deaf ears.  If the initiator transmits a packet, and the recipient has died and "failed over" to another node, why does the initiator not care about this?  (update: evidently it does not fall on deaf ears, but it comes close.  In certain circumstances it works, others it doesn't.)
    • aoe-stat could dump out more info, like the MAC it was expecting to connect to. (update: This info was hidden away in a sysfs file called "debug")
    • The driver doesn't appear to try hard when it comes to reconnecting to a perceived-failed device.  The 12-page super-spec doesn't give any real guidance on how the initiator should behave, or how a target should respond, to any situation other than one in which everything works.  (How wonderfully optimistic!)  (update: OK so maybe I didn't read the whole spec line-for-line...)
  • Driver error detection and recovery appears to be nonexistent.  Again, very optimistic.  Plus, with two vblades running on two separate servers, only one server's MAC is being seen.  Why is this?!  Oh yeah, and that page about revalidating the device by removing and re-probing the driver?  Not gonna happen while the damn device is mounted somewhere.  PUNT!  (update: the mechanisms are evidently far more complex than I originally believed.  I must now examine the code carefully to understand them.  I have already encountered one test case that failed spectacularly.)
  • aoe-discover does dick when it thinks it knows where its devices are.  aoe-flush -a does nothing useful.  I hate that I have to look through their scripts just to find command line options.  Anyway, you CAN perform an aoe-flush e#.# and get it to forget about individual devices.  Then do the aoe-discover and things will work...if aoe-flush ever successfully returns.
  • If you aoe-flush a device that has moved to a new home, your mount is screwed.  Even if it hasn't moved to a new home, you're screwed.  Once the device is borked, the mount is lost.  This makes changing the MAC of the target a requirement if you want to failover to a secondary node.  (update: aoe-flush is not what we need.  The issue is deeper than that.)

At this point, I am strongly considering moving my resources back to iSCSI, and hoping that LIO can handle the load.

UPDATE: I've taken a different tact and thought to ask some useful questions.  Given the answers, there is perchance still hope for using AoE on my cluster.  We shall soon see.  Time, nonetheless, is running out (and I don't mean to imply my patience...I mean time, like REAL TIME).


20130812

The Unconfirmed Horror

In my efforts to perfect my SAN upgrades before putting them into production, I've configured an iSCSI resource in Pacemaker.  Goal, of course and as with AoE, is high-availability with perfect data integrity (or at least as good as we can get it).  My AoE test store has, of late, been suffering under the burden of O_SYNC - synchronous file access to its backing store.  This has guaranteed that my writes make it to disk, at the expense of write-performance.  It's hard to say how much overall performance is lost to this, but in terms of pure writes, it appears to be pretty significant.

I had hoped that iSCSI would not be so evil with regard to data writing.  I was unpleasantly surprised when the TGT target I configured demonstrated file transfer errors on fencing the active SAN node.  Somehow the failure of the node was enough reason for a significant number of file blocks to get lost.  Not finding a self-evident way to put the target into a "synchronous" mode with regard to its backing store, I switched to LIO.  So far, it seems to be performing better with regard to not losing writes on node failure...which is to say, it's not losing my data.  That's critical.

Re-evaluating AoE in a head-to-head against LIO, here's the skinny.  With AoE running O_SYNC, it's a dead win (at the moment) for LIO in the 400 Meg-of-Files Challenge: 12 seconds versus 2 minutes.  Yet not all is lost!  We can tune AoE on the client-side to do a lot more caching of data before blocking on writes.  This assumes we're cool with losing data as long as we're not corrupting the file system in the process (in my prior post, I notes that file system corruption was among the other niceties of running AoE without the -s flag when a primary storage node fails).  That should boost performance at the cost of additional memory.  Right now AoE bursts about 6-10 files before blocking.

There is one other way, thus far, in which iSCSI appears superior to AoE: takeover time.  For iSCSI, on a migration from one node to the other it's nearly instantaneous.  On a node failure, it takes about 5-10 seconds.  AoE?  No such luck.  Even though it's purported to broadcast an updated MAC whenever vblade is started, the client either fails to see it (or doesn't care) or is too busy doing other things.  I think it's the former, as a failure while no writing is happening on the node causes the same 15-20 second delay before any writes can resume.  Why is this?

One thing does irk me as I test things out.  I had a strange node failure on the good node after fencing the non-good node.  It could just be that the DRBD resources were not yet synced, which would prevent them from starting (and cause a node-fencing).  Yet the logs indicate something about dummy resources running that shouldn't be running.

IDK.

All I know is that I want stability, and I want it now.

20130802

Working With AoE

A few weeks ago I did some major rearrangement of the server room, and in the midst did some badly-needed updates on the HA SAN servers.  The servers are responsible for all the virtual-machine data currently in use.  Consequently it's rather important they work right, and work well.

Sadly, one of the servers "died" in the midst of the updates.  Just as well, the cluster had problems with failover not being as transparent as it was supposed to be.  A cluster where putting a node on standby results in that node's immediate death-by-fencing is not a good cluster.

I thought this would be a good time to try out the latest pacemaker and corosync, so I set up some sandbox machines for play.  Of course, good testing is going to include making sure AoE is also performing up-to-snuff.  So far, I've encountered some interesting results.

For starters, I created a DRBD store between two of my test nodes, and shared it out via AoE.  A third node did read/write tests.  To do these tests, I created 400 1-meg files via dd if=/dev/urandom.  I generated SHA-512 sums for them all, to double-check file integrity after each test.  I also created 40 10-meg files, and 4 100-meg files.  I think you can spot a pattern here.  Transfers were done from a known-good source (the local drive of the AoE client) to the AoE store using cp and rsync.  During the transfer, failover events were simulated by intentionally fencing a node, or issuing a resource-migrate command.

Migration of the resource generally worked fine.  No data corruption was observed, and so long as both nodes were live everything appeared to work OK.  Fencing the active node, however, resulted in data corruption unless vblade was started with the "-s" option.  The up-side is that you're guaranteed that writes will have finished before the sender trashes the data.  The down-side is that writes go a LOT slower.  Strangely, -s is never really mentioned in the available high-availability guides for AoE.  I guess that's not really surprising; AoE is like a little black box that no one talks in any detail about.  Must be so simple as to be mindlessly easy...sadly that's a dangerously bad way to think.

Using the -d for direct-mode is also damaging to performance; I am not sure how well it does with failover due to SAN-node failure.  

What's the Worst that Can Happen?

Caching/Buffering = Speed at the sacrifice of data security.  So how much are we willing to risk?

If a VM host dies, any uncommitted data dies with it.  We could say that data corruption is worse than no data at all.  The file systems commonly used by my VMs include journaling, so file system corruption in the midst of a write should be minimal as long as the writes are at least in order.  Best of all in this bad situation is that no writes can proceed after the host has died because it's, well, dead.

The next most terrible failure would be death of a store node - specifically, the active store node.  AoE looks to be pretty darn close to fire-and-forget without totally the forget part.  Judging from the code, it sends back an acknowledgement to the writer once it has pushed the data to the backing store.  That's nice, except where the backing store is buffering up the data (or something in the long chain leading to the backing store, maybe still inside AoE itself).  So, without -s, killing a node outright caused data corruption and, in some cases, file system corruption.  The latter is most exceedingly possible because the guest is continuing to write under the assumption that all its writes have succeeded.  As far as AoE is concerned, they have.  Additional writes to a broken journal after the connection is re-established on the surviving node will only yield horror, and quite possibly undetected horror on a production system that may not be checked for weeks or months.

A link failure between AoE client and server would stop the flow of traffic.  Not much evil here.  In fact, it's tantamount to a VM host failure, except the host and its guests are still operating...just in a sort-of pseudo-detached media state (they can't write and don't know why).  Downside here is that the JBD2 process tends to hang "forever" when enough writes are pushed to a device that is inaccessible for sufficient time ("forever" meaning long enough that I have to reboot the host to clear the bottleneck in a timely manner - lots of blocked-process messages appear in the kern.log when this happens, and everything appears to grind to a wonderful halt).  Maybe JBD2 would clear itself after a while, but I've found that the Windows guests are quite sensitive to write failures, more so than Linux guests, though even the Linux guests have trouble surviving when the store gets brutally interrupted for too many seconds.

Now What Do I Do?

The -s option to vblade causes significant latency for the client when testing with dbench.  Whether or not this is actually a show-stopper remains to be seen.  Throughput drops from around 0.58 MB/sec to 0.15 MB/sec.  This is of course with all defaults for the various and appropriate file system buffers that work on the client and the server, and also running everything purely virtual.  Hardware performance should be markedly better.

I was worrying about using AoE and the risk of migrating a VM while it was writing to an AoE-connected shared storage device (via something like GFS or OCFS2).  My concern was that if the VM was migrated from host A to host B, and was in the middle of writing a huge file to disk, the file writes would still be getting completed on host A while the VM came to life on host B.  The question of "what data will I see?" was bothering me.  I then realized the answer must necessarily be in the cluster-aware file system, as it would certainly be the first to know of any disk-writes, even before they were transmitted to the backing store.  There still may be room for worry, though.  Testing some hypotheses will be therapeutic.