20130812

The Unconfirmed Horror

In my efforts to perfect my SAN upgrades before putting them into production, I've configured an iSCSI resource in Pacemaker.  Goal, of course and as with AoE, is high-availability with perfect data integrity (or at least as good as we can get it).  My AoE test store has, of late, been suffering under the burden of O_SYNC - synchronous file access to its backing store.  This has guaranteed that my writes make it to disk, at the expense of write-performance.  It's hard to say how much overall performance is lost to this, but in terms of pure writes, it appears to be pretty significant.

I had hoped that iSCSI would not be so evil with regard to data writing.  I was unpleasantly surprised when the TGT target I configured demonstrated file transfer errors on fencing the active SAN node.  Somehow the failure of the node was enough reason for a significant number of file blocks to get lost.  Not finding a self-evident way to put the target into a "synchronous" mode with regard to its backing store, I switched to LIO.  So far, it seems to be performing better with regard to not losing writes on node failure...which is to say, it's not losing my data.  That's critical.

Re-evaluating AoE in a head-to-head against LIO, here's the skinny.  With AoE running O_SYNC, it's a dead win (at the moment) for LIO in the 400 Meg-of-Files Challenge: 12 seconds versus 2 minutes.  Yet not all is lost!  We can tune AoE on the client-side to do a lot more caching of data before blocking on writes.  This assumes we're cool with losing data as long as we're not corrupting the file system in the process (in my prior post, I notes that file system corruption was among the other niceties of running AoE without the -s flag when a primary storage node fails).  That should boost performance at the cost of additional memory.  Right now AoE bursts about 6-10 files before blocking.

There is one other way, thus far, in which iSCSI appears superior to AoE: takeover time.  For iSCSI, on a migration from one node to the other it's nearly instantaneous.  On a node failure, it takes about 5-10 seconds.  AoE?  No such luck.  Even though it's purported to broadcast an updated MAC whenever vblade is started, the client either fails to see it (or doesn't care) or is too busy doing other things.  I think it's the former, as a failure while no writing is happening on the node causes the same 15-20 second delay before any writes can resume.  Why is this?

One thing does irk me as I test things out.  I had a strange node failure on the good node after fencing the non-good node.  It could just be that the DRBD resources were not yet synced, which would prevent them from starting (and cause a node-fencing).  Yet the logs indicate something about dummy resources running that shouldn't be running.

IDK.

All I know is that I want stability, and I want it now.

No comments:

Post a Comment