20130802

Working With AoE

A few weeks ago I did some major rearrangement of the server room, and in the midst did some badly-needed updates on the HA SAN servers.  The servers are responsible for all the virtual-machine data currently in use.  Consequently it's rather important they work right, and work well.

Sadly, one of the servers "died" in the midst of the updates.  Just as well, the cluster had problems with failover not being as transparent as it was supposed to be.  A cluster where putting a node on standby results in that node's immediate death-by-fencing is not a good cluster.

I thought this would be a good time to try out the latest pacemaker and corosync, so I set up some sandbox machines for play.  Of course, good testing is going to include making sure AoE is also performing up-to-snuff.  So far, I've encountered some interesting results.

For starters, I created a DRBD store between two of my test nodes, and shared it out via AoE.  A third node did read/write tests.  To do these tests, I created 400 1-meg files via dd if=/dev/urandom.  I generated SHA-512 sums for them all, to double-check file integrity after each test.  I also created 40 10-meg files, and 4 100-meg files.  I think you can spot a pattern here.  Transfers were done from a known-good source (the local drive of the AoE client) to the AoE store using cp and rsync.  During the transfer, failover events were simulated by intentionally fencing a node, or issuing a resource-migrate command.

Migration of the resource generally worked fine.  No data corruption was observed, and so long as both nodes were live everything appeared to work OK.  Fencing the active node, however, resulted in data corruption unless vblade was started with the "-s" option.  The up-side is that you're guaranteed that writes will have finished before the sender trashes the data.  The down-side is that writes go a LOT slower.  Strangely, -s is never really mentioned in the available high-availability guides for AoE.  I guess that's not really surprising; AoE is like a little black box that no one talks in any detail about.  Must be so simple as to be mindlessly easy...sadly that's a dangerously bad way to think.

Using the -d for direct-mode is also damaging to performance; I am not sure how well it does with failover due to SAN-node failure.  

What's the Worst that Can Happen?

Caching/Buffering = Speed at the sacrifice of data security.  So how much are we willing to risk?

If a VM host dies, any uncommitted data dies with it.  We could say that data corruption is worse than no data at all.  The file systems commonly used by my VMs include journaling, so file system corruption in the midst of a write should be minimal as long as the writes are at least in order.  Best of all in this bad situation is that no writes can proceed after the host has died because it's, well, dead.

The next most terrible failure would be death of a store node - specifically, the active store node.  AoE looks to be pretty darn close to fire-and-forget without totally the forget part.  Judging from the code, it sends back an acknowledgement to the writer once it has pushed the data to the backing store.  That's nice, except where the backing store is buffering up the data (or something in the long chain leading to the backing store, maybe still inside AoE itself).  So, without -s, killing a node outright caused data corruption and, in some cases, file system corruption.  The latter is most exceedingly possible because the guest is continuing to write under the assumption that all its writes have succeeded.  As far as AoE is concerned, they have.  Additional writes to a broken journal after the connection is re-established on the surviving node will only yield horror, and quite possibly undetected horror on a production system that may not be checked for weeks or months.

A link failure between AoE client and server would stop the flow of traffic.  Not much evil here.  In fact, it's tantamount to a VM host failure, except the host and its guests are still operating...just in a sort-of pseudo-detached media state (they can't write and don't know why).  Downside here is that the JBD2 process tends to hang "forever" when enough writes are pushed to a device that is inaccessible for sufficient time ("forever" meaning long enough that I have to reboot the host to clear the bottleneck in a timely manner - lots of blocked-process messages appear in the kern.log when this happens, and everything appears to grind to a wonderful halt).  Maybe JBD2 would clear itself after a while, but I've found that the Windows guests are quite sensitive to write failures, more so than Linux guests, though even the Linux guests have trouble surviving when the store gets brutally interrupted for too many seconds.

Now What Do I Do?

The -s option to vblade causes significant latency for the client when testing with dbench.  Whether or not this is actually a show-stopper remains to be seen.  Throughput drops from around 0.58 MB/sec to 0.15 MB/sec.  This is of course with all defaults for the various and appropriate file system buffers that work on the client and the server, and also running everything purely virtual.  Hardware performance should be markedly better.

I was worrying about using AoE and the risk of migrating a VM while it was writing to an AoE-connected shared storage device (via something like GFS or OCFS2).  My concern was that if the VM was migrated from host A to host B, and was in the middle of writing a huge file to disk, the file writes would still be getting completed on host A while the VM came to life on host B.  The question of "what data will I see?" was bothering me.  I then realized the answer must necessarily be in the cluster-aware file system, as it would certainly be the first to know of any disk-writes, even before they were transmitted to the backing store.  There still may be room for worry, though.  Testing some hypotheses will be therapeutic. 

No comments:

Post a Comment