BURNING MIDNIGHTm.at.work

20130822

Die-hard iSCSI SAN and Client Implementation Notes

Building out my next SAN and client

The goal here is a die-hard data access and integrity.

SAN Access Mechanisms

AOE - a no-go at this time

My testing to date (2013-08-21) has shown that AoE under vblade migrates well, but does not handle a failed node well. Data corruption generally happens if writes are active, and there are cases I have encountered (especially during periods of heavy load) where the client fails to talk to the surviving node if that node is not already the primary (more below on that). In other words, if the primary SAN target node fails, the secondary will come up, but the client might not use it (or might use it for a few seconds before things get borked). I am actively investigating this and other related issues with guidance from the AoE maintainer. At this time I cannot use it for what I want to use it for. Pity, it's damn fast.

iSCSI - Server Setup

Ubuntu 12.04 has a 3.2 kernel and sports the LIO target suite. In initial testing it worked well, though it will be interesting to see how it performs under more realistic loads. My next test will involve physical machines to exercise iSCSI responsiveness over real hardware and jumbo-frames.

The Pacemaker (Heartbeat) resource agent for iSCSILogialUnit suffers from a bug in LIO, whereby if the underlying device/target is receiving writes the logical unit cannot be shut down. This can cause a SAN node to get fenced for failure to shut down the resource when ordered to standby or migrate. It can be reliably reproduced. This post details what needs to be done to fix the issue. These modifications can be applied with this patch fragment:

--- old/iSCSILogicalUnit 2013-08-21 16:13:20.000000000 -0400
+++ new/iSCSILogicalUnit 2013-08-21 16:12:56.000000000 -0400
@@ -365,6 +365,11 @@
   done
   ;;
      lio)
+               # First stop the TPGs for the given device.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 0 > "${TPG}"
+               done
+
                if [ -n "${OCF_RESKEY_allowed_initiators}" ]; then
                        for initiator in ${OCF_RESKEY_allowed_initiators}; do
                                ocf_run lio_node --dellunacl=${OCF_RESKEY_target_iqn} 1 \
@@ -373,6 +378,15 @@
                fi
   ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC
   ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC
+
+               # Now that the LUN is down, reenable the TPGs...
+               # This is a guess, so we'll gonna have to test with multiple LUNs per target
+               # to make sure we are doing the right thing here.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 1 > "${TPG}"
+               done
+
+
  esac
     fi

Basically, go through all TPGs for a given target, disable them, nuke the logical unit, and then re-enable them. This has only been tested with one LUN. It may screw things up for multiple LUNs. Hopefully not, but you have been warned. If I get around to testing, I'll update this post. My setups always involve one LUN per target.

iSCSI - Pacemaker Setup

On the SERVER... I group the target's virtual IP, iSCSITarget, and iSCSILogicalUnit together for simplicity (and because they can't exist without each other). LIO requires the IP be up before it will build a portal to it.

group g_iscsisrv-o1 p_ipaddr-o1 p_iscsitarget-o1 p_iscsilun-o1

Each target gets its own IP. I'm using ocf:heartbeat:IPaddr2 for the resource agent. The iSCSITarget primitives each have unique tids. Other than that, LIO ignores parameters that iet and tgt care about, so configuration is pretty short. Make sure to use implementation="lio" absolutely everywhere when specifying the iSCSITarget and iSCSILogicalUnit primitives.

On the CLIENT... The ocf:heartbeat:iscsi resource agent needs this parameter to not break connections to the target when the conditions are right:

try_recovery="true"

Without it, a failed node or a migration will occasionally cause the connection to fail completely, which is not what you want when failover without noticeable interruption is your goal.

SAN Components

DRBD

Ubuntu 12.04 ships with 8.3.11, but the DRBD git repo has 8.3.15 and 8.4.3. In the midst of debugging Pacemaker bug, I migrated to 8.4.3. It works fine, and appears to be quite stable. Make sure that you're using the 8.4.3 resource agent, or else things like dual-primary will fail (if everything is installed to standard locations, you should be fine).

Though it's not absolutely necessary, I am running my DRBD resources in dual-primary. The allow-two-primaries option seems to shave a few seconds off the recovery, since all we have to migrate are the iSCSI target resources. LIO migrates very quickly, so the most of the waiting appears to be cluster-management-related (waiting to make sure the node is really down, making sure it's fenced, etc). We could probably get it faster with a little more work.

Pacemaker, Corosync

Without the need for OCFS2 on the SAN, I build the cluster suite from the sources using Corosync 2.3.1 and Pacemaker 1.1.10 + latest changes from git. It's very near bleeding-edge, but it's also working very well at the moment. Building the cluster requires a host of other packages. I will detail the exact build requirements and sequence in another post; I wrote a script that does pretty much an automated install. The important thing is to make sure you don't have any competing libraries/headers in the way, or parts of the build will break. Luckily it breaks during the build and not during execution. (libqb, I am looking at YOU!)

ZFS

I did not do any additional experimentation with this on the sandbox cluster, but it is worth noting that in my most recent experiences I have shifted to using drive UUIDs instead of any other available device addressing mechanisms. The problem I ran into (several times) involved the array not loading on reboot, or (worse) the vdevs not appearing after reboot. Since I the vdevs are the underlying devices for DRBD, it's rather imperative that they be present on reboot. It appears to be a semi-remaining issue in ZoL, though it's less so in recent releases.

Testing and Results

For testing I created a cluster of four nodes, all virtual, with external/libvirt as the STONITH device. The nodes, c6 thru c9, were configured thus:

c6 and c7 - SAN targets, synced with each other via DRBD
c8 - AoE test client
c9 - AoE and iSCSI test client

Server/Target

All test batches included migration tests (moving the target resource from one node to another), failover tests (manually fencing a node so that its partner takes over), single-primary tests (migration/failover when one node has primary and the other node has secondary), and dual-primary tests (migration/failover when both nodes are allowed to be primary).

Between tests, the DRBD stores were allowed to fully resync. During some long-term tests, resync happened while client file system access continued.

Client/Initiator

Two operations were tested: dbench and data transfer and verification.

dbench is fairly cut-and-dry. It was set to run for upwards of 5000 seconds with 3 clients, while the SAN target nodes (c6 and c7) were subjected to migrations and fencing.

The data transfer and verification tests were more interesting, as they signaled corruption issues. For sake of having options, I created three sets of files with dd if=/dev/urandom. The first set was 400 1-meg files. The second set was 40 10-meg files. The last set was 4 100-meg files. Random data was chosen to ensure that no compression features would interfere with the transfer, and also to provide useful data for verification. SHA-512 sums were generated for every file. As the files were done in three batches, three sum files were generated. For each test, a selected batch of files was copied to the target via either rsync or cp, while migrations/failovers were being performed. The batch was then checked for corruption by validating against the appropriate sums file. Between tests, the target's copy of the data was deleted. Occasionally the target store was reformatted to ensure that the file system was working correctly (especially after failed failover tests).

Results - AoE

AoE performed extremely well with transfer rates and migration, but failed during verification tests on failover testing. This is interesting because it suggests the mechanism that AoE is using to push its writes to disk is buffering somewhere along the way. vblade is forcibly terminated during migration, yet no corruption occurred throughout those tests.

Failover reliably demonstrated corruption; the fencing of a node practically guaranteed that 2-4 files would fail their SHA-512 sums. This can be fixed by using the "-s" option, but I find that to be rather unattractive. Yet it may be the only option.

Another issue: during a failover, the client might fail to communicate with the new target. Migration didn't seem to suffer from this. Yet on failover, aoe-debug sometimes reported both aoetgts receiving packets even though one was dead and one was living. More often than not, aoe would start talking to the remaining node, only to stop a few seconds later and never again resume. I've spent a good deal of time examining the code, but at this time it's a bit too complex to make any in-roads. At best, I've have intermittent success at generating the failure case.

One other point of interest regarding AoE: the failover is a bit slow, regrettably. This appears due to a hard-coded 10-second time limit before scorning a target. I might add a module-parameter for this, and/or see about a better logic-flow for dealing with suspected-failed targets.

Results - iSCSI

iSCSI performed, well, like iSCSI with regard to transfer rates - slower than AoE. My biggest fear with iSCSI is resource contention when multiple processes are accessing the store. Once the major issues involving the resource agent and the initiator were solved, migration worked like a charm. During failover testing, no corruption was observed and the remaining node picked up the target almost immediately. I will probably deploy with allow-two-primaries enabled.

20130815

AoE, DRBD...coping strategies

This is a stream of consciousness...

DRBD - Upgrade Paths

When dealing with pre-compiled packages, upgrading from source can be hazardous. Make sure of the following things:

DO make sure you have the correct kernel for the version of DRBD you want to use.

The 8.3 series won't compile in kernel 3.10, and possibly others. It does compile in 3.2. It appears that changes to procfs has made some of the relevant code in 8.3 out-of-date.
The 8.4 series will compile in kernel 3.2. It comes with 3.10, so therefore must also build successfully there.

DO make sure you build the tools AND the module.
DO configure the tools to use the CORRECT paths.

These will depend on the distro and the original configure args used. Deduce or look into the package sources for their build scripts.
Ubuntu has drbdadm in /sbin, the helper scripts in /usr/lib, the configs in /etc, and the state in /var/lib.
If you do not have the correct paths, things will break in unexpected ways and you might have a resource that continually tries to connect and then fails with some strange error (such as a protocol error or a mysterious "peer closed connection" on both sides).

DO be careful when installing the 8.3 kernel under Ubuntu 12.04, it doesn't seem to copy to where it needs to go - hand-copy if necessary (and it does seem necessary).
DO shut down Pacemaker on the upgrade node before upgrading. Reboots may be necessary. Module unloading/reloading is the minimum required.
DO reconnect the upgraded DRBD devices to their peers BEFORE bringing Pacemaker back into the mix. If there is something amiss with your upgrade, you'll rather it simply fail the upgrade node than for that node and its cluster friends to start fencing one another. Pacemaker won't care if a slave connects as long as it doesn't change the status quo (i.e. no auto master-seizure). If everything is gold, you should be able to either start Pacemaker as is (it should see the resources and just go with it) or shut down DRBD and let Pacemaker bring it back up when it starts.
It's probably a better idea to upgrade DRBD before upgrading the kernel, so that you are only changing one major variable at a time. In upgrading the kernel from 3.2 to 3.10, I ran into situations were things were subtly broken and the good node was getting fenced by the upgrading node for no good reason.
I have found, thus far, that wiping the metadata in the upgrade to 8.4 was not necessary, but it has been noted as a solution in certain circumstances. This requires a full resync when done.
8.3 and 8.4 WILL communicate with one another if you've done everything right. If you haven't, they won't, but they will sorta seem like they should...blame yourself and go back through your DRBD install and double-check everything.
8.4 WILL read 8.3's configuration scripts. Update them when things are stable to the 8.4 syntax.
Pulling the source from GIT is a fun and easy way to obtain it. Plus you can have the latest and greatest or any of the previously tagged releases.
And finally, WRITE DOWN the configure string you used on the upgrade node. You'll want to replicate it exactly on the other node, especially if you pulled the source from git.

Even an rsync-copy doesn't guarantee that the source won't want to rebuild. Plus if you end up switching to a newer or older revision, stuffing the configure command line into a little shell script makes rebuilding less error-prone.

AoE

This site is useful: http://support.coraid.com/support/linux/
To build in Ubuntu:

make install INSTDIR=/lib/modules/`uname -r`/kernel/drivers/block/aoe

Oh, where do I begin? Things I have issue with and would like to do something about:

Absolutely no security at all in this protocol. Security via hiding the data on a dedicated switch is not an answer, especially when you don't have said dedicated-switch to use. VLANs are a joke. Routability be damned, this protocol is more than vulnerable to any number of local area attacks, which are equally likely from a compromised node as they are for a routable protocol over the Internet.

I'd like to see the header or payload enhanced with a cryptographic sequence, at the very least.

The sequence could be a pair of 32-bit numbers, representing (1) the number that is the expected sequence number, and (2) the number that will be the next sequence number.
Providing these two numbers means that the number source can be anything, including random generation, making sequence prediction difficult (no predictable plaintext).
This could, at the very least, provide defense from replay attacks and give the target and the initiator something to identify each other with.
Extensions to this could allow for more robust initiator security, whereby a shared-secret is used to guarantee a target/initiator is who they say they are, in lieu of MAC filtering (which is pointless in the modern world of spoofable MACs.

vblade is, from the looks of things, "as good as it gets." No other projects appear to be among the living. Most stuff died back in '07 and '10 (if you're lucky). Even vblade doesn't get much love from the looks of it.

Let's get at least a version number in there somewhere, so that I don't feel like an idiot for typing "vblade --version."
Figure out why the "announce" that vblade does falls on deaf ears. If the initiator transmits a packet, and the recipient has died and "failed over" to another node, why does the initiator not care about this? (update: evidently it does not fall on deaf ears, but it comes close. In certain circumstances it works, others it doesn't.)
aoe-stat could dump out more info, like the MAC it was expecting to connect to. (update: This info was hidden away in a sysfs file called "debug")
The driver doesn't appear to try hard when it comes to reconnecting to a perceived-failed device. The 12-page super-spec doesn't give any real guidance on how the initiator should behave, or how a target should respond, to any situation other than one in which everything works. (How wonderfully optimistic!) (update: OK so maybe I didn't read the whole spec line-for-line...)

Driver error detection and recovery appears to be nonexistent. Again, very optimistic. Plus, with two vblades running on two separate servers, only one server's MAC is being seen. Why is this?! Oh yeah, and that page about revalidating the device by removing and re-probing the driver? Not gonna happen while the damn device is mounted somewhere. PUNT! (update: the mechanisms are evidently far more complex than I originally believed. I must now examine the code carefully to understand them. I have already encountered one test case that failed spectacularly.)
aoe-discover does dick when it thinks it knows where its devices are. aoe-flush -a does nothing useful. I hate that I have to look through their scripts just to find command line options. Anyway, you CAN perform an aoe-flush e#.# and get it to forget about individual devices. Then do the aoe-discover and things will work...if aoe-flush ever successfully returns.
If you aoe-flush a device that has moved to a new home, your mount is screwed. Even if it hasn't moved to a new home, you're screwed. Once the device is borked, the mount is lost. This makes changing the MAC of the target a requirement if you want to failover to a secondary node. (update: aoe-flush is not what we need. The issue is deeper than that.)

At this point, I am strongly considering moving my resources back to iSCSI, and hoping that LIO can handle the load.

UPDATE: I've taken a different tact and thought to ask some useful questions. Given the answers, there is perchance still hope for using AoE on my cluster. We shall soon see. Time, nonetheless, is running out (and I don't mean to imply my patience...I mean time, like REAL TIME).

20130812

The Unconfirmed Horror

In my efforts to perfect my SAN upgrades before putting them into production, I've configured an iSCSI resource in Pacemaker. Goal, of course and as with AoE, is high-availability with perfect data integrity (or at least as good as we can get it). My AoE test store has, of late, been suffering under the burden of O_SYNC - synchronous file access to its backing store. This has guaranteed that my writes make it to disk, at the expense of write-performance. It's hard to say how much overall performance is lost to this, but in terms of pure writes, it appears to be pretty significant.

I had hoped that iSCSI would not be so evil with regard to data writing. I was unpleasantly surprised when the TGT target I configured demonstrated file transfer errors on fencing the active SAN node. Somehow the failure of the node was enough reason for a significant number of file blocks to get lost. Not finding a self-evident way to put the target into a "synchronous" mode with regard to its backing store, I switched to LIO. So far, it seems to be performing better with regard to not losing writes on node failure...which is to say, it's not losing my data. That's critical.

Re-evaluating AoE in a head-to-head against LIO, here's the skinny. With AoE running O_SYNC, it's a dead win (at the moment) for LIO in the 400 Meg-of-Files Challenge: 12 seconds versus 2 minutes. Yet not all is lost! We can tune AoE on the client-side to do a lot more caching of data before blocking on writes. This assumes we're cool with losing data as long as we're not corrupting the file system in the process (in my prior post, I notes that file system corruption was among the other niceties of running AoE without the -s flag when a primary storage node fails). That should boost performance at the cost of additional memory. Right now AoE bursts about 6-10 files before blocking.

There is one other way, thus far, in which iSCSI appears superior to AoE: takeover time. For iSCSI, on a migration from one node to the other it's nearly instantaneous. On a node failure, it takes about 5-10 seconds. AoE? No such luck. Even though it's purported to broadcast an updated MAC whenever vblade is started, the client either fails to see it (or doesn't care) or is too busy doing other things. I think it's the former, as a failure while no writing is happening on the node causes the same 15-20 second delay before any writes can resume. Why is this?

One thing does irk me as I test things out. I had a strange node failure on the good node after fencing the non-good node. It could just be that the DRBD resources were not yet synced, which would prevent them from starting (and cause a node-fencing). Yet the logs indicate something about dummy resources running that shouldn't be running.

IDK.

All I know is that I want stability, and I want it now.

20130802

Working With AoE

A few weeks ago I did some major rearrangement of the server room, and in the midst did some badly-needed updates on the HA SAN servers. The servers are responsible for all the virtual-machine data currently in use. Consequently it's rather important they work right, and work well.

Sadly, one of the servers "died" in the midst of the updates. Just as well, the cluster had problems with failover not being as transparent as it was supposed to be. A cluster where putting a node on standby results in that node's immediate death-by-fencing is not a good cluster.

I thought this would be a good time to try out the latest pacemaker and corosync, so I set up some sandbox machines for play. Of course, good testing is going to include making sure AoE is also performing up-to-snuff. So far, I've encountered some interesting results.

For starters, I created a DRBD store between two of my test nodes, and shared it out via AoE. A third node did read/write tests. To do these tests, I created 400 1-meg files via dd if=/dev/urandom. I generated SHA-512 sums for them all, to double-check file integrity after each test. I also created 40 10-meg files, and 4 100-meg files. I think you can spot a pattern here. Transfers were done from a known-good source (the local drive of the AoE client) to the AoE store using cp and rsync. During the transfer, failover events were simulated by intentionally fencing a node, or issuing a resource-migrate command.

Migration of the resource generally worked fine. No data corruption was observed, and so long as both nodes were live everything appeared to work OK. Fencing the active node, however, resulted in data corruption unless vblade was started with the "-s" option. The up-side is that you're guaranteed that writes will have finished before the sender trashes the data. The down-side is that writes go a LOT slower. Strangely, -s is never really mentioned in the available high-availability guides for AoE. I guess that's not really surprising; AoE is like a little black box that no one talks in any detail about. Must be so simple as to be mindlessly easy...sadly that's a dangerously bad way to think.

Using the -d for direct-mode is also damaging to performance; I am not sure how well it does with failover due to SAN-node failure.

What's the Worst that Can Happen?

Caching/Buffering = Speed at the sacrifice of data security. So how much are we willing to risk?

If a VM host dies, any uncommitted data dies with it. We could say that data corruption is worse than no data at all. The file systems commonly used by my VMs include journaling, so file system corruption in the midst of a write should be minimal as long as the writes are at least in order. Best of all in this bad situation is that no writes can proceed after the host has died because it's, well, dead.

The next most terrible failure would be death of a store node - specifically, the active store node. AoE looks to be pretty darn close to fire-and-forget without totally the forget part. Judging from the code, it sends back an acknowledgement to the writer once it has pushed the data to the backing store. That's nice, except where the backing store is buffering up the data (or something in the long chain leading to the backing store, maybe still inside AoE itself). So, without -s, killing a node outright caused data corruption and, in some cases, file system corruption. The latter is most exceedingly possible because the guest is continuing to write under the assumption that all its writes have succeeded. As far as AoE is concerned, they have. Additional writes to a broken journal after the connection is re-established on the surviving node will only yield horror, and quite possibly undetected horror on a production system that may not be checked for weeks or months.

A link failure between AoE client and server would stop the flow of traffic. Not much evil here. In fact, it's tantamount to a VM host failure, except the host and its guests are still operating...just in a sort-of pseudo-detached media state (they can't write and don't know why). Downside here is that the JBD2 process tends to hang "forever" when enough writes are pushed to a device that is inaccessible for sufficient time ("forever" meaning long enough that I have to reboot the host to clear the bottleneck in a timely manner - lots of blocked-process messages appear in the kern.log when this happens, and everything appears to grind to a wonderful halt). Maybe JBD2 would clear itself after a while, but I've found that the Windows guests are quite sensitive to write failures, more so than Linux guests, though even the Linux guests have trouble surviving when the store gets brutally interrupted for too many seconds.

Now What Do I Do?

The -s option to vblade causes significant latency for the client when testing with dbench. Whether or not this is actually a show-stopper remains to be seen. Throughput drops from around 0.58 MB/sec to 0.15 MB/sec. This is of course with all defaults for the various and appropriate file system buffers that work on the client and the server, and also running everything purely virtual. Hardware performance should be markedly better.

I was worrying about using AoE and the risk of migrating a VM while it was writing to an AoE-connected shared storage device (via something like GFS or OCFS2). My concern was that if the VM was migrated from host A to host B, and was in the middle of writing a huge file to disk, the file writes would still be getting completed on host A while the VM came to life on host B. The question of "what data will I see?" was bothering me. I then realized the answer must necessarily be in the cluster-aware file system, as it would certainly be the first to know of any disk-writes, even before they were transmitted to the backing store. There still may be room for worry, though. Testing some hypotheses will be therapeutic.

20130429

AoE, you big tease...

I did some more testing with AoE today. I'll try to detail here what it does and doesn't appear to be.

Using Multiple Ethernet Ports

The aoe driver you modprobe will give you the option of using multiple ethernet ports, or at the very least selecting which port to use. I'm not sure what the intended functionality of this feature is, because if your vblade server is not able to communicate across more than 1 port at a time, you're really not going to find this very useful. The only way I've been able to see multi-gigabit speeds is to create RR bonds on both the server and the client. This requires either direct-connect or some VLAN magic on a managed switch, since many/most switches don't dig RR traffic by itself.

I could see where this feature would work out well if you have multiple segments or multiple servers, and want to spread the load across multiple ports that way. Otherwise I don't see much usefulness here.

How did I manage RR on my switch?

So, do to this on a managed switch, I created two VLANs for my two bond channels, and assigned one port from each machine to each channel. Four switch ports, two VLANs, and upwards to 2Gb/sec bandwidth. This is thus expandable to any number of machines if you can handle the caveat that should a machine lose one port, it will lose all ability to communicate effectively with the rest of its network over this bond. This is because the RR scheduler on both sides expects all paths to be connected. A sending port cannot see that the recipient has left the party if both are attached to a switch (which should always be online). ARP monitoring might take care of this issue, maybe, but then I don't think it will necessarily tell you not to send to a client on a particular channel and you'll need all your servers ARPing each other all the time. Sounds nasty.

AoE did handle RR traffic extremely well. Anyone familiar with it will note that packet-ordering is not guaranteed, and you will most definitely have some of your later packets arriving before some of your earlier packets. In UDP tests the numbers are usually not very large for small bandwidth tests. The higher the transmission rates, the higher the out-of-ordering.

The Best Possible Speed

To test the effectiveness of AoE, with explicit attention to the E part, I created a ramdrive on the server, seeded it with a 30G file (I have lots of RAM), and then served that up over vblade. I ran some tests using dbench and dd. To ensure that no local caching effects skewed the results, I had to set the various /proc/sys/vm/dirty_* fields to zero - specifically, ratio and background_ratio. Without doing that, you'll see fantastic rates of 900MB/sec, which is a moonshot above any networking gear I have to work with.

With a direct connection between my two machines, and RR bonds in place, I could obtain rates of around 130MB/sec. The same appeared true for my VLAN'd switch. Average latency was very low. In dbench, the WriteX call had the highest average latency of 267ms. Even flushes ran extremely fast. That makes me happy, but the compromise is that there is no fault-tolerance, other than what we'd see for if a whole switch dies - and that is, by the way, assuming you have your connections and VLANs spread across multiple switches.

Without all of that rigging, the next best thing is balance-alb, and then you're back to standard gigabit with the added benefit of fault-tolerance. As far as AoE natively using multiple interfaces, the reality seems to be that this feature either doesn't exist like it's purported to, or it requires additional hardware (read: Coraid cards). Since vblade itself requires a single interface to bind to, the best hope is a bond, and no bond mode except RR will utilize all available slaves for everything. That's the blunt truth of it. As far as the aoe module itself, I really don't know what its story is. Even with the machines directly connected and the server configured with a RR bond, the client machine did not seem to actively make use of the two adapters.

Dealing with Failures

One thing I like about AoE is that it is fairly die-hard. Even when I forcefully caused a networking fault, the driver recovered once the connectivity returned and things returned to normal. I guess as long as you don't actively look to kill the connection with an aoe-flush, you should be in a good state no matter what goes wrong.

That being said, if you're not pushing everything straight to disk and something bad happens on your client, you're looking at some quantity of data now missing from your backing store. How much will depend on those dirty_* parameters I mentioned earlier. And catastrophic faults rarely happen predictably.

Of course, setting the dirty_* parameters to something sensible and greater than zero may not be an entirely bad thing. Allowing some pages to get cached seems to lend itself to significantly latency and throughput. How to measure the risk? Well, informally, I'm watching network via ethstatus. The only traffic on the selected adapter is AoE. As such, it's pretty easy to see when big accesses start and stop. In my tests against the ramdrive store, traffic started immediately flowing and stopped a few seconds after the dbench test completed. Using dd without the oflag=direct option left me with a run that finished very quickly, but that did not appear to be actually committed to disk until about 30 seconds later. Again, kernel parameters should help this.

Hitting an actual disk store yielded similar results, with only occasional hiccups. For the most part the latency numbers stayed around 10ms, with transfer speeds reaching over 1100MB/sec (however dbench calculates this, especially considering that observed network speeds never reached beyond an aggregate 70MB/sec).

Security Options

I'm honestly still not a fan of the lack of security features in AoE. All the same, I'm allured to it, and want to now perform a multi-machine test. Having multiple clients with multiple adapters on balance-alb will mean that I have to reconfigure vblade to MAC-filter for all those MACs, or not use MAC filtering at all. That might be an option, and to that end perhaps putting it in a VLAN (for segregation sake, not for bandwidth) wouldn't be so bad. Of course, that's all really just to keep honest people honest.

Deploying Targets

If we keep the number of targets to a minimum, this should work out OK. I still don't like that you have to be mindful of your target numbers - deploying identical numbers from multiple machines might spell certain doom. For instance, I deployed the same device numbers on another machine, and of course there is no way to distinguish between the two. vblade doesn't even complain that there are identical numbers in use. Whether or not this will affect targets that are already mounted and in-use, I know not. The protocol does not seem to concern itself with this edge-case. As far as I am concerned, the whole thing could be more easily resolved by using a runtime-generated UUID instead of just the device ID numbers. I guess we'll see how actively this remains developed.

Comparison with iSCSI

I haven't done this yet, but plan to very soon.

Further Testing

I'll be doing some multi-machine testing, and also looking into the applicable kernel parameters more closely. I want to see how AoE responds to the kind of hammering only a full compliment of virtual machines can provide. I also want to make sure my data is safe - outages happen far more than they should, and I want to be prepared.

20130424

AoE - An Initial Look

I was recently investigating ways to make my SAN work faster, harder, better. Along the way, I looked at some alternatives.

Enter AoE, or ATA-over-Ethernet. If you're reading this you've probably already read some of the documentation and/or used it a bit. It's pretty cool, but I'm a little concerned about, well, several things. I read some scathing reviews from what appear to be mud-slingers, and I don't like mud. I have also read several documents that almost look like copy-and-paste evangelism. Having given it a shot, I'm going to summarize my immediate impressions of AoE here.

Protocol Compare: AoE is only 12 pages, iSCSI is 257.
That's nice, and simplicity is often one with elegance. But that doesn't mean it's flawless, and iSCSI has a long history of development. With such a small protocol you also lose, from what I can tell, a lot of the fine-tuning knobs that might allow for more stable operation under less-ideal conditions. That being said, with such a small protocol it would hopefully be hard to screw up its implementation, either in software or hardware.

I like that it's its own layer-2 protocol. It feels foreign, but it's also very, very fast. I think it would be awesome to overlay options of iSCSI on the lightweight framework of AoE.

Security At Its Finest: As a non-routeable protocol, it is inherently secure.
OK, I'm gonna have to say, WTF? Seriously? OK, seriously, let's talk about security. First, secure from who or what? It's not secure on a LAN in an office with dozens of other potentially compromised systems. It's spitting out data over the wire unencrypted, available for any sniffer to snag.

Second, it can be made routeable (I've seen the HOW-TOs), and that's cool, but I've never heard of a router being a sufficient security mechanism. Use a VLAN, you say? VLAN-jumping is now an old trick in the big book of exploits. Keep your AoE traffic on a dedicated switch and the door to that room barred tightly. MAC filtering to control access is cute, but stupid. Sniff the packets, spoof a MAC and you're done. Switches will not necessarily protect your data from promiscuous adapters, so don't take that for granted. Of course, we may as well concede that a sufficiently-motivated individual WILL eventually gain access or compromise a system, whether it's AoE-based or iSCSI-based. But I find the sheer openness of AoE disturbing. If I could wrap it up with IPsec or at least have some assurance that the target will be extra-finicky about who/what it lets in, I'd be a little happier, even with degraded performance (within reason).

Then there's the notion that I just like to make sure human error is kept to a bare minimum, especially when it's my fingers on the keyboard. Keeping targets out of reach means I won't accidentally format a volume that is live and important. Volumes are exported by e-numbers, so running multiple target servers on your network means you have to manage your device exports very carefully. Of course, none of this is mentioned in any of the documentation, and everyone's just out there vblading their file images as e0.0 on eth0.

Sorry, a little disdain there as the real world crashes in. I'll try to curb that.

Multiple Interfaces? Maybe.
If you happen to have several interfaces on your NIC that are not busy being used for, say, bonding or bridging, then you can let AoE stream traffic over them all! Bitchin'. This is the kind of throughput I've been waiting for...except I can't seem to use it without some sacrifices.

For me, the problem starts with running virtual machines that sometimes need access to iSCSI targets. These targets are "only available" (not totally but let's say they are) over adapters configured for jumbo frames. What's more, the adapters are bonded because, well, network cables get unplugged sometimes and switches sometimes die. The book said: "no single points of failure," so there. But maybe it is not so much of an issue and I just need to hand over a few ports to AoE and be done with it?

The documentation makes it clear how to do this on the client. On the server, it's not so clear. I think you bond some interfaces with RR-scheduling, and then let AoE do the rest. How this will work on a managed gigabit switch that generally hates RR bonding, I do not yet know. I also have not (yet) been able to use anything except the top-most adapter of any given stack. For example, I have 4 ports bonded (in balance-alb) and the bond bridged for my VMs. I can't publish vblades to the 4 ports directly, nor to the bond, but I can to the bridge. So I'm stuck with the compromise of having to stream AoE data across the wire at basically the same max rate as iSCSI. Sadness.

Control and Reporting
I'm not intimately familiar with the vblade program, but so far it's not exactly blowing my skirt up. My chief complaints to-date: I want to be able to daemonize it in a way that's more intelligent than just running in the background; I would like to get info about who's using what resources, how many computing/networking resources they're consuming, etc; I had to hack up a resource agent script so that Pacemaker could reliably start and stop vblade - the issue seemed to involve stdin and stdout handling, where vblade kept crashing.

Since it's not nice to say only negatives, here are some positives: It starts really fast; it's lightweight; Fail-over should work flawlessly, and configuration is as easy as naming a device and an adapter to publish it on. It does one thing really, really well: it provides AoE services. And that's it. It will hopefully not crash my hosts.

aoetools is another jewel - in mixed definitions of that word. Again I find myself pining for documentation, reporting, statistics, load information, and a scheme that is a little more controllable and less haphazard-feeling than modprobe aoe gives you your devices. Believe me, I think it's cool that it's so simple. I just somehow miss the fine-grained and ordered control of iSCSI. Maybe this is just alien to me and I need to get used to it. I fear there are gotchas I have not yet encountered.

It's FAST!
There's a catch to that. The catch is that AoE caches a great deal of data on the initiator and backgrounds a lot of the real writing to the target. So you know that guy that did that 1000 client test with dbench? He probably wasn't watching his storage server wigging out ten minutes after the test completed. My tests were too good to be true, and after tuning to ensure writes hit the store as quickly as possible, the real rates presented themselves.

I can imagine that where reading is the primary activity, such as when an VM boots, this is no biggie. But when I may have a VM host suddenly fail, I don't want a lot of dirty data disappearing with the host. That would be disastrous.

Luckily, they give some hints on tuneables in /proc/sys/vm. At one point I cranked the dirty-pages and dirty-ratio all the way to zero, just to see how the system responded. dbench was my tool of choice, and I ran it with a variety of different client sizes. I think 50 was about the max my systems could handle without huge (50-second) latencies. A lot of that is probably my store servers, which are both somewhat slow in the hardware and extremely safe (in terms of data corruption protection and total RAID failures). I'll be dealing with them soon.

Other than that, I think it'd be hard to beat this protocol over the wire, and it's so low-level that overhead really should be at a minimum. I do wish the kernel-gotchas were not so ominous; since this protocol is so low-level, your tuning controls become kernel tuning controls, and that bothers me a little. Subtle breakage in the kernel would not be a fun thing to debug. Read carefully the tuning documentation that is barely referenced in the tutorials (or not referenced at all - did I mention I would like to see better docs? Maybe I'll write some here after I get better at using this stuff.).

Vendor Lock-in
I read that, and thought: "Gimme a break!" Seriously guys, if you're using Microsoft, or VMware, you're already locked in. Don't go shitting yourself about the fact there's only one hardware vendor right now for AoE cards. Double-standards are bad, man.

Overall Impressions
So to summarize...

I would like more real documentation, less "it's so AWESOME" bullshit, and some concrete examples of various implementations along with their related tripping-hazards and performance bottlenecks. (Again, I might write some as I go.)

I feel the system as a whole is still a little immature, but has amazing potential. I'd love to see more development of it, some work on more robust and effective security against local threats, and some tuning controls to help those of us who throw 10 or 15 Windows virtuals at it. (Yeah, I know, but I have no choice.) If anyone is using AoE for running gobs of VMs on cluster storage, I'd love to hear from you!!

If iSCSI and AoE had a child, it would be the holy grail of network storage protocols. It would look something like this:

a daemon to manage vblades, query and control their usage, and distribute workload.
the low-and-tight AoE protocol, with at least authentication security if not also full data envelope (options are nice - we like options. Some may not want or need security, some of us do).
target identification, potentially, or at least something to help partition out the vblade-space a little better. I think of iSCSI target IDs and their LUNs, and though they're painful, they're also explicit. I like explicitness.
Some tuning parameters outside the kernel, so we don't feel like we're sticking our hands in the middle of a gnashing, chomping, chortling machine.

Although billed as competition to iSCSI, I think AoE actually serves a slightly different audience. Whereas iSCSI provides a great deal of control and flexibility in managing SAN-access to a wide variety of clients, AoE offers unbridled power and throughput on highly controlled and protected network. I really could never see using AoE to offer targets to coworkers or clients, since a single slip-up out on the floor could spell disaster. But I'm thinking iSCSI may be too slow for my virtualization clusters.

iSCSI can be locked down. AoE can offer near-full-speed data access.

Time will tell which is right for me.

20130417

Sustainable HA MySQL/MariaDB

I ran into this problem just yesterday, and thought I'd write about what I'm trying to do to fix it. Use at your own risk; hopefully this will work well.

I needed to run updates on my DB cluster. It's a two-node cluster, and generally stable on Ubuntu 12.04.2 LTS. Unfortunately, the way I had configured my HA databases, when I ran updates one of the nodes completely broke. Unable to start MariaDB, the update process failed. MariaDB couldn't start because the database files were nowhere to be found on that node at that time.

Not liking the idea of having to update the database server "hot," then migrating over to the second node and updating it "hot" again, I thought perhaps this would be a good time for some manual package management. This would mean the following:

I'd have to get the packages manually and configure the essentials accordingly - factory-default paths be damned!
No more automatic updates - a mixed bag: they're awesome when they work and terrible when they don't. Luckily they usually "Just Work" (tm)
I'd have the latest and greatest that MariaDB has to offer.
I would have to be more mindful in the future about updates and making sure things don't break en-route to a new version.

OK, so originally I had installed MariaDB via apt-get, and put the database files themselves on an iSCSI target. I used bind-mounts to place everything (from configuration files to the actual db files) where MySQL/MariaDB expected everything to be. For this fix, my first thought was to put the binaries (well, the whole MariaDB install) on the iSCSI target. This would mean one upgrade, one copy of binaries, and only one server capable of starting said database.

That didn't work - Pacemaker needs access to the binaries to make sure the database isn't started elsewhere on the cluster. So, I set up a directory structure as follows:

/opt/mariadb

.../versions/ (put your untarred-gzipped deployments here)
.../current --> links to versions/(current version you want to use)
.../var --> this is where the iSCSI target will now be mounted
.../config --> my.cnf and conf.d/... will be here

MariaDB offers precompiled tar.gz deployments, which is really nice. I can put these wherever I want. In this case I'm going the route of having an escape-route for future upgrades by putting the fresh deployment files in a versions/ directory and linking to the version that I want to use. No changes to configuration files or Pacemaker should be necessary, and upgrades won't stomp existing deployments this way. Of course, back up your databases frequently and before each upgrade.

Inside /opt/mariadb/var, I've placed a log and db directory. log originally came from /var/log, and has a variety of transaction logs in it. The db folder contains the actual database files, what would normally be found in /var/lib/mysql.

The configuration files MIGHT work under the /opt/mariadb/var folder, which would mean it ought to be named something more appropriate. I left them out for sake of having them always available on both nodes. I felt this was a safer route and don't have time to experiment much.

The my.cnf file has to be properly configured. I snagged the my.cnf file that the original MariaDB apt-get install provided, and changed paths accordingly. Now there are no bind-mounts, and for all intents and purposes I could simply duplicate the entire /opt/mariadb directory on a new node and be up and running in no-time. (New node deployment is technically untested as of this writing.)

Note that if you happen to be moving existing log files (especially a .index file), the .index file will contain file paths that need to be updated. sed will be your friend here, and you can cat the file to see the contents. Once everything is done, you should be able to perform the following command and see a successful MariaDB launch:

/opt/mariadb/current/bin/mysqld --defaults-file=/opt/mariadb/config/my.cnf

In case you don't know, here's how you shut down your successful launch:

/opt/mariadb/current/bin/mysqladmin --defaults-file=/opt/mariadb/config/my.cnf shutdown -p

The MySQL primitive in Pacemaker needs to be properly configured. Here is what mine looks like:

primitive p_db-mysql0 ocf:heartbeat:mysql \
params binary="/opt/mariadb/current/bin/mysqld" \
config="/opt/mariadb/config/my.cnf" \
datadir="/opt/mariadb/var/db" \
pid="/var/run/mysqld/mysqld.pid" \
socket="/var/run/mysqld/mysqld.sock" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="20s" timeout="30s"

So far, this new configuration seems to work. Comments and suggestions are welcome.