Building out my next SAN and client
The goal here is a die-hard data access and integrity.
SAN Access Mechanisms
AOE - a no-go at this time
My testing to date (2013-08-21) has shown that AoE under vblade migrates well, but does not handle a failed node well. Data corruption generally happens if writes are active, and there are cases I have encountered (especially during periods of heavy load) where the client fails to talk to the surviving node if that node is not already the primary (more below on that). In other words, if the primary SAN target node fails, the secondary will come up, but the client might not use it (or might use it for a few seconds before things get borked). I am actively investigating this and other related issues with guidance from the AoE maintainer. At this time I cannot use it for what I want to use it for. Pity, it's damn fast.
iSCSI - Server Setup
Ubuntu 12.04 has a 3.2 kernel and sports the LIO target suite. In initial testing it worked well, though it will be interesting to see how it performs under more realistic loads. My next test will involve physical machines to exercise iSCSI responsiveness over real hardware and jumbo-frames.
The Pacemaker (Heartbeat) resource agent for iSCSILogialUnit suffers from a bug in LIO, whereby if the underlying device/target is receiving writes the logical unit cannot be shut down. This can cause a SAN node to get fenced for failure to shut down the resource when ordered to standby or migrate. It can be reliably reproduced. This post details what needs to be done to fix the issue. These modifications can be applied with this patch fragment:
--- old/iSCSILogicalUnit 2013-08-21 16:13:20.000000000 -0400
+++ new/iSCSILogicalUnit 2013-08-21 16:12:56.000000000 -0400
@@ -365,6 +365,11 @@
done
;;
lio)
+ # First stop the TPGs for the given device.
+ for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+ echo 0 > "${TPG}"
+ done
+
if [ -n "${OCF_RESKEY_allowed_initiators}" ]; then
for initiator in ${OCF_RESKEY_allowed_initiators}; do
ocf_run lio_node --dellunacl=${OCF_RESKEY_target_iqn} 1 \
@@ -373,6 +378,15 @@
fi
ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC
ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC
+
+ # Now that the LUN is down, reenable the TPGs...
+ # This is a guess, so we'll gonna have to test with multiple LUNs per target
+ # to make sure we are doing the right thing here.
+ for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+ echo 1 > "${TPG}"
+ done
+
+
esac
fi
Basically, go through all TPGs for a given target, disable them, nuke the logical unit, and then re-enable them. This has only been tested with one LUN. It may screw things up for multiple LUNs. Hopefully not, but you have been warned. If I get around to testing, I'll update this post. My setups always involve one LUN per target.
iSCSI - Pacemaker Setup
On the SERVER... I group the target's virtual IP, iSCSITarget, and iSCSILogicalUnit together for simplicity (and because they can't exist without each other). LIO requires the IP be up before it will build a portal to it.
group g_iscsisrv-o1 p_ipaddr-o1 p_iscsitarget-o1 p_iscsilun-o1
Each target gets its own IP. I'm using ocf:heartbeat:IPaddr2 for the resource agent. The iSCSITarget primitives each have unique tids. Other than that, LIO ignores parameters that iet and tgt care about, so configuration is pretty short. Make sure to use implementation="lio" absolutely everywhere when specifying the iSCSITarget and iSCSILogicalUnit primitives.
On the CLIENT... The ocf:heartbeat:iscsi resource agent needs this parameter to not break connections to the target when the conditions are right:
try_recovery="true"Without it, a failed node or a migration will occasionally cause the connection to fail completely, which is not what you want when failover without noticeable interruption is your goal.
SAN Components
DRBD
Ubuntu 12.04 ships with 8.3.11, but the DRBD git repo has 8.3.15 and 8.4.3. In the midst of debugging Pacemaker bug, I migrated to 8.4.3. It works fine, and appears to be quite stable. Make sure that you're using the 8.4.3 resource agent, or else things like dual-primary will fail (if everything is installed to standard locations, you should be fine).
Though it's not absolutely necessary, I am running my DRBD resources in dual-primary. The allow-two-primaries option seems to shave a few seconds off the recovery, since all we have to migrate are the iSCSI target resources. LIO migrates very quickly, so the most of the waiting appears to be cluster-management-related (waiting to make sure the node is really down, making sure it's fenced, etc). We could probably get it faster with a little more work.
Pacemaker, Corosync
Without the need for OCFS2 on the SAN, I build the cluster suite from the sources using Corosync 2.3.1 and Pacemaker 1.1.10 + latest changes from git. It's very near bleeding-edge, but it's also working very well at the moment. Building the cluster requires a host of other packages. I will detail the exact build requirements and sequence in another post; I wrote a script that does pretty much an automated install. The important thing is to make sure you don't have any competing libraries/headers in the way, or parts of the build will break. Luckily it breaks during the build and not during execution. (libqb, I am looking at YOU!)
ZFS
I did not do any additional experimentation with this on the sandbox cluster, but it is worth noting that in my most recent experiences I have shifted to using drive UUIDs instead of any other available device addressing mechanisms. The problem I ran into (several times) involved the array not loading on reboot, or (worse) the vdevs not appearing after reboot. Since I the vdevs are the underlying devices for DRBD, it's rather imperative that they be present on reboot. It appears to be a semi-remaining issue in ZoL, though it's less so in recent releases.
Testing and Results
For testing I created a cluster of four nodes, all virtual, with external/libvirt as the STONITH device. The nodes, c6 thru c9, were configured thus:
- c6 and c7 - SAN targets, synced with each other via DRBD
- c8 - AoE test client
- c9 - AoE and iSCSI test client
Server/Target
All test batches included migration tests (moving the target resource from one node to another), failover tests (manually fencing a node so that its partner takes over), single-primary tests (migration/failover when one node has primary and the other node has secondary), and dual-primary tests (migration/failover when both nodes are allowed to be primary).
Between tests, the DRBD stores were allowed to fully resync. During some long-term tests, resync happened while client file system access continued.
Client/Initiator
Two operations were tested: dbench and data transfer and verification.
dbench is fairly cut-and-dry. It was set to run for upwards of 5000 seconds with 3 clients, while the SAN target nodes (c6 and c7) were subjected to migrations and fencing.
The data transfer and verification tests were more interesting, as they signaled corruption issues. For sake of having options, I created three sets of files with dd if=/dev/urandom. The first set was 400 1-meg files. The second set was 40 10-meg files. The last set was 4 100-meg files. Random data was chosen to ensure that no compression features would interfere with the transfer, and also to provide useful data for verification. SHA-512 sums were generated for every file. As the files were done in three batches, three sum files were generated. For each test, a selected batch of files was copied to the target via either rsync or cp, while migrations/failovers were being performed. The batch was then checked for corruption by validating against the appropriate sums file. Between tests, the target's copy of the data was deleted. Occasionally the target store was reformatted to ensure that the file system was working correctly (especially after failed failover tests).
Results - AoE
AoE performed extremely well with transfer rates and migration, but failed during verification tests on failover testing. This is interesting because it suggests the mechanism that AoE is using to push its writes to disk is buffering somewhere along the way. vblade is forcibly terminated during migration, yet no corruption occurred throughout those tests.
Failover reliably demonstrated corruption; the fencing of a node practically guaranteed that 2-4 files would fail their SHA-512 sums. This can be fixed by using the "-s" option, but I find that to be rather unattractive. Yet it may be the only option.
Another issue: during a failover, the client might fail to communicate with the new target. Migration didn't seem to suffer from this. Yet on failover, aoe-debug sometimes reported both aoetgts receiving packets even though one was dead and one was living. More often than not, aoe would start talking to the remaining node, only to stop a few seconds later and never again resume. I've spent a good deal of time examining the code, but at this time it's a bit too complex to make any in-roads. At best, I've have intermittent success at generating the failure case.
One other point of interest regarding AoE: the failover is a bit slow, regrettably. This appears due to a hard-coded 10-second time limit before scorning a target. I might add a module-parameter for this, and/or see about a better logic-flow for dealing with suspected-failed targets.
Results - iSCSI
iSCSI performed, well, like iSCSI with regard to transfer rates - slower than AoE. My biggest fear with iSCSI is resource contention when multiple processes are accessing the store. Once the major issues involving the resource agent and the initiator were solved, migration worked like a charm. During failover testing, no corruption was observed and the remaining node picked up the target almost immediately. I will probably deploy with allow-two-primaries enabled.