20131130

STUNned... when a useful feature burns your ass

So I've been doing some experimentation with Asterisk.  I have deployed a couple of servers so far, one at work and one at home, both pure testing environments.  Yeah, don't worry, I have horribly miserable-to-type (i.e. strong) passwords on all my test accounts.  Anyway, I kept running into a small problem with my home box.

It all started with intermittent audio.  It was an issue I kept experiencing on my desktop clients, on my phone clients (such as CSipSimple and ZoIPer), but strangely not on my tablet.  Basically, sometimes the audio would work, and sometimes it wouldn't.  Now, I was simply calling a test extension, one which would play "Hello, world!" over and over again.  On occasion the audio would kick in part way through the sequence.  Usually it would either work or, more often, be absent altogether.

The SIP registration was fine.  The receive calls even worked, to the right client.  The legacy version of desktop ZoIPer, for instance, seemed to handle things well, but the newer version kept having issues.  I toyed with the sound card settings.  I tried removing bluetooth from the equation (which is probably just as well, as I keep having BT connectivity issues with my headset - probably an adapter problem).  I even tried connecting to my work server.

The work server worked fine.

While considering whether or not it could be an issue with my bleeding-edge release of Asterisk (which I had to compile to get Google Voice/Chat integration working correctly, at least in version 11), another observation finally struck me.  Earlier in the evening, as I watched the SIP debug packets fly by, I noticed that the client IP was not always a local address.  In fact, it had no reason not to be local, the server and the client were in the same subnet!  STUN support, as it appears, was part of the culprit.  Given that I don't know nearly enough about STUN, ICE, or SIP to really speak authoritatively about this stuff, I'll give it my best guess:  the client was using STUN and finding its external IP, which it was relaying to Asterisk during part of the call.  At other times it was using its well-known internal IP.  The server, consequently, tried talking to an unregistered IP and decided it wasn't such a good idea.

Intermittently.

Anyway, disabling STUN in the clients seemed to work well.  Especially with the newer desktop version of ZoIPer.  In my Asterisk config I had once issued an externaddr directive in sip.conf, but ended up commenting it out as my server currently lives on a dynamic IP.  I will eventually try to fix that, one way or another.  Disabling STUN is, naturally, not an ideal solution.  Things should work regardless and the client ought, in my opinion, to be smart enough to figure out when it does and doesn't need STUN.  Or maybe Asterisk is really missing that externaddr directive and isn't telling the client the right thing to do.

In either case, I'm just happy to finally have working clients.  Now to pick out a FXO/FXS card...

20130905

Can't mount or fsck OCFS2, using CMAN...what you probably shouldn't ever do (but I did)

It's late.  You have only one node of a two-node cluster left, and you're using cman, pacemaker, and OCFS2.  The node gets rebooted.  Suddenly, and for no apparent reason, you can't mount your OCFS2 drives.  You can do fsck.ocfs2 -n, but the act of actually clearing the drive is just going nowhere.  You wait precious minutes.  What is wrong?

Checking the logs, you see that your stonith mechanisms are failing.  Strange, they used to work before.  But now they're not, and cman wants to fence the other node that you know is really, really, really fucking dead.  What to do?  Hmm.. can't tell it to unfence the node, because no commands I try seem to make it actually agree to those behests.

Desperation sets in.  You have to fsck the damn things.  You've rebooted a dozen times.  Carefully brought everything back up, and still it sits there, mocking you, not scanning a goddamn thing.  What did I do?  I installed a null stonith device (stonith:null) in pacemaker, and gave it the dead node in the hostlist.  On the next round of fencing attempts that cman made, pacemaker failed at the original stonith and succeeded at the null device (expectedly).  Suddenly cman was happy, and the world rejoiced, and the file system scans flew forth with verbose triumph.  Moments later everything was mounted and life continued as though nothing bad ever happened.

Now I have to figure out why my stonith device failed in the first place.  That pisses me off.

20130822

Die-hard iSCSI SAN and Client Implementation Notes

Building out my next SAN and client

The goal here is a die-hard data access and integrity.

SAN Access Mechanisms

AOE - a no-go at this time

My testing to date (2013-08-21) has shown that AoE under vblade migrates well, but does not handle a failed node well.  Data corruption generally happens if writes are active, and there are cases I have encountered (especially during periods of heavy load) where the client fails to talk to the surviving node if that node is not already the primary (more below on that).  In other words, if the primary SAN target node fails, the secondary will come up, but the client might not use it (or might use it for a few seconds before things get borked).   I am actively investigating this and other related issues with guidance from the AoE maintainer.  At this time I cannot use it for what I want to use it for.  Pity, it's damn fast.

iSCSI - Server Setup

Ubuntu 12.04 has a 3.2 kernel and sports the LIO target suite.  In initial testing it worked well, though it will be interesting to see how it performs under more realistic loads.  My next test will involve physical machines to exercise iSCSI responsiveness over real hardware and jumbo-frames.

The Pacemaker (Heartbeat) resource agent for iSCSILogialUnit suffers from a bug in LIO, whereby if the underlying device/target is receiving writes the logical unit cannot be shut down.  This can cause a SAN node to get fenced for failure to shut down the resource when ordered to standby or migrate.  It can be reliably reproduced.  This post details what needs to be done to fix the issue.  These modifications can be applied with this patch fragment:



--- old/iSCSILogicalUnit 2013-08-21 16:13:20.000000000 -0400
+++ new/iSCSILogicalUnit 2013-08-21 16:12:56.000000000 -0400
@@ -365,6 +365,11 @@
   done
   ;;
      lio)
+               # First stop the TPGs for the given device.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 0 > "${TPG}"
+               done
+
                if [ -n "${OCF_RESKEY_allowed_initiators}" ]; then
                        for initiator in ${OCF_RESKEY_allowed_initiators}; do
                                ocf_run lio_node --dellunacl=${OCF_RESKEY_target_iqn} 1 \
@@ -373,6 +378,15 @@
                fi
   ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC
   ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC
+
+               # Now that the LUN is down, reenable the TPGs...
+               # This is a guess, so we'll gonna have to test with multiple LUNs per target
+               # to make sure we are doing the right thing here.
+               for TPG in /sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/*/enable; do
+                       echo 1 > "${TPG}"
+               done
+
+
  esac
     fi


Basically, go through all TPGs for a given target, disable them, nuke the logical unit, and then re-enable them.  This has only been tested with one LUN.  It may screw things up for multiple LUNs.  Hopefully not, but you have been warned.  If I get around to testing, I'll update this post.  My setups always involve one LUN per target.

iSCSI - Pacemaker Setup

On the SERVER...  I group the target's virtual IP, iSCSITarget, and iSCSILogicalUnit together for simplicity (and because they can't exist without each other).  LIO requires the IP be up before it will build a portal to it.  
group g_iscsisrv-o1 p_ipaddr-o1 p_iscsitarget-o1 p_iscsilun-o1
Each target gets its own IP.  I'm using ocf:heartbeat:IPaddr2 for the resource agent.  The iSCSITarget primitives each have unique tids.  Other than that, LIO ignores parameters that iet and tgt care about, so configuration is pretty short.  Make sure to use implementation="lio" absolutely everywhere when specifying the iSCSITarget and iSCSILogicalUnit primitives.

On the CLIENT...  The ocf:heartbeat:iscsi resource agent needs this parameter to not break connections to the target when the conditions are right:
try_recovery="true"
Without it, a failed node or a migration will occasionally cause the connection to fail completely, which is not what you want when failover without noticeable interruption is your goal.

SAN Components

DRBD

Ubuntu 12.04 ships with 8.3.11, but the DRBD git repo has 8.3.15 and 8.4.3.  In the midst of debugging Pacemaker bug, I migrated to 8.4.3.  It works fine, and appears to be quite stable.  Make sure that you're using the 8.4.3 resource agent, or else things like dual-primary will fail (if everything is installed to standard locations, you should be fine).

Though it's not absolutely necessary, I am running my DRBD resources in dual-primary.  The allow-two-primaries option seems to shave a few seconds off the recovery, since all we have to migrate are the iSCSI target resources.  LIO migrates very quickly, so the most of the waiting appears to be cluster-management-related (waiting to make sure the node is really down, making sure it's fenced, etc).  We could probably get it faster with a little more work.

Pacemaker, Corosync

Without the need for OCFS2 on the SAN, I build the cluster suite from the sources using Corosync 2.3.1 and Pacemaker 1.1.10 + latest changes from git.  It's very near bleeding-edge, but it's also working very well at the moment.  Building the cluster requires a host of other packages.  I will detail the exact build requirements and sequence in another post; I wrote a script that does pretty much an automated install.  The important thing is to make sure you don't have any competing libraries/headers in the way, or parts of the build will break.  Luckily it breaks during the build and not during execution.  (libqb, I am looking at YOU!)

ZFS

I did not do any additional experimentation with this on the sandbox cluster, but it is worth noting that in my most recent experiences I have shifted to using drive UUIDs instead of any other available device addressing mechanisms.  The problem I ran into (several times) involved the array not loading on reboot, or (worse) the vdevs not appearing after reboot.  Since I the vdevs are the underlying devices for DRBD, it's rather imperative that they be present on reboot.  It appears to be a semi-remaining issue in ZoL, though it's less so in recent releases.

Testing and Results

For testing I created a cluster of four nodes, all virtual, with external/libvirt as the STONITH device.  The nodes, c6 thru c9, were configured thus:
  • c6 and c7 - SAN targets, synced with each other via DRBD
  • c8 - AoE test client
  • c9 - AoE and iSCSI test client

Server/Target

All test batches included migration tests (moving the target resource from one node to another), failover tests (manually fencing a node so that its partner takes over), single-primary tests (migration/failover when one node has primary and the other node has secondary), and dual-primary tests (migration/failover when both nodes are allowed to be primary).

Between tests, the DRBD stores were allowed to fully resync.  During some long-term tests, resync happened while client file system access continued.  

Client/Initiator

Two operations were tested: dbench and data transfer and verification.

dbench is fairly cut-and-dry.  It was set to run for upwards of 5000 seconds with 3 clients, while the SAN target nodes (c6 and c7) were subjected to migrations and fencing.

The data transfer and verification tests were more interesting, as they signaled corruption issues.  For sake of having options, I created three sets of files with dd if=/dev/urandom.  The first set was 400 1-meg files.  The second set was 40 10-meg files.  The last set was 4 100-meg files.  Random data was chosen to ensure that no compression features would interfere with the transfer, and also to provide useful data for verification.  SHA-512 sums were generated for every file.  As the files were done in three batches, three sum files were generated.  For each test, a selected batch of files was copied to the target via either rsync or cp, while migrations/failovers were being performed.  The batch was then checked for corruption by validating against the appropriate sums file.  Between tests, the target's copy of the data was deleted.  Occasionally the target store was reformatted to ensure that the file system was working correctly (especially after failed failover tests).

Results - AoE

AoE performed extremely well with transfer rates and migration, but failed during verification tests on failover testing.  This is interesting because it suggests the mechanism that AoE is using to push its writes to disk is buffering somewhere along the way.  vblade is forcibly terminated during migration, yet no corruption occurred throughout those tests.  

Failover reliably demonstrated corruption; the fencing of a node practically guaranteed that 2-4 files would fail their SHA-512 sums.  This can be fixed by using the "-s" option, but I find that to be rather unattractive.  Yet it may be the only option.

Another issue: during a failover, the client might fail to communicate with the new target.  Migration didn't seem to suffer from this.  Yet on failover, aoe-debug sometimes reported both aoetgts receiving packets even though one was dead and one was living.   More often than not, aoe would start talking to the remaining node, only to stop a few seconds later and never again resume.  I've spent a good deal of time examining the code, but at this time it's a bit too complex to make any in-roads.  At best, I've have intermittent success at generating the failure case.

One other point of interest regarding AoE: the failover is a bit slow, regrettably.  This appears due to a hard-coded 10-second time limit before scorning a target.  I might add a module-parameter for this, and/or see about a better logic-flow for dealing with suspected-failed targets.

Results - iSCSI

iSCSI performed, well, like iSCSI with regard to transfer rates - slower than AoE.  My biggest fear with iSCSI is resource contention when multiple processes are accessing the store.  Once the major issues involving the resource agent and the initiator were solved, migration worked like a charm.  During failover testing, no corruption was observed and the remaining node picked up the target almost immediately.  I will probably deploy with allow-two-primaries enabled.









20130815

AoE, DRBD...coping strategies

This is a stream of consciousness...

DRBD - Upgrade Paths

When dealing with pre-compiled packages, upgrading from source can be hazardous.  Make sure of the following things:
  • DO make sure you have the correct kernel for the version of DRBD you want to use.
    • The 8.3 series won't compile in kernel 3.10, and possibly others.  It does compile in 3.2.  It appears that changes to procfs has made some of the relevant code in 8.3 out-of-date.
    • The 8.4 series will compile in kernel 3.2.  It comes with 3.10, so therefore must also build successfully there.
  • DO make sure you build the tools AND the module.
  • DO configure the tools to use the CORRECT paths.
    • These will depend on the distro and the original configure args used.  Deduce or look into the package sources for their build scripts.
    • Ubuntu has drbdadm in /sbin, the helper scripts in /usr/lib, the configs in /etc, and the state in /var/lib.
    • If you do not have the correct paths, things will break in unexpected ways and you might have a resource that continually tries to connect and then fails with some strange error (such as a protocol error or a mysterious "peer closed connection" on both sides).
  • DO be careful when installing the 8.3 kernel under Ubuntu 12.04, it doesn't seem to copy to where it needs to go - hand-copy if necessary (and it does seem necessary).
  • DO shut down Pacemaker on the upgrade node before upgrading.  Reboots may be necessary.  Module unloading/reloading is the minimum required.
  • DO reconnect the upgraded DRBD devices to their peers BEFORE bringing Pacemaker back into the mix.  If there is something amiss with your upgrade, you'll rather it simply fail the upgrade node than for that node and its cluster friends to start fencing one another.  Pacemaker won't care if a slave connects as long as it doesn't change the status quo (i.e. no auto master-seizure).  If everything is gold, you should be able to either start Pacemaker as is (it should see the resources and just go with it) or shut down DRBD and let Pacemaker bring it back up when it starts.
  • It's probably a better idea to upgrade DRBD before upgrading the kernel, so that you are only changing one major variable at a time.  In upgrading the kernel from 3.2 to 3.10, I ran into situations were things were subtly broken and the good node was getting fenced by the upgrading node for no good reason.
  • I have found, thus far, that wiping the metadata in the upgrade to 8.4 was not necessary, but it has been noted as a solution in certain circumstances.  This requires a full resync when done.
  • 8.3 and 8.4 WILL communicate with one another if you've done everything right.  If you haven't, they won't, but they will sorta seem like they should...blame yourself and go back through your DRBD install and double-check everything.
  • 8.4 WILL read 8.3's configuration scripts.  Update them when things are stable to the 8.4 syntax.
  • Pulling the source from GIT is a fun and easy way to obtain it.  Plus you can have the latest and greatest or any of the previously tagged releases.
  • And finally, WRITE DOWN the configure string you used on the upgrade node.  You'll want to replicate it exactly on the other node, especially if you pulled the source from git.
    • Even an rsync-copy doesn't guarantee that the source won't want to rebuild.  Plus if you end up switching to a newer or older revision, stuffing the configure command line into a little shell script makes rebuilding less error-prone.

AoE

This site is useful: http://support.coraid.com/support/linux/
To build in Ubuntu:
make install INSTDIR=/lib/modules/`uname -r`/kernel/drivers/block/aoe

Oh, where do I begin?  Things I have issue with and would like to do something about:
  • Absolutely no security at all in this protocol.  Security via hiding the data on a dedicated switch is not an answer, especially when you don't have said dedicated-switch to use.  VLANs are a joke.  Routability be damned, this protocol is more than vulnerable to any number of local area attacks, which are equally likely from a compromised node as they are for a routable protocol over the Internet.
    • I'd like to see the header or payload enhanced with a cryptographic sequence, at the very least.
      • The sequence could be a pair of 32-bit numbers, representing (1) the number that is the expected sequence number, and (2) the number that will be the next sequence number.
      • Providing these two numbers means that the number source can be anything, including random generation, making sequence prediction difficult (no predictable plaintext).
      • This could, at the very least, provide defense from replay attacks and give the target and the initiator something to identify each other with.
      • Extensions to this could allow for more robust initiator security, whereby a shared-secret is used to guarantee a target/initiator is who they say they are, in lieu of MAC filtering (which is pointless in the modern world of spoofable MACs.
  • vblade is, from the looks of things, "as good as it gets."  No other projects appear to be among the living.  Most stuff died back in '07 and '10 (if you're lucky).  Even vblade doesn't get much love from the looks of it.
    • Let's get at least a version number in there somewhere, so that I don't feel like an idiot for typing "vblade --version."
    • Figure out why the "announce" that vblade does falls on deaf ears.  If the initiator transmits a packet, and the recipient has died and "failed over" to another node, why does the initiator not care about this?  (update: evidently it does not fall on deaf ears, but it comes close.  In certain circumstances it works, others it doesn't.)
    • aoe-stat could dump out more info, like the MAC it was expecting to connect to. (update: This info was hidden away in a sysfs file called "debug")
    • The driver doesn't appear to try hard when it comes to reconnecting to a perceived-failed device.  The 12-page super-spec doesn't give any real guidance on how the initiator should behave, or how a target should respond, to any situation other than one in which everything works.  (How wonderfully optimistic!)  (update: OK so maybe I didn't read the whole spec line-for-line...)
  • Driver error detection and recovery appears to be nonexistent.  Again, very optimistic.  Plus, with two vblades running on two separate servers, only one server's MAC is being seen.  Why is this?!  Oh yeah, and that page about revalidating the device by removing and re-probing the driver?  Not gonna happen while the damn device is mounted somewhere.  PUNT!  (update: the mechanisms are evidently far more complex than I originally believed.  I must now examine the code carefully to understand them.  I have already encountered one test case that failed spectacularly.)
  • aoe-discover does dick when it thinks it knows where its devices are.  aoe-flush -a does nothing useful.  I hate that I have to look through their scripts just to find command line options.  Anyway, you CAN perform an aoe-flush e#.# and get it to forget about individual devices.  Then do the aoe-discover and things will work...if aoe-flush ever successfully returns.
  • If you aoe-flush a device that has moved to a new home, your mount is screwed.  Even if it hasn't moved to a new home, you're screwed.  Once the device is borked, the mount is lost.  This makes changing the MAC of the target a requirement if you want to failover to a secondary node.  (update: aoe-flush is not what we need.  The issue is deeper than that.)

At this point, I am strongly considering moving my resources back to iSCSI, and hoping that LIO can handle the load.

UPDATE: I've taken a different tact and thought to ask some useful questions.  Given the answers, there is perchance still hope for using AoE on my cluster.  We shall soon see.  Time, nonetheless, is running out (and I don't mean to imply my patience...I mean time, like REAL TIME).


20130812

The Unconfirmed Horror

In my efforts to perfect my SAN upgrades before putting them into production, I've configured an iSCSI resource in Pacemaker.  Goal, of course and as with AoE, is high-availability with perfect data integrity (or at least as good as we can get it).  My AoE test store has, of late, been suffering under the burden of O_SYNC - synchronous file access to its backing store.  This has guaranteed that my writes make it to disk, at the expense of write-performance.  It's hard to say how much overall performance is lost to this, but in terms of pure writes, it appears to be pretty significant.

I had hoped that iSCSI would not be so evil with regard to data writing.  I was unpleasantly surprised when the TGT target I configured demonstrated file transfer errors on fencing the active SAN node.  Somehow the failure of the node was enough reason for a significant number of file blocks to get lost.  Not finding a self-evident way to put the target into a "synchronous" mode with regard to its backing store, I switched to LIO.  So far, it seems to be performing better with regard to not losing writes on node failure...which is to say, it's not losing my data.  That's critical.

Re-evaluating AoE in a head-to-head against LIO, here's the skinny.  With AoE running O_SYNC, it's a dead win (at the moment) for LIO in the 400 Meg-of-Files Challenge: 12 seconds versus 2 minutes.  Yet not all is lost!  We can tune AoE on the client-side to do a lot more caching of data before blocking on writes.  This assumes we're cool with losing data as long as we're not corrupting the file system in the process (in my prior post, I notes that file system corruption was among the other niceties of running AoE without the -s flag when a primary storage node fails).  That should boost performance at the cost of additional memory.  Right now AoE bursts about 6-10 files before blocking.

There is one other way, thus far, in which iSCSI appears superior to AoE: takeover time.  For iSCSI, on a migration from one node to the other it's nearly instantaneous.  On a node failure, it takes about 5-10 seconds.  AoE?  No such luck.  Even though it's purported to broadcast an updated MAC whenever vblade is started, the client either fails to see it (or doesn't care) or is too busy doing other things.  I think it's the former, as a failure while no writing is happening on the node causes the same 15-20 second delay before any writes can resume.  Why is this?

One thing does irk me as I test things out.  I had a strange node failure on the good node after fencing the non-good node.  It could just be that the DRBD resources were not yet synced, which would prevent them from starting (and cause a node-fencing).  Yet the logs indicate something about dummy resources running that shouldn't be running.

IDK.

All I know is that I want stability, and I want it now.

20130802

Working With AoE

A few weeks ago I did some major rearrangement of the server room, and in the midst did some badly-needed updates on the HA SAN servers.  The servers are responsible for all the virtual-machine data currently in use.  Consequently it's rather important they work right, and work well.

Sadly, one of the servers "died" in the midst of the updates.  Just as well, the cluster had problems with failover not being as transparent as it was supposed to be.  A cluster where putting a node on standby results in that node's immediate death-by-fencing is not a good cluster.

I thought this would be a good time to try out the latest pacemaker and corosync, so I set up some sandbox machines for play.  Of course, good testing is going to include making sure AoE is also performing up-to-snuff.  So far, I've encountered some interesting results.

For starters, I created a DRBD store between two of my test nodes, and shared it out via AoE.  A third node did read/write tests.  To do these tests, I created 400 1-meg files via dd if=/dev/urandom.  I generated SHA-512 sums for them all, to double-check file integrity after each test.  I also created 40 10-meg files, and 4 100-meg files.  I think you can spot a pattern here.  Transfers were done from a known-good source (the local drive of the AoE client) to the AoE store using cp and rsync.  During the transfer, failover events were simulated by intentionally fencing a node, or issuing a resource-migrate command.

Migration of the resource generally worked fine.  No data corruption was observed, and so long as both nodes were live everything appeared to work OK.  Fencing the active node, however, resulted in data corruption unless vblade was started with the "-s" option.  The up-side is that you're guaranteed that writes will have finished before the sender trashes the data.  The down-side is that writes go a LOT slower.  Strangely, -s is never really mentioned in the available high-availability guides for AoE.  I guess that's not really surprising; AoE is like a little black box that no one talks in any detail about.  Must be so simple as to be mindlessly easy...sadly that's a dangerously bad way to think.

Using the -d for direct-mode is also damaging to performance; I am not sure how well it does with failover due to SAN-node failure.  

What's the Worst that Can Happen?

Caching/Buffering = Speed at the sacrifice of data security.  So how much are we willing to risk?

If a VM host dies, any uncommitted data dies with it.  We could say that data corruption is worse than no data at all.  The file systems commonly used by my VMs include journaling, so file system corruption in the midst of a write should be minimal as long as the writes are at least in order.  Best of all in this bad situation is that no writes can proceed after the host has died because it's, well, dead.

The next most terrible failure would be death of a store node - specifically, the active store node.  AoE looks to be pretty darn close to fire-and-forget without totally the forget part.  Judging from the code, it sends back an acknowledgement to the writer once it has pushed the data to the backing store.  That's nice, except where the backing store is buffering up the data (or something in the long chain leading to the backing store, maybe still inside AoE itself).  So, without -s, killing a node outright caused data corruption and, in some cases, file system corruption.  The latter is most exceedingly possible because the guest is continuing to write under the assumption that all its writes have succeeded.  As far as AoE is concerned, they have.  Additional writes to a broken journal after the connection is re-established on the surviving node will only yield horror, and quite possibly undetected horror on a production system that may not be checked for weeks or months.

A link failure between AoE client and server would stop the flow of traffic.  Not much evil here.  In fact, it's tantamount to a VM host failure, except the host and its guests are still operating...just in a sort-of pseudo-detached media state (they can't write and don't know why).  Downside here is that the JBD2 process tends to hang "forever" when enough writes are pushed to a device that is inaccessible for sufficient time ("forever" meaning long enough that I have to reboot the host to clear the bottleneck in a timely manner - lots of blocked-process messages appear in the kern.log when this happens, and everything appears to grind to a wonderful halt).  Maybe JBD2 would clear itself after a while, but I've found that the Windows guests are quite sensitive to write failures, more so than Linux guests, though even the Linux guests have trouble surviving when the store gets brutally interrupted for too many seconds.

Now What Do I Do?

The -s option to vblade causes significant latency for the client when testing with dbench.  Whether or not this is actually a show-stopper remains to be seen.  Throughput drops from around 0.58 MB/sec to 0.15 MB/sec.  This is of course with all defaults for the various and appropriate file system buffers that work on the client and the server, and also running everything purely virtual.  Hardware performance should be markedly better.

I was worrying about using AoE and the risk of migrating a VM while it was writing to an AoE-connected shared storage device (via something like GFS or OCFS2).  My concern was that if the VM was migrated from host A to host B, and was in the middle of writing a huge file to disk, the file writes would still be getting completed on host A while the VM came to life on host B.  The question of "what data will I see?" was bothering me.  I then realized the answer must necessarily be in the cluster-aware file system, as it would certainly be the first to know of any disk-writes, even before they were transmitted to the backing store.  There still may be room for worry, though.  Testing some hypotheses will be therapeutic. 

20130429

AoE, you big tease...

I did some more testing with AoE today.  I'll try to detail here what it does and doesn't appear to be.

Using Multiple Ethernet Ports

The aoe driver you modprobe will give you the option of using multiple ethernet ports, or at the very least selecting which port to use.  I'm not sure what the intended functionality of this feature is, because if your vblade server is not able to communicate across more than 1 port at a time, you're really not going to find this very useful.  The only way I've been able to see multi-gigabit speeds is to create RR bonds on both the server and the client.  This requires either direct-connect or some VLAN magic on a managed switch, since many/most switches don't dig RR traffic by itself.

I could see where this feature would work out well if you have multiple segments or multiple servers, and want to spread the load across multiple ports that way.  Otherwise I don't see much usefulness here.

How did I manage RR on my switch?

So, do to this on a managed switch, I created two VLANs for my two bond channels, and assigned one port from each machine to each channel.  Four switch ports, two VLANs, and upwards to 2Gb/sec bandwidth.  This is thus expandable to any number of machines if you can handle the caveat that should a machine lose one port, it will lose all ability to communicate effectively with the rest of its network over this bond.  This is because the RR scheduler on both sides expects all paths to be connected.  A sending port cannot see that the recipient has left the party if both are attached to a switch (which should always be online).  ARP monitoring might take care of this issue, maybe, but then I don't think it will necessarily tell you not to send to a client on a particular channel and you'll need all your servers ARPing each other all the time.  Sounds nasty.

AoE did handle RR traffic extremely well.  Anyone familiar with it will note that packet-ordering is not guaranteed, and you will most definitely have some of your later packets arriving before some of your earlier packets.  In UDP tests the numbers are usually not very large for small bandwidth tests.  The higher the transmission rates, the higher the out-of-ordering.

The Best Possible Speed

To test the effectiveness of AoE, with explicit attention to the E part, I created a ramdrive on the server, seeded it with a 30G file (I have lots of RAM), and then served that up over vblade.  I ran some tests using dbench and dd.  To ensure that no local caching effects skewed the results, I had to set the various /proc/sys/vm/dirty_* fields to zero - specifically, ratio and background_ratio.  Without doing that, you'll see fantastic rates of 900MB/sec, which is a moonshot above any networking gear I have to work with.

With a direct connection between my two machines, and RR bonds in place, I could obtain rates of around 130MB/sec.  The same appeared true for my VLAN'd switch.  Average latency was very low.  In dbench, the WriteX call had the highest average latency of 267ms.  Even flushes ran extremely fast.  That makes me happy, but the compromise is that there is no fault-tolerance, other than what we'd see for if a whole switch dies - and that is, by the way, assuming you have your connections and VLANs spread across multiple switches.

Without all of that rigging, the next best thing is balance-alb, and then you're back to standard gigabit with the added benefit of fault-tolerance.  As far as AoE natively using multiple interfaces, the reality seems to be that this feature either doesn't exist like it's purported to, or it requires additional hardware (read: Coraid cards).  Since vblade itself requires a single interface to bind to, the best hope is a bond, and no bond mode except RR will utilize all available slaves for everything.  That's the blunt truth of it. As far as the aoe module itself, I really don't know what its story is.  Even with the machines directly connected and the server configured with a RR bond, the client machine did not seem to actively make use of the two adapters.

Dealing with Failures

One thing I like about AoE is that it is fairly die-hard.  Even when I forcefully caused a networking fault, the driver recovered once the connectivity returned and things returned to normal.  I guess as long as you don't actively look to kill the connection with an aoe-flush, you should be in a good state no matter what goes wrong.  

That being said, if you're not pushing everything straight to disk and something bad happens on your client, you're looking at some quantity of data now missing from your backing store.  How much will depend on those dirty_* parameters I mentioned earlier.  And catastrophic faults rarely happen predictably.

Of course, setting the dirty_* parameters to something sensible and greater than zero may not be an entirely bad thing.  Allowing some pages to get cached seems to lend itself to significantly latency and throughput.  How to measure the risk?  Well, informally, I'm watching network via ethstatus.  The only traffic on the selected adapter is AoE.  As such, it's pretty easy to see when big accesses start and stop.  In my tests against the ramdrive store, traffic started immediately flowing and stopped a few seconds after the dbench test completed.  Using dd without the oflag=direct option left me with a run that finished very quickly, but that did not appear to be actually committed to disk until about 30 seconds later.  Again, kernel parameters should help this.

Hitting an actual disk store yielded similar results, with only occasional hiccups.  For the most part the latency numbers stayed around 10ms, with transfer speeds reaching over 1100MB/sec (however dbench calculates this, especially considering that observed network speeds never reached beyond an aggregate 70MB/sec).

Security Options

I'm honestly still not a fan of the lack of security features in AoE.  All the same, I'm allured to it, and want to now perform a multi-machine test.  Having multiple clients with multiple adapters on balance-alb will mean that I have to reconfigure vblade to MAC-filter for all those MACs, or not use MAC filtering at all.  That might be an option, and to that end perhaps putting it in a VLAN (for segregation sake, not for bandwidth) wouldn't be so bad.  Of course, that's all really just to keep honest people honest.

Deploying Targets

If we keep the number of targets to a minimum, this should work out OK.  I still don't like that you have to be mindful of your target numbers - deploying identical numbers from multiple machines might spell certain doom.  For instance, I deployed the same device numbers on another machine, and of course there is no way to distinguish between the two.  vblade doesn't even complain that there are identical numbers in use.  Whether or not this will affect targets that are already mounted and in-use, I know not.  The protocol does not seem to concern itself with this edge-case.  As far as I am concerned, the whole thing could be more easily resolved by using a runtime-generated UUID instead of just the device ID numbers.  I guess we'll see how actively this remains developed.

Comparison with iSCSI

I haven't done this yet, but plan to very soon.

Further Testing

I'll be doing some multi-machine testing, and also looking into the applicable kernel parameters more closely.  I want to see how AoE responds to the kind of hammering only a full compliment of virtual machines can provide.  I also want to make sure my data is safe - outages happen far more than they should, and I want to be prepared.





20130424

AoE - An Initial Look

I was recently investigating ways to make my SAN work faster, harder, better.  Along the way, I looked at some alternatives.

Enter AoE, or ATA-over-Ethernet.  If you're reading this you've probably already read some of the documentation and/or used it a bit.  It's pretty cool, but I'm a little concerned about, well, several things. I read some scathing reviews from what appear to be mud-slingers, and I don't like mud.  I have also read several documents that almost look like copy-and-paste evangelism.  Having given it a shot, I'm going to summarize my immediate impressions of AoE here.

Protocol Compare: AoE is only 12 pages, iSCSI is 257.
That's nice, and simplicity is often one with elegance.  But that doesn't mean it's flawless, and iSCSI has a long history of development.  With such a small protocol you also lose, from what I can tell, a lot of the fine-tuning knobs that might allow for more stable operation under less-ideal conditions.  That being said, with such a small protocol it would hopefully be hard to screw up its implementation, either in software or hardware.

I like that it's its own layer-2 protocol.  It feels foreign, but it's also very, very fast.  I think it would be awesome to overlay options of iSCSI on the lightweight framework of AoE.

Security At Its Finest: As a non-routeable protocol, it is inherently secure.
OK, I'm gonna have to say, WTF?  Seriously?  OK, seriously, let's talk about security.  First, secure from who or what?  It's not secure on a LAN in an office with dozens of other potentially compromised systems.  It's spitting out data over the wire unencrypted, available for any sniffer to snag.

Second, it can be made routeable (I've seen the HOW-TOs), and that's cool, but I've never heard of a router being a sufficient security mechanism.  Use a VLAN, you say?  VLAN-jumping is now an old trick in the big book of exploits.  Keep your AoE traffic on a dedicated switch and the door to that room barred tightly.  MAC filtering to control access is cute, but stupid.  Sniff the packets, spoof a MAC and you're done.  Switches will not necessarily protect your data from promiscuous adapters, so don't take that for granted.  Of course, we may as well concede that a sufficiently-motivated individual WILL eventually gain access or compromise a system, whether it's AoE-based or iSCSI-based.  But I find the sheer openness of AoE disturbing.  If I could wrap it up with IPsec or at least have some assurance that the target will be extra-finicky about who/what it lets in, I'd be a little happier, even with degraded performance (within reason).

Then there's the notion that I just like to make sure human error is kept to a bare minimum, especially when it's my fingers on the keyboard.  Keeping targets out of reach means I won't accidentally format a volume that is live and important.  Volumes are exported by e-numbers, so running multiple target servers on your network means you have to manage your device exports very carefully.  Of course, none of this is mentioned in any of the documentation, and everyone's just out there vblading their file images as e0.0 on eth0.

Sorry, a little disdain there as the real world crashes in.  I'll try to curb that.

Multiple Interfaces?  Maybe.
If you happen to have several interfaces on your NIC that are not busy being used for, say, bonding or bridging, then you can let AoE stream traffic over them all!  Bitchin'.  This is the kind of throughput I've been waiting for...except I can't seem to use it without some sacrifices.

For me, the problem starts with running virtual machines that sometimes need access to iSCSI targets.  These targets are "only available" (not totally but let's say they are) over adapters configured for jumbo frames.  What's more, the adapters are bonded because, well, network cables get unplugged sometimes and switches sometimes die.  The book said: "no single points of failure," so there.  But maybe it is not so much of an issue and I just need to hand over a few ports to AoE and be done with it?

The documentation makes it clear how to do this on the client.  On the server, it's not so clear.  I think you bond some interfaces with RR-scheduling, and then let AoE do the rest.  How this will work on a managed gigabit switch that generally hates RR bonding, I do not yet know.  I also have not (yet) been able to use anything except the top-most adapter of any given stack.  For example, I have 4 ports bonded (in balance-alb) and the bond bridged for my VMs.  I can't publish vblades to the 4 ports directly, nor to the bond, but I can to the bridge.  So I'm stuck with the compromise of having to stream AoE data across the wire at basically the same max rate as iSCSI.  Sadness.

Control and Reporting
I'm not intimately familiar with the vblade program, but so far it's not exactly blowing my skirt up.  My chief complaints to-date:  I want to be able to daemonize it in a way that's more intelligent than just running in the background;  I would like to get info about who's using what resources, how many computing/networking resources they're consuming, etc;  I had to hack up a resource agent script so that Pacemaker could reliably start and stop vblade - the issue seemed to involve stdin and stdout handling, where vblade kept crashing.

Since it's not nice to say only negatives, here are some positives:  It starts really fast; it's lightweight;  Fail-over should work flawlessly, and configuration is as easy as naming a device and an adapter to publish it on.  It does one thing really, really well: it provides AoE services.  And that's it.  It will hopefully not crash my hosts.

aoetools is another jewel - in mixed definitions of that word.  Again I find myself pining for documentation, reporting, statistics, load information, and a scheme that is a little more controllable and less haphazard-feeling than modprobe aoe gives you your devices.  Believe me, I think it's cool that it's so simple.  I just somehow miss the fine-grained and ordered control of iSCSI.  Maybe this is just alien to me and I need to get used to it.  I fear there are gotchas I have not yet encountered.

It's FAST!
There's a catch to that.  The catch is that AoE caches a great deal of data on the initiator and backgrounds a lot of the real writing to the target.  So you know that guy that did that 1000 client test with dbench?  He probably wasn't watching his storage server wigging out ten minutes after the test completed.  My tests were too good to be true, and after tuning to ensure writes hit the store as quickly as possible, the real rates presented themselves.

I can imagine that where reading is the primary activity, such as when an VM boots, this is no biggie.  But when I may have a VM host suddenly fail, I don't want a lot of dirty data disappearing with the host.  That would be disastrous.

Luckily, they give some hints on tuneables in /proc/sys/vm.  At one point I cranked the dirty-pages and dirty-ratio all the way to zero, just to see how the system responded.  dbench was my tool of choice, and I ran it with a variety of different client sizes.  I think 50 was about the max my systems could handle without huge (50-second) latencies.  A lot of that is probably my store servers, which are both somewhat slow in the hardware and extremely safe (in terms of data corruption protection and total RAID failures).  I'll be dealing with them soon.

Other than that, I think it'd be hard to beat this protocol over the wire, and it's so low-level that overhead really should be at a minimum.  I do wish the kernel-gotchas were not so ominous; since this protocol is so low-level, your tuning controls become kernel tuning controls, and that bothers me a little.  Subtle breakage in the kernel would not be a fun thing to debug.  Read carefully the tuning documentation that is barely referenced in the tutorials (or not referenced at all - did I mention I would like to see better docs?  Maybe I'll write some here after I get better at using this stuff.).

Vendor Lock-in
I read that, and thought: "Gimme a break!"  Seriously guys, if you're using Microsoft, or VMware, you're already locked in.  Don't go shitting yourself about the fact there's only one hardware vendor right now for AoE cards.  Double-standards are bad, man.

Overall Impressions
So to summarize...

I would like more real documentation, less "it's so AWESOME" bullshit, and some concrete examples of various implementations along with their related tripping-hazards and performance bottlenecks.  (Again, I might write some as I go.)

I feel the system as a whole is still a little immature, but has amazing potential.  I'd love to see more development of it, some work on more robust and effective security against local threats, and some tuning controls to help those of us who throw 10 or 15 Windows virtuals at it.  (Yeah, I know, but I have no choice.)  If anyone is using AoE for running gobs of VMs on cluster storage, I'd love to hear from you!!

If iSCSI and AoE had a child, it would be the holy grail of network storage protocols.  It would look something like this:

  • a daemon to manage vblades, query and control their usage, and distribute workload.
  • the low-and-tight AoE protocol, with at least authentication security if not also full data envelope (options are nice - we like options.  Some may not want or need security, some of us do).
  • target identification, potentially, or at least something to help partition out the vblade-space a little better.  I think of iSCSI target IDs and their LUNs, and though they're painful, they're also explicit.  I like explicitness.
  • Some tuning parameters outside the kernel, so we don't feel like we're sticking our hands in the middle of a gnashing, chomping, chortling machine.
Although billed as competition to iSCSI, I think AoE actually serves a slightly different audience.  Whereas iSCSI provides a great deal of control and flexibility in managing SAN-access to a wide variety of clients, AoE offers unbridled power and throughput on highly controlled and protected network.  I really could never see using AoE to offer targets to coworkers or clients, since a single slip-up out on the floor could spell disaster.  But I'm thinking iSCSI may be too slow for my virtualization clusters.  

iSCSI can be locked down.  AoE can offer near-full-speed data access.

Time will tell which is right for me.


20130417

Sustainable HA MySQL/MariaDB

I ran into this problem just yesterday, and thought I'd write about what I'm trying to do to fix it.  Use at your own risk; hopefully this will work well.

I needed to run updates on my DB cluster.  It's a two-node cluster, and generally stable on Ubuntu 12.04.2 LTS.  Unfortunately, the way I had configured my HA databases, when I ran updates one of the nodes completely broke.  Unable to start MariaDB, the update process failed.  MariaDB couldn't start because the database files were nowhere to be found on that node at that time.

Not liking the idea of having to update the database server "hot," then migrating over to the second node and updating it "hot" again, I thought perhaps this would be a good time for some manual package management.  This would mean the following:

  • I'd have to get the packages manually and configure the essentials accordingly - factory-default paths be damned!
  • No more automatic updates - a mixed bag: they're awesome when they work and terrible when they don't.  Luckily they usually "Just Work" (tm)
  • I'd have the latest and greatest that MariaDB has to offer.
  • I would have to be more mindful in the future about updates and making sure things don't break en-route to a new version.
OK, so originally I had installed MariaDB via apt-get, and put the database files themselves on an iSCSI target.  I used bind-mounts to place everything (from configuration files to the actual db files) where MySQL/MariaDB expected everything to be.  For this fix, my first thought was to put the binaries (well, the whole MariaDB install) on the iSCSI target.  This would mean one upgrade, one copy of binaries, and only one server capable of starting said database.

That didn't work - Pacemaker needs access to the binaries to make sure the database isn't started elsewhere on the cluster.  So, I set up a directory structure as follows:
  • /opt/mariadb
    • .../versions/ (put your untarred-gzipped deployments here)
    • .../current --> links to versions/(current version you want to use)
    • .../var   --> this is where the iSCSI target will now be mounted
    • .../config  --> my.cnf and conf.d/... will be here
MariaDB offers precompiled tar.gz deployments, which is really nice.  I can put these wherever I want. In this case I'm going the route of having an escape-route for future upgrades by putting the fresh deployment files in a versions/ directory and linking to the version that I want to use.  No changes to configuration files or Pacemaker should be necessary, and upgrades won't stomp existing deployments this way.  Of course, back up your databases frequently and before each upgrade.

Inside /opt/mariadb/var, I've placed a log and db directory.  log originally came from /var/log, and has a variety of transaction logs in it.  The db folder contains the actual database files, what would normally be found in /var/lib/mysql.  

The configuration files MIGHT work under the /opt/mariadb/var folder, which would mean it ought to be named something more appropriate.  I left them out for sake of having them always available on both nodes.  I felt this was a safer route and don't have time to experiment much.

The my.cnf file has to be properly configured.  I snagged the my.cnf file that the original MariaDB apt-get install provided, and changed paths accordingly.  Now there are no bind-mounts, and for all intents and purposes I could simply duplicate the entire /opt/mariadb directory on a new node and be up and running in no-time.  (New node deployment is technically untested as of this writing.)

Note that if you happen to be moving existing log files (especially a .index file), the .index file will contain file paths that need to be updated.  sed will be your friend here, and you can cat the file to see the contents.  Once everything is done, you should be able to perform the following command and see a successful MariaDB launch:
/opt/mariadb/current/bin/mysqld --defaults-file=/opt/mariadb/config/my.cnf
In case you don't know, here's how you shut down your successful launch:
/opt/mariadb/current/bin/mysqladmin --defaults-file=/opt/mariadb/config/my.cnf shutdown -p

The MySQL primitive in Pacemaker needs to be properly configured.  Here is what mine looks like:

primitive p_db-mysql0 ocf:heartbeat:mysql \
params binary="/opt/mariadb/current/bin/mysqld" \
                    config="/opt/mariadb/config/my.cnf" \
                    datadir="/opt/mariadb/var/db" \
                    pid="/var/run/mysqld/mysqld.pid" \
                    socket="/var/run/mysqld/mysqld.sock" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="20s" timeout="30s"


So far, this new configuration seems to work.  Comments and suggestions are welcome.


20130312

Quick Notes: Samba 3, CTDB, Pacemaker, and Ubuntu 12.04

I started with these links:



Great stuff, but it doesn't appear to have been touched in a long time.  On Ubuntu, if you follow the path, you'll wind up broken.  Here are some intrusive fixes.

The CTDB scripts expect to find the "service" binary in /sbin.  Ubuntu 12.04 has it in /usr/sbin.  Provide a symbolic link:
  ln -s /usr/sbin/service /sbin/service

The CTDB event script 50.samba "correctly" identifies the system as a Debian one, but Samba here runs as two scripts: smbd and nmbd.  Fix the 50.samba script at the top where the switch gives us variables and what they should be.

        debian)
                CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-smbd}
                CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-nmbd}
                CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}


That will get this resource working "correctly:"

primitive p_ctdb ocf:heartbeat:CTDB \
        params ctdb_recovery_lock="/opt/samba0/samba/ctdb.lock" ctdb_manages_samba="yes" ctdb_manages_winbind="yes" ctdb_start_as_disabled="yes" \
        op monitor interval="10" timeout="20" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
Of course, make sure to install winbind along with Samba3 before executing.  Watch this command:
   ctdb --socket=/var/lib/ctdb/ctdb.socket status

and the log file in /var/log/ctdb/log.ctdb while trying to start resource in case you have further problems.  I went with the special cluster IP resource, per the tutorial link at the top, so I expect that the DISABLED status of CTDB is normal; I hope I am right.  Other than making sure Samba actually starts, I've not connected any clients to it yet.  But without the above, Samba will not start.  Also, I did not find any useful resource agents for running Samba itself from Pacemaker, and attempting to use the LSB appears to break.  I wonder if it breaks for the same reason that the 50.samba script can't use it?  (50.samba calls the /etc/init.d/smbd script when it can't find /sbin/service, but it for whatever reason fail to function correctly - smbd never starts.  Invoking smbd via service appears to work fine.)

NB.  For the record, I did find one ocf script that was written for Samba on Gentoo.  It looked promising, but I didn't feel like trying to port it over to Ubuntu 12.04.


20130305

open-iscsi and pacemaker, connection issue workaroundW

I ranted previously, and much to my chagrin, for I did not look closely enough at the problem.  What I initially took as a problem with either open-iscsi or its init.d script actually appears, at the moment, to be unrelated to those things.

The problem seems non-deterministic, and I haven't pinned down exactly where the failure occurs, but here's the landscape:

We have two database server nodes attached via iSCSI to our little home-made SAN.  On each machine we have /etc/iscsi/nodes directories that are chock FULL of various targets.  However they got there, they're there, and they're not going away on their own.  Reboot as much as you like.  Now, something happens...  In my case, it was something in the combination of the SAN going completely down, one of the two attached nodes surviving the outage, and one requiring reboot (because its VM image was...well...also on the SAN).

I had learned from prior experience that when the nodes directory was loaded with junk, something fails when Pacemaker tried to reconnect the iscsi resource.  Maybe it's the resource script.  Maybe it's open-iscsi.  Who knows!   And better yet, it's not guaranteed to fail, although I noticed a lot of failures when I was testing the fencing of my nodes.  Node would go down, node would come back up, iSCSI would NOT reconnect.  Errors galore.

What I do know is that cleaning out the /etc/iscsi/nodes folder on boot tends to make this problem go away, 99.999% guaranteed.

On some clusters I have a shell script called from /etc/rc.local that kills off anything left lingering in the /etc/iscsi/{nodes,send_targets}/ folders.  Here's another way - add the following to /etc/fstab:
none /etc/iscsi/nodes ramfs defaults 0 2
none /etc/iscsi/send_targets ramfs defaults 0 2
The contents of these two folders, which appear to be relatively inconsequential (if you're not using any automatic iSCSI targets) will go away on reboot.  They don't take up much room anyway, so hopefully a ramdrive is within your budget.

Applies to Ubuntu 11.10 and 12.04.


20130225

"Quick" notes on OCFS2 + cman + pacemaker + Ubuntu 12.04

Ha ha - "quick" is funny because now this document has become huge.  The good stuff is at the end.

Getting this working is my punishment for wanting what I evidently ought not to have.

When configuring CMAN, thou shalt NOT use "sctp" at the DLM communication protocol.  ocfs2_controld.cman does not seem to be compatible with it, and will forever bork itself while trying to initialize.  This is presented as something like:
Feb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 1 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 2 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 4 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 8 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 16 times while opening checkpoint "ocfs2:controld:00000003", still tryingFeb 22 13:29:46 hv03 ocfs2_controld[7402]: TRY_AGAIN seen 32 times while opening checkpoint "ocfs2:controld:00000003"

And it goes on forever.

To make the init scripts work, some evil might be required...
In /etc/default/o2cb, add 
O2CB_STACK=cman
The /etc/init.d/o2cb script tries to start ocfs2_controld.cman before it should.  Commenting out the appropriate line makes this script do part B, which is setting up everything it should, so that CMAN can do part A and C.  OR you can try running the o2cb script AFTER cman starts, and not worrying that CMAN is "controlling" o2cb...which it really doesn't anyway.

The fact is, the cman script's main job is to call a lot of other utilities and start a bunch of daemons.  cman_tool will crank up corosync, and configure it from the /etc/cluster/cluster.conf file.  It ignores /etc/corosync/corosync.conf entirely, as proved by experimentation and documented by the cman author(s).  As far as it cares about o2cb, it runs ocfs2_controld.cman only if it finds the appropriate things configured in the configfs mount-point.  It won't find those unless you've configured them yourself or run an modified o2cb init script.

Now, it gets better.  The  /etc/default/o2cb doesn't document the cluster stack option - you have to find out by reading the o2cb init script instead.  If you let the default stack (which is, ironically, "o2cb") stand, then ocfs2_controld.cman won't run and instead complains that you're using the wrong cluster stack.  Of course, running it with the default stack then runs the default ocfs2_controld, which doesn't complain about anything at all.  But does it play nice with cman and corosync and pacemaker??

Fact is, it doesn't play with cman/corosync/pacemaker at all when it plays as an "o2cb" stack.

How is this a big deal?

The crux to all of this comes down to fencing.  OK, so suppose you have a cluster and OCFS2 configured and something terrible happens to one node.  That node gets fenced.  Then what?  Well, OCFS2 and everyone else involved should go on with life, assuming quorum is maintained.

When o2cb is configured to use the o2cb stack, it appears to operate sort of "stand-alone," meaning it doesn't seem to talk to the corosync/pacemaker/cman stack.  It doesn't get informed when a node dies, it has to find this out on its own.  Moreover, it does its own thing regardless of quorum.  Here's the thing I just did:  configure a two node cluster, configure o2cb to use the o2cb stack, and then crash one of the two nodes while the other node is doing disk access (I'm using dbench, just because it does a lot of disk access and gives latency times - great way to watch how long a recovery takes!).

Watching the log on surviving-node (s-node), you can see the o2cb stack recover based on (I assume) the timeouts configured in the /etc/default/o2cb file.  About 90 seconds later access to the OCFS2 file system is restored, regardless of the state of crashed-node (c-node).

Now the good of this is that when you start up and shut down the o2cb stack and the cman stack, they don't care about each other.  This is great because on Ubuntu these start-up and shut-down sequences seem to be all fucked up.  More about that later.  The bad news is that because these stacks are not talking, the recovery takes (my default-configured cluster) 90 seconds, which would probably nuke any VM instances running on it and wreak all sorts of havoc.  Not acceptable, and I'm not crazy about modifying defaults downward when the documentation says (and I paraphrase): "You might want to increase these values..."

Reconfigure o2cb to use the cman stack instead (O2CB_STACK=cman).  Start o2cb service, ignore the o2cb_controld.cman failure, and start the cman service.  Cman starts o2cb_controld.cman.  Update the OCFS2 cluster stack, mount and start another dbench on s-node.  Crash c-node.  This time o2cb appears to find out from the three amigos that c-node died.  However, quorum is managed by cman, and since it's a two-node cluster it halts cluster operations (such as recovery) until quorum is reestablished.  This can be done simply by restarting cman (regardless of o2cb) on c-node...once c-node is rebooted.  Unfortunately, if you're not watching your cluster crash, it could be many minutes or hours before you notice that s-node isn't able to access its data.  Or maybe never, if c-node died due to, say, releasing its magic smoke.

What else to do?  cman documentation dictates using the two_node="1" and the expected_votes="1" attributes in the cman configuration tag in /etc/cluster/cluster.conf.  Now a single node is quorate.  Let's start dbench on s-node and crash c-node again.  Recovery after c-node bites the dust takes place after about 30 seconds of downtime.  That's better.  After adding some options to configure totem for greater responsiveness (hopefully not at the cost of stability), the only thing that takes a long time now is the ocfs2 journal replay.  And that's only because my SAN is overworked and under-powered.  Donations, anyone?

Lessons Learned

To get the benefit of ocfs2 + cman + pacemaker (under Ubuntu), you need to have ocfs2_controld.cman and it has to run when "cman" is running.  That is to say, when some particular daemons - notably dlm_controld - are running.

ocfs2 can run on its own (o2cb stack), but then you lose quorum control, so to speak, and it has to be configured and managed separately of cman-and-friends.  Ugly.

For two-node clusters, make absolutely sure you have correctly configured cman to know it's a two-node cluster and expect only one vote cluster-wide, otherwise there will be no recovery for node S when node C dies.  Two node clusters under cman demand:  two_node="1" and expected_votes="1"

ocfs2_controld.cman does NOT like to talk to the DLM via sctp.  You must NOT use sctp as the communication protocol.

When configuring cluster resources, about the only things you need under this setup will be connection to the data source, and mounting of the store.  In my case, that's an iSCSI initiator resource and to mount the OCFS2 partition once I'm connected to the target.  There is NO:
  • dlm_controld resource
  • o2cb control resource
  • gfs2 control resource
Basically, Pacemaker will not be managing any of those low-level things, unlike what you had to do back in Ubuntu 11.10.  Literally all I have in my cluster configuration is fencing, the iSCSI initiator, and the mount.  If you do anything else with the above three resources, you will find much pain when trying to put your nodes into standby or do anything with them other than leaving them running forever and ever.

Start-up sequence:
(Update 2013-02-28: The start-up order can be as now listed below.  ocfs2_controld.cman will connect to the dlm.  However, shutdown must take an alternate path.)
  1. service cman start
  2. service o2cb start
  3. service pacemaker start
If you start o2cb first:  You can start o2cb first, but o2cb WILL complain about not being able to start ocfs2_controld.cman.  Let it complain or modify the init script to not even try, or start cman first and don't worry that cman won't try to start ocfs_controld.cman.  But you MUST use "start" and not "load" because otherwise the script will not configure the necessary attributes under configfs (/sys/kernel/config) and cman will see an o2cb-leaning ocfs2 cluster instead of a cman-leaning ocfs2 cluster.

Shutdown almost is the reverse.  Whether or not you start o2cb then cman, or cman then o2cb, you must kill cman before killing o2cb.  Sometimes on shutdown, fenced will die before cman can kill it I think and the cman init script throws an error.  Run it again ("service cman stop" - yes, again), and when it completes successfully you can do "service o2cb stop".  If you try to stop o2cb before cman is totally dead, you will wind up with a minor mess.  Given all of this, I'd recommend disabling all of these scripts from being run at system boot.

CMAN-based o2cb requires O2CB_STACK=cman in /etc/default/o2cb.

If you are upgrading from Ubuntu 11.10 to 12.04, and you want to move your ocfs2 stack from whatever it's named to cman, remember to run tunefs.ocfs2 --update-cluster-stack [target] AFTER you have o2cb properly configured and running under cman.  This will mean your whole cluster will be unable to use that particular ocfs2 store, but then if you're doing this kind of upgrade you probably should not be using it live anyway.  Since I had my resources in groups, I configured the mount to be stopped before bringing the nodes up, and allowed the iSCSI initiator to connect to the target.  Then I was able to update the stack and start the mount resource, which succeeded as expected.

I hope you find this information useful.

20130205

HA MySQL (MariaDB) on Ubuntu 12.04 LTS

A few notes concerning this.


The tutorial provided on the Linbit site for HA-mysql is totally AWESOME!  Highly recommended.  It will get you 99% of the way there.

The resource definition for the MySQL server instance on Ubuntu 12.04 varies slightly due to Apparmor's need for things to line up neatly.  Specifically the file names for the pid and socket files must be correct.  Referencing the original Ubuntu configuration, we have this for a resource:
primitive p_db-mysql0 ocf:heartbeat:mysql \
        params binary="/usr/sbin/mysqld" \
               config="/etc/mysql/my.cnf" \
               datadir="/var/lib/mysql" \
               pid="/var/run/mysqld/mysqld.pid" \
               socket="/var/run/mysqld/mysqld.sock" \
               additional_parameters="--bind-address=127.0.0.1" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" timeout="30s"

Of course, the bind-address listed here is only for testing and must be changed to the bind address of the virtual IP that will be assigned to the database resource group.

I chose to have the database files stored on iSCSI, since my iSCSI SAN is HA already.  I realize that there is still the possibility of network switch failure causing runtime coma, but if that happens then there will be much larger problems at hand, since both database servers (two node cluster) are virtual machines.  To that end I must remember to configure them for virtual STONITH.

I'm still not sure virtualized database servers are the best idea; I can think of a few reasons not to love them, but also a few reasons to totally dig them. 

Minuses:
  • VM is subject to same iSCSI risks as the backing store for the databases right now - dedicated DRBD would be better; in my case, this isn't really applicable because the VMs are actually on DRBD and hosted via iSCSI, so I'd be doing double-duty there.
  • A VM migration SHOULDN'T cause any sort of db-cluster failure, but we will have to test to know for certain.  Perhaps modifying the corosync timeouts will be a beneficial thing.
Pluses:
  • The standard reason: hardware provisioning!!  No need to stand up more hard drives to watch die, or use more power than what I'm already using.
  • VMs means easy migration to other places, like a redundant VM cluster for instance.
  • Provisioning additional cluster nodes should be relatively painless.
  • The iSCSI backing store will soon be using ZFS, which will be more difficult to do for standalone nodes unless I spend $$ on drives, and ideally hot-swap cages.
  • If one of the VMs dies suddenly, we still won't suffer (hopefully) a major database access outage.  I'd like to move all internal database use over to this cluster, ultimately.  I am tempted to even put an LDAP server instance on there.  Then it can be all things data-access-related.
Hopefully I get more than I paid for.


Concerning MariaDB

This appears to be where the future is going, and more than one distro agrees.  So, to that end, I looked at the specs and the ideas behind MariaDB.  Satisfied that it was designed to be a literal "drop-in replacement" for MySQL, I immediately transitioned both machines over.  Now we will see how well it really works.  I had to follow their instructions on adding their repo to my servers.  The upgrade was painless, and all I have left now is to set up the virtual IPs and start connecting machines to the database instances.

Concerning PostgreSQL

My DB cluster also hosts PostgreSQL 9.1.  This was followed to the tee and works, as far as I have tested so far, quite well:

http://wiki.postgresql.org/images/0/07/Ha_postgres.pdf



20130128

LVM on ZFS

I don't know why one would ever do this, but here is an esoteric trick in case you're ever interested in turning a ZFS volume into an LVM physical device...

I create a ZFS volume:
zfs create -V 1G trunk/funk
To turn this into a PV, evidently one cannot simply pvcreate either /dev/trunk/funk or /dev/zd0 (in this case).  LVM complains that it cannot find the drive or that it was filtered out.  Without digging through LVM's options, I chose what feels like a very dirty but successful approach - loopback devices:
losetup /dev/loop0 /dev/trunk/funk
pvcreate /dev/loop0
Viola!  Now I have a ZFS backing store for my LVM, meaning I can pvmove all sorts of interesting things into ZFS and then back out, without invoking a single ZFS command.  Not that I have anything against ZFS commands, mind you.

The Good

You can do what I mentioned above with regard to LVM's logical extents.  You get to use familiar tools, and can migrate between two different volume managers...sort of.

The Bad

The loopback device does not survive reboot; you have to losetup it again and run pvscan to get your volumes back.  Thus, it's not a transparent solution for things like moving your root partition, or possibly even your /usr folder.  Since you're cramming data through three virtual devices instead of one, you also necessarily take a performance hit.  I figured this would be the case going in, but wanted to see what could be done.

The Ugly

Here are some results from two tests.  In both tests, dbench was run for 120 seconds with 50 clients.
Vol-1 here is a direct ZFS volume, 2G in size, formatted with XFS and mounted locally.

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    3550706     0.084   184.590
 Close        2607715     0.027   114.792
 Rename        150360     0.049    14.309
 Unlink        717466     0.338   180.865
 Deltree          100    12.302   106.897
 Mkdir             50     0.003     0.012
 Qpathinfo    3218035     0.005    25.145
 Qfileinfo     564208     0.001     7.601
 Qfsinfo       590419     0.003     6.306
 Sfileinfo     289141     0.042    14.303
 Find         1244465     0.014    18.228
 WriteX       1772158     0.026    17.727
 ReadX        5566389     0.006    19.958
 LockX          11566     0.004     2.074
 UnlockX        11566     0.003     5.996
 Flush         248977    20.776   264.706
Throughput 931.291 MB/sec  50 clients  50 procs  max_latency=264.710 ms
Vol-2 was my ZFS -> losetup -> LVM volume, also roughly 2G in size and formatted with XFS (and mounted locally):



 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    2019488     0.112   346.872
 Close        1481790     0.032   246.645
 Rename         85494     0.064    23.981
 Unlink        409039     0.385   324.935
 Qpathinfo    1830417     0.007    90.319
 Qfileinfo     318618     0.001     8.176
 Qfsinfo       335946     0.004     8.134
 Sfileinfo     164346     0.035    22.344
 Find          707310     0.019    67.141
 WriteX        996612     0.035   134.271
 ReadX        3163203     0.008    19.165
 LockX           6556     0.004     0.117
 UnlockX         6556     0.003     0.423
 Flush         141610    38.198   420.011
Throughput 524.834 MB/sec  50 clients  50 procs  max_latency=420.017 ms


Other Thoughts

It's possible that LVM is not treating the device very nicely, writing in 512 byte sectors instead of the 4K sectors that my ZFS pool has been configured to use.  If this were to become fixed, or if there was a way to get around using a loopback device, we might see better performance.  Maybe.

Conclusion

The moral of this story is:  You can do it, but it'll perform like shit.



20130115

Hot Add/Remove Strangeness

I'm encountering a strange phenomenon.  I was busy playing with the ZFS add/remove/online/offline functions, to get a better feel for how it does its thing.  (To that end, it seems to me that ZFS has to really decide a device is actually BAD before it will initiate replacement with a hot-spare.  I can't find a way to force it, so maybe I don't really understand how ZFS views hot-spares.  Better to keep some spare devices on hand I guess.)

I did the following experiment:

  1. Offline a disk via zpool.
  2. Remove said disk by deleting it from the system.
  3. Pop the disk out of the array, then pop it back in so the controller will think it was replaced.
  4. Rescan the SCSI buses, do a udevadm trigger
  5. If the disk was found, bring it back into the zpool.
What I found interesting was that the device was not always, well, fully attached into the system.  Explicitly, when searching for the device directory under /sys (find /sys -iname "6:1:5:0" in this case), I would normally see three entries:
  • /sys/scsi_device/6:1:5:0
  • /sys/bsg/6:1:5:0
  • /sys/scsi_disk/6:1:5:0
Occasionally only the first two would appear.  The third missing, the device never appeared to the kernel other than a report in the log that the "scsi generic" was added.  There would be no drive letter assigned, no report on its write-caching, etc.  Feels like a race-condition.

In order for the device to appear, you can issue an "echo 1 > /sys/scsi_device/6\:1\:5\:0/device/delete" and then rescan the buses AGAIN.  It should find it.  Or not.  Race-condition...yes.... ;-)

I honestly don't know if this is a driver issue, a kernel issue, or a controller issue.  That the kernel SEES the device suggests the controller is not at fault.  What populates the scsi_disk portion of the sys tree?  That may be what is failing here.  I would have to dig deeper to know for certain, but am unsure where in the source to start...

For reference: this is on Ubuntu 12.04.1 LTS, currently running kernel 3.2.0-35-generic x86_64.

Hot-remove, Hot-add drives under Linux

UPDATE: See http://burning-midnight.blogspot.com/2013/01/hot-addremove-strangeness.html for some strangeness I encountered while doing the following...

I keep looking for this because I keep forgetting it.  Now I have two scripts that make my job a lot easier.  I also recently started using device aliasing under ZFSonLinux, meaning I can type things like "a1" and "b3" instead of scsi-1ATA_WDC_WD10JPVT-00A1YT0_WD-WXB1EA......

BUT the device aliasing has a downside; I'd still have to dig through the dev tree and match up zpool device names to their semi-real system counterparts...up till now!

Here's a script to scan every SCSI bus on the system, so that when you add a drive it should just find it (my system has 8 somehow, by the way):


for X in /sys/class/scsi_host/host?; do
  echo "- - -" > ${X}/scan
done

And here's a script I found and made one minor change to (had to fix what was maybe a typo or a difference in shells).  If you supply the exact device path (such as /dev/zpool/a5), it will hot-remove it for you.  Someone commented that calling the device by name (sda, sdb) works too, but this does not seem to be applicable where the ZFS device aliasing is concerned.  Anyway...

#!/bin/bash
# (c) 2009 by Dennis Birkholz (firstname DOT lastname [at] nexxes.net)
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You can received a copy of the GNU General Public License at
# .
function usage {
echo "Usage $0 [device]"
echo
echo "Disable supplied SCSI device"
exit
}
# Need a parameter
[ "$1" == "" ] &&
usage
# Verify parameter exists
( [ ! -e "$1" ] || [ ! -b "$1" ] ) &&
echo "Supplied devices does not exist or is not a block device." >/dev/stderr &&
exit 1
# Verify SCSI disk entries exist in /sys
[ ! -d "/sys/class/scsi_disk/" ] &&
echo "Could not find SCSI disk entries in sys, aborting." >/dev/stderr &&
exit 2
# Get major/minor device string of device
major=$(stat --dereference --format='%t' "$1")
major=$(printf '%d\n' "0x${major}")
minor=$(stat --dereference --format='%T' "$1")
minor=$(printf '%d\n' "0x${minor}")
deviceID="${major}:${minor}"
echo "Major/Minor number for device '$1' is '${deviceID}'..."
for device in /sys/class/scsi_disk/*; do
[ "$(< ${device}/device/block/*/dev)" != "${deviceID}" ] && continue
scsiID=$(basename "${device}")
echo "Found SCSI ID '${scsiID}' for device '${1}'..."
echo 1 > ${device}/device/delete
echo "SCSI device removed."
exit 0
done
echo "Could not identify device as SCSI device, aborting." >/dev/stderr
exit 4
I will say I'm a little disappointed that after all this time someone hasn't come up with a more well-published way to do this on systems.  Of course, most of us don't hot-swap our drives, so maybe I shouldn't be TOO disappointed.

20130110

Fixing Balance-ALB (Mode 6) Bonding for KVM

I ended up contacting the netdev list, looking to see if the problems I was experiencing with Balance-ALB were fixable and if a fix would be accepted.

Good news!  It was already fixed!

Bad news...it's only fixed in the 3.8 release candidate right now.

The responder pointed me to the patch submission that fixed the issue at hand: balance-ALB would no longer stomp MACs that did not originate from the host itself.  Simple enough to apply to the 3.0 kernel, but there had been some other changes that caused both a hunk to fail and the build to fail.  I had to pull in a function from upstream and backport it into one of the headers.  The next challenge was getting the .deb packages built...I made the mistake of doing this on a ramdrive, not realizing it would compile everything three times and generate three images.  24G of ramdrive later, it was done.

The installation, at least, was easy enough...thanks to the .debs.  After rebooting, the bond worked correctly, and the MACs for all my virtuals are now visible and correct!

For posterity, this is the link I was given for the original patch:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=patch;h=567b871e503316b0927e54a3d7c86d50b722d955

Below is the patch for the 3.0 kernel.  The patch appears to build for kernels up to (but not including) the 3.7 series.  3.7 should work if you omit the etherdevice.h portion of the patch.

diff -uNr linux-3.0.0-a/drivers/net/bonding/bond_alb.c linux-3.0.0-b/drivers/net/bonding/bond_alb.c
--- linux-3.0.0-a/drivers/net/bonding/bond_alb.c        2013-01-10 12:47:53.000000000 -0500
+++ linux-3.0.0-b/drivers/net/bonding/bond_alb.c        2013-01-10 12:50:58.000000000 -0500
@@ -666,6 +666,12 @@
        struct arp_pkt *arp = arp_pkt(skb);
        struct slave *tx_slave = NULL;

+       /* Don't modify or load balance ARPs that do not originate locally
+        * (e.g.,arrive via a bridge).
+        */
+       if (!bond_slave_has_mac(bond, arp->mac_src))
+               return NULL;
+
        if (arp->op_code == htons(ARPOP_REPLY)) {
                /* the arp must be sent on the selected
                * rx channel
diff -uNr linux-3.0.0-a/drivers/net/bonding/bonding.h linux-3.0.0-b/drivers/net/bonding/bonding.h
--- linux-3.0.0-a/drivers/net/bonding/bonding.h 2011-07-21 22:17:23.000000000 -0400
+++ linux-3.0.0-b/drivers/net/bonding/bonding.h 2013-01-10 12:51:05.000000000 -0500
@@ -18,6 +18,7 @@
 #include
 #include
 #include
+#include
 #include
 #include
 #include
@@ -431,6 +432,18 @@
 }
 #endif

+static inline struct slave *bond_slave_has_mac(struct bonding *bond,
+                                              const u8 *mac)
+{
+       int i = 0;
+       struct slave *tmp;
+
+       bond_for_each_slave(bond, tmp, i)
+               if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
+                       return tmp;
+
+       return NULL;
+}

 /* exported from bond_main.c */
 extern int bond_net_id;
diff -uNr linux-3.0.0-a/include/linux/etherdevice.h linux-3.0.0-b/include/linux/etherdevice.h
--- linux-3.0.0-a/include/linux/etherdevice.h   2011-07-21 22:17:23.000000000 -0400
+++ linux-3.0.0-b/include/linux/etherdevice.h   2013-01-10 12:51:16.000000000 -0500
@@ -275,4 +275,37 @@
 #endif
 }

+/**
+ * ether_addr_equal_64bits - Compare two Ethernet addresses
+ * @addr1: Pointer to an array of 8 bytes
+ * @addr2: Pointer to an other array of 8 bytes
+ *
+ * Compare two Ethernet addresses, returns true if equal, false otherwise.
+ *
+ * The function doesn't need any conditional branches and possibly uses
+ * word memory accesses on CPU allowing cheap unaligned memory reads.
+ * arrays = { byte1, byte2, byte3, byte4, byte5, byte6, pad1, pad2 }
+ *
+ * Please note that alignment of addr1 & addr2 are only guaranteed to be 16 bits.
+ */
+
+static inline bool ether_addr_equal_64bits(const u8 addr1[6+2],
+                                           const u8 addr2[6+2])
+{
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+        unsigned long fold = ((*(unsigned long *)addr1) ^
+                              (*(unsigned long *)addr2));
+
+        if (sizeof(fold) == 8)
+                return zap_last_2bytes(fold) == 0;
+
+        fold |= zap_last_2bytes((*(unsigned long *)(addr1 + 4)) ^
+                                (*(unsigned long *)(addr2 + 4)));
+        return fold == 0;
+#else
+        return ether_addr_equal(addr1, addr2);
+#endif
+}
+
+
 #endif /* _LINUX_ETHERDEVICE_H */