20130128

LVM on ZFS

I don't know why one would ever do this, but here is an esoteric trick in case you're ever interested in turning a ZFS volume into an LVM physical device...

I create a ZFS volume:
zfs create -V 1G trunk/funk
To turn this into a PV, evidently one cannot simply pvcreate either /dev/trunk/funk or /dev/zd0 (in this case).  LVM complains that it cannot find the drive or that it was filtered out.  Without digging through LVM's options, I chose what feels like a very dirty but successful approach - loopback devices:
losetup /dev/loop0 /dev/trunk/funk
pvcreate /dev/loop0
Viola!  Now I have a ZFS backing store for my LVM, meaning I can pvmove all sorts of interesting things into ZFS and then back out, without invoking a single ZFS command.  Not that I have anything against ZFS commands, mind you.

The Good

You can do what I mentioned above with regard to LVM's logical extents.  You get to use familiar tools, and can migrate between two different volume managers...sort of.

The Bad

The loopback device does not survive reboot; you have to losetup it again and run pvscan to get your volumes back.  Thus, it's not a transparent solution for things like moving your root partition, or possibly even your /usr folder.  Since you're cramming data through three virtual devices instead of one, you also necessarily take a performance hit.  I figured this would be the case going in, but wanted to see what could be done.

The Ugly

Here are some results from two tests.  In both tests, dbench was run for 120 seconds with 50 clients.
Vol-1 here is a direct ZFS volume, 2G in size, formatted with XFS and mounted locally.

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    3550706     0.084   184.590
 Close        2607715     0.027   114.792
 Rename        150360     0.049    14.309
 Unlink        717466     0.338   180.865
 Deltree          100    12.302   106.897
 Mkdir             50     0.003     0.012
 Qpathinfo    3218035     0.005    25.145
 Qfileinfo     564208     0.001     7.601
 Qfsinfo       590419     0.003     6.306
 Sfileinfo     289141     0.042    14.303
 Find         1244465     0.014    18.228
 WriteX       1772158     0.026    17.727
 ReadX        5566389     0.006    19.958
 LockX          11566     0.004     2.074
 UnlockX        11566     0.003     5.996
 Flush         248977    20.776   264.706
Throughput 931.291 MB/sec  50 clients  50 procs  max_latency=264.710 ms
Vol-2 was my ZFS -> losetup -> LVM volume, also roughly 2G in size and formatted with XFS (and mounted locally):



 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    2019488     0.112   346.872
 Close        1481790     0.032   246.645
 Rename         85494     0.064    23.981
 Unlink        409039     0.385   324.935
 Qpathinfo    1830417     0.007    90.319
 Qfileinfo     318618     0.001     8.176
 Qfsinfo       335946     0.004     8.134
 Sfileinfo     164346     0.035    22.344
 Find          707310     0.019    67.141
 WriteX        996612     0.035   134.271
 ReadX        3163203     0.008    19.165
 LockX           6556     0.004     0.117
 UnlockX         6556     0.003     0.423
 Flush         141610    38.198   420.011
Throughput 524.834 MB/sec  50 clients  50 procs  max_latency=420.017 ms


Other Thoughts

It's possible that LVM is not treating the device very nicely, writing in 512 byte sectors instead of the 4K sectors that my ZFS pool has been configured to use.  If this were to become fixed, or if there was a way to get around using a loopback device, we might see better performance.  Maybe.

Conclusion

The moral of this story is:  You can do it, but it'll perform like shit.



20130115

Hot Add/Remove Strangeness

I'm encountering a strange phenomenon.  I was busy playing with the ZFS add/remove/online/offline functions, to get a better feel for how it does its thing.  (To that end, it seems to me that ZFS has to really decide a device is actually BAD before it will initiate replacement with a hot-spare.  I can't find a way to force it, so maybe I don't really understand how ZFS views hot-spares.  Better to keep some spare devices on hand I guess.)

I did the following experiment:

  1. Offline a disk via zpool.
  2. Remove said disk by deleting it from the system.
  3. Pop the disk out of the array, then pop it back in so the controller will think it was replaced.
  4. Rescan the SCSI buses, do a udevadm trigger
  5. If the disk was found, bring it back into the zpool.
What I found interesting was that the device was not always, well, fully attached into the system.  Explicitly, when searching for the device directory under /sys (find /sys -iname "6:1:5:0" in this case), I would normally see three entries:
  • /sys/scsi_device/6:1:5:0
  • /sys/bsg/6:1:5:0
  • /sys/scsi_disk/6:1:5:0
Occasionally only the first two would appear.  The third missing, the device never appeared to the kernel other than a report in the log that the "scsi generic" was added.  There would be no drive letter assigned, no report on its write-caching, etc.  Feels like a race-condition.

In order for the device to appear, you can issue an "echo 1 > /sys/scsi_device/6\:1\:5\:0/device/delete" and then rescan the buses AGAIN.  It should find it.  Or not.  Race-condition...yes.... ;-)

I honestly don't know if this is a driver issue, a kernel issue, or a controller issue.  That the kernel SEES the device suggests the controller is not at fault.  What populates the scsi_disk portion of the sys tree?  That may be what is failing here.  I would have to dig deeper to know for certain, but am unsure where in the source to start...

For reference: this is on Ubuntu 12.04.1 LTS, currently running kernel 3.2.0-35-generic x86_64.

Hot-remove, Hot-add drives under Linux

UPDATE: See http://burning-midnight.blogspot.com/2013/01/hot-addremove-strangeness.html for some strangeness I encountered while doing the following...

I keep looking for this because I keep forgetting it.  Now I have two scripts that make my job a lot easier.  I also recently started using device aliasing under ZFSonLinux, meaning I can type things like "a1" and "b3" instead of scsi-1ATA_WDC_WD10JPVT-00A1YT0_WD-WXB1EA......

BUT the device aliasing has a downside; I'd still have to dig through the dev tree and match up zpool device names to their semi-real system counterparts...up till now!

Here's a script to scan every SCSI bus on the system, so that when you add a drive it should just find it (my system has 8 somehow, by the way):


for X in /sys/class/scsi_host/host?; do
  echo "- - -" > ${X}/scan
done

And here's a script I found and made one minor change to (had to fix what was maybe a typo or a difference in shells).  If you supply the exact device path (such as /dev/zpool/a5), it will hot-remove it for you.  Someone commented that calling the device by name (sda, sdb) works too, but this does not seem to be applicable where the ZFS device aliasing is concerned.  Anyway...

#!/bin/bash
# (c) 2009 by Dennis Birkholz (firstname DOT lastname [at] nexxes.net)
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You can received a copy of the GNU General Public License at
# .
function usage {
echo "Usage $0 [device]"
echo
echo "Disable supplied SCSI device"
exit
}
# Need a parameter
[ "$1" == "" ] &&
usage
# Verify parameter exists
( [ ! -e "$1" ] || [ ! -b "$1" ] ) &&
echo "Supplied devices does not exist or is not a block device." >/dev/stderr &&
exit 1
# Verify SCSI disk entries exist in /sys
[ ! -d "/sys/class/scsi_disk/" ] &&
echo "Could not find SCSI disk entries in sys, aborting." >/dev/stderr &&
exit 2
# Get major/minor device string of device
major=$(stat --dereference --format='%t' "$1")
major=$(printf '%d\n' "0x${major}")
minor=$(stat --dereference --format='%T' "$1")
minor=$(printf '%d\n' "0x${minor}")
deviceID="${major}:${minor}"
echo "Major/Minor number for device '$1' is '${deviceID}'..."
for device in /sys/class/scsi_disk/*; do
[ "$(< ${device}/device/block/*/dev)" != "${deviceID}" ] && continue
scsiID=$(basename "${device}")
echo "Found SCSI ID '${scsiID}' for device '${1}'..."
echo 1 > ${device}/device/delete
echo "SCSI device removed."
exit 0
done
echo "Could not identify device as SCSI device, aborting." >/dev/stderr
exit 4
I will say I'm a little disappointed that after all this time someone hasn't come up with a more well-published way to do this on systems.  Of course, most of us don't hot-swap our drives, so maybe I shouldn't be TOO disappointed.

20130110

Fixing Balance-ALB (Mode 6) Bonding for KVM

I ended up contacting the netdev list, looking to see if the problems I was experiencing with Balance-ALB were fixable and if a fix would be accepted.

Good news!  It was already fixed!

Bad news...it's only fixed in the 3.8 release candidate right now.

The responder pointed me to the patch submission that fixed the issue at hand: balance-ALB would no longer stomp MACs that did not originate from the host itself.  Simple enough to apply to the 3.0 kernel, but there had been some other changes that caused both a hunk to fail and the build to fail.  I had to pull in a function from upstream and backport it into one of the headers.  The next challenge was getting the .deb packages built...I made the mistake of doing this on a ramdrive, not realizing it would compile everything three times and generate three images.  24G of ramdrive later, it was done.

The installation, at least, was easy enough...thanks to the .debs.  After rebooting, the bond worked correctly, and the MACs for all my virtuals are now visible and correct!

For posterity, this is the link I was given for the original patch:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=patch;h=567b871e503316b0927e54a3d7c86d50b722d955

Below is the patch for the 3.0 kernel.  The patch appears to build for kernels up to (but not including) the 3.7 series.  3.7 should work if you omit the etherdevice.h portion of the patch.

diff -uNr linux-3.0.0-a/drivers/net/bonding/bond_alb.c linux-3.0.0-b/drivers/net/bonding/bond_alb.c
--- linux-3.0.0-a/drivers/net/bonding/bond_alb.c        2013-01-10 12:47:53.000000000 -0500
+++ linux-3.0.0-b/drivers/net/bonding/bond_alb.c        2013-01-10 12:50:58.000000000 -0500
@@ -666,6 +666,12 @@
        struct arp_pkt *arp = arp_pkt(skb);
        struct slave *tx_slave = NULL;

+       /* Don't modify or load balance ARPs that do not originate locally
+        * (e.g.,arrive via a bridge).
+        */
+       if (!bond_slave_has_mac(bond, arp->mac_src))
+               return NULL;
+
        if (arp->op_code == htons(ARPOP_REPLY)) {
                /* the arp must be sent on the selected
                * rx channel
diff -uNr linux-3.0.0-a/drivers/net/bonding/bonding.h linux-3.0.0-b/drivers/net/bonding/bonding.h
--- linux-3.0.0-a/drivers/net/bonding/bonding.h 2011-07-21 22:17:23.000000000 -0400
+++ linux-3.0.0-b/drivers/net/bonding/bonding.h 2013-01-10 12:51:05.000000000 -0500
@@ -18,6 +18,7 @@
 #include
 #include
 #include
+#include
 #include
 #include
 #include
@@ -431,6 +432,18 @@
 }
 #endif

+static inline struct slave *bond_slave_has_mac(struct bonding *bond,
+                                              const u8 *mac)
+{
+       int i = 0;
+       struct slave *tmp;
+
+       bond_for_each_slave(bond, tmp, i)
+               if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
+                       return tmp;
+
+       return NULL;
+}

 /* exported from bond_main.c */
 extern int bond_net_id;
diff -uNr linux-3.0.0-a/include/linux/etherdevice.h linux-3.0.0-b/include/linux/etherdevice.h
--- linux-3.0.0-a/include/linux/etherdevice.h   2011-07-21 22:17:23.000000000 -0400
+++ linux-3.0.0-b/include/linux/etherdevice.h   2013-01-10 12:51:16.000000000 -0500
@@ -275,4 +275,37 @@
 #endif
 }

+/**
+ * ether_addr_equal_64bits - Compare two Ethernet addresses
+ * @addr1: Pointer to an array of 8 bytes
+ * @addr2: Pointer to an other array of 8 bytes
+ *
+ * Compare two Ethernet addresses, returns true if equal, false otherwise.
+ *
+ * The function doesn't need any conditional branches and possibly uses
+ * word memory accesses on CPU allowing cheap unaligned memory reads.
+ * arrays = { byte1, byte2, byte3, byte4, byte5, byte6, pad1, pad2 }
+ *
+ * Please note that alignment of addr1 & addr2 are only guaranteed to be 16 bits.
+ */
+
+static inline bool ether_addr_equal_64bits(const u8 addr1[6+2],
+                                           const u8 addr2[6+2])
+{
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+        unsigned long fold = ((*(unsigned long *)addr1) ^
+                              (*(unsigned long *)addr2));
+
+        if (sizeof(fold) == 8)
+                return zap_last_2bytes(fold) == 0;
+
+        fold |= zap_last_2bytes((*(unsigned long *)(addr1 + 4)) ^
+                                (*(unsigned long *)(addr2 + 4)));
+        return fold == 0;
+#else
+        return ether_addr_equal(addr1, addr2);
+#endif
+}
+
+
 #endif /* _LINUX_ETHERDEVICE_H */


20130107

Balance-ALB Woes...

I've concluded my research and here is the answer.

It's dead, Jim.

If your virtual's MACs are getting squashed, look no further than Balance-ALB (mode 6).

I'm not sure if this can be fixed or not, but right now it sucks.  After much testing and lots more reading, it seems this is a "known problem" and doesn't look like it's going to be fixed.  For reference, here's the configuration:

    vnet0 -> br0 -> bond0 -> eth1, eth2, ...  (note to self: make this a pretty picture)

Where bond0 is mode-6 over the listed interfaces.  ALB is supposed to balance transmits AND receives, so to accomplish this it apparently snags ARPs from the wire and replaces them with one of the several MACs of its slaves.  I think, if I recall correctly, it picks a slave in a round-robin fashion.  Anyway, the problem seems to be that when ARPs for the virtuals under the bridge come in, ALB snags those as well and scarfs them, stomps them, and sends out its own MAC.

Thus, the spice does not flow.

Symptoms include: intermittent ping, intermittent connectivity, ARP table reading one of the bond's MACs instead of the virtual's, headache, nausea, and some minor vomiting.

Nothing else, save mode-0 (which really doesn't count) does any sort of receive-side load-balancing.  What I would REALLY like to see is ALB more intelligently handle ARP requests, such that it doesn't squash those of the virtuals that are properly serviced elsewhere.  Ideally, it should not squash any ARP replies that do not have anything to do with its own physical adapters.  That seems like it ought to be a relatively easy fix...except I don't know where in the code to fix it...yet.  I might try someday...

Use Mode-4 you say?  Nope, it doesn't load-balance under the bonding driver.  Check out libteam instead and then go crazy trying to build it under Ubuntu 11.10.  It looks like it's under 12.04, except my stupid cluster doesn't run 12.04 because of issues with OCFS2 and the DLM and a lot of other bullshit that is taking way too much energy to solve.  Plus, Mode-4 is great if you're using a single switch.  I'm using two, for redundancy.  I could go back to Sins of the Bond and team two mode-4's under maybe a mode-1 (XOR), but then I'm kinda back where I started with no really awesome receive-load-balancing.

The ALB problem is especially nefarious because occasionally the virtual's REAL MAC will appear in the ARP cache.  Now, just What The Fuck is up with that?  It makes me think this issue with ALB is more of a bug than a feature.

Temporary fixes in the meantime: switch to TLB (mode 5), or any other mode that doesn't involve borking ARPs.  Strangely, even though TLB manages load-balancing on transmit, it doesn't display the same ARP-hell that ALB suffers.


20130103

Bridging, Bonding, and VM MACs...OH MY!!

I've been pouring through various swaths of documentation and forum posts all day, to little avail.  Perhaps it's buried in the "how Linux bridging works" doc somewhere in the kernel files, but I'm gonna ask here for kicks:  I've noticed a strange trend, and ordinarily I wouldn't care except I lose connectivity for short periods with my virtuals.  I have several hosts that all network by bridging virtual ethernet adapters with a bond of the physical adapters.  For sake of clarity: phy -> bond -> bridge, with mode-6.  So, I ping away at a host, and some of the pings come back "TTL Exceeded".  Not fun, especially since I also have trouble ssh'ing into the vm.  I can get to the console with VNC, and play around with it and even ping from the vm to, say, the firewall without any issues.  After some reading, the notion of MAC conflicts and bridging and bonding issues came up.

I pulled up Wireshark and started examining the ARP responses for said given host.  I detected that the ARP given for the vm's IP varied among the bonded adapters - I half expected this, since mode-6 is supposed to load-balance automatically.  However, when I started pinging FROM my vm TO my Wireshark machine, the MAC in my cache suddenly changes to the vm's actual MAC.  Once the pinging is stopped, the MAC eventually reverts to one of the physical adapters.

To sum up: pinging WS -> VM = bond MAC, whereas pinging VM -> WS = VM MAC.  It's like the VM's pings become unintentional gratuitous ARPs.

So, what's the deal here?  Is this "working right" and my problem with intermittent connectivity somewhere else?  Is it normal for either the bridge or the bond to reply with their own selective MAC address to ARP requests?  But then why bother doing that when you could just as easily let the VM publish its own response?  Obviously the MAC isn't getting nuked when the VM transmits, since it appears in my ARP cache during that ping experiment.....   ??  I get the feeling there is a setting that needs to be modified.