20120914

Resizing a Live iSCSI Target

Despite upgrading the hard drives in my SAN servers, I finally hit the drive-limits of my iSCSI targets and now had to make use of all that extra hard drive space.  Unfortunately, it seems there isn't a lot of information or NICE tool-age available to make this happen seamlessly.  That's OK, it wasn't as painful as I thought it was going to be.

My setup is as follows:


The stores can be managed on either of the two cluster hosts, and usually this results in a splitting of the load.  The first requirement was, of course, to enlarge the RAID store.  That was easy as mdadm.  Second was to resize the two logical volumes that are used as backing stores for the DRBD devices, via LVM.  Next, DRBD had to be told to resize each volume, which predictably caused a resync event to occur.  Once that was finished, it was time to notify the initiators.

This is where a little trickery had to take place.  So far I've not really found anything that made it easy to tell ietd to "rescan" itself or to otherwise realize that its underlying target devices might have changed their sizes.  About the only thing I could really find was to basically remove and re-add the target, or restart it if you will.

Not really a fun idea, but at least Pacemaker gave me an out.  Instead of shutting down each target, I migrated each target and then unmigrated it back:
  crm resource migrate g_iscsistore0
  crm resource unmigrate g_iscsistore0

It's important to realize that you must wait for the migration to actually complete before un-migrating.  The un-migrate is used to remove the constraint that was automatically generated to force the migration.  This effectively causes the target restart I needed, and because the cluster is properly configured no initiators realized the connection was ever terminated.  This was important because the targets are very live and it's not easy to shut them down without shutting down several other machines.  This will probably be a problem for me in the future when I go to upgrade both the VM cluster that relies on these stores, and the storage cluster that serves them, to a newer release of Ubuntu Server.

In the meantime, I have now effectively resized the targets, and the next step is obviously the initiators.  I have this script to check for a resized device by occasionally asking open-iscsi to rescan:

rescan-iscsi.sh

#!/bin/bash
/sbin/iscsiadm -m node -R > /dev/null


This is actually set up as a cron job on the initiators, to run every 15 minutes.  By now all the machines in the cluster should have recognized the new device sizes.  I can now perform the resize online from one of the initiators:
  tunefs.ocfs2 -v -S /dev/sdc

The resize should be transparent and non-interrupting.  It only took a few minutes for each store to complete.  I now have two 500G iSCSI targets, ready for more data!

One thing I'd really like to do in the future is have my initiators NOT use /dev/sd? names.  I'm not quite sure yet how to do that.  I have run into problems where smartd would try to access the iSCSI targets via the initiator connection and cause the SAN nodes to die horrific deaths.  Not sure what that's about, either.

20120913

ZFS, Additional Thoughts

I am about to expand my ZFS array, and I'm a little bit stuck...not because I don't know what to do, but because I am reflecting on my experiences thus far.

I guess I just find ZFS a little, well, uncomfortable.  That's really the best word I can come up with.  It's not necessarily all ZFS' fault, although some of the fault does lie with it.  I'll try to enumerate what's troubling me.

First, the drive references - they recommend adding devices via their /dev/disk/by-id (or similarly unique-but-consistent) identifiers.  This makes sense in terms of making sure that the drives are always properly recognized and dealt with in the correct order, and having been through some RAID hell with drive ordering I can attest that there have been instances where I've cursed the seeming-randomness of how the /dev/sd? identifiers are assigned.  That being said, my Linux device identifiers look like this:

    scsi-3632120e0a37653430e79784212fdb020

That's really quite ugly and as I look through the list of devices to prune out which ones I've already assigned, I'm missing my little 3-character device names....a lot.  This doesn't seem to be an issue on OpenSolaris systems, but I can't/won't run OpenSolaris at this time.

Second, there's the obvious "expansion through addition" instead of "expansion through reshaping."  I want to believe that two RAID-5-style arrays will give me almost as much redundancy as a single RAID-6, but truth be told any two drives could fail at any time.  I do not think we can safely say that two failing in the same array is less likely than one failing in each array.  If anything, it's just as likely.  If Fate has its say, it's more likely, just to piss off Statistics.

But this is what I've got, and I can't wait another 15 days for all my stores to resync just because I added a drive.  That will be even more true once the redundant server is remote again, and syncing is happening over a tiny 10Mbit link.  I'll just have to bite the bullet and build another raidz1 of four drives, and hope for the best.

Third, I'm just a little disturbed about the fact that once you bring something like a raidz online, there is no initial sync.  I guess the creators of ZFS might have thought it superfluous.  After all, if you've written no data, why bother to sync garbage?  It's just something I've come to expect from things like mdadm and every RAID card there is, but then again I suppose it doesn't make a lot of sense after all.  I'm trying to find a counter-example, but so far I can't seem to think of a good one.

Fourth, the tools remind me a little of something out of the Windows-age.  They're quite minimalist, especially when compared to mdadm and LVM.  Those two latter tools provide a plethora of information, and while not all admins will use it, there have been times I've needed it.  I just feel like the conveniences offered by the ZFS command-line tools actually takes away from the depth of information I expect to have access to.  I know there is probably a good reason for it, yet it just isn't that satisfying.

The obvious question at this point is: why use it if I have these issues with it?  Well, for the simple fact that it does per-block integrity checks.  Nothing more.  That is the one killer feature I need because I can no longer trust my hard drives not to corrupt my data, and I can't afford to drop another $6K on new hard drives.  I want so badly to have a device driver that implements this under mdadm, but writing one still seems beyond the scope of my available time.

Or is it?

20120907

ZFS on Linux - A Test Drive

What Happened??

I have been suffering through recent data corruption events on my multi-terabyte arrays.  We have a pair of redundant arrays, intended for secure backup of all our data, so data integrity is obviously of importance.  I've chosen DRBD as an integral part of the stack, because it is solid and totally rocks.  I had gone with mdadm and RAID-6, after a few controller melt-downs left me with a bitter aftertaste.  Throw in some crypto and LVM and viola!

Then a drive went bad.

There may even be more than one.  Even though smartd detected it, SeaTools didn't immediately want to believe that the drive was defunct.  It took a long-repair operation before the drive was officially failed.  Meanwhile, I have scoured the Internets in search of a solution to what is evidently a now-growing problem.

The crux of the issue is that drive technology is not necessarily getting better (in terms of error rates and ECC), but we are putting more and more data on it.  I've seen posts where people have argued convincingly that, mathematically, the published bit-corruption rates are now unacceptably high in very large data arrays.  I'm afraid that based on my empirical experience, I must concur.  No sectors reported unreadable, no clicking noises were observed, and yet I lost two file systems and an entire 2TB of redundant data.

Thank goodness it was redundant, but now I seriously fear for the primary array's integrity, for it is composed of the same kind and manufacturer's drives as the array I am now rebuilding.  I guess I'm a little surprised that data integrity has never really been a subject of much work in the file-system community; then again, a few years ago, I probably wouldn't have thought much of it myself, but as I am now acutely aware of the value of data and the volatility of storage mediums, it's now a big issue.

A Non-Trivial Fix


I had originally thought that drives would either return good data or error-out.  This was not the case.  The corruption was extremely silent, but highly visible once a reboot proved that one file system was unrecoverable and the metadata for one DRBD device was obliterated.  RAID of course did nothing - it was not designed to do so.  The author of mdadm has also, in forum posts, said that an option to perform per-read integrity checks of blocks from the RAID was not implemented, and would not be implemented...though it is probably possible.  That's unfortunate.

I looked for a block-device solution, like how cryptsetup and DRBD work, to act as an intermediary between the physical medium and the remainder of the device stack.   Such a device would provide either a check on sector integrity and fail blocks up as needed, or provide ECC on sector data and attempt to fix data as it went, only failing if the data was totally unrecoverable.

I considered some options, and decided that a validity-device would best be placed between the RAID driver and the hard drive, so that it could fail sectors up to the RAID and let RAID recover them from the other disks.  This assumes that the likelihood of data corruption occurring across the array in such a way as to contaminate two (or three) blocks of the same stripe on a RAID-5 (or 6) would be statistically unlikely.

An ECC-device would probably be best placed after the RAID driver, but could also sit before it.  It might not be a bad idea to implement and use both - a validity-device before the RAID and an ECC-device after it.  Obviously we can no longer trust the hardware to do this sort of thing for us.

I performed a cursory examination of Low-Density Parity Check codes, or LDPC, but alas my math is not so good.  There are some libraries available, but writing a whole device driver isn't quite in my time-budget right now.  I'd love to, and I know there are others who would like to make use of it, so maybe someday I will.  Right now I need a solution that works out of the box.

The Options


The latest and greatest open-source file system is Btrfs.  Unfortunately it's much too unstable, from what I've been reading, to be trusted in production environments.  Despite the fact that I tend to take more risks than I should, I can't bring myself to go that route at this time.  That left me with only ZFS, the only other file system that has integral data integrity checks built in.  This looked promising, but being unable to test-drive OpenSolaris on KVM did not please me.

ZFS-Fuse is readily available and relatively stable, but lacks the block-device manufacturing capability that native ZFS offers.  A happy, albeit slightly more dangerous alternative to this is ZFS-On-Linux (http://zfsonlinux.org/), a native port that, due to licensing, cannot be easily packaged with the kernel.  It can however be distributed separately, which is what the project's authors have done.  It offers native performance through a DKMS module, and (most importantly for me) offers the block-device-generating ZFS Volume feature.

Test-Drive!

ZFS likes to have control over everything, from the devices up to the file system.  That's how it was designed.  I toyed around setting up a RAID and, through LVM, splitting it up into volumes that would be handled by ZFS - not a good idea: a single drive corrupting itself causes massive and wonton destruction of ZFS' integrity, not to mention that the whole setup would be subject to the same risks that compromised my original array.  So, despite my love of mdadm and LVM, I handed the keys over to ZFS.

I did some initial testing on a VM, by first creating a ZFS file system composed of dd-generated files, and then introduced faults.  ZFS handled them quite well.  I did the same with virtual devices, which is where I learned that mdadm was not going to mix well with ZFS.  I have since deployed on my redundant server and have started rebuilding one DRBD device.  So far, so good.

What I like about ZFS is that it does wrap everything up very nicely.  I hope this will result in improved performance, but I will not be able to gather any metrics at this time.  Adding devices to the pool is straightforward, and replacing them is relatively painless.  The redundancy mechanisms are also very nice.  It provides mirroring, RAID-5, RAID-6, and I guess what you could call RAID-(6+1) in terms of how many devices can fail in the array before it becomes a brick (one, one, two, and three devices respectively, in case you were wondering).

What I dislike about ZFS, and what seriously kept me from jumping immediately on it, was its surprisingly poor support for expanding arrays.  mdadm allows you to basically restructure your array across more disks, thus allowing for easy expansion.  It even does this online!  ZFS will only do this over larger disks, not more of them, so if you have an array of 3 disks then you will only ever be able to use 3 disks in that array.  On the bright side, you can add more arrays to your "pool", which is kind of like adding another PV to an LVM stack.  The downside of this is that if you have one RAID-6 with four devices, and you add another RAID-6 of four devices, you are now down four-devices-worth of space when you could be down by only two on mdadm's RAID-6 after restructuring.

So once you choose how your array is going to look to ZFS, you are stuck with it.  Want to change it?  Copy your data off, make the change, and copy it back.  I guess this is what people who use hardware RAID are accustomed to - I've become spoiled by the awesome flexibility of the mdadm/LVM stack.  At this point, however, data integrity is more important to me.

Consequently, with only 8 devices available for my ZFS target (and really right now only 7 because one is failed and removed), I had to choose basic 1-device-redundancy RAIDZ and split the array into two 4-device sub-arrays.  Only one sub-array is currently configured, since I can't bring the other one up until I have replaced my failed drive.  With this being a redundant system, I am hopeful that statistics are on my side and that a dual-drive failure on any given sub-array will not not occur at the same time as one on the sibling system.

We Shall See.