What Happened??
I have been suffering through recent data corruption events on my multi-terabyte arrays. We have a pair of redundant arrays, intended for secure backup of all our data, so data integrity is obviously of importance. I've chosen DRBD as an integral part of the stack, because it is solid and totally rocks. I had gone with mdadm and RAID-6, after a few controller melt-downs left me with a bitter aftertaste. Throw in some crypto and LVM and viola!Then a drive went bad.
There may even be more than one. Even though smartd detected it, SeaTools didn't immediately want to believe that the drive was defunct. It took a long-repair operation before the drive was officially failed. Meanwhile, I have scoured the Internets in search of a solution to what is evidently a now-growing problem.
The crux of the issue is that drive technology is not necessarily getting better (in terms of error rates and ECC), but we are putting more and more data on it. I've seen posts where people have argued convincingly that, mathematically, the published bit-corruption rates are now unacceptably high in very large data arrays. I'm afraid that based on my empirical experience, I must concur. No sectors reported unreadable, no clicking noises were observed, and yet I lost two file systems and an entire 2TB of redundant data.
Thank goodness it was redundant, but now I seriously fear for the primary array's integrity, for it is composed of the same kind and manufacturer's drives as the array I am now rebuilding. I guess I'm a little surprised that data integrity has never really been a subject of much work in the file-system community; then again, a few years ago, I probably wouldn't have thought much of it myself, but as I am now acutely aware of the value of data and the volatility of storage mediums, it's now a big issue.
A Non-Trivial Fix
I had originally thought that drives would either return good data or error-out. This was not the case. The corruption was extremely silent, but highly visible once a reboot proved that one file system was unrecoverable and the metadata for one DRBD device was obliterated. RAID of course did nothing - it was not designed to do so. The author of mdadm has also, in forum posts, said that an option to perform per-read integrity checks of blocks from the RAID was not implemented, and would not be implemented...though it is probably possible. That's unfortunate.
I looked for a block-device solution, like how cryptsetup and DRBD work, to act as an intermediary between the physical medium and the remainder of the device stack. Such a device would provide either a check on sector integrity and fail blocks up as needed, or provide ECC on sector data and attempt to fix data as it went, only failing if the data was totally unrecoverable.
I considered some options, and decided that a validity-device would best be placed between the RAID driver and the hard drive, so that it could fail sectors up to the RAID and let RAID recover them from the other disks. This assumes that the likelihood of data corruption occurring across the array in such a way as to contaminate two (or three) blocks of the same stripe on a RAID-5 (or 6) would be statistically unlikely.
An ECC-device would probably be best placed after the RAID driver, but could also sit before it. It might not be a bad idea to implement and use both - a validity-device before the RAID and an ECC-device after it. Obviously we can no longer trust the hardware to do this sort of thing for us.
I performed a cursory examination of Low-Density Parity Check codes, or LDPC, but alas my math is not so good. There are some libraries available, but writing a whole device driver isn't quite in my time-budget right now. I'd love to, and I know there are others who would like to make use of it, so maybe someday I will. Right now I need a solution that works out of the box.
The Options
The latest and greatest open-source file system is Btrfs. Unfortunately it's much too unstable, from what I've been reading, to be trusted in production environments. Despite the fact that I tend to take more risks than I should, I can't bring myself to go that route at this time. That left me with only ZFS, the only other file system that has integral data integrity checks built in. This looked promising, but being unable to test-drive OpenSolaris on KVM did not please me.
ZFS-Fuse is readily available and relatively stable, but lacks the block-device manufacturing capability that native ZFS offers. A happy, albeit slightly more dangerous alternative to this is ZFS-On-Linux (http://zfsonlinux.org/), a native port that, due to licensing, cannot be easily packaged with the kernel. It can however be distributed separately, which is what the project's authors have done. It offers native performance through a DKMS module, and (most importantly for me) offers the block-device-generating ZFS Volume feature.
Test-Drive!
ZFS likes to have control over everything, from the devices up to the file system. That's how it was designed. I toyed around setting up a RAID and, through LVM, splitting it up into volumes that would be handled by ZFS - not a good idea: a single drive corrupting itself causes massive and wonton destruction of ZFS' integrity, not to mention that the whole setup would be subject to the same risks that compromised my original array. So, despite my love of mdadm and LVM, I handed the keys over to ZFS.I did some initial testing on a VM, by first creating a ZFS file system composed of dd-generated files, and then introduced faults. ZFS handled them quite well. I did the same with virtual devices, which is where I learned that mdadm was not going to mix well with ZFS. I have since deployed on my redundant server and have started rebuilding one DRBD device. So far, so good.
What I like about ZFS is that it does wrap everything up very nicely. I hope this will result in improved performance, but I will not be able to gather any metrics at this time. Adding devices to the pool is straightforward, and replacing them is relatively painless. The redundancy mechanisms are also very nice. It provides mirroring, RAID-5, RAID-6, and I guess what you could call RAID-(6+1) in terms of how many devices can fail in the array before it becomes a brick (one, one, two, and three devices respectively, in case you were wondering).
What I dislike about ZFS, and what seriously kept me from jumping immediately on it, was its surprisingly poor support for expanding arrays. mdadm allows you to basically restructure your array across more disks, thus allowing for easy expansion. It even does this online! ZFS will only do this over larger disks, not more of them, so if you have an array of 3 disks then you will only ever be able to use 3 disks in that array. On the bright side, you can add more arrays to your "pool", which is kind of like adding another PV to an LVM stack. The downside of this is that if you have one RAID-6 with four devices, and you add another RAID-6 of four devices, you are now down four-devices-worth of space when you could be down by only two on mdadm's RAID-6 after restructuring.
So once you choose how your array is going to look to ZFS, you are stuck with it. Want to change it? Copy your data off, make the change, and copy it back. I guess this is what people who use hardware RAID are accustomed to - I've become spoiled by the awesome flexibility of the mdadm/LVM stack. At this point, however, data integrity is more important to me.
Consequently, with only 8 devices available for my ZFS target (and really right now only 7 because one is failed and removed), I had to choose basic 1-device-redundancy RAIDZ and split the array into two 4-device sub-arrays. Only one sub-array is currently configured, since I can't bring the other one up until I have replaced my failed drive. With this being a redundant system, I am hopeful that statistics are on my side and that a dual-drive failure on any given sub-array will not not occur at the same time as one on the sibling system.
We Shall See.
No comments:
Post a Comment