BURNING MIDNIGHTm.at.work

20120724

Reinventing the...cog?

WANTED: Error Detecting and Correcting Drive Surface

I am looking for a strong solution, and I fear I may have to implement it myself. I want to boost the error-detection capability of typical hard drives. Someone on a thread asked about creating a block device to do this for any given underlying device, like what LUKS does. I want to at least detect, and _maybe_ (big maybe) correct, errors during reads from a drive or set of drives. After having some drives corrupt a vast amount of data, I'm looking to subdue that threat for a very, very, very long time.

Error Detection

To avoid reinventing RAID - which, by the way, doesn't seem to care if the drive is spitting out bad data on reads, and from what I read that's evidently _correct RAID operation_ - I would propose writing a block driver that "resurfaces" the disk with advanced error detection/correction. So, for detection, suppose that each 4k sector had an associated SHA512 hash we could verify. We'd want to store these hashes either with the sector or away from it; in the latter case, consolidating hashes into dedicated hash-sectors might be handy. The block driver would transparently verify sectors on reads and rewrite SHA hashes on writes - all for the cost of around 1.5625% of your disk. Where this meets RAID would be that the driver would simply fail any sector that didn't match its hash, and RAID would be forced to reconstruct the sector from its hopefully-good stash of other disks...where redundant RAID is used.

Error Correction

Error correction using LDPC codes would be even more awesome, and I found some work from 2006 that is basically a general-use LDPC codec, LGPL'd. Perhaps I'd need to write my own, to avoid any issues with eventual inclusion in the kernel. The codes would be stored in a manner similar to the hashes, though probably conglomerated into chunks on the disk and stored with their own error-detection hash.

Questions, Questions, Questions

Anyway, lots of questions/thoughts arise from this concept:

How do we make this work with bootloaders? Perhaps it would work regardless of the bootloader if only used on partitions.
It's an intermediary between the drive and the actual end-use by whatever system will use it, so it HAS to be there and working before an end-user gets hold of the media.

In other words, suppose we did use this between mdadm and the physical drive - how do we prevent mdadm from trying to assemble the raw drives versus the protected drives (via the intermediary block device)? If it assembles the raw drives and does a sync or rebuild, it could wipe out the detection/correction codes, or (worse) obliterate valuable file system contents.

Where would be the best place for LDCP/hash codes? Before a block of sectors, or following?
How sparsely should LDCP/hash codes be distributed across the disk surface?
Is it better to inline hash codes with sectors and cause two sector reads where one would ordinarily suffice, or push hash codes into their own sector and still do two sector reads, but a little more cleanly?

The difference being that in the former, a sector of data would most likely be split across two sectors - sounds rather ugly.
The latter case would keep sector data in sectors, possibly allowing the drive to operate a little more efficiently than in the former case.

How much space is required for LDCP to correct a sector out of a block of, say, 64 4k sectors? How strong can we make this, and how quickly do we lose the benefits of storage space?
If a sector starts getting a history of being bad, do we relocate it, or let the system above us handle that?
How best do we "resurface" or format the underlying block device? I would imagine dumping out code sectors that are either indicative of no writes having yet been done there, or code sectors generated from whatever content currently exists on disk. A mode for just accepting what's there (--assume-clean, for example) should probably be available, too, but then do we seed the code sectors with magics and check to see that everything is how it SHOULD be before participating in system reads/writes?
Do we write out the code sector before the real sector? How do we detect faulty code sectors? What happens if we're writing a code sector when the power dies?

I guess this really boils down to the following: RAID doesn't verify data integrity on read, and that is bugging me. Knowing it would be a major performance hit to touch every drive in the array for every read, I can understand why it's that way. If we could do the job underneath RAID, however, maybe it wouldn't be so bad? Most file systems also don't seem to know/care when bad data shows up at the party, and trying to write one of those (like the guys working on btrfs) demonstrates it's no easy task.

And I guess the ultimate question is this: has anyone done this already? Is it already in the kernel and I just haven't done the right Google search yet?

Please say yes.

20120720

Loving iSCSI, Hating Hard Drives

A user of mine recently suffered a lovely virus attack that ate her Windows XP machine. Well, it was time for an upgrade anyway, so a reformat-reinstall to Windows 7 was the logical choice. What about the user's data?

That which wasn't virus-laden needed to be captured. I must admit I really enjoy how seamlessly iSCSI can make drive space available. As I snooped around my office for a spare drive and a USB adapter, the thought of copying gigs of critical data off a user's machine and onto an aging piece of storage medium just didn't seem appealing. So, I logged into my HA iSCSI cluster, created a new resource, and mounted it from the user's machine (via a secure, CD-boot operating system....read: KNOPPIX). Moments later...well, about an hour later...the data was secure and the machine ready to nuke.

But there were a few takeaways from this. First, it was a royal pain in the balls to get my new resource up. It's not that it was mysterious, or difficult, but that there were in fact six or seven different components that needed to be instantiated and configured (copy-paste-change) just to get the target online. I would have very dearly loved a "create a target for me with this much space, on this IP, named this," and all the appropriate cluster configuration would be done. This isn't so much a complaint as it is a wish. I have no issues with cluster configuration now, and see it as powerful and flexible. Automation must show a positive ROI. To that end, manufacturing HA virtualization clusters would certainly tend in the direction of good ROI for the right automation. It is, in fact, one of the things preventing me from just spawning independent iSCSI targets for each of my VMs.

Maybe that's A Good Thing (tm).

On another front, failing hard drives (at least one confirmed) and transparent, unnoticed corruption of file systems has me again thumbing through the ZFS manual, and trying to reconcile its use with the system I currently have built. I have some requirements: DRBD replication and heavy-duty data encryption. I get these with RAID+LVM. But RAID, it seems, does not really care about ensuring data integrity, except during scans or rebuilds. As a matter of fact, whatever corruption took place, I'm not even sure it was the hard drives that were at fault - except for the one that is spitting out the SMART message: "FAILURE IMMINENT!". It could have also been the RAID adapter (that is only doing JBOD at the moment), or a file system fluke...or perhaps the resync over DRBD went horribly wrong in just the wrong place.

We may never know.

I can happily say that it appears a good majority of my static data was intact. I'm thankful, because that static data was data I really only had one highly-redundant copy of. Thus the case for tape-drives, I guess. I'm considering trying something along the following: RAID+LVM+ZFS+DRBD+crypto+FS. Does that seem a bit asinine? Perhaps:

RAID+LVM+DRBD+CRYPTO+ZFS
RAID+LVM+DRBD+ZFS+CRYPTO+FS
ZFS+DRBD+CRYPTO+FS

The problem I am running into is keeping DRBD in the mix. So, the convoluted hierarchy may be the only one that makes sense. Of course, if ZFS handles replication to secondary servers that are WAN-connected, maybe going purely ZFS would be better. That then begs the question: stick with Linux+ZFS-FUSE, or go with OpenSolaris?

I'm not sure I'm ready for OpenSolaris.

20120709

Rethinking Property

I was looking for an RSS feed from a photography site, to use as screensaver images automatically in xscreensaver. I just wanted some low-res stuff, nothing fancy - preview-quality images would've been great! Unfortunately, I did not readily find an RSS link, but I found some Q&A pages under their support. One question was from a photographer that was upset about finding her images on other websites, due to people who had subscribed to her RSS feed from the parent site. She was also upset that the subscribed users had no profiles, names, or other information (for what purposes I know not, since I have not yet delved into their membership policies or community expectations). The support reply basically said that the RSS feed was doing what it was supposed to do, and that it was not possible to "block" someone from using an RSS feed.

Now, to me this underscores a basic disconnect that pervades several facets of the internet-connected world. A most notable facet is DRM-laden media from content producers. Another is authored, copyrighted works published as e-books. Ostensibly, the argument goes like this:

"I perform my trade to make money. If people access my works for free, I don't make money. Therefore, I must prohibit people from accessing my works for free, or else there is no longer an incentive for me to produce new works."

Herein lies the rub: published works, cut records, stomped CDs, printed photographs and assembled pieces of furniture were all relatively "difficult" to reproduce before the advent of modern technology. Unless you went to some serious trouble or had some serious equipment, you could not easily copy, say, a nice picture of a lake and have it look anything like the original. The quality would be shot, the color gone. These days a good photocopier does most of the work and the result comes out rather good. Go all-digital, and there is zero quality loss.

Bring in modern computer technology, and interconnect it to a massive, standardized backbone (i.e. the Internet), and now we have a new reproduction and delivery mechanism for products. Better yet, the cost of reproducing electronic goods is practically zero. Just make another copy - the operating system does all the work, and with the right tech it happens in a heartbeat. Put out a link and millions of people can download it - all for a minuscule fraction of the cost of what a photo or musical work would have cost to manually reproduce.

We could argue here that property is property, and people have a right to make money on however many duplicates of their works they choose to make. But let's step back a moment and consider another example. It takes a great deal of effort, experience, materials and tooling to produce a set of fine kitchen cabinets. You don't just grab a handsaw and start hacking at plywood. There is, of course, the wood, and the fasteners, the jigs, the saws, blades, planers, router/shaper bits, glues, clamps, work space, sandpapers, paints and finishes, just to name maybe 20% of everything you need for the cabinet production. The cost of the materials, plus the cost of the labor (which equates to experience, knowledge and skill) to turn those materials into the finished goods, plus around 10% profit, is all factored into the basis of cabinet prices for a generic cabinet shop. Case and point: there is a substantial investment to produce, and therefore a substantial fee for the product.

Let's now suppose that you could produce a cabinet electronically. That really doesn't make sense, but to just explore the issue, let's suppose it could happen. After you go through the trouble to do the design work, you get everything ready and then push a button. Into a machine goes wood chips, glue, and paint. Whir whir whir, and out pops a perfectly finished cabinet! This is our advanced cabinet shop.

What is that electronically-manufactured cabinet worth? Is it worth more, less, or the same as the one painstakingly made by the generic cabinet shop above?

The corollary is to all the things we can download from the Internet. Our servers, operating systems, content management systems all serve the purpose of reproducing data over and over again to countless consumers. The difference is that most of those consumers aren't paying for most of that content. Most of the content isn't something we'd charge for anyway. If they do pay for it, they are not willing to pay the cost of what they could buy a physical version for - to them, the value isn't there. Even if it costs fractions of a penny to stamp out music CDs, what does it cost to provide that CD's worth of data for download? The fact is, a stamped CD must be manufactured, quality-checked, packaged, labeled, wrapped, boxed, and shipped to a "brick-and-mortar" or online store for sale and distribution. Therein lies the cost of a $15 CD. Why pay $15 - or even $10 - for the content when you cut out the last seven or eight steps of the process?

The same applies to movies, books, and just about everything that is now electronically produced. I'm not saying it's right or wrong to charge whatever is charged for goods available online. I'm saying that in this new economic playground, there are very different rules at hand. I do not agree with people who publish content in so grossly a public place as on the Internet, and then complain about that content appearing in a dozen unauthorized locations. It would be no different than printing a thousand low-quality copies of a picture and dropping them out of a tall building window. A thousand people grabbing those photos later, do you have any right to complain about "unauthorized" reproductions or the "unauthorized" display of the work you scattered to the four winds?

There are those who would look at all of this and think: "Look at all that money we're losing! We need to capture all of that! Everything costs something, right?" Well, could you sell the air we breathe? Is the air in my home better than the air in yours, and could I bottle it and sell it to you? Would you buy it? There are limits to what is salable and what is not. Forcing the creation of an economy where none exists leads inexorably to destruction.

So, the issue here shouldn't be about property rights on the Internet, but rather: how do people make money with the Internet as their distribution system? Forget property rights, and assume they don't exist on the Internet. Really, I think they never did, and despite the best efforts of governments around the world, property on the internet will never work the way it works in the real, physical, hard-to-duplicate world. We have to rethink the economics of Internet-based production/consumption. The hard part is lining this up with aspects of the world that do not translate easily to the Internet, such as the production, delivery, and purchase of food-stuffs (my "People Need To Eat" axiom).

Conversely, a product that fetches $1 on the internet but is downloaded one million times certainly earns as much as a product costing $50 and purchased from a store 20,000 times. Perhaps we humans are just not ready for an Internet-based economy.

20120621

The Good and Bad of OCFS2

It's my own fault, really, for not having yet purchased an ethernet-controlled PDU. I've been busy and time slips by, and the longer things run without happenstance the easier it is to forget how fragile it all is.

Whatever is causing the hiccups, it's pretty nasty when it happens. I now have three hosts in my VM cluster. I still run my two storage nodes, as a separate cluster. There are two shared-storage devices, accessed by each VM host node via iSCSI, meant to distribute the load between the two storage nodes. OCFS2 is the shared-storage file system for this installation.

Long story short, when one node dies, they all die. 15 VMs die with them, all at once. Again, STONITH would fix this issue. But what worries me more is the frequency of oops'. I really can't have my VM hosts going AWOL on me just because they're tired of running load averages into the 60s. I am beginning to rethink my design. Here I will discuss a few pros and cons to the two approaches under consideration.

OCFS2 - Pros

Easy to share data between systems, or have a unified store that all systems can see.
VM images are files in directories, all named appropriately for their target VMs - No confusion, very little chance of human error.
Storage node configuration is easy - Set up the store, initialize with OCFS2, and you're done!

OCFS2 - Cons

Fencing is so massively important that you might as well not even use clustering without it. Right now the cluster itself is about as stable as my "big VM host" that has motherboard and/or memory issues and regularly locks up for no apparent reason.
You have to configure the kernel to reboot on a panic, and to panic on an oops, per the OCFS2 1.6 documentation. I'm not really uncomfortable with that, but again the prevalence of these system failures is leaving me in wonder about the stability of everything. I cannot necessarily pin it on OCFS2 without some better logging, or at least some hammering while watching the system monitor closely.
One of my systems refuses to reboot on a panic, even though it says it's going to. Don't have any idea what that's about.
The DLM is not terrible, but sometimes I wonder how great it is in terms of performance. I may be misusing OCFS2. Of course, I have only one uplink per storage node to the lone gigabit switch in the setup, and the ethernet adapters are of the onboard variety. Did I mention I need to purchase some badass PCI-e ethernet cards??

The alternative to OCFS2, when you want to talk about virtualization, is of course straight-up iSCSI. Libvirt actually has support for this, though I'm not certain how well it works or how robust it is to failures. However, from what I've read and seen, I'd be very willing to give it a shot.

LIBVIRT iSCSI Storage Pool - Pros

STONITH is "less necessary" (even though it is STILL necessary) for the nodes in question, because they no longer have to worry so much about corrupting entire file systems. They would only be at risk for corrupting a limited number of virtual machines...although, given the right circumstances I bet we could corrupt them all.
Single node failures do not disrupt the DLM, because there is no DLM.
iSCSI connections are on a per-machine basis, though it would be interesting to see how well this scales out.
No shared-storage means that the storage nodes themselves can use more traditional or possibly more robust file systems, like ext4 or jfs.

LIBVIRT iSCSI Storage Pools - Cons

Storage configuration for new and existing virtuals will require an iSCSI LUN for each one. To keep the segregated, we could also introduce an iSCSI Target for each one, but that would become a cluster-management nightmare on the storage nodes. It's already bad enough to think about pumping out new LUNs for the damn things.
Since LUNs would be the thing to use, there is greater risk of human error when configuring a new virtual machine (think: Did I start the installer on the right LUN? Hmmmm....)
Changing to this won't necessarily solve the problems with the ethernet bottleneck. In fact, it could very well exacerbate them.
There is no longer a "shared storage" between machines. No longer a place to store all data and easily migrate it from machine to machine. At present I keep all VM configuration on the shared storage and update the hosts every so often. This would become significantly less pleasant without shared storage.

It would probably be in my best interest to simply keep the current configuration until I can get my STONITH devices and really see how well the system stays online. It would also behoove me to configure the VM cluster to also monitor and protect the virtuals themselves. I tested this with one VM, but haven't done a lot to toy with all the features and functions.

So much to do, so little time.

20120620

Cluster Building - Ubuntu 11.10

Some quick notes on setting up a new VM cluster host on Ubuntu Server 11.10. Assuming the basic setup, some packages need to be installed:

apt-get install \
ifenslave bridge-utils openais ocfs2-tools ocfs2-tools-pacemaker pacemaker \
corosync resource-agents open-iscsi drbd-utils dlm-pcmk ethtool ntp libvirt-bin kvm

If you don't use DRBD on these machines, omit it from all machines. If you install it on one, you'd probably better install it on all of them.

Copy the /etc/corosync/corosync.conf and /etc/corosync/authkey to the new machine.

Configure /etc/network/interfaces with bonding and bridging. My configuration:

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0 eth1 br0 br1 bond0

iface eth0 inet manual
bond-master bond0

iface eth1 inet manual
bond-master bond0

iface bond0 inet manual
bond-miimon 100
bond-slaves none
bond-mode 6

# I can't seem to get br1 to accept the other bridge-* options!! :(
iface br1 inet manual
pre-up brctl addbr br1
post-up brctl stp br1 on

iface br0 inet static
bridge-ports bond0
address 192.168.1.10
netmask 255.255.255.0
gateway 192.168.1.1
bridge-stp on
bridge-fd 0
bridge-maxwait 0

Disable the necessary rc.d resources:

update-rc.d corosync disable
update-rc.d o2cb disable
update-rc.d ocfs2 disable
update-rc.d drbd disable

Make sure corosync will start:
sed -i 's/START=no/START=yes/' /etc/default/corosync

Create the necessary mount-points for our shared storage:
mkdir -p /opt/{store0,store1}

You should now be ready to reboot and then join the cluster. I have the following to files on all machines, under the security of root:
go.sh
#!/bin/bash
service corosync start
sleep 1
service pacemaker start

force_fastreboot.sh
#!/bin/bash
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

Also, the open-iscsi stuff has some evil on reboots. If there is stuff in the /etc/iscsi/nodes folder, it screws up Pacemaker's attempt to connect. For lack of a better solution, I have this script called from rc.local:

clean_iscsi.sh
#!/bin/bash
rm -rf /etc/iscsi/{nodes,send_targets}/*

Sometimes I need to resize my iSCSI LUNs. Doing so means rescanning on all affected machines. This script is called via the crontab entry 0,15,30,45 * * * * /root/rescan-iscsi.sh
rescan-iscsi.sh
#!/bin/bash
iscsiadm -m node -R > /dev/null

20120617

Be Careful With mdadm.conf

I ran into a silly problem with booting my RAID-6. It seems mdadm inside initrd would not assemble my array on reboot! Every time the system rebooted, I was dumped after a few seconds into an initrd rescue prompt, and left to my own /dev/*. Here, again and again, I could manually assemble the array, but that was getting annoying and was certainly something I couldn't do remotely.

During one boot, I tried an mdadm -v --assemble --scan --run, and discovered it was scanning sda, sdb, sdc, etc, but no partitions.

I looked on Google for "initrd not assembling array" and "grub2 not assembling array," to no avail. I tried updating the initrd with the provided script (update-initramfs). Finally I happened upon someone mentioning the mdadm.conf file inside initrd.

I peeked in /etc/mdadm, and sure enough, there was a copy of my modified mdadm.conf. It contained the change I had made: to scan only the five drives of my RAID-6, since the other drives I had in the system at the time were giving me hassles. Unfortunately, the statement I concocted specified no partitions; my RAID was entirely on partitions. Entirely my fault. :'-(

The good news is that, once found, it's an easy fix. Get the system to boot again, reconfigure mdadm.conf, and then perform an update-initramfs. The system now boots perfectly!

20120616

Resurrection of the RAID

Have you had a day like this?

Hmm... My RAID was working the last time I rebooted. Why can I not assemble it now? Only two devices available? Nonsense...there should be five. What? What's this?! The others are SPARES?!! No they're not...they're part of the RAID! Awwww F*** F*** F*** F*** F*** F***....

Thank the heavens for these critical links, which I repeat here for posterity:

http://maillists.uci.edu/mailman/public/uci-linux/2007-December/002225.htmlhttp://www.storageforum.net/forum/archive/index.php/t-5890.html (saves more lives!)
http://neil.brown.name/blog/20120615073245 !!! important read !!!

As you can probably surmise, one of my arrays decided to get a little fancy on me this week. It all started when I was trying to reallocate some drives to better utilize their storage. I had moved all my data onto a makeshift array composed of three internal SATA and two USB-connected hard drives. I know, a glutton for punishment. The irony of that is the USB drives were the only two with accurate metadata, as will be discussed below.

I really don't know what happened, but somewhere between moving ALL my data onto that makeshift array and booting into my new array, the metadata of the three internal SATA drives got borked....badly. I discovered this while attempting to load up my makeshift array for data migration. Upon examining the superblocks of all the drives (mdadm -E), I discovered that the three internals received a metadata change that turned them into nameless spares. The two USB drives escaped destruction, so thankfully I had some info handy to verify how the array needed to be rebuilt.

After grieving a while, and pondering a while longer, I did a search for something along the lines of "resurrect mdadm raid" and eventually came up with the first two links. The third was an attempt to find the author of mdadm, just to see if he had anything interesting on his blog. Turns out he did! Now, I honestly don't know if I was bit by the bug he mentions in that post, but whatever happened did its job quite well.

So, I started a Goggle Doc called "raid hell" and started recording important details, like which drives and partitions were in use by that array, what info I was able to glean from mdadm -E, and eventually I felt almost brave enough to try a reassembly. But before I got the balls for that, I imaged the three drives with dd, piped it through gzip, and dumped them onto the new array that thankfully had just enough space to hold them. Now, if something bad happened, I had at least one life in reserve.

The next step was now to attempt an array recreation with the --assume-clean option: this option would eliminate the usual sync-up that takes place on array creation, thereby preventing destructive writing from taking place while experimentation was happening.

Oh, did I mention it was a RAID-5?

Of course, with most of the devices suffering amnesia, the challenge now was to figure out what order the drives were originally in. It would have been as easy as letter-order, maybe, if I had not grown the array two separate times during the course of the previous run of data transfers. So, to make life a little easier, I wrote a shell-script. Device names like /dev/sdf1 and /dev/sdi2 were long and hard and ugly, so I replaced them with variables F, G, I, J, and K (the letters of the drives for my array). It went something like this:

F="/dev/sdf1"
G="/dev/sdg2"
I="/dev/sdi2"
J="/dev/sdj1"
K="/dev/sdk1"
mdadm -S /dev/md4
mdadm --create /dev/md4 --assume-clean -l 5 -n 5 --metadata=1.2 $F $G $I $J $K
dd if=/dev/md4 bs=256 count=1 | xxd
pvscan

With this, I could easily copy/paste and comment out mdadm --create lines that had the incorrect drive order, and keep track of what permutations I had attempted. Now the array was originally part of LVM, so I was looking for some LVM header funk with the dd command. I only knew it was there because I had been running xxd on the array previously, and seen it go whizzing by. pvscan would be my second litmus test, and should the right first drive appear, the physical volume and subsequently the volume group and logical volumes would all become recognized. With this, it took only five runs to get K as the first drive. I now had four drives left to reorder.

To accomplish this, dd and pvscan would no longer help - they did their work on the first drive only. Since I had two easily-accessible LVs on that array - root and home - it would be easy to run fsck -fn to determine if the file system was actually readable. This assumed, of course, that all the data on the array was in good shape. I honestly had no reason to believe otherwise, and Neil Brown's post gave me a great deal of hope in that regard. Basically, if only the metadata was getting changed, the rest of the array should be A-OK. I knew it was not being written to at the time of the last good shutdown...because the array was basically a transfer target and not even supposed to be operating live systems.

It took another half-dozen or so attempts before I managed to get the latter four drives in the correct order. Finally, fsck returned no errors and a very clean pair of file systems. Of course, this was done with -n, so nothing was actually written to the array. I kicked off a RAID-check with an

echo check > /sys/block/md4/md/sync_action

and then the power went out. After rebooting, and reassembling the RAID (this time without having to recreate it, since the metadata was now correct on all the drives), I re-ran the check. It completed about three and a half hours later. Nothing was reported amiss, so finally it was time: I performed some final fsck's without the -n to ensure everything was ultra-clean, and started mounting file systems.

SUCCESS!!!

Since root and home both weigh in around 30-60G each, it was easy to believe they touched every device in the array. If something else had been out of order, I should have seen it (let's hope I'm right!!). Now, with the volumes unmounted, I am migrating all the data off the makeshift array...after all, it has a habit of not actually assembling on boot, probably because of the two USB devices.

It Lives Again.

BURNING MIDNIGHT
m.at.work