20120517

Software RAID Practices

I really like software RAID.  It's cheap, easy, and extremely stable.  It has lots of options, and I get the most awesome RAID levels I could ask for.  I can migrate from a RAID-1 to a RAID-5, to a RAID-6...I can grow the array live, add or remove hot-spares, allow more than one array share a hot-spare.  Compared with hardware solutions, it's amazingly complete.

What I love the most is that, unlike a hardware solution, any Linux system can mount it.  There are a few catches to that, in terms of what it names your md devices, but if you are to the point of mounting your raid on another system, and your raid happens to be your root drive, you're probably advanced enough to know what you're doing here.

But what is a good practice to Software RAID?  So many options make it difficult to know what the pros and cons are.  Here I will try to summarize my experiences.  The explorations here are via mdadm, the defacto standard software RAID configuration tool available on nearly every Linux distribution.  Special non-RAID-related features I discuss are part of the Ubuntu distribution, and I know them to exist on 10.04 and beyond.  Earlier versions may also support this greatness, though you may have to do more of the work on the command-line.  This is also a "theory and practice" document, and so does not contain any major command line statements.  If you want some of those, contact me.

Note that NO RAID is a guarantee of safety.  Sometimes multiple drives fail, and even RAID-6 will not save you if three die at once, or if your system is consumed in a fire.  The only true data safeguard is to backup to an offsite data-store.  Backup, Backup, Backup!


Method 1:  Every Partition is a RAID

Each physical disk is partitioned.  Each partition is added to its own RAID.  So you wind up with multiple RAIDs (possibly at different levels).
Here we have four physical disks, the first two of which are large enough for four partitions.  The other two join to augment the latter partitions.  md0 and md1 would very likely be /boot and /swap, and configured as RAID-1.  md2 could be the system root, as a RAID-5.  Suppose we also need high-speed writes without data integrity?  md3 could be a RAID-0.  One of my live systems is actually configured similar to this.  The reason: it captures a ton of statistics from other servers over the network.  The data comes so fast that it can easily bottleneck at the RAID-1 while the system waits for the drives to flush the data.  Since the data isn't mission-critical, the RAID-0 is more than capable of handling the load.

The benefits of this method is that you can do a lot with just a few drives, as evidenced above.  You could even just have every partition be a RAID-1, if you didn't need the extra space of sdc and sdd.  The detriment of it is that when a physical drive goes kaput, you need to recreate the exact same partition layout on the new drive.  This isn't really a problem, just a nuisance.  You also have to manually add each partition back into its target array.  RAID-rebuilds are done one or two at a time, usually, but the smaller devices will go very quickly.


Method 2:  Full Device RAID

Hardware controllers use almost the whole device; they reserve a bit for their proprietary metadata, even if you go JBOD (on some controllers, which stinks).  Popping a replacement drive into the array means no partition creation - it just adds it in and goes.  I thought, Why can't I do that?  So I did.

You can do this via a fresh install, or by migrating your system over to it, but either way this is a little advanced.  The concept here is to have a boot device separate from your main operating system RAID.  

In this diagram, our /boot partition is kept separate.  GRUB is able to handle this configuration without much issue. LILO could probably handle it, too, with an appropriate initrd image (i.e. you need to make sure your RAID drivers are loaded into the image to bootstrap the remainder of the system).

You can also turn /boot into a RAID-1, just to so that you don't run the risk of totally losing your way of booting your awesome array.  I have done this with a pair of very small USB thumb-drives.  They stick out of the computer only about a quarter of an inch, and the pair are partitioned and configured as RAID-1.  /boot is on the md0 device, and md1 is my root drive.  I tend to use LVM to manage the actual root RAID, so that I can easily carve it into a swap space and the root space, plus any additional logical drives I think are necessary.

There are some catches to using the thumb-drives as boot devices:
  • Obviously, you need a mobo that supports booting from USB sticks.
  • You MUST partition the sticks and RAID the partition.  Trying to raid the whole USB stick will give GRUB no place to live.  The GRUB installer on Ubuntu 10.04+ prefers to install to individual members of the RAID, and is smart enough to do so - put another way, it's not smart enough to realize that installing itself to a RAID-1 will practically guarantee its presence on both devices.  This may be a safety measure.
  • USB flash devices can sometimes raise hell with the BIOS regarding their number of cylinders and the size of them.  Using fdisk alone can be tumultuous, resulting in a partition table that is followed too closely by the first partition.  This results in GRUB complaining that there is no room to install itself on the desired device.  To resolve this, you can try making a new partition table (DOS-compatibility is fine), or moving the first partition up one cylinder.  The latter is almost guaranteed to work, and you won't lose enough space to even care.  After all, what's 8M inside 8G?  I doubt even NTFS is that efficient.
The plus to this is that replacing a failed root RAID device is as easy as adding a new device member back into the array - no partitioning required.   Very simple.  The downside is that sometimes the USB devices don't both join the RAID, so you have to watch for a degraded /boot array.  Also, it could be considered a detriment to have to boot off USB sticks.  It's worked fairly well for me, and is very, very fast, but there is another method that may be a reasonable compromise.

Method 3: Single Partition RAIDs

If you just want regular hard drives, easy rebuilds, and only one big happy array to deal with, a single partition RAID is probably your best bet.  Each drive has a partition table and exactly one partition.  This set of partitions is then added to a RAID device - levels 1, 5, and 6 are definitely supported by GRUB, others may be possible.  The purpose here is to provide a place for GRUB to live on each of the drives in the array.  In the event of a drive failure, BIOS should try the other drives in order until it finds one it can boot from.  The Ubuntu grub package configuration (available via grub-setup , or dpkg-reconfigure grub-pc) will take care of all the dirty-work.

Here again it is probably best practice - perhaps even downright necessary - to add your entire RAID device into LVM2 for management.  Your root, swap, and other logical drives will be easy to define, and GRUB will recognize them.  LVM2 provides such awesome flexibility anyway, I tend to feel you are better off using it than not.

The benefits here are fairly obvious: no gotchas with regard to USB sticks (because there are none), easy maintenance of the RAID, GRUB is automatically available on all drives (as long as you configure it that way), and LVM takes care of dividing up your available space however you please.  Growth is but another drive away.

Extras: Growing Your RAID, and Migrating Your Data


Growing Your RAID

An array is a great thing.  But sometimes you run out of room.  So, you thought two 80-gig drives in RAID-1 would be sufficient for all your future endeavors, until you started gathering up the LiveCD for every Linux distro under the sun.  First, you have a choice: another drive, or replacement drives.

If you have the ability to add more drives to your array, you can up the RAID level from 1 to 5.  For a RAID-1 on two 80 gig drives, you instantly get double the storage and RAID-5 redundancy.  If you want to replace your drives, you need to do it one at a time.  During this time, your RAID will become degraded, so it may be a little bit dicey to go this route.  You'll pull out one of the old drives, slide in a new drive (2TB, maybe?), and add it into your array.  The rebuild will happen, and will hopefully complete without issue.  The only instance I know of where it wouldn't complete is when your remaining live drive is actually bad.

Once you've replaced all the drives with bigger ones, you can resize your partitions (may require a reboot), order mdadm to enlarge the array to the maximum possible disk size, grow LVM (if applicable) and finally grow your file system.  Don't forget to do grub-installs on each of the new devices, as well, especially if you're using Method 1 or Method 3 for your array configuration.

Alternatively, you can crank up a whole new array, get it totally sync'd, and migrate your data over to it.  This is easy if you've got LVM running under the hood.

Migrating Your Data

Suppose you've configured your system as such:
  Drives -> RAID (mdadm) -> LVM PV -> LVM VG -> LVM LVs (root, swap, boot, data, junk)

A brief explanation about LVM2:  All physical volumes (PVs) are added to LVM before they can be used.  In this case, our PV is our RAID device.  All PVs are allocated to Volume Groups (VGs).  After all, if you have multiple physical volumes, you might want to allocate them out differently.  Typically I wind up with just one VG for everything, but it's no hard requirement.  Once you have a VG, you can divide it up into several Logical Volumes (LVs).  This is where the real beauty of LVM comes into play.

Suppose we've configured a second (huge) RAID array for our data.  If our system was originally configured as detailed immediately above, we can order LVM to migrate our data from one PV to another.  In other words, we would:
  1. Create our new RAID array.
  2. Add our new RAID to the LVM as a new PV.
  3. Ask LVM to move all data off our old RAID (old PV)
    1. This means it will use any and all available new PVs - in our case, we have only one new PV.
  4. Wait for the migration to complete.
  5. Order LVM to remove the old PV - we don't want to risk using it for future operations.
  6. Order LVM to grow its VG to the maximum size of the new PV.
  7. Resize our LVs accordingly (perhaps we want more space on root, data, and swap).
  8. Resize the affected file systems (can usually be done live).
  9. Make sure GRUB is installed on the new RAID's drives.
  10. Reboot when complete to make sure everything comes up fine.
The last step is not required, but is good practice to make sure your system won't be dead in six months, while you sit scratching your head trying to remember which of the above steps you left off at six months prior.  I recently performed this exact sequence to migrate a live cluster system over to a new set of drives.

Hardware RAID Adapters and JBOD

If you've decided to pimp out your system with lots of drives, you'll have some fun trying to find SATA adapters that don't come with RAID.  It's a challenge.  I suppose most manufacturers think, Why wouldn't you want hardware RAID?  Well, when I decide to purchase a competitor's card because your card is a piece of trash, I don't want to lose all my data in the process or have to buy a multi-terabyte array just to play the data shuffle game.

Most adapters at least offer JBOD.  Many better adapters, however, seem to also want to mark each disk they touch.  The result is a disk that is marginally smaller than the original, and tainted with some extra proprietary data.  This comes across as a drive with fewer sectors, meaning that if mdadm puts its metadata at the end of the known device, and you move that device to a system with straight-up SATA, you may not see your metadata!  (It's there, just earlier in the disk than mdadm is expecting, thanks to the hardware adapter from the original system.)

One benefit of partitioning a disk like this is that you can insulate mdadm from the adapter's insanity.  The metadata will reside at the end of the partition instead of the end of the known drive.  Migrating from one RAID controller to other, or to a standard SATA adapter, should be a little more safe, although I can't really speak much from experience concerning switching RAID adapters.  In any case, another option is, of course, to have mdadm use a different version of metadata.  Most of my arrays use version 1.2.  There are a few choices, and according to the documentation they put the actual metadata at different locations on the drive.  This may be a beneficial alternative, but is really probably a moot point if you RAID your partitions instead of the physical drive units.

No comments:

Post a Comment