20120503

Conversion of Flash Boot Media from Solo Device to RAID-1

The target system currently boots off one flash device, which we shall call sdf.  sdg is the new device.  sdf was configured with a DOS partition table and one 500M partition, beginning at cylinder 2.  /dev/sdf1 is the mount source of /boot.  The operating system is Ubuntu Server 11.04, with all the latest updates.  The hardware is a Dell XPS with an Intel Pentium 4 processor and a relatively ancient BIOS.  What follows are the steps for what I did to make it work, and what I observed along the way.  YMMV...

1. Create the RAID-1 target
fdisk /dev/sdi

  • Create a new DOS table and a single partition, starting at cylinder 2 and extending 500M into the device.
  • Configure the partition as type FD (linux raid member) and set its bootable flag.  Not sure what of this is absolutely necessary, so I did it all.

mdadm --create /dev/md1 --force --metadata=1.2 -l 1 -n 1 /dev/sdg1
mdadm -E --brief /dev/sdg1 >> /etc/mdadm/mdadm.conf

  • --force is required as we're building a 1-device RAID-1, which mdadm finds odd but will do anyway.
  • We'll try using metadata version 1.2 on here, because I've seen it work on other systems.
  • The array should have started automatically.
  • We add the array definition to the config file, so that it will be correctly auto-started AND correctly assigned as md1 on reboot.
dd if=/dev/sdf1 of=/dev/md1 bs=4k
  • We copy the entire boot file system over, sector for sector, assuming we have enough space on md1.  In my case, I had more than enough.  Note I'm using a 500M partition for two reasons: (a) I don't feel like waiting 30 minutes for things like copies to finish, and (b) there might be problems with GRUB and the full 8G size, although the other exemplar system had no troubles here.  If all goes well, I'll expand the RAID accordingly.
umount /boot
  • Remove the device that represents /dev/sdf.
partprobe
mount /boot
  • We will now see that all the boot files are there, and the file system is a perfect mirror, now living on the RAID.  The above will work if /etc/fstab lists the /boot source by UUID and not by device.  This has become the recommended way of mounting devices anyway.
Try a GRUB install:
update-grub
grub-install --allow-floppy /dev/sdg
  • This first invocation failed for me, with GRUB unable to figure out the file system on /dev/md1.  Running grub-probe -v revealed the same results.  To fix:
umount /boot
mdadm -S /dev/md1
mdadm --assemble --scan
  • mdadm should report that our little md1 array was started successfully with one drive.
mount /boot
update-grub
grub-install --allow-floppy /dev/sdg
  • This time it succeeds.  Now we attempt a reboot.
  • RESULTS: Reboot was mostly successful.  Saw a warning about fd0 being unreadable, but GRUB cranked up and booted the OS.  One snag: /boot failed to mount, and Ubuntu has the lovely feature of freezing all further booting until the administrator can intervene.
At this point, I am seeing that the operating system is trying to mount /dev/md1 on boot, but for some reason failing and thinking it needs to run fsck on the device.  But the device is busy, and so it can't.  It then waits in limbo until I hit the S or the I key (to skip mounting or ignore the issue, respectively).  To test how a drive-check would affect a sister system, I tune2fs -C 30 /dev/md1 and then rebooted.  This forced a drive check on the next reboot.  Happily, or sadly, it did the check without blinking and went right to work bringing the rest of the OS online.

Now, note that the two major differences between these two systems is that one uses 0.90 metadata, and the other (malfunctioning) system uses version 1.2 metadata.  Also, the malfunctioning system has a strange RAID configuration (a single-device RAID-1).  To test a theory, I'm going to toy with trying to build a full-device RAID drive.  In the past this has failed for some unknown reason - grub-install wouldn't take.  Perhaps now we can explore a bit further.

* * * * * *

I toyed around with doing a whole-device RAID-1 as the boot drive.  Not a good idea.  GRUB doesn't REALLY understand, and I still have problems with the fsck hosing up and trying to read the individual partitions of the raid device, instead of the raid device itself.  I think it may be an even bigger problem because I created a partition table and set the primary partition to be type "Linux".  Perhaps fsck is seeing this and decided to attempt a check even though it really shouldn't?  Anyway, all that aside, the rebuild onto the secondary device certainly worked well, and the system DID actually boot once.  I fear it's not a stable solution, however, and stability is key to the success of this mission.

As I had copied all the boot files to /boot2 (a temporary on-drive space that was safe), I decided to ditch the single-drive RAID-1 in favor of going all the way in one shot.  I again created a new raid, this time using both devices - more specifically, a partition from each device - as I had done above.  I made sure the partition was coded as type 0xfd.  Once the raid was synced, I created the file system (make sure you assign the correct UUID or things will break!) and copied the boot files over.

I ran dpkg-reconfigure grub-pc a few times.  The first few it just regenerated the grub.cfg file.  After running grub-install --allow-floppy /dev/sdf, and again for /dev/sdg, the dpkg-reconfigure grub-pc saw the two drives and re-ran the grub-install on them.  I'm not sure if that had any positive effects beyond what I had done above, but the system now boots correctly, loads correctly, and doesn't try to fsck the underlying media!

I'm done.

P.S.
I'm not sure why the steps for the single-device RAID-1 failed.  I honestly don't believe the single-device configuration was the fault.  I am still at a complete loss to explain why fsck wanted to check the underlying media constantly, unless there is some hidden magick that I was unaware of in the installation scripts.  Someday I'll have to play some more to better understand this bit of funk.  Right now I have to configure these devices for Heartbeat and make sure I can transparently fail-over an iSCSI connection to a virtual IP.  Then it's virtualization GO TIME!


No comments:

Post a Comment