The Case of the Missing RAID

I have a few servers with hardware RAID directly on the motherboard. They’re not the best boards in the world, but they process my data and serve up the information I want. Recently, I noticed that one of the servers was running on the /dev/sdb* devices, which was extremely odd. Digging some more, it seemed that /dev/sda* existed and seemed to be ok, but wasn’t being used.

After some searching, I was able to determine that the server, when built, actually booted up on /dev/mapper/via_* devices, which were actually the hardware RAID. At some point these devices disappeared. To make matters worse, it seems that kernel updates weren’t being applied correctly. My guess is that either the grub update was failing, or it updated a boot loader somewhere that wasn’t actually being used to boot. As a result, an older kernel was loading, with no way to get to the newer kernel.

I spent some time tonight digging around with Google, posting messages on the CentOS forums, and digging around on the system itself. With guidance from a user via the forums, I discovered that my system should be using dmraid, which is a program that discovers and runs RAID devices such as the one I have. Digging around a bit more with dmraid and I found this :

[user@dev ~]$ sudo /sbin/dmraid -ay -v
Password:
INFO: via: version 2; format handler specified for version 0+1 only
INFO: via: version 2; format handler specified for version 0+1 only
RAID set “via_bfjibfadia” was not activated
[user@dev ~]$

Apparently my RAID is running version 2 and dmraid only supports versions 0 and 1. Since this was initially working, I’m at a loss as to why my RAID is suddenly not supported. I suppose I can rebuild the machine, again, and check, but the machine is about 60+ miles from me and I’d rather not have to migrate data anyway.

So how does one go about fixing such a problem? Is my RAID truly not supported? Why did it work when I built the system? What changed? If you know what I’m doing wrong, I’d love to hear from you… This one has me stumped. But fear not, when I have an answer, I’ll post a full writeup!

 

Linux Software Raid

I had to replace a bad hard drive in a Linux box recently and I thought perhaps I’d detail the procedure I used.  This particular box uses software raid, so there are a few extra steps to getting the drive up and running.

Normally when a hard drive fails, you lose any data on it.  This is, of course, why we back things up.  In my case, I have two drives in a raid level 1 configuration.  There are a number of raid levels that dictate various states of redundancy (or lack thereof in the instance of level 0).  The raid levels are as follows (Copied from Wikipedia):

  • RAID 0: Striped Set
  • RAID 1: Mirrored Set
  • RAID 3/4: Striped with Dedicated Parity
  • RAID 5: Striped Set with Distributed Parity
  • RAID 6: Striped Set with Dual Distributed Parity

There are additional raid levels for nested raid as well as some non-standard raid levels.  For more information on those, see the Wikipedia article referenced above.

 

The hard drive in my case failed in kind of a weird way.  Only one of the partitions on the drive was malfunctioning.  Upon booting the server, however, the bios complained about the drive being bad.  So, better safe than sorry, I replaced the drive.

Raid level 1 is a mirrored raid.  As with most raid levels, the hard drives being raided should be identical.  It is possible to use different models and sizes in the same raid, but there are drawbacks such as a reduction in speed, possible increased failure rates, wasted space, etc.  Replacing a drive in a mirrored raid is pretty straightforward.  After identifying the problem drive, I physically removed the faulty drive and replaced it with a new one.

The secondary drive was the failed drive, so this replacement was pretty easy.  In the case of a primary drive failure, it’s easiest to move the secondary drive into the primary slot before replacing the failed drive.

Once the new drive has been installed, boot the system up and it should load up your favorite Linux distro.  The system should boot normally with a few errors regarding the degraded raid state.

After the system has booted, login to the system and use fdisk to partition the new drive.  Make sure you set the drive IDs back to Linux raid.  When finished, the partition table will look something like this :

   Device Boot      Start         End      Blocks   Id  System
/dev/hdb1   *           1          26      208813+  fd  Linux raid autodetect
/dev/hdb2              27        3850    30716280   fd  Linux raid autodetect
/dev/hdb3            3851        5125    10241437+  fd  Linux raid autodetect
/dev/hdb4            5126       19457   115121790    f  W95 Ext'd (LBA)
/dev/hdb5            5126        6400    10241406   fd  Linux raid autodetect
/dev/hdb6            6401        7037     5116671   fd  Linux raid autodetect
/dev/hdb7            7038        7164     1020096   82  Linux swap
/dev/hdb8            7165       19457    98743491   fd  Linux raid autodetect

Once the partitions have been set up, you need to format the drive with a filesystem.  This is a pretty painless process depending on your filesystem of choice.  I happen to be using ext3 as my filesystem, so I use the mke2fs program to format the drive.  To format an ext3 partition use the following command (This command, as well as the commands that follow, need to be run as root, so be sure to use sudo.) :

mke2fs -j /dev/hdb1

Once all of the drives have been formatted you can move on to creating the swap partition.  This is done using the mkswap program as follows :

mkswap /dev/hdb7

Once the swap drive has been formatted, activate it so the system can use it.  The swapon command achieves this goal :

swapon -a /dev/hdb7

And finally you can add the drives to the raid using mdadm.  mdadm is a single command with a plethora of uses.  It builds, monitors, and alters raid arrays.  To add a drive to the array use the following :

mdadm -a /dev/md1 /dev/hdb1

And that’s all there is to it.  If you’d like to watch the array rebuild itself, about as much fun as watching paint dry, you can do the following :

watch cat /proc/mdstat

And that’s all there is to it.  Software raid has come a long way and it’s quite stable these days.  I’ve been happily running it on my Linux machines for several years now.  It works well when hardware raid is not available or as a cheaper solution.  I’m quite happy with the performance and reliability of software raid and I definitely recommend it.