raid – Technological Musings

The Hard Drive Shuffle

I recently decided to pick up some new hard drives to replace the ones in my server that have been dutifully running for far too many years (somewhere in the 5-10 years range) without any failures. Because the process of swapping things out was a tad more involved that simply replacing the drives, I thought I’d write it up.

My server has nine drives in it across several SATA controllers. One of the drives is an SSD where I put the OS, two drives are mirrored on a slower SATA controller, and the remaining six are in a RAID6 array. The RAID6 array is the target of this upgrade. I picked up some shiny new Seagate Exos X16 16TB drives to replace them.

Before I could start replacing the drives, I needed to figure out which slots corresponded to which drives. Unfortunately, there are no lights I can blink on the case to figure this out, so a quick shutdown of the system so I can pull all the drives and write down the serial number and what slot they were in. Once complete, I was able to use udevadm to figure out which device corresponded to which serial number.

[root@server root]# udevadm info --query=all --name=/dev/sda | grep ID_SERIAL
E: ID_SERIAL=ST16000NM001G-SERIALNUM

The RAID array is set up using Linux Software Raid, so mdadm is the tool I’ll be using to fail and replace the drives, one by one. After some reading, the process is as follows :

First, fail and then remove the device to be replaced from the array :

[root@server root]# mdadm --manage /dev/md0 --fail /dev/sda1
[root@server root]# mdadm --manage /dev/md0 --remove /dev/sda1

The drive can now be replaced with the new drive. I took an additional step to print out labels for each drive with the serial number on them and labelled the slots. Once the drive has been re-added, you can watch /var/log/messages to see when the new drive is detected by the OS. It can take a minute or two for this to happen.

May 5 12:06:00 server kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 5 12:06:00 server kernel: ata11.00: ATA-11: ST16000NM001G-2KK103, SNA3, max UDMA/133
May 5 12:06:00 server kernel: ata11.00: 31251759104 sectors, multi 16: LBA48 NCQ (depth 32), AA
May 5 12:06:00 server kernel: ata11.00: Features: NCQ-sndrcv
May 5 12:06:00 server kernel: ata11.00: configured for UDMA/133
May 5 12:06:00 server kernel: scsi 10:0:0:0: Direct-Access ATA ST16000NM001G-2K SNA3 PQ: 0 ANSI: 5
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] 4096-byte physical blocks
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Write Protect is off
May 5 12:06:00 server kernel: sd 10:0:0:0: Attached scsi generic sg0 type 0
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Preferred minimum I/O size 4096 bytes
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Attached SCSI disk

Now that the drive has been detected, you can partition it and then re-add it to the array. These are pretty huge disks, so I used gdisk to partition them. The process is quite easy. Run gdisk on the drive you want to partition, choose “o” to write a new GPT, choose “n” to create a new partition. I simply hit enter to choose the partition number, first, and last sector. Then enter “fd00” to set the partition to Linux RAID. Finally, you can print out what you’ve done with “p” or jump right to entering “x” to write the partition table and exit.

[root@server root]# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.7

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries in memory.

Command (? for help): o
This option deletes all partitions and creates a new protective MBR.
Proceed? (Y/N): y

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-31251759070, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-31251759070, default = 31251759070) or {+-}size{KMGTP}:
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): fd00
Changed type of partition to 'Linux RAID'

Command (? for help): p
Disk /dev/sdf: 31251759104 sectors, 14.6 TiB
Model: ST16000NM001G-2K
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 31251759070
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     31251759070   14.6 TiB    FD00  Linux RAID

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sda.
mdThe operation has completed successfully.

Now that the drive is paritioned, you can add it into the array.

[root@server root]# mdadm --manage /dev/md0 --add /dev/sda1

And now you have to wait as the array resyncs. With a RAID6 array, you can technically replace two drives at the same time, but you increase your risk because if an additional drive fails, you can kiss goodbye to the data. In fact, when I started this process on my server, the first drive I pulled caused some sort of restart on the SATA controller and I ended up with two failed drives instead of just the one. Since the array already saw the drives as failed, I just went ahead and replaced both at the same time. I have backups, though, so I wouldn’t recommend this to everyone.

You can see the current state of the resync by watching the /proc filesystem. Specifically, the /proc/mdstat file.

[root@server root]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[11] sdg1[10] sdh1[9] sdi1[8] sdb1[7] sda1[6]
      7813525504 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      [>....................]  recovery =  0.0% (148008/1953381376) finish=659.8min speed=49336K/sec
      bitmap: 3/15 pages [12KB], 65536KB chunk

unused devices: <none>

It can take a while. Each drive took about 12-14 hours to resync.

Once you’ve replace the last drive and the array has resynced (YAY!), you have a few more steps to follow. First, we need to tell the system that the array size has increased. Again, a simple process.

[root@server root]# mdadm --grow /dev/md0 --size=max

And once again, you’re resyncing, but your array has now grown to it’s new size.

I wasn’t quite complete, however. On my system, I have an LVM on top of the RAID array, so I needed to tell LVM that the array was larger so I could then allocate that space where needed. This was easily accomplished with one simple command.

[root@server root]# pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized

And poof, the PV was resized as was the associated VG. Lots more space available to allocate where necessary.

The Case of the Missing RAID

I have a few servers with hardware RAID directly on the motherboard. They’re not the best boards in the world, but they process my data and serve up the information I want. Recently, I noticed that one of the servers was running on the /dev/sdb* devices, which was extremely odd. Digging some more, it seemed that /dev/sda* existed and seemed to be ok, but wasn’t being used.

After some searching, I was able to determine that the server, when built, actually booted up on /dev/mapper/via_* devices, which were actually the hardware RAID. At some point these devices disappeared. To make matters worse, it seems that kernel updates weren’t being applied correctly. My guess is that either the grub update was failing, or it updated a boot loader somewhere that wasn’t actually being used to boot. As a result, an older kernel was loading, with no way to get to the newer kernel.

I spent some time tonight digging around with Google, posting messages on the CentOS forums, and digging around on the system itself. With guidance from a user via the forums, I discovered that my system should be using dmraid, which is a program that discovers and runs RAID devices such as the one I have. Digging around a bit more with dmraid and I found this :

[user@dev ~]$ sudo /sbin/dmraid -ay -v
Password:
INFO: via: version 2; format handler specified for version 0+1 only
INFO: via: version 2; format handler specified for version 0+1 only
RAID set “via_bfjibfadia” was not activated
[user@dev ~]$

Apparently my RAID is running version 2 and dmraid only supports versions 0 and 1. Since this was initially working, I’m at a loss as to why my RAID is suddenly not supported. I suppose I can rebuild the machine, again, and check, but the machine is about 60+ miles from me and I’d rather not have to migrate data anyway.

So how does one go about fixing such a problem? Is my RAID truly not supported? Why did it work when I built the system? What changed? If you know what I’m doing wrong, I’d love to hear from you… This one has me stumped. But fear not, when I have an answer, I’ll post a full writeup!

Linux Software Raid

I had to replace a bad hard drive in a Linux box recently and I thought perhaps I’d detail the procedure I used. This particular box uses software raid, so there are a few extra steps to getting the drive up and running.

Normally when a hard drive fails, you lose any data on it. This is, of course, why we back things up. In my case, I have two drives in a raid level 1 configuration. There are a number of raid levels that dictate various states of redundancy (or lack thereof in the instance of level 0). The raid levels are as follows (Copied from Wikipedia):

RAID 0: Striped Set
RAID 1: Mirrored Set
RAID 3/4: Striped with Dedicated Parity
RAID 5: Striped Set with Distributed Parity
RAID 6: Striped Set with Dual Distributed Parity

There are additional raid levels for nested raid as well as some non-standard raid levels. For more information on those, see the Wikipedia article referenced above.

The hard drive in my case failed in kind of a weird way. Only one of the partitions on the drive was malfunctioning. Upon booting the server, however, the bios complained about the drive being bad. So, better safe than sorry, I replaced the drive.

Raid level 1 is a mirrored raid. As with most raid levels, the hard drives being raided should be identical. It is possible to use different models and sizes in the same raid, but there are drawbacks such as a reduction in speed, possible increased failure rates, wasted space, etc. Replacing a drive in a mirrored raid is pretty straightforward. After identifying the problem drive, I physically removed the faulty drive and replaced it with a new one.

The secondary drive was the failed drive, so this replacement was pretty easy. In the case of a primary drive failure, it’s easiest to move the secondary drive into the primary slot before replacing the failed drive.

Once the new drive has been installed, boot the system up and it should load up your favorite Linux distro. The system should boot normally with a few errors regarding the degraded raid state.

After the system has booted, login to the system and use fdisk to partition the new drive. Make sure you set the drive IDs back to Linux raid. When finished, the partition table will look something like this :

   Device Boot      Start         End      Blocks   Id  System
/dev/hdb1   *           1          26      208813+  fd  Linux raid autodetect
/dev/hdb2              27        3850    30716280   fd  Linux raid autodetect
/dev/hdb3            3851        5125    10241437+  fd  Linux raid autodetect
/dev/hdb4            5126       19457   115121790    f  W95 Ext'd (LBA)
/dev/hdb5            5126        6400    10241406   fd  Linux raid autodetect
/dev/hdb6            6401        7037     5116671   fd  Linux raid autodetect
/dev/hdb7            7038        7164     1020096   82  Linux swap
/dev/hdb8            7165       19457    98743491   fd  Linux raid autodetect

Once the partitions have been set up, you need to format the drive with a filesystem. This is a pretty painless process depending on your filesystem of choice. I happen to be using ext3 as my filesystem, so I use the mke2fs program to format the drive. To format an ext3 partition use the following command (This command, as well as the commands that follow, need to be run as root, so be sure to use sudo.) :

mke2fs -j /dev/hdb1

Once all of the drives have been formatted you can move on to creating the swap partition. This is done using the mkswap program as follows :

mkswap /dev/hdb7

Once the swap drive has been formatted, activate it so the system can use it. The swapon command achieves this goal :

swapon -a /dev/hdb7

And finally you can add the drives to the raid using mdadm. mdadm is a single command with a plethora of uses. It builds, monitors, and alters raid arrays. To add a drive to the array use the following :

mdadm -a /dev/md1 /dev/hdb1

And that’s all there is to it. If you’d like to watch the array rebuild itself, about as much fun as watching paint dry, you can do the following :

watch cat /proc/mdstat

And that’s all there is to it. Software raid has come a long way and it’s quite stable these days. I’ve been happily running it on my Linux machines for several years now. It works well when hardware raid is not available or as a cheaper solution. I’m quite happy with the performance and reliability of software raid and I definitely recommend it.