Skip to main content

Software RAID issues

This page regroups the common issues you might deal with regarding software RAID.

⏏️ Disk replacement with software RAID

If XCP-ng has been installed with a software RAID 1 full disk mirror method, a disk failure can be fixed with a disk replacement. Here's how:

If the host can't boot anymore

Boot to the XCP-ng installer ISO in shell mode.

Once booted into your XCP-ng install or the ISO

Enter the following commands:

cat /proc/mdstat

This will return a similar output:

Personalities : [raid1]
md127 : active raid1 nvme0n2[3] nvme0n1[2]
62914432 blocks super 1.0 [2/2] [U_]

unused devices: <none>

[U_] indicates that the RAID is damaged. Now we will repair it.

Remove damaged disk

Let's assume we want to remove nvme0n1:

mdadm --manage /dev/md127 --fail /dev/nvme0n1

Now mdstat shows nvme0n1 as failed:

Personalities : [raid1]
md127 : active raid1 nvme0n2[3] nvme0n1[2](F)
62914432 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Now we can remove the disk from the raid:

mdadm --manage /dev/md127 --remove /dev/nvme0n1

The disk is removed from mdstat:

Personalities : [raid1]
md127 : active raid1 nvme0n2[3]
62914432 blocks super 1.0 [2/1] [U_]

unused devices: <none>

The disk is successfully removed.

Add a new/replacement disk to the RAID

Now we can add a replacement disk. Shutdown your host, install the disk on your system, then boot it to your XCP-ng install or the installer ISO once more. Now add the disk to the RAID:

mdadm --manage /dev/md127 --add /dev/nvme0n1

mdstat shows that disk nvme0n1 is in the RAID and is synchronizing with nvme0n2:

Personalities : [raid1]
md127 : active raid1 nvme0n2[3] nvme0n1[4]
62914432 blocks super 1.0 [2/1] [U_]
[=>...................] recovery = 9.9% (2423168/24418688) finish=2.8min speed=127535K/sec

unused devices: <none>

Wait for completion, the rebuild is complete once mdstat looks like:

md127 : active raid1 nvme0n2[3] nvme0n1[4]
62914432 blocks super 1.0 [2/2] [UU]

unused devices: <none>

[UU] is back, the RAID is repaired and you should now reboot the host.

If the system is still unbootable

This might happen for various reasons. If you haven't backed-up the contents of the disks yet, you really should now, in case data was corrupted on more than one disk. Clonezilla is a good open source live ISO to do this with if you don't already have a favorite tool. It can back up to another disk, or to a network share.

It has been reported to us that some non-enterprise motherboards may have limited UEFI firmware that does not cope well with disk changes.

In most cases, you should be able to restore the bootloader by upgrading your host to the same version it is already running (e.g upgrade 8.2 to 8.2 using the 8.2 install ISO). Check the upgrade docs for the usual instructions and warnings. Another, custom solution is to run the appropriate efibootmgr commands from the installer's shell. Refer to its documentation.