Hot spare on RAID
Today I just got to know one important features that is available in RAID. When one of the hard drives in a RAID fails, the hot spare hard drive which is in idle or standby mode, will immediately switched into operation and replaces it without the need of system shutdowns.
This “hot spare” hard drive does not store any data when it’s idle. When one of disks fails in a RAID, the hot spare disk will replace the faulty one, and data will be rebuilt to the hot spare. This is only done via the data redundancy provided by other RAID disks. RAID 0 cannot support hot spares. After finishing data rebuilding, we can take out the faulty disk, insert a new one, and assign it to be the new hot spare disk.
Suggestions for hot spares:
- You have to check that the drive supports hot sparing.
- The capacity must larger or equal to the faulty drive.
- Use the same drives (brand, specifications, speed, etc.) if possible. For example, when you use 3 SAS 15000 RPM hard drives in a RAID 5, it’s better that the hot spare is SAS 15000 RPM, too. If you use a SATA III, it will slow down the entire RAID’s performance.
RAID 5 + Hot Spare & RAID 6, which one is better？
RAID 5+Hot Spare: (N – 2) x (min. HDD capacity)
You need at least four drives. Three are for RAID 5, and one acts as a hot spare. (Note: a RAID group can have one hot spare only.)
RAID 6: (N – 2) x (min. HDD capacity)
You need at least 4 drives. It offers 2 hard drive redundancy, with data striped across multiple disks along with a parity check bits. The parity check bits ensure data integrity.
Both combinations share same number of drives and capacity, but RAID 6 offers a higher level data redundancy than RAID 5 and handles fault tolerance better, too. The overall data security for RAID 6 is better. The reason for this is that RAID 5 can only withstand one drive failure (with the hot spare drive ready to replace it). When there are two drive failures at the same time, the data cannot be saved. RAID 6, on the other hand, can withstand 2 drives failures at the same time. It does need time to rebuild data, however, on hot spare drive and this depends on the total capacity of the drives and the data’s size. There’s one con in regards to RAID 6: Because of the RAID 6’s additional fault tolerance mechanism, its performance is weaker than that of RAID 5.
We usually use the same batch of drives, so once one of the drives fail, it’s possible that other drives are going to fail, too. Taking this into account, there’s the risk that on RAID5 systems with hot spares that another drive will fail while in the process of rebuilding data.
Therefore, we should keep in mind that RAID data redundancy is not to the same as data backup. To ensure data safety, we need to back up our data daily to offline or off-site storage.
Is Hot Spare safe?
It’s true that a Hot Spare helps to minimize the duration of a degraded array state but our goal of creating a Redundant Array of Inexpensive Disks is to continue operation and not to lose data in the event of a drive failure. Anything that increases the risk of data loss is a bad idea.
During a RAID rebuild the probability of an additional drive failure is quite high – a rebuild is stressful on the existing drives. It is advisable to follo these procedure once the array shows a degraded state as a result of a drive failure.
- Run a full data backup.
- Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
- Identify the problem source, i.e. find the erroneous hard disk. If possible, shut down the server, and make sure the serial number of the hard disk matches that reported by the RAID controller.
- Replace the hard disk identified as bad with a new, unused one. If the replacement hard drive had already been used within another RAID array, make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
- Start the rebuild of the RAID.
So using this approach, the rebuild is the 5th step! By using a Hot-Spare your RAID will skip the first two very important steps and then run steps 3, 4 and 5 automatically. Thus the rebuild will be done before these other critical steps that work to ensure that your data is safe.
Being aware of Murphy’s Law, no one would risk an immediate rebuild after a drive failure – but by using a Hot-Spare this is exactly what will happen. If you stop and think about the integrity of your data, you will come to the same conclusion: a Hot-Spare Drive is a very bad idea.