Hard Disk Failure Advisory for the Intel® Modular Server

Last Reviewed: 26-Jul-2016
Article ID: 000007020

Hard disk drives are still the most probable failure in a computer, due to the constantly moving drive heads and platters. Hard disk failures are a common cause for data loss. Setting up a RAID array provides a certain amount of protection (RAID 1, RAID 10, RAID 5 or RAID 6), but does not replace a regular backup of business or personal data.

The Intel® Modular Server has a sophisticated storage structure. The available physical hard drives will be used to create storage pools, in which one or more virtual drives are created, which are assigned to the different servers in the system.

Therefore a drive failure in a storage pool which holds more than one virtual drive can affect all the virtual drives in this pool.

Owners or administrators of an Intel® Modular Server can take certain steps to prepare for hard drive failures in this system.

  • Purchase one or two extra hard drives at the time of purchasing the modular server. Drives purchased at the same time are likely to have the same drive firmware and would be at hand immediately, if a disk in a production system fails.

  • Configure one drive as general or dedicated hot spare.

  • Configure e-mail alerts to get warnings of drive failures or Predictive Failure Alert (PFA) conditions.

    • A PFA is sent by the drive firmware to the Storage Controller Module (SCM) in case the drive firmware finds any real or assumed problems on the disk.

  • A PFA condition typically results in an immediate Predictive Drive Migration (PDM), if a hot spare is configured.

  • A PFA condition is predictive, so the affected drive may continue to operate, even if a PDM has taken place. To replace such a drive, it should be forced offline manually, before removing it from the chassis. This will ensure that the data previously migrated to the hot spare will be transitioned back to a new drive, once inserted in this slot. This activity will be captured in the event log as transition.

  • PDMs and transitions back to a new drive run as background activities. Depending on the load of the server during a normal working day, this may have some impact on performance or the duration of the migration/transition.

  • Any migration can leave a hard disk in a “stale” condition. A drive may become stale after its data has become outdated. This can occur when the drive is taken offline by the user (using the Force Offline action), by physically removing the drive, or by a disk error or by PDM. When the storage pool is rebuilt to correct for the missing drive, the drive is marked as stale. To make a physical disk available after it has become stale, users should select the drive in the modular server GUI and use the “Clear Stale Condition” action to bring the drive back online. (It is not recommended to do this if the drive was marked stale due to a failure of the drive.)

  • A real sudden disk failure will be captured in the event log by constant drive resets and command timeouts to this drive. A drive failure will trigger an e-mail alert, if alerting is set up. Such a failure could leave the storage pool in a critical condition until the drive is rebuilt, and affect the compute modules’ access to this storage pool. If a hot spare is configured, the data from this drive will migrate to the hot spare and transition back when the defective drive is replaced. If no hot spare is configured, this may leave the storage pool, virtual drives and compute modules vulnerable to a second drive failure, depending on the configured RAID level. In such a situation it is imperative to replace the failed drive as soon as possible, so that a rebuild can start and keep impact on the compute modules and their operation as brief as possible.

  • It is conceivable that more than one drive can fail either around the same time or shortly after each other. There can only be one background activity at a time, so it is recommended to replace the most vulnerable physical disk first (one drive in a RAID 5 array, for example), to control that the rebuild of this array starts first. Once the migration of this disk has completed, the second drive should be replaced.

  • There is a Help function available in the modular server GUI. Check the Help for any action which is offered for any on the modular server components to establish what it does and when to use it.

If owners or administrators of the Intel® Modular Server encounter any other drive failure related conditions, which are not explained in the GUI Help or this document, please contact Intel customer support in your region for assistance.