How Fault Resilient Booting (FRB) Works on Intel® Server Boards and Intel® Server Systems

Documentation

Product Information & Documentation

000007197

12/11/2023


Symptom(s):

  • What is FRB?
  • How do I know if it is working?
  • Fault resilient booting on Intel® servers.


Solution:

Fault resilient booting

The BMC (Baseboard Management Controller) implements FRB levels 1, 2, and 3. If the default bootstrap processor (BSP) fails to complete the boot process, FRB attempts to boot using an alternate processor.

  • FRB level 1 is intended to recover from a BIST failure detected during POST. This FRB recovery is fully handled by BIOS code.
  • FRB level 2 is intended to recover from a watchdog timeout during POST. The watchdog timer for FRB level 2 is implemented in the BMC.
  • FRB level 3 is intended to recover from a watchdog timeout on hard reset or power-up. This provides hardware functionality for this level of FRB.

FRB-1

In a multiprocessor system, the BIOS registers the application processors in the multi-processor (MP) table and the ACPI APIC tables. When started by the BSP, if an application processor (AP) fails to complete initialization within a certain time, it is assumed to be nonfunctional. If the BIOS detects that an application processor has failed BIST or is nonfunctional, it requests the BMC disable that processor.

The BMC then generates a system reset while disabling the processor; the BIOS will not see the failed processor in the next boot cycle. The failing AP is not listed in the MP table, nor in the ACPI APIC tables, and is invisible to the OS. If the BIOS detects that the BSP has failed BIST, it sends a request to the BMC to disable the present processor. If there is no alternate processor available, the BMC beeps the speaker and halts the system. If the BMC can find another processor, BSP ownership is transferred to that processor via a system reset.

FRB-2

The second watchdog timer (FRB-2) in the BMC is set for approximately 6 minutes by BIOS and is designed to guarantee that the system completes BIOS POST. The FRB-2 timer is enabled before the FRB-3 timer is disabled to prevent any unprotected window of time. Near the end of POST, before the option ROMs are initialized, the BIOS will disable the FRB-2 timer in the BMC.

If the system contains more than 1 GB of memory and the user chooses to test every DWORD of memory, the watchdog timer is disabled before the extended memory test starts, because the memory test can take more than 6 minutes under this configuration. If the system hangs during POST, the BIOS will not disable the timer in the BMC, which generates an asynchronous system reset (ASR).

FRB-3

The first timer (FRB-3) starts counting down whenever the system comes out of hard reset, which is usually about 5 seconds. If the BSP successfully resets and starts executing, the BIOS will disable the FRB-3 timer in the BMC by de-asserting the FRB_TIMER_HLT signal (GPIO) and the system continues on with the POST. If the timer expires because of the BSP's failure to fetch or execute BIOS code, the BMC resets the system and disables the failed processor.

The system continues to change the BSP until the BIOS POST gets past disabling the FRB-3 timer in the BMC. The BMC sounds beep codes on the speaker if it fails to find a good processor. The process of cycling through all the processors is repeated upon system reset or power cycle.