Agilex™ 7 Hard Processor System Technical Reference Manual

ID 683567
Date 4/01/2024
Public
Document Table of Contents

4.2.7. Error Handling

In the CCU, there are two kinds of errors reported, uncorrectable errors and correctable errors. Uncorrectable errors are errors that the hardware is unable to correct (for example: double-bit ECC). Correctable errors are errors that hardware can correct. Hardware reports the correctable error to software.

The CCU detects an error, logs information about the error and signals as interrupts (correctible_error_irq and uncorrectible_error_irq). Each CCU component implements two sets of registers, control for enabling error reporting and status for logging errors. The error control registers consist of an error detection enable bit, an error interrupt enable bit, and error threshold field. Software should enable the error detection, correction, and logging for all the active components. The error threshold field in the correctable error control register determines the number of correctable errors that occur before a correctable error is logged. Once the set threshold is reached, the next detected correctable error is logged.

Software is responsible for programming the error registers and handling any errors logged by the hardware. At reset, all the error control registers are initialized to zero, disabling detection, correction and logging of error's and masking interrupts. Software must set the enable bits to one and program the threshold field to the desired value. In addition, the error valid bit and the error overflow bit are initialized to zero at reset. As errors are detected, these bits are set, causing error interrupt signals to be asserted, software must write a 1 to each bit to clear the error.
Note: Writing 1 to a bit that is not set is UNDEFINED and may result in a loss of errors. In addition, writing 1 to the error valid bit also clears the error count field.
In Intel Agilex® 7 HPS the correctable and uncorrectable interrupts and are combined to generate a single CCU_INTERRUPT. The error ISR should execute the following steps:
  1. The ISR reads the Coherent Subsystem Correctable Error Interrupt Status Registers (CSCEISRn) and the Coherent Subsystem Uncorrectable Error Interrupt Status Registers (CSUEISRn), to determine which unit detected the error.
  2. In the case that multiple errors have occurred, the ISR prioritizes the errors and chooses one to handle.
  3. The ISR reads the correctable error status register or the uncorrectable error status register in the unit with the highest-priority error.
  4. If the error valid bit is set in the error status register, the ISR reads the appropriate error location registers.
  5. The ISR acknowledges the error by writing '1' to either or both error valid and error overflow bits.
  6. Using the information from the error status register and error location registers, the ISR performs the desired action to handle the error and returns.