Intel® Acceleration Stack User Guide: Intel FPGA Programmable Acceleration Card N3000

ID 683040
Date 6/14/2021
Public
Document Table of Contents

10.2. Using OPAE

The fpgad is a service that can help you protect the server from crashing when the hardware reaches an upper non-recoverable or lower non-recoverable sensor threshold (also called as fatal threshold). The fpgad is capable of monitoring each of the 20 sensors reported by the Board Management Controller.
$ sudo fpgainfo bmc
For more information about sensors, refer to the Intel FPGA Programmable Acceleration Card N3000 Board Management Controller (BMC) User Guide.
Note: Qualified OEM server systems should provide the required cooling for your workloads. Therefore, using fpgad may be optional.
When the opae-tools-extra-1.3.6-4.x86_64.rpm package is installed, fpgad is placed in the OPAE binaries directory (default: /usr/bin). The configuration file fpgad.cfg is located at /etc/opae. The log file fpgad.log which monitors fpgad actions is located at /var/lib/opae/.

The fpgad periodically reads the sensor values and if the values exceed the warning threshold stated in the fpgad.conf or the hardware defined warning threshold, it masks the PCIe Advanced Error Reporting (AER) registers for the Intel® FPGA PAC to avoid system reset.

Use the following command to start the fpgad service:
$ sudo systemctl start fpgad

The configuration file only includes the threshold setting for critical sensor 12V Aux Voltage (sensor 25) and 12 V Backplane Voltage (sensor 3). These sensors do not have a hardware defined warning threshold and hence fpgad relies on the configuration file. The other two critical sensor FPGA Core Temperature (sensor 12) and Board Temperature (sensor 13) have a hardware defined warning threshold and fatal threshold set to values mentioned in the above table. The fpgad uses this information to mask the PCIe AER register when the sensor reaches the warning threshold.

Snapshot of the fpgad.cfg file located at /etc/opae/ which configures the sensor 12V Aux Voltage (sensor 25) is shown below:
"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 25,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        },
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }
You must create another entry below the 12V Aux Voltage entry for 12V Backplane Voltage (sensor 3). The updated configuration file should have the following entry:
"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 25,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        }
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }, 

"fpgad-vc": {
                        "configuration": {
                                "cool-down": 30,
                                "config-sensors-enabled": true,
                                "sensors": [
                                        {
                                                "id": 3,
                                                "low-warn": 11.40,
                                                "low-fatal": 10.56
                                        }
                                ]
                        },
                        "enabled": true,
                        "plugin": "libfpgad-vc.so",
                        "devices": [
                                [ "0x8086", "0x0b30" ],
                                [ "0x8086", "0x0b31" ]
                        ]
                }
You can monitor the log file to see if upper or lower warning threshold levels are hit. For example:
tail -f /var/lib/opae/fpgad.log | grep “sensor.*warning”
fpgad-vc: sensor 'FPGA Die Temperature' warning

You must take appropriate action to recover from this warning before the sensor value reaches upper or lower fatal limits. On reaching the warning threshold limit, the daemon masks the AER registers and the log file will indicate that the sensor is tripped.

Sample output: Warning message when the FPGA Core Temperature exceeds the upper warning threshold limit.

Ex: tail -f /var/lib/opae/fpgad.log 
fpgad-vc: saving previous ECAP_AER+0x08 value 0x003ff030 for 0000:5d:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x000031c1 for 0000:5d:00.0
fpgad-vc: sensor 'FPGA Die Temperature' still tripped.
Sample output: Warning message when the voltage exceeds the lower warning threshold limit.:
fpgad-vc: sensor '12V AUX Voltage' warning.
fpgad-vc: saving previous ECAP_AER+0x08 value 0x00100000 for 0000:ae:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x00002000 for 0000:ae:00.0
fpgad-vc: sensor '12V AUX Voltage' still tripped.
fpgad-vc: sensor '12V AUX Voltage' still tripped.

If the upper or lower fatal threshold limit is reached, then a power cycle of server is required to recover the Intel® FPGA PAC N3000. AER is unmasked by the fpgad after the sensor values are within the normal range which is above the lower warning or below the upper warning threshold.

Sample output when upper or lower fatal threshold is reached:
fpgad-vc: failed to read sensor xx
To stop fpgad:
$ sudo systemctl stop fpgad.service
To check status of fpgad:
$ sudo systemctl status fpgad.service
Optional: To enable fpgad to re-start on boot, execute
$ sudo systemctl enable fpgad.service
For a full list of systemctl commands, run the following command:
$ systemctl -h