10.2. Using OPAE
$ sudo fpgainfo bmc
For more information about sensors, refer to the Intel FPGA Programmable Acceleration Card N3000 Board Management Controller (BMC) User Guide.
The fpgad periodically reads the sensor values and if the values exceed the warning threshold stated in the fpgad.conf or the hardware defined warning threshold, it masks the PCIe Advanced Error Reporting (AER) registers for the Intel® FPGA PAC to avoid system reset.
$ sudo systemctl start fpgad
The configuration file only includes the threshold setting for critical sensor 12V Aux Voltage (sensor 25) and 12 V Backplane Voltage (sensor 3). These sensors do not have a hardware defined warning threshold and hence fpgad relies on the configuration file. The other two critical sensor FPGA Core Temperature (sensor 12) and Board Temperature (sensor 13) have a hardware defined warning threshold and fatal threshold set to values mentioned in the above table. The fpgad uses this information to mask the PCIe AER register when the sensor reaches the warning threshold.
"fpgad-vc": {
"configuration": {
"cool-down": 30,
"config-sensors-enabled": true,
"sensors": [
{
"id": 25,
"low-warn": 11.40,
"low-fatal": 10.56
},
]
},
"enabled": true,
"plugin": "libfpgad-vc.so",
"devices": [
[ "0x8086", "0x0b30" ],
[ "0x8086", "0x0b31" ]
]
}
"fpgad-vc": {
"configuration": {
"cool-down": 30,
"config-sensors-enabled": true,
"sensors": [
{
"id": 25,
"low-warn": 11.40,
"low-fatal": 10.56
}
]
},
"enabled": true,
"plugin": "libfpgad-vc.so",
"devices": [
[ "0x8086", "0x0b30" ],
[ "0x8086", "0x0b31" ]
]
},
"fpgad-vc": {
"configuration": {
"cool-down": 30,
"config-sensors-enabled": true,
"sensors": [
{
"id": 3,
"low-warn": 11.40,
"low-fatal": 10.56
}
]
},
"enabled": true,
"plugin": "libfpgad-vc.so",
"devices": [
[ "0x8086", "0x0b30" ],
[ "0x8086", "0x0b31" ]
]
}
tail -f /var/lib/opae/fpgad.log | grep “sensor.*warning”
fpgad-vc: sensor 'FPGA Die Temperature' warning
You must take appropriate action to recover from this warning before the sensor value reaches upper or lower fatal limits. On reaching the warning threshold limit, the daemon masks the AER registers and the log file will indicate that the sensor is tripped.
Sample output: Warning message when the FPGA Core Temperature exceeds the upper warning threshold limit.
Ex: tail -f /var/lib/opae/fpgad.log
fpgad-vc: saving previous ECAP_AER+0x08 value 0x003ff030 for 0000:5d:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x000031c1 for 0000:5d:00.0
fpgad-vc: sensor 'FPGA Die Temperature' still tripped.
fpgad-vc: sensor '12V AUX Voltage' warning.
fpgad-vc: saving previous ECAP_AER+0x08 value 0x00100000 for 0000:ae:00.0
fpgad-vc: saving previous ECAP_AER+0x14 value 0x00002000 for 0000:ae:00.0
fpgad-vc: sensor '12V AUX Voltage' still tripped.
fpgad-vc: sensor '12V AUX Voltage' still tripped.
If the upper or lower fatal threshold limit is reached, then a power cycle of server is required to recover the Intel® FPGA PAC N3000. AER is unmasked by the fpgad after the sensor values are within the normal range which is above the lower warning or below the upper warning threshold.
fpgad-vc: failed to read sensor xx
$ sudo systemctl stop fpgad.service
$ sudo systemctl status fpgad.service
$ sudo systemctl enable fpgad.service
$ systemctl -h