Intel® Acceleration Stack用户指南: Intel FPGA Programmable Acceleration Card N3000

ID 683040
日期 8/17/2020
Public
文档目录

11.1. SEU的OPAE处理

OPAE工具fpgad监控SEU事件,并将此类事件记录在日志文件/var/lib/opae/fpgad.log中。

开始fpgad
sudo systemctl start fpgad
  • Intel® MAX® 10 SEU:
    The fpgad.log文件会显示以下输出:
    tail -f /var/lib/opae/fpgad.log
    fpgad-vc: failed to get value object for sensor38.
    fpgad-vc: poll count = 1
    fpgad-vc: SEU error occurred on bmc @ 0000:b2:00.0
    fpgad-vc: failed to get value object for sensor15.
    fpgad-vc: failed to get value object for sensor38.
    
    忽略消息:failed to get value object for sensor。sensor 15和sensor 38指示QSFP温度。此故障表明未插入QSFP电缆。
  • FPGA SEU:
    fpgad.log文件会显示以下输出:
    tail -f /var/lib/opae/fpgad.log
    fpgad-vc: failed to get value object for sensor38.
    fpgad-vc: poll count = 1
    fpgad-vc: SEU error occurred on fpga @ 0000:b2:00.0
    fpgad-vc: failed to get value object for sensor15.
    fpgad-vc: failed to get value object for sensor38.
    
    忽略消息:failed to get value object for sensor。sensor 15和sensor 38指示QSFP温度。此故障表明未插入QSFP电缆。
要从 Intel® MAX® 10以及FPGA SEU事件进行恢复,使用以下命令对 Intel® FPGA PAC N3000进行复位:
$ rsu bmcimg <PCI BDF>
为了测试系统对SEU事件的响应,Intel提供了一种机制来注入错误,并由fpgad进行记录,与记录SEU事件的方式类似。
  1. 开始fpgad
    $ sudo systemctl start fpgad
  2. 终端2:监控fpgad.log
    $ sudo tail -f /var/lib/opae/fpgad.log
  3. 终端1:注入错误
    $ sudo sh -c "echo 1 > /sys/class/fpga/intel-fpga-dev.0/\
    intel-fpga-fme.0/errors/inject_error"
    样例输出:
    fpgad-vc: error interrupt event received.
    fpgad-vc: poll count = 1.
    fpgad-vc: detect inject_error 0x1 @ 0000:15:00.0
    fpgad-vc: detect catfatal_errors 0x800 @ 0000:15:00.0
    注: poll count =1: 表明检测到一个错误。
  4. 清除错误注入:
    $ sudo sh -c "echo 0 > /sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/errors/inject_error"