仅对英特尔可见 — GUID: mtm1565383332729
Ixiasoft
仅对英特尔可见 — GUID: mtm1565383332729
Ixiasoft
10.2. 使用OPAE
$ sudo fpgainfo bmc有关传感器的更多信息,请参考Intel FPGA Programmable Acceleration Card N3000 Board Management Controller (BMC) User Guide。
fpgad定期读取传感器值,如果这些值超过了fpgad.conf中指定的警告阈值或硬件定义的警告阈值,那么它会屏蔽PCIe Advanced Error Reporting (AER)寄存器,使 Intel® FPGA PAC能够避免系统复位。
$ sudo systemctl start fpgad
配置文件仅包含关键传感器12V Aux Voltage (sensor 25)和12 V Backplane Voltage (sensor 3)的阈值设置。这些传感器没有硬件定义的警告阈值,因此fpgad依赖于配置文件。其他两个关键传感器FPGA Core Temperature (sensor 12)和Board Temperature (sensor 13)有硬件定义的警告阈值和上表中提到的致命阈值。当传感器达到警告阈值时,fpgad使用此信息来屏蔽PCIe AER寄存器。
"fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 25, "low-warn": 11.40, "low-fatal": 10.56 }, ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }
"fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 25, "low-warn": 11.40, "low-fatal": 10.56 } ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }, "fpgad-vc": { "configuration": { "cool-down": 30, "config-sensors-enabled": true, "sensors": [ { "id": 3, "low-warn": 11.40, "low-fatal": 10.56 } ] }, "enabled": true, "plugin": "libfpgad-vc.so", "devices": [ [ "0x8086", "0x0b30" ], [ "0x8086", "0x0b31" ] ] }
tail -f /var/lib/opae/fpgad.log | grep “sensor.*warning” fpgad-vc: sensor 'FPGA Die Temperature' warning
在传感器值达到致命阈值的上限或者下限之前,您必须采取适当的措施从此警告中恢复。达到警告阈值限制时,守护程序将屏蔽AER寄存器,日志文件将指示传感器已跳闸。
样例输出:FPGA Core Temperature超过警告阈值限制的上限时的警告消息。
Ex: tail -f /var/lib/opae/fpgad.log fpgad-vc: saving previous ECAP_AER+0x08 value 0x003ff030 for 0000:5d:00.0 fpgad-vc: saving previous ECAP_AER+0x14 value 0x000031c1 for 0000:5d:00.0 fpgad-vc: sensor 'FPGA Die Temperature' still tripped.
fpgad-vc: sensor '12V AUX Voltage' warning. fpgad-vc: saving previous ECAP_AER+0x08 value 0x00100000 for 0000:ae:00.0 fpgad-vc: saving previous ECAP_AER+0x14 value 0x00002000 for 0000:ae:00.0 fpgad-vc: sensor '12V AUX Voltage' still tripped. fpgad-vc: sensor '12V AUX Voltage' still tripped.
如果达到致命阈值限制的上限或者下限,那么需要重启服务器才能恢复 Intel® FPGA PAC N3000。在传感器值处于正常范围内(高于警告阈值下限或低于警告阈值上限)之后,fpgad将对AER进行解屏蔽。
fpgad-vc: failed to read sensor xx
$ sudo systemctl stop fpgad.service
$ sudo systemctl status fpgad.service
$ sudo systemctl enable fpgad.service
$ systemctl -h