9.4.3. Using the Mixed Precision Feature
enable_mixed_precision:true
Specifying Graph Layers to Run at High Precision
To enable mixed precision, specify the desired precision for a graph layer by manually annotating it onto the OpenVINO™ Model Converter output, the OpenVINO™ graph (graph.xml).
If you rerun the OpenVINO™ Model Converter during development, the graph.xml file is regenerated and you must re-annotate this new file.
Annotating the OpenVINO™ Graph
Refer to the FPGA AI Suite documentation for instructions on running the OpenVINO™ Model Converter tool on your trained network.
The primary OpenVINO™ Model Converter outputs are the OpenVINO™ graph (graph.xml) that describes your network topology, and the weights and biases for your model (graph.bin). These OpenVINO™ outputs are the inputs of the FPGA AI Suite dla_compiler tool.
To annotate the precision of a layer precision onto the OpenVINO™ graph, you must modify that layer name string directly in graph.xml. The precision annotation is appended at the end of the layer name, following a double underscore, as shown in the example below.
<layer id="2" name="model/conv1/Conv2D__fpga_precision=high" type="Convolution" version="opset1">
<data strides="1, 1" dilations="1, 1" pads_begin="0, 0" pads_end="0, 0" auto_pad="same_upper"/>
<input>
<port_id="0" precision="FP32">
...
Do not modify the graph.bin file.
Annotation |
Description |
|---|---|
__fpga_precision=default |
Run this layer at the default BFP precision. The default BFP arch precision is indicated by the arch_precision field in the Architecture Description File. |
__fpga_precision=high |
Run this layer with both feature and filter at high BFP precision. |
__fpga_precision=high-feature |
Run this layer with the feature at high BFP precision. The filter is represented at the default arch_precision. |
(no annotation used) |
Equivalent to __fpga_precision=default. |
Selecting Which Graph Layers to Annotate
The FPGA AI Suite IP runs the most computationally intensive graph layers at block floating point precision. Other graph layers operate at fp16 precision. For details, refer to Block Floating Point (BFP).
The following table summarizes the information that you need to select which graph layers to annotate with precision directives. For each graph layer/primitive, the table tells describes the following information:
- The corresponding layer type in the graph.xml
- The FPGA AI Suite IP block to which the dla_compiler tool maps layers of this type.
- The default precision at which this IP block operates (that is, if __fpga_precision=default).
- Whether this layer type is a legal target for the fpga_precision annotation.
Layer / Primitive |
OpenVINO™ IR (.xml) Layer Type |
IP Block Mapping |
Default Precision |
Supports Mixed Precision? |
|---|---|---|---|---|
Convolution(2D or 3D) |
Convolution |
PE array |
arch_precision(bfp) |
Yes |
Convolutionincluding bias(2D or 3D) |
Convolution + Add |
PE array |
arch_precision(bfp) |
Yes |
Depthwise |
GroupConvolution + Add (if bias is present) |
Depthwise auxiliary block (if possible) Otherwise, maps to the PE array |
fp16 arch_precision(bfp) |
No (if aux block) Yes (if PE array) |
Scale-Shift |
Multiply (single input) |
PE array |
arch_precision(bfp) |
Yes |
Deconv / Transpose Convolution |
ConvolutionBackpropData |
PE array |
arch_precision(bfp) |
No |
Elementwise addition of feature + feature tensors |
Add |
PE array |
arch_precision(bfp) |
Yes |
Elementwise multiplication of feature * filter tensors |
Multiply (single input) |
PE array |
arch_precision(bfp) |
Yes |
Elementwise multiplication of feature * feature tensors |
Multiple(two inputs) |
PE array |
arch_precision(bfp) |
No |
Fully connected |
Matmul + Add (if bias is present) |
PE array |
arch_precision(bfp) |
Yes |
ChannelToSpace DepthToSpace PixelShuffle |
DepthToSpace |
PE array |
arch_precision(bfp) |
No |
ReLU pReLU Leaky ReLU |
ReLU, PReLU |
Activation auxiliary block |
fp16 |
N/A |
Clamp Round Clamp |
Clamp |
Activation auxiliary block |
fp16 |
N/A |
H-sigmoid |
Hsigmoid |
Activation auxiliary block |
fp16 |
N/A |
H-swish |
HSwish |
Activation auxiliary block |
fp16 |
N/A |
Sigmoid |
Sigmoid |
Activation auxiliary block |
fp16 |
N/A |
Swish |
Swish |
Activation auxiliary block |
fp16 |
N/A |
Max Pool |
MaxPool |
Pooling auxiliary block |
fp16 |
N/A |
Max Pool (for 3D inputs, i.e. DHWC) |
MaxPool |
Pooling auxiliary block + PE array |
fp16 &arch_precision(bfp) |
No |
Average Pool |
AvgPool or ReduceMean |
PE array |
arch_precision(bfp) |
Yes |
Softmax |
Softmax |
Activation auxiliary block |
fp16 |
N/A |
Most graph layers that map to the FPGA AI Suite PE array are legal targets for the fpga_precision annotation. These are marked Yes in the Supports Mixed Precision? column of the preceding table.
Layers that are mapped to auxiliary blocks of the FPGA AI Suite IP are computed at fp16 precision. This is "high precision" by default. These layers are marked "N/A" in the above table.
The dla_compiler tool processes your annotations as follows:
- When a precision annotation is successfully applied, dla_compiler adds this precision to the layer visualization produced by the compiler as "...after_all_passes.dot" generated under $coredla_word/visualizations . This will show up in the layer as Precision_Mode: "Selected Precision" (High, High_Feature).
dot -Tsvg ...after_all_passes.dot -o ...after_all_passes.svg
- For an annotation applied to an unsupported layer type (that is, "No*" or "No" in Graph Layer Mapping and Precision.), dla_compiler issues an error and terminates the compilation.
- If dla_compiler maps a layer to an auxiliary block (that is, "N/A" in Graph Layer Mapping and Precision), it will ignore any associated precision annotation and will print Precision Mode: Arch, which means the default arch precision.
Achieving Your Numerical Objectives
The mixed precision feature is useful for preserving fine-grained information in the input feature data. Information encoded in the least significant mantissa bits of the feature fp16 input is lost when data is quantized to the default block floating point arch_precision.
Use the __fpga_precision=high or __fpga_precision=high-feature annotations as follows:
- Apply them to the initial layers of a graph. Applying the annotations at this level preserves fine-grained input data, which would be lost when quantizing to default-precision block floating point.
- Apply them to consecutive layers of a graph. It is unclear whether there is significant benefit to running high-BFP precision layers after the first default block floating point precision layer.
- Apply them to the fewest possible graph layers needed to achieve your numerical objectives. Running a layer at high BFP precision is more computationally intensive, which reduces the overall inference throughput.
If your graph contains depthwise convolution layers, leverage the dedicated depthwise auxiliary block:
- Enable the depthwise auxiliary block by modifying the Architecture Description File. The depthwise auxiliary block provides superior throughput and higher numerical precision for depthwise convolutions.
- Ensure that the parameterization of the depthwise auxiliary block (for maximum window size, stride, and dilation) are set large enough to support all of the depthwise convolutions in your graph.
- Use the FPGA AI Suite Area Estimator tool to estimate the FPGA resource utilization required by the depthwise auxiliary block.
For information on using the depthwise auxiliary block, refer to Parameter Group: depthwise.