Developer Guide and Reference

  • 2022.1
  • 04/11/2022
  • Public Content
Contents

Matrix Multiplication

General

The matrix multiplication (MatMul) primitive computes the product of two 2D tensors with optional bias addition (the variable names follow the standard Naming Conventions):
LaTex Math image.
The MatMul primitive also supports batching multiple independent matrix multiplication operations, in which case the tensors can be up to 12D:
LaTex Math image.
MatMul also supports implicit broadcast semantics i.e., LaTex Math image. can be broadcasted into LaTex Math image. if the corresponding dimension in LaTex Math image. is 1 (and vice versa). However, all tensors (including LaTex Math image., if it exists) must have the same number of dimensions.
The shape of LaTex Math image. only depends on LaTex Math image. and LaTex Math image. tensors. The LaTex Math image. cannot change the dimensions of LaTex Math image. by broadcasting. In other words, for every dimension, the following constraint must hold true:
dimension(bias) == dimension(dst) || dimension(bias) == 1
.

Execution Arguments

When executed, the inputs and outputs should be mapped to an execution argument index as specified by the following table.
Primitive input/output
Execution argument index
LaTex Math image.
DNNL_ARG_SRC
LaTex Math image.
DNNL_ARG_WEIGHTS
LaTex Math image.
DNNL_ARG_BIAS
LaTex Math image.
DNNL_ARG_DST
LaTex Math image.

Implementation Details

General Notes
  1. The MatMul primitive supports input and output tensors with run-time specified shapes and memory formats. The run-time specified dimensions or strides are specified using the DNNL_RUNTIME_DIM_VAL wildcard value during the primitive initialization and creation stage. At the execution stage, the user must pass fully specified memory objects so that the primitive is able to perform the computations. Note that the less information about shapes or format is available at the creation stage, the less performant execution will be. In particular, if the shape is not known at creation stage, one cannot use the special format tag dnnl::memory::format_tag::any to enable an implementation to choose the most appropriate memory format for the corresponding input or output shapes. On the other hand, run-time specified shapes enable users to create a primitive once and use it in different situations.
  2. Inconsistency with dimensions being “primitive-creation-time-defined” vs “runtime-defined” is invalid. For example, LaTex Math image. and LaTex Math image. with dimensions set to
    {3, 4, 4}
    and
    {DNNL_RUNTIME_DIM_VAL, 4, 4}
    respectively is invalid.
  3. The broadcasting shape consistency check is not done for the dimensions with DNNL_RUNTIME_DIM_VAL. It is user responsibility to make sure the dimensions for the tensors are valid.
  4. Multiple batch dimensions and broadcasting of batch dimensions of
    src
    and
    weights
    are supported for both CPU and GPU engines.
    Please check tutorials below to see DNNL_RUNTIME_DIM_VAL support in use.
Data Types
The MatMul primitive supports the following combinations of data types for source, destination, weights, and bias tensors:
Source
Weights
Destination
Bias
f32
f32
f32
f32
f16
f16
f16, u8, s8
f16
bf16
bf16
f32, bf16
bf16, f32
u8, s8
s8
u8, s8, s32, f32, bf16
u8, s8, s32, f32, bf16
Data Representation
The MatMul primitive expects the following tensors:
Dims
Source
Weights
Destination
Bias
2D
M LaTex Math image. K
K LaTex Math image. N
M LaTex Math image. N
None or LaTex Math image.
ND
S LaTex Math image. M LaTex Math image. K
W LaTex Math image. K LaTex Math image. N
D LaTex Math image. M LaTex Math image. N
None or B
where for the sake of notational convenience, we have
LaTex Math image.
The MatMul primitive is generally optimized for the case in which memory objects use plain memory formats. Additionally, the LaTex Math image. and LaTex Math image. must have at least one of the axes
m
or
k
and
n
or
k
contiguous (i.e., stride=1) respectively. However, it is recommended to use the placeholder memory format dnnl::memory::format_tag::any if an input tensor is reused across multiple executions. In this case, the primitive will set the most appropriate memory format for the corresponding input tensor.
The memory format of the destination tensor should always be plain with
n
axis contiguous. For example, dnnl::memory::format_tag::ab for the 2D case and dnnl::memory::format_tag::abc or dnnl::memory::format_tag::bac for the 3D one.
Attributes and Post-ops
Attributes and post-ops enable modifying the behavior of the MatMul primitive. The following attributes and post-ops are supported:
Type
Operation
Description
Restrictions
Attribute
Output scales
Scales the result by given scale factor(s)
Attribute
Zero points
Sets zero point(s) for the corresponding tensors
Int8 computations only
Post-op
Applies an Eltwise operation to the result
Post-op
Adds the operation result to the destination tensor instead of overwriting it
Post-op
Applies a Binary operation to the result
General binary post-op restrictions
To facilitate dynamic quantization, the primitive supports run-time output scales. That means a user could configure attributes with output scales set to the DNNL_RUNTIME_F32_VAL wildcard value instead of the actual scales, if the scales are not known at the primitive descriptor creation stage. In this case, the user must provide the scales as an additional input memory object with argument
DNNL_ARG_ATTR_OUTPUT_SCALES
during the execution stage.
Similarly to run-time output scales, the primitive supports run-time zero points. The wildcard value for zero points is DNNL_RUNTIME_S32_VAL. The following masks are supported by the primitive:
  • 0, which applies one zero point value to an entire tensor, and
  • 2, which applies a zero point value per each element in a
    k
    or
    n
    dimension for
    DNNL_ARG_SRC
    or
    DNNL_ARG_DST
    arguments respectively.
During the execution stage, the corresponding memory object needs to be passed in the argument with index set to (
DNNL_ARG_ATTR_ZERO_POINTS | DNNL_ARG_${MEMORY_INDEX}
).
  • For instance, source tensor zero points memory argument would be passed with index (
    DNNL_ARG_ATTR_ZERO_POINTS | DNNL_ARG_SRC
    ).
Please check tutorials below to see run-time attributes in use.

Implementation Limitations

  1. Check Data Types.
  2. The CPU engine does not support
    u8
    data type for
    weights
    .
  3. The CPU engine does not support
    u8
    or
    s8
    data type for
    dst
    with
    f16
    src
    and
    weights
    .
  4. GPU implementation is limited to 6D and plain memory formats.

Performance Tips

  • Use dnnl::memory::format_tag::any for either of the input tensors if and only if the shape of the corresponding tensor is fully known at creation time and it is possible to cache reordered tensors across multiple primitive executions. For instance, a good candidate for reuse are the weights tensors during inference: their shapes and data types are known in advance; thus they can be reordered during the first inference pass and can be reused during the subsequent passes. However, if any of the input tensors cannot be reused, it is best to force the primitive to use the same format as that used by the tensors.

Examples

The following examples are available:
Matrix Multiplication Primitive Examples
This C++ API example demonstrates how to create and execute a MatMul primitive.
Key optimizations included in this example:
  • Primitive attributes with fused post-ops.
C++ API example demonstrating MatMul as a replacement for SGEMM functions.
Concepts:
C++ API example demonstrating how one can use MatMul fused with ReLU in INT8 inference.
Concepts:
C++ API example demonstrating how one can perform reduced precision matrix-matrix multiplication using MatMul and the accuracy of the result compared to the floating point computations.
Concepts:
  • Static and dynamic quantization
  • Asymmetric quantization

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.