7.7.12. Single-Precision Complex Floating-Point Matrix Multiply
A matrix multiplication must multiply row and column dot product for each output element. For 8×8 matrices A and B:
You may accumulate the adjacent partial results, or build adder trees, without considering any latency. However, to implement with a smaller dot product, consider resource usage folding, which uses a smaller number of multipliers rather than performing everything in parallel. Also split up the loop over k into smaller chunks. Then reorder the calculations to avoid adjacent accumulations.
A traditional implementation of a matrix multiply design is structured around a delay line and an adder tree:
A11B11 +A12B21 +A13B31 and so on.
The traditional implementation has the following features:
- The length and size grow with folding size (typically 8 to 12)
- Uses adder trees of 7 to 10 adders that are only used once every 10 cycles.
- Each matrix size needs different length, so you must provide for the worst case
A better implementation is to use FIFO buffers to provide self-timed control. New data is accumulated when both FIFO buffers have data. This implementation has the following advantages:
- Runs as fast as possible
- Is not sensitive to latency of dot product on devices or fMAX
- Is not sensitive to matrix size (hardware just stalls for small N)
- Can be responsive to back pressure, which stops FIFO buffers emptying and full feedback to control
The model file is matmul_CS.mdl.