This example contains a high-performance implementation of the fundamental matrix multiplication operation and demonstrates optimizations that can be described in Open Computing Language (OpenCLTM) to achieve significantly improved performance. On an algorithmic level, the kernel in this example shows how to describe loop tiling to take advantage of the data reuse inherent in the computation.
This example also demonstrates how to use loop unrolling and SIMD-style compiler optimizations to easily increase the performance of the kernel. As part of the example package, the parameters for each precompiled device binary have been chosen to maximize performance on that particular board. Additional details are available in the example package that show how easy it is to parameterize the kernel to target different performance and resource requirements.
Also, the host application is set up to automatically take advantage of multiple OpenCL devices by distributing the computation and achieving even more parallelism.
Peak Matrix Multiplication Performance