The multiplication of a matrix and a vector is a common operation in applications such as in the skinning and physics code of 3D graphics games. We investigated a few ways to write the code for this operation and assess the performance of each version on a 2.66GHz Intel® Core™ 2 Extreme quad-core processors.
Figure 1. Various ways of implementing this matrix-vector multiplication were investigated
Six different versions of the code were written. The first version was written using C++ code. It involved two nested loops, iterating through each element of the data sets. This version would be used as the reference by which the performance of other versions would be measured. This version was then multithreaded using Windows* threading functions.
The SIMD versions were written in assembly and they operated on 4 floating point data in each loop. One version assumed that the data was in the array of structure (AOS) format while the other assumed the data was in the structure of array (SOA) format. Figure 2 shows an example of the structure of array construct. Two other versions were subsequently derived from these SIMD versions by converting them to multi-threaded code.
Figure 2. An example of the 'structure of array' format. An array of this structure was declared to obtain an array of 'structure of array' to store all the vectors.
The time, in milliseconds, to compute the mult
iplication using the different versions was recorded in Table 1. The test was done using 40,000 randomly generated vectors and matrices. This number was chosen to ensure that the data set would fit into the 2 x 4MB L2 cache. This gives a better assessment of the different code without having to account for the impact of cache misses.
Using the data for the C++ version as the baseline, the speedup for the 4-thread SOA SIMD version was the highest at 20.8x, as shown in Table 1. This was a significant gain. If the data was not arranged in the SOA format, the gain from using SIMD instructions and threading was 13.56x. Speedups from multi-threading alone scaled almost linearly with the number of threads. The benefit of using SIMD instructions was also evident in the data. Comparing the C++ version and the SIMD version, the speedup was 3.76x.
Table 1. The time it took to run the different versions for 100 iterations on an Intel® Core™ 2 Extreme quad-core processor was recorded. The highest speedup was 20.8x. The fourth column shows only the threading speedups, excluding speedups from the other optimizations. Using SIMD instructions improved performance by 3.76x over the C++ code.