Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.04

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 5 of 12  

Intel® Performance Libraries: Multi-Core-Ready Software for Numeric-Intensive Computation

VECTOR MATH LIBRARY (VML)

As we suggested earlier, the main issue in threading various math functions is not so much whether they can be threaded (are operations separable) but rather whether there are sufficient operations on the data once they are in cache to permit other cores/processors to also get data to work on. In other words, this all comes down to a memory bandwidth issue.

The transcendental functions of VML typically require 10-50 cycles per element (CPE), typically with one input and one output value per element. Taking this into consideration we can roughly estimate the break-even point of threading by the following inequality:

S/T+O < S -> N*CPE/T+O < N*CPE -> N > O * T / (CPE * (T-1)),

where S = CPE*N and N is the vector length – is the number of cycles to execute a particular function in serial mode, O is the number of clocks for overhead for starting threads (it really depends on T, the total number of threads used) and CPE which is the cpe of the function in serial case. One can see that with increasing CPE (more complex functions) the shorter vectors can be effectively parallelized. The greatest difficulty here is to make an estimation of O.

Our computations show that O, measured in cycles, depends mostly on the number of sockets, the number of cores, and whether hyperthreading is turned on or off. This inequality, estimation of O, and a table of CPE values for each function are used in order to choose the number of threads for a particular function call during runtime.

Figure 8 shows the speedups for three VML functions on a Woodcrest system (dual socket, dual core) compared to single-thread performance on the same processor.



Figure 8: VML scaling on selected functions
click image for larger view
 

Though VML can choose some particular number of threads it is difficult to do this accurately:

  • Performance is often data dependent. For example, the dCbrt function (cubic root for double precision vectors) has 27.96 cpe for uniformly distributed data on the interval [-10000;10000], but 15.00 cpe if the vector is all zeros.
  • Data location. If the input/output vectors are in cache, scaling and performance will be much better than if the data are in memory.
  • When several different successive vector functions work with the same vectors, a different number of threads can be chosen, and as a result data might stay in the wrong cache if the cache is not shared.
  • The influence of overhead might be significantly lowered by using threading at a higher level (i.e., if the user calls the VML functions from a threaded application).

In summary, for VML, multi-core shared cache architectures have opened opportunities for threading that did not exist previously, but the performance is dependent on factors the VML developer can only partly control. It is likely that in most cases calling VML functions from a threaded application will result in better performance than invoking the threaded VML.

  Section 5 of 12  

Back to Top

In this article

Download a PDF of this article.