- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Future-Proof Data Parallel Algorithms and Software on Intel® Multi-Core Architecture
TVECs
The basic type in Ct is a generic vector type, called TVEC. TVECs are allocated and managed in a segregated memory space that is accessible only by Ct operators, to ensure the safety of parallel operation on vectors. TVEC is polymorphic in terms of its base types and shapes.
The base types of TVECs are drawn from a set of typical pre-defined scalar (or value) types. Examples of base types include I32 (32-bit integer), I64 (64-bit integer), F32 (Float), F64 (Double), and Bool (Boolean). In future, Ct will also support the Bit type and user-defined base types, for example, C struct-like base type, TSTRUCT, and the C array-like base type, TUPLE, for more complicated application scenarios.
A TVEC may be declared as follows:
The TVEC constructor copies data explicitly from the unmanaged C/C++ memory to the managed vector space, in the form of either plain element-wise copy, or the strided memory copy (red in the example above takes one byte from every four of the data stream). There are also exceptional cases when it is not preferable to copy the data all at once (because of long latency) or we do not want to copy at all. Thus, there are several TVEC traits that may be applied, including Stream for copying data in a streaming fashion, or Direct for not copying.
Constant TVECs may also be constructed by factory methods. For example, an identity matrix, in the form of TVEC2D (a TVEC derivative for matrices), may be created as follows:
Nested data parallelism is a distinguished property for programming irregular data structures and algorithms. TVECs assume a number of shapes, including flat, multi-dimensional, irregular nested, and indexed forms. For example, a matrix TVEC could be constructed as follows:
TVECs may also be associated with certain accuracy attributes, which may allow experienced programmers to influence the compiler's code generation. For example:
The above TVEC declaration specifies 2 ulp (units-in-the-last-place) as the tolerable accuracy threshold, which gives a hint to the compiler that the square root operator may be translated into a simpler code sequence with lower-order polynomials and less fix-up code. However, if 0.5 ulp is specified, the compiler may generate a more complicated code sequence that might be up to 60+% slower on some architectures.
When the computation on TVECs is completed, the computed results may be transferred back to the unmanaged space through the copyOut primitive.
Ct Operators
The only operators allowed on TVECs are Ct operators. Ct operators are functionally pure (free of side effects). That is, TVECs are passed around by value, and each Ct operator logically returns a new TVEC. For example:
This property guarantees the safety of parallelism and the aggressive optimizations that make parallelism efficient. The Ct API provides a broad range of Ct operators with rich functionalities. Operator overloading is used extensively to support a programming style, based on C++, particularly for the arithmetic, bitwise, and logical/comparison operators. For example, the '*' operator in the above example is overloaded to the TVEC multiply operator.
Basically Ct operators can be categorized into element-wise/vector-scalar, collective communication, and permutation operators.
Element-wise/vector-scalar operators are typically referred to as "embarrassingly" parallel, requiring no interactions between the computations on each vector element. An example of an element-wise operation is the addition of two vectors:
Note that this code generically performs an element-wise addition of two vectors, regardless of the "shape" of the two vectors (i.e., their length, dimensionality, irregularity).
Collective communication operators tend to provide distilled computations over entire vectors and are highly coordinated3. While they have a high degree of interference, they can be structured so that there is parallelism in colliding writes, and they typically scale in performance linearly with processor count, with little or no hardware support.
There are two kinds of collective communication primitives in general, namely reductions and prefix-sums (also called scans). Reductions apply an operator over an entire vector to compute a distilled value (or values, depending on the type of vector). Prefix-sums perform a similar operation, but return a partial result for each vector element. For example, an addReduce sums over all the elements of a vector if the vector is flat. More concretely, addReduce([1 0 2 -1 4]) yields 1+0+2+(-1)+4=6. Likewise, addScan([1 0 2 -1 4]) yields [1 1+0 1+0+2 1+0+2+(-1) 1+0+2+(-1)+4].
A permutation operator in Ct is any operator that requires moving data from its original position to a different position. An example is the gather operation, which uses an index array to collect values of a vector in a particular order; and the scatter operator does the reverse. Permutations run the gamut, from arbitrary permutations with arbitrary collisions (occurring when two values want to reside in the same location) to well-structured and predictable permutations where no collisions can occur. For collisions, it is recommended that programmers make use of the collective communication operators. An example of a well-structured (and efficient) permutation operator is pack, which uses a flag vector to select values from a vector in the source vector order. With proper hardware support on multi-core IA, these operators can be implemented fairly efficiently. In contrast, these operators could not be implemented efficiently on constrained architectures (for example, most GPUs do not efficiently support scatter operations).
Besides these built-in operators, Ct also supports generic user-defined operators through Ct functions. As implied by their name, Ct functions define a block of code that is applicable to a collection of vectors, which allows programmers to define new generic operators or functions for repeated application (mitigating compilation overhead). The following code defines a Ct function that performs a fused multiply-add:
return a + b * c;
We use a map operator that takes as arguments the Ct function pointer and three vector arguments to apply this function in an element-wise manner:
The implementation of the map operator employs compile-time type inference to prevent programmers from specifying improper arguments, such as TVEC<I32> (which is not conformant to this function's definition), or wrong numbers of arguments. Just like C/C++ routines, Ct functions are composable, greatly extending Ct's expressiveness.
Nested Vectors
Ct's support for nested vectors is a generalization that allows a greater degree of flexibility than is otherwise found in most data parallel models. TVECs may be flat vectors or regular multi-dimensional vectors. They also may be nested vectors of varying length, which allows for very expressive coding of irregular algorithms, such as other variants of sparse matrix representations, or byproducts of divide-and-conquer algorithms.

Figure 2: The usage and implementation of Ct
click image for larger view
The vector value [a b c d e f] is a flat (or 1-dimensional) vector. The vector [[a b][c d e f]] holds the same element values, but is a vector of two vectors of lengths 2 and 4. The second vector might represent a partitioning of the first vector's data based on certain criteria (e.g., the relationship to a pivot value in a quicksort). Practically, the nested format enables a lot of irregular data structures and algorithms. Figure 2 gives a few such examples.
All Ct operators work on nested TVECs generically. The behavior of element-wise operators is the same for nested TVECs as for flat vectors. For example, [[a b][c d e f]] + [[g h][i j k l]] yields [[a+g b+h][c+i d+j e+k f+l]].
The power of nested versus flat TVECs is primarily differentiated through the behavior of collective communication and permutation primitives. Collective communication primitives applied to nested TVECs "respect the boundaries" of the subvectors by applying the operator to each subvector independently. For example, addReduce([a b c d e f]) yields the singleton vector [a+b+c+d+e+f], while addReduce([[a b][c d e f]) yields the two-element vector [a+b c+d+e+f].
[3] These operators are called collective communication operators in MPI and reductions in OpenMP, though neither provides the rich set of operations that Ct does. In functional languages, these are termed fold operations or list homomorphisms.
