- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
High-Performance Physical Simulations on Next-Generation Architecture with Many Cores
PARALLELIZATION METHODOLOGY
The applications we study are all computationally demandingon a 4-way Intel® Xeon® processor 3.0GHz system, with 16GB of DDR2-3200 and three levels of cache on each processor (16KB L1, 1MB L2, and 8MB L3), they take on average 188, 14, and 5 seconds to process a single frame for production fluid, face, and cloth simulations, respectively. Similarly, high-complexity scenes in game rigid body dynamics, fluid and cloth take on average 1, 0.4, and 0.1 seconds to process a single frame. While game physics performance may seem much better than production physics, one needs to perform at least 30 frames per second for real-time interactive experiences. Since they will all benefit from a large performance boost, we parallelize the applications, targeting a multi-core processor with tens of cores.
We took the conventional approach to parallelizing large code bases. We first profiled each application to determine the most expensive modules in a serial execution. After that, we prioritize the modules of each application and parallelize them in decreasing order of importance.
The applications were parallelized using the fork-join model in which the program consists of alternating serial and parallel sections. This model is attractive because it allows one to start with a serial program and selectively parallelize the most profitable portions of the program until satisfactory performance is achieved. We use a standard task queue technique [8], similar to Intel Thread Building Blocks (TBB) [6] and OpenMP [10], to parallelize all modules.
In the rest of this section, we discuss how the various modules were parallelized to scale to a large number of cores.
Parallelizing Loops
The majority of modules were parallelized via loop parallelization. These modules typically involve operations on arrays of elements, such as grid cells of a 3D grid (production fluid), an array of particles (game fluid), vertices of a triangle mesh (production cloth), and contact constraints (game rigid bodies).
In most of the cases, the iterations of the loops are independent of each other. For instance, computing the aggregate force on a vertex of the triangular mesh (production cloth) requires simply adding all the forces on that vertex. These loops are parallelized by partitioning the iterations of the loop among the cores. In a few instances, multiple iterations update the same piece of data. However, even in these instances, the final result is independent of the ordering of the iterations. These loops are also parallelized by splitting iterations among cores while using fine-grained locking to guard updates on the shared data.
Parallelizing Graph Operations
A few modules have more complex forms of parallelism and typically incur more parallelization overheads.

Figure 2: Scene configuration
click image for larger view

Figure 3: Constraints between various objects
click image for larger view

Figure 4: Reordered constraints into two batches to expose parallel computation
click image for larger view

Figure 5: Relative speedups for the constraint solver with and without reordering of the constraints
click image for larger view
An example of this is the broad-phase in collision detection. Collision detection requires checking every pair of objects for collisions. Since only a small fraction of the objects actually collide at any given instant, collision detection is performed in two phases: a broad-phase that performs quick checks to rule out a large fraction of the object pairs, and a narrow-phase that performs the exact (and more computationally expensive) check on the remaining pairs. A standard technique to accelerate the broad-phase is to build a bounding volume hierarchy (a tree) containing the objects. A leaf node of this tree consists of a single object. The computation starts at the root and traverses down. At each step, it checks pairs of nodes of the tree. If the bounding volumes at the two nodes do not overlap, then none of the objects in the first subtree can possibly collide with any object in the second subtree. Otherwise, more checks have to be performed using the children of the two subtrees. Each of these pairs of subtrees represents independent computation and can be performed in parallel. Consequently, in the broad-phase, each unit of parallel work can spawn off more parallel work.
Using Alternative Algorithms
Sometimes, the best serial algorithm has poor parallel scalability. In such cases, we often use an alternative algorithm whose serial version is not as efficient as the original, but whose parallel version scales much better. Sometimes, we use an additional phase to reorder data and expose more parallelism. In this section, we describe two specific examples in detail.
The first example is from rigid-body dynamics from game physics. During the execution of the physical simulation pipeline, the collision detection phase computes the pairs of colliding bodies, which are used as inputs to the constraint solving phase. The physics solver operates on these pairs and computes the separating contact forces, which keep the bodies from inter-penetrating each other. In Figure 2, we show one such case involving four bodies (three boxes and one ground plane), where the corresponding pairs of colliding bodies are listed in Figure 3. The resulting constraints C1, C2, C3, and C4 need to be resolved to update the body positions.
To parallelize this phase, we would ideally like to distribute the constraints amongst the available threads and resolve them in parallel. However, there is often an inherent dependency between consecutive constraints. In our example, constraints C1 and C2 both involve body b2 and thus cannot be resolved in parallel. These dependencies can force a significant serialization of the computation. However, we can reorder the constraints into different batches such that there are no conflicting constraints in each batch. That is, each batch will contain at most one constraint that refers to any given body.
Reordering algorithms traverse the constraints, maintaining an ordered list of partially filled batches. Each constraint is assigned to the earliest batch with no conflicting constraints. As a result, all constraints within a batch can be processed in parallel, while the different batches have to be processed sequentially.
For example, we reorder the constraints in Figure 3 and obtain two batches, (C1, C3) and (C4, C2), as shown in Figure 4. Note that C1 and C3 from the first batch do not refer to any body more than once and can be resolved in parallel. A similar observation holds for C4 and C2. As a result C1 and C3 in the first batch are solved in parallel and the results are fed as part of the input to the second batch. The bottom curve in Figure 5 shows the speedup of the physics solver using the original order of the constraints relative to the serial version for up to 64 cores. The top curve shows the speedup using reordered constraints. We see that without reordering, the speedup is limited to 4x on 64 cores. However, reordering the constraints enables a speedup of 35x, including the overhead for reordering. This example highlights the case where some extra computation needs to be performed to expose the parallelism.
The second example of the need for alternative algorithms is from fluid simulation for production physics. Our fluid simulation application implicitly tracks the interface (boundary) between the air and the fluid. For each grid cell in the modeled space, it computes the distance to the interface. The most common technique to do this is the Fast Marching Method (FMM), which iteratively advances the wave front. For each iteration, it finds and updates the closest grid cell to the front that is not already on or behind the front (Figure 6). This is inherently serial. However, these distance values are required only for a narrow band around the interface. Thus, we parallelize FMM by dividing the grid into overlapping blocks, padded by the width of the narrow band, and working on each block independently (Figure 7). This works well for a small number of threads. However, the total overlap region becomes large quickly as the number of blocks increases. As a result, the application scales relatively poorly, achieving a scaling of around 21x on 64 threads.
We instead use an alternative scheme known as the Fast Sweeping Method (FSM) [5] (Figure 8). FSM traverses the grid cells in all eight possible combinations of the X, Y, and Z directions. Each "sweep" updates the distance of a cell from the distances computed for its neighbors in the previous sweeps. We obtain the correct distance for each cell after completing all eight sweeps. We parallelize FSM in a similar manner to FMM (i.e., with overlapping blocks). The serial version of FSM is around 30% slower than the serial version of FMM. However, FSM has more parallelism since the sweeps are independent. Thus, we achieve scaling of around 55x on 64 threads. In Figure 9, we compare the speedup of FSM relative to FMM. Up to 16 cores, FMM provides higher performance, but beyond 16 cores, FSM is better, giving about 2x the performance of FMM on 64 cores.

Figure 6: Fast Marching Method (FMM) advances the front to incrementally compute the signed distance of nodes from the interface
click image for larger view

Figure 7: Parallelizing the algorithm by dividing the region into overlapping cells
click image for larger view

Figure 8: Fast Sweeping Method (FSM) traverses the grid nodes and incrementally updates the minimum distance to the interface
click image for larger view
