- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
High-Performance Physical Simulations on Next-Generation Architecture with Many Cores
CONCLUSION
We consider two broad categories of physical simulation applications: production physics and game physics. Production physics is used by movie studios for creating special effects that may take many minutes to process a single frame. In contrast, game physics is used by the gaming industry and has a more stringent real-time requirement of about 30-60 frames per second. The difference in execution time requirements affects the choice and design of algorithms for the two categories of physical simulation.
We have parallelized applications in both categories and achieve parallel scalability of 30-60x on a cycle-accurate simulator of a multi-core chip with 64 cores. Many modules of these applications require extensive effort to achieve good performance scaling. In some cases, the best serial algorithms have poor parallel scalability. For these, we use alternative algorithms which are slower on one core, but have more parallelism. In other cases, we modify the algorithm to expose more parallelism. The overhead of exposing the parallelism is often small compared to the benefits of improved scaling.
While our applications scale well, some modules are far from the theoretical maximum scaling. This is primarily due to overheads in the task queues and to imperfect load balancing.
Some modules also have significant overheads from locking, but these overheads do not grow with the number of cores (i.e., the locks have low contention), and therefore do not impact scalability. However, the cost of locking still has a significant impact on the overall performance of the parallelized application.
We find that future physics workloads will require large last-level caches (i.e., 128MB) or main memory bandwidths in excess of 100GB/s. This is due to the applications' use of streaming access patterns combined with large data sets (e.g., tens of thousands of objects for game physics and hundreds of thousands to a few million objects for production physics).
We also find that physical simulation applications have very different memory characteristics than traditional benchmarks such as TPC-C, SPECjAppServer, and SPECjbb. These traditional benchmarks do not get a big boost from a large last-level cache since their working sets are extremely large. However, physical simulation applications benefit greatly from a 128MB cache since it can fit the whole working set of all application modules.
