The Intel Guide for Developing Multithreaded Applications (PDF 2.86MB)
This article is part of the Intel® Modern Code Developer Community documentation which supports developers in leveraging application performance in code through a systematic step-by-step optimization framework methodology. This article addresses: parallelization.
Table Of Contents:
This chapter covers general topics in parallel performance but occasionally refers to API-specific issues.
1-1 - Predicting and Measuring Parallel Performance
1-2 - Loop Modifications to Enhance Data-Parallel Performance
1-3 - Granularity and Parallel Performance
1-4 - Load Balance and Parallel Performance
1-5 - Expose Parallelism by Avoiding or Removing Artificial Dependencies
1-6 - Using Tasks Instead of Threads
1-7 - Exploiting Data Parallelism in Ordered Data Streams
1-8 - Using AVX Without Writing AVX Code - New
The topics in this chapter discuss techniques to mitigate the negative impact of synchronization on performance.
2-1 - Managing Lock Contention- Large and Small Critical Sections
2-2 - Use Synchronization Routines Provided by the Threading API Rather than Hand-Coded Synchronization
2-3 - Choosing Appropriate Synchronization Primitives to Minimize Overhead
2-4 - Use Non-blocking Locks When Possible
Threads add another dimension to memory management that should not be ignored. This chapter covers memory issues that are unique to multithreaded applications.
3-1 - Avoiding Heap Contention Among Threads
3-2 - Use Thread-local Storage to Reduce Synchronization
3-3 - Detecting Memory Bandwidth Saturation in Threaded Applications
3-4 - Avoiding and Identifying False Sharing Among Threads
3-5 - Optimizing Applications for NUMA - New
This chapter describes how to use Intel software products to develop, debug, and optimize multithreaded applications.
4-1 - Automatic Parallelization with Intel® Compilers
4-2 - Parallelism in the Intel® Math Kernel Library
4-3 - Threading and Intel® Integrated Performance Primitives
4-4 - Using Intel® Inspector XE 2011 to Find Data Races in Multithreaded Code - Retired
4-5 - Curing Thread Imbalance Using Intel® Parallel Amplifier
4-6 - Getting Code Ready for Parallel Execution with Intel® Parallel Composer - Retired
4-7 - Optimize Data Structures and Memory Access Patterns to Improve Data Locality - New
On March 9, 2010 The Parallel Programming Community (renamed the Intel Modern Code Developer Community in July, 2015) published a collection of technical papers to provide software developers with the most current technical information on Application Threading, Synchronization, Memory Management and Programming Tools. These threading techniques remain important to achieving high performance on Intel® Xeon® and Intel® Xeon Phi™ Coprocessors. We look forward to your thoughts and feedback and encourage you to participate in the discussion and ask question in our Modern code for Parallel Architectures forum.
The objective of the Intel® Guide for Developing Multithreaded Applications is to provide guidelines for developing efficient multithreaded applications across Intel-based symmetric multiprocessors (SMP) and/or systems with Hyper-Threading Technology. An application developer can use the advice in this document to improve multithreading performance and minimize unexpected performance variations on current as well as future SMP architectures built with Intel® processors.
The Guide provides general advice on multithreaded performance. Hardware-specific optimizations have deliberately been kept to a minimum. In future versions of the Guide, topics covering hardware-specific optimizations will be added for developers willing to sacrifice portability for higher performance.
Readers should have programming experience in a high-level language, preferably C, C++, and/or Fortran, though many of the recommendations in this document also apply to languages such as Java, C#, and Perl. Readers must also understand basic concurrent programming and be familiar with one or more threading methods, preferably OpenMP*, POSIX threads (also referred to as Pthreads), or the Win32* threading API.
The main objective of the Guide is to provide a quick reference to design and optimization guidelines for multithreaded applications on Intel® platforms. This Guide is not intended to serve as a textbook on multithreading nor is it a porting guide to Intel platforms.
The Intel® Guide for Developing Multithreaded Applications covers topics ranging from general advice applicable to any multithreading method to usage guidelines for Intel® software products to API-specific issues. Each topic in the Intel® Guide for Developing Multithreaded Applications is designed to stand on its own. However, the topics fall naturally into four categories: Application Threading, Synchronization, Memory Management and Programming Tools. Though each topic is a standalone discussion of some issue important to threading, many topics complement each other. Cross-references to related topics are provided throughout.
1.5 Authors and Editors
The following Intel Engineers & Technical experts contributed to writing, reviewing and editing the Intel® Guide for Developing Multithreaded Applications: Henry Gabb, Martyn Corden, Todd Rosenquist, Paul Fischer, Julia Fedorova, Clay Breshears, Thomas Zipplies, Vladimir Tsymbal, Levent Akyil, Anton Pegushin, Alexey Kukanov, Paul Petersen, Mike Voss, Aaron Tersteeg and Jay Hoeflinger
1.6 Early Review of The Intel Guide for Developing Multithreaded Applications from Parallel Programers
Tom Spyrou's opinion on how to tune work assigned to threads from the programming tools paper Curing Thread Imbalance Using Intel® Parallel Amplifier and his experience with detecting when the bottleneck in a program is caused by the bandwidth from main memory in the paper thoughts on False Sharing from the paper Avoiding and Identifying False Sharing Among Threads.
Asaf Shelly discusses the correct use of memory allocations by giving each thread its own Heap for the paper Avoiding Heap Contention Among Threads.
Clay Breshears wrote about The Art of Snow Blowing as it relates to the paper Granularity and Parallel Performance.