- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Parallel Software Development with Intel® Threading Analysis Tools
PARALLEL COMPUTING WITH INTEL® TBB
Challenges of Parallelizing with the Win32* Threading Library
While the Win32 threading library gives programmers great flexibility by giving them detailed control over threads, the library brings challenges for them too: programmers must rewrite the common parallel programming utilities; troubleshoot threading bugs, and try to avoid thread-safe issues. There is no way to maintain load balance and achieve good scalability unless the programmers perform these tasks manually, and these tasks require a comprehensive understanding of the threading method, application, and the features of the system.
Intel Threading Building Blocks
To make it easier for programmers to realize parallelism, Intel® TBB provides a high-level generic implementation of parallel patterns and concurrent data structures. With TBB, you specify task patterns instead of threads. Logical tasks are mapped automatically onto physical threads, thus hiding the complexity of operating system threads.
Implementation
According to the analysis detailed in the "Parallel with the Win32 Threading Library" section, the AC_BM algorithm can be parallelized by using data decomposition. We can separate the text into several blocks, calculate the matched number in each block, and sum the total matched number from all the blocks. Intel TBB provides the template parallel_reduce to do this kind of decomposition. Figure 15 and Figure 16 show the source code of the implementation with Intel TBB. Figure 15 shows the definition of Search task that performs the pattern matching operations. Figure 16 shows the main() function.

Figure 15: AC_BM search task
click image for larger view

Figure 16: Main() function
click image for larger view

Figure 17: Declaration of parallel_reduce()
click image for larger view
Figure 17 shows the declaration of the parallel_reduce() function. A parallel_reduce performs parallel reduction of body over each value in range. Grain size specifies the biggest range considered indivisible, i.e., Intel TBB splits the range into two subranges, if the size of the range exceeds the grain size. A too-small grain size may cause scheduling overhead within the loop templates and counteract the speedup gained from parallelism. A too large grain size, on the other hand, may unnecessarily limit parallelism. For example, if the grain size is so large that the range cannot be split, then the application will not be parallelized. Therefore, it is important to set the proper grain size. Normally, setting grain size to 10,000 is high enough to amortize the scheduler overhead; the correct size also depends on the ratio between the task number and processor number.
From the result of the Intel Thread Checker, there are no data races and thread-safety errors. Figure 18 shows that there are no transitions or imbalances between different threads either.

Figure 18: Thread Profiler Timeline view
click image for larger view
Figure 19 shows the performance for different grain sizes when the test file is 32M. The application achieves the best performance when the grain size equals 8M.

Figure 19: TBB performance of different grain sizes
click image for larger view