- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Parallel Software Development with Intel® Threading Analysis Tools
PARALLELIZATION WITH THE WIN32 THREADING LIBRARY
Decomposition Method
In call graph view, we can see that the acbm_search function is called only once, but it takes up a significant amount of CPU time. The operation of acbm_search is to match the pattern tree against the given file to count the total matches between the patterns and the file. Since different text portions of the file are independent, each portion can be matched against the pattern tree separately. Data decomposition is used to logically divide the file into several parts and to use one thread to match against the patterns and return the matches of each part. Thus for the acbm_search function, these threads only have two points of serialization, i.e., loading file and combining the final result of total matches after all threads are finished.
Implementation
Figure 5 and Figure 6 show the source codes that illustrate the basic operations of threading the algorithm with the Windows* threading API.
When splitting the file into several parts, some matched strings may be split into different text portions. So in each search thread, after getting the matched pattern number from acbm_search() for their assigned portion of text, we use acbm_connection() to calculate the matched number of strings on the text boundary.

Figure 5: Main() function
click image for larger view

Figure 6: AC_BM Search thread
click image for larger view
Debugging
When introducing the threads, it is necessary to ensure their correctness. There are some notorious thread bugs, such as data race, stall, and deadlock, and they are usually difficult to detect. In our application, we use the Intel® Thread Checker to check whether there are any bugs in the threads.

Figure 7: Data race detected by Thread Checker
click image for larger view
From the results shown in Figure 7, we can see that there is data race on the result calculating part. In order to avoid incorrect access to gMatched, we use a CRITICAL_SECTION to protect gMatched between different threads.

Figure 8: Fix the result calculating part
click image for larger view
Test and Performance Tuning
After multi-threading the application, we test the performance with a 32M file: the result is 339.553ms. The scaling performance to the serial version, 383.353ms, is not very good, which means we need to tune the performance.
Performance tuning involves several steps:
- Gather performance data.
- Analyze data and identify issue.
- Generate alternatives to resolve the issue.
- Implement enhancements.
- Test results.
All these steps are performed in one iteration of the performance tuning process. Usually, several iterations are required to get the best performance. Before any tuning process, we need to ensure our program is correct and ready for tuning. Another thing that should be kept in mind is to only make one change for each tuning iteration.
In order to find tuning opportunities for a multi-threaded program, we use the Intel® Thread Profiler to gather thread-related data.

Figure 9: Thread Profiler Timeline view
click image for larger view
In the Timeline view (shown in Figure 9), we can see that a large majority of the threads are executed in serial (the orange line indicates serial execution) and there are overhead and transitions between different threads (represented by yellow lines). We can double click one of the yellow lines and it will drill down to the source code where the problem occurs.

Figure 10: Computing the search result
click image for larger view
The transition and overhead are caused by acquiring access to the critical section. And, since only one thread can get access to the critical section at one time, the code essentially allows only one thread to do the search, which causes the serial problem. We modify the code like this (in Figure 11):

Figure 11: Improved approach to computing the search result
click image for larger view
In this way, each thread performs the search independently and only needs to access the critical section when it adds the result to global variable gMatched.
After the modification, we tested the application and found that the time it spent on a 32M file was reduced to 255.500ms.
Now we have completed one iteration of tuning. We achieved improved performance after this iteration. Next we run the Thread Profiler again to see whether there are any other issues.

Figure 12: Thread Profiler Timeline view
click image for larger view
In Figure 12, we can see that the serial execution of different threads has been solved. Notice that in the timeline view, the thread that ends last still has a small portion of serial execution, and the main thread needs to wait for all threads to exit and join the result. Instead of allowing the main thread to idle, we may let the main thread share a portion of data and do the search too. After the modification, the time spent is reduced to 252.643ms. See the timeline view in Figure 13.

Figure 13: Thread Profiler Timeline view
click image for larger view
We can see that there are no more overhead and transitions between different threads; Thread No. 3 is fully utilized during its life cycle. However, there are actually only two threads running at a time; the other two threads are just waiting. Remember, the execution environment is a dual-core system so there won't be more than two threads running at one time in a CPU. Delegating more than two threads may only incur extra overhead, due to thread creation and scheduling. We therefore change the THREAD_NUM to 2. The time spent is now reduced to 248.702ms.
![]()
Figure 14: Thread Profiler Timeline view
click image for larger view
Now the threads are all well utilized, and Figure 14 shows that there are no transitions and imbalances any more. The performance scalability compared to serial processing is 383.353ms/248.702ms = 1.54 (the ideal is 1.867). Since the performance for the recent tuning iterations is not improved dramatically, we decide to accept the improvement.
From the tuning process presented above, we can see that the number of threads impacts the overall performance. In our example, the application works best when there are two threads, and the application happens to run on a dual core system. However, it doesn't mean that all applications running on a dual core system have the best performance when they have two threads. In the AC_BM algorithm, ACBMThreadProc only does some searches and calculations. However, for some other applications, the thread may need to wait for some resources, such as network connection or data loading. If this is the case, such a thread can relinquish the CPU and let another thread that is not waiting for resources to have the CPU. So in the case of this application, it may perform better if it has more threads than the number of CPU cores. There is no easy way to determine the number of threads for an application especially when the application is large and complex. Further, it is not easy to determine which task should be assigned to each thread to get the best performance. It is better to use the Thread Profiler to get the execution information about each thread and discover the tuning opportunities.
