- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Parallelization Made Easier with Intel® Performance-Tuning Utility
TUNING FOR DATA-LEVEL PARALLELISM
In this section we provide a real tuning example to highlight the capabilities of Intel® PTU in real-world software analysis. We start with discovering parallel execution opportunities, and then we analyze the efficiency of parallelization by locating thread interaction and data layout issues. In the course of our analysis we consider the data-level parallelism wherein different data ranges are processed in parallel on a shared-memory multi-processor.
SP Application and Environment
For our example we took the SP code from the NAS 2.3 Benchmark Suite (NPB2.3) [10]. We started by profiling the serial version of SP, then took the OpenMP C implementation, made by the OMNI compiler team [11, 12], analyzed, and tuned it.
SP is a simulated computational fluid dynamics application. The finite difference solution is based on an approximate factorization that decouples the x, y, and z dimensions [13]. The data set we use is class A, for which a problem size is equal to 64. The simulation is done in 400 high-level iterations over time. The main loop contains the following calls:
Where x_solve calls lhsx and ninvr; y_solve lhsy and pinvr; z_solve lhsz and tzetar.
At each iteration, SP re-calculates a number of three- (64x64x64) and four-dimensional (5x64x64x64) arrays consisting of double precision floating-point numbers and consuming ~76 Mb of the memory space in total.
Our environment was Red Hat Linux* 3.0 Update 8 running on a 2.66 GHz Quad-Core Intel® Xeon® processor 53001 series system. This system had eight cores configured in four paired CPUs, with two such pairs per package. Each CPU had a 4Mb L2 cache. The application was compiled using the Intel® Compiler 10.0 with options "O3 openmp g."
Identifying Synchronization Overhead Using Statistical Call Graph
We started with the Basic Statistical Call Graph and Loop Analysis to understand the behavior of the serial version of SP. As can be seen in Figure 6, the tool identified a number of hotspot functions (most of the samples reside in compute_rhs). These hotspots have significant numbers of samples that fall in loops inside the hotspots (circled arrow icons). In addition every hotspot is called from within a loop (exclamation mark icons).

Figure 6: SP serial version hotspots in Statistical Call Graph display
click image for larger view
Inspection of the source for the hotspot functions suggests that we cannot parallelize the program in question assigning a different thread to every time iteration (that is, trying to multi-thread the loop surrounding the hotspot calls), because every new time iteration depends on the arrays produced by the previous iterations. Instead, we can employ data decomposition and assign multiple threads to different iterations of the loops inside the hotspots. For such an approach, OpenMP [11] is the obvious choice.
The timing of the parallel version of SP from the OMNI package [12] when running from two to eight threads is shown in Table 1. The two-threaded version's execution time decreased from 130 seconds to 91 seconds.

Table 1: SP OpenMP execution time and relative __kmpc_barrier contribution for different numbers of threads
click image for larger view
However, running the code with four threads or more shows no additional performance gain. This clearly indicates that there are some problems with SP OpenMP implementation.
The Statistical Call Graph profile for the 4-thread execution (Figure 7) shows that one of the main hotspots, namely the __kmp_wait_sleep function, belongs to the libguide library. Another substantial hotspot that belongs to libguide is __kmp_x86_pause. These hotspots all have the __kmpc_barrier function on their stacks. __kmpc_barrier in turn is called from many SP functions. This can be seen either from the expanded stack of the __kmp_wait_sleep hotspot (Figure 7) or, in an aggregated form, in the caller-callee view (Figure 8). __kmpc_barrier is dominantly called from the lhsx, lhsy and lhsz functions as the total number of samples for these three functions clearly account for the majority of the time (Figure 8).

Figure 7: Hotspots for SP OpenMP. Number of threads = 4. The partial stack for the __kmp_wait_sleep hotspot is shown.
click image for larger view

Figure 8: Caller/callee view for SP OpenMP with __kmpc_barrier as a target function. This view is useful in evaluating an aggregated target contribution and the relative contributions of its callers.
click image for larger view
The data for the __kmpc_barrier itself and its callees contribution are summarized in Table 1. The tables show that the number of total samples for __kmpc_barrier grows up to one quarter of the application's total number of samples when running with more than four threads. By total samples we mean the self sample count plus the self counts of all the functions down the call chain (callees).
The conclusion from the profiling session is that the initial SP OpenMP implementation doesn't scale because of a significant synchronization overhead exposed as a substantial number of total samples associated with the __kmpc_barrier function.
Note that Intel PTU significantly simplifies the identification of the total contribution of a function by automatically synchronizing views for the focus function. Thus, the Caller/Callee view displays aggregated total samples for a function selected in the Hotspot view.
To find the cause of the synchronization overhead, we look at the lhs[*] functions code (Figure 9) and find where the OpenMP "omp for" pragmas are applied. Instead of being applied to the outer loops, they are applied to the inner loop, decomposing the leading dimensions of the multi-dimensional arrays and causing a considerable overhead at an implicit barrier.

Figure 9: A code fragment from the lhsy function causing barrier overhead
click image for larger view

Figure 10: An optimized code fragment at the lhsy function
click image for larger view
Similar problems were observed in other functions where the "omp for" pragmas were applied to the middle loop of three nested loops. We modified the initial SP OpenMP implementation by making changes to lhs* and *_solve functions so that the "omp for" pragmas were properly applied to the outermost loop. Further, we merged some separate loops under one "omp for" pragma. We also had to privatize several variables as part of the changes to ensure the correctness of the program.
The improved version of the same non-optimal code fragment (from Figure 9) is shown in Figure 10. We refer to this version of the SP code as "SP OpenMP Opt."
Data Layout Analysis Using Sampling and Data Access Profiling
However, while the issue of the large barrier overhead was fixed by these modifications, the overall performance and scaling did not improve much beyond two threads (see Table 2).

Table 2: An optimized code fragment at the lhsy function
click image for larger view
Since the SP application uses ~76 Mb of data space and our system has only 16 Mb of shared L2 cache, the memory usage approach might be the reason for the poor improvement in scaling. To prove this we launched sampling, and we collected the MEM_LOAD_RETIERED.L2_LINE_MISS event for SP OpenMP Opt running with thread numbers 1 through 8. The results (summarized in Table 3) clearly indicate that the number of L2 cache line misses grows with the increasing number of threads. Although for the code to be scalable the number of cache misses should remain the same or even decrease.
Profiling runs for four and eight threads reveal that compute_rhs and z_solve functions are the main hotspots, contributing ~28% and ~12% L2 cache line misses, respectively, in both runs. The other main hotspots are the x_solve and y_solve functions.

Table 3: Count of the MEM_LOAD_RETIERED.L2_LINE_MISS event for the SP OpenMP Opt code
click image for larger view
The Data Profiling analysis for the four- and eight-thread runs confirms the same functions as the memory access bottlenecks. The Data Access Display shows that compute_rhs, z_solve, y_solve and x_solve functions are also hotspots in terms of Last Level Cache (LLC) misses, Total Latency, and Cachelines accessed (Figure 11).

Figure 11: Main hotspots for the SP OpenMP Opt 8 threads run in the Data Access Display. The figure also illustrates how to filter the cachelines accessed by selected functions.
click image for larger view
The reason for the growing number of cache misses (refer to Table 3) for four- and eight-thread runs might be interfering data accesses. The data access profile allows us to investigate if there are access contention issues for the cachelines used in the hotspots compute_rhs, z_solve, y_solve, and x_solve, particularly those caused by the threads running on different cores but accessing the same cachelines. This will show if there are any contentious lines associated with high average latencies and accessed by several threads.
To explore contention issues we ran the SP OpenMP Opt version with eight threads, bounding each thread to a distinct core using the KMP_AFFINITY environment variable supported by the Intel® Compiler OpenMP run-time library. The execution time and hotspots do not change with respect to the non-bound run.
In data access view we select the hotspot functions and use the "Filter by Selection in Code Hotspots" button (circled in Figure 11), to display only the cachelines accessed by these functions.
A number of filtered cachelines are marked in pink as likely suffering from false-sharing. But false-sharing (as well as true-sharing) is a particular case of thread access contention. Consequently, we sort cachelines by average latency and select the high latency lines. We then filter back either on a specific cacheline or a few of them to identify the functions that are associated with the contention. This is done by using the "Filter by Selection in Data Hotspots" button (triangled in Figure 11).
We found a number of cases (one example is in Figure 12) where the same cachelines were accessed from different threads by the functions compute_rhs & x_solve, x_solve & y_solve, y_solve & z_solve. Specifically, Figure 12 demonstrates that the same highlighted cacheline was accessed by the different threads in the x_solve and y_solve functions. The second access (by y_solve, as it called after x_solve in the code) is associated with the high latency (250 cycles) equal to an L2 miss penalty.

Figure 12: The IP hotspot view filtered by the selected cacheline
click image for larger view
We drill down from the filtered hotspot view to the source view (now only displaying the filtered accesses) of the functions, e.g., x_solve and y_solve ones, to identify the source lines that generated the access contention.
The source code identified by the access counts on these lines in turn identifies a number of cache contention patterns. Figure 13 displays the typical one we discovered.
In this case the x_solve function writes to the elements of the arrays rhs and lhs, and the y_solve function reads from them. The "omp for" pragma is placed in such a way that the data decomposition of these arrays is different in these two function fragments. In x_solve the decomposition, over the third index of rhs, causes thread_1 to write into rhs[*][*][T1_range][*], thread_2 writes into rhs[*][*][T2_range][*] and so on. While in y_solve, the decomposition is over the second index, so thread_1 here reads from rhs[*][T1_range][*][*]. This results in multiple cores having to shuffle the cachelines between themselves as they execute x_solve and then y_solve. This in turn results in a large number of load-driven cache misses and the resulting execution stalls.
We didn't go further with optimizing the SP code since our purpose was just to demonstrate how an application using data parallelism is analyzed and tuned with Intel PTU.
Possible ways for further optimization could be code transformations to make the "omp for" pragmas apply to the outermost loops, iterating over the same dimension indices. This would decrease the shuffling of the cachelines between cores and thereby improve the performance.
It would be also useful to consider decreasing some array sizes (to apply data blocking optimization), as described in [13]. This would bring an even bigger performance gain due to a more efficient cache usage.

Figure 13: Code fragments causing cacheline contention
click image for larger view
In this section we have shown that Statistical Call Graph analysis may be very helpful in the initial stages of parallel code tuning. Proceeding with the analysis requires some knowledge of the processor architecture to identify the hardware events to collect and to interpret the collected data. Advanced scalability estimations can hardly be performed without the help of data access profiling whose automatic analysis and flexible filtering interface enable pinpointing of such problems as cache contention (particularly false-sharing) and high latency loads at the source code level.
[1] Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.
