- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Media MiningEmerging Tera-scale Computing Applications
PERFORMANCE ANALYSIS ON MULTI-CORE SYSTEMS
In this section we analyze three typical media-mining workloads (Player, Face and Shot detection), which are parallelized via our video-mining parallel framework. To generate best-performing executable codes, the Intel® 9.1 OpenMP compiler tool chain and highly optimized OpenCV and IPP library [16] are used. Furthermore, we also use the Intel VTune™ Performance Analyzer [17] to identify the hotspots in functional profiling and guide the optimizations. To characterize the parallel performance, the Intel® Thread Profiler is used to quantify the parallel performance metrics, i.e., synchronization, locks, load imbalance, etc.
We evaluate the scaling performance of these parallel media-mining workloads on a real multi-core machine and a large-scale CMP simulator. The multi-core platform is a dual-socket, quad-core machine, with two Intel® Core™2 Quad processors running at 2.33GHz. Each socket has four cores, and each core is equipped with a 32KB L1 data cache and a 32KB L1 instruction cache. The two cores on one chip share a 4MB L2 unified cache. The maximum Front-Side-Bus (FSB) bandwidth is 21GB/s. In addition to the existing multi-core system, we further study these media-mining applications’ performance on a large-scale CMP simulator with cycle- accurate simulation to see how they will scale with the increasing number of cores. We assume a very high main memory bandwidth so that we do not artificially limit the scalability of the modules.
For the workloads studied in this experiment, we choose application parameters and datasets so as to represent realistic executions. For Player detection, we used a 30-minute MPEG-2 soccer video as the input. For Face and Shot detection, we used a 10-minute MPEG-2 movie video as the input.
Performance Scalability Analysis
Our video-mining workloads scale very well as the number of threads increases, as shown in Figures 5 and 7. That is, media-mining applications can efficiently use the computational power provided by multi-core processors.

Figure 5: Scalability of parallel video-mining workloads on an 8-core system
click image for larger view
However, as also shown in Figure 5, our workloads, in particular, Shot detection, do not have linear scaling on the 8-core system. To fully understand the scaling-limiting factors on an 8-core system, we characterize the parallel performance from the perspective of the high-level parallelization overhead, e.g., synchronization penalties, load imbalance, and sequential regions, and from the detailed memory behavior, e.g., cache miss rates and FSB bandwidth.

Figure 6: Execution time breakdown
click image for larger view
In general, our parallelized workloads expose good parallel performance metrics. Figure 6 depicts the parallel profiling metrics for these three workloads. The higher the parallel region, the better speedup can be achieved on highly threaded architectures. Shot detection has slightly more load imbalance than other workloads. Because of frame dependency, it is more challenging to implement two-level task queues in Shot detection than in other workloads. In Shot detection, we use the static scheduling scheme, which leads to a slightly higher load imbalance. Nonetheless, the profiling information suggests these parallel video-mining workloads expose good parallel performance metrics. If we assume the parallel region can scale perfectly, the three workloads should achieve the theoretical speedups of 7.95, 7.93, and 7.56, respectively, on eight cores. They are higher than the results shown in Figure 5. Therefore, we believe the scalability of our workloads is limited by some other factors that are discussed in the next subsection.
On the simulated 32-core CMP system with a huge amount of memory bandwidth, two selected parallel video-mining workloads have very good scalability, as depicted in Figure 7. First, the size of the serial sections in the applications is reasonably smallthe serial code accounts for much less than 1% of the execution time for the one-thread runs. Second, there is little contention on the locks: the locking overhead does not increase with the thread number due to coarse-grained parallelism. Third, the load imbalance is not a major issue; most of our video-mining workloads adopt a dynamic hybrid parallelization scheme. In short, when we assume a very high main memory bandwidth so that we do not artificially limit the scalability of the workloads, these applications scale very well.

Figure 7: Scalability of two video-mining workloads on a 32-core CMP simulator
click image for larger view
Memory Behavior Analysis
Besides the general parallel performance metrics, the memory subsystem also plays an important role in scalability. As shown earlier in Figure 6, our workloads with good parallel performance metrics should achieve the theoretical speedup of 7.67.9x on 8 cores, if the parallel region can scale perfectly. We now investigate why these workloads cannot achieve this perfect scaling performance from the perspective of the memory subsystem. We use the Intel VTune Performance Analyzer and a command-line tool for hardware-based performance counter sampling to further analyze the memory behavior of the applications on the real system, e.g., system memory bandwidth and L1/L2 cache miss rates.
Our first observation is that average bus bandwidth is not limiting the scalability of these workloads on the 8-core system. Figure 8 shows how the average FSB bandwidth utilization varies with the number of threads. The bandwidth usages of all workloads are far below the 21GB/s capacity supported by the system. This seems to indicate bus bandwidth does not limit the scalability of our workloads on the 8-core system.

Figure 8: Average FSB bandwidth utilization vs. number of cores
click image for larger view
Although workloads are not bounded by the average bandwidth usage, the scalability is limited by the instantaneous bandwidth usage. We perform interval sampling of the memory subsystem behavior over time. Figure 9 shows a representative phase of the bandwidth usage over time for the single-threaded Shot detection workload on a single core. It goes without saying that there are some bursty memory access behaviorsthe instantaneous bandwidth usage is much higher than the average bandwidth usage. In particular, one of the modules demands about 7x more bandwidth over the average bandwidth. When the bandwidth demand of the module is higher than the system’s capability, its speedup from 8 cores is less than 3x, and it becomes the bottleneck of scalability. In short, the workload is not able to scale perfectly when the instantaneous bandwidth usage is higher than the system’s capability. This is what limits the scalability.

Figure 9: Bandwidth usage over time for single-threaded Shot detection workload
click image for larger view
Additionally, there is a significant increase in bandwidth usage from four threads to eight threads for Shot detection. Figure 10 shows that L1 cache miss rates vary little with the number of threads, while L2 cache performance deteriorates when scaling the thread count. In particular, the external memory access rate for Shot detection increases from 0.05 bytes per instruction for a single thread to 0.30 bytes per instruction for eight threads. Because we exploit coarse-grained parallelism for these three workloads, each thread operates on a large private working set, about 32MB per thread for Player detection, 8MB per thread for Face detection, and 4MB per thread for Shot detection. As the total working set size increases with the number of threads, there are more L2 cache misses for more threads. For Shot detection, while the working set of four threads fits well into 16MB L2 caches, the working set of eight threads cannot fit. This explains the significant increase in cache misses from four threads to eight threads. Together with the instantaneously high bandwidth usage, the speedup of Shot detection from four threads to eight threads is much slower than the speedup from two threads to four threads.

Figure 10: L1/L2 cache miss rates
click image for larger view
To summarize, most of the video-mining workloads demonstrate fairly good parallel performance on both existing multi-core systems and future large-scale CMP platforms. As most of them can be partitioned into a large number of parallel tasks, they have little lock overhead and serial region. Since the workloads are parallelized in coarse-grain fashion, which exposes a huge working set with the increase in thread numbers, large cache size and sufficient memory bandwidth will be necessary to enable large-scale, video-mining computing. To reduce the working set sizes and the external bandwidth usage in the future, we may need to exploit fine-grain parallelism. This could be a tradeoff between memory subsystem performance and parallelism overheads.
