- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Accelerating Video Feature Extractions in CBVIR on Multi-Core Systems
PERFORMANCE ANALYSIS ON MULTI-CORE SYSTEMS
In this section we first show twelve typical visual feature extraction workloads, which are accelerated by serial optimization. Then we parallelize six of the most compute-intensive workloads with the methodology introduced in the previous section. We evaluate the performance of these workloads on an 8-core system, which is a dual-socket, quad-core machine, with two Intel® Core™2 Quad processors running at 2.33GHz. Each socket has four cores, and each core is equipped with a 32KB L1 data cache and a 32KB L1 instruction cache. The two cores on one chip share a 4MB L2 unified cache. The maximum FSB bandwidth is 21GB/s.
For the workloads studied in this work, we carefully choose the data sets to represent realistic scenarios. All the experiments are based on the TRECVID 2005 [20] developing data sets. The 141st and 142nd video sequences are chosen to evaluate the performance, which consists of around one hour of MPEG-1 (352x240 in resolution) videos and 791 key frames. The evaluations are directly performed on the extracted key frames.
Serial Performance Improvement
As shown in Figure 5, more than half of the workloads are formerly slower than real-time, i.e., 30 frames per second (FPS), in the serial performance on an 8-core system. After a series of optimizations, these kernels achieved an average of 3.3x speedup, about 60% of which came from using the Intel highly optimized libraries and the SIMD optimization. Even so, five workloads, Correlogram, MRSAR, Gabor, SIFT, and OpticalFlow, are still slower than real time. To harness the power provided by a multi-core system through exploiting thread-level parallelism, we further parallelized these workloads and analyzed their performance on an 8-core system. In addition, to make our work more comprehensive, we also included a representative shape descriptor, Shape Context, in the parallelization study.

Figure 5: Serial processing speed (FPS) of CBVIR workloads on an 8-core system
click image for larger view
Performance Scalability Analysis
These six workloads scale very well as the number of threads increases, as shown in Figure 6. Four of them exhibit almost linear speedups and two achieve quite respectable speedups. That is, CBVIR workloads can efficiently use the computational power provided by multi-core processors.

Figure 6: Scalability of parallel CBVIR workloads on an 8-core system
click image for larger view
To fully understand the scaling limiting factors on an 8-core system, we characterize the parallel performance from the high-level general parallel overheads, e.g., synchronization penalties, load imbalance, and sequential regions, to the detailed memory hierarchy behavior, e.g., cache miss rates and FSB bandwidth.
We profile them with the Intel® Thread Profiler to see their general parallel limiting factors. From Figure 7, we can see that the parallel region dominates in the execution time breakdown, which suggests these CBVIR workloads expose good parallel performance metrics. However, some workloads, especially SIFT, suffer a lot from load imbalance when the number of threads increases to four and eight, which leads to the poor speedup of SIFT. If we assume the parallel region can scale perfectly, Gabor and SIFT should achieve theoretical speedups of 7.6 and 6.2, respectively, on eight cores. The theoretical speedups are much higher than the practical results shown in Figure 6. Therefore, we believe the scalability of our workloads is also limited by some other factors.

Figure 7: Execution time breakdown
click image for larger view
Besides the general scalability performance factors, the memory subsystem also plays an important role in identifying the scaling performance bottlenecks. For further assurance, we get the memory-hierarchy micro-architectural statistics with the Intel VTune™ Performance Analyzer as shown in Figure 8. The figure shows that L1 cache miss rates vary little with the number of threads, while for some workloads L2 cache performance varies a lot when scaling the thread count. The L2 cache misses for most workloads is reduced when the number of threads increases to four or eight, because the system offers a larger size L2 cache from 4M to 8M and 16M. Since SIFT has a hierarchical parallel decomposition method, the downscale image has to be broadcast to all the private L2 caches after one iteration, thereby incurring significant cache coherency misses when we scale to four and eight cores.

Figure 8: L1/L2 cache miss rates
click image for larger view
Generally speaking, memory bandwidth is a key factor that may potentially limit the speedup on multi-core systems. Figure 9 shows how the average FSB bandwidth utilization varies with the number of threads. The bandwidth usages of all workloads are far below the saturated FSB bandwidth capacity supported by the system. This seems to indicate bus bandwidth does not limit the scalability of our workloads on an 8-core system. However, the scalability is limited by the instantaneous bandwidth usage for some workloads, such as Gabor. We perform interval sampling of the memory subsystem behavior over time. Figure 10 shows a representative phase of the bandwidth usage over time for this workload on eight cores. Several modules in this workload have higher bandwidth requirements than the saturated bandwidth provided by the system.

Figure 9: Average FSB bandwidth utilization vs. number of threads
click image for larger view

Figure 10: Bandwidth usage over time for eight-threaded Gabor workload
click image for larger view
In addition to studying the memory sub-system performance, we also use different thread-scheduling mechanisms to further improve their performance on a multi-core system. As mentioned earlier, there are three scheduling policies: "clustered," "non-clustered" and "os." The "clustered" policy tries as much as possible to schedule all the threads to the closely-coupled cores; e.g., it schedules two threads to two cores residing in one chip. In contrast, the "non-clustered" policy tries to schedule the threads to the loosely coupled cores; e.g., it schedules two threads to two cores on two chips instead of one chip. The "os" is the default scheduling policy of the operating system, and it is non-aware of the hardware architecture.
Our results show that some workloads are sensitive to the scheduling policy. Figure 11 shows the scaling performance of Gabor and SIFT using different scheduling policies on an 8-core system. Gabor has better performance with the "non-clustered" policy, while SIFT has better performance with the "clustered" policy. This is because Gabor has a higher bandwidth requirement as shown in Figure 9. The "non-clustered" policy can make full use of the available L2 cache capacity and bandwidth, resulting in better cache performance as depicted in Figure 12. SIFT has better performance with the "clustered" policy because the data can reside in the same L2 cache all the while between several consecutive parallel regions. Otherwise, the data generated by one thread have to be transferred to another core that does not reside in the same L2 cache, yielding significant cache coherency traffic and slowing down the program. As shown in Figure 12, the "clustered" policy in SIFT has far fewer L2 cache misses and a lower FSB bandwidth utilization compared to the "non-clustered" policy. Hence, all the experimental results in the previous sections are obtained by choosing the best policy for each individual workload.

Figure 11: Effects of thread scheduling for two feature extraction workloads on an 8-core system
click image for larger view

Figure 12: Effects of thread scheduling on L2 miss rate and FSB utilization rate for two feature extraction workloads on an 8-core system
click image for larger view
