- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Media MiningEmerging Tera-scale Computing Applications
MEDIA-MINING PARALLEL FRAMEWORK
With the boom in multi-core processors and the prevalence of shared memory processing, it is important to exploit thread-level parallelism within applications to take advantage of next-generation microprocessor capability. In this section, we present the parallelization methodology, characterize different parallel schemes, and provide insights for parallelizing these media-mining applications on future multi/many-core systems. Besides the parallelization study, we also made intensive optimizations, e.g., genetic loop-level optimization, SIMD acceleration, and cache-conscious optimizations, to provide a fully optimized baseline for further workload parallelization and analysis.
Video-Mining Parallelism
Most video-mining applications can be partitioned into three modules: video decoding, feature extraction, and post processing. We use an MPEG-2 video decoder to divide the input video stream into a number of consecutive decoded frames. Then we use a feature extraction module to extract a set of visual features from these decoded frames. This process continues until all the frames are processed. Finally, all the feature results are fed into a post-process module to detect the final visual information. The breakdown of execution time indicates that the video decoding and image processing modules are the most time consuming. The post- processing module is extremely fast and is therefore not the focus of this paper.
We use a top-down analysis methodology to analyze the coarse-grained parallelism in each module and the whole application. In general, people tend to use data-domain decomposition rather than functional-domain decomposition to take advantage of the inherent parallelism in multimedia applications. Though fine-grained parallelism within each module is of interest, we don’t explore this kind of parallelism as it’s not profitable because of serial regions and insufficient parallelism in these modules. Therefore, we choose coarse-grained parallelism to explore both functional domain decomposition and data domain decomposition within each task with the goal of load balancing, and we examine the issue of scalability when using a large number of processors.
A task-level parallel scheme uses the producer-consumer model, where the video decoder works as a producer, generating a sequence of video frames, while the image processing modules act as a consumer, operating on decoded frames to obtain the corresponding visual feature information for each frame. This multi-threading scheme is very similar to the task queue model provided by the Intel® OpenMP extension [14], which provides an efficient way to exploit functional-domain decomposition. The video decoder serves as the task producer to encapsulate the decoded frame as a task and conceptually put it in the task queue, and all the other worker threads will wait until the task is available. Though this parallel scheme is straightforward, it may experience limited scalability performance on a large number of processors when the ratio between feature extraction modules and video decoders is not sufficiently high.
A static data slicing parallel scheme slices the raw data into several video bit-stream chunks. Each thread performs a similar routine to that of a sequential application: decoding the bit-stream chunk and extracting features from the decoded frames. Because the raw video stream is split manually, each thread has to find the new sequence synchronization position. Therefore, there is an explicit synchronization between two adjacent threads to guarantee no excess or loss of decoded frames. In addition, the static data-slicing scheme may experience a load imbalance problem when the work is not evenly distributed across threads.
To take advantage of both task-level and data-level parallel schemes, a dynamic hybrid parallelization approach is proposed to combine these two schemes. At first, we decompose the video stream into several chunks to exploit the data-domain decomposition, and then we exploit the functional-domain decomposition on each particular chunk of data as previously mentioned. In this parallel scheme, there are multiple queues to buffer tasks. Master threads are responsible for video decoding and for enqueueing the feature extraction tasks in different queues. Worker threads fetch tasks from their associated queue and execute the tasks. Further, in order to reduce load imbalance, we use the work stealing strategy. When one queue is empty, it will steal tasks from other non-empty queues and execute the tasks on the idle physical processor. With work stealing enabled, the load imbalance disappears. In addition, due to reduced contention on the access to each queue, the synchronization overhead is reduced significantly. Figure 4 illustrates the hybrid task-stealing scheme. The whole video is partitioned into four chunks and assigned to four thread groups. Within each thread group, a task queue is implemented with one master thread and three worker threads. The task will not be migrated to other queues unless its private task queue is out of tasks.

Figure 4: Dynamic hybrid task-stealing scheme
click image for larger view
In summary, the dynamic hybrid parallelization scheme has several advantages. First, it significantly improves performance by manipulating multiple queues of video data in parallel, which reduces the competition for shared resources. Second, it solves the load imbalance issue by enabling dynamic task scheduling and stealing. Finally, the hybrid scheme provides enough flexibility by specifying the number of decoding and worker threads to maximize system resources utilization and deliver good scalability. However, from the perspective of programming, this dynamic hybrid parallelization approach is the most difficult of the three parallel schemes to build.
Parallel Pattern in Video-Mining Applications
Because of the difficulty of parallelization in these media-mining applications, we construct a universal parallel video-mining framework to encapsulate the parallel scheme and provide an ease-of-use interface to the programmer.
The video-mining parallel framework [15] is built in C++, and OpenMP is the default parallel language. It includes the parallel implementation, an abstract interface, and a set of configuration parameters. There are four primary components in this framework:
- An image-processing engine that serves as the interface to invoke the user codes in the library and perform feature extraction functionalities for each decoded frame.
- A video-decoding engine that acts as an interface to enable most video codec standards with parallel support.
- A portal video-mining function that serves as an interface to link the user codes with the framework.
- Configuration parameters and core image data structures.
The parallel video-mining framework has several advantages. It provides a unified parallel computing environment for video-mining applications. Programs written in this framework can be automatically parallelized and efficiently executed on a multi-core architecture. The run-time library takes care of the details. Furthermore, this framework is easily extensible and maintainable. The programmer can extend it to meet new requirements.
To summarize, video-mining workloads have abundant parallelism. The dynamic hybrid parallelization approach that combines both functional-domain decomposition and data-domain decomposition can achieve optimal parallel performance. In addition, the particular execution pattern of video-mining applications can be abstracted into a parallel video-processing framework to help programmers easily construct a parallelized video-mining application.