Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.07

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 3 of 9  

Media Mining—Emerging Tera-scale Computing Applications

MEDIA-MINING APPLICATIONS

Media mining has a huge number of emerging applications with different usage models. We highlight three typical usage models developed at Intel.

Media-Mining Usage Models

  • Sports video analysis: Broadcast sports videos are very popular on television. Using highlights detection, consumers can quickly retrieve specific video clips without having to browse through the whole video. Sports video analytics can be viewed from the perspective of an editor. Based on a predefined semantic intention, an editor combines certain multimedia content elements and their temporal layout to achieve the desired highlighted events. Hence, detecting highlighted events is similar to a reverse process of authoring. The system framework consists of three levels: low-level audio/visual feature extraction, mid-level semantic keywords generation, and high-level event detection [8]. To minimize the semantic gap between low-level features and high-level events, we use mid-level semantic "keywords" followed by a classifier to infer events of interest. Our sports video analysis system can work with a multitude of sports including soccer, hockey, badminton, tennis, and diving. Given a video in a specific domain with predefined semantic intentions, the system can extract the desired events and features and interpret a summarization output video in terms of high-level semantics.
  • Personal video editing: Home videos are increasingly popular as digital video cameras become more user friendly and portable. However, because home videos for the most part are shot by amateurs, shaking, blurring, under-exposure artifacts, and redundant content are always present. Therefore, the demand for an automated home video editing system [2] is high. Such a system has to be able to recognize how many people and how many scenes are involved, mine the relationship between various people and scenes, and synthesize a short artistic video clip from a long raw video. A typical personal video editing system includes three key modules: intelligent analysis, adaptive selection, and seamless composition. The first module extracts the multi-modal and multi-level audio-visual features; the second module selects the most interesting, important, and informative content; and the third module produces a near-professional story with incidental music. The overall automated home video editing system must be easily extended to the personal video recorder and digital home entertainment system.
  • Personal video retrieval: A personal video retrieval system is a desktop application that works much like the Google desktop search to help end users manage more and more personal multimedia data from all kinds of mobility digital camera devices. In response to a user query, the personal video retrieval application finds the relevant video clips from a large video database such as from movies, TV, sports games, and home videos. Generally, a retrieval system first extracts low-level audio/visual features from videos, and then detects semantic concepts (keywords) to represent the video content. Finally, a query engine returns retrieval results based on the user’s query and on a similarity model. The query can be text keywords, image examples, hand-drawn sketches, or short video clips, and the output is relevant video clips ranked not only by their content similarity to the query, but also by their importance, according to a concept-link relationship analysis. To gradually improve system performance during the query procedure, the system provides user-friendly relevant feedback and active learning modules.

Key Media-Mining Techniques

Although the above usage models are quite different from one another, the underlying technologies are common and can be extended to a broad range of media-mining applications. In this paper, four key techniques are extracted from previous usage models to show how media-mining applications are built.

  • Sports keyword detection: The mid-level module generates semantic "keywords" from the previously described low-level extraction. Listed below are some keywords in sports video analysis. These keywords are used as input for high-level event detection.
    • View type: Based on color histograms of each frame, we can obtain the dominant color to segment the playing field region. We then classify each frame as a global view, medium view, close-up view, and out of view [5].
    • Play-field: A Hough transform from digital image processing is used to detect field boundaries and penalty box sections. Then a decision-tree-based classifier determines the play position according to the slope and position of the lines.
    • Replay: In broadcast sports videos, to capture clues for significant events, there typically is a replay following an important event. At the beginning and end of each replay, there is generally a logo flying in high speed. We detect logos to identify replays by discovering repeat video segments through dynamic programming [6].
    • Audio keywords: There are two types of audio keywords: commentator’s excited speech and referee’s whistle: these have a strong correlation to key events in the game such as a foul, a goal, or player entanglements. A Gauss Mixture Model (GMM) is used to detect keywords from low-level audio features including Mel frequency Cepstral coefficients (MFCC), energy, and pitch [7].
  • Human detection and tracking: Human detection and tracking is a significant and challenging task in many application scenarios. Different from rigid objects, humans are articulated and jointed by several human-parts, which may lead to pose variance, self-occlusion, etc. In human detection, the first problem is to select the proper features to characterize human regions/parts: Haar wavelets [3] and orientation histograms are mostly used to do this. The second problem with human detection is to use a discriminator to determine whether there are humans and where they are if they are present. The Boosting learning- based detector is preferred [3]. It is an aggressive learning algorithm that produces a strong classifier by choosing features in a family of simple classifiers and combining them linearly. Then a cascaded structure is introduced in order to quickly reject the background regions. Human tracking is essentially finding body regions or parts that correspond with successive frames by using data association and occlusion inference techniques.
  • Face detection and tracking: Face detection and face tracking have been an important technology and pre-requirement for many person-analysis relevant applications, such as face recognition/identification, emotion analysis, and cast indexing. Face detection has been studied for many years. Viola and Jones Boosting learning-based detection algorithms are the most successful algorithm to date [2]. Recently, some improvements are proposed to enable the algorithm to handle multi-view faces more efficiently for high-quality videos [12]. Generally, Boosting-based face detection characterizes image regions by very simple Haar wavelet features, and it learns cascade detection from a training set to separate a face set from a non-face set. In the detection phase, the learned detector will slide by a window over the image to detect whether the window contains a face or not. Face tracking [13] is an extension of face detection technology, which can detect a person’s continuous faces from a video sequence. Spatial and temporal constraints are employed to avoid much unnecessary calculation. Since it detects faces only in predicted face image regions, it doesn’t waste time scanning all the positions of every frame.
  • Concept ontology indexing: Concept ontology indexing represents multimedia data by large-scale concept ontology for indexing and fast retrieval. There are several concept lexicons for multimedia: large-scale ontology for multimedia (LSCOM) [9] is the most popular. LSCOM currently contains about 1000 concepts that are relevant to objects, people, locations, scenes, and events. LSCOM has been successfully used by the TREC video retrieval evaluation (TRECVID) hosted by NIST [10]. Concepts are detected from more than 20 low-level MPEG-7 compatible audio/visual features, e.g., color histogram, Gabor texture, shape context, edge histogram, motion, and MFCC audio features, etc. Given these low-level features, a supervised classifier (such as an SVM) is learnt for each concept from a training set to identify whether the concept exists or not in each video shot [11]. Employing all of the concept detectors, a video shot is therefore represented and indexed by the semantic concept ontology that makes next- stage search similar to text retrieval.

Common Characteristics in Media Mining

Three attributes of media-mining applications can be summarized as follows:

  • First, a media-mining system is basically a bottom-up framework as shown in Figure 1. The framework is a three-layer architecture, i.e., low-level feature extraction, mid-level semantic keywords detection, and high-level concept detection. In processing, low-level visual/audio/textual features are extracted from raw media data. Then in the second layer, mid-level features or keyword concepts are detected from low-level features to bridge the semantic gap between low-level features and high- level concepts. Finally, high-level modules infer the desired concepts in the semantic keyword spaces.



Figure 1: General video-mining framework
click image for larger view
 

  • Second, media mining is a hybrid technique of computer vision, pattern recognition, machine learning, and data mining. For example, human detection/tracking techniques involve Haar and HoG feature extraction from video frames, Boosting (cascade learning) training-based candidate detection, and associate rule learning from quite large examples to identify relationships between articulations. In these techniques, Haar and HoG features are essentially computer vision methods; Boosting is a famous machine-learning algorithm; and associate rule learning is a typical data-mining method.
  • Third, media-mining applications usually combine multiple components. For example, in the automatic home video editing application, the application needs to recognize people, mine the relationship between people, and synthesize a short artistic video clip from a long raw video.

Media mining has mass-market potential and is therefore quite a suitable and important proxy not only for workload analysis on future architectures, but also for developing parallel programming models for multimedia applications. Furthermore, due to its similar framework for different usage models, we only use one technique as an example to study its computational requirements.

Computational Requirement: a Case Study



Figure 2: Flowchart of player detection, tracking and classification
click image for larger view
 

In the sports domain, we look at multiple player detection, tracking, and classification in broadcast soccer video for our example. Its flowchart is shown in Figure 2 [4]. To make the algorithm robust and adaptive, we construct the background (playfield) color model and three player appearance models (Team A, Team B, and Referee) through unsupervised learning procedures. In the learning phase, the background color model is obtained by accumulating color histograms over hundreds of frames in the video in HSV color space. Player appearance models are learned by player sample collection with a boosted player detector, color histogram representation, and clustering. In the testing phase, we first perform background segmentation, playfield extraction, and view-type classification. Only global views are selected for player detection. We then apply a boosted cascade of Haar features for player detection on each foreground pixel within the playfield. Multiple detections will usually occur around each player after scanning the image. We merge adjacent detected rectangles and get final detections with proper scale and position. In the player classification procedure, each player sample is represented by the learned codebook histogram. We calculate the Bhattacharyya distance between the histogram and each sub-model. The player sample is assigned the sub-model’s label by the nearest neighbor rule. With this procedure, players are labeled as Team A, Team B, Referee, or Outlier (if the minimum distance is larger than a threshold). Player tracking is performed by efficient forward and backward nearest neighbor data association. We take both binary mask overlap and color histogram intersections in player upper-body as observations within a certain spatial displacement range to find the optimal player regions correspondence, and we generate players’ trajectories across frames.

Figure 3 is an example of player tracking results, in which white ellipses and rectangles indicate two teams’ players and a black rectangle is the referee.



Figure 3: Player tracking results on soccer video
click image for larger view
 

Player detection is achieved by background elimination and a boosted cascade of Haar features. In this paper, we only show the detailed detection procedure since this procedure is most compute intensive compared to tracking and classification. The cascade detector with multiple stages has the capability of quickly rejecting the regions and focus on the harder-to-classify windows. The number of features selected in each stage is different depending on the expected performance and sampling criterion. Therefore, increasingly complex classifiers are combined sequentially. This improves both the detection speed and efficiency.

  • Input: image frame, background model
  • Playfield elimination and view-type classification
  • Player detection
    • For each scale
      • Scan each point to be detected
      • For each point
        • Evaluate its response with cascaded stages
        • Calculate normalized constant
        • For each stage
          • Evaluate the response
          • For each selected Haar feature
            • Calculate Haar feature response
            • Normalize Haar feature response
            • Get weak classifier response
          • Accumulate all Haar response
        • If verified by the threshold, begin next stage;
        • else, label the point as negative, break;
      • If pass all stages, label the point as positive
  • Post-processing to merge adjacent detection instances
  • Output: vector of player regions (rectangles)

Based on the above description, one can easily infer its computation complexity, which is proportional to the size of the video frame, the number of weak classifiers, and the number of scales. For player tracking between two adjacent frames, it is proportional to the number of players and player size. For player classification, it is linear to the number of players, player size, size of codebook, and size of sub-model. For an MPEG-2 video, the frame size is 720x576; we use about 1000 weak classifiers and three different scales. Thus, one minute of MPEG-2 video will need 1.86 tera-operations. Its serial processing speed on today’s processors is about 3 frames per second, which is 10x slower than real-time.

  Section 3 of 9  

Back to Top

In this article

Download a PDF of this article.