Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.08

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 3 of 9  

Accelerating Video Feature Extractions in CBVIR on Multi-Core Systems

CBVIR AND LOW-LEVEL VISUAL DESCRIPTORS

Video information differs from conventional text or numerical data in that video data require a large amount of memory and special processing operations. Video retrieval is based on how the contents of a sequence of images can be represented. Computational techniques that pursue the goals of indexing the unstructured visual information are called CBVIR [1, 4]. Generally, a typical CBVIR system includes two ingredients: the back-end for video indexing and the front-end for retrieval query processing. The back-end extracts low-level audio/visual features for video data indexing, while the front-end is a query engine that returns retrieval results based on the similarity between query example and indexed video data [4]. A typical system framework is illustrated in Figure 1.



Figure 1: Framework of a typical CBVIR system
click image for larger view
 

The well-known maxim "Garbage in, garbage out" means that good features will greatly improve the retrieval performance of a CBVIR system. Based on this point, the MPEG-7 standard, formally known as the "Multimedia Content Description Interface," is proposed to guide content retrieval and feature extraction from video data. It includes a set of low-level color descriptors, texture descriptors, shape descriptors, and motion descriptors [2]. Since MPEG-7 is an experimental standard currently, the descriptors are only at the conceptual level. Therefore, in practice, most CBVIR systems just use MPEG-7 as a guideline for low-level feature extractions [1, 5]. In our experiments, we also use MPEG-7 as a guideline, and we briefly introduce the most-widely used visual features. In each category, we also select one or two typical features with detailed descriptions. These features are widely adopted and have very good retrieval performance [6].

Color Descriptors

Because of its expressive power, color is one of the first attributes used in image description, similarity, and retrieval tasks [7]. MPEG-7 divides color descriptors into several sub-categories: scalable color, color structure, color layout and so on [2]. In practice, there are four widely used color descriptors: Color Histogram, Color Moments, Color Coherence Vector (CCV), and Color Correlogram. The first two can be viewed as scale color descriptors, and the latter two can be viewed as structure color descriptors. In color histograms, overall color distribution can be captured in terms of histogram or low-order moments, but color histograms do not capture any spatial relationships among colors. The CCV is an extension of color histograms, in that it partitions pixels falling in each color histogram bin into coherent pixels and non-coherent pixels.

Color Correlogram is proposed to characterize how the spatial correlation of pairs of colors is changing with the distance [8]. It provides much better performance than color histograms, color moments, and the CCV [6, 8] and has been widely used in CBVIR systems [1, 5]. Color Correlogram extends the co-occurrence matrix method in texture analysis to the color domain. In short, a correlogram is a squared table where the entry at (ij) specifies the probability of finding a pixel of color cj at a fixed distance from a given pixel of color ci. To catch more local spatial information, the co-occurrence can also be defined by banded neighborhoods: this leads to the banded color correlogram. In practice, {0, 1, 3, 5, 7} are the most popularly used banded distances.

Texture Descriptors

The textural features describe local arrangements of image signals in the spatial domain or the frequency domain by some spectral transforms. There are many kinds of texture features, such as the Gray-Level Co-occurrence Matrix (GLCM), edge histogram features, multi-resolution simultaneous autoregressive models (MRSAR), wavelet coefficients, and Gabor textures. Specifically, the GLCM is the sufficient statistics of Markov random fields with multiple pairwise pixel interactions. The Edge histogram feature is used to characterize non-homogeneous texture regions. The MRSAR is a random field texture model that characterizes the geometric structure and the quantitative strength of interactions among neighbors. At present, most promising features for texture retrieval are multi-resolution features obtained from orthogonal wavelet transforms or from Gabor transforms in the frequency domain [7].

MPEG-7 has three texture descriptors: homogeneous texture, texture browsing, and edge histograms. The first two are based on the Gabor transform [2]. The Gabor transform offers the best simultaneous localization of spatial and frequency information. It emerges as an important visual primitive, and it is widely applied in tasks like edge detection, invariant object recognition, and compression [9, 10]. The 2-dimensional (2D) Gabor filters are defined as a series of multi-scale and multi-orientation cosine modulated Gaussian kernels. The Gabor texture representation of images is derived by convolving the image with the Gabor filters and implementing the convolved image efficiently by using Fast Fourier Transform (FFT). The MPEG-7 standard suggests using 6-orientation and 5-scale Gabor filters for the homogeneous texture descriptor and the texture browsing descriptor, which yields one forward 2D FFT for the image and 30 inverse 2D FFTs for the frequency-domain results.

MRSAR is another texture feature studied in this paper, that models the texture as second-order, non-causal Markov random fields [15]. MRSAR uses a 21x21 window sliding across the input image with fixed pixel steps (seven pixels in our experiments) in three resolutions. The least squares estimations are carried out at each resolution independently. Together with the standard deviation of the error term, five parameters are estimated for each resolution and concatenated for a 15-dimensional feature vector. The final feature is the mean and covariance matrix of the 15-dimensional feature on all sliding windows.

Shape Descriptors

The object's shape plays a critical role in searching for similar image objects (e.g., texts or trademarks in binary images or specific boundaries of target objects in images, etc.). In image/video retrieval, one expects that the shape description is invariant to scaling, rotation, and translation of the object. Shape features are less developed than their color and texture counterparts because of the inherent complexity of representing shapes. MPEG-7 supports region-based and contour-based shape descriptors [2]. However, these kinds of shape descriptors rely on the shape quality of shape extraction processes.

Recently, shape context has been proposed as a global shape descriptor, and it has demonstrated great success in image matching, recognition, and retrieval [11, 12]. It contains two steps: shape extraction and feature formulation. In practice, the shape can be provided by boundary detector, edge detector, or segmentation boundary. Our implementation adopts the simplest Canny edge detector. For each shape point p, it calculates the distance r and orientation θ between the point p and other shape points, and then it quantizes the pair (r, θ) into nine bins of a log-polar coordinate as shown in Figure 2. The 9-bin histogram is used to represent features at point p. Finally, the histogram of each selected key point is flattened and concatenated to form the context description of the shape.



Figure 2: An example of shape context for the reference point
click image for larger view
 

Localization Descriptors

Local descriptors for regions of interest have proved to be very successful in applications such as object recognition, image/video retrieval, and matching different views of object and scene [12]. They are distinctive, robust to occlusion, and do not require segmentation. The idea is to detect image regions that are covariant to a class of transformations, and these regions are then used as support regions to compute invariant descriptors. MPEG-7 contains a region locator and spatial-temporal locators [2]. In this paper we only discuss one of the most widely used localization descriptors: the scale-invariant feature transform (SIFT), which is a known invariant to changes in illumination, image noise, scaling, and small changes in viewpoint [13].

SIFT feature detection can be divided into four steps. The first step detects local extrema in scale-space. SIFT progressively blurs the input image with the Gaussian kernel, resulting in a series of blurred images. Then, each blurred image is subtracted from its direct neighbors (called scale space) to produce a new series of difference of Gaussian (DoG) images. Thereafter, a specific blob detection is conducted at each pixel in the image by comparing the pixel to its eight direct neighbor pixels and 18 neighbor pixels from direct neighbored blur images in the scale space. The second step localizes key points from the extrema in scale space by removing some lower-contrast and noise points. The third step assigns orientation for each key point, and computes histograms of gradient directions in a 16x16 window at each key point. The fourth step formulates the key point descriptor, which is a 128-dimensional vector of the normalized histogram.

Motion Descriptors

There are four motion descriptors: camera motion, motion trajectory, parametric motion, and motion activity in MPEG-7, which characterize 3-D camera motion parameters, temporal evolution of key points, the motion of regions, and the intensity or pace of motion, respectively [2]. Some MPEG video compression methods already encode macro-block level motion vectors. However, when the pixel-level or object-level motion estimation is required, we must resort to other techniques such as optical flow.

As motion can be represented as vectors originating or terminating at pixels in a digital image sequence, optical flow denotes a vector field defined across the image plane that can wrap images from previous to the next [14]. Estimating the optical flow is very useful in pattern recognition, computer vision, and other image-processing applications. In this work, we study the Lucas-Kanade method, which is known as the most popular two-frame differential method for optical flow estimation. This method tries to calculate the motion between two image frames that are taken at times t and t+δt at every pixel position. As a pixel at location (x, y, t) with intensity I(x, y, t) will have moved by δx, δy, and δt between the two frames, optical flow assumes that parts of the objects are the same at the two time slices, i.e., I(x + δx, y + δy,t+ δ) = I(x,y,t) . With first-order Taylor expansion of the left side, and omitting higher-order terms, we have the basic constraint IxVx + IyVy . The Lucas-Kanade method assumes that the flow (Vx,Vy) is constant in a small window with n pixels, and then it yields n linear equations when taking the n pixels into the basic constraint. Since there are more equations than unknown variables (i.e., n>2), the system is over-determined and can be solved by the least squares method.

  Section 3 of 9  

Back to Top

In this article

Download a PDF of this article.