Analyzing and Understanding Visual Data

Yurong Chen

Currently, more than 75% of all internet traffic is visual (video/images). Total traffic is exploding, projected to jump from 1.2 zettabytes per year in 2016, to 3.3 zettabytes in 2021, and visual data will comprise roughly 2.6 zettabytes of that.

A major challenge for applications is how to process and understand this visual data, a capability called “visual understanding”. So what exactly is visual understanding?

Visual understanding (VU) is one important part of computer vision. According to Wikipedia, computer vision is a field that includes methods for acquiring, processing, analyzing and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. Basically, VU is the process of analyzing and understanding images and videos. It is mainly focused on object processing, unlike pixel processing in the imaging process. The objective of VU is to derive knowledge from images and videos of the real world. VU encompasses the following capabilities, among others:

  • Text recognition
  • Face detection & recognition
  • Emotion recognition
  • Object detection and classification
  • Scene classification and understanding
  • Activity recognition and classification
  • Video classification/summarization

In the following example, the Person and Camera have been detected and classified, and the action has been recognized as Taking Pictures.

Intel Labs China, directed by Dr. Yurong Chen, has been making dramatic strides in VU. Dr. Chen is a Senior Research Director and Principal Research Scientist at Intel Corporation and Director of Cognitive Computing Lab at Intel Labs China (ILC). Under his direction ILC has made significant progress in these three key areas:

  • Face analysis and emotion recognition
  • Deep-learning based visual recognition
  • Visual parsing and multimodal analysis

Face Analysis Technology Research

ILC has developed a full face analysis pipeline with “best in class” algorithms, resulting in more than twenty patents awarded for this work.

Face analysis research at ILC is advancing rapidly, with current technologies generally able to recognize a subject’s face, gender, age, expression and emotion, and in real time, create live 3D facial animations with emotional enhancements. Applications for this include avatar representations, virtual reality, augmented reality, gaming, etc. These face analysis technologies have been integrated and leveraged in a variety of other technologies and applications, including Intel® RealSense™ technology, the OpenVINO™ toolkit, client application prototyping, and IoT video E2E solutions.

3D Face Technology

Intel’s 3D face technology is able to recognize emotions and perform 3D face modeling, tracking, and enhancements in real time, for applications in virtual reality, augmented reality and gaming. Using Intel China Labs’ 3D face technology, Intel collaborated with Chinese pop star Chris Lee to create the world’s first AI music promotion video.
More information here.

Visual Emotion Recognition

Visual emotion recognition will be key for smart devices. ILC’s Action Units-Aware Features and Interactions (AUAFI) technology leverages multi-task learning to decode facial muscle movements and their inherent interactions. Tested against the CK+ dataset of expressions consisting of 327 videos, 7 basic facial expressions, and 118 subjects, AUAFI achieved a 98.7% overall recognition rate. And tested against the MMI facial expression database of 205 videos, 6 basic facial expressions, and 23 subjects with large-pose variations, AUAFI achieved 80.27% overall recognition rate. Intel Labs China presented AUAFI at the ACM ICMI Conference in 2015. 
More detailed information about this research here.

Audio is another important queue for emotion recognition, and ILC proposed “Importance-Aware Features” with selective grouping for audio emotion recognition, and designed a fusion framework to make the best use of visual and audio modalities for emotion recognition in the wild.

Intel Labs China wins Emotion Recognition in the Wild Challenge

Competing against seventy-four teams from around world, including from Carnegie Mellon, University of Illinois Urbana-Champaign, and Microsoft Research, Intel Labs China won First Place in the ‘Emotion Recognition in the Wild Challenge’ 2015 (EmotiW 2015) in the audio-video based task. With Intel’s entry, Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild, ILC scored an overall recognition rate of 53.8% against the EmotiW 2015 AFEW dataset (against a baseline of 39.33%) and an overall recognition rate of 55.38% against the EmotiW 2015 SFEW dataset (against a baseline of 39.13%.) (AFEW stands for Acted Facial Expressions in the Wild, while SFEW stands for Static Facial Expressions in the Wild.) 
More detailed information on this research here.

Following are samples of the EmotiW 2015 video clips.

Source: 1) Dhall, R. Goecke, S. Lucey and T. Gedeon, “Collecting Large, Richly Annotated Facial Expression Databases from Movies”, IEEE MultiMedia 19 (2012) 34-41.

2) Dhall, R. Goecke, J. Joshi, K Sikka and T. Gedeon, “Emotion Recognition in the Wild Challenge 2014: Baseline, Data and Protocol”, ACM ICMI 2014.


In 2016, ILC invented a deep, yet computationally-efficient CNN framework named HoloNet (represented in the figure below) for robust emotion recognition. ILC won First Runner Up with HoloNet against 100 registered teams in EmotiW 2016 (ACM ICMI ’16) in the audio-video based task, and Most Influential Paper award in the past four years’ challenges. Intel’s method, a fusion of ILC’s convolutional neural network model named HoloNet (A&B), plus one audio model and one iDT model, achieved a test score of 57.84% against the AFEW 6.0 dataset. Abhinav Dhall, EmotiW 2016 Chairperson had this to say regarding ILC’s submission: “… You showed me a really novel method, no use of extra data and its speed is hundreds of times faster than the other competitors.” 
More detailed information on this research here.

 Supervised Scoring Ensemble

Supervised Scoring Ensemble is an approach to emotion recognition invented by ILC that applies supervision not only to deep layers, but also to intermediate and shallow layers of a convolutional neural network. This method also employs a new fusion structure in which class-wise scoring activations at diverse complementary-feature layers are concatenated and used as the inputs for second-level supervision, thus acting as a deep feature-ensemble within a single CNN. This approach brings large accuracy gains over diverse backbone networks. ILC presented SSE at the ACM International Conference on Multimodal Interaction 2017, and achieved a recognition rate of 60.34%, against that year’s audio-video-based emotion recognition task, surpassing all existing records.
More detailed information on this research here.

Deep Learning Based Visual Recognition

Over time, Intel Labs China has applied for and received tens of patents for its methods in designing and training large and deep convolutional neural networks (CNNs) for Intel® platforms. ILC has developed and optimized algorithms and models for general visual recognition–performing large-scale object classification and multiclass object detection–and then developed specific applications targeted to edge devices (IoT) for recognizing objects in real life scenarios, e.g., pedestrians, cars, etc. More recently, ILC is working on advances in CNN algorithm design to better balance the needs for accuracy, speed, memory consumption and power efficiency, in order to support edge-device deployment with FPGAs, VPUs, etc.

HyperNet – An Efficient Object Detection Solution

Most top-performing object detection networks employ region proposals to guide the search for objects. Although leading region proposal network methods may achieve promising detection accuracy, this is usually after several hundred proposals, which is inefficient, and this method still struggles to detect and precisely locate smaller objects.

Intel Labs China, in conjunction with Tsinghua University, designed HyperNet to alleviate these shortcomings. HyperNet handles region proposal generation and object detection jointly and is primarily based on an elaborately designed Hyper Feature which aggregates hierarchical feature maps first, and then compresses them into a uniform space. Hyper Features incorporate highly-semantic, complementary, and high-resolution features of the image, thus allowing HyperNet to generate proposals and detect objects via an end-to-end joint training strategy. Using the deep VGG16 CNN pre-trained model, HyperNet achieves leading recall and state-of-the-art object detection accuracy on PASCAL VOC 2007 and 2012 datasets using only 100 proposals per image. Advantages include high-recall, a smaller memory footprint, and speed. In tests, HyperNet ran at five frames per second (including all steps), thus having the potential for real-time processing in deployment.
More detailed information on this research here.

Reverse Connection with Objectness Prior Networks

At the 2017 Conference on Computer Vision and Pattern Recognition, ILC presented its work on a new framework for object detection called “Reverse Connection with Objectness Prior Networks” or RON. RON is a fully convolutional framework that combines the merits of two mainstream solution families (region-based & free) and eliminates their two major detractions:

  • Multi-scale object localization with Reverse Connection Pyramids
  • Efficient negative space mining with Objectness Prior Networks

RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the standard datasets demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, RON gets 81.3% mean Average Precision (mAP) on the PASCAL Visual Object Classes (VOC) 2007 dataset and 80.7% mAP on the PASCAL VOC 2012 dataset. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the Microsoft Common Objects in Context (COCO) dataset. With COCO, RON excelled in both state-of-the-art accuracy and speed.
More detailed information on this research here.

Training Deeply Supervised Object Detectors “from Scratch”

State-of-the-art object detectors rely heavily on the off-the-shelf networks, pre-trained on large-scale classification datasets, such as ImageNet. This approach incurs learning bias due to the differences on the loss functions and the category distributions between classification and detection tasks. Fine-tuning the model for detection can alleviate this bias to some extent, but not entirely. And transferring pre-trained models from classification to detection between discrepant domains is difficult (for example, from RGB to depth-images). A better solution to tackle these two problems is to train object detectors from untrained models, and this is what ILC’s Deeply Supervised Object Detector (DSOD) framework is able to achieve.

Previous efforts in this direction have largely failed due to excessively complicated loss functions and limited training data in object detection. For the DSOD, ILC developed a set of design principles for training object detectors. One of the team’s key findings is that deep supervision, enabled by dense layer-wise connections, plays a critical role in training a good detector. Combined with several other improvements, ILC developed a DSOD following the single-shot detection (SSD) framework. Experiments on the PASCAL VOC 2007 and 2012 datasets and the MS COCO dataset demonstrate that DSOD can achieve better results than state-of-the-art solutions, and with highly-compact models. ILC’s DSOD outperforms SSD on all three benchmarks above, yet requires only 1/2 of the parameters compared to SSD, and 1/10 of the parameters compared to Faster RCNN ( These features make DSOD suitable for training with limited data for specific problems, and opens doors to other domains, such as depth, medical, and multi-spectral images.
More detailed information on this research here.

Multiclass Object Detection

HyperNet, RON and DSOD are all CNN-based, multi-class object detection algorithms. Based on these algorithms, ICL has developed a multi-class object detection prototype system that can perform multi-class object detection in real time and deliver accurate results in complex scenes. This prototype can be widely used in applications such as automatic driving and video analysis.

Deploying to the Edge: Low-bit Deep Compression

Artificial intelligence will have greater impact as fully functional models can be deployed to edge devices, such as mobile devices, IoT devices, etc. However, pre-trained, full-precision convolutional neural networks are resource-intensive, making them difficult to deploy to devices with limited computational resources. The need is to greatly reduce CNN complexity, to prune and compress the trained model to improve performance and allow compressed models to run efficiently on edge devices. ILC has developed an impressive and elegant solution, called Low-bit Deep Compression (LDC), which can achieve hundred-level lossless compression on deep neural networks (DNNs) with low-precision weights and low-precision activations. Thus, it can pave the way for the development of efficient inference engines both for HW and SW implementation. LDC includes three key modules:

  • Dynamic Network Surgery: Seeks optimal DNN architectures
  • Incremental Network Quantization: Constrains optimal DNN with low-bit weights
  • Multi-level Quantization: Constrains optimal DNN with low-bit activations

Dynamic Network Surgery

ILC has developed a model compression process called Dynamic Network Surgery that performs intelligent network pruning on the fly. This process incorporates a new tool: connection splicing. Parameter importance can change during the pruning process. The loss of some connections due to excessive pruning can actually result in accuracy loss and network damage. With Dynamic Network Surgery, pruned connections can be spliced back to the network, recovering needed parameters and network accuracy. Network pruning and maintenance become a continual process.
More detailed information on this research here.

Incremental Network Quantization

Incremental Network Quantization (INQ) is an innovative technique created by ILC that converts a pre-trained, full-precision CNN into a low-precision version, the weights of which are constrained to be either powers of two, or zero. INQ employs three novel operations: parameter partitioning, quantization, and re-training. This procedure is incremental, permitting consecutive model partitioning, quantization, and training cycles to optimize for greatest model compression along with sufficient model accuracy.

The results delivered by INQ are impressive. INQ is able to deliver a lossless, low-precision CNN model from any full-precision reference. ILC conducted extensive experiments on the ImageNet large-scale classification task using almost all known deep CNN architectures and were able to show that:

  • For AlexNet, VGG-16, GoogleNet, and ResNets with 5-bit quantization, INQ achieved improved accuracy in comparison to their full-precision baselines. The absolute top-1 accuracy gain ranges from 0.13% to 2.28%, and the absolute top-5 accuracy gain ranges from 0.23% to 1.65%.
  • INQ supports easier convergence in training. In general, re-training with less than 8 epochs consistently generated a lossless model with 5-bit weights.
  • Using ResNet-18 as an example, ILC’s quantized models with 4-bit, 3-bit, and 2-bit ternary weights also improved or gave very similar accuracy compared with the 32-bit floating-point baseline.
  • Using AlexNet as an example, the combination of network pruning and INQ outperforms deep compression methods (Han et al., 2016) with significant margins.

More detailed information on this research here.

Multi-level Quantization

ILC has developed innovative methods that achieve network quantization in both the network width and depth. This approach employs two novel network quantization methods: single-level network quantization (SLQ) for high-bit quantization, and multi-level network quantization (MLQ) for extremely low-bit quantization (ternary). ILC was the first to consider the network quantization both for width and depth levels. In the width level, parameters are divided into two parts: one for quantization and the other for re-training, to eliminate the quantization loss. SLQ leverages the distribution of the parameters to improve the width level. In the depth level, ILC introduces incremental layer compensation to quantize layers iteratively, which decreases the quantization loss in each iteration. Together, SLQ and MLQ achieve impressive results, validated with extensive experiments based on state-of-the-art neural networks including AlexNet, VGG-16, GoogleNet and ResNet-18.
Details regarding these results here.

Low-bit Deep Compression Results

ILC found that performing low-bit, deep compression employing the three methods described above, Dynamic Network Surgery, Incremental Network Quantization, and Multi-level Quantization, yielded truly impressive results in creating models that retained accuracy and achieved compression ratios greater than 100X using 2-bit quantization, compared to their pre-trained and full-precision but uncompressed counterparts. These combined levels of accuracy and efficiency are unmatched (at the time of testing).

The following table compares the accuracy of the LDC-derived models against other current state-of-the-art models, and shows compression rates compared to the original inference model size using AlexNet on the ImageNet dataset as an example. LDC outperformed the state-of-the-art deep compression solution* with at least 1X absolute margin on AlexNet, achieving >100X compression with 2 bits. For example, in the last row, the LDC compressed model achieved a compression ratio of 142X, and only suffered a loss in accuracy of 0.96 percent in the Top-5 recognition rate.

Method Bit Width (Conv/FC) Bit Width (Act) Compression ratio Decrease in top 1 / top 5

error rate in percent
P+Q* 8/5 32 27X 0.00 / 0.03
P+Q+H* 8/5 32 35X 0.00 / 0.03
LDC 4/4 4 71X 0.08 / 0.03
P+Q+H* 4/2 32 - -1.99 / -2.60
LDC 3/3 4 89X -0.52 / 0.20
LDC 2/2 4 142X -1.47 / -0.96
Comparison of ILC’s Low-Bit Deep Compression and Deep Compression methods (P+Q+H, LCLR’16 and ISCA’16) on AlexNet.
Conv: Convolutional layer, FC: Fully connected layer, Act: Activation, P: Pruning, Q: Quantization, H = Huffman coding.
* S. Han, J. Pool, J. Tran, and W. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. Best paper in ILCR 2016.

Network Slimming

A final approach to reducing complexity in CNNs, called “network slimming”, has been jointly developed by researchers from Intel Labs China, Tsinghua University, Fudan University, and Cornell University. Network slimming takes wide and large networks as input models and, during training, identifies insignificant portions or channels of the network. These are pruned, yielding thin, compact models with accuracies comparable to the input networks. This technique reduces the complexity of deep neural networks in the channel level and can reduce the model size by up to 20X, and the number of floating-point operations by up to 5X, all without accuracy loss for network structures such as VGGNet, ResNet, DenseNet, etc. With limited accuracy loss, network slimming can further reduce the number of floating-point operations by >10X. Unlike low-bit deep compression, the inference speedup from network slimming does not require any special hardware accelerators, just conventional floating-point operation hardware.
More information here

Visual Parsing & Multimodal Analysis

ILC also conducts advanced multimodal fusion and learning research to bridge the gap between visual recognition and visual understanding. The idea is to connect the dots from vision, speech, language, knowledge, and machine learning to make machines able to see and infer in ways indistinguishable from human beings. This effort includes research on video to text (VTT), visual question answering, visual relation detection, and so on. These capabilities will enable many important applications, such as natural interaction, intelligent visual cloud, personal visual assistants, and advanced visual control and decision making.

Dense Video Captioning

For video to text, ILC has focused on a novel and challenging vision task: dense video captioning. This aims at automatically describing a video clip with multiple informative and diverse caption sentences. ILC invented a weakly-supervised dense video captioning approach with only video-level sentence annotations during the training procedure. First, they proposed lexical, fully convolutional neural networks (Lexical-FCN) with weakly supervised, multi-instance, multi-label learning to weakly link video regions with lexical labels. Second, they introduced a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. Third, they trained a sequence-to-sequence, learning-based language model with the weakly supervised information obtained through the association process. The proposed method not only produces informative, diverse, and dense captions, but also outperforms many state-of-the-art single video captioning methods by a large margin.


The ever-growing explosion in the volume of online visual data demands the ability to analyze, understand, and respond to that data, even in real time. Researchers at Intel Labs China, with partners in academia, are making significant, often break-through progress in AI research. ILC is developing deep convolutional neural networks (DNNs) that can accurately and rapidly detect and recognize objects, understand complex scenes, recognize actions and activities, and faces and their emotions. ILC is also developing a leading 3D face technology that can perform 3D face modeling and rendering with enhancements such as emotion cues, in “real-time”, and is conducting research on automatic video-to-text transcription and captioning, among other capabilities. These advances are clearly impressive, but such complex DNNs require substantial compute and memory resources. Deployment for inference requires that DNNs are dramatically compressed and pruned to be able to run on low-power platforms, at the edge, and in the IoT space, all while maintaining accuracy. ILC is also making dramatic progress developing technologies to achieve low-bit deep compression, on the order of up to 100X over trained, uncompressed DNNs. These ongoing advances in visual data analysis and understanding, coupled with low-bit deep model compression, are leading the way to deployment on a range of Intel AI devices, including low-powered edge devices. The opportunities are substantial. Watch this space.

Stay Connected

Keep tabs on all the latest news with our monthly newsletter.