Emotion Detection using Intel® Distribution for Python* on Intel® Xeon® Scalable Processors

Published: 10/22/2019  

Last Updated: 10/22/2019

Understanding the emotions of customers is important in areas like smart retail. It helps brands measure consumers' cognitive and emotional responses towards content or product experiences, and it optimizes these experiences to resonate emotionally with the consumer. This emotional feedback can also help  retailers improve future content or products to positively impact their ROI. 

Capturing an emotional response towards any content or product experience—and the customer ratings of the content—requires analytics around emotions from multiple in-store video streams at near real time. High-throughput facial and emotion detection are fundamental to the pipeline of emotion AI.

Face detection, which is the first step in an emotion detection pipeline, uses two types of approaches. One approach uses a traditional histogram of oriented gradients (HOG) and a support vector machine (SVM). The second approach uses convolutional neural networks (CNN). The HOG + SVM approach uses approximately 8.5x more detection time than the CNN approach; this slows down the inference pipeline. Further, the HOG + SVM algorithm is not well optimized for Intel® architecture.

The goal of this research was to improve the performance of face detection, which subsequently leads to emotion detection. We conducted a set of experiments to identify ideal settings for an inference workload on a CPU using the DLib* framework and parallelizing the workload to run faster. The experiment involved comparing the inference performance of face detection using a baseline script and an improvised script using the strengths of Intel architecture.

A well-optimized solution achieves high physical core utilization (> 80 percent) in multisocket and multicore systems. In addition to core utilization the vectorization has to be high (> 70 percent) to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The memory stalls should be limited to less than 10 percent so that the impact of non-uniform memory access (NUMA) does not degrade the overall performance in multisocket systems.

The inference in this experiment was performed on both private and proprietary real-time datasets to measure the accuracy of predictions under realistic conditions.

Hardware Configuration

Table 1 lists the hardware configurations used in the experiment.

Table 1. Hardware configurations

Architecture Intel® Xeon® Scalable processor
CPU op-modes 32-bit, 64-bit
Byte order Little Endian
Number of CPUs 80
Online CPUs list 0-79
Thread per core 2
Core per socket: 20
Socket 2
NUMA nodes 2
Vendor ID Genuine Intel
CPU family 6
Model 85
Model name Intel® Xeon® Gold 6248 processor @ 2.50 GHz
Stepping 6
CPU MHz 1000
Bogo MIPS 5000
L1d cache 32 K
L1i cache 32 K
L2 cache 1024 K
L3 cache 28160 K

Software Configuration

Table 2 lists the software configurations used in the experiment. 

Table 2. Software configurations and versions

Python*  3.6.8
DLib 19.17.99
Intel® Math Kernel Library (Intel® MKL) 2019.4
OpenCV 3.4.3
NumPy 1.16.4

Solution Approach

We performed these steps during the experiment:

  1. Conducted Intel® VTune™ Amplifier analysis or created an application-performance snapshot on the inference workload to identify physical core utilization, memory stalls, and vectorization.
  2. If all three parameters in Step 1 were above the defined targets, there was very little scope for improvement in the given hardware.
  3. In most cases, we found that all these parameters deviated from the target, making a case for fine-tuning the workload.
  4. The physical core utilization could be improved by bringing in parallelism; that is, inferring multiple video frames at the same time. Multiprocessing or mpi4py libraries help to parallelize the pipeline.
  5. Prefetching the data to memory and replicating to parallel processes should help avoid multiple reads from disk and maximize the usage of high-bandwidth memory available in Intel® hardware.
  6. The framework—or library—of choice for inference should be able to take advantage of Intel AVX for better throughput.

Environment Setup

We set up the environment as follows:

  1. Install the Intel® Distribution for Python*.
  2. Source activate the conda* environment.
  3. git clone davisking/dlib.
  4. cd dlib.
  5. python setup.py install.

Note Currently, DLib can only take advantage of Intel AVX using the source build. Support for Intel AVX-512 will make the inference faster using DLib.

Optimization Approach

We ran the optimization as follows:

  1. Baseline the inference time using the shared code provided by the ISV on a conda environment.
  2. Source build DLib to take advantage of Intel AVX, and then document the inference time on the source build environment in comparison to the default conda install.
  3. Optimize settings like OMP_NUM_THREADS, KMP_BLOCKTIME, and KMP_AFFINITY for better throughput.
  4. Parallelize the pipeline using the multiprocessing and mpi4py libraries, and then document the inference time after parallelization. (In our experiment, seven approaches were tried and the best one was chosen for the comparison.)
  5. The workload was executed on the Intel® Xeon® Gold 6248 processor to understand the performance with and without optimizations.
  6. Document all the changes to arrive at the best configurations and best approaches.


Figures 1 and 2 show the experiment results.

results before the optimization
Figure 1. Intel VTune Amplifier results before optimization

results after optimization
Figure 2. Intel VTune Amplifier results after optimization

inference time
Figure 3. Inference time (Intel Xeon Gold 6248 processor at 2.50 GHz)

Figure 3 shows that switching from a single processing to a multiprocessing approach can result in a performance gain of up to 5.1x. 


Emotion detection at the edge and in the data center can be performed efficiently on Intel architecture when the optimizations and best practices are well understood. This experiment revealed the following:

  • The throughput can be increased by parallelizing the inference workload to take advantage of the multiplesocket and multicore capabilities.
  • Parallelization of code using the multiprocessing and mpi4py libraries improved the performance around 5.1x in this specific case.
  • Good vectorization and memory management helps to further improve the throughput.
  • Tuning additional parameters in Intel architecture allows greater throughput on the same hardware.
  • Newer CNN-based approaches normally perform better than the traditional face detection algorithms (HOG + SVM) because the former is better optimized for Intel architecture.


Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda*

Intel Xeon Gold 6248 processor benchmarking


Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.