Optimizing Face Recognition Inference on Intel® Xeon® Scalable Processors

Published: 06/24/2019  

Last Updated: 06/24/2019

Overview

Nowadays no one can deny new technologies particularly Artificial Intelligence facilitate new productivity to our modern society. Among them deep learning plays a very important role. Most of all it brings powerful improvement for face recognition which is widely used in unit secure, traffic, finance, retail, criminal identification and many more scenarios.

In this case study we performed a face recognition inference experiment on Intel® Xeon® scalable processors and saw how they are facilitating the AI power on face recognition applications.

Normally face recognition using deep learning end-to-end pipeline will be:

  1. Prepare data set of face images (source might be from public open dataset or on premise)
  2. Train a face detection model using deep learning topology ResNet50
  3. Inference with trained detection model
  4. Compare features of inferenced face to those in the local database which already stores millions of identified faces of person data, score and sort them by similarity carried out by face matching algorithms
  5. Pick out the highest match degree of face data and show the similarity as well

In this paper we only represent step 3 (Inference with trained detection model) one of vital step of this pipeline on how to optimize face recognition inference on Intel® Xeon® Scalable processors.

Solution Architecture and Design

In terms of framework we use Intel® Optimization for Caffe* to accomplish this experiment. The advantages are:

  1. Intel® Optimization for Caffe* takes good advantage of libraries such as Intel® Math Kernel Library (Intel® MKL) and Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to accelerate matrix multiply and add computation.
  2. It has topologies optimization as well which means similar layers are fused together and computed only once during its lifetime. Thus other layers can query and acquire the computed result very quickly.
  3. Most of all, the training part for face features detection was achieved by Intel® Optimization for Caffe* in order to keep models and end-to-end pipeline simply consistent.

Our work diagram is designed as below:

Aarchitecture workflow
Architecture workflow

Topologies

We use ResNet50 as main inference topology. The advantages are:

  • Considering the use case below, the driver is sitting in a car, that the camera is monitoring the driver through the front window is a typical face recognition use case. ResNet50 compared to other topologies e.g. SSD, Yolo outperforms better accuracy and stable performance that is appropriate for the above scenario.
  • Topology ResNet50 is well optimized due to its performance boost model topology which was discussed in solution architecture and design.

Hardware Configuration

  Config1 Config2

Platform

x86_64

x86_64

# Nodes

1

1

# Sockets

2

2

CPU

Intel® Xeon® E5-2699 v4 (55M Cache, 2.20 GHz)

Intel® Xeon® Platinum 8180 (38.5M Cache, 2.50 GHz)

Cores/socket, Threads/socket

22, 44

28, 56

ucode

NA

NA

HT

OFF

OFF

Turbo

ON

ON

BIOS version (dmidecode -s bios-version)

SE5C620.86B.00.01.0015.110720180833

SE5C620.86B.00.01.0015.110720180833

System DDR Mem Config:

8 slot / 16GB / 2400 MHz

12 slot / 16GB / 2666 MHz

Total Memory/Node (DDR+DCPMM)

128 GB

192 GB

Storage - boot

480 GB

300 GB

Storage - application drives

480 GB (shared with boot)

300 GB (shared with boot)

NIC

NA

NA

PCH

NA

NA

OS

CentOS 7.2

CentOS 7.2

Kernel

NA

NA

Mitigation variants

NA

NA

Compiler

GCC 4.8.5

GCC 4.8.5

Libraries

OpenCV 3.4

OpenCV 3.4

Frameworks version

Intel® Optimization for Caffe* 1.1.0

Intel® Optimization for Caffe* 1.1.0

MKL DNN Version

Intel® MKL
Intel® MKL-DNN 2018 update2

Intel® MKL
Intel® MKL-DNN 2018 update2

Dataset

On premise data set

On premise data set

Topology

ResNet50

ResNet50

Batch Size

1,2,4,8,16,32,50,64,128

1,2,4,8,16,32,50,64,128

Software Used

  • Tools
    • Intel® Optimization for Caffe* -v1.1.0
    • BVLC Caffe* -v1.0
  • Language
    • Open Python* -v2.7
  • Topology
    • ResNet50 -Integrated in Intel® Optimization for Caffe*

Installing Required Software

Follow these links to install the required software:

Installing Intel® Optimization for Caffe*

Installing BVLC Caffe*

What We Evaluated

We compared the performance between Intel® Optimization for Caffe* and BVLC Caffe* on both hardware candidates. From aspect of software we used BVLC Caffe* (Public Caffe) as baseline whereas Intel® Optimization for Caffe* as performance compared target. (See Figures 1, 4)

On the other hand, we also evaluated the performance of Intel® Optimization for Caffe* on the two hardware candidates. (See Figures 2, 3)

As discussed in architecture, Intel® MKL-DNN is designed for accelerating deep learning computing, meanwhile Intel® MKL also underlies the matrix calculation. Hence here we designed an experiment to evaluate the performance between different topologies using Intel® MKL-DNN or Intel® MKL. (See Figure 5)

Test Command

caffe time --forward_only --phase TEST --iterations 100 --model <model.caffemodel> --engine <mkl/mkldnn>

Above is a script for inference, where model.caffemodel stands for the trained Caffe model, and mkl/mkldnn denotes the engine using either Intel® MKL or Intel® MKL-DNN.

Experiment Results

The below graph shows Intel® Optimization for Caffe* with Intel® MKL-DNN engine achieved 5.69X performance gain compared to BVLC Caffe* on Intel® Xeon® E5-2699 v4, whereas 6.37X on Intel® Xeon® Platinum 8180. In the comparison group with both Intel® Optimization for Caffe* on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180, the latter outperforms 1.67X speed up.

inference throughput comparison 1
Figure 1. BVLC Caffe* vs Intel® Optimization for Caffe* inference throughput compared on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180

Next graph represents Intel® Optimization for Caffe* inference throughput on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180 with full cores. We can come into conclusion that the performance on Intel® Xeon® Platinum 8180 has 2.13X faster than that on Intel® Xeon® E5-2699 v4.

inference throughput comparison 2
Figure 2. Intel® Optimization for Caffe* inference throughput compared on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180 (Full cores)

The difference between Figure 3 and Figure 2 is to normalize full cores to 1 core respectively. As a result, we get 1.63X improvement by Intel® Xeon® Platinum 8180.

inference throughput comparison 3
Figure 3. Intel® Optimization for Caffe* inference throughput compared on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180 (1 core)

Figure 4 different from Figure 1 is we compared inference throughput with various batch size groups.

inference throughput comparison 4
Figure 4. BVLC Caffe* vs Intel® Optimization for Caffe* inference throughput compared on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180 (Batch size groups)

Figure 5 compared Intel® Optimization for Caffe* inference performance with Intel® MKL and Intel® MKL-DNN engine on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180. From this graph we understand inference with Intel® MKL-DNN engine is better than that with Intel® MKL on Intel® Xeon® Platinum 8180, moreover much better than that on Intel® Xeon® E5-2699 v4.

inference throughput comparison 5
Figure 5. Intel® Optimization for Caffe* inference throughput compared with Intel® MKL and Intel® MKL-DNN engine on Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180 (Batch size groups)

Conclusion

Face recognition outperforms 2.13X with full cores based on Intel® Xeon® Platinum 8180 compared to Intel® Xeon® E5-2699 v4, 1.67X per core performance gains as well. Achieved 6.37X using Intel® Optimization for Caffe* with Intel® MKL-DNN compared to BVLC Caffe*. There was around 10% performance improvement with engines from Intel® MKL to Intel® MKL-DNN.

References

Installing Intel® Optimization for Caffe*

Installing BVLC Caffe*

Intel® MKL

Intel® MKL-DNN

 

DISCLAIMERS:

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: Intel® Xeon® E5-2699 v4 and Intel® Xeon® Platinum 8180, Intel® Optimization for Caffe* 1.1.0. Test by ISV on 18/09/2018.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.