# Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

Published: 04/09/2017

Last Updated: 04/10/2017

### Overview

This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and  a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.

### Introduction to BVLC Caffe* and Intel® Optimized Caffe*

Caffe* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV*, MATLAB and Python* before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers.

Intel® Optimized Caffe* is an Intel-distributed customized Caffe* version for Intel architecture. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel architecture-optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.

To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer to Caffe* Optimized for Intel® Architecture: Applying Modern Code Techniques.

In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.

Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

1. Caffe* provides its own timing option for example :

./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000

2. Intel® VTune™ Amplifier :  Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface.  https://software.intel.com/en-us/intel-vtune-amplifier-xe

### How to Install BVLC Caffe*

Please refer the BVLC Caffe project web page for installation : http://caffe.berkeleyvision.org/installation.html

If you have Intel® MKL installed on your system, it is better using MKL as BLAS library.

In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )

In our test, we kept all configurations as they are specified as default except the CPU only option.

### Test example

In this article, we will use 'Cifar 10' example included in Caffe* package as default.

You can simply run the training example of Cifar 10 as the following :

cd $CAFFE_ROOT ./data/cifar10/get_cifar10.sh ./examples/cifar10/create_cifar10.sh ./examples/cifar10/train_full_sigmoid_bn.sh First, we will try the Caffe's own benchmark method to obtain its performance results as the following: ./build/tools/caffe time \ --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \ -iterations 1000 as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2. The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. ### VTune Profiling Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link : Intel® VTune™ Amplifier : https://software.intel.com/en-us/intel-vtune-amplifier-xe We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. ### VTune result analysis What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. In this case, we will focus on 'caffe::im2col_cpu<float>' function as a optimization candidate. 'im2col_cpu<float>' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. On the above result, we can see the CPI ( Cycles Per Instruction ) of the function is 0.907 and the function utilizes only one single thread for the entire calculation. One more intuitive data provided by Intel VTune Amplifier is here. This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later. Let's move on to Intel® Optimized Caffe* now. ### How to Install Intel® Optimized Caffe* The basic procedure of installation of Intel® Optimized Caffe* is the same as BVLC Caffe*. When clone Intel® Optimized Caffe* from Git, you can use this alternative : git clone https://github.com/intel/caffe Additionally, it is required to install Intel® MKL to bring out the best performance of Intel® Optimized Caffe*. Please download and install Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of Intel® Optimized Caffe* is set to MKL. Intel® MKL : https://software.intel.com/en-us/intel-mkl After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB BLAS :=mkl BLAS_INCLUDE := /opt/intel/mkl/include BLAS_LIB := /opt/intel/mkl/lib/intel64 If you encounter 'libstdc++' related error during the compilation of Intel® Optimized Caffe*, please install 'libstdc++-static'. For example : sudo yum install libstdc++-static ### Optimization factors and tunes Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance. • Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also. • Set CPU utilization limit : echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo • Put 'engine:"MKL2017" ' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017" • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used. • Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2. • Please also refer here : Intel Recommendation to Achieve the best performance If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : KMP_AFFINITY=compact,granularity=fine ### Test example For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. cd$CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000

### Comparison

The results with the above example is the following :

Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together,

-->

to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case.

 Direction BVLC (ms) Intel (ms) Performance Benefit (x) conv1 Forward 40.2966 1.65063 24.413 conv1 Backward 54.5911 2.24787 24.286 pool1 Forward 162.288 1.97146 82.319 pool1 Backward 21.7133 0.459767 47.227 bn1 Forward 1.60717 0.812487 1.978 bn1 Backward 1.22236 1.24449 0.982 Sigmoid1 Forward 132.515 2.24764 58.957 Sigmoid1 Backward 17.9085 0.262797 68.146 conv2 Forward 125.811 3.8915 32.330 conv2 Backward 239.459 8.45695 28.315 bn2 Forward 1.58582 0.854936 1.855 bn2 Backward 1.2253 1.25895 0.973 Sigmoid2 Forward 132.443 2.2247 59.533 Sigmoid2 Backward 17.9186 0.234701 76.347 pool2 Forward 17.2868 0.38456 44.952 pool2 Backward 27.0168 0.661755 40.826 conv3 Forward 40.6405 1.74722 23.260 conv3 Backward 79.0186 4.95822 15.937 bn3 Forward 0.918853 0.779927 1.178 bn3 Backward 1.18006 1.18185 0.998 Sigmoid3 Forward 66.2918 1.1543 57.430 Sigmoid3 Backward 8.98023 0.121766 73.750 pool3 Forward 12.5598 0.220369 56.994 pool3 Backward 17.3557 0.333837 51.989 ipl Forward 0.301847 0.186466 1.619 ipl Backward 0.301837 0.184209 1.639 loss Forward 0.802242 0.641221 1.251 loss Backward 0.013722 0.013825 0.993 Ave. Forward 735.534 21.6799 33.927 Ave. Backward 488.049 21.7214 22.469 Ave. Forward-Backward 1223.86 43.636 28.047 Total 1223860 43636 28.047

Some of many reasons this optimization was possible are :

• Code vectorization for SIMD
• Finding hotspot functions and reducing function complexity and the amount of calculations
• CPU / system specific optimizations
• Efficient OpenMP* utilization

Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*.

Simply we will looking at how efficiently im2col_cpu function has been utilized.

BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded.

In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers.

The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.

VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.

CPU Usage Histogram for the whole process also shows better threading results in this case.

BVLC Caffe* Project : http://caffe.berkeleyvision.org/
BVLC Caffe* Git : https://github.com/BVLC/caffe

Intel® Optimized Caffe* Introduction : https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe
Intel® Optimized Caffe* Git : https://github.com/intel/caffe
Intel® Optimized Caffe* Recommendations for the best performance : https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance
Intel® Optimized Caffe* Modern Code Techniques : /content/www/us/en/develop/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques.html

### Summary

Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.

In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization.

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.