Execution Analysis of Training a Deep Neural Network Task

Published: 09/14/2018  

Last Updated: 09/14/2018

High-performance computing (HPC) can be defined as the use of a set of techniques that enable the maximum performance of a processing platform. We need to go beyond the simple insertion of OpenMP* directives when developing a multithreaded algorithm for execution in a shared memory system. A major concern is how the algorithm will behave during execution and how the environment can be configured to provide better performance. In this paper we propose a set of techniques for execution analysis of a multithreaded algorithm developed for a shared memory environment. The target processing platform for this work is the Intel® Xeon® Platinum 8160 processor.


The identification of entities in texts is a well-known problem of natural language processing, which consists of recognizing categories of words as names of people, places, dates, and many others1. The applications of named entity recognition (NER) are many and include the extraction of entities in documents and in social networks aiming toward the automation of the analysis of large amounts of data. In the scientific literature it is possible to find several proposals of methods that seek to solve this problem, among which it is possible to highlight the approaches that make use of neural networks1.

One of the major problems of supervised methods is the need to label databases so that there is minimal knowledge for the algorithm; however, this need imposes a number of limitations, since learning is linked to the context in which the text is included. In order to generalize the problem and also aim at real applications, a semi-supervised method was proposed that first labels a dataset automatically, enabling a posteriori learning, and is therefore generalizable to texts in any context. This positions the proposed method closer to an actual application that can be put into production. Thus, the composition of the dataset follows a top-down rule: 1- (Name) Defining what is a name-by-name dictionary; 2- (Not name) All words that are certainly not names: articles, prepositions, verbs, and common nouns.

After labeling the dataset, the learning takes place through the use of deep neural networks. For this, a series of preprocessing in the text is employed until the transformation of the words into vectors. We have the following steps:

  • Preprocessing. Texts are tokenized and cleaned by removing stop words, symbols, and accents.
  • Tagger. Morphological analysis of the text is performed. This information is also used in learning.
  • Vectorization of words. Both the processed text and the result of the morphological analysis are transformed into vectors through Word2Vector*. The result can then be employed in the composition of tensors.
  • Finally, the tensors are used in the training of neural networks. The architecture employed is summarized in Figure 1.

The architecture employed has the purpose of identifying the word together with its context and morphological structure. For this, the tensor is divided in two: One of them contains the whole context and another contains only the word. In the end, the architecture consolidates the complete information. Convolutional layers were used to combine input information, both in terms of vocabulary and morphology. A bidirectional long short-term memory (LSTM) layer was also used to learn word order in both directions. The resulting network is shown in Figure 1.

Recurrent Neural Network (RNN) architecture
Figure 1. Recurrent Neural Network (RNN) architecture

Training neural networks, such as employed in the NER problem, demands a high computational effort to be done6. This kind of application is basically composed of tensorial operations on high dimensional spaces. Therefore, it is necessary to apply high-performance computing (HPC) techniques to improve the performance7.

In this paper we investigate environment variables essential for the execution of the network training, having as reference the method developed for the entities recognition. Performance improvement will be assessed by two environmental variables: KMP_BLOCKTIME and KMP_AFFINITY. The KMP_BLOCKTIME variable sets the time, in milliseconds that a thread should wait after completing the execution of a parallel region, before sleeping. The KMP_AFFINITY enables a run-time library to bind threads to physical processing units.


All experiments were made on a compute node composed of two Intel® Xeon® Platinum 8160 processors @ 2.10 GHz, each one with 24 physical cores (48 logical) and 33 MB of cache memory, 190 GB of RAM, two Intel® Solid State Drive Data Center S3520 SERIES with 1.2 TB e 240 GB store capacity and a CentOS* 7 operation system running kernel version 3.10.0-693.21.1.3l7.x86_64. The dataset used in the experiments were a set of 101 books on PDF format; a total of 20.8 MB.


To ensure the performance measurement, we compiled the most relevant libraries and Intel® Distribution for Python* for our application. First, we compiled Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) 0.14; after that, we compiled TensorFlow* 1.7 using Intel MKL-DNN as a back-end mathematical library; and then we compiled Keras* over this new TensorFlow compilation.

We divided the results into three environments: Default, Test1, and Test2. On the Default environment no variable is set, that is, all variables have default values; on the Test1 environment we set KMP_BLOCKTIME=0; on the Test2 environment we set KMP_BLOCKTIME=0 and KMP_AFFINITY=fine, compact,1,0.

Figure 2 presents the total time execution of each environment. The x-axis shows the environment. The y-axis shows the execution time in seconds.

Total execution time
Figure 2. Total execution time

Figure 2 shows that the Default environment ran in 10,697.011 seconds; the Test1 environment ran in 4,729.111 seconds; and the Test2 environment ran in 5,517.951 seconds.

Speedup in relation to Default environment
Figure 3. Speedup in relation to Default environment

Figure 3 shows that the Test1 environment performed 2.26x faster than the Default environment. This means a reduction of 55.79 percent on the total execution time. These results also show that the Test2 environment performed 1.93x faster than the Default environment. This means a reduction of 48.41 percent on the total execution time.

We can use Intel® VTune™ Amplifier to obtain the execution profile of the NER application. As previously mentioned, we used three different environmental settings. The first one is a default environment represented here as Default; the second one we set KMP_BLOCKTIME=0, represented here as Test1; the third one we set KMP_BLOCKTIME=0 and KMP_AFFINITYF=fine, compact,1,0, represented here as Test2.

Intel VTune Amplifier is a toolkit to profile the algorithm execution. The main metrics present are effective time, spin time, and overhead time. Effective time is CPU time spent in the user code. This metric does not include spin and overhead time. Spin time is wait time during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some spin time may be preferable to the alternative of increased thread context switches. Too much spin time, however, can reflect lost opportunity for productive work. Overhead time is CPU time spent on the overhead of known synchronization and threading libraries, such as system synchronization APIs, Intel® Threading Building Blocks (Intel® TBB), and OpenMP*.

We observed that when we set BLOCKTIME=0 the effective time percentage increases from 2.64 percent in the Default environment to 75.04 percent in the Test1 environment and to 72.40 percent in the Test2 environment. It happens because we determine the immediate thread extinction after the end of its work. Therefore, the processing resources availability are increased and then other threads can be scheduled for these resources. This conclusion is reinforced by the decreasing of spin time percentage. It decreases from 97.29 percent in the Default environment to 21.64 percent in the Test1 environment and to 23.55 percent in the Test2 environment. This metric also counts the time that a thread was allocated on a processing resource without accomplishing specific work. So, in this case, when spin time decreases it expresses a better execution efficiency.


In this paper we presented an execution analysis of training a deep neural network for solving NER problems. The algorithm was developed using Intel Distribution for Python. To insure the performance measurement, we compiled the most relevant libraries and Python packages to our application: MKL-DNN 0.14, TensorFlow* 1.7, and Keras.

Figure 3 shows that a single change on an environment variable, BLOCKTIME=0, can increase the algorithm performance, and by using Intel VTune Amplifier we showed that this improvement was caused by a better use of computational resources. The effective time increases 72.4 percent on the Test1 environment in comparison to the Default environment. On the Test2 environment the effective time increases 69.76 percent in comparison to the Default environment. Spin time decreases 75.65 percent and 73.74 percent on the Test1 and Test2 environments, respectively.

Ultimately, we can see that the environment configuration is as important as the use of performance-enhancing techniques.

Compiling TensorFlow and Intel® Math Kernel Library for Deep Neural Networks

Following are the steps for compiling and installing TensorFlow and as a back-end.

Compiling Intel® MKL-DNN
  1. git clone https://github.com/intel/mkl-dnn.git
  2. cd mkl-dnn
  3. cd scripts && ./prepare_mkl.sh && cd ..
  4. mkdir -p build && cd build && cmake .. && make
  5. sudo make install


Compiling TensorFlow*
  1. git clone -b v1.8.0 https://github.com/tensorflow/tensorflow.git
  2. cd tensorflow
  3. ./configure (Manter todas as opções no default)
  4. bazel build --config=mkl -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mavx512f --copt=-mavx512pf --copt=-mavx512cd --copt=-mavx512er --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package
  5. bazel-bin/tensorflow/tools/pip_package/build_pip_package ./



  1. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C. (2016). Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT 2016, pages 260–270.
  2. Bengio, Y., Courvill, A., Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, nr. 8.
  3. Collobert, R., Bottou, L., Weston J., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011), Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research 12 2493-2537.
  4. Hirschberg, J., Manning, C.D. (2015). Advances in natural language processing, Science, vol. 349, nr. 6245.
  5. Kalchbrenner, N., Grefenstette, E., Blunsom, P. (2014). A convolutional neural network for modelling sentences, arXiv.org.
  6. Barney, Blaise. Introduction to parallel computing. Lawrence Livermore National Laboratory 6.13 (2010): 10.
  7. Jeffers, Jim, and James Reinders. High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches. Morgan Kaufmann, 2015.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.