# Code Sample: Optimizing Binarized Neural Networks on Intel® Xeon® Scalable Processors

Published: 08/15/2018

Last Updated: 08/14/2018

File(s): |
Download |

License: | Intel Sample Source Code License Agreement |

Optimized for... | |
---|---|

OS: |
Linux* |

Hardware: |
Intel® Xeon® Gold 6128 processor |

Software:(Programming Language, tool, IDE, Framework) |
OpenMP*, C++ (for kernel development), Wolfram Language (for model development) |

Prerequisites: |
Access to Intel® AI DevCloud |

In a previous article, we discussed the performance and accuracy of Binarized Neural Networks (BNN). We also introduced a BNN coded from scratch in the Wolfram Language. The key component of this neural network is Matrix Multiplication. General Matrix-to-Matrix Multiplication (GEMM) is at the heart of deep learning. We can clearly see in the image below that almost 89 percent of the CPU forward time is spent on convolutional and fully connected layers in a typical neural network.

Image source: Learning Semantic Image Representations at a Large Scale

As a result, it is exceedingly important to accelerate GEMM. Convolutions can be reduced to GEMM operations as well, using the im2col algorithm. Therefore, developing good approximations to convolution could allow a 10x speed up in inference with respect to a full precision convolution implementation.

## Binarizing Convolutions

As mentioned above, it is possible to reduce convolutions to GEMM operations by utilizing the im2col algorithm, which is visualized below.

Image source: Implement Convolution in CNN

By virtue of the im2col algorithm, it is possible to develop approximations to convolution. im2col arranges data in a way that makes memory accesses regular. Regular indicating that the matrices can be accessed in a row-major format. This leads to significantly lesser cache-misses. Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory.^{1} However, it is important to note that there is a large memory overhead generated by using im2col. This can be dealt with effectively during the binarization process. Row-wise binarization is one of the dominant strategies to reduce model size but at the same time preserve the representational capacity of the network. More in-depth study of this field will be required to truly parallelize convolutions.

## Implementing Binary Classic Matrix Multiplication on Intel® Xeon® Gold Processors

### Execution

To run the code, use this download link and upload the file (xbin.c) to the Intel® AI DevCloud and use the following command:

`icpc -O3 -xCORE-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 xbin.c -o xcmma.out && echo ~/xcmma.out | qsub`

Note: This will work if the xbin.c file is saved in the home directory. Make sure the file addresses are changed accordingly.

We will begin the process of developing this algorithm by first implementing the Classic Matrix Multiplication Algorithm (CMMA).

### Optimized CMMA Loop Processing Scheme

```
for(int i = 0; i < m; i++){
for(int j = 0; j < n; j++){
for(int k = 0; k < p; k++){
pC[i][j] += pA[i][k]*pB[k][j];
}
}
}
```

There are a few issues with the unoptimized implementation of CMMA. We encounter k memory accesses to pC, which can be avoided. Moreover, we have not utilized any OpenMP* clauses to help parallelize this operation.

Most of the constructs in OpenMP are compiler directives.

` #pragma omp construct [clause [clause]...]`

We will be using #pragma omp parallel to define a parallel region.

```
#pragma omp parallel [clauses]
{
code_block
}
```

Defines a parallel region, which is code that will be executed by multiple threads in parallel.

We will also define private variables.

`private(var)`

Each thread should have its own instance of the variable specified

### Optimized CMMA Loop Processing Scheme

```
float sum = 0;
#pragma omp parallel for private(i, j, k, sum) num_threads(NUM_OF_THREADS)
for(int i = 0; i < m; i++){
for(int j = 0; j < n; j++){
sum = 0.0;
for(int k = 0; k < p; k++){
sum += pA[i][k]*pB[k][j];
}
pC[i][j] = sum;
}
}
```

Note: The `NUM_OF_THREADS`

has been defined in the main code.

If pB denotes the weights in a neural network, the memory accesses are made more efficient by transposing them. (Row major memory access pattern). We will however, omit that optimization for now as this will remain common to both the binary multiplication and full precision multiplication.

## Binarizing Activations

In a neural network, there are two elements: the activations and the weights. The weights for a trained neural network are fixed; however, the activations are dependent upon the input.

Therefore, we will introduce separate kernels for multiplication and binarization. This is for better understanding, although combining both the processes gives marginal performance improvements.

```
unsigned int tbA; int sign;
#pragma omp parallel for private(i, seg, j, sign, tbA) num_threads(NUM_OF_THREADS)
for (i = 0; i < MX_SIZE; i++)
{
for(seg = 0; seg < MX_SIZE/16; seg++)
{
tbA = 0;
for(j = 0; j < 16; j++)
{
sign = (int) (pA[i][seg*16 + j] >= 0);
tbA = tbA|(sign<<j);
}
bA[i][seg] = tbA;
}
}
```

Notice that the second matrix being multiplied is the *Weight Matrix* and has already been binarized.

## Binary Multiplication

We will attempt to introduce C code for binary multiplication with the performance metrics, as demonstrated in the figure above. This benchmark was discussed in a previous article.

Keeping in mind the constructs used in the ”Binarizing Activations” section, inspect the code below for binary multiplication. We divide the `MX_SIZE`

(Matrix Size) by 16 since that is the bit width that data types `bA`

and `bB`

are initialized with.

```
float temp;
#pragma omp parallel for private(i, j, k, temp) num_threads(NUM_OF_THREADS)
for(i = 0; i < m; i++){
for(j = 0; j < n; j++){
temp = 0;
for(k = 0; k < MX_SIZE/16; k++){
temp +=
2*(__builtin_popcount(~(bA[i][k]^bB[j][k]))) - 48;
}
pC[i][j] = temp;
}
}
```

As shown in the graph above, by simply adding the #pragma clause, we achieved a consistent 10-fold speed-up. Further optimizations could involve parallelizing loops with pointers to row arrays.

## Intel® Development Tools Used

The project was developed using the Intel® AI DevCloud (using Intel® Xeon® Scalable processors). Information from the Intel® AI Developer Program forum was also used for optimization purposes with Intel Xeon processors.

More details about implementation:

Processor | Intel® Xeon® Gold 6128 processor |

Multiprocessing API | OpenMP* |

Programming language |
C++ (for kernel development) Wolfram Language (for model development) |

Compiler |
icpc version 18.0.3 (gcc version 4.8.5 compatibility) |

## References

^{1}

#### Product and Performance Information

^{1}

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.