NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks



Convolution Neural Networks (CNNs), a special subcategory of Deep Learning Neural Networks (DNNs), have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech recognition. Recently, they have been widely adopted in High Performance Computing (HPC) environments for solving complex problems related to modeling, runtime prediction, and big data analysis. Current state-of-the-art designs for DNNs on modern multi- and many-core CPU architectures, such as variants of Caffe, have reported promising performance in speedup and scalability, comparable with the GPU implementations. However, modern CPU architectures employ Non-Uniform Memory Access (NUMA) technique to integrate multiple sockets, which incurs unique challenges for designing highly efficient CNN frameworks. Without a careful design, DNN frameworks can easily suffer from long memory latency due to a large number of memory accesses to remote NUMA domains, resulting in poor scalability. To address this challenge, we propose NUMA-aware multi-solver-based CNN design, named NUMA-Caffe, for accelerating deep learning neural networks on multi- and many-core CPU architectures. NUMA-Caffe is independent of DNN topology, does not impact network convergence rates, and provides superior scalability to the existing Caffe variants. Through a thorough empirical study on four contemporary NUMA-based multi- and many-core architectures, our experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability...