Emphysema, a progressive lung disease that impacts breathing ability, affects more than 3 million people in the United States, and more than 65 million people worldwide. Early detection is key in stopping the progression of emphysema, which in severe cases is life-threatening. Pneumonia, a lung infection that also impacts breathing, causes another 1.4 million deaths annually around the world. In most cases, it, too, is treatable with early detection.
Figure 1: Healthy lung (left) and lungs with severe emphysema (right)
Figure 2: Chest X-Ray Images
Research Using CheXNet at Stanford: CheXNet is a deep learning Convolutional Neural Network (CNN) model developed at Stanford University to identify thoracic pathologies from the NIH ChestXray14 dataset. CheXNet is a 121-layer CNN that uses chest X-Ray images to predict the output probabilities of a pathology. It correctly detects pneumonia by localizing the areas in the image that are most indicative of the pathology. Stanford researchers have been able to train the ChestX-Ray14 dataset using a pre-trained model of CheXNet-121 with the ImageNet2012-1K dataset. The NIH dataset consists of over one hundred thousand frontal chest X-ray images from over 30,000 unique patients that have been annotated with up to 14 thoracic diseases including pneumonia and emphysema. CheXNet-121 outperforms the best-published results on all 14 pathologies in the ChestX-Ray14 dataset.
Extending Research on HPC Infrastructure: In this joint work, DellEMC, SURFsara, and Intel extended the research using VGG-16 and ResNet-50 CNN models scaled out across a large number of Intel® Xeon® Scalable processors running on Dell EMC’s Zenith supercomputer and accurately pre-trained on the ImageNet2012-1K dataset. Our team was able to significantly reduce the training time and outperform the CheXNet-121 published results in four pathological categories using VGG-16 and up to 10 categories (including pneumonia and emphysema).
Transfer Learning Using Benchmark for Real Use Cases
We first pre-trained the network on the ImageNet2012 dataset on 200 nodes of Dell EMC’s Zenith HPC Cluster using Intel® Optimization for TensorFlow* and Horovod distributed training framework. The chart below shows the performance of ResNet-50 pre-trained to > 75% Top-1 accuracy resulting in a time-to-train speedup of 3X on 200 nodes relative to 64 nodes on the Zenith cluster. We followed the methodology to fine-tune ResNet-50 on ImageNet2012-1K, similar to previous work done by SURFsara and Intel.
Training Performance and Accuracy with VGG-16:
The charts below show throughput performance and accuracy of a pre-trained model with the ImageNet2012 dataset, using default implementation in Keras* with Intel Optimizations for TensorFlow with Intel® Math Kernel Library for Deep Neural Networks (MKL-DNN) and exploiting NUMA domains with multiple workers per node.
We parallelized, optimized and scaled both VGG-16 and CheXNet-121 models on up to 64 Intel Xeon processor nodes. Figure 4 shows that using the pre-trained VGG-16 model, we were able to achieve 6.3X faster throughput performance on 64 nodes than CheXNet-121 on 32 nodes on the Dell EMC Zenith cluster.
Training Performance and Accuracy with ResNet50 Using TensorFlow Only:
Next, we fine-tuned the pre-trained ResNet-50 model and measured its performance against the ChestXRay14 dataset. We achieved a 4.7X speedup in throughput performance with a TensorFlow-only implementation compared to Keras* +TensorFlow implementation on 128 Intel Xeon nodes on the Zenith cluster. This result demonstrated that Keras has significant performance overhead.
Figure 6 shows that using pre-trained ResNet-50, the throughput performance using TensorFlow on 128 nodes is 104X faster than single node performance on the Dell EMC Zenith cluster. Figure 6 also shows scale out training performance using ResNet-50 relative to single node performance up to 256 nodes on Zenith cluster.
Figure 7 shows the accuracy of ResNet-50 relative to CheXNet-121. By using pre-trained ResNet-50 model, we were able to achieve up to 4% better training accuracy (positive AUROC) than the published CheXNet-121 in 10 categories out of 14 pathologies.
In healthcare, prevention is key to saving lives, improving outcomes, and reducing costs. Models which can help identify disease will be critical to providing quality care to everyone in a timely fashion. As we’ve shown, scale-out training of neural network models can reduce the time to solution from weeks to minutes, using the same compute infrastructure that is already being used for everyday operations in hospitals and medical research labs around the world.
If you’d like to learn more, check out a recording of our presentation on this topic at the Intel AI® DevCon in May 2018. Two of our authors, Valeriu and Damian, also shared their insights earlier this month at the Artificial Intelligence conference in London jointly presented by O’Reilly Media and Intel Corporation.
- SURFsara/Intel Paper: Residual network training on ImageNet-1K with improved accuracy and reduced time to train
- Intel Blog: Accelerating Deep Learning Training and Inference with System Level Optimizations
- SURFsara* Best Practices for Caffe*
- SURFsara Best Practices for TensorFlow
- Dell EMC: Ready Solutions for AI
Notices and Disclaimers:
Intel does not control or audit third-party benchmark data or the websites referenced in this document. You should visit the referenced website and confirm whether referenced data are accurate.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice
Performance results are based on testing as of May 17, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
Testing Configuration: Dell EMC Zenith HPC Supercomputer platform: dual socket Dell EMC PowerEdge C 6420 server,Intel® Xeon® Gold 6148 processor, 20 cores each @ 2.40GHz for a total of 40 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® Omni-Path Host Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4, Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat* Enterprise Linux 6.7.
TensorFlow 1.6: Built & Installed from source: https://www.tensorflow.org/install/install_sources,
ResNet-50 Model: Topology specs from https://github.com/tensorflow/tpu/tree/master/models/official/resnet,
DenseNet-121Model: Topology specs from https://github.com/liuzhuang13/DenseNet,
Convergence & Performance Model: https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs,
Dataset: ImageNet2012-1K: http://www.image-net.org/challenges/LSVRC/2012/,
OMP_NUM_THREADS=20 HOROVOD_FUSION_THRESHOLD=134217728 export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2 mpirun -np 512 -ppn 2 python resnet_main.py --train_batch_size 8192 --train_steps 14075 --num_intra_threads 20 --num_inter_threads 2 --mkl=True --data_dir=/scratch/04611/valeriuc/tf-1.6/tpu_rec/train --model_dir model_batch_8k_90ep --use_tpu=False --kmp_blocktime 1. Baseline configuration: 64 nodes.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© 2018 Intel Corporation. All rights reserved..