Using the CPU for Effective and Efficient Medical Image Analysis

Published: 01/10/2018  

Last Updated: 01/10/2018

By Xiao Hu, Hui Wu, and Weifeng Yao

A Quantitative Report Based on the Alibaba Tianchi Healthcare AI Competition 2017


This paper is based on the Tianchi Healthcare AI Competition, an online challenge for automatically detecting lung nodules from computed tomography (CT) scans, cosponsored by Alibaba Cloud, Intel, and LinkDoc. In October 2017, this competition successfully concluded after an intense seven-month competition among 2,887 teams across the globe. This competition was hosted on Alibaba’s public cloud service that was completely built upon Intel’s deep learning hardware and software stack. Intel was deeply engaged in the architecture design, hardware and software development, performance optimization, and online support throughout the competition and thus obtained many insights into the medical artificial intelligence (AI) domain. This paper reports on the key findings taken from the experiments.

First, we implemented a 3D convolutional neural network (CNN) model to reflect state-of-the-art lung nodule detection, according to the common philosophy among all the Tianchi participants.

Second, we trained the model with the input data of different resolutions and quantitatively proved that a model trained with higher-resolution data achieves higher detection performance, especially for the small nodules. As a result, the higher-resolution model consumed much more memory than lower-resolution ones at the same time.

Third, we compared the behaviors of general-purpose computing on graphics processor units (GPGPU) and CPU and proved that CPU architecture can provide larger memory capacity, which thus enables medical AI developers to explore the higher-resolution designs so as to pursue the optimal detection performance.

We also introduced a customized deep learning framework, called Extended-Caffe*10, the core of Tianchi’s software stack, as an example to demonstrate that CPU architecture can support highly efficient 3D CNN computations, so that people can both effectively and efficiently use CPU for 3D CNN model development.


The Tianchi Healthcare AI Competition1 is the first AI healthcare competition in China and the only one of its kind worldwide in terms of scale and data volume. Sixteen top domestic cancer hospitals in China provided labeled lung CT scans of nearly 3,000 patients for this competition. Lung nodule detection was chosen because the incidence of lung cancer has increased in China during the past 30 years and has already become the number one cause to death among all other cancers. Therefore, the early screening of lung nodules is an urgent problem that needs to be addressed immediately2. After an intense seven-month online competition among 2,887 teams across the globe, the team from Peking University won the contest.

This online competition was hosted on Alibaba’s public cloud service that was completely built upon Intel’s deep learning hardware and software stack. The underlying hardware infrastructure is a cluster of Intel® Xeon® and Intel® Xeon Phi™ processor-based platforms, which offer a total of 400+ TFLOPS computing power. Intel also offered a series of deep learning software components to facilitate model training and inference, the core of which was a customized deep learning framework, called Extended-Caffe, which was specifically optimized for medical AI usages. Intel experts also helped hundreds of online participants run their models efficiently, and, in return, obtained valuable insights into the medical AI domain.

This competition revealed that, although deep learning has been applied to computer vision areas for more than a decade, the medical image analysis still brings unique and significant challenge to the domain experts and engineers. In particular, almost all the state-of-the-art solutions for medical image analysis are heavily relying on 3D, or even 4D/5D, CNN, which call for very different engineering considerations compared to the 2D ones that we have often seen in other areas of computer vision. We found that the CPU platform, compared to the traditional GPGPU platform, can more effectively support 3D CNNs for medical image analysis due to the CPU’s advantage of large memory capacity, while keeping high computing efficiency through delicate algorithm implementations for 3D CNN primitives.

Although the Tianchi dataset and models are confidential, this paper discusses our key findings based on self-developing and then experimenting on a 3D CNN model that can reflect state-of-the-art lung nodule detections. The following sections describe how we preprocessed the CT dataset, designed our model, implemented highly efficient 3D primitives on the CPU, conducted our experiments, and then drew conclusions through quantitative analysis.

CT Data Preprocessing

Every raw CT image contains a sequence of 2D images. The interval between 2D images is called Z-internal. Every 2D image is a matrix of gray-scale pixels, where the horizontal and vertical intervals between pixels are called X- and Y-intervals, respectively. These intervals are measured in millimeters. Because the CT instruments often differ from one another, different raw CT images have different intervals. For example, in the LUNA’16 dataset3, Z-intervals range from 0.625 mm to 2.5 mm. The Tianchi dataset has a similar situation. In order to make a deep learning model work on a unified dataset, we have to interpolate the original pixels in X, Y, and Z directions by taking a fixed sampling distance, so that a raw CT is converted to a new 3D image where the pixel-to-pixel intervals in three directions equal the sampling distance. Then, if we measure things on a pixel basis, the sizes (that is, resolutions) of the new 3D images and of the nodules are determined by the sampling distance. Table 1 shows that smaller sampling distances lead to higher resolutions. Note that unlike ordinary object detections, lung nodule detections suffer a unique problem that a nodule takes only one millionth of the whole CT. Therefore, in order for a model to effectively extract the features of nodules, we must crop the CT image into smaller 3D regions, and then feed those crops one by one into the model. Again, smaller sampling distances lead to bigger crops.

Table 1. Different Sampling Distances Generate Different Resolutions of 3D Data

Sampling Distance (mm) 3D Image Resolution (Pixel Pixel Pixel) Nodule Resolution (Diameter: Pixel) Crop Resolution (Pixel Pixel Pixel)
1.00 249 256 302 3.66 128 128 128
1.33 188 196 231 2.74 96 96 96
2.00 127 136 159 1.83 64 64 64

Our 3D CNN Model for Lung Nodule Detection

Figure 1. Our 3D CNN model architecture (crop size = 128 x 128 x 128).

Using the common philosophy of prior networks4‒6 and Tianchi models as our guides, we constructed a 3D CNN model for lung nodule detection, as shown in Figure 1, which is divided into down-sampling and up-sampling parts. The down-sampling part consisted of five 3D residual blocks interleaved with four pooling layers. Each residual block was made up of convolution, batch normalization, ReLU, and other operations, together with a residual structure (C1 and C2 in Figure 1). The up-sampling was done through two de-convolutions (Deconv in Figure 1). We combined the output of each deconvolution with the output of the corresponding down-sampling layer, so as to generate the feature maps, which contained both local and global information of the original input data.

For each input crop (m x m x m), our model generated (m/4) x (m/4) x (m/4) x 3 bounding cubes, called candidates, and then associated each cube with a probability (that is, the possibility that this cube was a nodule), the coordinates of the cube’s center, and the size of the cube.

Usually post-processing includes a false-positive reduction and other steps, following the model, to filter out false positive candidates. However, since this paper focuses on engineering considerations that just impact the effectiveness and efficiency of the model itself, we didn’t develop these. But, even without the enhancements by such post-processing steps, we submitted our trained model (CCELargeCubeCnn9) to the LUNA’16 competition and ranked number 14 in its LeaderBoard, which demonstrates that our model indeed reflects state-of-the-art lung nodule detection.

Highly Efficient 3D CNN Primitives on the CPU

Our experiments compared the effectiveness between the GPGPU and CPU platforms, in terms of running our model with different resolutions and hyperparameters (for example, batch size). Therefore, the computing efficiency on a CPU platform must be guaranteed first, especially for 3D convolution, the most frequently used primitive.

3D Convolution on the CPU

We implemented a highly efficient 3D convolution primitive on the CPU, by leveraging the highly optimized 2D convolution in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)7 (see Figure 2). First, we treated the 3D data and kernels as a group of 2D slices. Then, a 3D convolution is equivalent to convolving the corresponding 2D slices (having the same color in Figure 2), and then summing all the intermediate results together. Because the Intel MKL-DNN 2D convolutions are extremely optimized on the CPU, our 3D convolutions can also run highly efficiently on the CPU.

Figure 2. Highly efficient 3D convolution (leveraging the Intel® Math Kernel Library for Deep Neural Networks 2D convolution).

In order to show the effect, we also provided a baseline implementation, called single precision, floating general matrix multiplication (GEMM) based 3D convolution, which follows a more straightforward philosophy. Figure 3 illustrates how a 2D GEMM-based convolution works, that is, to rearrange data and kernels so that we could use matrix multiplication to compute the original convolution. Then, we applied the same idea to 3D data and kernels to get a 3D GEMM-based convolution. Since the core computations are matrix multiplications, for which we could leverage the highly optimized SGEMM implementation in the Intel® Math Kernel Library (Intel® MKL)8, this baseline implementation could actually achieve reasonable performance on the CPU.

Figure 3. GEMM-based 2D convolution (leveraging the Intel® Math Kernel Library SGEMM).

Figure 4 shows that our highly efficient 3D convolution implementation outperformed the GEMM-based one. The forward pass was accelerated by 4X, backward pass by 30 percent, and overall by 2X.

Figure 4. Execution time comparison (highly efficient 3D convolution versus GEMM-based 3D convolution).

Other 3D CNN Primitives on the CPU

In addition to 3D convolution, the efficiency of all related 3D primitives must be guaranteed. Thanks to Intel MKL and Intel MKL-DNN, we implemented the highly efficient 3D primitives, including 3D batch normalization, 3D deconvolution, 3D pooling, 3D softmax loss, 3D cross-entrophy loss, 3D smooth L1 loss, 3D concat, and so on, and then packaged them into Extended-Caffe10, the core of the Tianchi software stack. Figure 5 shows the overall efficiency improvement for the training and inference of our model.

Figure 5. Overall efficiency improvement (optimized versus unoptimized). Training time is measured per iteration (that is, the time to process one crop), while inference time is measured per CT image (that is, the time to process all the crops of one CT image).

Experimental Results and Quantitative Analysis

Model Training

We used the stochastic gradient descent (SGD) and stepwise learning rate strategies to train our model. All network parameters were initialized randomly, and the initial learning rate was set to 0.01. We trained the model for 100 epochs and downscaled the learning rate by 10X at the 50th and 80th epoch, respectively. Figure 6 records the loss trends when we trained the LUNA’16 dataset (subset 0, as an example). You can see that the model successfully converged on this dataset during the 80th ~ 100th epoch.

Figure 6. An example of model training (with LUNA’16 subset 0).

Detection Performance Evaluation

The method that well-known competitions like Tianchi, LUNA’16, and so on are using to evaluate a model’s detection performance is called FROC (free-response receiver operating characteristic)11. A candidate (that is, a bounding cube) is considered to be a true positive if the distance between the center of this candidate and the center of the ground truth nodule is shorter than the radius of the nodule. Then, a FROC score is calculated to quantify the sensitivity of the model versus the average number of false positives.

The Impact of Resolution

Figure 7 shows the different FROC scores of our 3D CNN model versus the different resolutions of the input data that were used for the model training. We can see that the model trained with higher resolutions achieved higher FROC scores. This experiment quantitatively proved that higher resolution can help improve the detection performance of a model.

Figure 7. Higher resolution leads to higher FROC scores.

Because human radiologists can easily detect the large nodules but find it much harder to detect the smaller ones, the capability of an AI solution to detect small nodules is more in demand. Figure 8 compares our model’s accuracies among different resolutions, in terms of detecting the nodules of different sizes. We can see that higher resolutions can especially improve the detection performance on smaller nodules.

Figure 8. Higher resolution improves the detection accuracy on smaller nodules.

Memory Consumption Analysis

We analyzed the memory consumptions when training our model with different resolutions, and then compared the behaviors between the CPU platform and GPGPU platform. Figure 9 (a) and (b) records the cases where the training batch size equals 1 and 4, respectively. When the batch size equals 1, a modern GPGPU with 12 GB memory can only support up to 128 x 128 x 128 resolution, while a CPU platform with 384 GB memory can easily support up to 448 x 448 x 448. When the batch size equals 4, the GPGPU gets worse—only up to 96 x 96 x 96 can be supported, while the CPU can easily support up to 256 x 256 x 256.

(a) Batch size=1

(b) Batch size=4

Figure 9. Memory consumption versus different resolutions.

Since the modern CPU server has terabytes of memory capacity, which will be especially the case when Intel’s Apache Pass technology is out soon, the CPU platform will be able to offer almost infinite flexibility for the model designer in the medical image analysis domain to explore extremely high-resolution solutions in the pursuit of optimal detection performance.


The Tianchi Healthcare AI Competition, co-sponsored by Alibaba Cloud, Intel, and LinkDoc, was a cloud-based AI challenge built upon Intel’s deep learning hardware and software. In this paper, derived from Tianchi, we discussed our key insights into the medical AI domain based on our experiments. First, we self-developed a 3D CNN model that reflects the state-of-the-art in lung nodule detections. Next, we quantitatively proved that a model trained with higher-resolution data would achieve higher detection performance, especially for the small nodules. As a result, the higher-resolution model consumed much more memory. We compared the GPGPU and CPU and proved that the CPU platform, thanks to its large memory capacity, can enable medical AI designers to explore much higher-resolution solutions so as to pursue optimal detection performance. We also introduced the Extended-Caffe framework, as an example, to demonstrate that CPU architecture can support highly efficient 3D CNN computations, so that people can both effectively and efficiently use CPU for 3D CNN model development.


  2. Bush I., Lung nodule detection and classification. Technical report, Stanford Computer Science, 2016.
  4. Girshick, R. Fast R-CNN. Computer Science, 2015.
  5. Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation, Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015. Springer International Publishing, 2015:234‒241.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at