How Can AI Advance Cervical Cancer Detection Using Convolutional Neural Networks

Published: 12/29/2017  

Last Updated: 12/28/2017

Indrayana Rustandi employs convolutional neural networks to improve cervical cancer screening.

This is one in a series of case studies showcasing finalists in the Kaggle* Competition sponsored by Intel and MobileODT*. The goal of this competition was to use artificial intelligence to improve the precision and accuracy of cervical cancer screening.


More than 1,000 participants from over 800 data scientist teams developed algorithms to accurately identify a woman’s cervix type based on images as part of the Intel and MobileODT* Competition on Kaggle. Such identification can help prevent ineffectual treatments and allow health care providers to offer proper referrals for cases requiring more advanced treatment.

This case study follows the process used by second-place winner Indrayana Rustandi to build a deep learning model improving- this life-saving diagnostic procedure. His approach primarily used convolutional neural networks as the basis of his methods.

Kaggle Competitions: Data Scientists Solve Real-World Problems with Machine Learning

The goal of Kaggle competitions is to challenge and incentivize data scientists globally to create machine-learning solutions in a wide range of industries and disciplines. In this particular competition – sponsored by Intel and MobileODT, developer of mobile diagnostic tools – more than 1,000 participants from over 800 data scientist teams each developed algorithms to correctly classify cervix types based on cervical images.

In the screening process for cervical cancer, some patients require further testing while others don't; because this decision is so critical, an algorithm-aided determination can improve the quality and efficiency of cervical cancer screening for these patients. The challenge for each team was to develop the most efficient deep learning model for that purpose.

A Kaggle Competition Veteran Takes on Cervical Cancer Screening

Indrayana Rustandi is a quantitative analyst in option market making at Citigroup providing data analytics services. Prior to entering the financial industry, he earned his Ph.D. in computer science at Carnegie Mellon University working on machine learning methods for brain imaging.

Rustandi began dabbling in Kaggle challenges about a year ago and credits his machine learning experience with success as a competitor. “But until this particular competition, I felt I did not spend sufficient time and focus for any single one,” he said. “I wanted to make sure that I could devote enough proper time to work on the competition and only make a submission when I am confident that I have given my best.”

Choosing an Approach to Code Optimization

The methods used were largely based on convolutional neural networks, in particular DenseNet-161 and ResNet-152* pre-trained on the ImageNet dataset. To those custom classification layers were added.

The main framework for the solution was PyTorch*. On a single GPU, one component of the ensemble takes about six hours on average to compute with early stopping. Rustandi’s only feature engineering was to apply cervix segmentation posted in one of the kernels and combining segmented image-based models with non-segmented models. “Instead, what I found most useful was the incorporation of the additional data, along with extensive manual filtering of the train+additional data to be used for training – especially because a lot of the images are blurry or might not be relevant at all to the task,” he said.

Software and Hardware Resources Brought into Play

Software used by Rustandi included:

  • Python* 3.5+ in an Anaconda* 4.1 installation
  • PyTorch 0.1.12
  • torchsample 0.1.2
  • TQDM 4.11.2
  • PyCrayon (optional, for TensorBoard* logging)
  • Rustandi used two workstations, both based on Intel technology. “That way I could run four experiments in parallel,” he explained, “one experiment per GPU.”

Machine Learning Model Design for Training and Testing

Figure 1. A 5-layer dense block with a growth rate of k = 4.

Variations of the model used:

  • Original images, one hidden layer with 192 hidden units prior to the classification layer
  • Original images, two hidden layers each with 192 hidden units prior to the classification layer. “I found that 192 gives the best performance,” he commented.
  • Cropped images using cervix segmentation, one hidden layer with 192 hidden units prior to the classification layer
  • Cropped images using cervix segmentation, two hidden layers each with 192 hidden units prior to the classification layer

For each of the above, he trained three models with different random seeds to get three components of the ensemble. In all, there were 24 models in the ensemble. Ensembling was performed simply by averaging the class probabilities output by each ensemble component, derived from a solution to a previous Kaggle competition.

Learning the neural network weights

In regard to the learning algorithm, Rustandi used stochastic gradient descent with momentum. He used learning rate of 1e-3 for the first five epochs, then a learning rate of 1e-4 with decay.

The model was trained for a maximum of 150 epochs with early stopping: “The stopping criterion is to stop training if the current validation error exceeds the best validation error by 0.1,” he explained. “The model with the best validation error is the one chosen to be part of the ensemble.”

Rustandi noticed in the validation set that there were some non-overlapping instances that the models misclassified. “When I looked at the saliency maps, I saw that the base models might focus on different areas of the images when making their decision,” he explained, “hence confirming even further the wisdom of including both base models as part of the ensemble.”

A straightforward way to simplify the methods was to reduce the number of components in the ensemble. He could also derive potential improvements when using the patient IDs and making sure that the validation dataset does not contain patients present in the training data. Finally, simpler models could be achieved by using less complex base models, such as versions of DenseNet and ResNet with fewer parameters.

Data Augmentation

Generally, data augmentation is a useful way to artificially increase the size of the dataset through different transformations, he said. This is particularly helpful in training neural networks because they converge to a better model (as measured by out-of-sample performance) when the dataset used to train them is sufficiently large and representative.

Training and Prediction Time

On a single GPU (either GTX 1070 or GTX 1080) with 16 images in a mini-batch, each epoch took three to four minutes. So, for the maximum number of 150 epochs, it would take 450-600 minutes (7.5-10 hours) to train a particular model, although early stopping can shorten the time.

If done on individual instances, generating predictions can take up to one second for each instance. But, it was determined to be more efficient to generate predictions for multiple instances simultaneously. On a single GPU, simultaneous predictions on multiple instances take roughly the same amount of time as separate predictions on individual instances – up to one second.

Results and Key Findings: ‘The Personal Touch’ Sets This Approach Apart

Because competition images resembled ordinary RGB images, Rustandi found that with some refinement - pretrained ImageNet models could be used to extract informative features for classification quite well. The biggest challenge was the existence of two available datasets: the REGULAR dataset and the ADDITIONAL dataset. The ADDITIONAL dataset in particular had the potential to help train the networks. But to get there, - certain problems had to be addressed in the dataset-- namely its inconsistent image value (blurry, duplicate, or irrelevant to the task). Similar problems possibly affected the REGULAR dataset as well, but to a lesser degree.

In the end, Rustandi chose to examine each image in both datasets manually, flagging those - determined to be problematic for exclusion from training. “I think this step in particular gave me quite an edge over the other competitors,” he said, “since a majority of them seemed to end up not using the ADDITIONAL dataset at all, and hence, not availing themselves of the useful information present in this dataset.”

During stage 1, he chose not to probe the leaderboard at all, declining to make any submissions until stage 2 of the competition. “Instead, I decided to rely on my own validation,” he said. “Also, the final models did not incorporate any stage 1 test data.”





Type 1

Type 2

Type 3

Type 1




Type 2




Type 3




Each row in the confusion matrix marks the true class while each column marks the predicted class, using data from stage 1. Each element in the confusion matrix specifies the number of cases for the class in the corresponding row that gets classified as the class in the corresponding column. The sum of elements in each row is the number of total cases for each class in stage 1 data.

Rustandi’s patience and hands-on attention to detail paid off, earning a second place in the Kaggle Competition-and a $20,000 prize.

Learn More About Intel Initiatives in AI

Intel commends the AI developers who contributed their time and talent to help improve diagnosis and treatment for this life-threatening disease. Committed to helping scale AI solutions through the developer community, Intel makes AI training and tools broadly accessible through the Intel® AI Developer Program.

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amount of data that is generated every minute.

Sign up with Intel AI Developer Program to get the latest updates on competitions and access to tools, optimized frameworks, and training for artificial intelligence, machine learning, and deep learning.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at