Class imbalanced data can impede the effectiveness of training deep neural networks when analyzing biomedical images. New techniques for approaching this challenge can help enhance the accuracy of the results.
“Machine learning algorithms typically work in a closed-loop manner. At a high level, this involves training the model on the available training dataset, followed by evaluation of the prediction performance and repeating the process as more data is available. The choice of evaluation techniques depends on factors such as target variable class, the type of algorithm, etc. As more data points become available, the algorithm learns better and its performance is expected to get better. However, the catch here is not just having more data points, but data that is meaningful and diverse. For instance, an image recognition model trained on numerous images of Labradors in meadows ended up classifying green grass as Labradors. Predictions like these not only seem trivial, but also negatively impact the end user’s credibility of the recommendations.”1
— Pavitra Srinivasan, founding member of humans for AI
Typically, standard algorithms assume or expect balanced class distributions or equal misclassification costs. When presented with complex imbalanced data sets, however, these algorithms do not properly represent the distributive characteristics of the data, resulting in inaccuracies across the data classes. In real- world applications, this imbalanced learning problem often recurs, representing an important issue with wide-ranging implications.
Research into this imbalanced dataset challenge being conducted by Intel® Student Ambassador Subhashis Banerjee is leading in promising directions that could yield significant improvements in the way that training is performed in deep learning applications in this area.
Background and Project History
Subhashis Banerjee attends the Indian Statistical Institute at Kolkata while also serving as an Intel Student Ambassador, sharing his research and insights into the latest AI methodologies and emerging techniques. His recent research focuses on biomedical image analysis, including computer-aided disease localization and segmentation from medical images, such as those provided by magnetic resonance image (MRI) scans. Another area of interest is disease classification using quantitative imaging features extracted from medical images, based on computational intelligence, computer vision, and machine learning techniques.
“While working with biomedical data,” Subhashis said, “I have observed that medical imaging datasets are heavily imbalanced in nature, where the frequency of one class—for example, an image showing a cancerous tumor—can be a thousand times less than another class, such as the images for healthy patients or images when the tumor is benign.”
Deep Learning Handicapped by Imbalanced Data
“Class imbalance can have significant detrimental effects on training of traditional classifiers,” Subhashis said. “It affects both convergence during the training phase and generalization of a model on the test set. Lately, deep learning has achieved great success because of its high learning capacity and by automatically learning accurate underlying descriptors of input data. However, deep learning is still handicapped by the negative impact of imbalanced data. While this issue is addressed thoroughly in traditional machine learning algorithms, no significant research on this issue for deep networks (with application to real medical imaging datasets) is available in the literature.”
“Medical image datasets are predominantly composed of normal samples with only a small percentage of abnormal ones, leading to the so-called class imbalance problems,” Subhashis explained. He noted that class imbalance negatively affects training of machine learning classifiers because most classifiers focus on learning the large classes. This leads to poor classification accuracy for the small classes. In the case of medical diagnosis, however, misclassification costs are often unequal. Classifying the minority samples (for example, the image containing a cancerous tumor) as majority (those images collected from healthy patients or those that depict a benign tumor) implies serious consequences.
Figure 1. Medical team examining an X-ray image.
Key Problems Remain to Be Solved
Despite widespread awareness of the issues associated with data imbalance, Subhashis thinks that many of the key problems remain unresolved, and these are sometimes encountered even more frequently, particularly when medical image datasets are involved. Techniques used to handle class imbalance in cases of classical machine learning (or shallow models) don’t apply well to deep learning applications based on medical image datasets. “In my current project,” Subhashis said, “I would like to do a comprehensive evaluation of the impact of class imbalance in the training dataset for the performance of deep neural networks in medical image analysis. Through investigation, I hope to develop a better approach for applying deep learning techniques to class-imbalanced data.”
A key challenge that Subhashis addressed in the project is that class imbalance can take a variety of forms, particularly in the context of multiclass classification for ConvNets. In some cases, only one class might be under- or over- represented, while in other cases, every class may have a different number of examples. The methodology he adopted first explored the impact of class imbalance on the performance of ConvNets for the three main medical image analysis problems, including:
- Disease or abnormality detection
- Region-of-interest segmentation
- Disease classification from real medical image datasets
“In this project,” Subashis said, “I am using three large, massively imbalanced datasets (suitable for deep learning) of three different modalities. The modalities include X-ray (NIH Chest X-ray Dataset of 14 Common Thorax Diseases), MRI (MICCAI Brain Tumor Segmentation Challenge dataset), and Color Fundus Photography (Kaggle Diabetic Retinopathy dataset), dedicated to each of the three medical image analysis tasks.”
These analysis tasks, Subhashis explained, include:
- Automated detection of 14 common thorax diseases
- Segmentation of high- and low-grade brain tumors from brain MRI images
- Classification of diabetic retinopathy into five classes
“Imagine you are a medical professional who is training a classifier to detect whether an individual has an extremely rare disease. You train your classifier, and it yields 99.9% accuracy on your test set. You’re overcome with joy by these results, but when you check the labels output by the classifier, you see it always output ‘No Disease,’ regardless of the patient data. What’s going on? “Because the disease is extremely rare, there were only a handful of patients with the disease in your dataset, compared to the thousands of patients without the disease. Because over 99.9% of the patients in your dataset don’t have the disease, any classifier can achieve an impressively high accuracy simply by returning “No Disease” to every new patient.”
— Manojit Nandi, author at Data Science Blog by Domino2
Figure 2 shows X-ray examples of the disease screening performed and graphically depicts the classes in the dataset.
Similarly, Figure 3 illustrates the subtypes of diabetic retinopathy progression, corresponding with five ordered classes, and broken down by the frequency of their occurrence as shown in the pie chart.
Figure 4 depicts a sample segmentation of a brain tumor, using voxel counts and labels to indicate the frequency of occurrences.
Figure 4. Sample segmentation of brain tumor with corresponding voxels count for each class.
“This project is ongoing,” Subhashis said. “I have been working on it for the last six months. Already, I have proposed a new loss function, which is a sum of two losses: Generalized Dice Loss (GDL) and Weighted Log Loss (WLL) for handling class imbalance for the brain tumor segmentation from MRI data (MICCAI Brain Tumor Segmentation Challenge 2018 dataset).”
The research outcomes to date have been presented in a poster titled Multi Planar Spatial-ConvNet for Segmentation and Survival Prediction in Brain Cancer, which appeared in the MICCAI BraTS 2018 challenge in Spain (see Figure 4). The annual MICCAI (Medical Image Computing and Computer Assisted Intervention) conference is attended by top biomedical scientists, clinicians, engineers, and others involved in medical imaging research and computer-assisted intervention.
Figure 5. Subhashis presented his research project at MICCAI 2018.
“Currently,” Subhashis said, “I am conducting research to solve the class imbalance problem based on two novel ideas: one-class learning and two-phase training with pre-training on randomly oversampled or undersampled datasets. Finally, I will test the effects of class imbalance on classification performance and compare with three proposed methods to determine the how to most effectively resolve the class imbalance issue.”
For this research project, Subhashis is using Intel® AI DevCloud with Intel optimized versions of Python*, TensorFlow*, and Keras for implementation and training the models.
“The deep neural networks trained on Intel® AI DevCloud were powered by clusters of Intel® Xeon® Scalable processor, achieving a high degree of training and testing accuracy. This could be because of the performance value of the Intel optimized libraries, as well as the rounding of floating-point numbers on CPU, which helps achieve effective training of the network through gradient descent optimization.”
Subhashis explained that since the proposed solutions to handle the class imbalance problem require custom loss functions and layers (that are not available in the existing libraries provided by Keras and TensorFlow), CPUs offer a better solution than GPUs for implementation. Typically, GPUs require more time for compiling. Another factor is that CPUs provide an interface for integrating the custom layer definitions into the underlying framework.
The real-world applications of this project promise to improve diagnostic efforts for efficiently and accurately detecting disease using radio images that are analyzed with deep learning techniques. This could provide significant benefits to researchers and medical practitioners, specifically those individuals working in medical image analysis, machine learning, and radiogenomics.
“The huge amounts of data generated by the healthcare industry are too complex and voluminous to be processed and analyzed manually. Machine learning and deep learning provide the methodology and technology to transform these mounds of data into useful information for decision making. To efficiently train a deep/machine learning model on medical datasets we have to deal with the inherent classes imbalance issue.”
— Subhashis Banerjee, Intel Student Ambassador
New Ideas for Solving the Class Imbalance Problem
The research being conducted by Subhashis Banerjee includes two novel approaches for dealing with the issue of imbalanced medical image datasets.
One-class learning – Using this approach, the training process focuses on a single classifier for each class. The samples of that class are designated positive and all other samples are considered negative. As the training progresses, this concept-learning technique recognizes positive instances, rather than discriminating between two classes. The novel approach uses deep autoencoders trained to perform auto-associative mapping (which is essentially an identity function). Classification of a new example is defined based on a reconstruction error between the input and output patterns (for example, absolute error, squared sum of errors, Euclidean or Mahalanobis distance).
Two-phase training with pre-training on randomly oversampled and undersampled dataset – Two variants of the two-phase training method are being explored through the current research. One is conducted on an oversampled dataset; the other, on an undersampled dataset. An initial attempt will be made to balance the dataset with oversampling through data augmentation or undersampling through elimination of redundant examples near the boundary between classes. The next stage is pre-training the convolutional neural network (ConvNet) on the balanced dataset and then fine-tuning the last output layer before softmax on the original, imbalanced dataset. This is done while maintaining the same hyperparameters and learning rate decay policy as in the first phase.
Continuous rounds of testing determine the effects of class imbalance on the classification performance and make it possible to compare other methods to discover which approach achieves the best results. Overall accuracy is generally used as a metric for classifier performance with ConvNets, calculating the proportion of test examples that are correctly classified. This metric, however, has known limitations, particularly when working with imbalanced datasets. When the test set is imbalanced, accuracy favors classes that are over-represented, which in some cases can lead to a misleading assessment. Another prospective issue arises when the test set is balanced, but the training is imbalanced. This can result in a situation in which a decision threshold is moved to reflect the estimated class prior probabilities and causes a low accuracy measure in the test set, while the true discriminative power of the classifier does not change.
Additional Resources for Developers
“I have gone through technical articles on the Intel® Developer Zone (Intel® DZ),” Subhashis noted, “and attended many live training and webinar sessions to better understand how to improve the model accuracy using libraries optimized on Intel® architecture. I also found a number of tutorials explaining Intel architecture proved quite useful.”
“I recommend checking out the references to study more about class imbalance,” Subhashis continued, “and Intel DZ, as there are many libraries, tutorials, technical articles, and a plethora of digital content available on machine learning and deep learning that developers can learn from.” (See the Resources section at the end of this document for additional references.)
AI is Expanding the Boundaries of Medicine
Through the design and development of specialized chips, research, educational outreach, and industry partnerships, Intel is accelerating the progress of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, robotics, and other industry sectors. Intel works closely with policymakers, educational institutions, and enterprises of all kinds to uncover and advance solutions that address major challenges in the sciences.
Figure 6. Patient monitoring and treatment options are being enhanced through AI techniques.
“Data streams are often just a messy representation of information, but there’s typically just a small dimension of things that are really important. So, we need to summarize what’s important in a very compact way. That’s where AI comes in. AI allows machines to be the translator of very large data sets that could take a lifetime to go through. We can train a system to look for the usable data that people can derive use from right away.”3
— Naveen Rao, corporate vice president and general manager of the Artificial Intelligence Product Group, Intel
The Intel® AI Developer Program portfolio includes:
Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.
Intel® Xeon® Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.
Intel® Movidius™ Myriad™ X Vision Processing Unit (VPU): Delivers advanced features for the most demanding computer vision workloads and deep neural network implementations.
Intel® Neural Compute Stick 2: Provides deep learning prototyping at the network edge with always-on vision processing, making it ideal for use in smart security cameras, gesture-controlled drones, industrial machine vision equipment, and more.
Intel® FPGA: Create specialized, custom functionality for a wide variety of electronic equipment, including AI-based solutions and monitoring devices, medical equipment, aircraft navigation devices, system accelerators, and more.
Reinforcement Learning Coach: Provides an open-source research framework for training and evaluating RL agents by harnessing the power of multicore CPU processing to achieve state-of-the-art results.
Intel® Distribution of OpenVINO™ toolkit: Make your vision a reality on Intel® platforms—from smart cameras and video surveillance to robotics, transportation, and more.
Intel® Distribution for Python*: Supercharge applications and speed up core computational packages with this performance-oriented distribution.
Intel® Data Analytics Acceleration Library (Intel® DAAL): Boost machine learning and data analytics performance with this easy-to-use library.
Intel® Math Kernel Library (Intel® MKL): Accelerate math processing routines, increase application performance, and reduce development time.
For more information, visit the portfolio page.
- Srinivasan, Pavitra. “The art and science of dealing with imbalanced datasets” Medium.com. February 26, 2018.
- Nandi, Manojit. “Imbalanced Datasets” Domino. May 2017
- Creating a smarter AI future for a better world. Politico. July 2018.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.