Deep Learning: Build a Black Box Model for Medical Professionals

Published: 07/02/2018  

Last Updated: 07/02/2018

Building a Black Box Model Using Transfer Learning


In the 21st century, the years of big data and big innovations in medicine, we frequently hear about artificial intelligence (AI) solutions based on statistical and machine learning models that could improve disease prevention, diagnosis, and treatment in solving medical problems.

In this paper we describe and present a method for creating models that predict illness occurrence from cheap and popular medical imaging methods such as X-rays, and by using a state of the art deep learning model, of which some of the trained weights we will reuse.

Black boxes

Sometimes machine learning models are used to decide whether disease has occurred or which drug will be the best for a specific situation. They are called black boxes because their implementation and principle of operation isn't released to the public or isn't well known (even by its creators); they take input data and output diagnosis without the justification of their conclusion1. This is also often caused by competition between companies because they don't want to release their secrets to the public and, combined with private datasets that are needed for the algorithm to work properly (they have to source from real patients), can cause a slowdown of new innovations.

On the other hand, deep learning models are a set of millions of parameters whose prediction is very hard to interpret due to the high abstraction of layers and their output.

Fortunately, they are mostly based on published algorithms that are often implemented in popular open source projects (such as scikit-learn*, TensorFlow*, PyTorch*, or Keras), and we try to replicate them using public datasets that can be found on websites like Kaggle*. Most libraries have their Python* API, which allows users to write their programs in easy to understand, higher level language.

For this purpose, the Intel® Distribution for Python* is supplied with most machine learning libraries built with optimizations for Intel® Advanced Vector Extensions (Intel® AVX) instructions and more, which allows for 10–200 times speed up2

Transfer learning and its use in various applications

Transfer learning3 is a training technique often used with deep convolutional neural networks. It allows for decreasing the number of training samples needed for the neural network to converge, together with computation cost. We can assume that low-level and sometimes mid-level features are versatile, similarly for base image and purpose-specific datasets. To reuse a pretrained layer's weights, the first few layers are frozen (not adjusted) during the training, and others are fine-tuned (adjusted for a specific task, sometimes with a smaller learning rate).

This training method has been successfully applied by researchers from Yonsei University College of Engineering to resolve classification problems on histopathology images of breast cancer with an area under the receiver-operating characteristics (ROC) curve (AUC) of 0.93, by reusing Google Inception* v3 pretrained model4.

Models such as VGG-16*, Inception, and others are mostly trained on the ImageNet* Large Scale Visual Recognition Competition dataset, which contains multiple representations of images in 1,000 categories. Based on the VGG-16 example, we can spot the progression of features during forwarding propagation through the neural network5. As we can see, the first convolutional layer neural networks focus on basic shapes such as lines, arcs, and so on; then the features that filters are looking for are more abstract.

flow of features filtering

We can assume that filters for basic shape detection will be the same for each dataset. Going forward, layers correspond to higher level features; that could be different for a new dataset, but still, weights of layers trained on ImageNet datasets could be useful for easier training by fine-tuning.

Lastly, final dense layers (including prediction layers) are often initialized from scratch, especially if a number of classes in a new dataset differ from the ImageNet one. Another reason to train dense layers from scratch is that often there is nothing useful in weights to be reused because dense connections correspond only to these old output classes, as dense layers differ in this aspect from convolutional layers.

Using libraries such as Keras, we have to import ready to go pretrained models like VGG-16 with an option to reject the top dense layers and add our own. This can be done by a simple for loop iterating the layers that we choose not to train:

for layer in model_final.layers[:10]:
    layer.trainable = False

We can conduct transfer learning in various ranges depending on size, type of dataset used for training, target accuracy of model, and hardware resources. To get the best results (accuracy) we sometimes need to conduct more experiments.

This simple scheme presents few possible options for how we can apply transfer learning. Option A is less invasive and is based on training only dense layers (from scratch), leaving convolutional layers frozen during the training. The second option B (one that I have chosen for this use case) is slightly more invasive because, as in A, we train a dense layer from scratch but also train convolutional layers to adapt to new data (images).

flow of features filtering

The following table details transfer learning possibilities, both advantages and disadvantages:

Range of Trained Layers A B
Layer trained from scratch Hidden and final (prediction) dense layers
Fine-tuned layers - Part of higher level convolutional layer
Frozen layer All non-dense layers Lower level convolutional layers
Amount of training data Small Big
The similarity of training dataset compared to base one Big Small
Flexibility Small—higher level convolutional filters will stay the same Big—there will be a field for improving convolutional filters of some layers
Utilization Quick training for the easy use case Length of an epoch may be similar to training from scratch, but thanks to pretrained layers neural network performance may be better

You might ask what the difference is in training only dense layers or fine-tuning convolutional layers. The answer is simple: Even if VGG-16 architecture works very well for ImageNet challenge image categories, your medical use case data might be completely different, and filters in convolutional layers might need to be adjusted. In our case, VGG-16 with just the A-option type of learning did not seem to converge, and since we have a large dataset to utilize I have chosen to train more layers. It allowed me to get a few percent more in accuracy. However, your data situation might be different, and you will need to try different numbers of trained layers.

Use case

Transfer learning applied on National Institutes of Health (NIH) Chest X-ray dataset from Kaggle.

Problem statement

To present deep learning methods for medical imaging diagnostics we use the transfer learning method to fine-tune VGG-16 pretrained on an ImageNet dataset for classification of chest X-ray images to determine whether the patient is healthy or whether he or she has pulmonary infiltration. Due to the significant difference of this data compared to the ImageNet dataset, most of the convolutional layers have been fine-tuned, and dense layers have been trained from scratch.

Software and hardware configurations

All data manipulations and deep neural network training have been conducted on the Intel® AI DevCloud using Intel Distribution for Python 2018 version 2, which allows for using multiple nodes. Each consists of Intel® Xeon® Gold 6128 processors and 192 GB of RAM. There is also 200GB of storage per user and preconfigured Intel Distribution for Python with Jupyter Notebook* enabled, and optimized distributions of the following libraries:

  • neon™ framework
  • Intel® Optimization for Theano*
  • Intel® Optimization for TensorFlow*
  • Intel® Optimization for Caffe*
  • Intel Distribution for Python (including NumPy*, SciPy*, and scikit-learn)
  • Keras (2.2.0)
  • Keras-Applications (1.0.2)
  • Keras-Preprocessing (1.0.1)
  • keras-vis (0.4.1)

TensorFlow was additionally updated to 1.6 to gain maximum performance6.

There are optimized for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions that utilize the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) library for highly optimized and fast computations.


This dataset was recently released by NIH7 and consists of 112,120 X-ray images with disease labels from 30,805 unique patients. This medical imaging technique is much cheaper than other methods including computed tomography (CT) imaging, although medical diagnosis based on this data may be more difficult in some cases.

The authors state that labeling isn't 100 percent accurate; rather, estimated at 90 percent, because it comes from natural language processing (NLP) from data mining of text diagnoses, so this should be our baseline.

Code snippets with explanations

First, we need to preprocess image data.

  Image Index Finding Labels Follow-Up # Patient ID Patient Age Patient Gender View Position Original Image Width Original Image Height Original Image Pixel Spacing x



Mass 1 9142 40 M AP 2500 2048 0.168000



Atelectasis 0 20264 7 M PA 2458 1953 0.143000



No Finding 0 538 72 F PA 2992 2991 0.143000



No Finding 11 14465 64 M AP 2500 2048 0.168000



No Finding 0 25919 53 F PA 2021 2021 0.194311

As we can see, in comma-separated values we get basic information about an image's location, patient's gender, and age, but what's most important is that the image size isn't constant, which is required for training and final evaluation.

But before that, we need to extract only images of healthy patients and patients with pulmonary infiltration.

For this use case we will train a neural network for a binary problem: healthy versus pulmonary infiltration.

In [13]:

all_image_paths = {os.path.basename(x): x for x in 
                   glob(os.path.join('data',  'images', '*.png'))}
print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])
all_xray_df['path'] = all_xray_df['Image Index'].map(all_image_paths.get)
all_xray_df['infiltration'] = all_xray_df['Finding Labels'].map(lambda x: 'Infiltration' in x)

Scans found: 112120, Total Headers 112120

  Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position Original Image [Width Height] Original Image Pixel Spacing [X y] Unnamed: 11 path infiltration
12540 00003275_004.png No Finding 4 3275 41 F PA 2048 2500 0.168 0.168 NaN data/images/00003275_004.png FALSE
45791 00011723_018.png No Finding 18 11723 66 M AP 2500 2048 0.168 0.168 NaN data/images/00011723_018.png FALSE
89096 00022116_000.png No Finding 0 22116 46 M PA 3056 2544 0.139 0.139 NaN data/images/00022116_000.png FALSE

By resampling the data, we can try to balance a number of images in both categories, in random order.

Let's examine the distribution of binary labels.

In [15]:

all_xray_df['infiltration'].hist(figsize = (10, 5))


Now, we balance the distribution in sets.

In [16]:

all_xray_df = all_xray_df.groupby(['infiltration']).apply(lambda x: x.sample(6000, replace = True)).reset_index(drop = True)
all_xray_df[['infiltration']].hist(figsize = (10, 5))


Another important aspect of training a neural model is to split data into training and test datasets, and not allow any information to leak, for proper evaluation.

Split data into training and validation.

In [17]:

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(all_xray_df, 
                                   test_size = 2000, 
                                   random_state = 2018,
                                   stratify = all_xray_df[['infiltration', 'Patient Gender']])

Train samples: 10,000; test samples: 2,000.

Create Train and Test Datasets

In [2]:

import matplotlib.pyplot as plt
from skimage import transform, color
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
IMG_SIZE = (384, 384)
def load_data(in_df,IMG_SIZE=IMG_SIZE,y_col='infiltration'):
    images = []
    for file in tqdm(in_df['path'].values):
        image = plt.imread(file)
        image_resized = transform.resize(image, IMG_SIZE, mode = 'constant')
        if not len(image_resized.shape) == 2:
            image_resized = color.rgb2gray(image_resized)
        image_resized = (image_resized - image_resized.min()) / (image_resized.max() - image_resized.min())
    out_img = np.expand_dims(np.array(images),axis=-1)
    out_img = np.concatenate((out_img, out_img, out_img), axis = -1)
    return (out_img,in_df[y_col].values)
train_X, train_Y = load_data(train_df)
test_X, test_Y = load_data(test_df)
with h5py.File("xray_dataset4.h5", "w") as h5f:
    h5f.create_dataset('train_X', data=train_X)
    h5f.create_dataset('train_Y', data=train_Y)
    h5f.create_dataset('test_X', data=test_X)
    h5f.create_dataset('test_Y', data=test_Y)
    h5f.create_dataset('z_param', data=np.array([train_X.mean(), train_X.std()]))
print('Images have been saved')

100%|██████████| 10000/10000 [04:26<00:00, 37.50it/s]
100%|██████████| 2000/2000 [02:05<00:00, 15.93it/s]
Images have been saved

As you probably know, deep learning likes a lot of data for training. We are providing 10,000 images for training, which should be more than enough for the binary classification problem.

This simple script does its job by loading all the images, resampling them to 384 x 384 resolution, and saving them to an HDF5 file for later use. We also save the mean and standard deviation of images to standardize later.

Another often-used trick is to augment the data by small rotations, zooms, and shifts, so on each epoch the neural network doesn't get too much data. We utilize this method by using random horizontal flips, shifting both in width and height, randomly rotating by a maximum of five degrees, shearing by 1 percent maximum, and zooming in the 0–10 percent range.

At each epoch, the neural network won't be able to overfit too much to training data, because each time it will be differently distorted.

In [14]:

from keras.preprocessing.image import ImageDataGenerator
def get_img_gen():
    core_idg = ImageDataGenerator(samplewise_center=False, 
                                  horizontal_flip = True, 
                                  vertical_flip = False, 
                                  height_shift_range = 0.15, 
                                  width_shift_range = 0.15, 
                                  rotation_range = 5, 
                                  shear_range = 0.01,
                                  fill_mode = 'nearest',
    return core_idg


Here are example results of augmented training samples. As you can see they are distorted, which hopefully helps to train the neural network with more diverse data so it will generalize better on the test dataset.

multiple lungs x-rays

The next part is to create a model and freeze specific layers to preserve already trained, low-level features trained on the ImageNet dataset. On the first epoch, we will pre-train newly added layers so they can keep up with other layers that had a warm start. Then, we will train all layers but the first 10, as in option B.

In [45]:

vgg16 = VGG16(input_shape =  train_X.shape[1:], 
              include_top = False, 
              weights = 'imagenet')
x = vgg16.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(2, activation='softmax')(x)

model_final = Model(inputs=vgg16.input, outputs=x)
#Let's freeze vgg16 weights, so they won't be initialized from scratch
for layer in vgg16.layers:
    layer.trainable = False
    # update the weight that are added - just one epoch, 
#so dense layers; weights won't be random for fine tuning 
model_final.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model_final.fit_generator(get_img_gen().flow(train_X, train_Y, batch_size = batch_size), 
                    steps_per_epoch=len(train_X) // batch_size, 
                    epochs = 1, max_queue_size=10, workers=12, verbose = 1)
#now let's freeze only first 10 layers (2 conv blocks)
for layer in model_final.layers[:10]:
    layer.trainable = False
    for layer in model_final.layers[10:]:
    layer.trainable = True
sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
from keras.callbacks import ModelCheckpoint, EarlyStopping
weight_path = "model_best4.h5"
checkpoint = ModelCheckpoint(weight_path, monitor='val_loss', verbose=1, 
                             save_best_only=True, mode='min', save_weights_only = False)
early = EarlyStopping(monitor="val_loss", 
callbacks_list = [checkpoint, early]
history = model_final.fit_generator(get_img_gen().flow(train_X, train_Y, batch_size = batch_size), 
                    steps_per_epoch=len(train_X) // batch_size,
                    validation_data = (test_X, test_Y), 
                    epochs = 30, max_queue_size=10, workers=12, 
                    callbacks = callbacks_list, verbose = 1)


By training one epoch on only the last dense layers and then all layers but the first 10 (they could be useful even with different data), we can give these last layers an initial starting point to keep up with the training of other layers. This has helped to get an additional 0.01–0.02 accuracy on test data (my tests). To improve generalization, I used dropout and data augmentation. This allowed for steady training.


This figure presents training after an initial warm start epoch, which is why these lines aren't so steep. At the x-axis there are epochs (starting from 0 index—which is the first epoch), and at the y-axis there are measured parameters—the model's accuracy and loss for both training and validation datasets. Validation statistics are measured at the end of each epoch, and for training data they are measured after each batch.


Now let's evaluate the network's training performance on the test data (separate dataset) without data augmentation.

  Precision Recall F1-score Support
Healthy 0.68 0.76 0.72 1000
Infiltration 0.73 0.64 0.68 1000
Average / total 0.70 0.70 0.70 2000

This seems really nice for such data. The ROC curve seems even better (0.76 AUC). This means that our model is quite certain about its predictions.


We can compare it to the accuracy measured on test data with the same augmentation as that used during the training.

  Precision Recall F1-score Support
Healthy 0.49 0.60 0.54 1000
Infiltration 0.48 0.37 0.42 1000
Average / total 0.48 0.48 0.48 2000


This ROC curve for test data with augmentation presents how hard it is to classify unknown data with additional distortions. It is similar to random guessing; this is why we use test data without distortions.

To visualize what regions the neural network has focused on in order to diagnose patients as those with pulmonary infiltration, we use the keras-vis* library that gives an easy to use API for gradient-weighted class activation mapping (grad-cam) extraction. These heat maps allow evaluating regions with high importance in classification to a specific class.

heat maps of multiple lungs

Wow! Our black box model seems to understand that in this task, lungs are important and seems to be able to detect infiltration and help doctors to save time in the detection of pulmonary abnormalities, and maybe other diseases.


We have successfully trained a VGG-16 neural network using transfer learning for new X-ray chest data by reusing some of the layers and fine-tuning others. This method can be further extended for new labels and data. Class activation maps have proven that a neural network uses visual data of lungs to classify them.

This table presents a comparison of training time (with data augmentation) on the standard TensorFlow wheel and the Intel optimized one.

  Warm-Start Training Epoch (Dense Layers Only) Epoch (All But 10 First Layers Were Trained)
Intel® optimized TensorFlow* wheel 1.6 650s/epoch 1301s/epoch
Pip* TensorFlow wheel 1.6 1713s/epoch 3277s/epoch

By using the Intel® Optimization for TensorFlow* 1.6 with the Intel MKL-DNN wheel, I have managed to get about 2x the training time improvement, compared to training time using the standard wheel [6].

This can lead to time and money savings, allowing professionals to train deep learning models quicker by utilizing resources better.

GitHub* gist link for the project: Pulmonary infiltration prediction from Chest x-rays with pretrained VGG16 and fine tuning of dense layers.


  1. W. Nicholson Price II, Regulating Black-Box Medicine, Michigan Law Review, vol. 116, no. 3, 2017.
  2. Built for Speed.
  3. M. Oquab, L. Bottou i I. Laptev, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  4. J. Chang, J. Yu i T. Han, A method for classifying medical images using transfer learning: A pilot study on histopathology of breast cancer, IEEE 19th International Conference on e-Health Networking, Applications and Services, 2017.
  5. K. Simonyan i A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition,  Computer Vision and Pattern Recognition, 2014.
  6. Intel / Packages / Tensorflow.
  7. NIH Chest X-rays.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at