Using the Intel® Distribution for Python* to Solve the Scene-Classification Problem Efficiently

Published: 05/25/2018  

Last Updated: 05/24/2018

Abstract: The objective of this task is to get acquainted with image and scene categorization. Initially, we try to extract the image features, prepare a classifier using the training samples, and then assess the classifier on the test set. Later, we considered pre-trained AlexNet and ResNet models, and fine-tuned and applied them on the considered dataset.

Technology stack: Intel® Distribution for Python*

Frameworks: Intel® Optimization for Caffe* and Keras

Libraries used: NumPy, scikit-learn*, SciPY Stack

Systems used: Intel® Core™ i7-6500U processor with 16 GB RAM (Model: HP envy17t-s000cto) and Intel® AI DevCloud


The scene database provides pictures from the eight classes: coast, mountain, forest, open country, street, inside city, tall buildings, and highways, respectively. The dataset is divided into a training set (1,888 images) and testing set (800 images), which are placed separately in their respective folders. The associated labels are stored in "train labels.csv" and "test labels.csv." The SIFT word descriptors are likewise included in "train sift features" and "test sift features" directories.

The following are a few of the images from the dataset:

Training set

mountain view
ocean view

Testing set

street view
mountain view
ocean view

K-nearest neighbor (knn) classifier

Bag of visual words

We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. To build a bag of visual words, we utilize the included SIFT word descriptors incorporated into the "train sift features" and "test sift features" directories.

Classifying the test images

The method used to classify the images is called k-nearest neighbor (kNN) classifier.


Number of Clusters k value Accuracy (%)
50 5 49.375
50 15 52.25
64 15 53.125
75 15 52.375
100 15 54.5
100 9 55.25
150 18 53.125

Discriminative Classifier—support Vector Machines (SVMs)

Bag of visual words

We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. Along these lines, we are utilizing an indistinguishable technique from above for the bag of visual word representation.


Support Vector Machines (SVMs) are inherently two-class classifiers. We utilize one vs. all SVMs for preparing the multiclass classifier.


Number of Clusters Accuracy (%)
50 40.625
64 46.375
75 47.375
100 52.5
150 51.875

Transfer Learning and Fine Tuning

Transfer learning

A popular approach in deep learning where pre-trained models that are developed to solve a specific task are used as the starting point for a model on a second task.

Fine tuning

This process takes a network model that has already been trained for a given task, and makes it perform a second similar task.

How to use it

  1. Select source model: A pre-trained source model is chosen from available models. Many research institutions release models on large and challenging datasets that may be included in the pool of candidate models from which to choose from.
  2. Reuse model: The pre-trained model can then be used as the starting point for a model on the second task of interest. This may involve using all or parts of the model, depending on the modeling technique used.
  3. Tune model: Optionally, the model may need to be adapted or refined on the input-output pair data available for the task of interest.

When and why to use it

Transfer learning is an optimization; it's a shortcut to save time or get better performance.

In general, it is not obvious that there will be a benefit to using transfer learning in the domain until after the model has been developed and evaluated.

There are three possible benefits to look for when using transfer learning:

  1. Higher start: The initial skill (before refining the model) on the source model is higher than it otherwise would be.
  2. Higher slope: The rate of improvement of skill during training of the source model is steeper than it otherwise would be.
  3. Higher asymptote: The converged skill of the trained model is better than it otherwise would be.

We apply transfer learning with the pre-trained AlexNet model to demonstrate the results over the chosen subset of places database. Furthermore, we supplant only class score layer with another completely associated layer having eight nodes for eight classifications.


Architectures Used Top-1 Accuracy (%) Top-3 Accuracy (%) Top-5 Accuracy (%)
AlexNet 51.25 68.65 81.35
ResNet 53.45 74 87.25
GoogLeNet 52.33 71.36 82.84

Top-1 Accuracy: Accuracies obtained while considering the top-1 prediction.

Top-3 Accuracy: Accuracies obtained while considering the top-3 predictions.

Top-5 Accuracy: Accuracies obtained while considering the top-5 predictions.

Training Time Periods (For Fine Tuning)

Architecture Used System Training Time
AlexNet Intel® AI DevCloud ~23 min


HP envy17t-s000cto ~95 min
ResNet Intel® AI DevCloud ~27 min
ResNet HP envy17t-s000cto ~135 min
GoogLeNet Intel® AI DevCloud ~23 min
GoogLeNet HP envy17t-s000cto ~105 min

Note: Have considered smaller datasets and experimented to test the speeds and accuracies that can be achieved by using Intel Distribution fot Python.


From the above experiments, it is quite clear that deep-learning methods are performing much better than extracting the features using traditional methods and applying machine-learning techniques for the scene-classification problem.

In the future, I want to design a new deep neural network by making some changes to the proposed architecture so that accuracies can be further increased. I would also like to deploy in AWS* DeepLens and make it real time.

Click GitHub for source code.

Please visit Places for more advanced techniques and datasets.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at