Abstract: The objective of this task is to get acquainted with image and scene categorization. Initially, we try to extract the image features, prepare a classifier using the training samples, and then assess the classifier on the test set. Later, we considered pre-trained AlexNet and ResNet models, and fine-tuned and applied them on the considered dataset.
Technology stack: Intel® Distribution for Python*
Frameworks: Intel® Optimization for Caffe* and Keras
Libraries used: NumPy, scikit-learn*, SciPY Stack
Systems used: Intel® Core™ i7-6500U processor with 16 GB RAM (Model: HP envy17t-s000cto) and Intel® AI DevCloud
The scene database provides pictures from the eight classes: coast, mountain, forest, open country, street, inside city, tall buildings, and highways, respectively. The dataset is divided into a training set (1,888 images) and testing set (800 images), which are placed separately in their respective folders. The associated labels are stored in "train labels.csv" and "test labels.csv." The SIFT word descriptors are likewise included in "train sift features" and "test sift features" directories.
The following are a few of the images from the dataset:
K-nearest neighbor (knn) classifier
Bag of visual words
We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. To build a bag of visual words, we utilize the included SIFT word descriptors incorporated into the "train sift features" and "test sift features" directories.
Classifying the test images
The method used to classify the images is called k-nearest neighbor (kNN) classifier.
|Number of Clusters||k value||Accuracy (%)|
Discriminative Classifier—support Vector Machines (SVMs)
Bag of visual words
We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. Along these lines, we are utilizing an indistinguishable technique from above for the bag of visual word representation.
Support Vector Machines (SVMs) are inherently two-class classifiers. We utilize one vs. all SVMs for preparing the multiclass classifier.
|Number of Clusters||Accuracy (%)|
Transfer Learning and Fine Tuning
A popular approach in deep learning where pre-trained models that are developed to solve a specific task are used as the starting point for a model on a second task.
This process takes a network model that has already been trained for a given task, and makes it perform a second similar task.
How to use it
- Select source model: A pre-trained source model is chosen from available models. Many research institutions release models on large and challenging datasets that may be included in the pool of candidate models from which to choose from.
- Reuse model: The pre-trained model can then be used as the starting point for a model on the second task of interest. This may involve using all or parts of the model, depending on the modeling technique used.
- Tune model: Optionally, the model may need to be adapted or refined on the input-output pair data available for the task of interest.
When and why to use it
Transfer learning is an optimization; it's a shortcut to save time or get better performance.
In general, it is not obvious that there will be a benefit to using transfer learning in the domain until after the model has been developed and evaluated.
There are three possible benefits to look for when using transfer learning:
- Higher start: The initial skill (before refining the model) on the source model is higher than it otherwise would be.
- Higher slope: The rate of improvement of skill during training of the source model is steeper than it otherwise would be.
- Higher asymptote: The converged skill of the trained model is better than it otherwise would be.
We apply transfer learning with the pre-trained AlexNet model to demonstrate the results over the chosen subset of places database. Furthermore, we supplant only class score layer with another completely associated layer having eight nodes for eight classifications.
|Architectures Used||Top-1 Accuracy (%)||Top-3 Accuracy (%)||Top-5 Accuracy (%)|
Top-1 Accuracy: Accuracies obtained while considering the top-1 prediction.
Top-3 Accuracy: Accuracies obtained while considering the top-3 predictions.
Top-5 Accuracy: Accuracies obtained while considering the top-5 predictions.
Training Time Periods (For Fine Tuning)
|Architecture Used||System||Training Time|
|AlexNet||Intel® AI DevCloud||~23 min|
|HP envy17t-s000cto||~95 min|
|ResNet||Intel® AI DevCloud||~27 min|
|ResNet||HP envy17t-s000cto||~135 min|
|GoogLeNet||Intel® AI DevCloud||~23 min|
|GoogLeNet||HP envy17t-s000cto||~105 min|
Note: Have considered smaller datasets and experimented to test the speeds and accuracies that can be achieved by using Intel Distribution fot Python.
From the above experiments, it is quite clear that deep-learning methods are performing much better than extracting the features using traditional methods and applying machine-learning techniques for the scene-classification problem.
In the future, I want to design a new deep neural network by making some changes to the proposed architecture so that accuracies can be further increased. I would also like to deploy in AWS* DeepLens and make it real time.
Click GitHub for source code.
Please visit Places for more advanced techniques and datasets.