AI Developer Project Part 2: Combating Distracted-Driver Behavior

Published: 09/26/2018  

Last Updated: 09/26/2018

Experimental Design and Data Preparation for a Distracted-Driver AI Project

The first Combating Distracted-Driver Behavior article in this five-part series, Overview of a Use Case: Combating Distracted Driving Behavior, covers conceptualizing a product with a cross-functional team, using the five stages of design-thinking, and formulating a final concept to hand off to a development team.

This second article covers how research and development helps you to build your project. It mainly discusses how to prepare a dataset, how to approach a solution, and how to create a topology and design for an experiment.


The research and development team does most of the concept feasibility and technology-development work that’s of interest to us as developers.

Based on the requirements, our developer team decided to use artificial intelligence (AI) to detect driver behavior. The most important requirement for an AI project is to acquire a dataset to train the images. The following link has lots of resources to explore: Datasets.

For our purposes, the distracted-driver dataset was a perfect fit: State Farm Distracted Driver Detection.

Dataset Preparation and Wrangling

The dataset has been extracted from the Kaggle* platform for predictive modeling and analytic competitions. It pertains specifically to the State Farm Distracted Driver Detection competition. It comprises driver images that are taken in a car when the driver is performing some kind of activity, such as texting, eating, talking on the phone, applying makeup, reaching behind, etc. For each of the images, the goal is to predict the likelihood that the driver is engaged in certain classes of activity.

The following are the ten classes to be predicted:

Class Class Name
c0 safe driving
c1 texting - right
c2 talking on the phone - right
c3 texting - left
c4 talking on the phone - left
c5 operating the radio
c6 drinking
c7 reaching behind
c8 hair and makeup
c9 talking to passenger

The images that are available for training and testing purpose by the competition do not contain the associated metadata. This ensures that it is a computer vision problem. The training and testing data are split on the drivers, so that one driver cannot appear on either the training or the testing dataset only.

Data link: State Farm Distracted Driver Detection.

To download the dataset, we used Kaggle-CLI, an unofficial Kaggle command-line tool. For reference, go to GitHub* repository.

The downloaded dataset consists of:

  • A training dataset: 22,424 files (1,900 to 2,500 images per class), 640 x 480 pixels, size - 44 KB, total size - 950 MB
  • A testing dataset: 79,726 files, 640 x 480 pixels (unprocessed images), total size - 3.27 GB
  • A CSV file: driver_imgs_list.csv with the driver IDs, class labels, and the image filename

Solution Approach and Design of Experiments

To overcome the problem of classification in computer vision, the best potential solution is a convolutional neural networks (CNN)-based approach. Alternatively, you can also build the entire network from scratch or go with state-of-the-art standard topologies made available with deep-learning frameworks. To obtain optimal results within minimal time, the latter will be an advantage.

Topology Selection

As there are no standard guidelines available for topology selection, we resorted to a parallel exploration of three of the highest performing CNNs with the ImageNet classification challenge: Inception-ResNet-V2, Inception V3, and Inception V4. Currently, transfer learning with the selected topologies are available with both Intel Optimization for Keras* and TensorFlow*.

We considered these topologies:

Design Considerations

Design considerations in computer vision mostly fall into three categories: speed, memory, and accuracy.


Transfer learning reduces the time in multifolds when compared to training from scratch. Parallelization of data preprocessing using multithreads will be considered and can provide a computational speed boost up for data wrangling. Since we want real-time prediction, it is important that the model predicts fast enough. TensorFlow framework gives predictions fast enough and hence is a suitable option for this.


Data resizing is expected to help model generalize better over noise and reduce memory requirements for data processing at the same time.

Networks train faster and require less memory with batching of files. The appropriate batch size is determined with respect to its effect on accuracy. The actual image size is 640 x 480. If we use it without resizing, it will use more memory, increasing the chances of the system crashing. Also, lesser memory usage also ensures that the results come sooner, which is an important design consideration for real-time distraction detection.


22,424 files (1,900 to 2,500 images per class) is a small training dataset for a computer-vision challenge. Dataset size can be increased by implementing various data-augmentation strategies readily available with the deep-learning frameworks. Even with the image-preprocessing overhead, achieving a decent accuracy is a challenge. The dataset present is already susceptible to overfitting since similar images were present. Hence, training the network with millions of parameters could push overfitting.

Initialize the network with pre-trained weights of the ImageNet dataset, built on Inception-ResNet-V2/ Inception-v3/ Inception-v4, to extract the lower-level features. Retrain only the last fully connected classification layer with the distracted-driver dataset. The use of transfer learning in this way has been proven to give better accuracy than training the neural network from scratch. Accuracy is a highly important aspect of this problem because if the model is not good enough to identify the moments that the driver is distracted, then it will create more trouble than help. Drivers will be irritated if they get warnings when they are driving safely.

Step Design Alternatives Tradeoffs
1 Image
  • Color
  • Grayscale
2 Topology
  • Inception-ResNet-V2
  • Inception V3
  • Inception V4
Speed vs. memory
3 Resizing
  • Direct resize
  • Padding to scale 1 followed by scaling in
Processing time vs. preserving spatial information
4 Framework
  • TensorFlow
Greater handle on code vs. quick prototyping

Image Alternatives

Weighing in on the invariance of a colored vs. a grayscale image on the model accuracy, accuracy can be affected for color-specific objects. (For example, oranges are usually visually identifiable by the color orange.) On the other hand, for recognizing driver distraction, it is the actions that are relevant, and actions may not be identified by color. While color-to-grayscale conversion can result in loss of information, it can avoid a potential overfitting of the network due to CNN learning color-sensitive filters, simultaneously reducing computation time. The intention behind carrying out this exercise is to improve accuracy.


Several state-of-the-art deep neural networks can be considered for the use case at hand. To condense the choices to a few that can be run in parallel, due weightage needs to be given to the memory and speed requirements. A little increment in accuracy costs a lot of computation time. The choices can then be made according to the available resources. Memory considerations for different topologies is considered along with the declared accuracy comparisons to arrive at the topologies to be selected. ResNet topologies are too memory intensive, while AlexNet and VGG-16 are not good at giving accuracy. Hence, we decided to use inception models.

The below table of comparisons is taken from the research paper Using simple architectures to outperform deeper and more complex architectures.

Table 1. Flops and Parameter Comparison

  MACC COMP ADD DIV EXP Activations Params SIZE (MB)
SimpleNet 652 M 0.838 M 10 10 10 1 M 5 M 20.9
SqueezeNet 861 M 10 M 226 K 1.51 M 1 K 13 M 1 M 4.7
Inception v4 12270 M 21.9 M 5.34 M 897 K 1 K 73 M 43 M 163
Inception v3 5710 M 16.5 M 2.59 M 1.71 M 11 K 33 M 24 M 91
Inception-ResNetv2 9210 M 17.6 M 2.36 M 1 K 1 K 74 M 32 M 210
ResNet-152 11300 M 22.33 M 35.27 M 22.03 M 1 K 100.26 M 60.19 M 230
ResNet-50 3870 M 10.9 M 1.62 M 1.06 M 1 K 47 M 26 M 97.70
AlexNet 1140 M 1.77 M 4.78 K 955 K 478 K 2 M 62 M 217.00
GoogleNet 1600 M 16.1 M 883 K 166 K 833 K 10 M 7 M 22.82
Network in Network 1100 M 2.86 M 370 K 1 K 1 K 3.8 M 8 M 29
VGG16 15740 M 19.7 M 1K 1 K 1 K 29 M 138 M 512.2

Column description of the above table:

  • MACC: The hardware unit that performs the multiply–accumulate operation is known as a multiplier–accumulator (MAC, or MAC unit). MAC operation computes the product of two numbers and adds that product to an accumulator. MACC here represents number of multiply-add operations for the model. These are element-wise mathematical operations.
  • COMP: The number of comparison operations in a model.
  • ADD: The number of addition operations in a model.
  • DIV: The number of division operations in a model.
  • EXP: The number of exponential operations in a model.
  • Activations: Transfer functions in neural networks that are added at the end of a neural network or in between two neural networks. The purpose of an activation is to convert an input signal of a node in a neural network to an output signal. Activations here give the total number of activations for the model.
  • Params: Network parameters are the number of layers, the number of neurons per layer, the number of training iterations, and so on in a given model.
  • SIZE (MB): The size of the model in megabytes.


Preserving the spatial information of an image is expected to improve performance. Hence, a direct resize of an image vs. padding the image to an aspect ratio of 1 and then scaling (scaling in - for our case for image-size reduction) is debatable. Padding would add overhead in terms of compute time. At the same time, reduction in volume size with padding would support deeper networks. Volume reduction after each convolution could result in loss of information at the borders too quickly for a non-padded image.

We experimented with the direct image resize vs. padding and then resizing the image on one of the training images. The results are displayed below.

Driver original image 640 x 480 pixels
Original image (640 x 480 pixels).

Driver direct resize of the original image
Direct resize of the original image (downsized to 300 x 300 pixels).

Driver padding the original image
Padding the original image (640 x 640 pixels) followed by a resize of the padded image (downsized to 300 x 300 pixels).


For quick prototyping and testing neural networks, one can consider the more user-friendly option, Keras. On the other hand, TensorFlow as a low-level library offers more control on our model. Also from a research perspective, TensorFlow has greater functionalities to offer, such as threads and queues, that can speed up operations through parallel computations. So the tradeoff here is between a user-friendly facilitating quick development vs. greater functionality and more control over the network.


Introduce random rotation, shifts, shear, and flips using data-aggregation techniques available with the deep-learning frameworks to ensure generalization of model. Convert the images to grayscale, and verify the conversion’s effect on accuracy.

Hardware Configuration

Name Description
Intel® architecture x86_64
CPU op-modes 32-bit, 64-bit
Byte order Little endian
CPUs 8
On-line CPUs list 0-7
Threads per core 1
Cores per socket 1
Sockets 8
NUMA nodes 1
Vendor ID GenuineIntel
CPU family 6
Model 61
Model name Intel® Core™ processor (Broadwell)
Stepping 2
CPU MHz 2099.998
BogoMIPS 4199.99
Hypervisor vendor KVM
Virtualization type Full
L1d cache 32K
L1i cache 32K
L2 cache 4096K
NUMA node0 CPU(s) 0-7

Software Installation

Python* Installation

Intel® AI DevCloud comes with Python* installed by default. You can also create a separate environment with the desired Python version.


To activate a conda* environment with distracted_driver name and Python version 3.5.

conda create -n distracted_driver -c intel python=3.5
source activate distracted_driver 

TensorFlow Installation/Keras* Installation

TensorFlow Installation

To install TensorFlow, follow the instructions provided in the link below.

Intel® Optimization for TensorFlow* Installation Guide


conda install

Keras Installation

conda install keras

Solution Design

From the perspective of a computer-vision use case, with the available training dataset, we observed the following major drawbacks:


With only 26 drivers in an aggregate sum of 22,424 files in the training set of over 10 classes, the images permit a mutual substitution for training.


Certain images are difficult to classify even for human vision. For example, images of drivers with a hands on the steering wheel and looking at the mirror could fall under either c8 (hair and makeup) or c0 (safe driving).

Initial focus was to be on a broad number of factors that challenge the assumptions on the drawbacks outlined above. From the listed levels for each factor in the initial experiment, the end goal was to filter out the noise factors that do not contribute significantly in building a comprehensive design.

Experiment Design

Factor Levels
Topology Inception-ResNet-V2, Inception V3, Inception V4
Weight initialization dataset ImageNet
Batch size 16, 32, 64, 128
Iterations 50,000; 100,000; 200,000
Learning rate 0.01, 0.001
Sampling method k-fold cross-validation
Sample size 5, 7, 10
Image resizing 300 x 300 (Inception V3)
Image channels 3, 1
Invariance Rotation, shifts, shear and flips

Design factors selection and their relevance are detailed below.


With a top-five accuracy rate of 93.9% (Inception v3), 95.3% (Inception-ResNet-V2) and 95.2% (Inception v4), topologies with transfer learning have an advantage over traditional convolutional neural networks in terms of accuracy and computation time.

Weight-Initialization Dataset

ImageNet is a large-scale visual database comprising of 14,197,122 images across various categories to facilitate researchers in undertaking computer-vision use cases. Instead of random weight initialization, for better learning, the network will be initialized with pre-trained network weights of ImageNet.

Batch Size

Having memory and time constraints, evaluate batch sizes of 16, 32, 64, and 128 to get the best accuracy results with respect to the batch size count.


Increasing iterations is expected to improve the accuracy of the model at the cost of time. For quick and dirty testing to condense factor levels at a quicker pace, 4,000 is a good number although with the past experiences, 100,000 is deemed to be a good count for the final runs with the finalized factors.

Learning Rate

A lower value for this tuning variable would slowly require more training iterations but would provide a greater chance at an optimal solution. The levels chosen here are 0.01 and 0.001 taking into account both the computation time and finding the optimal solution.

Sampling Method

k-fold validation partitions dataset into k number of subsamples and assigns one subsample to the test set, treating the remaining k-1 as training set. It can be computationally expensive but is needed for ensuring that the resulting model generalizes well on unseen data.

Sample Size

As a rule of thumb, k-fold cross validation with k >= 5 is considered although it is not a hard and fast rule; any value can be assigned to k.

Image Resizing

For the expected dataset format for consumption with Inception and VGG models, the files are to be resized to 299 x 299 for Inception v3 and 150 x 150 for VGG-16. This will also reduce the computational time. Deformation of images will also help in generalizing.

Image Channels

A colored image is not expected to hold relevance in capturing driver actions. If the color of the image does not contribute significantly to the accuracy, the image channels can be reduced to be monochrome. This will reduce computational time by a factor of 3.


This model can be susceptible to error induced by variance from images captured under poor light conditions or from images captured from different angles. To avoid this, invariance is to be introduced by using aggregation methods for rotation, shifts, shear, and flips. These methods are available with most deep-learning frameworks.

Hand Labeling of Test Images

As the labels for the test images are not available, crowdsource the manual labeling of the images. For quick testing of 2,000 images from the 79,726 images available in the test dataset, images can be hand-labeled for model verification. To tackle ambiguity, a consensus on image class should be taken into account.


Next Steps

The third article of this Combating Distracted-Driver Behavior series, Training and Evaluation of a Distracted-Driver AI Model, provides a consolidated set of instructions and the commands to run and reproduce the results. Additionally, the first article was on Overview of a Use Case: Combating Distracted Driving Behavior.

For reference on AI Developer Project: Combating Distracted-Driver Behavior ›

Join the Intel® AI Developer Program

Sign up for the Intel® AI Developer Program and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel® Student Ambassador and share your expertise with other student data scientists and developers.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at