Exploratory Data Analysis

Learn about exploratory data analysis (EDA), the starting point for solving any AI problem.
 

Hi, I'm Meghana Rao, and this is the AI from the Data Center to the Edge video series. In this episode, we talk about the exploratory data analysis phase, or EDA, which is the starting point to solving any AI problem. We focus on how to obtain a starter dataset for the challenge you are trying to solve, initial assessment of the data, and how to prepare the dataset for the problem at hand before training. Some of the techniques we cover include preprocessing and data augmentation.

AI problems require large amounts of data. Most often, the dataset is not readily available in the quality and the quantity that we need. As practitioners, we either have to create one from scratch or rely on an existing repository to adapt from. EDA refers to the stage of collecting and cleaning data and preparing it for training. Through the challenge of identifying the 10 most stolen cars in the US, this chapter teaches you how to perform EDA. This is a complex, time-consuming task, and if not done right, can result in poor training results.

As a starting point, we used the vehicle make and model recognition dataset, or VMMRDB, which contains over 9,000 classes, consisting of over 200,000 images. It covers models manufactured between 1950 and 2016. We assess the amount of data for class and narrow it down to the 10 classes needed. In the second phase, we show you how to preprocess the data using techniques like gray scaling, samplewise centering, sample standard normalization, and RGB to BGR conversion, which may be necessary depending on the framework and the network used.

To ensure that the data is not skewed toward a few classes, we show you how to oversample the minority classes with data augmentation techniques. The lecture and Jupyter* Notebooks for this chapter are designed to teach you each of these techniques. By the end of this chapter, you will have a dataset that is ready for training.

Thanks for watching this episode of AI from the Data Center to the Edge. Make sure to check out the links to register for the course. You can complete the lecture and the notebooks listed and the resources for this course, and join me in the next episode to learn more about training a deep neural network model on Intel® architecture.