Video Series: Hands-On AI—Part 2: Cleaning Transformations

Explore the concepts of data cleaning, which is an important aspect of data preprocessing for machine learning workloads. Get an introduction to common Keras (an open source neural network library) transformations like rescaling, gray scaling, sample wise centering, sample wise normalization, feature wise centering, and feature-wise normalization.

Welcome back. I'm Karl, and this is the second video in our Hands-On AI series. In this video, we cover data cleaning, an extremely important practice for anyone wanting to build a machine learning application. This is a critical step when dealing with your own data. We have to make the images in a format that our models can understand.

This means both scaling it to a usable size as well as normalizing the data so that there aren't any extreme outliers. These outliers can break your network or cause exploding gradients. In machine learning, remember that images are not images as far as computer algorithms are concerned. They are simply multidimensional integer arrays. Consider data cleaning as a method of simplifying the data and reducing the noise and outliers.

Now let's take a look at a set of transformations that are commonly applied for cleaning up the data and their influence on images. All the code snippets can be found in the preprocessing Jupyter* Notebook. We will also be working through these in a later episode.

The first transformation is called rescaling. Rescaling is an operation that moves your data from one numerical range to another by division using predefined constants. In other words, we are transforming the images from one size to the size we have predetermined to be the number of inputs for our machine learning model.

Another type of transformation is gray scaling, which turns a color RGB image into images with only shades of gray representing colors. Grayscaling transformations throw away noisy pixels and make it easier to detect shapes in the picture. Depending on your application, grayscaling might be a useful preprocessing step. Next is sample wise centering.

The raw RGB data values in images range from 0 to 255. However, in order to get rid of vanishing or saturating values, we can normalize the datasets such that the mean value of each data sample would be equal to zero. This reduces the chances for outliers on each image and centers it on a normal distribution curve.

In conjunction with samplewise centering, you can use samplewise standard normalization. This preprocessing step follows the same idea as samplewise centering, but instead of setting the mean value to zero, it sets the standard deviation value to 1. So now instead of having these values be 0 to 255, they range from negative 1 to 1. This minimizes extreme outliers and gives us values that are ready to be digested by common machine learning functions.

Feature wise is similar to sample wise. A sample refers to an individual image, whereas a feature is a specific trait common on our whole data set. Featurewise centering is performed on the entire data set instead of on each individual image. In this case, the features are each of the R, G, and B values. To do this easily and to understand what it looks like, we compute the mean image from our data set and subtract it from each image.

While the mean image doesn't look great to us, it does show the most common features of our data set. Since we already calculated the mean image of the data set, we can also calculate the standard deviation for featurewise standard normalization. This value is then divided from every image and pixel in our data set.

These transformations can help clean your data set and reduce complexity to ensure you train an accurate model. Thanks for watching. Be sure to check out the links to read the article associated with this series and to learn more about AI. Also, stay tuned for the next episode where we learn to augment our data set to increase the real-world accuracy of our model.