Find the Humor in Text Data: NLP with Intel & Habana*

Learn how to train a binary classification natural language processing (NLP) model on a humor dataset, where each statement is labeled as humorous or not humorous. The training is performed on a powerful Habana Gaudi* AI processor unit (HPU). 

Quantize a model to speed up inference by 1.8x, taking it from FP32 format to INT8 format without significant accuracy loss. Complete this work on a 3rd generation Intel® Xeon® CPU.

Hello and welcome to this NLP deep learning notebook. I first am going to briefly spend some time introducing the problem, the model architecture, the hardware, and the software we will be using in two notebooks. During the demo, I have already run all the cells so that you are not sitting and waiting for anything, but you are of course encouraged to run them yourself and to change as much as you’d like.

In a world where negativity in speech and media is prominent, humor can help uplift the human spirit. "How to create a method or model to discover the structures behind humor, recognize humor…remains a challenge because of its subjective nature" (Jain, 2017). Machine learning and deep learning has been progressing to produce powerful language models. The proposed challenge here is to teach a computer how to distinguish between an humorous and nonhumorous statement in English.

In the first of two Jupyter* Notebooks, we will train a binary text classification model to determine if a statement is humorous. For the demonstration, I will only use a small portion of the data for training. 

You are encouraged to use more/all of the data to improve the efficacy of your model.

You are also encouraged to experiment with data tokenization, preprocessing, and augmentation.

In the second Jupyter Notebook, we will increase the performance of prediction (or inference) in a simulated production environment. We will also be making use of an open source MLOps tool called MLFlow to simulate a production model-serving environment.

Today, we will use a distilled version of the BERT transformer-based model architecture, called DistilBERT. You are free and encouraged to experiment with other architectures. 

BERT stands for Bidirectional Encoder Representations for Transformers, and it is a deep learning model for natural language processing (NLP) that can be used for a variety of language tasks. 

For the first notebook, we will be training our model using a Habana Gaudi HPU (Habana Processing Unit) accelerator, hosted on AWS*. The instance is an Amazon EC2 dl1.24xlarge ( It is an 8x parallel accelerator (HPU) that beats comparable GPU-based instances "by up to 40%" and at a much-reduced cost ( 

Due to the smaller size of the dataset and relatively low training time, I am only covering single-HPU training here, but if you would like to try distributed training over multiple HPUs, you can visit the Optimum Habana GitHub* repository to learn from the examples of distributed training there (

The Habana® Gaudi® DL1 instances come with 96 2nd generation Intel® Xeon® vCPUs (48 physical cores).

In the second notebook, I will be showing you how to speed up inference time using a technique called "quantization" on a production-capable 3rd Generation Intel® Xeon® Platinum 8375C Ice Lake CPU ( This instance is called m6i.*xlarge on AWS.

Intel will be releasing a 4th generation Xeon® Sapphire Rapids CPU processor with Advanced Matrix Extension that will be able to offer a performance speed improvement for inference of up to 8X on INT8 model as compared to INT8 on the 3rd generation Xeon Ice Lake CPU. For more information about the upcoming performance benefits, you can visit:

To actively monitor the compute on the HPUs, you can use watch hl-smi, similar to watch nvidia-smi on NVIDIA* GPUs.

To monitor the compute on the CPU cores and memory usage, you can use htop in the command line. And to get a printout of the CPU information, you can use a command called lscpu.

Though we are using these specific hardware architectures, I have attempted to make the code as accessible as possible by offering alternative code in the notebooks for other hardware.

I will now briefly highlight some of the key Python* libraries I will be using in the notebooks. 

We will be using the Habana SynapseAI fork of PyTorch*. It looks and feels much like the stock PyTorch, but it has been optimized for Habana Gaudi HPUs.

Stock PyTorch ( on CPU for inference

The Transformers library is what we are using to pull our DistilBERT pretrained model from and the associated configuration prior to training.

For setting up the training, we will be using optimum.habana, which is "the interface between the Transformers library and Habana’s Gaudi HPU ( 

To speed up model inference, we will be using Optimum Intel, which is "the interface between the Hugging Face* Transformers library and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures." ( In particular, the Intel® Neural Compressor (INC) is used in the back end for quantization of a model from FP32 to INT8.

We now move onto the second main section: Importing of Libraries. I have also included a requirements.txt file, in case you need to know the versions of libraries I am using.

Before importing tools, I just run these couple of lines starting with %load_ext autoreload to automatically reload any updated local Python libraries into the Jupyter Notebook. 

I have divided the importing of Python libraries into two main cells: 

The first cell is for the libraries that can be loaded and work on any hardware, including PyTorch, pandas, and transformers.

The second cell is for the Habana specific frameworks for training on the HPU, including Optimum Habana.

Here, we declare which device to use during training and inference for PyTorch. For this notebook, we select HPU, but there are others listed pertaining to other hardware.

Let’s first read in the CSV dataset into a pandas dataframe and call it hdf. We can see that we have 130,000 rows of data with a statement and a corresponding True or False label. First, we must convert the True or False label into a 1 or 0, respectively, in order to use the data for training. 

For this demonstration, I am only going to make use of 13,000 of the samples from the humor dataset so that everything is very fast for the demo. But you will likely want to use as much of the data as you can in your training, keeping in mind that you will want to keep a subset of the data for holdout to test your model inference.

Using NumPy, we can split up the data into training, validation, and test sets, where the training data have 11,700 rows, the validation data with 650 rows, and the test data with 650 rows.

To fine-tune train an NLP model, we must first tokenize the text data. Tokenizing the text data is a process where we convert each set of text into a vector of integers, each word, set of words, or symbol converting to a number according to a vocabulary tokenizer. 

You are free and encouraged to experiment with alternate tokenization methods to improve your model, as well as change the model architecture altogether. 

Here is some more information about the DistilBERT config that we are using here for our model.

I instantiate the tokenizer here by pulling from DistilBertTokenizerFast, and we can see some of the configurations of the tokenizer here. The word vocabulary size is 30,522, padding the vector is to the right, and there is a list of some special tokens, including the separator token SEP and the pad token PAD. 

Now that we have instantiated the tokenizer, we can tokenize the text data from text to a vector of integers. 

Once tokenized, we can see that the data have two keys: input_ids and attention_mask (see

The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not.

Now, let's explore the values of the data. We can look at a histogram of the length of characters in the text samples we have in the training dataset.

Let's look at which words are the most common, and which words are the least common with a word cloud.

In the word cloud, there are certain words called STOPWORDS which are very common words that we exclude from the word cloud to make it more interesting, but you can use an empty list instead if you would like to include all the common words like don't, to, etc. 

We concatenate all the words into a large string, and we can look at a sample of the nonhumourous words, and the length of the whole string. And now we can see a plot of the word cloud, with the more common words appearing larger.

We can go through the same process for humorous phrases and plot a similar word cloud.

Now that we have tokenized the data, we need to put the dataset in the torch.tensor format that PyTorch expects before training. 

Now that we’ve completely prepared the data, we can look at an example of the original dataset and after preprocessing it for training. 

We can see that we do have some padded 0s on the index, and we also do see with the attention mask that we are ignoring these pads for training. 

We can also decode the data from a vector of integers back to text with the tokenizer.

Let's first define our training arguments, our metrics, load a pretrained model, and then start training. Feel free to adjust the hyperparameters to tune it.

I begin with defining training arguments using the previously loaded GaudiTrainingArguments class from optimum.habana. You can always read more about a particular class or function with the Python help function. Of note here, I have provided the number of training epochs, the batch size, the number of text examples to handle at once in memory on the HPU, and the config from the Optimum.Habana framework.

If you are not using Habana Gaudi HPU hardware, I have given some sample code for a GPU and a CPU.

I am now defining a metrics function so that during training we can use our previously defined validation dataset to measure the progress with inference at certain intervals during training. I am loading the F1 metric, because that is what we are wanting to measure in this notebook.

Before we run the training, you can go to a command line window and type watch hl-smi to see the use of the HPUs during training.

This next cell is where I complete the training of my model. Technically, what I am doing is fine-tuning a pretrained NLP model called distilbert-base-uncased. I first load the pretrained model, and make sure to place the model on the HPU device. 

Then, I define the trainer with the GaudiTrainer class, and give the training_args previously defined, the training dataset, the validation dataset, and the compute_metrics function I defined. If you are using other hardware, I have provided the alternate code in the comments to just use Trainer.

I am using a timer to see the length of time the training takes. 

By watching the training loss, we can see that the training loss decreases with more training steps, which is what we want. Feel free to adjust some of the training parameters to optimize your model. The model is saving at several checkpoints in the output_results folder.

If you want to save the model manually to your local machine, you should be able to navigate to one of the checkpoint folders and download the pytorch_model.bin file, right-click it, and click Download. 

Now, we can start to evaluate the performance on the model by running inference on an unseen test dataset. 

We again must set up the trainer as before, but this time with the eval dataset as the test_dataset, which is unseen so far by our model. We then can run trainer.evaluate() to measure the F1 score on the test dataset, the loss, and the speed at which it runs.

Testing the inference on just a single example is helpful for illustration of how the model works. Let’s first look at a single row of text, and its label. We then can define the tokenized dataset in terms of an index and attention mask. 

We see from the index above that there are many 0s which indicate that the phrase has been padded to a certain vector length.

From the attention mask, we can see that there are also 0s at the end of the vector, indicating that we are telling the model not to pay attention to the numbers in those positions.

Let's run prediction on a single example with our fine-tuned model.

We must convert the logits tensor to a prediction output. If your output tensor is 1, the model is predicting a humorous statement. If your tensor is 0, the model is predicting a nonhumourous statement.

Now, let's decode the vector back into text as a sanity check.

We have gotten our text back, along with the padding. So, we did indeed predict on the text we wanted.

If you have followed along through the notebook, we have accomplished a lot. We have:

  • Introduced the humor dataset, the Habana Gaudi HPU hardware, the Habana Gaudi PyTorch software, and the DistilBERT model architecture.
  • Explored through the text data and tokenized it for training.
  • Fine-tuned the pretrained DistilBERT model on the binary classification task on a Habana Gaudi HPU.
  • Assessed the model's performance across a test set and completed sample inference on a single text example

We reserve the optimization of inference (with quantization) for the second notebook. Quantization should greatly reduce inference time while retaining accuracy for your model. You could expect your model file size to shrink by 4-5X.

The sample code for quantization will be in the second notebook, which must be completed on the 3rd generation Intel Xeon Platinum 8375C Ice Lake CPU instance. 

There is some other bonus work below, but I recommend that you work on quantization first before trying any of the bonus work, as you should see the most significant benefit in inference speed by quantizing your model.

I have provided some references to some of the important documentation and GitHub sites mentioned throughout this notebook. 

I now would like to turn your attention to the second notebook,  where I have a Jupyter environment open on a 3rd generation Xeon machine with the notebook loaded.

I will run pretty quickly through this notebook, as some of the notebook is a repeat from the first notebook, just to get the data loaded and ready.

I am not going to repeat everything from the initial introduction, but I just want to point out that for this second notebook, we will use different hardware than the first notebook, which is a 3rd generation Intel Xeon® CPU. 

Instead of using the SynapseAI PyTorch, we will now use the CPU version of PyTorch so that we can run it on the CPU here. 

Also of note here, we are loading two libraries that we need for quantization: neural_compressor and

And I am setting the torch device to cpu instead of hpu.

I have simplified the data loading only to the essentials with no data exploration here, but I am taking the same steps as I did in the first notebook to load the data.

We now turn to quantization. To save time at inference, we can quantize the model from FP32 to INT8, without much drop-in accuracy. This step is not necessary to submit your model, but it should save on inference time. To learn more about the functions and quantization, you can visit the Optimum Intel GitHub repository with text classification examples ( This is where I learned how to apply quantization and present it to you here in this notebook.

First, I need to load the previously trained model.

I then set up the trainer, but this time use the IncTrainer, or the Intel Neural Compressor Trainer class. We can then run a baseline model for inference using the FP32 model.

We now set everything up that we need to run quantization. One thing I do want to point out is that I am using a configuration file called quantization.yml that I downloaded from the Optimum Intel GitHub page ( You can adjust some of these parameters if you would like to adjust how the model is quantized.

We can then go ahead and run quantization with If you have a relatively small test dataset size, it should quantize the model fairly quickly, within a minute or less. 

And now that we have an optimized model, we can run evaluation and then compare it to the baseline model. 

We now can save the quantized model, and have both the FP32 and the newly quantized INT8 model available.

Now that we have a quantized model, let's test the inference on a test dataset to make sure that we have not lost significantly in accuracy.

In the first code snippet, I am loading the FP32 model, putting it on the CPU device, and then running inference on a test dataset.

In the second code snippet, I am loading the INT8 model, and running inference on the same test set, to compare it to the FP32 model. What you should see is a very similar accuracy/F1 score, but the INT8 model should show that it can handle more samples per second than the FP32 model.

A summary of learnings in this notebook:

  • Loaded, split train-val-test, and tokenized the humor text data.
  • Quantized our output model from FP32 to INT8 for faster inference speed.
  • Evaluated the quantized model inference speed on a small test dataset.

Thank you for listening and happy coding!