Using Neuraltalk2 Automatic Sentence Generation to Create Deep Learning Workflows

Published: 11/16/2018

Last Updated: 11/16/2018

Introduction

Start working with Deep Learning (DL) workflows or simply learn what DL workflows involve with this educational article. The content shares experiences in preparing and delivering a training session that is based on picking up a publicly available DL repository and working through the process of replicating it and attempting to get it working by just following the publicly available tutorial found on the corresponding GitHub* repository.

In working through the process and documenting the issues encountered, this article gives an accurate account of the challenges developers hoping to work in this field should expect to face. This article presents a general understanding of the challenges and issues encountered when using a deep learning workflow (DLW) that employs object detection modules used in autonomous driving and various other fields. It is by no means an exhaustive review, but rather aims to impart an appreciation for the overall process.

Whether it’s for the purpose of research or the need to setup third party solutions, a major part of working with neural net models involves following a workflow that usually requires pulling a DL model from a GitHub repository or developing your own models from scratch. Trying to reproduce work done by third parties usually involves finding public repositories and attempting to follow the setup instructions in the README file. This work is targeted at developers looking to go through that initial process of replicating or deploying a DL setup, field engineers who will need to setup demonstrations for inferencing, retraining a third-party model or anyone curious to understand pain points and considerations common to most DLWs especially if the starting point is third-party models.

The approach taken here is to go through the stages of the DLW by using the Neurltalk21 automatic sentence generation tutorial instructions while highlighting and commenting on issues encountered and pain points to look out for when working through any similar DLW. The processes of working through and documenting the issues and their resolutions will help highlight the various stages of the workflow and associated pain points encountered or expected at each stage while demonstrating the complete pipeline from data gathering or acquisition, model selection through to training, evaluation and inferencing for anyone curious to know about the full DLW.

Setup and Methodology

From the original Neuraltalk21 GitHub repository, start with the README file. The idea is to follow the setup instructions for a specific platform. This stage is guaranteed never to work straight out of the box and will usually involve a lot of online searching and trying out various fixes for any number of issues that pop up.

The specifications for the system used to run automatic labeling in this work are shown below in table 1. The issues encountered are as varied as there are host platforms even working with various generations of the same operating system, e.g., Ubuntu* 16.04, 17.04 or 18.04 LTS. Issues encountered here may include installing Lua and making sure Neuraltalk2 scripts actually run it and, debug it if not.

A specific issue encountered here was that the Torch*8 install script provided in the original Neuraltalk2 README file didn’t install as expected. The solution used here is to look up the actual Torch website and follow through the Torch installation instructions.

Table 1. System setup.

System Specifications Description
Operating System 4. 4. 0-93-generic 14. 04. 1- Ubuntu* x 86 64 GNU/Linux*
Processor Intel® Core™ i7-6700K CPU  4.00GHz x 4
System Memory DIMM synchronous 2133 MHz (0.5ns) x 4, 32 GiB
Hard drive ATA Disk, Intel® SSD SC2BW48, 480 GB

This constant struggle, attempt to execute lines in README file, encounter issue, debug, find a work around appears to be the norm and not the exception with replicating any third party DLW repositories in the public arena. This is a substantial time sink compounded by the fact that workarounds found for one host may apply only for that specific generation of that specific host. Using another version of Linux*, for example, would bring up different issues from the ones encountered here. Once a working setup is found, the user can now move onto focusing on the rest of the workflow. The DLW can be divided into four stages or phases. The following sections work through these four main phases while adhering to the Neuraltalk2 sentence generation tutorial2, taking note of other considerations needed while attempting to reproduce the setup, as well as highlighting issues bound to pop up with any such similar DLW.

Deep Learning Workflow Phases

The four main phases of the deep learning workflow, once a problem has been selected, are:

1. Data search selection and preparation
2. Model selection and setup
3. Model training
4. Evaluation of model performance

As pointed out earlier, the kinds of setup issues faced will depend on the host system (Apple OS*, Ubuntu, etc.) used to run and whether or not one chooses to run the training with CPU or GPU mode. CPU mode is the system used here, see figure 1. The following use case is that of object detection with sentence generation to give context to detected objects.

Phase 1: Data search selection and preparation

Assuming a problem has already been chosen and use of a deep learning neural net is found to be a compelling or appropriate solution, the next logical step would be, dataset search, selection and preparation, see figure 1. This involves finding or preparing good quality labeled data to train the neural net on. Neuraltalk2 used the MS COCO dataset 3 as its training dataset. This is a well-curated and well-known dataset.

Well labeled datasets are not usually easy to find. There is a lot of progress being made to address this but we are still far from being able to easily find inexpensive, well-labeled data for specific use cases. Some universities or institutions make their data available for free, for instance the KITTI Dataset4 and more recently Microsoft* just launched, as of this write-up, an online open sharing repository for AI and data science datasets5. For “supervised” training the selected dataset must have ground truth labels and is usually split into “train”, “test” and “evaluation” datasets.

Figure 1. Dataset search, selection, and preparation.

In this case the dataset is free and already curated for download and use. Upon downloading the MS COCO dataset, a Neuraltalk2 script is used to preprocess and split the data into valuation and test bins as shown in the code snippet:

//Image preprocessing stage
%%time
!python prepro.py --input_json coco/coco_raw.json \
--num_val 5000 \
--num_test 5000 \
--images_root coco/images \
--word_count_threshold 5 \
--output_json coco/cocotalk.json \
--output_h5 coco/cocotalk.h5


Phase 2: Model selection and setup

Figure 2 below involves selecting an appropriate model for your use case. There is currently a veritable zoo of models and frameworks and an unprecedented amount of faster, more efficient, higher-accuracy models being released. Things to consider when picking a model include, what framework8 you plan to use. For instance, widely used frameworks such as TensorFlow* will have a larger number of models to choose from and more community support for setup and deployment issues. Other times the kind of use case (our current one being object detection with sentence generation that describes the context of the detected object) may not be available in the framework of your choice. Neuraltalk2, for instance, is written in the Lua language on Torch Framework. Lua does not have as wide a community pick up as some of the more popular languages, such as Python* which have garnered much wider traction6.

Once a model is chosen one usually pulls it from a public repository and attempts to set it up by following through the “README.md” file and using internet searches to chase down and resolve the many inevitable issues. Some are easier to chase down by just inspecting the error and figuring out where it points to by going through relevant code/scripts. Other times the errors are so obscure that finding the solution online will be a considerable time sink trying to resolve. For example, an issue that popped up repeatedly were the steps in the Neuraltalk2 README file where LuaRocks (package manager for Lua modules) is used to install some needed packages. The failed commands eventually worked after using git https protocol behind the corporate network instead of using git protocol. Such seemingly small things can waste quite a bit of time trying to figure out why, what should be a straightforward terminal command, is failing.

 $git config --global url."https://".insteadOf git://$ luarocks install nn
$luarocks install nngraph$ luarocks install image


Google* can be the solution. If others have moved through the setup already, use their findings and hints to guide troubleshooting. If making the first tentative steps in deep learning, it is advisable that a framework with a large user base be the first tried out simply because there will be a substantially larger user community that will likely have already encountered, solved and shared solutions for common problems.

Figure 2. Model selection and setup.

A lot of model repositories now offer models as Docker* containers. The use of containers, whether you’re merely deploying locally or in a cluster or cloud environment, has expanded enough that this actually offers a means of taming the complexity in terms of setup and environment. One can simply download the required Docker image of a targeted framework with a neural net model already setup and simply run it without the pain of setting up the environment. Containerizing is also the choice method for deploying in clusters/cloud environments. Third- party models don’t always work for the purpose you intend straight out of the box. For instance your particular use case may target a different number of classes some or all of which may not overlap with the classes this third-party model was trained to detect. This means the configuration scripts need to be edited or replaced with a new output stage before training is done. But once done with this phase there will be a working model that has been modified with the right number of classes in its output stage for your use case.

Phase 3: Train model on new dataset

Figure 3. Retraining model on new dataset.

In phase 3, training the model depicted in figure 3, the modified model is run in training mode with the labeled dataset acquired in phase 1. Issues may arise from incomplete training if the model crashes after a few steps or further along after a few epochs. Troubleshooting this stage also involves a mixture of following through the errors combined with internet searches for similar issues reported by other users. Quite a few times the problem pointed to by the crashed scripts may have nothing to do with what the output says.

For Neuraltalk2 setup one easily encountered problem is to be aware that the checkpoint model comes in two versions, GPU and CPU. Take care to pick the correct one or you will repeatedly get a failed attempts at running training especially if there’s no feedback about no GPU being detected on the host.

The training snippet is shown below with the model selected to train on CPU by setting the flag “-gpuid” to -1:

//Train the model
%%time
!th train.lua -input_h5 coco/cocotalk.h5 \
-input_json coco/cocotalk.json \
-checkpoint_path checkpoints  \
-cnn_proto model/VGG_ILSVRC_16_layers_deploy.prototxt\
-cnn_model model/VGG_ILSVRC_16_layers.caffemodel
-gpuid -1


Figure 4. Training iterations with falling validation loss in Jupyter notebook.

Phase 4: Validate/Evaluate model performance

Figure 5. Evaluation, Validation and Field testing.

After training the model (or using a checkpoint model if not training one yourself) is complete, evaluation of how “well” the model is doing at the object detection task set for it has to be determined. There are various metrics used to evaluate performance of models, such as precision, F-Score, accuracy, etc. A confusion matrix12, illustrated in figure 5, is a good way of capturing the statistical distribution of “predicted classes” (what the model ‘thinks’ the object is) versus “actual classes” (what the object really is) and lends itself quite easily to calculating various metrics.

Neuraltalk2 offers switches to use various evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Consensus-base Image Description Evaluation (CIDer). But for our purposes the model’s performance is evaluated by running inference on some test images it hasn’t “seen” before and directly examining how well the generated description of the scene is. Running an evaluation on 10 images is shown using the Jupyter* Notebook code snippet below:

//Running inference on 10 images.
NUM_IMAGES=10
%%time
!th eval.lua -model ../pretrained-chkpts/model_id1-501-1448236541.t7_cpu.t7 \
-image_folder coco/images/val2014/ \
-num_images \$NUM_IMAGES -gpuid -1


Figure 6 below shows the outcome of this evaluation. The descriptions of the objects in the images are pretty accurate. But these are from a dataset, MSCOCO, that was also used to train the model so it has been exposed to a good selection of the types of objects found in the MSCOCO dataset.

Figure 6. Inference on MSCOCO dataset.

When given images from a completely different dataset, which the model has never been exposed to and is designed for the very specific use case of automated vehicle object detection – what happens? Figure 7 below shows the scene descriptions generated by the trained model. It is evident that the model would need to be exposed to a lot more images relevant to driving scenarios before ever being deployed in any kind of automated vehicle.

Figure 7. Inference on KITTI dataset.

Other Considerations

Other key performance indicators (KPIs) to keep in mind when choosing systems for deploying DL neural nets include: latency -- the seconds per frame required for video decode and inference, throughput -- at a minimum 24 fps should be supported for video. For edge and cloud deployment, considerations may include performance/watt, scaling efficiency – such as how much corresponding speed up is obtained with scale up of cores or nodes.

Summary

The Neuraltalk2 setup, inference and training along with a discussion of typical issues expected when working through DL workflow gives an appreciation for the kinds of considerations and challenges involved. This was at a level at which those just starting out or seeking to understand the overall process of what is involved in working through typical deep learning workflows get a good appreciation of some of the pain points and issues expected to be encountered in making use of the many public offerings of deep learning implementations. Get started today using the Neuraltalk2 sentence generation tutorial.