Evaluating Automatic Labeling Models for Automated Vehicle Object Detection

Published: 11/13/2018  

Last Updated: 11/13/2018


The purpose of this article is to highlight the importance of selecting metrics that provide a better evaluation of a neural net model’s performance rather than relying solely on a single measure like accuracy, which is much too often the case. The work also highlights the importance of taking the statistical distributions of the datasets used to train models. Though the target audience is developers working on object detection for Autonomous Vehicles (AV), the concepts are useful for any tasks targeting the problem of object detection or automated labeling. The approach used here is to take three well-known pre-trained neural net models from the TensorFlow* Zoo created and shared by Google Inc., and set them up for inference without retraining and evaluate how well these models perform on a previously unseen AV dataset. Each of the four convolutional neural network (CNN) models used here is an instance of one of three meta-architectures. Meta-architectures are defined as unified implementations of three different types of CNNs. The KITTI dataset is used as the AV dataset. All work is done on an Intel® Core™ i7-6700K processor 4.0 GHz x 4. Three of the models used were trained on the MS COCO dataset1 and for comparison; a fourth model, trained on the KITTI dataset2, was also used.

The performance of each of these models was then evaluated using metrics of Precision, Recall, and F-score as defined using a confusion matrix. It was found that the MS COCO trained models performed surprisingly well on the new AV dataset. As was to be expected, the KITTI trained model performed better with regards to the two object classes, {‘car’, ‘person’}, it was trained on. This investigation also focuses primarily on these two object classes. Despite not having been trained on an AV dataset, they perform quite well as evidenced both by the metrics and indeed the observed labeling. It stands to reason that they would only get better if further trained (transfer learning) on the KITTI dataset. The models were able to identify and correctly label objects in the images that were almost entirely obscured or only partially visible, objects partially or completely in shade, and even managed to detect objects at a distance that even human eyes would struggle with. The most salient lessons from this work are that any developers working in the realm of object detection would do well to better understand the statistics of the datasets they use in training CNNs for object detection as well as pay more attention to the evaluation metrics used to determine how “well” their models do. Using only a single metric falls drastically short of a true characterization of a model’s performance.


There is an ever-growing choice of publicly available CNN models that have been trained on one of a number of datasets created and labeled for specific fields (medical, biological, physical, image, text, etc.), with the many associated classes representative of each of these fields. The rapidly evolving field of autonomous driving is no exception. The KITTI dataset was chosen for this study, which is an open dataset and is fairly representative of the kind of images one typically expects to encounter when driving around a mid-sized city as represented by the German city of Karlsruhe, where the driving data in the KITTI dataset was gathered. The KITTI dataset has a large number of object classes organized into various locations where the data was collected such as city, residential, road, campus, etc.

To narrow our scope, we focus primarily on the two object classes {‘car’, ‘pedestrian’}, which are quite common and very important classes for self-driving vehicles to be able to identify reliably and consistently. The three meta-architectures we use for our CNN models in this investigation are Faster Region Based CNN (Faster R-CNN)3, Single Shot Detector (SSD)4, and Regional Fully Convolutional Network (R-FCN)5. Of course, an argument can be made as to whether a fair comparison can be made between different models with many different adjustable hyperparameters, but the idea behind this initial work is to merely ascertain how well pre-trained models perform especially with regards to object classes that overlap with those found in both the target dataset (KITTI, in our case) and the dataset these models were originally trained on (MS COCO). A good overview of each of these CNNs, including a comparative evaluation of their performance with regard to speed and accuracy trade-offs, can be found in the article published by the Google research team6. The models used in this work were obtained from the TensorFlow Detection Model Zoo7. To compare the performance of the three chosen models we also include another Faster R-CNN model from the same TensorFlow Model Zoo. This particular model is actually pre-trained on the KITTI dataset and for precisely the two object classes mentioned above, {‘car’, ‘pedestrian’}.

In the following sections, the questions we address include: How well do CNN models trained on one dataset (MS COCO dataset) perform on labeling a different dataset (KITTI dataset2) without retraining? What kind of image classification scenarios do they struggle with? These questions are investigated in the context of how well these off-the-shelf models perform relative to each other on the same dataset.

Setup and Methodology

CNN models

Table 1 displays the selected model architectures and the particular implementations chosen for this study. All models are based on the TensorFlow framework8. Models 1-3 were trained on the MS COCO dataset. Model 4 is a KITTI trained version of Faster R-CNN and is added to get a comparative idea of how the MS COCO trained models fare when they are tasked with labeling images from the KITTI dataset.

Table 1. Convolutional neural network (CNN) model instances used from TensorFlow* Model Zoo. Models 1-3 were trained on the MS COCO dataset. Model 4 was trained on the KITTI dataset.

CNN Type Specific Instance of Model Used
1. Faster R-CNN faster rcnn resnet101 coco 11 06 2017
2. Single Shot Detector (SSD) ssd inception V2 coco 2017 11 17
3. Regional Fully Convolutional Network (R-FCN) rfcn resnet101 coco 11 06 2017
4. Faster R-CNN KITTI faster rcnn resnet101 kitti 2017 11 08

Treatment of different models as instances of what is called meta-architectures introduces a clear means of comparing different architectures by treating them as being built from swappable modules4. As such, all three types of models considered in this work can use one of several different types of feature extractors. Specifically, referring to table 1, models 1, 3, and 4 all use ResNet-101 as the feature extractor, whereas model 2 (SSD) uses Inception V2 as its feature extractor. In terms of operation, SSD uses a single feed-forward CNN, doing away with the need for region proposal generation or feature resampling5 as used by Faster R-CNN and R-FCN. Having all computation in one network results in SSD having inference times that are considerably faster than Faster R-CNN and R-FCN. Faster R-CNN uses what is found to be an expensive per-region subnetwork, which is applied a few hundreds of times on each image. R-FCN avoids this costly hit by pushing feature cropping to the very last layer before class prediction while achieving comparable performance metrics to Faster R-CNN with lower inference times.

System setup

The automatic labeling runs are conducted on a system that would typically be used as a higher end desktop computer. The specifications for the system used to run automatic labeling are shown in table 2.

Table 2. System setup

System Specifications Description
Operating System 4. 4. 0-93-genric 14. 04; 1- Ubuntu* x86 64 GNU/Linux*
Processor Intel® Core™ i7-6700K processor 4.00 GHz x 4.
System Memory DIMM synchronous 2133 MHz (0.5ns) x 4, 32 GiB
Hard Drive ATA disk, Intel® SSD SC2BW48, 480 GB

The KITTI dataset

Our target dataset is the KITTI dataset. We primarily focus on the two KITTI object categories {‘car’, ‘pedestrian’}, as these have a relatively high rate of occurrence in the KITTI dataset and are of high importance for any object-detection task for a self-driving car. The KITTI dataset collects data in driving scenarios it categorizes as {‘city’, ‘residential’, ‘road’, ‘campus’, ‘person’} where ‘person’ is data recorded by a stationary observer as opposed to recording from a moving vehicle, which is the case with the rest of the categories. It is evident that the number of instances of the object class {‘car’} by far outstrips that of object class {‘pedestrian’} in the three scenarios that arguably matter most in terms of where most cars spend a good portion of their time in typical driving scenarios {‘city’, ‘residential’, ‘road’}. These are also the two categories that the labeling effort for the KITTI dataset was focused on. Statistically, the total number of images, approximately 20,000, with a ‘car’ object count of 1-6 cars per image is significantly higher than the number of images, approximately 14,000, with a ‘pedestrian’ object count of 1-6 pedestrians per image.

The MS COCO dataset also has similar object classes that appear as {‘car’, ‘person’}, so it was necessary to make a slight modification to first generate the KITTI dataset into a format suitable for model ingestion (JSON format in this case) and extend the default scripts a little to handle a conversion of object class data type between the two datasets so that KITTI’s two object classes {‘car’, ‘pedestrian’} map onto MS COCO’s corresponding version of these two classes {‘car’, ‘person’}. Interestingly enough, the total number of labeled instances of object classes {‘car’, ‘pedestrian’} for KITTI and {‘car’, ‘person’} for COCO are approximately 180,000 (‘car’), 105,000 (‘pedestrian’), and approximately 100,000 (‘car’) and 800,00 (‘person’), respectively. So, CNNs trained on COCO vs. KITTI dataset would have a very heavy skew in the training data towards person/pedestrian.

It is also worth pointing out that both KITTI and COCO datasets do have other categories on would likely class as {‘car’}, such as {‘truck’, ‘van’} in KITTI and {‘truck’, ‘bus’} for COCO, but which we ignore for our work as they occur in significantly lower numbers in both datasets.

Model evaluation

Time to inference

All the models are run on the same system shown in table 2. The time to inference, ∆tinf, is the average time a given model takes to label objects on any given image. It is used here only as a relative measure of how much time each model takes to go through a labeling task for an image, and little emphasis is placed on the absolute times as these would obviously differ based on which underlying system models are run on as well as the size of the images used.

Illustrated confusion matrix for object class car
Figure 1. Illustrated confusion matrix for object class {‘car’}, where TP, FP, TN, and FN are True Positive, False Positive, True Negative, and False Negative, respectively.

Performance metrics

In obtaining evaluation metrics, we rely on an evaluation script that comes with the MS COCO trained models. This evaluation script reports metrics for a given model in terms of Recall and Precision for a given threshold of the Intersection of Union (IoU) value defined as

Intersection of Union (I o U)

where Apred is the model’s area of ‘predicted’ bounding box and Agt is the area of actual or ‘Ground Truth’ bounding box for the labeled object located in the image under consideration. For the metrics derived from the confusion matrix of figure 1 we can frame the concepts in a more intuitive manner. We can think of Precision (ϕ) as follows: Out of all the objects (TP + FP) a model classified as belonging to object class ‘car’ (or ‘pedestrian’), how many of said objects actually turned out to really be valid predictions (TP) of object class ‘car’ (or ‘pedestrian’). Similarly, the metric Recall (ρ) can be understood as follows: Out of all the real (actual) instances of ‘car’ (or ‘pedestrian’) that are present in the image under consideration, how many of them did our model identify as ‘car’ (or ‘pedestrian’) while missing out (classifying as FN) on the rest of the instances of object class ‘car’ (or ‘pedestrian’), which actually do appear in the image under evaluation.

To evaluate the models, we use MS COCO’s standard evaluation metric1, which evaluates mean Average Precision (mAP) averaged over 10 evaluation thresholds, IoU ∈ [0.5 : 0.05 : 0.95]. We use the mAP as defined by MS COCO (denoted by ϕ) along with the mean Average Recall (denoted by ρ), to define a corresponding F-score for each model as

F-score for each model

Qualitative performance

Finally, we also examine some labeling samples and evaluate how well the off-the-shelf MS COCO trained models do on the labeling of the object classes of interest, {‘car’, ‘pedestrian’}. We look for how well the models handle object detection with regards to object occlusion and truncation; distance and size of objects; instances of false positives (mislabeling); false negatives or undetected objects; performance on objects in various conditions of lighting and shadow; etc.


The following sections are the results of inference runs on the KITTI dataset. For the 3 types of models used, we present the relative time to inference, precision, recall and F-Score. A few select images are included to highlight some aspects of the inference results.

Time to inference

Table 3. Absolute inference times in seconds, ∆tinf, per image.

CNN Type Specific Instance of Model Used ∆tinf/Img(s)
1. Faster R-CNN faster rcnn resnet101 coco 11 06 2017 ≈ 8.98
2. Single Shot Detector (SSD) ssd inception V2 coco 2017 11 17 ≈ 1.17
3. Regional Fully Convolutional Network (R-FCN) rfcn resnet101 coco 11 06 2017 ≈ 4.50
4. Faster R-CNN KITTI faster rcnn resnet101 kitti 2017 11 08  ≈ 11.89

Inference times relative to Faster R-CNN
Figure 2. Inference times, ∆tinf, relative to Faster R-CNN.

Precision, recall, and F-Score

Bounding box enclosing detected object
Figure 3. Mean average precision against size of bounding box enclosing detected object.

Bounding box enclosing detected object graphic
Figure 4. Mean average recall against size of bounding box enclosing detected object.

F-Score against size of bounding box
Figure 5. F-Score against size of bounding box enclosing detected object.

Occulusion, truncation or partial object  visibility
Figure 6. Occulusion, truncation or partial object visibility.

Small and Distant Objects
Figure 7. Small and Distant Objects.

Undetected or false negatives
Figure 8. Undetected or false negatives.

Lighting and shadow
Figure 9. Lighting and shadow.

Mislabeled or false positives
Figure 10. Mislabeled or false positives


For sheer relative speed of labeling per image (table 3), the SSD model beats the Faster R-CNN and R-FCN by factors of approximately 8x and 4x, respectively, with the KITTI trained Faster R-CNN model showing up as only marginally slower than the COCO trained Faster R-CNN model. This falls within the relative speeds reported by the developers of the SSD4 model, which they quote as 59 frames per second (FPS) for their SSD vs. 7 FPS for Faster R-CNN running labeling on same dataset. Similarly, the authors of R-FCN5 report R-FCN speeds greater than 2.5x compared to Faster RCNN, which matches what is found here. Although the metrics we used have their short comings9, 10, we can see the relative performance of the three types of CNN models we have used based on their relative scores. F-measure measures the balance between Recall and Precision and is a better measure than accuracy, especially for skewed datasets. As mentioned in “The Kitti Dataset” section, the KITTI dataset appears to have a skewed dataset with regard to the object classes we focus on {‘car’, ‘pedestrian’}. The relative number of each object class {‘car’, ‘pedestrian’} in the KITTI dataset2 indicates there is a skew towards a greater number of occurrences of the object class {‘car’} than there is of object class {‘pedestrian’}, indicating that a better choice of F-score would likely be the generalized F-score given by

F-Score ecuation

where ϕ = Precision, ρ = Recall, and β adjusts the relative importance of Precision vs. Recall. Lower β values translate to a reduction of the importance of Precision11. However, for our model evaluation purposes, we use the F1 metric with β = 1 giving equal weighting to both Recall and Precision. One aspect of the performance that our chosen metric, F1, doesn’t capture is the instances of True Negatives (TN), as illustrated in figure 1. For our purposes though, the higher the F1 score the better, so by this measure the COCO trained Faster R-CNN and R-FCN models do marginally better than the SSD model as shown in figure 3. The Faster R-CNN and R-FCN models perform equally well with regards to Precision and Recall for all object sizes, figures 3 and 4. The SSD model is found to have a lower mAP than the other models that have a longer inference time per image; but interestingly enough, the SSD models average Recall for all object sizes (large, medium, and small) are quite a bit higher than that of the COCO trained Faster R-CNN and R-FCN models, and are only bested by the KITTI trained Faster RCNN model. However, note that SSD’s performance metrics seem to fall off more rapidly as the image size decreases compared to all the other models.

For all metrics shown in figures 3-5, the KITTI trained Faster R-CNN appears to have the highest average scores for all images sizes for this particular sample of KITTI data. This seems to be corroborated by the fact that no False Positive (FP) examples were found (for the samples used) for the KITTI trained Faster R-CNN (shown in figure 10). Of course, this only suggests that the KITTI trained model that has a vocabulary of only two classes {‘car’, ‘pedestrian’} is able to detect a larger number of “actual” instances of these two classes (higher Recall) and has a higher rate of return of True Positives (TP) among all the classes it identifies as belonging to {‘car’, ‘pedestrian’} (that is, higher Precision). The KITTI trained faster R-CNN model scores substantially higher than the rest when all image sizes are taken into consideration.

When examining the qualitative performance of the models, it is rather impressive that they are able to identify a car even when most of the image of the car is only partially visible (figure 6. This was generally found to be the case even when the line of parked cars occluded each other as shown on right-hand side of figure 6(d). All the models seem to struggle somewhat with more distant objects (figure 7), which are mostly passed over as FNs or sometimes mislabeled as FP, as is the case in figure 10 (b) where an object class {‘car’} is mislabeled as belonging to object class {‘person’}. Most of the FN occurrences (figure 8) for all models happened for distant or small objects, but occasionally spectacularly wrong FN classifications would occur such as that by SSD in figure 8(b), where pedestrians in the foreground are missed but those in the image background, which are obviously smaller, are correctly identified. There were more than a few instances of Faster R-CNN, SSD, and RFCN misidentifying inanimate objects as belonging to classes {‘car’, ‘pedestrian’}, but this did not seem to be the case for the KITTI trained Faster R-CNN used in this investigation. For the most part, all the models fared pretty well in being able to identify vehicles in various conditions of lighting. However, the particular samples illustrated in figure 9 show some examples of SSD (figure 9(b)) and KITTI trained Faster R-CNN (figure 9(c)) not detecting some obvious objects of class {‘car’} that are in the shadows but can clearly be made out by a human. Model RFCN (figure 9(c)) impressively picks out a couple of cars not only in the shadows, but also small and distant cars; and even though it groups the cars as one car, this is a case that would also be slightly challenging to a human observer.


Despite being trained on a different dataset (MS COCO) with 90 different object classes, the three straight “off-the-shelf” CNN model architectures seemed to fare pretty well on a dataset (KITTI) they weren’t trained on. The KITTI trained model could only identify two object classes {‘car’, ‘pedestrian’} and this is what the study focused on. The KITTI model was found to perform quantitatively better than the MS COCO trained models with respect to the objects it detected consistently, scoring a higher Recall, Precision, and F-score. Of course, the KITTI dataset was specifically created for object detection for the AV use case and used KITTI’s training and test datasets for training, meaning it was exposed to a preponderance of the kind of object classes one would expect to find when driving around a typical modern city. MS COCO dataset is more generic, covering a large number of everyday objects of which there are statistically fewer instances of vehicles and pedestrians. The expectation then is that the MS COCO trained models would perform just as well if they were actually trained on an AV such as the recent and extensive Berkley Deep Drive dataset12. An important question raised by this investigation is would a neural net, NN1, trained to recognize just a handful of objects perform better on these objects than a neural net, NN2, trained on a much larger class of objects, but which includes all the classes NN1 recognizes? Answering this question would determine whether object detection takes on a paradigm of designing with “specialized” vs. “generalized” neural nets. Object detection would then take on the characteristics of being composed of several neural nets, each specialized to detect only a handful of objects but taken as a collective that is able to scale to more classes more effectively, without a drop in performance as compared to a single neural net designed to recognize a large number of classes.

About the Author

Mambwe Mumba is a software engineer for the System Simulation Center (SCC) where he played a critical early role in pushing and supporting adoption of the virtual platform Simics. He was awarded the Intel Achievement award for his contributions. Simics is now used extensively across Intel and by Intel's partners from firmware and BIOS development; test and continuous integration; architectural studies; fault injection; and pretty much anything for which actual silicon doesn't yet exist or is too expensive to procure. Mambwe holds a PhD in theoretical physics and has also recently embarked on a journey into applied research in the blossoming field of deep learning. Mambwe has an enthusiastic fascination for cultures and languages of the world; he dabbles in language learning (the spoken kind) in his spare time.


  1. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
  2. Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  3. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  4. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
  5. Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. CoRR, abs/1605.06409, 2016.
  6. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
  7. Open. TensorFlow model zoo, 2017.
  8. Mart´ın Abadi, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  9. David M. W. Powers. What the f-measure doesn’t measure: Features, flaws, fallacies and fixes. CoRR, abs/1503.06410, 2015.
  10. David Powers. Evaluation: From precision, recall and f-factor to roc, informedness, markedness and correlation. 2, 01 2008.
  11. Mohamed Bekkar, Dr. Hassiba Kheliouane Djemaa, and Dr. Taklit Akrouf Alitouche. Evaluation measures for models assessment over imbalanced data sets. Journal of Information Engineering and Applications, 2013.
  12. Berkley Deep Drive.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.