Intel® Labs Presents Latest Research at 2021 International Conference on Computer Vision


  • The 2021 International Conference on Computer Vision runs virtually from October 11th -17th.

  • Intel® Labs presents its latest computer vision research including research on vision transformers, self-attention networks, interactive vision-based driving policies and new approaches for training visual relationship models.

  • Intel Labs is a co-organizer of the third edition of the “Deep Learning for Geometric Computing” workshop and challenge.



The International Conference on Computer Vision (ICCV 2021) is the industry’s premier international event on computer vision and takes place virtually this year from October 11-17th. Intel’s computer vision strategy includes researching solutions for applications ranging from industrial machine vision to artificial intelligence (AI) to robotics. Its portfolio of computer vision products enables companies to automate computer vision systems and accelerate their path to production. 

At ICCV, Intel will present its latest research in computer vision including seven papers on topics ranging from vision transformers, self-attention networks, interactive vision-based driving policies and new approaches for training visual relationship models. Intel will also participate in the ICCV 2021 Low-Power Computer Vision Workshop on October 11th, the ICCV 2021 DeepMLT Workshop on Multi-Task Learning in Computer Vision on October 16th and the ICCV 2021 Autonomous Vehicle Vision Workshop on October 17th.

In addition, Intel will co-organize the third edition of the “Deep Learning for Geometric Computing” workshop and challenge on Oct. 11th. The workshop gathers computer vision researchers to advance the state-of-the-art in topological and geometric shape analysis using deep learning and will include competitions, proceedings, keynotes, paper presentations, and a fair. 

Highlighted Papers 

Following is a complete list of Intel’s workshops and papers at this year’s conference: 


ICCV 2021 Deep Learning for Geometric Computing Workshop - October 11th
ICCV 2021 Low-Power Computer Vision Workshop - October 11th
ICCV 2021 Workshop on Multi-Task Learning in Computer Vision - October 16th
ICCV 2021 Autonomous Vehicle Vision Workshop - October 17th

Main Conference Papers 

Vision Transformer for Dense Prediction 
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at

Point Transformer
Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70% mIoU threshold for the first time.

Adaptive Surface Reconstruction with Multiscale Convolutional Kernels
We propose generalized convolutional kernels for 3D reconstruction with ConvNets from point clouds. Our method uses multiscale convolutional kernels that can be applied to adaptive grids as generated with octrees. In addition to standard kernels in which each element has a distinct spatial location relative to the center, our elements have a distinct relative location as well as a relative scale level. Making our kernels span multiple resolutions allows us to apply ConvNets to adaptive grids for large problem sizes where the input data is sparse but the entire domain needs to be processed. Our ConvNet architecture can predict the signed and unsigned distance fields for large data sets with millions of input points and is faster and more accurate than classic energy minimization or recent learning approaches. We demonstrate this in a zero-shot setting where we only train on synthetic data and evaluate on the Tanks and Temples dataset of real-world large-scale 3D scenes.

Learning to Drive from a World on Rails
We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. At the time of writing, our method ranks first on the CARLA leaderboard, attaining a 25% higher driving score while using 40 times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.

Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data 
Continual learning is the problem of learning and retaining knowledge through time over multiple tasks and environments. Research has primarily focused on the task-incremental setting, where a new task is added at discrete time intervals. Such an “offline” setting does not evaluate the ability of agents to learn effectively and efficiently, since an agent can perform multiple learning epochs without any time limitation when a task is added. We argue that “online” continual learning, where data is a single continuous stream without task boundaries, enables evaluating both information retention and online learning efficacy. In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online. Trained models are later evaluated on historical data to assess information retention. We introduce a new benchmark for online continual visual learning that exhibits large-scale and natural distribution shifts. Through a large-scale analysis, we identify critical and previously unobserved phenomena of gradient-based optimization in continual learning, and propose effective strategies for improving gradient-based online continual learning with real data.

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations
Recent advances have enabled a single neural network to serve as an implicit scene representation, establishing the mapping function between spatial coordinates and scene properties. In this paper, we make a further step towards continual learning of the implicit scene representation directly from sequential observations, namely Continual Neural Mapping. The proposed problem setting bridges the gap between batch-trained implicit neural representations and commonly used streaming data in robotics and vision communities. We introduce an experience replay approach to tackle an exemplary task of continual neural mapping: approximating a continuous signed distance function (SDF) from sequential depth images as a scene geometry representation. We show for the first time that a single network can represent scene geometry over time continually without catastrophic forgetting, while achieving promising tradeoffs between accuracy and efficiency.

Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks

Learning of Visual Relations: The Devil is in the Tails 
Significant effort has been recently devoted to modeling visual relations. This has mostly addressed the design of architectures, typically by adding parameters and increasing model complexity. However, visual relation learning is a long-tailed problem, due to the combinatorial nature of joint reasoning about groups of objects. Increasing model complexity is, in general, ill-suited for long-tailed problems due to their tendency to overfit. In this paper, we explore an alternative hypothesis, denoted the Devil is in the Tails. Under this hypothesis, better performance is achieved by keeping the model simple but improving its ability to cope with long-tailed distributions. To test this hypothesis, we devise a new approach for training visual relationships models, which is inspired by state-of-the-art long-tailed recognition literature. This is based on an iterative decoupled training scheme, denoted Decoupled Training for Devil in the Tails (DT2). DT2 employs a novel sampling approach, Alternating Class-Balanced Sampling (ACBS), to capture the interplay between the long-tailed entity and predicate distributions of visual relations. Results show that, with an extremely simple architecture, DT2-ACBS significantly outperforms much more complex state-of-the-art methods on scene graph generation tasks. This suggests that the development of sophisticated models must be considered in tandem with the long-tailed nature of the problem. For more information, see

In Defense of Scene Graphs for Image Captioning
The mainstream image captioning models rely on Convolutional Neural Network (CNN) image features to generate captions via recurrent models. Recently, image scene graphs have been used to augment captioning models to leverage their structural semantics, such as object entities, relationships, and attributes. Several studies have noted that the naive use of scene graphs from a black-box scene graph generator harms image captioning performance and that scene graph-based captioning models must incur the overhead of explicit use of image features to generate decent captions. Addressing these challenges, we propose SG2Caps, a framework that utilizes only the scene graph labels for competitive image captioning performance. The basic idea is to close the semantic gap between the two scene graphs - one derived from the input image and the other from its caption. To achieve this, we leverage the spatial location of objects and the Human-Object-Interaction (HOI) labels as an additional HOI graph. SG2Caps outperforms existing scene graph-only captioning models by a large margin, indicating scene graphs as a promising representation for image captioning. Direct utilization of scene graph labels avoids expensive graph convolutions over high-dimensional CNN features resulting in 49% fewer trainable parameters. Our code is available at

Workshop Papers 

ICCV 2021 Low-Power Computer Vision Workshop 

Post-training Deep Neural Network Pruning via Layer-wise Calibration 

We present a post-training weight pruning method for deep neural networks that achieves accuracy levels tolerable for the production setting and that is sufficiently fast to be run on commodity hardware such as desktop CPUs or edge devices. We propose a data-free extension of the approach for computer vision models based on automatically-generated synthetic fractal images. We obtain state-of-the-art results for data-free neural network pruning, with ∼1.5% top@1 accuracy drop for a ResNet50 on ImageNet at 50% sparsity rate. When using real data, we are able to get a ResNet50 model on ImageNet with 65% sparsity rate in 8-bit precision in a post-training setting with a ∼1% top@1 accuracy drop. We release the code as a part of the OpenVINOTM Post-Training Optimization tool.1

ICCV 2021 Workshop on Multi-Task Learning in Computer Vision 

Distribution-Aware Multitask Learning for Visual Relations 

ICCV 2021 Autonomous Vehicle Vision Workshop

Few-shot Batch Incremental Road Object Detection Via Detector Fusion

Incremental few-shot learning has emerged as a new and challenging area in deep learning, whose objective is to train deep learning models using very few samples of new class data, and none of the old class data. In this work we tackle the problem of batch incremental few-shot road object detection using data from the India Driving Dataset (IDD). Our approach, DualFusion, combines object detectors in a manner that allows us to learn to detect rare objects with very limited data, all without severely degrading the performance of the detector on the abundant classes. In the IDD OpenSet incremental few-shot detection task, we achieve a mAP50 score of 40.0 on the base classes and an overall mAP50 score of 38.8, both of which are the highest to date. In the COCO batch incremental few-shot detection task, we achieve a novel AP score of 9.9, surpassing the state-of-the-art novel class performance on the same by over 6.6 times.