Intel Labs Presents Latest Research at The Conference on Computer Vision and Pattern Recognition (CVPR)


  • CVPR takes place virtually from June 19-25, 2021.

  • Intel Labs’ Researcher René Ranftl to give a presentation on general purpose monocular depth estimation on Tuesday, June 22nd from 4:30 p.m. – 5:00 p.m. (PST).



The Conference on Computer Vision and Pattern Recognition (CVPR) is the industry’s largest event exploring artificial intelligence (AI), machine learning, and computer vision research and applications. This year, the conference will take place virtually from June 19-25. Featuring presentations, tutorials, workshops, and panels delivered by leading authors, academics, and experts, the event is expected to attract more than 7,500 attendees.

Intel is a conference sponsor and will present several papers on some of the company’s latest research on computer vision and pattern recognition, an aspect of AI that trains computer to interpret and understand the visual world. This year’s topics include neural architecture, novel view synthesis, and self-supervised geometric perception. CVPR is an important conference for Intel Labs because it enables the company to learn more about the latest advancement and cutting-edge AI techniques to help inform its design of future generations of hardware. 

Intel Labs Researcher René Ranftl, Ph.D., is a featured speaker and will present “Towards Ubiquitous Depth Sensing: Data, Training, and Architectures for General Purpose Monocular Depth Estimation” on Tuesday, June 22nd, from 4:30 p.m. – 5:00 p.m. (PST). The presentation will also be available to view on-demand. 

Following is a list of papers to be presented during the conference:

Weight sharing has become a de facto standard in neural architecture search because it enables the search to be done on commodity hardware. However, recent works have empirically shown a ranking disorder between the performance of stand-alone architectures and that of the corresponding shared-weight networks. This violates the main assumption of weight-sharing NAS algorithms, thus limiting their effectiveness. 

We tackle this issue by proposing a regularization term that aims to maximize the correlation between the performance rankings of the shared-weight network and that of the standalone architectures using a small set of landmark architectures. We incorporate our regularization term into three different NAS algorithms and show that it consistently improves performance across algorithms, search-spaces, and tasks.

We present Stable View Synthesis (SVS). Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene. The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. 

The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection. Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes.

We present self-supervised geometric perception (SGP), the first general framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels (e.g., camera poses, rigid transformations). Our first contribution is to formulate geometric perception as an optimization problem that jointly optimizes the feature descriptor and the geometric models given a large corpus of visual measurements (e.g., images, point clouds). Under this optimization formulation, we show that two important streams of research in vision, namely robust model fitting and deep feature learning, correspond to optimizing one block of the unknown variables while fixing the other block. 

This analysis naturally leads to our second contribution – the SGP algorithm that performs alternating minimization to solve the joint optimization. SGP iteratively executes two meta-algorithms: a teacher that performs robust model fitting given learned features to generate geometric pseudo-labels, and a student that performs deep feature learning under noisy supervision of the pseudo-labels. As a third contribution, we apply SGP to two perception problems on large-scale real datasets, namely relative camera pose estimation on MegaDepth and point cloud registration on 3DMatch. We demonstrate that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.

Physical contact between hands and objects plays a critical role in human grasps. We show that optimizing the pose of a hand to achieve expected contact with an object can improve hand poses inferred via image-based methods. Given a hand mesh and an object mesh, a deep model trained on ground truth contact data infers desirable contact across the surfaces of the meshes. Then, ContactOpt efficiently optimizes the pose of the hand to achieve desirable contact using a differentiable contact model. 

Notably, our contact model encourages mesh interpenetration to approximate deformable soft tissue in the hand. In our evaluations, our methods result in grasps that better match ground truth contact, have lower kinematic error, and are significantly preferred by human participants. Code and models are available online.