Intel Labs Presents 21 Papers at NeurIPS 2022

Highlights:

  • The Conference on Neural Information Processing Systems (NeurIPS) 2022 takes place from November 28th to December 9th as a hybrid event.

  • Intel Labs presents 21 papers, including five main conference works regarding zero-cost proxies for neural architecture scoring, 3D keypoint detection and shape reconstruction, scene graph generation, and other novel frameworks for information processing systems.

  • Intel Labs also organized conference workshops on Artificial and Real Intelligence (MemARI) and AI for Accelerated Materials Design (AI4Mat).

author-image

By

The Conference on Neural Information Processing Systems (NeurIPS) 2022 takes place from November 28th through December 9th as a hybrid event. The first week will comprise the physical component at the New Orleans Convention Center, followed by virtual content in the second week. The conference will kick off with an Expo Day preceding the main conference sessions, workshops, and live streams.

This year, Intel Labs is presenting 21 papers at NeurIPS, including five at the main conference. Researchers will also present papers at several workshops, including the Foundational Models for Decision Making Workshop, the second Workshop on Efficient Natural Language and Speech Processing, and the Workshop on Human in the Loop Learning. Intel Labs will also host and present papers at the following workshops; the Artificial and Real Intelligence (MemARI) Workshop and the AI for Accelerated Materials Design (AI4Mat) Workshop.

Intel Labs’ research contributions include a genetic programming framework to automate the discovery of zero-cost proxies for neural architecture scoring and a shape-aware neural 3D keypoint field which naturally entangles 3D keypoint detection and shape reconstruction without supervision. Labs researchers also propose a new formulation for scene graph generation that avoids the multi-task learning problem and the combinatorial entity pair distribution, a novel method for guaranteeing linear momentum in learned physics simulations, and a novel noun-pronoun distillation framework to leverage pre-trained noun-referring expression comprehension models.

Main Conference Papers

EZNAS: Evolving Zero-Cost Proxies for Neural Architecture Scoring
Neural Architecture Search (NAS) has significantly improved productivity in designing and deploying neural networks (NN). However, as NAS typically evaluates multiple models by training them partially or completely, the improved productivity comes at the cost of a significant carbon footprint. To alleviate this expensive training routine, zero-shot/cost proxies analyze an NN at initialization to generate a score, which correlates highly with its true accuracy. Zero-cost proxies are currently designed by experts conducting multiple cycles of empirical testing on possible algorithms, datasets, and neural architecture design spaces. This experimentation lowers productivity and is an unsustainable approach towards zero-cost proxy design as deep learning use cases diversify in nature. Additionally, existing zero-cost proxies fail to generalize across neural architecture design spaces. This paper proposes a genetic programming framework to automate the discovery of zero-cost proxies for neural architecture scoring. The methodology efficiently discovers an interpretable and generalizable zero-cost proxy that delivers state-of-the-art score-accuracy correlation on all datasets and search spaces of NASBench201 and Network Design Spaces (NDS). This research indicates a promising direction toward automatically discovering zero-cost proxies that can work across network architecture design spaces, datasets, and tasks.

Guaranteed Conservation of Momentum for Learning Particle-based Fluid Dynamics
We present a novel method for guaranteeing linear momentum in learned physics simulations. Unlike existing methods, the proposed method enforces the conservation of momentum with a hard constraint realized via antisymmetric continuous convolutional layers. Furthermore, this method combines these strict constraints with a hierarchical network architecture, a carefully constructed resampling scheme, and a training approach for temporal coherence. In combination, the proposed method substantially increases the learned simulator's physical accuracy substantially. In addition, the induced physical bias leads to significantly better generalization performance and makes this approach more reliable in unseen test cases. The method is evaluated on a range of different, challenging fluid scenarios, demonstrating that the approach generalizes to new scenarios with up to one million particles. Additional results show that the proposed algorithm can learn complex dynamics while outperforming existing systems in generalization and training performance. An implementation of the approach is available at https://github.com/tum-pbs/DMCF.

SNAKE: Shape-aware Neural 3D Keypoint Field
Detecting 3D keypoints from point clouds is important for shape reconstruction as this work investigates the dual question: can shape reconstruction benefit 3D keypoint detection? Existing methods either seek salient features according to statistics of different orders or learn to predict keypoints that are invariant to transformation. Nevertheless, the idea of incorporating shape reconstruction into 3D keypoint detection is under-explored. The paper argues that this is restricted by former problem formulations. To this end, a novel unsupervised paradigm named SNAKE is proposed, which is short for shape-aware neural 3D keypoint field. Similar to recent coordinate-based radiance or distance field, the proposed network takes 3D coordinates as inputs and predicts implicit shape indicators and keypoint saliency simultaneously, thus naturally entangling 3D keypoint detection and shape reconstruction. It achieves superior performance on various public benchmarks, including standalone object datasets ModelNet40, KeypointNet, SMPL meshes and scene-level datasets 3DMatch and Redwood. Intrinsic shape awareness brings several advantages as follows. (1) SNAKE generates 3D keypoints consistent with human semantic annotation, even without such supervision. (2) SNAKE outperforms counterparts in terms of repeatability, especially when the input point clouds are down-sampled. (3) the generated keypoints allow accurate geometric registration, notably in a zero-shot setting. Codes are available at https://github.com/zhongcl-thu/SNAKE.

Single-Stage Visual Relationship Learning using Conditional Queries
Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labelling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR[3], a set-based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. This paper proposes Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. Using a DETR-based encoder-decoder design, the system leverages conditional queries to significantly reduce the entity label space, leading to 20% less parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, but it also beats many state-of-the-art two-stage methods on Visual Genome dataset yet is capable of end-to-end training and faster inference.

TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Current referring expression comprehension algorithms can effectively detect or segment objects indicated by nouns, but how to understand verb reference is still under-explored. As such, this work investigates the challenging problem of task-oriented detection, which aims to find objects that best afford an action indicated by verbs like sit comfortably on. Towards a finer localization that better serves downstream applications, such as robot interactions, this method extends the problem into task-oriented instance segmentation. A unique requirement of this task is to select preferred candidates among possible alternatives. Resorting to the transformer architecture, which naturally models pair-wise query relationships with attention, led to the TOIST method. A novel noun-pronoun distillation framework is proposed to leverage pre-trained noun-referring expression comprehension models and the fact that we can access privileged noun ground truth during training. Noun prototypes are generated unsupervised, and contextual pronoun features are trained to select prototypes. As such, the network remains noun-agnostic during inference. TOIST was evaluated on the large-scale task-oriented dataset COCO-Tasks and achieved +10.9% higher mAPbox than the best-reported results. The proposed noun-pronoun distillation can boost mAPbox and mAPmask by +2.8% and +3.8%.