Intel® Labs Presents Research at NeurIPS 2021

Highlights:

  • Thirty-fifth Neural Information Processing Systems Conference takes place virtually from December 6 -December 14.

  • Intel Labs to present the latest research on Deep-SWIM, Natural Language Processing, and iterative causal discovery (ICD) and participate in the NeurIPS’ Efficient Neural Language Speech Processing (ENSLP) Workshop.

BUILT IN - ARTICLE INTRO SECOND COMPONENT

The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS’21) takes place online this year from December 6-14th. Intel Labs is excited to join other members of the artificial intelligence (AI) and machine learning (ML) communities to share information about the industry’s latest discoveries and developments. 

Intel Labs is pleased to present eight published papers, including three poster presentations at the main conference, including one from Habana Labs. Researchers will also present papers at the Efficient Natural Language and Speech Processing (ENLSP NeurIPS Workshop 2021).  These papers focus on optimization problems in the field of natural language processing (NLP) and the Workshop on Deployable Decision Making in Embodied Systems.

In addition, Intel Labs is a participant in the NeurIPS’21 competition track, Billion-Scale Approximate Nearest Neighbor Search Challenge. Intel AI is also the sponsor of the LatinX AI Workshop, a one-day event for Latino/Latina faculty, graduate students, research scientists, and engineers to discuss AI research trends and career choices. 

NeurIPS is a multi-track interdisciplinary annual meeting and exposition focusing on ML in practice with topical workshops that facilitate the exchange of ideas. NeurIPS is hosted by the Neural Information Processing Systems Foundation, a nonprofit to promote the exchange of research and ideas in AI and ML. Read the official NeurIPS blog to keep up with the last conference papers and proceedings here. A list of published papers can be found here

Following are Intel Labs’ papers, poster presentations and workshops: 

NeurIPS’ Main Conference Presentations: 
 

  • Low-dimensional Structure in the Space of Language Representations is Reflected in Brain Responses
    Richard Antonello, Javier S. Turek, Vy Vo, Alexander Huth

    How related are the representations learned by neural language models, translation models, and language tagging tasks? We answer this question by adapting an encoder-decoder transfer learning method from computer vision to investigate the structure among 100 different feature spaces extracted from hidden representations of various networks trained on language tasks.

    This method reveals a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings. We call this low-dimensional structure a language representation embedding because it encodes the relationships between representations needed to process language for various NLP tasks. 

    This representation embedding can predict how well each feature space maps to human brain responses to natural language stimuli recorded using fMRI. Additionally, we find that the principal dimension of this structure can be used to create a metric that highlights the brain's natural language processing hierarchy. This finding suggests that the embedding captures some part of the brain's natural language representation structure.
  • Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias 
    Raanan Y. Rohekar, Shami Nisimov, Yaniv Gurwicz, Gal Novik

    Abstract We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph and a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. 

    Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes and increase this value in the successive iteration. Thus, each iteration refines a graph recovered by previous iterations having smaller conditioning sets—a higher statistical power—which contributes to stability. We demonstrate that ICD requires significantly fewer CI tests and learns more accurate causal graphs than FCI, FCI+, and RFCI algorithms.
  • Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
    Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Joseph Naor, Daniel Soudry

    Unstructured pruning reduces the memory footprint in deep neural networks (DNNs). Recently, researchers proposed different types of structural pruning also intending to reduce the computation complexity. In this work, we first suggest a new measure called mask-diversity, which correlates with the expected accuracy of the different types of structural pruning. We focus on the recently suggested N:M fine-grained block sparsity mask, in which for each block of M weights, we have at least N zeros. 

    While N:M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. To allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes. Our transposable mask guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated. 

    We formulate the problem of finding the optimal transposable mask as a minimum-cost flow problem. Additionally, to speed up the minimum-cost flow computation, we also introduce a fast linear-time approximation that can be used when the masks dynamically change during training. Our experiments suggest a 2x speed-up in the matrix multiplications with no accuracy degradation over vision and language models. 

    Finally, to solve the problem of switching between different structure constraints, we suggest a method to convert a pre-trained model with unstructured sparsity to an N:M fine-grained block sparsity model with little to no training. 
  • Looking Beyond Single Images for Contrastive Semantic Segmentation Learning
    Feihu Zang, Philip Torr, René Ranftl, Stephan R. Richter

    We present an approach to contrastive representation learning for semantic segmentation. Our approach leverages the representational power of existing feature extractors to find corresponding regions across images. These cross-image correspondences are used as auxiliary labels to guide the pixel-level selection of positive and negative samples for more effective contrastive learning in semantic segmentation.

     We show that auxiliary labels can be generated from a variety of feature extractors, ranging from image classification networks that have been trained using unsupervised contrastive learning to segmentation models that have been trained on a small amount of labeled data. We additionally introduce a novel metric for rapidly judging the quality of a given auxiliary-labeling strategy, and empirically analyze various factors that influence the performance of contrastive learning for semantic segmentation. 

    We demonstrate the effectiveness of our method both in the low-data as well as the high-data regime on various datasets. Our experiments show that contrastive learning with our auxiliary-labeling approach consistently boosts semantic segmentation accuracy when compared to standard ImageNet pre-training and outperforms existing approaches of contrastive and semi-supervised semantic segmentation.

Deployable Decision Making in Embodied Systems Workshop: 
 

  • Validate on Sim, Detect on Real -- Model Selection for Domain Randomization
    Gal Leibovich, Guy Jacob, Shadi Endrawis, Gal Novik, Aviv Tamar

    A practical approach to learning robot skills, often termed sim2real, is to train control policies in simulation and then deploy them on a real robot. Popular techniques to improve the sim2real transfer build on domain randomization (DR): Training the policy on a diverse set of randomly generated domains with the hope of better generalization to the real world. 

    Due to the large number of hyper-parameters in both the policy learning and DR algorithms, one often ends up with many trained models, where choosing the best model among them demands costly evaluation on the real robot. In this work we ask: Can we rank the policies without running them in the real world? Our main idea is that a predefined set of real-world data can be used to evaluate all policies, using out-of-distribution detection (OOD) techniques. In a sense, this approach can be seen as a "unit test" to evaluate policies before any real-world execution. 

    However, we find that by itself, the OOD score can be inaccurate and very sensitive to the OOD method. Our main contribution is a simple-yet-effective policy score that combines OOD with an evaluation in simulation. We show that our score - VSDR - can significantly improve the accuracy of policy ranking without requiring additional real-world data. 

    We evaluate the effectiveness of VSDR on sim2real transfer in a robotic grasping task with image inputs. We extensively evaluate different DR parameters and OOD methods and show that VSDR improves policy selection across the board. More importantly, our method achieves significantly better ranking, and uses significantly less data compared to baselines.

Efficient NLP and Speech Processing (ENSLP) Workshop:
 

  • Undivided Attention: Are Intermediate Layers Necessary for BERT? 
    Sharath Nittur Sridhar, Anthony Sarah

    In recent times, BERT-based models have been extremely successful in solving a variety of natural language processing (NLP) tasks such as reading comprehension, natural language inference, sentiment analysis, etc. All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature. 

    In this work, we investigate the importance of intermediate layers on the overall network performance of downstream tasks. We show that reducing the number of intermediate layers and modifying the architecture for BERTBASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing the number of parameters and training time of the model. Additionally, we use the central kernel alignment (CKA) similarity metric and probing classifiers to demonstrate that removing intermediate layers has little impact on the learned self-attention representations. 
     
  • Prune Once for All: Sparse Pre-Trained Language Models
    Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat 

    Abstract Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. Many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware in recent years. 

    This work presents a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large, and DistilBERT. 

    We show how the compressed sparse pre-trained models transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models’ weights to 8bit precision using quantization-aware training. 

    For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit, we achieve a compression ratio of 40X for the encoder with less than 1% accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.
     
  • Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length 
    Shira Guskin, Moshe Wasserblat, Ke Ding, Gyuwan Kim 

    Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT [8] addresses the computational efficiency by self-distilling BERT [4] into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT’s performance drops when we reduce the number of layers by 50% and drops even more abruptly when we reduce the number of layers by 75% for advanced NLP tasks such as span question answering. 

    Additionally, a separate model must be trained for each inference scenario with its distinct computational budget. In this work, we present Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy speedup trade-off that is superior to any other efficient approaches (up to 3.3x with <1 loss-drop). Upon publication, the code to reproduce our work will be open-sourced.