Intel proudly sponsors the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) event. This leading conference, recognized as the “premier annual computer vision event,” is a place for students, academics, and industry researchers to connect and stay up-to-date on the latest innovations in the computer vision field. CVPR 2022 will be in New Orleans, LA, from June 19-24th. This year the conference is a hybrid event, with options for in-person and virtual attendance. In addition to the main event, they will offer auxiliary workshops and short courses at the event site as well.
Intel’s contributions at the conference include six conference papers and four workshop papers detailing significant research efforts and results. These papers present innovations spanning multiple aspects of computer vision and pattern recognition. The works include an attention-based method for semantic segmentation, a video denoiser for submililux conditions, a multi-task dense prediction transformer, an end-to-end trainable framework for text detection and recognition, and a method used to forecast future hand-object interactions within an egocentric video. Many of these advancements showcase state-of-the-art results with real-world applications for the future of autonomous driving, robotics, and consumer electronics. Intel also has a hand in organizing several conference workshops and tutorial sessions.
OpenVINO™ is an open-source toolkit that boosts deep learning performance in computer vision and offers several tools for making a model run faster and take less memory. Intel will provide opportunities for hands-on experience with the toolkit through an OpenVINO tutorial session held on the 19th. After completing some pre-work steps, the 4-hour session will involve interactive practice with OpenVINO Notebooks, exercises on the optimization process, a tutorial on object detection with OpenVINO Training Extensions (OTE), and end-to-end experience using Anomalib to evaluate and deploy solutions.
Additionally, Intel will host an AI Happy Hour Networking Reception during the conference. Held at The Gallery on June 21st, this event will be the perfect occasion to connect with other CVPR attendees and Intel experts while enjoying food, drinks, demo experiences, raffles, and giveaways. Additionally, Intel will cover happy hour for the first 100 conference attendees to arrive. Be sure to reserve your spot ahead of time.
Conference content hosted on the virtual platform will be available exclusively to CVPR registered attendees. However, the conference proceedings will be publicly available via the CVF website, with the final version posted to IEEE Xplore after the conference.
Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic Segmentation
3D semantic segmentation addresses the task of object classification for various points in a scene. This ability makes it a vital component in autonomous driving, robotics, virtual reality, and other scene understanding applications. However, many leading semantic segmentation methods suffer from the part-misclassification problem, where parts of the same object are labeled incorrectly. In response to this challenge, researchers from Intel Labs present Segment-Fusion. This novel attention-based method utilizes a hierarchical fusion of semantic and instance information to improve the performance of a generic semantic segmentation model. Segment-Fusion is also capable of adapting to inputs from other datasets and base networks and offers the possibility of end-to-end training while maintaining efficiency. To demonstrate the efficacy of their approach, the team used the ScanNet validation set to perform evaluations on multiple semantic backbones; MinkowskiNet-42, PointConv, and SparseConvNet. Results show that the model improves the qualitative and quantitative performance of these backbones by up to 5% on the ScanNet and S3DIS datasets.
Dancing Under the Stars: Video Denoising in Starlight
Capturing images or video in extremely low light conditions, i.e., with only starlight, is incredibly difficult. Current successful approaches need at least moonlight which provides 0.05-0.3 lux illumination. Even this level of lighting requires sensitive CMOS cameras to capture workable video. While other approaches have provided denoising at 0.1-0.3 lux, this work from Intel Labs demonstrates the first instance of photorealistic denoising at submililux (<0.001 lux) illumination levels. Researchers achieved this feat by training a video denoiser using the noise model of a high-quality camera optimized for low-light imaging. Building off of the state-of-the-art video denoiser, FastDVDNet, they first modified the network’s denoising blocks. Then, in addition to pre-training steps, the team refined the network and performed gamma correction, finishing up with several post-processing steps. Using this denoising method, researchers captured a 5-10 frame-per-second (fps) video dataset with significant motion at approximately 0.6-0.7 millilux with no active illumination. Compared to alternative methods, this model achieves improved video quality at the lowest light levels, demonstrating photorealistic video denoising in starlight for the first time. This technology has broad applications and the potential to become a part of consumer devices in the future.
Cerberus Transformer: Joint Semantic, Affordance, and Attribute Parsing
Understanding indoor scenes is a foundational tenet of computer vision with broad robotics and metaverse applications. However, to achieve a well-rounded understanding, several sub-tasks must be addressed. With the help of university collaborators, Intel Labs presents Cerberus, a novel multi-task dense reduction transformer for indoor scenes. Cerberus performs three main tasks; object attribution, semantic categorization of a region, and affordance prediction. Compared to the best-published results, Cerberus achieves state-of-the-art results across all three tasks. Additional in-depth analysis showed concept affinity consistent with human cognition. Furthermore, even when weakly supervised, Cerberus delivers strong performance using only 0.1%-1% annotation.
Text Spotting Transformers
Researchers from UC San Diego, Shanghai Jiao Tong University, and Intel Labs present an end-to-end trainable framework called TExt Spotting TRansformers (TESTR), which jointly addresses text detection and recognition. TESTR builds upon a single encoder and dual decoders for the joint text-box control point regression and character recognition. This framework uses a joint approach to directly perform set prediction without some of the post-processing and operations required by most existing models. Another key component of the method is the bounding-box guided polygon (box-to-polygon) detection procedure the researchers designed, which allows for a more efficient detection of arbitrarily-shaped texts. To demonstrate the efficacy of their model for both Bezier curve and polygon annotations, researchers tested it on a few text benchmarks. In addition to significant results for text spotting regular and irregular texts, TESTR leverages multi-scale feature maps within the box-to-polygon detection process to overcome the issue of small text and dramatically improve end-to-end results by 10.8%. Ultimately, the experiments on curved and arbitrarily shaped datasets demonstrate state-of-the-art performances of the proposed TESTR algorithm.
Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos
Demonstrating promising results, this work from Intel Labs in collaboration with University of Illinois Urbana-Champaign and UC San Diego has also been accepted for presentation at the Epic workshop within the conference. This research brings AI systems one step closer to accurately predicting a person’s intent, preference, and future activities. Given an egocentric video, the proposed method directly predicts the hand motion trajectory and future contact points. To accomplish this task, researchers created an automatic collection method for trajectory and hotspot labels. Then, using the collected data, the team trained an Object-Centric Transformer (OCT) model for prediction. The OCT performs hand and object interaction reasoning and provides a probabilistic framework to handle prediction uncertainty. Experimental results show that this method outperforms previous state-of-the-art approaches by a large margin, improving the Average Displacement Error (ADE) by 50%, and the Final Displacement Error (FDE) by 27.3% on the EK100 dataset against the second-best method of each metric. It also achieves similar performance with the Divided Attention Transformer encoder design. This demonstrates the superiority of using Transformer to capture hand, object, and environment context interactions in egocentric videos.
BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster
Most AI projects start with a Python notebook running on a single laptop. However, subsequently scaling it to handle larger datasets is typically a heavy undertaking. This can also entail many manual and error-prone steps for data scientists to fully take advantage of the available hardware resources such as quantization, memory allocation optimization, data partitioning, distributed computing, and more. To address this challenge, Intel researchers present BigDL 2.0, an open source Big Data AI toolkit. Using BigDL 2.0, users can simply build conventional Python notebooks on their laptops, which can then be transparently accelerated on a single node, and seamlessly scaled out to a large cluster. BigDL 2.0 takes a holistic approach to optimize the entire AI pipeline from end-to-end. Additionally, it automatically integrates optimized libraries, best known configurations, and software optimizations and handles all tuning, accelerations, and scaling out. For ease-of-use, it seamlessly scales out the end-to-end pipelines with simple APIs familiar to data scientists. Experiments show that this toolkit can accelerate inference pipelines by up to 9.6x. Due to its success, BigDL 2.0 was accepted in the CVPR Demo Track, and has already been adopted into production by many real-world users including Mastercard, Burger King, and Inspur.
Adversarial Deepfake Generation for Detector Misclassification
Synthetic content generation bloomed with advancements in deep learning, especially for deepfakes. However, deepfakes have the potential for increasingly harmful uses, from celebrity impersonation to political influence and misinformation. Luckily, deepfake detection methods are rising to counteract malicious intentions at a similar rate. This work introduces adversarial attacks on deepfake detectors to assess their capabilities and limitations. Designed as a score-based black-box attack, this approach develops a new loss function and utilizes a light-weight generative neural network to create adversarial fakes detected as reals. Evaluated on five different attack and detector models, the approach decreases fake detection accuracy by over 80% using only perturbed fakes, and up to 93% with post-processing operations.
A New Non-central Model for Fisheye Calibration
Fisheye lenses are commonly used in security systems, automotive applications, robotics, and so much more because of their large field of vision. Building off of the Matlab Computer Vision Toolbox fisheye calibration tool, this work presents a new non-central model for calibrating fisheye cameras.This model allows for the adaptation of existing applications already using this central model to a non-central projection that is more accurate, especially when objects are close to the camera. It also makes it possible to switch easily between the more accurate non-central characterization of the fisheye camera and the more convenient central approximation, as needed. Additional innovations improve the performance of the central model and its non-central extension.
Continual Active Adaptation to Distributional Shifts
In applications with continually evolving data distribution, it is important to have neural network models that are adaptable to new distributions without suffering catastrophic forgetting. The replay method is commonly used to prevent forgetting. However, it involves storing a subset of past data, which is not always feasible. Therefore, this research proposes a source-free approach utilizing batch normalization to fine-tune the model to the ever-changing data. Experiments on the CIFAR10-C data mimic the sequential evolution of data based on the corruptions and show that the model outperforms existing methods and is capable of adapting to new data distributions while not forgetting the past information.
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
Active speaker detection (ASD) is challenging in videos with multiple speakers as it requires learning effective multimodal features and spatial-temporal correlations over long temporal windows. In this work, researchers from Intel, University of California Riverside and University of Glasgow, describe SPELL, a novel spatial-temporal graph learning framework that solves ASD by reducing it to a node classification task. SPELL can reason over long temporal contexts for all nodes without relying on computationally expensive fully connected GNNs. Experiments on the AVA-ActiveSpeaker dataset show that SPELL outperforms all previous state-of-the-art approaches on the validation split while requiring significantly lower memory and computation resources. SPELL also achieves 2nd position in the ActivityNet Challenge 2022 leaderboard.
Workshops and Tutorials:
Graph Machine Learning for Visual Computing
Graph Machine Learning (GML) presents powerful tools used to tackle various problems in the visual computing area such as geometric processing, scene graph generation, video understanding, multi-object relational mining, physical reasoning from vision, graphics simulation, visual navigation and so on. Intel co-organized this tutorial to spark diverse interests and creative thoughts surrounding GML within the computer vision community. The half-day session will cover a wide variety of topics that involve the core theory of graph machine learning, its applications in visual computing, and the introduction to one of the most popular GML programming frameworks.
How to get Quick and Performant Model for your Edge Application. From Data to Application
The OpenVINO Toolkit offers several tools for making a model run faster and take less memory. However, these good results also will depend on the dataset distribution you are using in between training, testing, and validation. To address these challenges, OpenVINO has developed OpenVINO Training Extensions (OTE), a convenient environment to train a new model with more efficient architecture using your own dataset, keeping the distribution of itself and getting the best possible results to deploy your model into the edge, achieving a 3.6x increase in processing speed compared to the original FP16 model (SSD-300). This tutorial session will provide opportunities to learn about OpenVINO and OTE while gaining hands-on experience.
The Fourth Workshop on Deep Learning for Geometric Computing
Computer vision approaches have made tremendous efforts toward understanding shape from various data formats, especially since entering the deep learning era. However, attention and research on extracting topological and geometric information from shapes seems to be lacking. These geometric representations can provide compact and intuitive abstractions for modeling, synthesis, compression, matching, and analysis. In it’s third edition, this workshop is intended to gather researchers from computer vision, computational geometry, computer graphics, and machine learning to advance topological and geometric shape analysis using deep learning. The workshop will include competitions with prizes, proceedings, keynotes, paper presentations, and a fair and diverse environment for brainstorming about future research collaborations.