Meeting the Challenges of Lifelong Robotic Vision

MaryT_Intel · ‎12-20-2019

Recent advances in computer vision and deep learning techniques have been made possible due to the availability of large-scale datasets such as ImageNet and COCO. Breakthroughs in object/person recognition, detection, and segmentation have relied heavily on the availability of these large representative datasets for training. However, robotic vision poses new challenges for applying visual algorithms that were developed from computer vision datasets for a fixed set of categories and tasks, since semantic concepts of a real environment are dynamically changing over time.

Introduction

Humans have the remarkable ability to learn from both their current environments and their past experiences. One of the goals of robot development is to create an artificial learning agent that can develop an autonomous understanding of the world both from their present surroundings and their previous knowledge.

Recent advances in computer vision and deep learning techniques have been made possible due to the availability of large-scale datasets such as ImageNet and COCO. Breakthroughs in object/person recognition, detection, and segmentation have relied heavily on the availability of these large representative datasets for training. However, robotic vision poses new challenges for applying visual algorithms that were developed from computer vision datasets for a fixed set of categories and tasks, since semantic concepts of a real environment are dynamically changing over time. Specifically, robots operate continually under open-set and sometimes detrimental conditions, which require them to have the capacity for lifelong learning by using reliable uncertainty estimates and robust algorithm designs.

The major challenge for lifelong robotic vision is continual understanding of the dynamic environment. At the object level, robots should be able to learn new object models incrementally without forgetting previously learned ones. The robotic vision system should explore how the previous learned knowledge can enhance current perception task, and whether the current task can improve old task performances. At the scene level, the robot should be able to incrementally update its world model without getting lost. That’s why we have focused our research on lifelong object recognition and lifelong SLAM (simultaneous localization and mapping), with the goal of providing benchmarks for both tasks. Earlier this month, we organized the first-ever lifelong robotic vision competition at the 2019 International Conference on Intelligent Robots and Systems (IROS), held in Macau, China.

We refer to our (L)ifel(O)ng (R)obotic V(IS)ion research as OpenLORIS, which already contains two datasets: OpenLORIS-Object for lifelong object recognition and OpenLORIS-Scene for lifelong SLAM. Both contain multi-sensory data recorded primarily with Intel® RealSense™ cameras in real-world scenes. New evaluation metrics have been established to benchmark algorithm performance in a practicability-first fashion.

Datasets

We have utilized the unique characteristics of robotics to enhance robotic vision research by using additional high-resolution sensors (e.g. depth and point clouds), controlling sensor directions and numbers, and even shrinking the intense labeling effort with self-supervision. In order to enable lifelong robotic vision research, we are providing robotic data (RGB-D, IMU, etc.) in typical indoor scenarios (home, office, café, etc.) with commonly used objects, variant scenes, and ground-truth trajectory acquired from auxiliary measurements with high-resolution sensors. As well as using a diverse range of sensors, scenarios, and task types, our dataset embraces environment dynamics, which to our knowledge makes it the first real-world dataset to be used in a lifelong robotic vision setting.

The two datasets already included in OpenLORIS are OpenLORIS-Object and OpenLORIS-Scene. OpenLORIS-Object, collected via an Intel® RealSense™ Depth Camera D435i, contains objects collected with different views under high-variant illuminations, occlusions, and object-camera distances. In the OpenLORIS-Scene dataset, there are multi-sensor data collected with a wheeled robot in several different scenes, with multiple data sequences in each scene, among which there can be significant scene changes caused by human activities or day-night shifts. The primary sensors include an Intel RealSense Depth Camera D435i and an Intel® RealSense™ Tracking Camera T265, both mounted at a fixed height of about 1m. The color images and depth images from the D435i camera can be used for monocular/RGB-D algorithms, while the dual fisheye images from the T265 camera can be used for stereo vision algorithms. Both provide IMU measurements with hardware synchronization with corresponding images. Odometry data from wheel encoders and ground-truth trajectories based on either a high-resolution LiDAR (light detection and ranging) or a motion capture system are also provided.

Competitions

This year’s inaugural Lifelong Robotic Vision Competition has just ended. In the first part of the challenge, more than 150 teams registered, and 40 teams submitted valid results over two months of online competition. For each of the two challenges (Lifelong Object Recognition & Lifelong SLAM), the top 8 teams were invited to the final round competition. The finalists, representing seven different countries, came from both academia and industry. They had competed not only in the online algorithm benchmarking, but also by delivering onsite talks at the Lifelong Robotic Vision Challenge Workshop at IROS 2019. The two tasks in the competition were:

Lifelong Object Recognition (with OpenLORIS-Object dataset): The goal of this competition is to test a model’s capability to continuously learn objects in a service robots scenario. This challenge explored how to leverage the knowledge learned from previous tasks to perform new tasks better, and efficiently memorize previously learned tasks. Essentially, the goal is to make a robot behave more like a human in terms of its knowledge transfer, association, and combination capabilities. In this competition, model prediction accuracy (forward transfer accuracy, backward transfer accuracy), time to inference, model size, CPU/GPU memory usage, and the size of training data from previous tasks were all considered in the final results. The Intel Labs China team has made great efforts to reconstruct all 8 finalists’ solutions in the Lifelong Object Recognition in a unified framework. The top 2 solutions (knowledge distillation with incremental structures and latent rehearsal approach) were recognized by IROS 2019 competition chairs.
Lifelong SLAM (with OpenLORIS-Scene dataset This competition presented new challenges to SLAM algorithms by introducing out-of-sight scene changes. For example, in home scenarios, most objects may be movable, replaceable or deformable, and the visual features of the same place may be significantly different in some successive days. Such out-of-sight dynamics pose great challenges to the robustness of pose estimation, and hence a robot's long-term deployment and operation. The term Lifelong SLAM is used here to address SLAM problems in an ever-changing environment over a long period of time.

Lifelong Object Recognition Challenge Finalists at IROS 2019

Lifelong SLAM Challenge Finalists at IROS 2019.

Lifelong SLAM Challenge Finalists at IROS 2019

Lifelong SLAM Challenge Finalists at IROS 2019.

Why We Did This

Both lifelong object recognition and lifelong SLAM are new research topics which we believe will attract many researchers in the next few years, yet there is no satisfying public dataset, nor well-established benchmarking metric. With a good dataset and well-designed benchmark, it will be much easier for researchers to work on these topics without requiring a physical robot.

For lifelong object recognition, the robotics community has been interested in machine vision/computer vision analysis for decades. However, the research is much harder compared with standard computer vision tasks due to time-varying distributions for an increasing set of categories and tasks. One of the biggest obstacles is that there are few benchmarks for algorithm/model evaluations. Furthermore, the current research is more focused on the accuracy, and there are no well-established benchmarks regarding the inference time, model size, CPU/GPU usage and the size of training data, which are the key features that real applications care about. Our provided datasets and benchmarks can be used for evaluating them, and we believe our competition is a perfect starting point for the research community.

For SLAM, many researchers have started to work on semantic SLAM and learning-based algorithms, since the traditional feature-based algorithms are generally considered to be mature from an academic point of view. However, the advantage of such novel approaches are often non-significant on existing SLAM benchmarks (mostly static, or with only in-sight dynamics, and having a focus on accuracy instead of robustness) since feature-based ones may work well enough in low-dynamic scenes. On the other hand, the ability of continual adaptation to out-of-sight scene changes over an extended period of time is crucial for real-world deployment of autonomous robots, yet is not well-addressed in existing literature. The proposed lifelong SLAM task could help fill the gap between conventional SLAM research and their real-world practicability, and may encourage SLAM researchers to investigate more on semantic and learning-based algorithms for their potential capability of robust tracking and re-localization in dynamic and changing scenes.

Participants in the inaugural Lifelong Robotic Vision Challenge Workshop

What’s Next

Our competition was the most popular competition at this year’s IROS, and we intend to make the competition an annual event, and to introduce more data and new tasks based on the datasets. For object recognition, we plan to expand from 2D to 3D vision tasks by adding depth and point cloud information from real robots. Lifelong detection and segmentation tasks can also benefit from our collected benchmarks. For SLAM, most existing benchmarks are focused on the capability of robot localization, but we hope to expand to more general scene understanding tasks in the future to support intelligent robot behaviors such as semantic navigation, object finding and natural language Q&A. If you would like to learn more about our work, we encourage you to read our papers, “Are We Ready for Service Robots? The OpenLORIS-Scene Datasets for Lifelong SLAM” and “OpenLORIS-Object: A Dataset and Benchmark towards Lifelong Object Recognition” and check out our datasets online.

The next “Continual Learning in Computer Vision Workshop” has been accepted into CVPR 2020. This one-day workshop will include paper presentations, invited speakers, and technical benchmark challenges. The goal is to review the status quo as well discuss the limitations and future directions for computer vision in continual learning. The OpenLORIS-Object dataset will be one of two datasets used for evaluating the continual learning algorithms.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.