AI Medical Robot Learns How to Suture by Imitating Videos

Using imitation learning, artificial intelligence (AI) researchers have found a promising approach for teaching medical robots surgical manipulation skills just by imitating video demonstrations done by surgeons. Using videos, researchers from University of California at Berkeley, Google and Intel trained an algorithm to segment surgical procedures into simple gestures, enabling robotic surgical tools to perform with greater precision than metric and sequence learning methods.

It's challenging for a surgical robot to train using videos. While surgeons can easily watch and understand a video, a robot may only see the demonstration as a stream of pixels. In addition, uncontrolled variables such as lighting, background, and camera viewpoint pose a challenge to robot learning.

To overcome this, researchers used a Siamese network to train a Motion2Vec algorithm in a semi-supervised manner, enabling the robot to analyze the video and segment it into meaningful sequences. In contrast to direct trajectory learning approaches from demonstrations, the team focused on self-supervised and semi-supervised approaches to learn skills from video observations where only a few segment labels may be available.

Fig. 1: Motion2Vec groups similar action segments together in an embedding space in a semi-supervised manner.

“Imitation learning provides a promising approach to teach new robotic manipulation skills from expert demonstrations,” said Mariano Phielipp, a principal engineer at the Intel AI Lab in Intel Labs. “Imitation learning is a technique in reinforcement learning that uses demonstrations from an expert to learn how to perform the intended task. There are many algorithms that perform imitation learning, and it continues to be an active area of research.”

To be successful, this technique usually requires a significant amount of quality demonstrations, Phielipp said. “Even with this data, depending on the problem the algorithm could fail to learn to a degree that the task requires.”

Using 78 demonstrations from publicly available suturing videos from the JIGSAWS dataset, the Motion2Vec algorithm imitated suturing motions on the kinematic level using a dual arm da Vinci robot. Each demonstration consists of a pair of videos from stereo cameras, kinematic data from the robot arm end effectors, and the action segment label for each video frame. The action segment labels correspond to a distinct set of 11 suturing subtasks as annotated by surgeons, such as reach needle with right hand, position needle, push needle through tissue, and so on.

Learning in twin networks

The research team used metric learning with a Siamese network to bring similar surgical action segments, such as needle insertion, needle extraction, and needle hand-off, together in an embedding space. Designed for verification tasks, a Siamese network uses two parallel neural networks that take in different input vectors to compute comparable output vectors. This method uses triplet loss to attract similar action segments in an embedding space and repel samples from other action segments in a semi-supervised manner. The resulting outputs are combined to provide a prediction.

The Siamese network was tasked with matching the video input of motions made by the robotic end effectors with the video input of a surgeon making the same movements. After pre-training the network, a recurrent neural network (RNN) was used to predict pseudo-labels on unlabeled embedded sequences that are fed back to the Siamese network to improve the alignment of the action segments. Motion2Vec moved the video observations into a vector domain where closeness refers to spatiotemporal grouping of the same action segments.

Conclusions on imitation learning

Using this imitation learning approach, on average the robot had an 85.5% segmentation accuracy, suggesting performance improvement over several state-of-the-art baselines for metric and sequence learning methods, including temporal cycle consistency (TCC), single-view time contrastive network (svTCN), temporal convolutional network (TCN), hidden Markov model (HMM), hidden semi-Markov model (HSMM), and conditional random fields (CRF). The Motion2Vec kinematic pose imitation gave a 0.94 centimeter error in position per observation on the test set.

“Using a number of demonstrations provided in a reasonable amount of time from the expert surgeon, the algorithm was able to learn to a certain degree to perform the intended surgery,” said Phielipp. “The results are an early indication that simple and repeated surgeries, such as skin suturing, could be assisted by a robot after it is taught how to perform the task.”