Applying Deep Learning to Genomics Analysis

Yinyin Liu
Harini Eavani
Zach Dwiel

Synthetic Genomics, Incorporated (SGI) is a synthetic biology company that aims to bring genomic-driven solutions to market. They design and build biological systems and conduct interdisciplinary research by combining biology and engineering to address global sustainability problems

SGI asked for Intel’s help to conduct a deep learning proof of concept that would automatically tag a protein sequence. Tagging protein sequences with the corresponding protein family labels and other annotations is important to facilitate genomic research – as there are thousands of protein families and millions of sequences. SGI and Intel collaborated to design a deep-learning software framework flexible enough to design and train various kinds of models each predicting multiple properties of protein sequences using amino-acid sequences as the only input. The system was further trained on the IntelⓇ Deep Learning Cloud (IntelⓇ DL Cloud) to produce basic protein descriptions for each sequence.

To do this, Intel designed a topology to handle multi-task learning. One deep network was designed which could generate output types including classifications, tags, segmentations, and text descriptions. There were multiple tasks trained of each type, and all of these tasks shared the same feature extraction layers and contributed to a single embedding. This embedding has been shown to be a useful representation not only for the tasks it was trained on but also on tasks it had never before seen. This architecture makes it possible to predict a large number of properties in a single pass requiring far less computation than classical methods which require running individual models for each prediction and a complex pipeline to coordinate the execution of these models.

This effort relied on the Intel DL Cloud and the latest in deep learning techniques to facilitate analysis, classification, and prediction in synthetic biology applications.

As an innovative and collaborative data science effort, the Intel AI Lab team leveraged AI techniques and computing power to bring extra tools and insights to SGI’s protein sequence analysis. It allowed the data scientists at SGI to utilize the large amount of existing data and extract new insights that traditional methods couldn’t provide. In the next phase of the project, more capabilities are being added to the model. The ultimate goal is to bring these innovative methods of protein sequence analysis into production. If your company or organization is interested in working with Intel to solve deep learning challenges, please contact us.

Stay Connected

Keep tabs on all the latest news with our monthly newsletter.