Analyze High-Dimension Data Using the Intel® AI Analytics Toolkit

Learn how medical science implements the Intel® AI Analytics Toolkit for single-cell data (such as an scRNA sequence).

Clustergrammer2, a web-based tool for visualizing and analyzing high-dimensional data (like a single-cell RNA sequence) as interactive and shareable heatmaps, is ported to the Intel AI Analytics Toolkit. This web-based tool has several optional biology-specific features (like enrichment analysis) to facilitate the exploration of gene-level biological data.

This talk explores gene expression data with a good implementation for studying diseases such as cancer. Examining the heat maps gives useful information for studying where gene mutation occurred.

Porting Clustergrammer 2 to the Intel AI Analytics Toolkit gives an edge for interactively exploring the data of 2700 peripheral blood mononuclear cells (PBMC) obtained from a 10x Genomics* dataset. Use Intel® Distribution for Python* programming language from the toolkit and run the programs in Intel® Developer Cloud.

CIBERSORT (an external dataset for exploration) is used to provide an estimation of abundant cell types in a mixed-population using gene expression data. Load the data as a Sparse matrix format. The dataset consists of 32,000 genes and 2,700 single cells.

Using the Intel® Distribution for Python*, the dataset (gene expression data [GEX data]) is normalized and finds the top expressing genes. Then arcsinh transform and X-score is implemented. After that, data is loaded into Clustergrammer2 for observing interactive heat maps.

The features of Clustergrammer2 are:

  • Zoom into and pan across a heat map by scrolling and dragging.
  • Hover over elements in the heat map to bring up more information via tooltips.
  • Reorder rows and columns.
  • Reduce high-dimensional datasets down to a number that can be visualized with interactive dimensionality reduction (a data analysis method).
  • Depict the hierarchy of row and column clusters produced by hierarchical clustering with interactive dendrogram trees. The height of its branches shows the distance between clusters and trapezoids display this hierarchical tree one slice at a time.

The uses are:

  • Visualize bulk gene expression data.
  • Access gene expression data from the Cancer Cell Line Encyclopedia.
  • Post-translational modification and gene expression regulation for lung cancer.

 

Speaker

Abhishek Nandy's experience is a mix of research and large industry exposure. He is an entrepreneur, a teacher, author, researcher, and dream catcher. Abhishek worked in pharmaceutical, manufacturing, and retail and has led several teams in research and product development.

In the past, Abhishek was a principal engineer at P360* where he established the AI and IoT product teams. He has a bachelor's of technology degree and is an Intel® Black Belt Software Developer–a coveted open source Intel award given to people who have contributed to the Intel® Open Source Alliance. Abhishek has presented his research work on reinforcement learning at the Association for Computing Machinery (ACM) SIGGRAPH 2018. He has been an invited educator at several leading premier education institutes in India.

Abhishek has also authored books on reinforcement learning, Unity* machine learning, and Leap Motion* game engines. He was also among the top 50 innovators at the first Make in India initiative.