High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often ac- complished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for k-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
Related Content
Real-Time Full Correlation Matrix Analysis of fMRI Data
Real-time functional magnetic resonance imaging (rtfMRI) is an emerging approach for studying the functioning of the human brain. Computational challenges...
Many-Core Graph Workload Analysis
Graph applications have specific characteristics that are not common in other application domains and therefore require thorough analysis to guide...
Precision and Recall for Time Series
Classical anomaly detection is principally concerned with point-based anomalies, those anomalies that occur at a single point in time. Yet,...
CosmoFlow: Using Deep Learning to Learn the Universe...
Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational...