Assembly by Short Sequences* (ABySS*)

ABySS* is an open-source de novo genome assembler for short paired-end reads.

Wall clock time sees 4X improvement1

The Michael Smith Genome Sciences Centre at the BC Cancer Agency was faced with two challenges: Reduce the execution time of their parallel de novo genome assembler, the ABySS software application, and reduce the memory requirements for general alignment tools such as BWA, Bowtie, Novoalign, and ABySS-map. Intel worked with the agency to help enable improved parallelization in ABySS version 1.9.0.

ABySS is differentiated in its ability to scale to large genomes due to its message-passing interface (MPI)-based implementation of the de Bruijn graph assembly algorithm. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The relevant code optimizations are included and enabled by default in ABySS* 1.9.0.

Performance Results

ABySS 1.3.5, the baseline version, required 25 hours to perform a human genome assembly. The optimized version, ABySS 1.9.0, took only 6 hours of wall clock time to recreate the genome when run on multiple processors and taking further advantage of that parallelism by splitting the input file. This indicates a 4X improvement over the baseline version on the same data set1.

Wall clock times for the main genome assembly stage of the ABySS pipeline, using a human genome read dataset (NA12878) are shown in the figure to the right. The first leftmost bar is the base run time before optimization. The second (middle) bar represents the run time for optimized version with all data contained within a single, monolithic input file; the third (rightmost) bar shows the effect of both code optimizations and splitting the input file into 10 equal-sized parts.

Download the code ›

Reproduce these results with this optimization recipe ›

Related Codes

Distributed Indexing Dispatched Alignment* (DIDA*) ›

Publications

J.T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. "ABySS: A Parallel Assembler for Short Read Sequence Data." Genome Research 19, no. 6 (2009): 1117-1123. doi:10.1101/gr.089532.108. Genome ResearchPubMed

İnanç Birol, Shaun D. Jackman, Cydney Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Stephen J. M. Jones. "De Novo Transcriptome Assembly with ABySS." Bioinformatics 25, no. 21 (2009): 2872-2877. doi:10.1093/bioinformatics/btp367. Bioinformatics Advance Access

Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D. Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q. Qian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron S. Butterfield, Richard Newsome, Simon K. Chan, Rong She, Richard Varhol, Baljit Kamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A. Moore, Martin Hirst, Marco A. Marra, Steven J. M. Jones, Pamela A. Hoodless, and İnanç Bairol. "De Novo Assembly and Analysis of RNA-seq Data" Nature Methods. 10 October, 2010. Nature

Configuration Table

System Overview

 

Nodes

Eight HPC nodes interconnected by 40Gbps Infiniband

Processor

Each node has two Intel® Xeon® X5650 processors (2.67 GHz)

RAM

Each node has 48GB RAM

Operating System

CentOS 5.4
Intel® Cluster Studio 2013

Baseline

ABySS version 1.3.5

Optimized

ABySS version 1.9.0

Input dataset: Subset of the following BAM file (272GB)

Input data were split into 10 approximately equal-sized BAM files. Equivalent gzipped FASTQ files should perform equally well.

Data subset: The data subset corresponds to the following eight-lane IDs:

1.                20FUKAAXX100202_1

2.                20FUKAAXX100202_2

3.                20FUKAAXX100202_3

4.                20FUKAAXX100202_4

5.                20FUKAAXX100202_5

6.                20FUKAAXX100202_6

7.                20FUKAAXX100202_7

8.                20FUKAAXX100202_8

Product and Performance Information

1

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.

Intel is a sponsor and member of the BenchmarkXPRT Development Community, and was the major developer of the XPRT family of benchmarks. Principled Technologies is the publisher of the XPRT family of benchmarks.