elPrep*

elPrep* is a high-performance tool for preparing SAM/BAM/CRAM files for variant calling in genomic sequencing pipelines.

Execution Time Cut to 15 Minutes1

elPrep* is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools* and Picard* for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep* apart is its software architecture, which allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep* is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time.

Performance Results

For a preparation pipeline of five steps on a whole-exome BAM file (NA12878), elPrep* reduces the execution time from about 1:40 hours, when using a combination of SAMtools* and Picard*, to about 15 minutes when using elPrep*, while utilizing the same server resources (48 threads and 23 GB RAM)1. Tested using picard-tools-1.229*, samtools-1.2*, elprep-2.2*.


Download the code ›

Reproduce these results with this optimization recipe ›

Background

Sequence analysis generally consists of a mapping phase followed by an analysis phase. In the mapping phase, an alignment tool maps the reads produced by the wet lab to a known reference genome. Afterwards, the mapped reads are processed by an analysis tool, for example for variant detection.

Alignment and analysis tools communicate via sequence alignment/map (SAM) files, a standardized format for storing mapped reads (Li et al., 2009), or the compressed variants thereof (BAM/CRAM). In practice, different alignment tools produce slightly different outputs, and different analysis tools depend on slightly different SAM structures to work properly.

This is why there are typically a number of steps in between the alignment and analysis tools to rewrite the SAM files into a form that is accepted by the analysis tool. For example, the GATK best-practice pipeline (Van der Auwera et al., 2013) requires five preparation steps between alignment (BWA) and analysis (GATK). These steps take up roughly 30% of the runtime of the complete pipeline.

Pipeline Execution Without elPrep*

We developed elPrep*, a new tool that is designed as a high-performance alternative to existing tools for manipulating SAM, BAM, and CRAM files. elPrep* is designed as a multi-threaded program from the ground up: all preparation steps are executed in parallel. The application is designed to run entirely in memory, avoiding repeated file I/O between the preparation steps and merging their computations to execute more efficiently.

Hypothetical Execution with Parallelized Tools

We had to reformulate preparation steps as filters. In many cases, this was straightforward, but some steps required finding alternative algorithms. For example, the algorithm for marking duplicates in Picard* is based on comparing adapted mapping positions of all reads. Its implementation traverses the entire read set multiple times to compare the reads' mapping positions one by one. We reformulate this as a single-pass algorithm, and use memoization to keep track of the reads with the best mapping positions. If a subsequent read maps to the same position as a previous one, but with a better quality score, it replaces the old one in the memoization table, and the old one is marked as a duplicate. Despite such algorithmic reformulations, the output of elPrep* is 100% equivalent to the output produced by SAMtools* and Picard.

Pipeline Execution with elPrep*

Once all data is streamed into memory and all filters are applied, the operations that work on the whole data set, such as sorting, are executed. elPrep* implements this phase using fork-join patterns, which are executed on a work-stealing scheduler for load balancing. After the sorting phase, the worker threads transform the data back into SAM file entries in parallel, while possibly applying additional filters, to write the result to the output file.

Publications

Charlotte Herzeel, Pascal Costanza, Dries Decap, Jan Fostier, and Joke Reumers. "elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling." PLoS ONE 10, no. 7 (2015). doi:10.1371/journal.pone.0132868.

Configuration Table

System Overview

 

Software

picard-tools-1.229*, samtools-1.2*, elprep-2.32*, CentOS* release 7.0.1406 (Core), Python* 2.7.5, GCC* 4.8.2 (optional), GNU parallel* 20150222 (optional)

Processor

2x 12-core Intel® Xeon® E5-2690 processor (2.6 GHz)

RAM

256 GB

Storage

2 TB Intel® P3700 SSD

Product and Performance Information

1

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks.