High throughput sequencing has revolutionized the new world of bioinformatics research. Since everyone is aware of the Human Genome project in which the human genome has been sequenced, millions of species have been sequenced so far. Sequencing is a very important aspect of bioinformatics so new faster and better sequencing techniques are needed . New sequencing platforms produce biological sequence fragments faster and cheaper.
Ideal read mappers should accomplish the following aspects:
- Maximal speed
- Minimal memory
- Maximal accuracy
- Shoot at a moving target (since fast evolving technologies differ in length distribution and sequencing errors).
Recent advances in next generation sequencing technologies have led to increase read lengths, higher error rates and error models showing more and longer indels (insertions and deletions). A preprocessing step of indexing reference genomes and/or sequencing reads must guarantee fast substring matching. The overall search space is pruned to candidate genomic regions by searching matching segments (called seeds) between reads and the reference genome. These candidate regions are then further investigated to look for acceptable alignments that reach a particular score. Then the sequencing is done.
ALFALFA is a new platform for sequencing is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately sized reads (> 100bp). Both end-to-end (i.e., global) and local read alignment is supported and several strategies for paired-end (i.e., global) mapping can efficiently handle large variations in insert size (i.e., input genome to be sequenced). The name is an acronym for “A Long Fragment Aligner/A Long Fragment Aligner”. It is repeated twice as a pun on repetitive and overlapping fragments observed in genome sequences that heavily distort read mapping and genome assembly.
The most fascinating feature of ALFALFA is that it uniquely uses the ‘enhanced sparse suffix arrays’ to index reference genome (the genome to be sequenced). Index refers to a data structure that allow for the quick location of all occurrences of patterns starting at interesting positions only. Sparse suffix array is a technology which uses LCP (Longest Common Prefix) series which reduces the solution space and forms a suffix tree efficiently. Sparse Suffix Array uses a chaining algorithm to speed up dynamic programming extensions of the candidate region. This data structure facilitates fast calculation of maximal and super-maximal exact matches. The speed-memory trade-off is tuned by setting the sparseness value of the index.
ALFALFA follows a canonical seed-and-extend work- flow for mapping reads onto a reference genome:
- Root system
Reference genome indexed by enhanced sparse suffix array
- Seed
Super-maximal exact match between reference genome and read
(To enable quick retrieval of variable-length seeds called super-maximal exact matches between a read and the reference genome).
- Flower bud
Cluster of seed forms candidate genomic region
(Seeds are then grouped in to non-overlapping clusters that mark candidate genomic regions for read alignment).
- Flower
Gaps between seeds filled by dynamic programming
(Handling of candidate region is prioritized by agglomerate base pair coverage of seeds. the final extend phase sample seeds from candidate regions to form collinear chains that are bridged using dynamic programming).
Features of ALFALFA:
ALFALFA uses the technological evolution for the production of longer reads by using maximal exact matches [MEMs] and super-maximal exact matches [SMEMs] as seeds. (Since MEMs between a read and a reference genome may overlap, super- maximal exact matches are defined as MEMs that are not contained in another MEM in the read ) . These seeds are then extensively filtered and then decide the order of alignment to allow for more accurate prioritization of candidate regions. To reduce the number of expensive dynamic programming computations needed, ALFALFA chains seeds together to form a gapped alignment. As a result, the extension phase (aligning the matches) is limited to filling gaps in between chains while evaluating alignment quality.
The sparseness value ‘s ‘of sparse suffix arrays (controlled by the option -s) provides an easily tunable trade-off to balance performance and memory footprint. In theory, sparse suffix arrays take up 9/s + 1 bytes of memory per indexed base. A sparse suffix array with sparseness factor 12 thus indexes the entire human genome with a memory footprint of 5.8GB. It shows that ALFALFA is good to perform sequencing at maximal speed acquiring minimal memory space.
ALFALFA tries to balance the number and the quality of seeds using a combination of maximal and super-maximal exact matches. The intervals [i..i + l−1] and [j..j + l-1] correspond to a maximal exact match between a read and a reference genome if there is a perfect match between both subsequences of length ` starting at position i in the read and at position j in the reference genome, with mismatches occurring at positions (i−1,j−1) and (i + l, j + l) just before and after the location of the matching subsequence.
A combination of neighboring seeds increases the evidence that some region in the reference genome holds potential to serve as a mapping location. ALFALFA therefore sorts seeds according to their starting position in the reference genome and bins them into non-overlapping clusters using the locally longest seeds as anchors around which regions are built. This results in a list of candidate regions along the reference genome. To limit the number of candidate regions requiring further examination, only SMEMs and rare MEMs are used for candidate region identification. . Candidate regions are then ranked by their cov- erage of read bases, calculated from the seeds that make up the clusters. Sequential processing of these prioritized candidate regions halts when either a high number of feasible alignments has been found, a series of consecutive candidate regions failed to produce an acceptable alignment or read coverage drops below a certain threshold.
The dimensions of a dynamic programming matrix correspond to the bounds of a candidate region, but computations are often restricted to a band around the main diagonal of the matrix. The width of this band depends on the minimal alignment score required.ALFALFA further reduces the dimensions of the matrix by forming a collinear chain of a subset of the seeds that make up a candidate region. Dynamic programming can then be restricted to fill the gaps in between consecutive non-overlapping seeds. The chaining algorithm starts from an anchor seed and greedily adds new seeds that do not exhibit a high skew to the chain. The skew is defined as the difference of the distances between two seeds on the read sequence and the reference genome. The amount of skew allowed is automatically decided based on the gap between the seeds and the parameters that influence the feasibility of an alignment. ALFALFA allows multiple chains per candidate region, based on the available anchor seeds. Anchor selection is based on seed length and seeds contained in chains can no longer be used as anchors in successive chain construction.
Overall, Bowtie 2 has the highest sensitivity, which reaches 100%. However, Bowtie 2 is also less able to distinguish between good and bad alignments. CUSHAW3, BWA-MEM and ALFALFA exhibit the best trade-off between true positives and false positives. The mapping quality is determined by ROC (receiver operating characteristic ) curve.
The benchmark results demonstrate that ALFALFA is extremely fast at mapping long reads, while still being competitive for moderately sized reads. Together with BWA-SW and BWA-MEM, it is one of a few mappers that scale well for read lengths up to several kilobases.
Reference:
BMC Bioinformatics Sample 16:59
doi:10.1186/s12859-015-0533-0
Michaël Vyverman ([email protected]) Bernard De Baets ([email protected]) Veerle Fack ([email protected]) Peter Dawyndt