Benchmark databases for multiple sequence alignment: An overview

Multiple sequence alignment (MSA) is a very crucial step in most of the molecular analyses and evolutionary studies. Many MSA programs have been developed so far based on different approaches which attempt to provide optimal alignment with high accuracy. Basic algorithms employed to develop MSA programs include progressive algorithm [1], iterative-based [2], and consistency-based algorithm [3]. Some of the programs incorporate several other methods into the process of creating an optimal alignment such as M-COFFEE [4] and PCMA [5].

An MSA program outperforms the other in different aspects with different accuracy levels. The assessment of accuracy and efficiency of these MSA programs is done on the basis of benchmark databases. These benchmark databases are either manually created or semi-automatedly generated and developed on the basis of protein structure alignment. Since multiple structure alignment is complex, therefore, the pairwise structure alignment is preferred. The alignments created by MSA programs are compared to the reference alignment sets provided by these benchmark databases.

There are various benchmark databases available amongst which BAliBASE (Benchmark alignment database) [6-8] is the most widely used. BAliBASE is created by combining automated and manual methods and provides a variety of reference alignment sets such as repeats, circular permutations, sequences with highly divergent orphans, N/C terminal extensions and so on. HOMSTRAD (Homologous structure alignment database) [9-11] is another database of protein structure alignments which is frequently used as a benchmark database though was not created for this purpose. Several other benchmarks have been developed in the last decade which includes OXBench [12], PREFAB (Protein reference alignment benchmark) [13], SABmark (Sequence alignment benchmark) [14], and IRMBASE (Implanted rose motifs base) [15].

Most of the reference alignments in these benchmark databases are globally aligned and measure sensitivity (i.e., number of correctly aligned positions) instead of calculating specificity. The IRMBASE benchmark is comprised of simulated conserved motifs inserted/deleted/substituted manually with the help of a software called ROSE [16]. The manually simulated sequences give correct multiple alignments with known evolution, which is used to assess the capability of MSA programs to detect isolated motifs within the sequences [15].

The evaluation of the MSA programs is done on the basis of some scores such as Sum-of-Pair (SP) score, column score, maximum-likelihood, minimum entropy, consensus, and star, calculated by the reference alignment databases. The most widely used evaluation function is the SP score used for the assessment of the MSA programs. The evaluation functions of the MSA programs will be discussed in detail in the upcoming article. For further reading kindly refer to the references given below. For any other query write to muniba@bioinformaticsreview.com.

References

Fitch, W. M., & Yasunobu, K. T. (1975). Phylogenies from amino acid sequences aligned with gaps: The problem of gap weighting. Journal of Molecular Evolution, 5(1), 1–24. https://doi.org/10.1007/BF01732010
Berger, M. P., & Munson, P. J. (1991). A novel randomized iterative strategy for aligning multiple protein sequences. Bioinformatics, 7(4), 479–484. https://doi.org/10.1093/bioinformatics/7.4.479
Gotoh, O. (1990). Consistency of optimal sequence alignments. Bulletin of Mathematical Biology, 52(4), 509–525. https://doi.org/10.1016/S0092-8240(05)80359-3
Wallace, I. M., O’Sullivan, O., Higgins, D. G., & Notredame, C. (2006). M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Research, 34(6), 1692–1699. https://doi.org/10.1093/nar/gkl091
Pei, J., Sadreyev, R., & Grishin, N. V. (2003). PCMA: fast and accurate multiple sequence alignment based on profile consistency. BIOINFORMATICS APPLICATIONS NOTE, 19(3), 427–428. https://doi.org/10.1093/bioinformatics/btg008
Thompson, J., Plewniak, F., & Poch, O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15(1), 87–88. https://doi.org/10.1093/bioinformatics/15.1.87
Bahr, A., Thompson, J. D., Thierry, J.-C., & Poch, O. (2001). BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research, 29(1), 323–326. https://doi.org/10.1093/nar/29.1.323
Thompson, J. D., Koehl, P., Ripp, R., & Poch, O. (2005). BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics, 61(1), 127–136. https://doi.org/10.1002/prot.20527
Mizuguchi, K., Deane, C. M., Blundell, T. L., & Overington, J. P. (1998). HOMSTRAD: A database of protein structure alignments for homologous families. Protein Science, 7(11), 2469–2471. https://doi.org/10.1002/pro.5560071126
De Bakker, P. I. W., Bateman, A., Burke, D. F., Miguel, R. N., Mizuguchi, K., Shi, J., … Blundell, T. L. (2001). HOMSTRAD: Adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17(8), 748–749. https://doi.org/10.1093/bioinformatics/17.8.748
Stebbings, L. A., & Mizuguchi, K. (2004). HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Research, 32(Database issue), D203–D207.
Raghava, G., & Searle, S. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC.
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. https://doi.org/10.1093/nar/gkh340
Wallace, I. M., Blackshields, G., & Higgins, D. G. (2005). Multiple sequence alignments. Current Opinion in Structural Biology, 15(3), 261–266. https://doi.org/10.1016/J.SBI.2005.04.002
Subramanian, A. R., Weyer-Menkhoff, J., Kaufmann, M., & Morgenstern, B. (2005). DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics, 6(1), 66. https://doi.org/10.1186/1471-2105-6-66
Stoye, J., Evers, D., & Meyer, F. (1998). Rose: generating sequence families. Bioinformatics, 14(2), 157–163. https://doi.org/10.1093/bioinformatics/14.2.157