Multiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them. The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that “maximizes the sum of similarities for all the pair of sequences”, which is called as the ‘Sum-of-scores or SP Score’. The SP Score is the basic of many alignment algorithms.
The most widely used approach for constructing MSA is “Progressive Alignment”, where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is “Consistency-based Scoring”,where the alignment is consistently dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment.
Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated?
The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding reference alignment which gives an accuracy score.
Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available.
“Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE.”
Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence.
However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.
- Multiple sequence alignment
Robert C Edgar1 and Serafim Batzoglou2
- BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs
Julie D. Thompson, Frédréric Plewniak and Olivier Poch