Basic Concept of Multiple Sequence Alignment

in Algorithms/Softwares by

Multiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them.  The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that “maximizes the sum of similarities for all the pair of sequences”, which is called as the ‘Sum-of-scores or SP Score’. The SP Score is the basic of many alignment algorithms.

The most widely used approach for constructing MSA is “Progressive Alignment”, where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is “Consistency-based Scoring”,where the alignment is consistently dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment.

Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated?

The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding reference alignment  which gives an accuracy score.

Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available.

“Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE.”

Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence.

However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.

 

REFERENCES:

  • Multiple sequence alignment

Robert C Edgar1 and Serafim Batzoglou2

  • BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs

Julie D. Thompson, Frédréric Plewniak and Olivier Poch

Download PDF

Muniba is a Bioinformatician based in the South China University of Technology. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

5 Comments

  1. The wealth of existing methods and their improved similar accuracy has made selection of one tool over the others.
    While discussing the methods, it is worth mentioning that tools like M-Coffee & T-Coffee should be objectively elaborated, so that one can come to know the basic algorithm at Kernel.

  2. Thanks for your concern Sir. Actually I didn’t mention about tools because I wanted to represent the basic idea of MSA as simple as possible, otherwise I could have include about the benchmark test approved tools like MUSCLE, MAFFT, T-COFFEE, etc.
    The tools algorithm will be explained in next article regarding MSA.

  3. Is sequence Alignment is done only to find out region of similarity only or also to find out how much the sequences get differed i.e the region of dissimilarity? If yes , why? If No, why?

Leave a Reply