Multiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them. The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that “maximizes the sum of similarities for all the pair of sequences”, which is called as the ‘Sum-of-scores or SP Score’. The SP Score is the basic of many alignment algorithms.
The most widely used approach for constructing MSA is “Progressive Alignment”, where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is “Consistency-based Scoring”,where the alignment is consistently dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment.
Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated?
The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding reference alignment which gives an accuracy score.
Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available.
“Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE.”
Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence.
However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.
REFERENCES:
- Multiple sequence alignment
Robert C Edgar1 and Serafim Batzoglou2
- BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs
Julie D. Thompson, Frédréric Plewniak and Olivier Poch
The wealth of existing methods and their improved similar accuracy has made selection of one tool over the others.
While discussing the methods, it is worth mentioning that tools like M-Coffee & T-Coffee should be objectively elaborated, so that one can come to know the basic algorithm at Kernel.
Thanks for your concern Sir. Actually I didn’t mention about tools because I wanted to represent the basic idea of MSA as simple as possible, otherwise I could have include about the benchmark test approved tools like MUSCLE, MAFFT, T-COFFEE, etc.
The tools algorithm will be explained in next article regarding MSA.
Is sequence Alignment is done only to find out region of similarity only or also to find out how much the sequences get differed i.e the region of dissimilarity? If yes , why? If No, why?
Dear Mr Prashant,
It totally depends upon the case of study on which you are doing the analysis. For example, imagine a case where you are considering the closely related species for your analysis, then of course you already know that the sequences are going to be mostly same and you would be interested in the regions of dissimilarity/difference in order to get the pattern of their evolution e.g. Hemoglobin of Man, Monkey and Chimpanzee. While, the other case where species are distantly related to each other or are having just similar kind of functions but different structures and then of course you must be interested in finding out the region of similarity between them or there active regions (catalytic domain) e.g. plant hemoglobin, bacterial hemoglobin and bacterial hemoglobin.
Besides this, if we can relate the sequences, then we can also predict the structure of some proteins because of the principle “Sequence decides structure and structure decides function”.
for reference you may read:
for hemoglobin: http://www.bioquest.org/summer2006/The_Evolution_of_Hemoglobin.pdf
for MSA:
http://statweb.stanford.edu/~nzhang/345_web/sequence_slides3.pdf
http://onlinelibrary.wiley.com/doi/10.1002/1097-0134(20000815)40:3%3C502::AID-PROT170%3E3.0.CO;2-Q/pdf
http://epubs.siam.org/doi/pdf/10.1137/0148063
http://epubs.siam.org/doi/abs/10.1137/0148063
http://www.pnas.org/content/86/12/4412.full.pdf
For further queries, you may contact me on the links provided.
Thanks,
Best regards,
Sanjay
Generally sequence alignment is done to find out the similarity between the organisms, but yes we can also find out the dissimilarity in the scenario where we just want to study the differences among the species or to calculate how much the species differ to study variation during evolution or other phylogeny analysis. We can find out the dissimilar sequences with the help of Discontiguous Megablast (a kind of Megablast) and then we can simply align all of them using MUSCLE, CLUSTAL W, etc.