Connect with us


Basic Concept of Multiple Sequence Alignment



Multiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them.  The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that “maximizes the sum of similarities for all the pair of sequences”, which is called as the ‘Sum-of-scores or SP Score’. The SP Score is the basic of many alignment algorithms.

The most widely used approach for constructing MSA is “Progressive Alignment”, where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is “Consistency-based Scoring”,where the alignment is consistently dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment.

Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated?

The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding reference alignment  which gives an accuracy score.

Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available.

“Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE.”

Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence.

However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.



  • Multiple sequence alignment

Robert C Edgar1 and Serafim Batzoglou2

  • BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs

Julie D. Thompson, Frédréric Plewniak and Olivier Poch

Dr. Muniba is a Bioinformatician based in New Delhi, India. She has completed her PhD in Bioinformatics from South China University of Technology, Guangzhou, China. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba



  1. Fozail Ahmad

    October 19, 2015 at 5:29 pm

    The wealth of existing methods and their improved similar accuracy has made selection of one tool over the others.
    While discussing the methods, it is worth mentioning that tools like M-Coffee & T-Coffee should be objectively elaborated, so that one can come to know the basic algorithm at Kernel.

  2. Muniba Faiza

    October 19, 2015 at 6:02 pm

    Thanks for your concern Sir. Actually I didn’t mention about tools because I wanted to represent the basic idea of MSA as simple as possible, otherwise I could have include about the benchmark test approved tools like MUSCLE, MAFFT, T-COFFEE, etc.
    The tools algorithm will be explained in next article regarding MSA.

  3. prashant

    October 19, 2015 at 7:48 pm

    Is sequence Alignment is done only to find out region of similarity only or also to find out how much the sequences get differed i.e the region of dissimilarity? If yes , why? If No, why?

    • Sanjay_infection_biologist

      October 20, 2015 at 10:09 am

      Dear Mr Prashant,

      It totally depends upon the case of study on which you are doing the analysis. For example, imagine a case where you are considering the closely related species for your analysis, then of course you already know that the sequences are going to be mostly same and you would be interested in the regions of dissimilarity/difference in order to get the pattern of their evolution e.g. Hemoglobin of Man, Monkey and Chimpanzee. While, the other case where species are distantly related to each other or are having just similar kind of functions but different structures and then of course you must be interested in finding out the region of similarity between them or there active regions (catalytic domain) e.g. plant hemoglobin, bacterial hemoglobin and bacterial hemoglobin.
      Besides this, if we can relate the sequences, then we can also predict the structure of some proteins because of the principle “Sequence decides structure and structure decides function”.
      for reference you may read:

      for hemoglobin:

      for MSA:;2-Q/pdf

      For further queries, you may contact me on the links provided.

      Best regards,

    • Muniba Faiza

      October 20, 2015 at 1:15 pm

      Generally sequence alignment is done to find out the similarity between the organisms, but yes we can also find out the dissimilarity in the scenario where we just want to study the differences among the species or to calculate how much the species differ to study variation during evolution or other phylogeny analysis. We can find out the dissimilar sequences with the help of Discontiguous Megablast (a kind of Megablast) and then we can simply align all of them using MUSCLE, CLUSTAL W, etc.

You must be logged in to post a comment Login

Leave a Reply


MOCCA- A New Suite to Model cis- regulatory Elements for Motif Occurrence Combinatorics



MOCCA- A New Suite to Model cis- regulatory Elements for Motif Occurrence Combinatorics

cis-regulatory elements are DNA sequence segments that regulate gene expression. cis-regulatory elements consist of some regions such as promoters, enhancers, and so on. These regions consist of specific sequence motifs. (more…)

Continue Reading

Algorithms A Python Script to Analyze Virtual Screening Results of Autodock Vina



VS-Analysis: A Python Script to Analyze Virtual Screening Results of Autodock Vina

The output files obtained as a result of virtual screening (VS) using Autodock Vina may be large in number. It is difficult or quite impossible to analyze them manually. Therefore, we are providing a Python script to fetch top results (i.e., compounds showing low binding affinities). (more…)

Continue Reading


How to search motif pattern in FASTA sequences using Perl hash?



Here is a simple Perl script to search for motif patterns in a large FASTA file with multiple sequences.


Continue Reading