As I have discussed in my earlier articles about the multiple sequence alignment (MSA) tools (MUSCLE & T-COFFEE). Now in this article, we will discuss different aspects of these tools and which one is more preferred over the another. MUSCLE and T-COFFEE both are multiple sequence alignment tools and also helps to study the evolutionary relationships among the species. Continue reading “MUSCLE v/s T-COFFEE : An overview and different aspects” »
T-Coffee is a multiple sequence alignment tool which stands for Tree-based Consistency Objective Function for alignment Evaluation. It is a simultaneous alignment which combines the best properties of local and global alignment and for this it also uses the Smith-Waterman algorithm. T-Coffee is an advancement over other multiple alignment tools such as ClustalW, MUSCLE (discussed about in earlier article), etc.
Its main features include, first, it provides the multiple alignments using various data sources which is the library of pairwise alignments(global + local). Second main feature is the optimization method which provides the multiple alignment that best fits in the input library.
Fig.1 Layout of the T-Coffee strategy; the main steps required to compute a multiple sequence alignment using the T-Coffee method. Square blocks designate procedures while rounded blocks indicate data structures.
How T-Coffee works?
- Generate Primary library of alignments:
It consists of a set of pairwise alignments of all of the sequences to be aligned (here the alignment source is local). It may also include two or more different alignments of the same pair of sequences. Then the global alignment is done using ClustalW .
- Derive primary library weights:
The most reliable residue pair is obtained in this step using a weighted scheme. In this, a weight is assigned to each pair of aligned residues in the library. Here, sequence identity is the criteria to measure accuracy with more than 30 % identity. For each set of sequences, two libraries are constructed along with their weights, one using ClustaW and other using Lalign (program of FASTA package).
- Combine Libraries:
In this step, all the duplicated pairs are merged into a single entry that has a weight equal to the sum of two weights, or a new entry is created for the pair being considered.
- Extend library:
A triplet approach involving intermediate-sequence method is used. For example, we have 4 sequences, A,B,C & D, it aligns A-B and with C and D as well and checks for the alignment.
- Progressive alignment strategy:
In this alignment strategy, a distance matrix is constructed using pairwise alignments between all the sequences, with the help of which a guide tree is constructed using Neighbor Joining (NJ) method (a method that first aligns the two closest sequences), the obtained pair of sequences are checked for gaps,again the next closest two sequences. This continue until all the sequences have been aligned.
Fig.2 The library extension. (a) Progressive alignment. Four sequences have been designed. The tree indicates
the order in which the sequences are aligned when using a progressive method such as ClustalW. The resulting alignment is shown, with the word CAT misaligned. (b) Primary library. Each pair of sequences is aligned using ClustalW. In these alignments, each pair of aligned residues is associated with a weight equal to the average identity among matched residues within the complete alignment (mismatches are indicated in bold type). (c) Library extension for a pair of sequences. The three possible alignments of sequence A and B are shown (A and B, A and B through C, A and B through D). These alignments are combined, as explained in the text, to produce the position-speci®c library. This library is resolved by dynamic programming to give the correct alignment. The thickness of the lines indicates the strength of the weight.
An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to [email protected]
HyPhy, acronym for Hypothesis Testing Using Phylogenies (www.hyphy.org) was written & designed by Kosakovsky Pond and workers to provide likelihood-based analyses on molecular evolutionary data sets and help detect differential rates of variability within a coding sequence datasets. It is freely available, has a Graphical User Interface and can be used by anyone with or without much computer language or programming exposure.
It was earlier presumed that substitution rates were uniform over an alignment of homologous DNA/Protein sequences but many workers studying molecular evolutionary processes influencing rates and patterns of evolution negated this presumption with quite a lot of data and this is especially true for highly evolving gene family datasets and for viral genomes. Natural selection takes place at different domains/regions/sites which are under positive, negative or neutral selection pressures. Positive selection originates with more of non-synonymous substitutions in a protein coding sequence influencing the fitness advantage (protein structure and function) of an organism whereas negative selection takes place with more of synonymous substitution in a protein coding sequence leaving the amino acid sequence or protein structure and function unchanged. A neutral evolution is said to be taking place when the non-synonymous substitutions do not affect the protein structure and function and rate of non-synonymous substitutions. The rate of synonymous and non-synonymous substitutions is given by dS and dN respectively. In the case of neutral evolution, dS and dN are observed to be in equilibrium. Accordingly, the ratio of dN/dS given by ω=β/α (also referred to as dN/dS) has become a standard measure of selective pressure. The total ω for a sequence alignment is referred to as Global ω. Global ω with a value of approximately 1 signifies neutral evolution, below 1 suggests negative selection whereas ω more than 1 implies positive selection. To start with the analyses, all one needs is, a suitable codon substitution model as detected by MODELTEST program (available online), a nexus formatted sequence alignment file (must be codon data file) and a Maximum Likelihood tree of the data.
Datamonkey is a web interface (http://www.datamonkey.org) which uses HyPhy batch files to execute most of its tools and packages for the computational analyses. This web interface can be used for estimating dS and dN over an alignment of coding sequences and also for identifying codons and lineages under selection. It also provides “state of the art” tests of codon based models to infer signatures of positive Darwinian selection by comparing rates of synonymous (dS) versus non-synonymous (dN) mutations even in the presence of recombination. It actually reports ω (=dN/dS) using a variety of evolutionary models. Apart from this, Datamonkey also offers a number of packages such as GARD, SLAC, REL, FEL, EVOBLAST etc. These will be discussed in the next issue. Keep reading!!
A comprehensive list of references on the article are available upon request to the author ([email protected])