A practical guide to selection analyses of coding sequence datasets and its intricacies

6 mins read

This is the second article under the popular series “Do you HYPHY…”, recently published online in BiR. In this issue, I would like to take the substitution/selection analyses and its intricacies to the next level. As previously mentioned, HyPhy offer a collection of programs such as GARD (Genetic Algorithm for Recombination Detection), SLAC (Single Likelihood Ancestor Counting), REL (Random Effects Likelihood), and FEL (Fixed Effects Likelihood). All these packages contribute to a holistic selection analysis depicting whether there are signatures of positive selection and if yes, what are the sites under positive selection. To start with, one needs a coding sequence dataset and it is important to point out here that such an intron-free coding sequence dataset is a minimum basic condition to work in these packages failing which the program will simply not work. Due to the aforesaid reason, if you are working with non-coding sequences such as rRNA, repetitive DNA sequences, or with molecular marker based sequences (Sequence Tagged Sites, STS) which contain composites of non-coding and coding regions, HyPhy cannot be deployed in such cases for selection analyses.

Prior to selection analyses, the first and foremost thing one should look for is potential chimeras in the sequences. Chimeric sequences are sequences generated due to multiple artifacts from amplification and sequencing. These are highly undesirable sequences and can lead to drastic anomalies in the selection analyses and can skew the data to an extreme (unreal) positive selection. To reveal potential recombinants, HyPhy offers GARD which works on statistical probability analyses to depict sites (essentially bases) at which breakpoints have occurred thus creating chimeras or putative recombinants. Apart from GARD, one can also refer to RDP (Recombination Detection Program) by Darren Martin (2000). Both these program are a bit different in their result delivery but essentially do the same thing, of detecting recombinants. The advantage of using GARD at the datamonkey server is that it allows partitioning of sequence datasets into clean non-recombinant partitions ignoring the possible recombinant/chimeric regions and thus prevent overestimations.

Once the recombinants have been identified and excluded from the dataset, it is all set to proceed for the analyses. For the identification of signatures of positive selection, three methods are employed namely SLAC, REL, and FEL. The probable reason for employing three methods as opposed to one is to have a more rigorous estimation of positive selection and to account for false positives (over-estimations). SLAC method is basically a ‘‘counting method’’ that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions. REL models variation in non-synonymous (changes in amino acid residues from one group to another of amino acids) and synonymous (substitutions of amino acids of similar groups/nature) rates across sites according to a predefined distribution model. The distribution model also uses a priori calculated selection pressure at an individual site. The selection pressure, in turn, is derived from empirical Bayesian approaches. The major demerit of the REL method is that it suffers from high false-positive rates. FEL is the most robust method of all and here the estimation is site by site in terms of rates of non-synonymous substitution (rate is given by ‘dN’ also referred to as β) over rates of synonymous substitution (rates are given by ‘dS’ also referred as α). In this manner, a site by site dN/dS analyses selects the codon under positive selection dN/dS rates and thus selects which are the codon sites under positive (α<β) selection. Essentially speaking, a neutral selection is said to occurring on a sequence dataset when dN/dS=1, while a value of more than 1, depicts positive selection. Negative selection is concluded if the value is less than 1. FEL is considered to be most accurate and precise of the three methods as it estimates dN and dS independently at each codon sites using the modified Suzuki and Gojobori method and makes no a priori assumption about the rate distribution, making the estimation all the more accurate. The errors in dN or dS and also that of local α or β estimations are corrected by probability values or p-values acting as a level of significance for every site. In this manner, a combined approach to study sites under evolution and/or selection are studied on a coding sequence dataset. However, apart from this, there are many others ways to deal with selection analyses and they will be taken up in future issues. For a further practical application of these methods applied in a field study, readers can refer to Author’s previously published article in Archives of Virology (Singh-Pant et al., 2012; DOI: 10.1007/s00705-012-1287-x).

The author has a comprehensive list of references and is available upon request. For more information contact [email protected]

Dr. Pant is a researcher with keen interest in software driven analysis of DNA/Protein sequence data for taxonomic, phylogenetic and other homology based studies. Currently he is involved in understanding Microbial diversity using Next generation sequencing approaches and Analysis of sequence and metagenomic datasets using computational biology approaches. He is presently engaged with undergraduate teaching as an Assistant Professor in University of Delhi
Dr. Pratibha Pant Nee Singh is a Plant Molecular Biologist and Plant Virologist. Her main area of interest are Plant Begomoviral Infections. She has worked on Viral Genome Sequencing, Sequence Analyses, and Molecular phylogenetic studies. To her credit are, a number of viral gene sequences including one novel report submitted in GenBank.

Leave a Reply

Previous Story

Big Data in Bioinformatics

Next Story

What is PRSice?

Latest from Sequence Analysis

0 $0.00