A practical guide to selection analyses of coding sequence datasets and its intricacies

This is the second article under the popular series “Do you HYPHY…”, recently published online in BiR. In this issue, I would like to take the substitution/selection analyses and its intricacies to the next level. As previously mentioned, HyPhy offer a collection of programs such as GARD (Genetic Algorithm for Recombination Detection), SLAC (Single Likelihood Ancestor Counting), REL (Random Effects Likelihood), and FEL (Fixed Effects Likelihood). All these packages contribute to a holistic selection analysis depicting whether there are signatures of positive selection and if yes, what are the sites under positive selection. To start with, one needs a coding sequence dataset and it is important to point out here that such an intron-free coding sequence dataset is a minimum basic condition to work in these packages failing which the program will simply not work. Due to the aforesaid reason, if you are working with non-coding sequences such as rRNA, repetitive DNA sequences, or with molecular marker based sequences (Sequence Tagged Sites, STS) which contain composites of non-coding and coding regions, HyPhy cannot be deployed in such cases for selection analyses.

Prior to selection analyses, the first and foremost thing one should look for is potential chimeras in the sequences. Chimeric sequences are sequences generated due to multiple artifacts from amplification and sequencing. These are highly undesirable sequences and can lead to drastic anomalies in the selection analyses and can skew the data to an extreme (unreal) positive selection. To reveal potential recombinants, HyPhy offers GARD which works on statistical probability analyses to depict sites (essentially bases) at which breakpoints have occurred thus creating chimeras or putative recombinants. Apart from GARD, one can also refer to RDP (Recombination Detection Program) by Darren Martin (2000). Both these program are a bit different in their result delivery but essentially do the same thing, of detecting recombinants. The advantage of using GARD at the datamonkey server is that it allows partitioning of sequence datasets into clean non-recombinant partitions ignoring the possible recombinant/chimeric regions and thus prevent overestimations.

Once the recombinants have been identified and excluded from the dataset, it is all set to proceed for the analyses. For the identification of signatures of positive selection, three methods are employed namely SLAC, REL, and FEL. The probable reason for employing three methods as opposed to one is to have a more rigorous estimation of positive selection and to account for false positives (over-estimations). SLAC method is basically a ‘‘counting method’’ that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions. REL models variation in non-synonymous (changes in amino acid residues from one group to another of amino acids) and synonymous (substitutions of amino acids of similar groups/nature) rates across sites according to a predefined distribution model. The distribution model also uses a priori calculated selection pressure at an individual site. The selection pressure, in turn, is derived from empirical Bayesian approaches. The major demerit of the REL method is that it suffers from high false-positive rates. FEL is the most robust method of all and here the estimation is site by site in terms of rates of non-synonymous substitution (rate is given by ‘dN’ also referred to as β) over rates of synonymous substitution (rates are given by ‘dS’ also referred as α). In this manner, a site by site dN/dS analyses selects the codon under positive selection dN/dS rates and thus selects which are the codon sites under positive (α<β) selection. Essentially speaking, a neutral selection is said to occurring on a sequence dataset when dN/dS=1, while a value of more than 1, depicts positive selection. Negative selection is concluded if the value is less than 1. FEL is considered to be most accurate and precise of the three methods as it estimates dN and dS independently at each codon sites using the modified Suzuki and Gojobori method and makes no a priori assumption about the rate distribution, making the estimation all the more accurate. The errors in dN or dS and also that of local α or β estimations are corrected by probability values or p-values acting as a level of significance for every site. In this manner, a combined approach to study sites under evolution and/or selection are studied on a coding sequence dataset. However, apart from this, there are many others ways to deal with selection analyses and they will be taken up in future issues. For a further practical application of these methods applied in a field study, readers can refer to Author’s previously published article in Archives of Virology (Singh-Pant et al., 2012; DOI: 10.1007/s00705-012-1287-x).

The author has a comprehensive list of references and is available upon request. For more information contact prashant@bioinformaticsreview.com

Leave a Reply Cancel reply

You Might Also Like

NGlyAlign- A New Tool to Align Highly Variable Regions in HIV Sequences

EVOBLAST: Evolutionary Fingerprinting Analysis Module

Structural Identification of Macromolecules in solution with DARA web server

How to compare two pairwise alignments using Modeller-10.1?