This is the second article under the popular series “Do you HYPHY…”, recently published online in BiR. In this issue, I would like to take the substitution/selection analyses and its intricacies to the next level. As previously mentioned, HyPhy offer a collection of programs such as GARD (Genetic Algorithm for Recombination Detection), SLAC (Single Likelihood Ancestor Counting), REL (Random Effects Likelihood), and FEL (Fixed Effects Likelihood). All these packages contribute to a holistic selection analysis depicting whether there are signatures of positive selection and if yes, what are the sites under positive selection. To start with, one needs a coding sequence dataset and it is important to point out here that such an intron-free coding sequence dataset is a minimum basic condition to work in these packages failing which the program will simply not work. Due to the aforesaid reason, if you are working with non-coding sequences such as rRNA, repetitive DNA sequences, or with molecular marker based sequences (Sequence Tagged Sites, STS) which contain composites of non-coding and coding regions, HyPhy cannot be deployed in such cases for selection analyses.
Prior to selection analyses, the first and foremost thing one should look for is potential chimeras in the sequences. Chimeric sequences are sequences generated due to multiple artifacts from amplification and sequencing. These are highly undesirable sequences and can lead to drastic anomalies in the selection analyses and can skew the data to an extreme (unreal) positive selection. To reveal potential recombinants, HyPhy offers GARD which works on statistical probability analyses to depict sites (essentially bases) at which breakpoints have occurred thus creating chimeras or putative recombinants. Apart from GARD, one can also refer to RDP (Recombination Detection Program) by Darren Martin (2000). Both these program are a bit different in their result delivery but essentially do the same thing, of detecting recombinants. The advantage of using GARD at the datamonkey server is that it allows partitioning of sequence datasets into clean non-recombinant partitions ignoring the possible recombinant/chimeric regions and thus prevent overestimations.
Once the recombinants have been identified and excluded from the dataset, it is all set to proceed for the analyses. For the identification of signatures of positive selection, three methods are employed namely SLAC, REL, and FEL. The probable reason for employing three methods as opposed to one is to have a more rigorous estimation of positive selection and to account for false positives (over-estimations). SLAC method is basically a ‘‘counting method’’ that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions. REL models variation in non-synonymous (changes in amino acid residues from one group to another of amino acids) and synonymous (substitutions of amino acids of similar groups/nature) rates across sites according to a predefined distribution model. The distribution model also uses a priori calculated selection pressure at an individual site. The selection pressure, in turn, is derived from empirical Bayesian approaches. The major demerit of the REL method is that it suffers from high false-positive rates. FEL is the most robust method of all and here the estimation is site by site in terms of rates of non-synonymous substitution (rate is given by ‘dN’ also referred to as β) over rates of synonymous substitution (rates are given by ‘dS’ also referred as α). In this manner, a site by site dN/dS analyses selects the codon under positive selection dN/dS rates and thus selects which are the codon sites under positive (α<β) selection. Essentially speaking, a neutral selection is said to occurring on a sequence dataset when dN/dS=1, while a value of more than 1, depicts positive selection. Negative selection is concluded if the value is less than 1. FEL is considered to be most accurate and precise of the three methods as it estimates dN and dS independently at each codon sites using the modified Suzuki and Gojobori method and makes no a priori assumption about the rate distribution, making the estimation all the more accurate. The errors in dN or dS and also that of local α or β estimations are corrected by probability values or p-values acting as a level of significance for every site. In this manner, a combined approach to study sites under evolution and/or selection are studied on a coding sequence dataset. However, apart from this, there are many others ways to deal with selection analyses and they will be taken up in future issues. For a further practical application of these methods applied in a field study, readers can refer to Author’s previously published article in Archives of Virology (Singh-Pant et al., 2012; DOI: 10.1007/s00705-012-1287-x).
The author has a comprehensive list of references and is available upon request. For more information contact [email protected]
HMMER- Uses & Applications
Easy installation of some alignment software on Ubuntu (Linux) 18.04 & 20.04
There are commonly used alignment programs such as muscle, blast, clustalx, and so on, that can be easily installed from the repository. In this article, we are going to install such software on Ubuntu 18.04 & 20.04. (more…)
FEGS- A New Feature Extraction Model for Protein Sequence Analysis
Protein sequence analyses include protein similarity, Protein function prediction, protein interactions, and so on. A new feature extraction model is developed for easy analysis of protein sequences. (more…)
Installing RDPTools on Ubuntu (Linux)
RDP provides analysis tools called RDPTools. These tools are used to high-throughput sequencing data including single-strand, and paired-end reads . In this article, we are going to install RDPTools on Ubuntu (Linux). (more…)
NGlyAlign- A New Tool to Align Highly Variable Regions in HIV Sequences
It is necessary to detect highly variable regions in envelopes of viruses as it allows the establishment of the viruses in the human body. A new tool is developed to build and align the highly variable regions in HIV sequences. (more…)
How to install ClustalW2 on Ubuntu?
Clustal packages [1,2] are quite useful in multiple sequence alignments. Especially, when you need specific outputs from the command-line. In this article, we will install CustalW2 command-line tool on Ubuntu. (more…)
Installing HMMER package on Ubuntu
HMMER tool is used for searching sequence homologs using profile hidden Markov Models (HMMs) . It is also one of the most widely used alignment tools. In this article, we will install the latest HMMER package on Ubuntu. (more…)
Installing FASTX-toolkit on Ubuntu
FASTX-toolkit is a command-line bioinformatics software package for the preprocessing of short reads FASTQ/A files . These files contain multiple short-read sequences obtained as an output of next-generation sequencing. In this article, we are going to install FASTX-toolkit on Ubuntu. (more…)
Aligning DNA reads against a local database using DIAMOND
DIAMOND is a program for high throughput pairwise alignment of DNA reads and protein sequences . It is used for the high-performance analysis of large sequence data. In this article, we will make a local database of protein sequences and align protein sequences against the reference database. (more…)
Installing MEME suite on Ubuntu
Installing BLAT- A Pairwise Alignment Tool on Ubuntu
Homology search against a local dataset using NCBI-BLAST+ command-line tool
NCBI-BLAST+  command-line tool offers multiple functions to be performed on a large dataset of sequences. Previously, we have shown how to blast against a local dataset of sequences. This article will explain the search of homologous sequences for a query sequence against a local database of sequences and how to obtain the top 100 hits out of the searched results. (more…)
How to use Clustal Omega and MUSCLE command-line tools for multiple sequence alignment?
Clustal Omega [1,2] and MUSCLE are bioinformatics tools that are used for multiple sequence alignment (MSA). In one of our previous articles, we explained the usage of the ClustalW2 command-line tool for MSA and phylogenetic tree construction. In this article, we will use Clustal Omega and MUSCLE for MSA exploring other arguments that facilitate different output formats. (more…)
Multiple Sequence Alignment and Phylogenetic Tree construction using ClustalW2 command-line tool
ClustalW2 is a bioinformatics tool for multiple sequence alignment of DNA or protein sequences. It can easily align sequences and generate a phylogenetic tree online (https://www.genome.jp/tools-bin/clustalw). However, in some cases, we need to perform these operations on a large number of FASTA sequences using the command-line tool of ClustalW2 . (more…)
Sequence search against a set of local sequences (local database) using phmmer
PHMMER is a sequence analysis tool used for protein sequences (http://hmmer.org; version 3.1 b2). It is available online as a web server and as well as a part of the HMMER stand-alone package (http://hmmer.org; version 3.1 b2). HMMER offers various useful features such as multiple sequence alignment including the file format conversion. (more…)
Biotite: A bioinformatics framework for sequence and structure data analysis
Sequence and structural data in bioinformatics are ever-increasing and the need for its analysis is ever-demanding likewise. As bioinformaticians analyze the data with their keen knowledge and reach important conclusions, similarly, bioinformaticists provide with the enhanced and advanced tools and software for data analysis. (more…)
Simulated sequence alignment software: An alternative to MSA benchmarks
In our previous article, we discussed different multiple sequence alignment (MSA) benchmarks to compare and assess the available MSA programs. However, since the last decade, several sequence simulation software have been introduced and are gaining more interest. In this article, we will be discussing various sequence simulating software being used as alternatives to MSA benchmarks. (more…)
Benchmark databases for multiple sequence alignment: An overview
Multiple sequence alignment (MSA) is a very crucial step in most of the molecular analyses and evolutionary studies. Many MSA programs have been developed so far based on different approaches which attempt to provide optimal alignment with high accuracy. Basic algorithms employed to develop MSA programs include progressive algorithm , iterative-based , and consistency-based algorithm . Some of the programs incorporate several other methods into the process of creating an optimal alignment such as M-COFFEE  and PCMA . (more…)
The basic local alignment search tool (BLAST) [1,2] is known for its speed and results, which is also a primary step in sequence analysis. The ever-increasing demand for processing huge amount of genomic data has led to the development of new scalable and highly efficient computational tools/algorithms. For example, MapReduce is the most widely accepted framework which supports design patterns representing general reusable solutions to some problems including biological assembly  and is highly efficient to handle large datasets running over hundreds to thousands of processing nodes . But the implementation frameworks of MapReduce (such as Hadoop) limits its capability to process smaller data. (more…)
Role of Information Theory, Chaos Theory, and Linear Algebra and Statistics in the development of alignment-free sequence analysis
Sequence alignment is customary to not only find similar regions among a pair of sequences but also to study the structural, functional and evolutionary relationship between organisms. Many tools have been discovered to achieve the goal of alignment of a pair of sequences, separately for nucleotide sequence and amino acid sequence, BLOSSUM & PAM  are a few to name. (more…)