Tools
How BLAST works – Concepts, Types, & Methods Explained
BLAST stands for Basic Local Alignment Search Tool. It is a local alignment algorithm-based tool used for aligning multiple sequences and finding similarities or dissimilarities among various species. In this article, we will explain different kinds of BLAST tools and how does BLAST algorithm works.
BLAST is a heuristic method which means that it is a dynamic programming algorithm that is faster, efficient but relatively less sensitive.
For BLAST(ing) any sequence, there is a query sequence and a target sequence/database. The query sequence is the sequence for which we want to find out the similarity and the target sequence is a sequence/database against which the query sequence is aligned. Blast returns the output in the form of hit tables that are arranged in decreasing order of matched accession numbers along with their titles, query coverage, sequence identity, score, and an e-value in separate columns. The reliability of the compared sequences is assessed by e-value.
BLAST has different programs to align sequences of nucleotides, proteins, etc. It consists of other multiple BLAST programs, but the basic kinds of BLAST are as follows:
-
blastn
It is a type of blast where the query sequence is a nucleotide and the target sequence is also a nucleotide, i.e., it is a nucleotide against a nucleotide.
-
blastp
Blastp is a protein-to-protein blast where the query sequence is a protein and the target sequence is also a protein.
-
blastx
In this type of blast, the query sequence is a nucleotide sequence and the target is a protein sequence/database. First, the nucleotide sequence is converted into its protein sequence in three reading frames, then it is searched against the protein.
-
tblastn
In tblastn, the query is a protein and the target is a nucleotide sequence/database. Here, the protein sequence is searched against a nucleotide database which is translated to its corresponding proteins. The translation occurs in all reading frames, but the reading frame is only for the conventional 5’ to 3’ site in the databases, therefore, only 3 reading frames are compared.
-
tblastx
It is a type of blast in which the nucleotide sequence is against the nucleotide database but at the protein level. In other words, the nucleotide query and target sequences are translated into their corresponding protein sequences and then aligned together. Both the query and the target are translated in all 6 reading frames.
Special kinds of BLASTs:
-
Megablast
It is very similar to blastn but its advantage over blastn is that in megablast long sequences can be aligned. A large number of sequences having large sizes can be easily aligned using megablast and all the query sequences are concatenated into one large query sequence. It is a greedy algorithm so that it induces gaps during the alignment and hence, similar sequences are not avoided. Megablast due to these features is faster than blastn but less sensitive since it is a greedy algorithm, but it is very useful when a large number of similar sequences are to be aligned in one go.
-
Discontiguous Megablast
It is exactly the opposite of the megablast referred to as a “Highly Dissimilar Megablast”. It is used to find the dissimilar sequences of the query sequence, i.e., paralogs. Here, the user wants to find the paralogs of a gene present in distant species. So, here the output is those sequences that have the least amount of similarity with the query sequence.
- PSI Blast
Position-specific iterated (PSI) Blast is very sensitive and usually used for protein similarity search. The query sequence is taken and subjected to blastp which results in the formation of a multiple sequence alignment (MSA) of most similar sequences. From this MSA, the pattern that identifies the query and its homologs are taken, then this conserved pattern is subjected to blastp again to filter the database. This process of identifying patterns from MSA, blasting the pattern against the database again creating MSA, and then again identifying a redefined pattern is PSI Blast.
-
PHI Blast
Pattern Hit Initiated (PHI) blast is very similar to PSI Blast but there is not any iteration. It can be used for DNA as well as protein queries.
-
RPS Blast
Reverse Position Specific (RPS) Blast is also similar to PSI Blast which matches the query with a set of conserved domains, HMM profiles, or pre-aligned profiles. In this kind of blast, the query sequence (DNA / protein) is searched against an existing collection of conserved domains, a preconfigured MSA of various genes.
How does Blast work?
Blast is a greedy algorithm that was developed by Altschul et al. [1]. It is similar to FASTA but more efficient. As FASTA uses a ktup parameter, similarly BLAST also uses a window size for proteins and nucleotides. Both assume that good alignments contain short stretches of exact matches. BLAST is an improvisation over FASTA in the sense that it is faster, more sensitive, more statistically significant, and easy to use. There is a threshold in blast known as ‘Minimal Score denoted as ‘S’. It means that whatever the match is between the query and the database it must have a value equal to or greater than S.
BLAST performs the alignment in 3 basic steps:
- First, Blast applies the word search in which it removes the higher complex regions and then looks for short stretches of a fixed length of the query sequence.
- Secondly, Blast identifies the exact word matches from the database. Those words which have scored equal to or greater than the threshold (S) are taken for alignment. These obtained alignments are called “Hits”.
- Lastly, the blast extends the alignment in both directions as an ungapped alignment that stops at the maximum score and inserts a gap.
References
- Altschul, S. F. (2001). BLAST algorithm. e LS.
Software
[Tutorial] How to install 3Dmapper on Ubuntu (Linux)?
Understanding the relationship between genes and proteins is crucial for elucidating biological processes, and disease mechanisms, and developing targeted therapies. A new tool developed by Yang et. al., [1], provides a better solution to map annotated positions and variants to protein structures automatically. 3Dmapper is a stand-alone tool based on R and Python programming languages that map annotated genomic variants or positions to protein structures [1]. In this article, we will install 3Dmapper on Ubuntu (Linux).
Software
CMake installation and upgrade: What worked & what didn’t?!
CMake is a widely used cross-platform build system that automates the process of compiling and linking software projects. In bioinformatics, CMake can be utilized to manage the build process of software tools and pipelines used for data analysis, algorithm implementation, and other computational tasks. However, managing the versions of CMake or upgrading it on Ubuntu (Linux) can be a trivial task for beginners. In this article, we provide methods for installing and upgrading CMake on Ubuntu.
Bioinformatics Programming
Free_Energy_Landscape-MD: Python package to create Free Energy Landscape using PCA from GROMACS.
In molecular dynamics (MD) simulations, a free energy landscape (FEL) serves as a crucial tool for understanding the behavior of molecules and biomolecules over time. It is difficult to understand and plot a meaningful FEL and then extract the time frames at which the plot shows minima. In this article, we introduce a new Python package (Free_Energy_Landscape-MD) to generate an FEL based on principal component analysis (PCA) from MD simulation done by GROMACS [1].
Bioinformatics News
VS_Analysis: A Python package to perform post-virtual screening analysis
Virtual screening (VS) is a crucial aspect of bioinformatics. As you may already know, there are various tools available for this purpose, including both paid and freely accessible options such as Autodock Vina. Conducting virtual screening with Autodock Vina requires less effort than analyzing its results. However, the analysis process can be challenging due to the large number of output files generated. To address this, we offer a comprehensive Python package designed to automate the analysis of virtual screening results.
Bioinformatics Programming
vs_interaction_analysis.py: Python script to perform post-virtual screening analysis
Analyzing the results of virtual screening (VS) performed with Autodock Vina [1] can be challenging when done manually. In earlier instances, we supplied two scripts, namely vs_analysis.py [2,3] and vs_analysis_compounds.py [4]. This time, we have developed a new Python script to simplify the analysis of VS results.
Software
How to install Interactive Genome Viewer (IGV) & tools on Ubuntu?
Interactive Genome Viewer (IGV) is an interactive tool to visualize genomic data [1]. In this article, we are installing IGV and tools on Ubuntu desktop.
MD Simulation
[Tutorial] Installing VIAMD on Ubuntu (Linux).
Visual Interactive Analysis of Molecular Dynamics (VIAMD) is a tool that allows the interactive analysis of molecular dynamics simulations [1]. In this article, we are installing it on Ubuntu (Linux).
Docking
[Tutorial] Performing docking using DockingPie plugin in PyMOL.
DockingPie [1] is a PyMOL plugin to perform computational docking within PyMOL [2]. In this article, we will perform simple docking using DockingPie1.2.
Docking
How to install the DockingPie plugin on PyMOL?
DockingPie [1] is a plugin of PyMOL [2] made to fulfill the purpose of docking within the PyMOL interface. This plugin will allow you to dock using four different algorithms, namely, Vina, RxDock, SMINA, and ADFR. It will also allow you to perform flexible docking. Though the installation procedure is the same for all OSs, in this article, we are installing this plugin on Ubuntu (Linux).
Structural Bioinformatics
How to predict binding pocket/site using CASTp server?
The CASTp server allows you to predict the binding sites in a protein [1]. In this article, we will predict binding sites in a protein using the same.
Software
Video Tutorial: Calculating binding pocket volume using PyVol plugin.
This is a video tutorial for calculating binding pocket volume using the PyVol plugin [1] in Pymol [2].
Software
How to generate topology from SMILES for MD Simulation?
If you need to generate the topology of molecules using their SMILES, a simple Python script is available.
Software
[Tutorial] Installing jdock on Ubuntu (Linux).
jdock is an extended version of idock [1]. It has the same features as the idock along with some bug fixes. However, the binary name and the GitHub repository names are changed. We are installing jdock on Ubuntu (Linux).
Software
How to install GMXPBSA on Ubuntu (Linux)?
GMXPBSA is a tool to calculate binding free energy [1]. It is compatible with Gromacs version 4.5 and later. In this article, we will install GMXPBSA version 2.1.2 on Ubuntu (Linux).
Docking
[Tutorial] Installing Pyrx on Windows.
Pyrx [1] is another virtual screening software that also offers to perform docking using Autodock Vina. In this article, we will install Pyrx on Windows. (more…)
MD Simulation
How to solve ‘Could NOT find CUDA: Found unsuitable version “10.1”‘ error during GROMACS installation?
Compiling GROMACS [1] with GPU can be trivial. Previously, we have provided a few articles on the same. In this article, we will solve an error frequently occurring during GROMACS installation.
Software
Installing Autodock4 on MacOS.
Previously, we installed the Autodock suite [1] on Ubuntu. Visit this article for details. Now, let’s install it on MacOS.
Docking
How to install Autodock4 on Ubuntu?
Autodock suite is used for docking small molecules [1]. Recently, Autodock-GPU [2] is developed to accelerate the docking process. Its installation is described in this article. In this tutorial, we will install Autodock 4.2.6 on Ubuntu.
Software
DS Visualizer: Uses & Applications
Discovery Studio (DS) Visualizer (from BIOVIA) is a visualization tool for viewing, sharing, and analyzing proteins [1]. Here are some uses and applications of DS Visualizer.
Software
Protein structure & folding information exploited from remote homologs.
Remote homologs are similar protein structures that share similar functions, but there is no easily detectable sequence similarity in them. A new study has revealed that the protein folding information can be exploited from remote homologous structures. A new tool is developed to recognize such proteins and predict their structure and folding pathway. (more…)
You must be logged in to post a comment Login