BLAST stands for Basic Local Alignment Search Tool. It is a local alignment algorithm-based tool used for aligning multiple sequences and finding similarities or dissimilarities among various species. In this article, we will explain different kinds of BLAST tools and how does BLAST algorithm works.
BLAST is a heuristic method which means that it is a dynamic programming algorithm that is faster, efficient but relatively less sensitive.
For BLAST(ing) any sequence, there is a query sequence and a target sequence/database. The query sequence is the sequence for which we want to find out the similarity and the target sequence is a sequence/database against which the query sequence is aligned. Blast returns the output in the form of hit tables that are arranged in decreasing order of matched accession numbers along with their titles, query coverage, sequence identity, score, and an e-value in separate columns. The reliability of the compared sequences is assessed by e-value.
BLAST has different programs to align sequences of nucleotides, proteins, etc. It consists of other multiple BLAST programs, but the basic kinds of BLAST are as follows:
-
blastn
It is a type of blast where the query sequence is a nucleotide and the target sequence is also a nucleotide, i.e., it is a nucleotide against a nucleotide.
-
blastp
Blastp is a protein-to-protein blast where the query sequence is a protein and the target sequence is also a protein.
-
blastx
In this type of blast, the query sequence is a nucleotide sequence and the target is a protein sequence/database. First, the nucleotide sequence is converted into its protein sequence in three reading frames, then it is searched against the protein.
-
tblastn
In tblastn, the query is a protein and the target is a nucleotide sequence/database. Here, the protein sequence is searched against a nucleotide database which is translated to its corresponding proteins. The translation occurs in all reading frames, but the reading frame is only for the conventional 5’ to 3’ site in the databases, therefore, only 3 reading frames are compared.
-
tblastx
It is a type of blast in which the nucleotide sequence is against the nucleotide database but at the protein level. In other words, the nucleotide query and target sequences are translated into their corresponding protein sequences and then aligned together. Both the query and the target are translated in all 6 reading frames.
Special kinds of BLASTs:
-
Megablast
It is very similar to blastn but its advantage over blastn is that in megablast long sequences can be aligned. A large number of sequences having large sizes can be easily aligned using megablast and all the query sequences are concatenated into one large query sequence. It is a greedy algorithm so that it induces gaps during the alignment and hence, similar sequences are not avoided. Megablast due to these features is faster than blastn but less sensitive since it is a greedy algorithm, but it is very useful when a large number of similar sequences are to be aligned in one go.
-
Discontiguous Megablast
It is exactly the opposite of the megablast referred to as a “Highly Dissimilar Megablast”. It is used to find the dissimilar sequences of the query sequence, i.e., paralogs. Here, the user wants to find the paralogs of a gene present in distant species. So, here the output is those sequences that have the least amount of similarity with the query sequence.
- PSI Blast
Position-specific iterated (PSI) Blast is very sensitive and usually used for protein similarity search. The query sequence is taken and subjected to blastp which results in the formation of a multiple sequence alignment (MSA) of most similar sequences. From this MSA, the pattern that identifies the query and its homologs are taken, then this conserved pattern is subjected to blastp again to filter the database. This process of identifying patterns from MSA, blasting the pattern against the database again creating MSA, and then again identifying a redefined pattern is PSI Blast.
-
PHI Blast
Pattern Hit Initiated (PHI) blast is very similar to PSI Blast but there is not any iteration. It can be used for DNA as well as protein queries.
-
RPS Blast
Reverse Position Specific (RPS) Blast is also similar to PSI Blast which matches the query with a set of conserved domains, HMM profiles, or pre-aligned profiles. In this kind of blast, the query sequence (DNA / protein) is searched against an existing collection of conserved domains, a preconfigured MSA of various genes.
How does Blast work?
Blast is a greedy algorithm that was developed by Altschul et al. [1]. It is similar to FASTA but more efficient. As FASTA uses a ktup parameter, similarly BLAST also uses a window size for proteins and nucleotides. Both assume that good alignments contain short stretches of exact matches. BLAST is an improvisation over FASTA in the sense that it is faster, more sensitive, more statistically significant, and easy to use. There is a threshold in blast known as ‘Minimal Score denoted as ‘S’. It means that whatever the match is between the query and the database it must have a value equal to or greater than S.
BLAST performs the alignment in 3 basic steps:
- First, Blast applies the word search in which it removes the higher complex regions and then looks for short stretches of a fixed length of the query sequence.
- Secondly, Blast identifies the exact word matches from the database. Those words which have scored equal to or greater than the threshold (S) are taken for alignment. These obtained alignments are called “Hits”.
- Lastly, the blast extends the alignment in both directions as an ungapped alignment that stops at the maximum score and inserts a gap.
References
- Altschul, S. F. (2001). BLAST algorithm. e LS.