Homology search against a local dataset using NCBI-BLAST+ command-line tool

Tariq Abdullah
3 Min Read

NCBI-BLAST+ [1] command-line tool offers multiple functions to be performed on a large dataset of sequences. Previously, we have shown how to blast against a local dataset of sequences. This article will explain the search of homologous sequences for a query sequence against a local database of sequences and how to obtain the top 100 hits out of the searched results.

For performing homology search against a local database, follow the steps given below:

  1. Install NCBI-BLAST+ on Ubuntu

Open a terminal (Ctrl+Alt+T) and type the following command:

$ sudo apt-get install ncbi-blast+

2. Make BLAST database of your sequences

$ makeblastdb -in input.fasta -parse_seqids -dbtype prot -out blastdb

The details of these arguments are given in the previous article.

We have used blastp since we are demonstrating for protein sequences. You can use blastn if you are working on nucleotide sequences and define in dbtype as -dbtype nucl.

3. Perform homology search

$ blastp -query query.fasta -db blastdb -outfmt '6 sseqid' -max_target_seqs 100 -out homologousids.txt

Here, -query defines the input query sequence saved in a file ‘query.fasta’, 

-dbis the local BLAST database

-outfmt defines the output format. ‘6 sseqid’ means Subject Seq-id in a tabular format.

-max_target_seqs is used to define the number of hits to get in output, here it’s set to 100. You can set it to any number.

-out defines the output filename.

This command will result in a simple text file containing the sequence ids of all the homologous sequences.

4. Extract sequences of those homologous sequence ids.

In this step, we will obtain the sequences of all homologous sequence ids from the constructed local database. This can be achieved by using the blastdbcmd binary of the NCBI-BLAST+ package.

$ blastdbcmd -db blastdb -entry_batch homologousids.txt -out homlogseqs.fasta -outfmt %f

Here, -entry_batch is used for batch processing. Each entry should be in a single line and should begin with sequence ID and then followed by any other character/specifier.

-outfmt %f means output in FASTA format.

There are several other output formats. To read in detail, click here.

The output file (homologseqs.fasta) will be consisting of the top 100 hits of homology search.

References

  1. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics10(1), 421.
Share This Article
Tariq is founder of Bioinformatics Review and Lead Developer at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.
Leave a Comment

Leave a Reply