NCBI-BLAST+ [1] command-line tool offers multiple functions to be performed on a large dataset of sequences. Previously, we have shown how to blast against a local dataset of sequences. This article will explain the search of homologous sequences for a query sequence against a local database of sequences and how to obtain the top 100 hits out of the searched results.
For performing homology search against a local database, follow the steps given below:
- Install NCBI-BLAST+ on Ubuntu
Open a terminal (Ctrl+Alt+T) and type the following command:
$ sudo apt-get install ncbi-blast+
2. Make BLAST database of your sequences
$ makeblastdb -in input.fasta -parse_seqids -dbtype prot -out blastdb
The details of these arguments are given in the previous article.
We have used blastp since we are demonstrating for protein sequences. You can use blastn if you are working on nucleotide sequences and define in dbtype as -dbtype nucl
.
3. Perform homology search
$ blastp -query query.fasta -db blastdb -outfmt '6 sseqid' -max_target_seqs 100 -out homologousids.txt
Here, -query
defines the input query sequence saved in a file ‘query.fasta’,
-db
is the local BLAST database
-outfmt
defines the output format. ‘6 sseqid’ means Subject Seq-id in a tabular format.
-max_target_seqs
is used to define the number of hits to get in output, here it’s set to 100. You can set it to any number.
-out
defines the output filename.
This command will result in a simple text file containing the sequence ids of all the homologous sequences.
4. Extract sequences of those homologous sequence ids.
In this step, we will obtain the sequences of all homologous sequence ids from the constructed local database. This can be achieved by using the blastdbcmd binary of the NCBI-BLAST+ package.
$ blastdbcmd -db blastdb -entry_batch homologousids.txt -out homlogseqs.fasta -outfmt %f
Here, -entry_batch
is used for batch processing. Each entry should be in a single line and should begin with sequence ID and then followed by any other character/specifier.
-outfmt %f
means output in FASTA format.
There are several other output formats. To read in detail, click here.
The output file (homologseqs.fasta) will be consisting of the top 100 hits of homology search.
References
- Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics, 10(1), 421.