Modifying multi-FASTA files using Bash: ‘Sed’ Command

Dealing with thousands of FASTA sequences is a tedious task without using bioinformatics programming. It eases multiple minute tasks to be performed on FASTA sequences or their headers such as removal, addition, or substitution of certain characters in the header, or manipulating the sequence format, and so on. In such cases, shell bash commands provide an easy way to perform such tasks on FASTA sequences.

Here are some simple sed commands to manipulate FASTA headers in multi-fasta files.

  1. To remove everything after first ‘/’ or  ‘_’ from FASTA headers.

$ sed 's|\/.*||' input.fasta > output.fasta

$ sed 's|\_.*||' input.fasta > output.fasta

2. To remove everything after last ‘/’ or  ‘_’ from FASTA headers.

$ sed 's|_[^/]*$||' input.fasta > output.fasta

$ sed 's|_[^_]*$||' input.fasta > output.fasta

3. To remove all FASTA headers and output only sequences.

$ sed '|^>|d' input.fasta > output.fasta

4. To remove everything after a dot (.) from FASTA headers.

$ sed '|[.].*$||' input.fasta > output.fasta

5. To replace a dot with an underscore (_) in FASTA header (provided no dot is present in the sequence).

$ sed '|\.|_|g' input.fasta > output.fasta

6. To delete a specific number of characters (n) from the FASTA header.

$ sed '|^>|s|.\{n\}$||' input.fasta > output.fasta

Here, replace with the specific number, for example, 3, 5, 10, etc.

7. To remove all characters after first space in the FASTA header.

$ sed '|^>| s| .*||' input.fasta > output.fasta

These are a few examples of ‘sed’ commands. Besides, there are awk, grep, Perl -e, and so on that are used for similar operations on multi-FASTA files.

Tariq is founder of Bioinformatics Review and a professional Software Developer at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.

Leave a Reply

HOW TO CITE THIS ARTICLE Tariq Abdullah (2020). Modifying multi-FASTA files using Bash: ‘Sed’ Command. Bioinformatics Review, 6 (06)
Perl programming in Bioinformatics
Previous Story

Perl script to find duplicate FASTA sequences using their header?

Next Story

How to use Clustal Omega and MUSCLE command-line tools for multiple sequence alignment?

Latest from Bioinformatics Programming

Willing to stay updated?

By investing less than 30 seconds you can start recieving all our new articles in your mailbox. Stay updated with latest Bioinformatics Research, trends and tools of trade.

 

Lost your password? Please enter your email address. You will receive mail with link to set new password.

0 $0.00