Dealing with thousands of FASTA sequences is a tedious task without using bioinformatics programming. It eases multiple minute tasks to be performed on FASTA sequences or their headers such as removal, addition, or substitution of certain characters in the header, or manipulating the sequence format, and so on. In such cases, shell bash commands provide an easy way to perform such tasks on FASTA sequences.
Here are some simple sed commands to manipulate FASTA headers in multi-fasta files.
- To remove everything after first ‘/’ or ‘_’ from FASTA headers.
$ sed 's|\/.*||' input.fasta > output.fasta
$ sed 's|\_.*||' input.fasta > output.fasta
2. To remove everything after last ‘/’ or ‘_’ from FASTA headers.
$ sed 's|_[^/]*$||' input.fasta > output.fasta
$ sed 's|_[^_]*$||' input.fasta > output.fasta
3. To remove all FASTA headers and output only sequences.
$ sed '|^>|d' input.fasta > output.fasta
4. To remove everything after a dot (.) from FASTA headers.
$ sed '|[.].*$||' input.fasta > output.fasta
5. To replace a dot with an underscore (_) in FASTA header (provided no dot is present in the sequence).
$ sed '|\.|_|g' input.fasta > output.fasta
6. To delete a specific number of characters (n) from the FASTA header.
$ sed '|^>|s|.\{n\}$||' input.fasta > output.fasta
Here, replace n with the specific number, for example, 3, 5, 10, etc.
7. To remove all characters after first space in the FASTA header.
$ sed '|^>| s| .*||' input.fasta > output.fasta
These are a few examples of ‘sed’ commands. Besides, there is awk, grep, Perl -e, and so on that are used for similar operations on multi-FASTA files.