Connect with us

Bioinformatics Programming

Extract FASTA sequences based on sequence length using Perl

Dr. Muniba Faiza

Published

on

extact fasta sequences using perl

Here are simple Perl scripts to filter out FASTA sequences from a multi-fasta file based on sequence length.

Let’s say our input file consisting of multiple FASTA sequences is ‘input.fasta’.
#!/usr/bin/perl
use strict;
use warnings;
my ($infile, $minlen) = @ARGV;
{
local $/=">";
while(<$infile>) {
chomp;
next unless /\w/;
my @keep = split /\n/;
my $header = shift @keep;
my $seqlen = length join "", @keep;
if($seqlen >= $minlen){
print ">$_";
}
}
local $/="\n";
}
exit;

Save this Perl script as ‘extractfasta.pl‘ and run in the terminal as

$ perl extractfasta.pl input.fasta <minlen> > output.fasta

For example,

$ perl extractfasta.pl input.fasta 100 > output.fasta

If you want to set a maximum length limit as well, then use the following script.
#!/usr/bin/perl
use strict;
use warnings;
my ($infile, $minlen, $maxlen) = @ARGV;
{
local $/=">";
while(<$infile>) {
chomp;
next unless /\w/;
my @keep = split /\n/;
my $header = shift @keep;
my $seqlen = length join "", @keep;
if($seqlen >= $minlen){
print ">$_";
}
}
local $/="\n";
}
exit;

Save this Perl script as ‘extractfasta.pl‘ and run in the terminal as

$ perl extractfasta.pl input.fasta <minlen> <maxlen> > output.fasta

For example,

$ perl extractfasta.pl input.fasta 100 350 > output.fasta

Dr. Muniba is a Bioinformatician based in New Delhi, India. She has completed her PhD in Bioinformatics from South China University of Technology, Guangzhou, China. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

Advertisement
1 Comment

1 Comment

  1. Avatar

    [email protected]

    April 13, 2021 at 3:52 am

    I got an error when usong the second command:

    readline() on unopened filehandle at extractfasta.pl line 7

    this is what I run:
    perl extractfasta.pl /path-to/BAC4A_L00M_R1_001.fasta 50 100 > 100_maxln.fasta

    the input fasta is ok , dont know what is wrong ):

You must be logged in to post a comment Login

Leave a Reply

Bioinformatics Programming

sminalog_analysis.py – A new Python script to fetch top binding affinities from SMINA log file

Dr. Muniba Faiza

Published

on

sminalog_analysis.py – A new Python script to fetch top binding affinities from SMINA log file

In one of our previous posts, we provided a Python script for the virtual screening analysis of Autodock Vina. This script analyzes all log files obtained from docking of multiple ligands to a receptor and provides the binding affinities for top poses from each file. In this article, we are publishing a new Python script for the virtual screening analysis of SMINA [1]. (more…)

Continue Reading

Bioinformatics Programming

Installing Pycharm on Ubuntu (Linux)

Tariq Abdullah

Published

on

Installing pycharm on Ubuntu

Pycharm [1] is an integrated development environment (IDE) for developers. It combines Python developer tools and provides an easy graphical user interface. In this article, we are going to install Pycharm on Ubuntu. (more…)

Continue Reading

Algorithms

vs_Analysis.py: A Python Script to Analyze Virtual Screening Results of Autodock Vina

Dr. Muniba Faiza

Published

on

VS-Analysis: A Python Script to Analyze Virtual Screening Results of Autodock Vina

The output files obtained as a result of virtual screening (VS) using Autodock Vina may be large in number. It is difficult or quite impossible to analyze them manually. Therefore, we are providing a Python script to fetch top results (i.e., compounds showing low binding affinities). (more…)

Continue Reading

LATEST ISSUE

ADVERT