Here are simple Perl scripts to filter out FASTA sequences from a multi-fasta file based on sequence length.
Let’s say our input file consisting of multiple FASTA sequences is ‘input.fasta’.
#!/usr/bin/perl
use strict;
use warnings;
my ($infile, $minlen) = @ARGV;
{
local $/=">";
while(<$infile>) {
chomp;
next unless /\w/;
my @keep = split /\n/;
my $header = shift @keep;
my $seqlen = length join "", @keep;
if($seqlen >= $minlen){
print ">$_";
}
}
local $/="\n";
}
exit;
Save this Perl script as ‘extractfasta.pl‘ and run in the terminal as
$ perl extractfasta.pl input.fasta <minlen> > output.fasta
For example,
$ perl extractfasta.pl input.fasta 100 > output.fasta
If you want to set a maximum length limit as well, then use the following script.
#!/usr/bin/perl
use strict;
use warnings;
my ($infile, $minlen, $maxlen) = @ARGV;
{
local $/=">";
while(<$infile>) {
chomp;
next unless /\w/;
my @keep = split /\n/;
my $header = shift @keep;
my $seqlen = length join "", @keep;
if($seqlen >= $minlen){
print ">$_";
}
}
local $/="\n";
}
exit;
Save this Perl script as ‘extractfasta.pl‘ and run in the terminal as
$ perl extractfasta.pl input.fasta <minlen> <maxlen> > output.fasta
For example,
$ perl extractfasta.pl input.fasta 100 350 > output.fasta
I got an error when usong the second command:
readline() on unopened filehandle at extractfasta.pl line 7
this is what I run:
perl extractfasta.pl /path-to/BAC4A_L00M_R1_001.fasta 50 100 > 100_maxln.fasta
the input fasta is ok , dont know what is wrong ):