Perl one-liners for bioinformaticians

in Bioinformatics Programming/Tools by

Perl one-liners are extremely short Perl scripts written in the form of a string of commands that fits onto one line. That would amount to a bit less than 80 symbols for most purposes. Here’s the obligatory “Hello World!” one-liner in Perl and it’s output:

$ perl -e 'print "Hello World!\n";'
Hello World!

Try it! (of course, Perl must be installed on your computer for the “perl” command to work).

The most common and useful way to use such one-liners is to use them as stream processors on the command line, sometimes connected by pipes to other utilities typical for a Linux command-line environment. To process the stream one would commonly use Perl regular expression syntax to match (m/string/) or substitute (s/string1/string2/). Let us use “echo” to generate an empty input to act upon and “-p” to tell Perl to print the $_ variable (entire line) at the end:

$ echo | perl -pe 's/$_/Hello World!\n/;'
Hello World!

Notice that Perl iterates over all lines of the input (first create a file test with 3 empty lines):

$ cat test | perl -pe 's/$_/Hello World!\n/;'
Hello World!
Hello World!
Hello World!

Finally, let us introduce the “-i” switch to make Perl do the changes directly on a supplied file:

$ perl -pi -e 's/$_/Hello World!\n/;' test2

This will result in the contents of test2 getting overwritten with “Hello World!” now present on every line! Needless to say, the “-i” switch can be quite dangerous for it’s ability to completely overwrite files.

Suppose you have a file where you would like to number the lines directly in the file. This is a no-brainer with Perl one-liners! Just replace the beginning of each line with it’s number:

cat test2 | perl -pe '$i++; s/^/$i: /;'
1: Hello World!
2: Hello World!
3: Hello World!

The “^” symbol denotes the beginning of the line in Perl regular expressions. Notice that the one-liner actually contains two lines of Perl code separated by a semicolon (;).

Bioinformaticians often process FASTA files with nucleotide or amino-acid sequences. Suppose you have a FASTA file you would like to convert to a format where every sequence occupies only one line, so that you can apply “grep” to look for a specific k-mer in the sequence (say TATATAA for TATA-box). This can be easily done by removing every end-of-line symbol on non-header lines:

$ cat test2 | perl -pe 's/^([^>]+)\n/$1/;END{print "\n"}' | grep -B1 TATATAA

The “$1” is a special Perl variable created in regular expressions whenever you enclose something in parentheses. Here we do that with entire lines that do not begin with a “>” character (“^” in brackets like “[^>]” means NOT “>”, in this case we choose non-header lines).

Perl one-liners can be very useful in ad-hoc processing or parsing of files and streams from a plethora of sources. Additional examples of clever Perl one-liners can be found here or here.

Download PDF

Matej Lexa is a bioinformatician at the Faculty of Informatics of Masaryk University in Brno, Czech Republic. His main interests include mathematical modelling, sequence analysis, structural bioinformatics and evolutionary genomics. He has published more than 20 peer-reviewed papers on these and similar subjects.

4 Comments

  1. Very nice article for a beginner..Thanks for that!! I would request you to keep this article in the form of monthly or weekly series and thus increase the levels so as to help the readers advance in learning perl.

  2. Thank you for the encouragement! I have a plan to put together a few more articles in a similar “hacking for bioinformaticians” spirit, although some may go beyond Perl. I have plans for introducing Biopieces, for example. There is a lot of information on specific tools out there, but sometimes it is not immediately clear which and how they can be applied in bioinformatics. I’d like to focus on the “which” and “how”.

Leave a Reply