Bioinformatics Programming - Page 7

A collection of articles on bioinformatics programming published in Bioinformatics Review.

Perl one-liners for bioinformaticians

/

Perl one-liners are extremely short Perl scripts written in the form of a string of commands that fits onto one line. That would amount to a bit less than 80 symbols for most purposes. Here’s the obligatory “Hello World!” one-liner in Perl and it’s output:

$ perl -e 'print "Hello World!\n";'
Hello World!

Try it! (of course, Perl must be installed on your computer for the “perl” command to work).

The most common and useful way to use such one-liners is to use them as stream processors on the command line, sometimes connected by pipes to other utilities typical for a Linux command-line environment. To process the stream one would commonly use Perl regular expression syntax to match (m/string/) or substitute (s/string1/string2/). Let us use “echo” to generate an empty input to act upon and “-p” to tell Perl to print the $_ variable (entire line) at the end:

$ echo | perl -pe 's/$_/Hello World!\n/;'
Hello World!

Notice that Perl iterates over all lines of the input (first create a file test with 3 empty lines):

$ cat test | perl -pe 's/$_/Hello World!\n/;'
Hello World!
Hello World!
Hello World!

Finally, let us introduce the “-i” switch to make Perl do the changes directly on a supplied file:

$ perl -pi -e 's/$_/Hello World!\n/;' test2

This will result in the contents of test2 getting overwritten with “Hello World!” now present on every line! Needless to say, the “-i” switch can be quite dangerous for it’s ability to completely overwrite files.

Suppose you have a file where you would like to number the lines directly in the file. This is a no-brainer with Perl one-liners! Just replace the beginning of each line with it’s number:

cat test2 | perl -pe '$i++; s/^/$i: /;'
1: Hello World!
2: Hello World!
3: Hello World!

The “^” symbol denotes the beginning of the line in Perl regular expressions. Notice that the one-liner actually contains two lines of Perl code separated by a semicolon (;).

Bioinformaticians often process FASTA files with nucleotide or amino-acid sequences. Suppose you have a FASTA file you would like to convert to a format where every sequence occupies only one line, so that you can apply “grep” to look for a specific k-mer in the sequence (say TATATAA for TATA-box). This can be easily done by removing every end-of-line symbol on non-header lines:

$ cat test2 | perl -pe 's/^([^>]+)\n/$1/;END{print "\n"}' | grep -B1 TATATAA

The “$1” is a special Perl variable created in regular expressions whenever you enclose something in parentheses. Here we do that with entire lines that do not begin with a “>” character (“^” in brackets like “[^>]” means NOT “>”, in this case we choose non-header lines).

Perl one-liners can be very useful in ad-hoc processing or parsing of files and streams from a plethora of sources. Additional examples of clever Perl one-liners can be found here or here.

1 5 6 7

Willing to stay updated?

By investing less than 30 seconds you can start recieving all our new articles in your mailbox. Stay updated with latest Bioinformatics Research, trends and tools of trade.

 

Lost your password? Please enter your email address. You will receive mail with link to set new password.

0 $0.00