Bioinformatics Programming - Page 8

A collection of articles on bioinformatics programming published in Bioinformatics Review.

HTSeq : A Python framework to analyze high throughput sequencing data


High throughput sequencing is most widely used as it saves a lot of time and provide good results, and produces a huge amount of data which is difficult to manage and especially the tasks and operations performed on it are also very difficult. To ease this purpose, a Python framework have been introduced by  Simon Anders and team members, this framework is known as “HTSeq”. Keep Reading

TIN: R package to analyze Transcriptome Instability


Alternative Splicing plays a very essential role in proper functioning of eukaryotic cells. It acts as a regulatory mechanism for gene expression and any kind of disruption in this mechanism may lead to human diseases. Alternative splicing of pre-mRNA is a major source of genetic variation in human beings and disruption of the splicing process may cause human diseases such as cancer.  Keep Reading

Perl one-liners for bioinformaticians


Perl one-liners are extremely short Perl scripts written in the form of a string of commands that fits onto one line. That would amount to a bit less than 80 symbols for most purposes. Here’s the obligatory “Hello World!” one-liner in Perl and it’s output:

$ perl -e 'print "Hello World!\n";'
Hello World!

Try it! (of course, Perl must be installed on your computer for the “perl” command to work).

The most common and useful way to use such one-liners is to use them as stream processors on the command line, sometimes connected by pipes to other utilities typical for a Linux command-line environment. To process the stream one would commonly use Perl regular expression syntax to match (m/string/) or substitute (s/string1/string2/). Let us use “echo” to generate an empty input to act upon and “-p” to tell Perl to print the $_ variable (entire line) at the end:

$ echo | perl -pe 's/$_/Hello World!\n/;'
Hello World!

Notice that Perl iterates over all lines of the input (first create a file test with 3 empty lines):

$ cat test | perl -pe 's/$_/Hello World!\n/;'
Hello World!
Hello World!
Hello World!

Finally, let us introduce the “-i” switch to make Perl do the changes directly on a supplied file:

$ perl -pi -e 's/$_/Hello World!\n/;' test2

This will result in the contents of test2 getting overwritten with “Hello World!” now present on every line! Needless to say, the “-i” switch can be quite dangerous for it’s ability to completely overwrite files.

Suppose you have a file where you would like to number the lines directly in the file. This is a no-brainer with Perl one-liners! Just replace the beginning of each line with it’s number:

cat test2 | perl -pe '$i++; s/^/$i: /;'
1: Hello World!
2: Hello World!
3: Hello World!

The “^” symbol denotes the beginning of the line in Perl regular expressions. Notice that the one-liner actually contains two lines of Perl code separated by a semicolon (;).

Bioinformaticians often process FASTA files with nucleotide or amino-acid sequences. Suppose you have a FASTA file you would like to convert to a format where every sequence occupies only one line, so that you can apply “grep” to look for a specific k-mer in the sequence (say TATATAA for TATA-box). This can be easily done by removing every end-of-line symbol on non-header lines:

$ cat test2 | perl -pe 's/^([^>]+)\n/$1/;END{print "\n"}' | grep -B1 TATATAA

The “$1” is a special Perl variable created in regular expressions whenever you enclose something in parentheses. Here we do that with entire lines that do not begin with a “>” character (“^” in brackets like “[^>]” means NOT “>”, in this case we choose non-header lines).

Perl one-liners can be very useful in ad-hoc processing or parsing of files and streams from a plethora of sources. Additional examples of clever Perl one-liners can be found here or here.

1 6 7 8
0 $0.00