Connect with us

Cloud Computing

Big Data in Bioinformatics

Dr. Muniba Faiza



With the ever-increasing amount of biological data being generated with the advanced tools and techniques, a number of suitable ways have been simultaneously developed to handle this vast amount of data in order to make it presentable, accessible and arranged in a logical order to increase workability with the data. Due to the nature of the data being voluminous, Big Data management methods have shown their capabilities to make the biological data effectively managed both in terms of accessibility as well as cost.

Big data describes a large volume of data, in bioinformatics and computational biology, it represents a new paradigm that transforms the studies to a large-scale research.

The high-throughput experiments in bioinformatics, and increasing trends of developing personalized medicines, etc., increasing a need to produce, store, and analyze these massive datasets in a manageable time. The role of big data in bioinformatics is to provide repositories of data, better computing facilities, and data manipulation tools to analyze data.

Parallel Computing is one of the fundamental infrastructures that manage big data tasks [1]. It allows executing algorithms simultaneously on a cluster of machines or supercomputers. Recently, Google has proposed the MapReduce novel parallel computing model as the new big data infrastructure [2]. Similarly, Hadoop which is an open-source MapReduce package was introduced by Apache for distributed data management and is successfully applied in the field of bioinformatics [3]. Hadoop also provides the cloud computing facilities for centralized data storage and provides remote access to them.

In the field of bioinformatics, the big data technologies/tools have been categorized into four:

1. Data storage and retrieval:

The sequencing data obtained has a need to be mapped to specific reference genomes for further analysis. For this purpose, CloudBurst, a parallel computing model is used [4]. Parallel computing model facilitates the genome mapping by parallelizing the short-read mapping process to improve the scalability of large sequencing data. They have also developed some new tools such as Contrail for assembling large genomes and Crossbow for identifying SNPs from sequence datasets. Similarly, various tools such as DistMap (a toolkit for distributed short read mapping on a Hadoop cluster) [5], SeqWare (to access large-scale whole genome datasets) [6], Read Annotation pipeline ( developed by DDBJ, cloud-based pipeline to analyze NGS data) [7], and Hydra ( for processing large peptide and spectra databases) [8] have been developed.

2. Error Identification:

It is necessary to identify errors in the sequence datasets, so many of the cloud-based software packages have been developed to achieve this purpose. For example, SAMQA [9] which identifies errors and ensures that large-scale genomic data meet the minimum quality standards, ART [10] which simulates data for three major sequencing platforms,viz., Sequencing, Illumina and SOLiD, and CloudRS [11].

3. Data Analysis:

This feature of big data allows the researchers to analyze the data obtained by performing experiments. For example, GATK (Genome Analysis Toolkit)  is a MapReduce-based programming framework which is used for large-scale DNA sequence analysis [12]. It supports many data formats (SAM, BAM, and many others), ArrayExpress Archive of Functional Genomics data repository is an international collaboration for integrating high-throughput genomics data [13], BlueSNP is used to analyze the genome-wide association studies [14], and much more.

4. Platform Integration Deployment:

Since everyone does not have a good grasp of the computing and networking knowledge, therefore, novel methods are needed to integrate big data technologies into user-friendly operations. For achieving this purpose, some of the software packages have been introduced. SeqPig reduces the technological skills required to use MapReduce by reading large formatted files to feed analysis applications [15], CloVR is a sequencing analysis package distributed through a virtual machine [16], CloudBioLinux [17], and so on.

For further details click here.


  1. Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: A literature review. Biomedical informatics insights, 8, 1.
  2. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
  3. Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(12), S1.
  4. Schatz, M. C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), 1363-1369.
  5. Pandey, R. V., & Schlötterer, C. (2013). DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One, 8(8), e72614.
  6. D O’Connor, B., Merriman, B., & Nelson, S. F. (2010). SeqWare Query Engine: storing and searching sequence data in the cloud. BMC bioinformatics, 11(12), S2.
  7. Nagasaki, H., Mochizuki, T., Kodama, Y., Saruhashi, S., Morizaki, S., Sugawara, H., … & Kaminuma, E. (2013). DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. DNA research, dst017.
  8. Wulf, W. A., Levin, R., & Harbison, S. P. (1981). HYDRA/C. mmp, an experimental computer system. McGraw-Hill Companies.
  9. Robinson, T., Killcoyne, S., Bressler, R., & Boyle, J. (2011). SAMQA: error classification and validation of high-throughput sequenced read data. BMC genomics, 12(1), 419.
  10. Huang, W., Li, L., Myers, J. R., & Marth, G. T. (2012). ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593-594.
  11. Chen, C. C., Chang, Y. J., Chung, W. C., Lee, D. T., & Ho, J. M. (2013, October). CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework. In Big Data, 2013 IEEE International Conference on (pp. 717-722). IEEE.
  12. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.
  13. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., … & Oezcimen, A. (2003). ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic acids research, 31(1), 68-71.
  14. Huang, H., Tata, S., & Prill, R. J. (2013). BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters. Bioinformatics, 29(1), 135-136.
  15. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., & Heljanko, K. (2014). SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics, 30(1), 119-120.
  16. Angiuoli, S. V., Matalka, M., Gussman, A., Galens, K., Vangala, M., Riley, D. R., … & Fricke, W. F. (2011). CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC bioinformatics, 12(1), 356.
  17. Krampis, K., Booth, T., Chapman, B., Tiwari, B., Bicak, M., Field, D., & Nelson, K. E. (2012). Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC bioinformatics, 13(1), 42.

For more details write to [email protected].

How to cite this article:
Faiza, M., 2016. Big Data in Bioinformatics, 2(3):page 14-18. The article is available at


Dr. Muniba is a Bioinformatician based in New Delhi, India. She has completed her PhD in Bioinformatics from South China University of Technology, Guangzhou, China. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

Click to comment

You must be logged in to post a comment Login

Leave a Reply


SparkBLAST: Introduction

Dr. Muniba Faiza



The basic local alignment search tool (BLAST) [1,2] is known for its speed and results, which is also a primary step in sequence analysis. The ever-increasing demand for processing huge amount of genomic data has led to the development of new scalable and highly efficient computational tools/algorithms. For example, MapReduce is the most widely accepted framework which supports design patterns representing general reusable solutions to some problems including biological assembly [3] and is highly efficient to handle large datasets running over hundreds to thousands of processing nodes [4]. But the implementation frameworks of MapReduce (such as Hadoop) limits its capability to process smaller data. (more…)

Continue Reading

Cloud Computing

Cl-Dash: speeding up cloud computing in bioinformatics

Dr. Muniba Faiza



After a lot of work in the field of bioinformatics, many of the living organisms’ genome has been sequenced and a lot of information has been generated at RNA and protein level. This has given rise to a huge amounts of biological data whose storage is a issue now a days, because such an enormous data cannot be stored on a personal computer or on a local server. For this purpose cloud computing, a practice to manage, and process data by using remote servers hosted on internet has been introduced in bioinformatics, though the origin of cloud computing is not very clear.

Cl-dash is a tool which  facilitates research of novel bioinformatics data using Hadoop – a software that stores huge amount of data and provide a very easy access to that data in a relatively lesser time. This tool has been developed by Paul Hodor, Amandeep Chawla, Andrew Clark and Lauren Neal from Booz Allen Hamilton, USA.

The tool is “cl-dash”,it is a starter kit, which configures and apply the new hadoop clusters in a few minutes. It is provided by AWS (Amazon Web Services).

According to a paper published in Bioinformatics (Nov, 2015), cl-dash is based on the distributed file system and MapReduce programming pattern. Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of personal computers or hardwares.  With the help of cl-dash, a user can create clusters (or nodes which stores huge amount of data) as an ‘admin’ , through a set of command line tools, which begins with ‘cl-’ (hence the name: ‘cl dash’). A YAML configuration file (config.yml) is required to make a new cluster can be created in minutes. Once the Hadoop cluster is formed, the user can easily access the data.

Such tools are required for further storage space requirement because biological data is increasing, thereby, the demand for large data storage space is also required. cl-dash has provided a good pathway for managing such a huge data.


An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to [email protected].

Continue Reading