Alignment-free approaches for Sequence Analysis

Multiple Sequence Alignment (MSA) is a fundamental aspect of Bioinformatics in order to identify the species, their functions, phylogeny, study the novel genes/ proteins, and so on. Multiple MSA tools are available with different specifications, which are based on the heuristic algorithm focusing on the speed rather than the accuracy. As MSA is the basic need for sequence analysis, most of the research depends on it, but unluckily, it has some limitations also that can affect the results, which are being noticed since a very long time. These shortcomings/ errors may be due to the different recognition of indel events, gap introduction, gap penalty, etc., applied by different algorithms/tools. These limitations may interfere with the results generated by different tools. It has been reported that phylogenetic trees constructed using different phylogeny programs generate varied trees [1]. Also due to the presence of various benchmarks available such as OXBENCH, SABMARK, etc. for MSA may also affect the alignment accuracy and reliability [2-4].

These limitations of MSA programs has led to the futuristic development of alignment-free sequence analysis [5]. A handful of physics-based theories such as Information Theory, Chaos Theory and, Linear Algebra and Statistical Theory have been proposed to be implemented for the multiple sequence comparison [5,6]. Among these proposed theories, Information Theory has been found to be more promising in the multiple sequence analysis [5], however, efforts are still being made to implement this theory.

Another current approach BBO (Biogeography-based Optimization) has found to attempt to solve the problem of MSA [7]. It is based on the concept of emigration and immigration of species from one habitat to another. It has now been more improved and proposed in the form of IBBOMSA (An Improved Biogeography-based Approach for Multiple Sequence Alignment) [8]. Its algorithm implements a mutation operator which calculates the probability of mutation in the given species and according to their comparison tests, IBBOMSA was found to be most accurate among the other considered tools [8].

A similar approach to the alignment-free analysis of DNA sequences has been made by Zhou et al., (2016) which is based on the characterization of complex networks [9]. It is based on a code of three cis nucleotides in a gene that could code for an amino acid [9]. Graphical representation of DNA has also been proposed for sequence comparison [10], which were later improved in 2D [11-16], 3D [17-21], 4D [22], 5D [23], and 6D [24] representations of DNA sequences in the form of matrices.

Sequence comparison without aligning the sequences may be a good alternative to alignment programs, but it requires a lot of work by the scientific community to be fully usable.

References:

Wong. K. M., Suchard M. A., Huelsenbeck J P, (2008). Alignment uncertainty and genomic analysis. Science, 319, 473–476.
Thompson, J. D., Plewniak, F., &Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research, 27(13), 2682–2690. http://doi.org/10.1093/nar/27.13.2682
Raghava, G. P. S., Searle, S. M. J., Audley, P. C., Barber, J. D., & Barton, G. J. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47.
Van Walle, I., Lasters, I., &Wyns, L. (2005). SABmark – A benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21(7), 1267–1268.
Vinga, S., & Almeida, J. (2003). Alignment-free sequence comparison—a review. Bioinformatics, 19(4), 513-523.
Shannon, C.E. (1948) A mathematical theory of communication. The Bell System Technical J., 27, 379–423, 623–656.
Simon D. Biogeography-based optimization. IEEE Trans Evol Comput. 2008;12:702–13.
Yadav, R. K., & Banka, H. (2016). IBBOMSA: An Improved Biogeography-based Approach for Multiple Sequence Alignment. Evolutionary Bioinformatics Online, 12, 237.
Zhou, J., Zhong, P., & Zhang, T. (2016). A Novel Method for Alignment-free DNA Sequence Similarity Analysis Based on the Characterization of Complex Networks. Evolutionary Bioinformatics Online, 12, 229.
Qi X, Wu Q , Zhang Y, Fuller E, Zhang C-Q. A novel model for DNA sequence similarity analysis based on graph theory. Evol Bioinform Online. 2011;7: 149–58
Guo X, Randić M, Basak SC. A novel 2-D graphical representation of DNA sequences of low degeneracy. Chem Phys Lett. 2001;350:106–12
Randić M, Vraćko M, Lerś N, Plavśić D. Analysis of similarity/dissimilarity of DNA sequence based on novel 2-D graphical representation. J Chem Inform Comput Sci. 2003;371:202–7.
. Randić M, Vraćko M, Zupan J, Novic M. Compact 2-D graphical representation of DNA. Chem Phys Lett. 2003;373:558–62
Randić M. Graphical representations of DNA as 2-D map. Chem Phys Lett. 2004;386:468–71.
Liu X, Dai Q , Xiu Z, Wang T. PNN-curve: a new 2D graphical representation of DNA sequences and its application. J Theor Biol. 2006;243:555–61
Huang G, Liao B, Li Y, Liu Z. H curves: a novel 2D graphical representation for DNA sequences. Chem Phys Lett. 2008;462:129–32.
Liao B, Wang T. 3-D graphical representation of DNA sequences and their numerical characterization. J Mol Struct (Theochem). 2004;681:209–12
Qi X, Wen J, Qi Z. New 3D graphical representation of DNA sequence based on dual nucleotides. J Theor Biol. 2007;249:681–90
Qi Z, Fan T. PN-curve: a 3D graphical representation of DNA sequences and their numerical characterization. Chem Phys Lett. 2007;442:434–40
Cao Z, Liao B, Li R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int J Quantum Chem. 2008;108:1485–90.
Yu J, Sun X, Wang J. TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications. J Theor Biol. 2009;261:459–68.
Chi R, Ding K. Novel 4D numerical representation of DNA sequences. Chem Phys Lett. 2005;407:63–7.
Liao B, Li R, Zhu W. On the similarity of DNA primary sequences based on 5-D representation. J Math Chem. 2007;42:47–57
Liao B, Wang T. Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping trinucleotides of nucleotide bases. J Chem Inform Comput Sci. 2004;44:1666–70.