Bioinformaticians handle a large amount of data: in TBs if not in gigs thus it becomes important not only to store such massive data but also making sense out of them. In this article, I will talk about what is data mining and how bioinformaticians can benefit from it.
What is data mining?
Data Mining is the process of discovering a new data / pattern/ information / understandable models from huge amount of data that already exists. It is sometimes also referred to as “Knowledge Discovery in Databases” (KDD). It has been successfully applied in bioinformatics which is data-rich and requires essential findings such as gene expression, protein modeling, drug discovery and so on. Development of novel data mining methods provides a useful way to understand the rapidly expanding biological data. Now let’s discuss basic concepts of data mining and then we will move to its application in bioinformatics. I will also discuss some data mining tools in upcoming articles.
As defined earlier, data mining is a process of automatic generation of information from existing data. The major goals of data mining are “prediction” & “description”. The main tasks which can be performed with it are as follows:
- Classification: Classification is the learning of a function that maps / reads (classifies) the input data item into one of several predefined classes (i.e., existing data).
- Estimation: It shows a value for the data input.
- Prediction: Involves both classification and estimation, but the data is classified on the basis of the some future behavior or estimated future value.
- Association rules: It is also known as dependency modeling, where it determines the data associated with each other and what may be the outcomes.
- Clustering: Separating the population into subgroups or clusters.
- Description & Visualization: Representing the data with the help of visualization techniques / tools.
Data learning is composed of two main categories:
Directed (Supervised) learning and Indirected (Unsupervised) learning.
Classification, Estimation and Prediction falls under the category of Supervised learning and the rest three tasks- Association rules, Clustering and Description & Visualization comes under the Unsupervised learning. In the former category, some relationships are established among all the variables and the patterns are identified in the later category.
Data Mining has been proved to be very effective and useful in bioinformatics, such as, microarray analysis, gene finding, domain identification, protein function prediction, disease identification, drug discovery and so on.
For follow up, please write to [email protected]
K Raza. APPLICATION OF DATA MINING IN BIOINFORMATICS, Indian Journal of Computer Science and Engineering, Vol 1 No 2, 114-118
Mohammed J Zaki, Data Mining in Bioinformatics (BIOKDD), Algorithms for Molecular Biology2007 2:4, DOI: 10.1186/1748-7188-2-4
Prof. Xiaohua (Tony) Hu, Editor, International Journal of Data Mining and Bioinformatics