Question: So, more than 200 web servers and 30 databases, and more than 200 research papers in reputed journals. So how do you come up with a team and diverse ideas to implement those?
Answer: That’s an interesting question, I never ever thought that I would publish 200 papers or 200 web servers. After completing my M. Tech from IIT Delhi, I joined a job in 1986 at the Institute of Microbial Technology, Chandigarh. I was really happy to get a Class-1 officer position in central government at the age of 23. I was an odd one out in the institute as most of the scientists were a biologist. I was hired by the institute to look after the computer center, so I was called as the computer scientist. Initially, I felt why they gave me the designation of a computer scientist though my duty is mainly to operate and maintain computers in the institute. During 1986 to 1990, I developed the different types of general software, like Payslip, inventory management, etc. I realized that this type of general software has a limited impact (service at institute level) and limited half-life. I was not happy with the work as it was not utilizing my full potential, so I was searching for more challenging problems. Meantime, in 1990 a research scholar Mr. Anish Joshi came to me and request to implement software in immunology. The software was simple, “Calculation of antibody-antigen concentration from ELISA data”. I studied the algorithm of software and realized that I can develop a better algorithm than existing software. Thus, we developed a more efficient software and demonstrated experimentally that our software is better than any existing software. We sent paper based on our software to Journal of Immunological Methods (Impact factor ~3) and it got published without any revision. It was basically a wow moment! for a person who has been hired as the computer scientist has published a paper in high impact factor journal. After that, I thought whether I will be there or not but my contribution/papers will be there in the literature. This way I may provide better service to the whole scientific community rather than limited services to the users of my institute. So, that was the starting point, after that we developed software for the different types of biological problems. We talk to the scientific community and read literature to identify problems faced by the community. We developed software for these problems and publish software-based papers. In order to provide service to the community, we distribute these software packages including source code to the public. We talk to researchers/students to understand their problem particularly in the field of biology, so we may provide a solution to them. In simple words, we are not working for our-self instead we are working for the community. This is the reason, we contributed to diverse fields over the years.
Question: Sir, bioinformatics is really an interesting field, how did you get interested in this particular area? Is it like you were actually in biology or computer science together?
Answer: As I told you, I was hired by the institute to provide computer services, so my background is computer science, during my M. Tech I did a lot of programming, that’s why my programming is strong. Though at the time when I started my career I had nearly no knowledge of biology, only up to high school like everyone else. I applied my computer science skills in the field of biological science; the major challenge was to understand problems faced by the biologist. The computer in biology has always been in demand though in India it is popular since the last 20 years only. In the 1970s, Prof. G.N. Ramachandran computed possible dihedral angles in a protein structure, it is known as “Ramachandran plot”. I consider bioinformatics as challenging field as it integrates two diverse fields computers and biology. In addition, it has a huge potential because it allows mining useful information from experimental data that is growing with exponential rate. This is the reason, it is my favorite field and I am working in this area from the last 30 years.
Question: Sir, you have worked with international organizations like EBI, Oxford, UAMS, so, based on your experience, what suggestions do you have for people working in India become at par with those levels?
Answer: There are two to three points I would like to mention. One of the major challenges in India is to form a team to solve complex problems. As an individual, we Indians are as good as anybody in the world, so there is no problem in performing at an individual level. The major challenge in India is to solve a problem as a team due to internal fightings. It has been commonly observed that team members blame each other; if you talk to the junior students, then they will blame their seniors or guides. Similarly, if you talk to the guides they will blame their students, this is the major problem in India, otherwise, as an individual, we are as good as anybody in the world. In interdisciplinary areas like bioinformatics, a team is must as it is nearly impossible for an individual to become experts in multiple fields. The second thing is ‘infrastructure’ because the infrastructure is important for performing world-class research in India. For example, in 1996, Oxford University has all facilities like fast internet, intranet via fiber optics at that time we had limited facilities, the internet was slow and not available at most of the organizations in India. Another point is that they know that hire-and-fire policy is there, so what happens if you would work then you get all the advantages and if you don’t work then you will get fired. In India, this is not the situation, particularly in the government sector, that hire-and-fire policy is not effective. Once you join, then whether you are working or not, you are getting the same salary, so the people adopt that kind of behavior. In India, those who have a strong desire to contribute to society, are performing as per international standards.
Question: Sir, you have been a significant contributor to the open source community, you have developed many web servers and databases, and you have kept everything free for the users. Although you could have made a fortune over that and you could have commercialized the codes. What inspires you to do that?
Answer: I will go again back to the time during 1986-1990 when I joined the institute where I was appointed as the computer scientist. At that time, there was no internet and email was only electronic media of communication. Fortunately, in our institute, we established an email facility in 1991, which was not there in most of the institutes in our country. At that time, we access/retrieve biological data from databases to our biologists. At that time, there was an email server at EMBL (EBI was not there), which maintained the repository of free software. One can send an email with the appropriate commands to EMBL for downloading the desired software with source code. Even at that time, you could not download the big software in an email but we can get them in the pieces, we compiled them and provided service to the users. Organizations in Europe and the USA maintain email server and provide all resource free to the public. These facilities or resources are heavily used by the Indian scientific community. I have not seen a similar trend in India, most of us do not share our resources with the community. In India, we are so possessive, we think that we will live forever, and we are not sharing information with anybody, so that’s why we are not growing. We have done a lot of work in the past to be claimed but it is not well-documented and not shared with the public. As a result, what happens, the person who generates the data will die and this data will also die, this is the common problem. Instead of following the Indian system, we followed an international standard where any discovery will be documented and shared with the public, in order to utilize the full potential of the discovery. The USA has already made a number of public repositories like PubMed and Europe also maintain resources at EBI, why India can’t do it. Our group contributed to open source software/resources over the years, sometimes we also got objection why we are not charging. Our logic for providing free resources was that Indians are also using resources developed by other countries. Developing countries like India where it is difficult to afford the costly commercial software; freeware provides alternative to commercial software. Thus, researchers in India can use state-of-art free software without spending huge money on commercial software. So, it will save a lot of money in the country, so indirectly, I am helping in saving the unnecessary spending of money. We have shown in one of the calculations of usage of our software/resources, saved more than Rs 800 crores of India. It’s a big amount, it’s not visible, but heavy usage of our software by scientific community indicate it clearly.
Question: Sir, what’s up with your lab nowadays? What are you working on?
Answer: I have around 8-10 Ph.D. students who are working in my lab. My research group is different than others as I do not follow a traditional path. In the traditional system, most of the times guide assign research project for a student, depending on the need of guides’ project. Most of PI’s are working in the focused area so the student has to work in that area only. This is not the situation in my group because I am not interested in my own career I am interested in solving the biological problems as well as to train Ph.D. students. In my lab, when a student joins my group, then I ask a few questions such as “what are your interests?”, “what are your skills?” and “what important problem you wish to solve?”. Based on their expertise and set of skills students have, we assign a problem to the student. This is the reason in my group if I have 10 students, they are working on different problems, one of the students is working on “Probiotics and Prebiotics”. Though this is a new field to me the student is interested, she wants to understand what are the probiotics and prebiotics, so she is working on that. Recently, a girl joined my group as an M. Tech student, she wants to work on rare genetic diseases. I have never worked before on rare disease but I allowed students to work on rare diseases. In this process, I will also learn a new topic with my student. So, currently, we are working on a rare disease caused by lysosomal enzymes. Other students are working on different problems such as protein-ligand interactions, immunosuppressive peptides because vaccine development is a major part of our lab, we are working from the last 20 years in the field of immunology. Recently, we have completed a very interesting project on “P-features”. In this project, we integrated algorithms for computing features of a protein which have been discovered by different researchers in the last 30 years. Generation of protein features an essential part of any software developed for classification/annotation of a protein. To predict a portion of a protein, you have to complete the features of the protein like the amino acid composition, and all these features have been discovered over the years by my group as well as by the other groups. The problem is if a new student will join and learn these tools, it will take a long time, so what we did in the last one month, my whole group worked together and all the possible features in the protein have been calculated by a single software and that software will be available in the form of a web server as well as the source code. So, anybody who will be working in the field of protein annotation/protein structure prediction, he can easily use our tools and wouldn’t have to reinvent the wheel.
Question: Sir, what do you think are the most interesting areas in bioinformatics?
Answer: That’s an interesting question, in my opinion, all fields are interesting. We need to understand the difference between biologists, experimentalists, and bioinformaticists. Bioinformaticians are not generating their our own data, so we cannot discover the new things entirely. In order to work in bioinformatics, one should have sufficient data. For example, we are getting a lot of data particularly in the field of genomics and proteomics, so genomics-based biomarkers and proteomics-based biomarkers have a lot of scopes. As a lot of data is available, so we can discover different types of biomarkers. If you are going to work in bioinformatics, first you have to see what is your interest, what are the major problems you wish to address, once you figure out the problem. Next problem is to assess your own potential, whether you can do it or not. It is important to judge your won capabilities to solve this problem. A large number of researchers come into the bioinformatics field without judging themselves whether they are capable of developing tools or not. So, that’s the important part, you have to judge your strength. After knowing your strength, understand the problem, the next question is whether sufficient data is available or not, even if you are highly skilled and you figured out the problem but sufficient data is not available, you can’t do anything with it. Because in bioinformatics everything is based on the data, unavailability of data would be a problem. Overall, I would say, it more or less depends on the person, whether a particular problem excites you or not, whether you have the capability of solving it or not.
Question: As new technologies are evolving, where do you see a bioinformatician working in 50 years? Does he have a future?
Answer: Yes, the reason is simple, this biological field, despite all development and the progress over the years, such as microarray data, Chip-seq data, or RNA-Seq data, we are still unable to understand even 1% of our living organisms. We have limited knowledge, we are working on pieces that’s why a lot of data has been already generated still we do not fully understand the function of a cell. The bioinformatics has huge scope in the future; because biologists are generating data with an exponential rate, even to maintain this data is a challenge. Mining important information from big biological data is a major challenge for bioinformaticians. The most important question is whether an individual has an ability or not, to solve these bioinformatics associated problems. That’s the big problem for anybody who wants to jump into bioinformatics by considering only its scopes, will not be successful, you have to check what is your capability, where you can fit in, what you can do. It’s not that whether genomics is important or proteomics is important or protein structure prediction is important, there is a lot of areas in bioinformatics, you have to see where you fit in. If you are good in computers, you have to take a particular type of problem, if you are good in biology, then you have to take another type of problems, if you are good in chemistry, then you have to take other problems. So, a lot of scopes is there without any doubt.
Question: What method have you found most helpful in training your research staff/team in the use of databases? Which technique have you found quite helpful?
Answer: I think regarding the learning of the group, frankly, speaking I do not support the traditional approach, where one person will teach and all others will learn. In my group, we support learning from each other, like a network one-to-many and many-to-one. I am there but they work in a healthy environment, they are talking/learning to each other. If they are not able to understand, they come to me, similarly, if I do not understand then I talk/learn from my students. So, training human resource development is a major challenge for me (not for database development, for any bioinformatics work) because if we do not train the next generations, we will lose that information, then eventually, the data will be lost. So, I want to make sure the field should grow, therefore, I have different concepts for learning. First thing, it is my record in the old years, if anybody is not getting any pay or facing problems in Ph.D. like in my previous institute also, when a student who is facing, problems come into my group, I try my best to shape their career. So, the trust between me and the student is quite strong. We maintain a healthy environment in the group, where we talk to each other and learn. Currently, we are developing software, I gave you the example of P-feature where we have computed all the features of the protein and made it a package, so if anyone will start working in that area, it will take a long time, but if he will use my software, he can learn in a few days. Therefore, in the same way, in my group, infrastructure is there, everybody has access even they have access to each other’s workspace, so, let’s say, one web server will be developed by student X then student Y also has the access, so that they can learn from each other because you can learn faster by examples than by big lectures. So, openness is helping in training my own students because they are not fighting or competing with each other, they are happily working together. We have organized a large number of workshops, seminars, training program and conferences in the last 25 years to share knowledge with a new generation. If you see the average, we have organized at least two programs every year. In these programs, we are providing training about the latest trend including databases. For example, earlier they were using MySQL database to store data as the size of data was small and structured. Due to the growth of data, particularly, the growth of unstructured data, we need to use NoSQL technology. Thus, we are teaching modern DBMS like Hadoop, MongoDB, It is important for students to learn the next-generation database management systems.
Question: What is your all-time favorite piece of bioinformatics software and why would you prefer that?
Answer: If you are asking about software developed by other groups, I would say BLAST, specifically, PSI-BLAST. It is not only doing the searching but it also helps you to generate the evolutionary profiles (in the form of PSSM profiles). Evolutionary information is important to predict the structure or function of a protein.
Question: Sir, you have developed a lot of software and tools, which computer language do you use for developing them?
Answer: In the last 30 years, I have used many computer programming languages for developing software, I enjoy to learn a new programming language. My first scientific software was in GW-BASIC, the next was in Pascal, another was in FORTRAN, so I enjoy to write. If you would ask about a figure for programming languages, then I would say that I know at least 20 programming languages. So, the important question is which language you are using and in what kind of work. Earlier, during the initial phase, I frequently used C as it has many advantages. In 1996, I started to use Perl, which I learned during my stay at the Oxford group. Most of my work, where I used structure predictions, Perl can do it and probably fast, I am not saying that Perl is a fast language than other programming languages but if you are doing small jobs, Perl is one of the best choices for it. So, for the number of software, I have used Perl. Similarly, nowadays, I am switching more towards Python, the reason is that Python has developed a lot of libraries and even if you are not coding, you can use these libraries to implement the machine learning techniques or data mining techniques. So, nowadays, we are focusing more on Python.
Question: So, coming to Python and Perl, big data, AI, the blockchain, as a bioinformatician which one are you focussing on like big data? Do you see any role of blockchain in the coming future for bioinformaticians?
Answer: Regarding the AI and big data, that’s an interesting question, it is like old wine in a new bottle. For example, in deep learning, we are using a neural network with a large number of hidden layers/units as well as a combination of neural networks. This concept has been developed a long time back, the only limitation was that at that time we have limited data and resources. Our group used a combination of neural networks called hybrid and cascade network in methods developed for predicting the structure and function of proteins. One should be careful in the implementation of AI techniques, in the last two decades, we used support vector machine (SVM) heavily for developing methods. SVM is specifically useful when training/testing dataset is small, it has a minimum over optimization. The neural network is fast and gave excellent performance on large training/testing dataset. One should be careful in selecting the AI technique for mining their data based on type and size. In the case of Big data, one should use NoSQL technology for managing the data and systems like Hadoop to process data efficiently. Fortunately, the implementation of new technology is easy as the number of free software is available to implement these technologies. I have advised to youngsters, to learn all these new technologies, it is not difficult but important for your growth. Regarding blockchain, it is an important technology which can be used to protect/secure our personal data particularly genomics and proteomics. Due to advancement in technology, in the future, person-specific medical data will grow with the exponential rate which needs to be protected using encrypting technologies.
Question: Currently, one of the biggest concerns in bioinformatics is data deluge. A few weeks ago, I read an article published in nature and those people were actually confused about which data to archive and which data to discard. Because from our point of view, everything is important. So, what do you think what measures should we take or what should we do? And recently, some researchers are trying to reuse the keywords which are already present in the datasets. What measures should we take regarding this?
Answer: This is a big challenge to maintain important data generated by our experimental researchers. It is difficult to answer as nobody knows about it, we are processing a lot of data and most of the data is in no use. Even if you see TCGA data, it’s huge and unfortunately, the data we need is not there. That’s the problem here, we use a lot of cancer genomics data but only limited samples are there. So, the requirement is too high and the existing techniques or storage capacity is not up to the level. We should take care and I think maybe somebody would come and do better mining than the previous ones. It happens in case of microarray data, earlier researchers were submitting finished information (the final results only). Later, academic community forced data producers to make the raw data available too because maybe the new individual is smarter than the previous one in data mining. So, limitations are there and I cannot comment whether we should discard it, ideally, we should maintain it.
Question: One concern in bioinformatics is that, unlike the software which you developed that are freely available, there is some software which is not available for free and they are charged or over-charged sometimes. So, that also impacts research for the people who cannot afford it. What are your views about it?
Answer: This is one big challenge because I have sat in most of the committees, most of the bioinformatics researchers request for commercial software. These software packages are costly because the vendor sees your pocket rather than the actual cost. So, I am the strong opponent for the commercial software and I simply, say, “no you have to use academic software” because I am not seeing any commercial software which is better than the academic software from the algorithmic point of view. Academic software is as good as the commercial software and they are free, you can implement them in your software. Unfortunately, most of the researchers are not economic; in fact, we feel the pride to bring grant rather minimum input and maximum. We are not computing the cost per research paper instead we show performance by the amount of grant. Our young researchers should learn how to minimize per paper cost, it is possible if we use free software. I want to give an important message here, if you want to become a good researcher in your life, you have to think about why you are doing this. You are doing the research because you want to serve the community but if you are consuming a lot of money from the country unnecessarily, then you are not serving in the country. So, we should think about it and use the minimum budget and should get maximum output using the freeware tools/data.
Question: As a Ph.D. student, I have faced a problem that when I am working on a particular project and I need some software, I will search and I will definitely find one but as a Ph.D. student, we should be aware of all these pieces of software because they are very important for bioinformatics, so what do you think how important is it for students to make aware, especially the research scholars, they are not aware of all kinds of software because this is difficult to read each and every issue of the journals, there are many, so what do you think what should we do in this regard and how important is it?
Answer: I learn a lot particularly new topics from the Internet, which is easily available. In case, I wish to learn about a topic; first I search in Wikipedia which provides summarized information about the topic. Secondly, I will go to Google and will search tutorial/documents (mainly ppt, pdf and doc file), which provides most of the information on a given topic. Thirdly, I will search video on the topic particularly in YouTube, most of the time I got excellent material on the topic. In order to read more about the topic, I search for information on a topic in PubMed and Google Scholar, which provides scientific papers published on a given topic. This way, I got most of the information on a topic, I learn most of the new topics or latest information using the above process; the student may adopt a similar strategy. If you are in research, go to the Pubmed and type the keywords, if you are entering the right keywords, you will get the published articles related to your topic. In case reprint is not freely available, you may write/email to authors for reprints, most of the cases authors send a reprint. The only challenge for our students is to change their process of learning; most of our students learn in class, which is not practically possible in research. Most of us have a habit of spoon-feeding, for everything we go to the teacher and say “please Sir tell me this”. That habit should be removed at the level of Ph.D. Self-learning is most important for Ph.D. students or a researcher, it is nearly impossible to cope with time if we do not have self-learning capabilities. Students are fortunate that nowadays you are in this era of internet where you can get all the information you want, just put the right keywords for search information.
Question: A question that every bioinformatician in India and the world wants to know that how can someone join your lab and what is the criteria that you look for or what are the things that you look for in an ideal Ph.D. candidate or someone joining your lab?
Answer: That’s an interesting question, I was thinking about whether I should answer it or not. The reason is that I am not too much worried about the students’ qualifications, I feel any student who has a master degree have a unique set of skills. For a Ph.D. in bioinformatics, ideally student should have knowledge of both but it is not possible so we prefer a student with biology or computer background. If you are an M.Sc. holder, then you should qualify the national exam for a fellowship to get admission at our institute. I have shown the examples in the past, where students have been thrown out by some PIs because that student is not good enough for them, their performance was excellent in my group. In my view achieving high performance from highly talented students is not a challenge; a major challenge to train an average student to achieve high performance. If any student comes to me and wants to do Ph.D., for me that’s a challenge. So, I work as a team rather than as a guide with students. I learn from students as much as they learn from me, so it is an exchange of knowledge in both directions. This is the reason, we are contributing to a diverse area successfully over the years. Rather than being a single source of knowledge, we prefer multiple sources of knowledge to learn in networking fashion. For me, training a student is a service to society, if we trained our students than they will further contribute to society. This chain is important for the growth of society and science. Frankly speaking, I am not interested in my career; I already got the job of a Government official in 1986. So, during my whole life, I just worked to provide service to the community. The service is in four or five forms. First is, basically, to train the manpower, I am providing training, whatever knowledge we are gaining in our group, we are also giving it to others, so the competition will also be there. It’s not like that we set some expertise in our group and it is not available outside, so, whatever expertise we have developed in our group, we have shared it. Second thing, for me, my students are not just to do research for me, they are future of science, I try my best to make them future researcher. In the last 20 years, more than 30 students have completed their Ph.D. in my group as well as there is a number of students working on projects, without any internal fighting. One of the major reasons for the success of our group is that we learn how to work in a team; working in a team is our strength.
Question: At last, I want to ask, what is your opinion about bioinformatics? Do you think it is just making the castle in the air or this is just prediction-based or simulation-based, and nothing more we can do with it?
Answer: For me, bioinformatics is to extract/mine knowledge from biological data to provide service to the community. In simple, words we are providing an interface between user and knowledge generated by experimental data. Sometimes, there is a misconception that with bioinformatics we can predict anything, which is not, I consider that bioinformatics will help you to prioritize things. Let’s say if you are a biologist, you want to work on a problem, how you will plan your experiment so you can optimize your cost and time. For example, you want to identify epitopes in a protein of 200 amino acids of length 9 amino acids; there are nearly 192 possible combinations, which will take a lot of time and money. Alternatively, one can predict potential 10 epitopes in the lab, maybe all 10 will not be epitopes but at least 7 or 8 will be actual epitopes. This way we can save cost as well as time to perform experimental validation. In simple words, experimental research and bioinformatics are complementary to each other. Recently, we combined bioinformatics and experimental approach to discover drug delivery peptides. In this project, first, we developed highly accurate methods for predicting cell penetrating peptides then we scanned the whole SwissProt database to predict best cell-penetrating peptides in protein. We synthesized these predicted peptides and tested in wet-lab. It was observed that some of the peptides have better efficacy than any existing cell penetrating peptides discovered in the world. In contrast, our counterpart’s biologists are not able to discover these peptides over the years, which we were able to discover in the last 5 years. So, bioinformatics has a lot of power, we demonstrated that if you combine experimental science and this theoretical science, you can do better. This is the reason our papers are heavily cited by the scientific community. These papers are not being read by only bioinformaticists, they are also utilized by the biologists. Unfortunately, in India, we don’t respect each other’s fields, if you are a biologist, they would say, what is it they’re in bioinformatics and vice-versa. That’s why they are not collaborating, they are not taking the full advantage of each other. I believe in a few years, they will believe more on the bioinformaticians and they will utilize the knowledge for their own experimental work.