Connect with us

Cheminformatics

cheML.io: ML-generated molecules database

Dr. Muniba Faiza

Published

on

cheML.io: ML-generated database of molecules

Due to the advancement of machine learning (ML) methods, we can find increasing applications of them in the field of bioinformatics as well. ML is being utilized in making personalized medicines, similarity searches in DNA and protein sequences, phylogenetics by mapping selected species on phylogenetic trees, gene and protein function annotation, generating chemical compounds, and so on. In this article, we will discuss an online database of ML-generated molecules known as cheML.io [1].

cheML.io is a complete database of ML-generated molecules along with their calculated chemical properties. These molecules are generated from 10 different ML frameworks including CDN, ORGANIC, ChemVAE, CVAE, grammarVAE, JT-VAE, SSVAE, ORGAN, and MolCycleGAN. The training molecules are collected from ZINC and Chembl databases.

How does cheML.io work?

Here is the step-wise breakdown of the workflow used to generate the cheML.io database:

  • At first, molecules are generated using 10 ML methods summing up to 2.9 million molecules.
  • All molecules are tested using RDKit and discarded all invalid molecules (174, 000).
  • The remaining molecules are converted into canonical SMILES format using RDKit.
  • SMILES and calculated chemical properties are inserted into a relational database (PostgreSQL).

How new molecules are added to the database?

  • Search molecules that are similar to the query molecule in ZINC, CHEMBL, and CheML databases.
  • These molecules are used as a training dataset for the first training.
  • Search molecules based on substructure search in the above databases.
  • These molecules are used as a training dataset for the second training.
  • A combination of the above-searched molecules is used as a training dataset for the third training.
  • Generate molecules.
  • Filter out the molecules that are already present in cheML and add novel molecules to the database.

The web interface is user-friendly that allows users to retrieve data easily. It allows similarity search and substructure search by providing minimum and maximum values not only for the similarity but also for the other parameters. Additionally, users can draw molecules using the doodle widget. The database is freely downloadable and is available at http://cheml.io/.

For further reading, click here.


References

  1. Zhumagambetov, R., Kazbek, D., Shakipov, M., Maksut, D., Peshkov, V.A., & Fazli, S. (2020). cheML. io: an online database of ML-generated molecules. RSC advances , 10 (73), 45189-45198.

Dr. Muniba is a Bioinformatician based in New Delhi, India. She has completed her PhD in Bioinformatics from South China University of Technology, Guangzhou, China. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

Advertisement
Click to comment

You must be logged in to post a comment Login

Leave a Reply

Bioinformatics Programming

How to obtain ligand structures in PDB format from PDB ligand IDs?

Dr. Muniba Faiza

Published

on

How to obtain ligand structures in PDB format from PDB ligand IDs?

Previously, we provided a similar script to download ligand SMILES from PDB ligand IDs. In this article, we are downloading PDB ligand structures from their corresponding IDs. (more…)

Continue Reading

Bioinformatics Programming

How to obtain SMILES of ligands using PDB ligand IDs?

Dr. Muniba Faiza

Published

on

How to obtain SMILES of ligands using PDB ligand IDs?

Fetching SMILE strings for a given number of SDF files of chemical compounds is not such a trivial task. We can quickly obtain them using RDKit or OpenBabel. But what if you don’t have SDF files of ligands in the first place? All you have is Ligand IDs from PDB. If they are a few then you can think of downloading SDF files manually but still, it seems time-consuming, especially when you have multiple compounds to work with. Therefore, we provide a Python script that will read all Ligand IDs and fetch their SDF files, and will finally convert them into SMILE strings. (more…)

Continue Reading

Cheminformatics

Converting file formats using Openbabel.

Dr. Muniba Faiza

Published

on

Converting file formats using Openbabel.

Openbabel [1] offers a wide range of operations. One of which is file format conversion which is most widely used. In this article, we will describe commands that convert file formats. (more…)

Continue Reading

Bioinformatics Programming

smitostr.py: Python script to convert SMILES to structures.

Dr. Muniba Faiza

Published

on

smitostr.py: Python script to convert SMILES to structures.

As mentioned in some of our previous articles, RDKit provides a wide range of functions. In this article, we are using RDKit [1] to draw a molecular structure using SMILES. (more…)

Continue Reading

Bioinformatics Programming

tanimoto_similarities.py: A Python script to calculate Tanimoto similarities of multiple compounds using RDKit.

Dr. Muniba Faiza

Published

on

tanimoto_similarities.py: A Python script to calculate Tanimoto similarities of multiple compounds using RDKit.

RDKit [1] is a very nice cheminformatics software. It allows us to perform a wide range of operations on chemical compounds/ ligands. We have provided a Python script to perform fingerprinting using Tanimoto similarity on multiple compounds using RDKit. (more…)

Continue Reading

Cheminformatics

Installing CDK (Chemistry Development Kit) on Ubuntu (Linux)

Dr. Muniba Faiza

Published

on

Installing cdk on Ubuntu

CDK stands for chemistry development kit [1]. This is an open source kit for cheminformatics consisting of modular JAVA libraries. In this article, we will install CDK on Ubuntu. (more…)

Continue Reading

Cheminformatics

How to do molecular orbital analysis to find d-orbitals involved in bonding in an organometallic compound?

Dr. Muniba Faiza

Published

on

Structure modeling of chemical compounds finds essential application in the field of cheminformatics. It is used to study the structural stability, metal-ion bonding, the presence of electrons, closed and open shell energies, the reactivity of complexes, molecular orbital analyzes, molecular mechanics, and so on. There is some software available for structural modeling of chemical compounds/complexes and the most widely used are Gaussian [1] and ORCA [2]. (more…)

Continue Reading

LATEST ISSUE

ADVERT