cheML.io: ML-generated molecules database

Due to the advancement of machine learning (ML) methods, we can find increasing applications of them in the field of bioinformatics as well. ML is being utilized in making personalized medicines, similarity searches in DNA and protein sequences, phylogenetics by mapping selected species on phylogenetic trees, gene and protein function annotation, generating chemical compounds, and so on. In this article, we will discuss an online database of ML-generated molecules known as cheML.io [1].

Contents

How does cheML.io work?How new molecules are added to the database?References

cheML.io is a complete database of ML-generated molecules along with their calculated chemical properties. These molecules are generated from 10 different ML frameworks including CDN, ORGANIC, ChemVAE, CVAE, grammarVAE, JT-VAE, SSVAE, ORGAN, and MolCycleGAN. The training molecules are collected from ZINC and Chembl databases.

How does cheML.io work?

Here is the step-wise breakdown of the workflow used to generate the cheML.io database:

At first, molecules are generated using 10 ML methods summing up to 2.9 million molecules.
All molecules are tested using RDKit and discarded all invalid molecules (174, 000).
The remaining molecules are converted into canonical SMILES format using RDKit.
SMILES and calculated chemical properties are inserted into a relational database (PostgreSQL).

How new molecules are added to the database?

Search molecules that are similar to the query molecule in ZINC, CHEMBL, and CheML databases.
These molecules are used as a training dataset for the first training.
Search molecules based on substructure search in the above databases.
These molecules are used as a training dataset for the second training.
A combination of the above-searched molecules is used as a training dataset for the third training.
Generate molecules.
Filter out the molecules that are already present in cheML and add novel molecules to the database.

The web interface is user-friendly that allows users to retrieve data easily. It allows similarity search and substructure search by providing minimum and maximum values not only for the similarity but also for the other parameters. Additionally, users can draw molecules using the doodle widget. The database is freely downloadable and is available at http://cheml.io/.

For further reading, click here.

References

Zhumagambetov, R., Kazbek, D., Shakipov, M., Maksut, D., Peshkov, V.A., & Fazli, S. (2020). cheML. io: an online database of ML-generated molecules. RSC advances , 10 (73), 45189-45198.

How does cheML.io work?

How new molecules are added to the database?

References

Leave a Reply Cancel reply

You Might Also Like

Installing CDK (Chemistry Development Kit) on Ubuntu (Linux)

H2V- A Database of Human Responsive Genes & Proteins for SARS & MERS

TANTIGEN 2.0- A Database of Tumor T-cell Antigens & Epitopes

How to download small molecules from ZINC database for virtual screening?