Connect with us

Bioinformatics Programming

How to perform graph-based clustering of peptide/protein sequences using MCL?

Published

on

Clustering using MCL

Markov Cluster Algorithm (MCL) is a clustering algorithm that clusters networks [1]. One of its applications is in clustering protein or peptide sequences. This is a fast and scalable clustering algorithm. Previously, we have shown protein/peptide sequence clustering using Cd-hit software.

In this article, we will cluster a bunch of sequences using MCL software. This software can be easily downloaded from here.

Prepare input file

The input file for MCL should be a graph or a matrix (e.g., .tsv) in the MCL input format (ABC format). We will convert a .tsv file into MCL input format (i.e., .abc). The cut mode of MCL will be used for file conversion. Provide the full path to cut, generally its /usr/bin/.

For this, open a terminal and type the following command:

$ /usr/bin/cut -f 1,2,11 input.tsv > input.abc 

The preferred way of loading maps in MCL is using another mode called mcxload. Therefore, we will convert the obtained input.abc file into the final_input.mci format. For that, type the following command:

$ /usr/bin/mcxload -abc input.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o final_input.mci -write-tab input.tab

Our final input file is input.mci.

Clustering

For clustering, type the following commands:

$ /usr/bin/mcl final_input.mci -I 1.4 -o output_mci

here, -I is to the inflation value that handles the cluster’s granularity. The default range is 1.2 to 5.0. -I set to 1.2 results in very coarse-grained clustering. You can set its value according to your data. For more information, click here.

Output refining

Now, to get final output clusters, run the following command:

$ /usr/bin/mcxdump -icl output_mci -tabr input.tab -o dumpclusters.i14 

This file dumpclusters.i14 will show all selected sequence ids as defined in the .tsv file (the very first input file).

If this file consists of headers of FASTA sequences you wanted to cluster, then you can use this script to extract their sequences from the main multifasta input file.

References

  1. Stijn van Dongen, Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht, May 2000.
  2. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22(13), 1658-1659.

 

Tariq is founder of Bioinformatics Review and CEO at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.

Bioinformatics Programming

How to obtain ligand structures in PDB format from PDB ligand IDs?

Published

on

How to obtain ligand structures in PDB format from PDB ligand IDs?

Previously, we provided a similar script to download ligand SMILES from PDB ligand IDs. In this article, we are downloading PDB ligand structures from their corresponding IDs. (more…)

Continue Reading

Bioinformatics Programming

How to obtain SMILES of ligands using PDB ligand IDs?

Published

on

How to obtain SMILES of ligands using PDB ligand IDs?

Fetching SMILE strings for a given number of SDF files of chemical compounds is not such a trivial task. We can quickly obtain them using RDKit or OpenBabel. But what if you don’t have SDF files of ligands in the first place? All you have is Ligand IDs from PDB. If they are a few then you can think of downloading SDF files manually but still, it seems time-consuming, especially when you have multiple compounds to work with. Therefore, we provide a Python script that will read all Ligand IDs and fetch their SDF files, and will finally convert them into SMILE strings. (more…)

Continue Reading

Bioinformatics Programming

How to get secondary structure of multiple PDB files using DSSP in Python?

Published

on

How to get secondary structure of multiple PDB files using DSSP in Python?

In this article, we will obtain the secondary structure of multiple PDB files present in a directory using DSSP [1]. You need to have DSSP installed on your system. (more…)

Continue Reading

LATEST ISSUE

ADVERT