Connect with us

Bioinformatics Programming

How to perform graph-based clustering of peptide/protein sequences using MCL?

Published

on

Clustering using MCL

Markov Cluster Algorithm (MCL) is a clustering algorithm that clusters networks [1]. One of its applications is in clustering protein or peptide sequences. This is a fast and scalable clustering algorithm. Previously, we have shown protein/peptide sequence clustering using Cd-hit software.

In this article, we will cluster a bunch of sequences using MCL software. This software can be easily downloaded from here.

Prepare input file

The input file for MCL should be a graph or a matrix (e.g., .tsv) in the MCL input format (ABC format). We will convert a .tsv file into MCL input format (i.e., .abc). The cut mode of MCL will be used for file conversion. Provide the full path to cut, generally its /usr/bin/.

For this, open a terminal and type the following command:

$ /usr/bin/cut -f 1,2,11 input.tsv > input.abc 

The preferred way of loading maps in MCL is using another mode called mcxload. Therefore, we will convert the obtained input.abc file into the final_input.mci format. For that, type the following command:

$ /usr/bin/mcxload -abc input.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o final_input.mci -write-tab input.tab

Our final input file is input.mci.

Clustering

For clustering, type the following commands:

$ /usr/bin/mcl final_input.mci -I 1.4 -o output_mci

here, -I is to the inflation value that handles the cluster’s granularity. The default range is 1.2 to 5.0. -I set to 1.2 results in very coarse-grained clustering. You can set its value according to your data. For more information, click here.

Output refining

Now, to get final output clusters, run the following command:

$ /usr/bin/mcxdump -icl output_mci -tabr input.tab -o dumpclusters.i14 

This file dumpclusters.i14 will show all selected sequence ids as defined in the .tsv file (the very first input file).

If this file consists of headers of FASTA sequences you wanted to cluster, then you can use this script to extract their sequences from the main multifasta input file.

References

  1. Stijn van Dongen, Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht, May 2000.
  2. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22(13), 1658-1659.

 

Tariq is founder of Bioinformatics Review and CEO at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.

Bioinformatics Programming

tanimoto_similarities.py: A Python script to calculate Tanimoto similarities of multiple compounds using RDKit.

Published

on

tanimoto_similarities.py: A Python script to calculate Tanimoto similarities of multiple compounds using RDKit.

RDKit [1] is a very nice cheminformatics software. It allows us to perform a wide range of operations on chemical compounds/ ligands. We have provided a Python script to perform fingerprinting using Tanimoto similarity on multiple compounds using RDKit. (more…)

Continue Reading

Bioinformatics Programming

How to commit changes to GitHub repository using vs code?

Published

on

How to commit changes to GitHub repository using vs code?

In this article, we are providing a few commands that are used to commit changes to GitHub repositories using VS code terminal.

(more…)

Continue Reading

Bioinformatics Programming

Extracting first and last residue from helix file in DSSP format.

Published

on

Extracting first and last residue from helix file in DSSP format.

Previously, we have provided a tutorial on using dssp_parser to extract all helices including long and short separately. Now, we have provided a new python script to find the first and last residue in each helix file. (more…)

Continue Reading

LATEST ISSUE

ADVERT