Clustering using MCL

How to perform graph-based clustering of peptide/protein sequences using MCL?

//

Markov Cluster Algorithm (MCL) is a clustering algorithm that clusters networks [1]. One of its applications is in clustering protein or peptide sequences. This is a fast and scalable clustering algorithm. Previously, we have shown protein/peptide sequence clustering using Cd-hit software.

In this article, we will cluster a bunch of sequences using MCL software. This software can be easily downloaded from here.

Prepare input file

The input file for MCL should be a graph or a matrix (e.g., .tsv) in the MCL input format (ABC format). We will convert a .tsv file into MCL input format (i.e., .abc). The cut mode of MCL will be used for file conversion. Provide the full path to cut, generally its /usr/bin/.

For this, open a terminal and type the following command:

$ /usr/bin/cut -f 1,2,11 input.tsv > input.abc 

The preferred way of loading maps in MCL is using another mode called mcxload. Therefore, we will convert the obtained input.abc file into the final_input.mci format. For that, type the following command:

$ /usr/bin/mcxload -abc input.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o final_input.mci -write-tab input.tab

Our final input file is input.mci.

Clustering

For clustering, type the following commands:

$ /usr/bin/mcl final_input.mci -I 1.4 -o output_mci

here, -I is to the inflation value that handles the cluster’s granularity. The default range is 1.2 to 5.0. -I set to 1.2 results in very coarse-grained clustering. You can set its value according to your data. For more information, click here.

Output refining

Now, to get final output clusters, run the following command:

$ /usr/bin/mcxdump -icl output_mci -tabr input.tab -o dumpclusters.i14 

This file dumpclusters.i14 will show all selected sequence ids as defined in the .tsv file (the very first input file).

If this file consists of headers of FASTA sequences you wanted to cluster, then you can use this script to extract their sequences from the main multifasta input file.

References

  1. Stijn van Dongen, Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht, May 2000.
  2. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22(13), 1658-1659.

 

Tariq is founder of Bioinformatics Review and a professional Software Developer at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.

Leave a Reply

HOW TO CITE THIS ARTICLE Tariq Abdullah (2020). How to perform graph-based clustering of peptide/protein sequences using MCL?. Bioinformatics Review, 6 (06)
qwikmd
Previous Story

Tutorial: A quick MD simulation using NAMD and VMD

Bioinformaticians Awards
Next Story

Inviting Nominations for 'Top 5 Bioinformaticians in India 2020'

Latest from Bioinformatics Programming

Willing to stay updated?

By investing less than 30 seconds you can start recieving all our new articles in your mailbox. Stay updated with latest Bioinformatics Research, trends and tools of trade.

 

Lost your password? Please enter your email address. You will receive mail with link to set new password.

0 $0.00