How to perform graph-based clustering of peptide/protein sequences using MCL?

Markov Cluster Algorithm (MCL) is a clustering algorithm that clusters networks [1]. One of its applications is in clustering protein or peptide sequences. This is a fast and scalable clustering algorithm. Previously, we have shown protein/peptide sequence clustering using Cd-hit software.

In this article, we will cluster a bunch of sequences using MCL software. This software can be easily downloaded from here.

Prepare input file

The input file for MCL should be a graph or a matrix (e.g., .tsv) in the MCL input format (ABC format). We will convert a .tsv file into MCL input format (i.e., .abc). The cut mode of MCL will be used for file conversion. Provide the full path to cut, generally its /usr/bin/.

For this, open a terminal and type the following command:

$ /usr/bin/cut -f 1,2,11 input.tsv > input.abc

The preferred way of loading maps in MCL is using another mode called mcxload. Therefore, we will convert the obtained input.abc file into the final_input.mci format. For that, type the following command:

$ /usr/bin/mcxload -abc input.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o final_input.mci -write-tab input.tab

Our final input file is input.mci.

Clustering

For clustering, type the following commands:

$ /usr/bin/mcl final_input.mci -I 1.4 -o output_mci

here, -I is to the inflation value that handles the cluster’s granularity. The default range is 1.2 to 5.0. -I set to 1.2 results in very coarse-grained clustering. You can set its value according to your data. For more information, click here.

Output refining

Now, to get final output clusters, run the following command:

$ /usr/bin/mcxdump -icl output_mci -tabr input.tab -o dumpclusters.i14

This file dumpclusters.i14 will show all selected sequence ids as defined in the .tsv file (the very first input file).

If this file consists of headers of FASTA sequences you wanted to cluster, then you can use this script to extract their sequences from the main multifasta input file.

References

Stijn van Dongen, Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht, May 2000.
Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658-1659.