Bioinformatics ReviewBioinformatics Review
Notification Show More
Font ResizerAa
  •  Home
  • Docking
  • MD Simulation
  • Tools
  • More Topics
    • Softwares
    • Sequence Analysis
    • Algorithms
    • Bioinformatics Programming
    • Bioinformatics Research Updates
    • Drug Discovery
    • Phylogenetics
    • Structural Bioinformatics
    • Editorials
    • Tips & Tricks
    • Bioinformatics News
    • Featured
    • Genomics
    • Bioinformatics Infographics
  • Community
    • BiR-Research Group
    • Community Q&A
    • Ask a question
    • Join Telegram Channel
    • Join Facebook Group
    • Join Reddit Group
    • Subscription Options
    • Become a Patron
    • Write for us
  • About Us
    • About BiR
    • BiR Scope
    • The Team
    • Guidelines for Research Collaboration
    • Feedback
    • Contact Us
    • Recent @ BiR
  • Subscription
  • Account
    • Visit Dashboard
    • Login
Font ResizerAa
Bioinformatics ReviewBioinformatics Review
Search
Have an existing account? Sign In
Follow US
SoftwareTools

How to cluster peptide/protein sequences using cd-hit software?

Tariq Abdullah
Last updated: May 20, 2020 5:47 pm
Tariq Abdullah
Share
3 Min Read
SHARE

Cd-hit is one of the most widely used programs to cluster biological sequences [1]. It helps in removing the redundant sequences and provides better results in the sequence analyses. Cd-hit is used for sequence-based clustering by making clusters of a particular cut off provided as an input. It uses a single linkage clustering and finds a representative sequence for each cluster. In this article, we will learn how to cluster a set of protein sequences using cd-hit software.

Cd-hit package has many programs for clustering different kinds of sequences. For example, the cd-hit program is used to cluster peptide sequences, cd-hit-est is used to cluster nucleotide sequences, and even this package can compare two different databases using cd-hit-2d and cd-hit-est-2d to compare peptide and nucleotide databases respectively [1]. In this tutorial, we are using the cd-hit program which is used to cluster a group of peptide sequences. The complete package of cd-hit can be downloaded from here.

Prepare input file

The input file consists of all the peptide or protein sequences in FASTA format. There is no need to format the FASTA header of the sequences. The software manages it on its own.

Basic commands

$ cd-hit -i input.fasta -o db100 -c 1.00 -n 5 -M 2000

where,

-i = input

-o = output

-c = cut-off

-n = word size:

n=5 for thresholds 0.7 ~ 1.0

n=4 for thresholds 0.6 ~ 0.7

n=3 for thresholds 0.5 ~ 0.6

n=2 for thresholds 0.4 ~ 0.5

-M = maximum available memory

T0 cluster the sequences at 97% similarity cut-off

$ cd-hit -i input.fasta -o db90 -c 0.97 -n 5 -M 2000

Output

The output of cd-hit provides two different files:

1. A FASTA file of the representative sequences of all the clustered sequences.

2. A text file listing all the clusters showing a representative sequence signified with a ‘*’ at the end of the header of the sequence.

There are many other options which you can define in the command line including -G to use global sequence identity, -t to set tolerance for redundancy, -l to set length of throw_away_sequences, and -d to adjust the description of sequences in the .clstr output file. You can read about the command-line options either in the user guide provided at the cd-hit website (http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf) or by entering the help command ($ cd-hit --help).

References

  1. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658-1659.
TAGGED:cd-hitClusteringpeptide sequencesprotein sequences
Share This Article
Facebook Copy Link Print
ByTariq Abdullah
Tariq is founder of Bioinformatics Review and Lead Developer at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.
Leave a Comment

Leave a Reply Cancel reply

You must be logged in to post a comment.

Starting in Bioinformatics? Do This First!
Starting in Bioinformatics? Do This First!
Tips & Tricks
[Editorial] Is it ethical to change the order of authors’ names in a manuscript?
Editorial Opinion
Installing bbtools on Ubuntu
[Tutorial] Installing BBTools on Ubuntu (Linux).
Sequence Analysis Software Tools
wes_data_analysis Whole Exome Sequencing (WES) Data visualization Toolkit
wes_data_analysis: Whole Exome Sequencing (WES) Data visualization Toolkit
Bioinformatics Programming GitHub Python

You Might Also Like

Installing Conda on Ubuntu (Linux)
Software

Installing Conda on Ubuntu (Linux)

March 12, 2022
How to take snapshots of structure at specific times in GROMACS?
MD SimulationSoftwareTools

How to take snapshots of structure at specific times in GROMACS?

February 28, 2024
Installing meme suite on ubuntu
Sequence AnalysisSoftwareTools

Installing MEME suite on Ubuntu

September 4, 2020
How to install GMXPBSA on Ubuntu (Linux)?
SoftwareTools

How to install GMXPBSA on Ubuntu (Linux)?

May 22, 2023
Copyright 2024 IQL Technologies
  • Journal
  • Customer Support
  • Contact Us
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Cookie Policy
  • Sitemap
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?

Not a member? Sign Up