Connect with us

Software

How to create a pangenome of isolated genome sequences using Roary and Prokka?

Published

on

Roary is a pangenome genome pipeline, which calculates pangenome of a set of related prokaryotic isolates [1]. It takes annotated assemblies in the gff3 format generated by Prokka [2] and provides the pangenome. The working methodology has been explained in our previous article. In this article, we will learn how to create the pangenome of a few isolated genome sequences using Roary [1] and Prokka [2].

Input for Roary

  1. Genome sequences in the form of gff3 files.

Downloading the genome sequences

At first, you need to download genome sequences as per your need, which you can easily download yourself or by using the ncbi-genome-download package. It provides several scripts to download genome sequences from NCBI FTP servers. To install this package, open a terminal (Ctrl + T) and type the following commands:

$ pip install ncbi-genome-download

After downloading this package, you can download the genome assemblies as per your requirements such as fasta sequences of all bacteria, viral genome, RefSeq genome sequences in GenBank format, fungal genomes and so on. (Remember, while downloading gff3 files, you need to download Genbank files with the nucleotide sequence because gff3 files on the NCBI website contain annotation only). I will download all bacterial sequences in fasta format using the following command (showing this example with only a few sequences only):

$ ncbi-genome-download --format fasta bacteria

Annotating the genome sequences

Go into the directory of Roary, create a new folder, let’s name it as ‘example’, and save those downloaded sequences. After downloading, you will see many fasta files in the same folder. Now start annotating them to determine the attributes and location of the genes present in them, and also to obtain gff3 files which are used as an input in roary. This can be easily done with Prokka [2]. Open the terminal and type the following commands:

$ cd Downloads/Roary/example/

$ prokka --kingdom Bacteria --outdir prokka_GCA_000006285 --genus Salmonella --locustag GCA_000006285 GCA_000006285.2_ASM828v3_genomic.fna

You can further add other descriptions such as organism details (genus, species, etc.). Make sure you annotate all the genome sequences you are dealing with and remember to change the output directory name, locus tag, and assembly version accordingly. After running this command, a new directory will be created in the name of each sequence and will consist of 12 files with different extensions including the gff3 file.

Creating pangenome/Running Roary

We have got gff3 files of the genome sequences in the directories, now we need to copy the gff3 file from each directory into another directory (let’s say, gff_all). After that, open the terminal again and type the following command to run roary:

$ roary -f ./tutorial -e -n -v ./gff_all/*.gff

At this stage, Roary will get all the coding sequences, translate them into protein sequences, and generate pre-clusters. After that, roary will look for the paralogs using blastp [3] and create clusters using MCL [4]. Finally, it will take every isolate and order them according to the presence/absence of orthologs. This will take time depending upon the number of sequences (or gff3 files) you are using.

If you want to create a pangenome without the core alignment, then use the following command:

$ roary -f ./tutorial -v ./gff_all/*.gff

If you want to change the percentage identity of blastp (not advised to go below 90%), then use the following command:

$ roary -f ./tutorial -i 90 -v ./gff_all/*.gff

These commands will result in a new directory called tutorial (as given name in the command), where all result files will be found. You can see the summary statistics in the file named ‘summary_statistics.txt‘, it will look like this:

                  summary_statistics.txt

Core genes (99% <= strains <= 100%) 2031

Softcore genes (95% <= strains < 99%) 0

Shell genes (15% <= strains < 95%) 2497

Cloud genes (0% <= strains < 15%) 0

Total genes (0% <= strains <= 100%) 4528

Visualizing results

Similarly, you will find some other output files such as ‘gene_presence_absence.csv‘, ‘accessory_binary_genes.fa.newick‘. ‘roary_plots.py’ script (written by Marco Galardini) will be used to visualize the results, which is present inside the directory named contrib in the main roary directory. Open the terminal, go into the tutorial directory (where all the result files are present) and type the following:

$ cd tutorial
$ /home/user/Downloads/roary/contrib/roary_plots/roary_plots.py accessory_binary_genes.fa.newick gene_presence_absence.csv

You will see three png files that will be added in the same tutorial directory: pangenome_frequence.png (Fig. 1), pangenome_matrix.png (Fig. 2), and pangenome_pie.png (Fig. 3) as shown below.

Fig. 1 showing the number of genes present in each genome sequence.

Fig. 2 Gene clusters.

Fig. 3 represents a pie chart showing different genes present in the genome sequences.

Additionally, you can also visualize the Newick file in phylogeny software such as Mega for further analysis.

This article demonstrated the creation of a pangenome of isolated genome sequences using roary. In case of any queries, please write to us at [email protected] or [email protected].

References

  1. Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T., … & Parkhill, J. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics31(22), 3691-3693.
  2. Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics30(14), 2068-2069.
  3. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
  4. Dongen S van. Graph Clustering by Flow Simulation. University of Utrecht; 2000.

Tariq is founder of Bioinformatics Review and CEO at IQL Technologies. His areas of expertise include algorithm design, phylogenetics, MicroArray, Plant Systematics, and genome data analysis. If you have questions, reach out to him via his homepage.

Advertisement
Click to comment

You must be logged in to post a comment Login

Leave a Reply

Software

[Tutorial] How to install openbabel on Ubuntu (Linux)?

Dr. Muniba Faiza

Published

on

Installing OpenBabel on Ubuntu

Open Babel is an open-source chemical toolbox for molecular modeling and cheminformatics tasks. It is a versatile conversion tool that supports various chemical file formats, enabling researchers to convert, analyze, and visualize molecular data across different platforms. With its comprehensive library of chemical functionalities, Open Babel allows users to perform tasks such as molecular structure conversion, property calculations, molecular fingerprint generation, and 3D structure manipulation. In this article, we are installing the openbabel on Ubuntu (Linux).

(more…)

Continue Reading

Software

How to install & execute Discovery Studio Visualizer on Ubuntu (Linux)?

Dr. Muniba Faiza

Published

on

how to install Discovery Studio Visualizer on Ubuntu (Linux)?

DS Visualizer is a comprehensive, free molecular modeling and visualization tool designed by BIOVIA, part of Dassault Systèmes [1]. It enables researchers to visualize and analyze complex chemical and biological data, including molecular structures, sequences, and simulations.DS Visualizer’s user-friendly interface supports various file formats and provides powerful tools for molecular editing, docking, and structure analysis. In this article, we are installing DS Visualizer on Ubuntu (Linux).

(more…)

Continue Reading

Software

[Tutorial] Installing HTSlib on Ubuntu (Linux).

Dr. Muniba Faiza

Published

on

[Tutorial] Installing HTSlib on Ubuntu (Linux).

HTSlib is an open-source C library designed for handling high-throughput sequencing (HTS) data [1]. It provides the underlying functionality for manipulating various file formats commonly used in genomics, such as SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map), CRAM (Compressed Reference-oriented Alignment Map), and VCF (Variant Call Format). In this article, we are installing on Ubuntu (Linux).

(more…)

Continue Reading

MD Simulation

List of widely used MD Simulation Analysis Tools.

Dr. Muniba Faiza

Published

on

List of widely used MD Simulation Analysis Tools.

Molecular Dynamics (MD) simulation analysis involves interpreting the vast amounts of data generated during the simulation of molecular systems. These analyses are necessary to study the physical movements of atoms and molecules, the stability of molecular conformations, reaction mechanisms, and thermodynamic properties, among other aspects. In this article, we will give a brief overview of some widely used MD simulation analysis tools.

(more…)

Continue Reading

Software

[Tutorial] Installing ProteStAr on Ubuntu (Linux).

Dr. Muniba Faiza

Published

on

Installing Protestar on Ubuntu

ProteStAr is a bioinformatics tool to compress protein structure files [1]. It compresses PDB/CIF files and supplementary PAE files. The compression is lossless. However, users are allowed to generate the lossy compression of files. In this article, we are installing ProteStar on Ubuntu.

(more…)

Continue Reading

Software

[Tutorial] How to install 3Dmapper on Ubuntu (Linux)?

Dr. Muniba Faiza

Published

on

Installing 3Dmapper on Ubuntu (Linux).

Understanding the relationship between genes and proteins is crucial for elucidating biological processes, and disease mechanisms, and developing targeted therapies. A new tool developed by Yang et. al., [1], provides a better solution to map annotated positions and variants to protein structures automatically. 3Dmapper is a stand-alone tool based on R and Python programming languages that map annotated genomic variants or positions to protein structures [1]. In this article, we will install 3Dmapper on Ubuntu (Linux).

(more…)

Continue Reading

Software

CMake installation and upgrade: What worked & what didn’t?!

Dr. Muniba Faiza

Published

on

CMake installation and upgrade: What worked & what didn’t?!

CMake is a widely used cross-platform build system that automates the process of compiling and linking software projects. In bioinformatics, CMake can be utilized to manage the build process of software tools and pipelines used for data analysis, algorithm implementation, and other computational tasks. However, managing the versions of CMake or upgrading it on Ubuntu (Linux) can be a trivial task for beginners. In this article, we provide methods for installing and upgrading CMake on Ubuntu.

(more…)

Continue Reading

Bioinformatics Programming

Free_Energy_Landscape-MD: Python package to create Free Energy Landscape using PCA from GROMACS.

Dr. Muniba Faiza

Published

on

In molecular dynamics (MD) simulations, a free energy landscape (FEL) serves as a crucial tool for understanding the behavior of molecules and biomolecules over time. It is difficult to understand and plot a meaningful FEL and then extract the time frames at which the plot shows minima. In this article, we introduce a new Python package (Free_Energy_Landscape-MD) to generate an FEL based on principal component analysis (PCA) from MD simulation done by GROMACS [1].

(more…)

Continue Reading

Bioinformatics News

VS_Analysis: A Python package to perform post-virtual screening analysis

Dr. Muniba Faiza

Published

on

VS_Analysis: A Python package to perform post-virtual screening analysis

Virtual screening (VS) is a crucial aspect of bioinformatics. As you may already know, there are various tools available for this purpose, including both paid and freely accessible options such as Autodock Vina. Conducting virtual screening with Autodock Vina requires less effort than analyzing its results. However, the analysis process can be challenging due to the large number of output files generated. To address this, we offer a comprehensive Python package designed to automate the analysis of virtual screening results.

(more…)

Continue Reading

Bioinformatics Programming

vs_interaction_analysis.py: Python script to perform post-virtual screening analysis

Dr. Muniba Faiza

Published

on

vs_interaction_analysis.py: Python script to perform post-virtual screening analysis

Analyzing the results of virtual screening (VS) performed with Autodock Vina [1] can be challenging when done manually. In earlier instances, we supplied two scripts, namely vs_analysis.py [2,3] and vs_analysis_compounds.py [4]. This time, we have developed a new Python script to simplify the analysis of VS results.

(more…)

Continue Reading

Software

How to install Interactive Genome Viewer (IGV) & tools on Ubuntu?

Dr. Muniba Faiza

Published

on

How to install Interactive Genome Viewer (IGV) & tools on Ubuntu?

Interactive Genome Viewer (IGV) is an interactive tool to visualize genomic data [1]. In this article, we are installing IGV and tools on Ubuntu desktop.

(more…)

Continue Reading

MD Simulation

[Tutorial] Installing VIAMD on Ubuntu (Linux).

Dr. Muniba Faiza

Published

on

[Tutorial] Installing VIAMD on Ubuntu (Linux).

Visual Interactive Analysis of Molecular Dynamics (VIAMD) is a tool that allows the interactive analysis of molecular dynamics simulations [1]. In this article, we are installing it on Ubuntu (Linux).

(more…)

Continue Reading

Docking

[Tutorial] Performing docking using DockingPie plugin in PyMOL.

Dr. Muniba Faiza

Published

on

[Tutorial] Performing docking using DockingPie plugin in PyMOL.

DockingPie [1] is a PyMOL plugin to perform computational docking within PyMOL [2]. In this article, we will perform simple docking using DockingPie1.2.

(more…)

Continue Reading

Docking

How to install the DockingPie plugin on PyMOL?

Dr. Muniba Faiza

Published

on

How to install DockingPie plugin on PyMOL?

DockingPie [1] is a plugin of PyMOL [2] made to fulfill the purpose of docking within the PyMOL interface. This plugin will allow you to dock using four different algorithms, namely, Vina, RxDock, SMINA, and ADFR. It will also allow you to perform flexible docking. Though the installation procedure is the same for all OSs, in this article, we are installing this plugin on Ubuntu (Linux).

(more…)

Continue Reading

Software

Video Tutorial: Calculating binding pocket volume using PyVol plugin.

Dr. Muniba Faiza

Published

on

Calculate Binding Pocket Volume in Pymol (using PyVol plugin).

This is a video tutorial for calculating binding pocket volume using the PyVol plugin [1] in Pymol [2].

(more…)

Continue Reading

Software

How to generate topology from SMILES for MD Simulation?

Dr. Muniba Faiza

Published

on

How to generate topology from SMILES for MD Simulation?

If you need to generate the topology of molecules using their SMILES, a simple Python script is available.

(more…)

Continue Reading

Software

[Tutorial] Installing jdock on Ubuntu (Linux).

Dr. Muniba Faiza

Published

on

[Tutorial] Installing jdock on Ubuntu (Linux).

jdock is an extended version of idock [1]. It has the same features as the idock along with some bug fixes. However, the binary name and the GitHub repository names are changed. We are installing jdock on Ubuntu (Linux).

(more…)

Continue Reading

Software

How to upgrade cmake on Ubuntu (Linux)?

Dr. Muniba Faiza

Published

on

How to upgrade cmake on Ubuntu/Linux?

In bioinformatics, cmake is used to install multiple software including GROMACS, jdock, and so on. Here is a short tutorial on how to upgrade cmake on Ubuntu and get rid of the previous version. (more…)

Continue Reading

Software

How to install GMXPBSA on Ubuntu (Linux)?

Dr. Muniba Faiza

Published

on

How to install GMXPBSA on Ubuntu (Linux)?

GMXPBSA is a tool to calculate binding free energy [1]. It is compatible with Gromacs version 4.5 and later. In this article, we will install GMXPBSA version 2.1.2 on Ubuntu (Linux).

(more…)

Continue Reading

Docking

[Tutorial] Installing Pyrx on Windows.

Dr. Muniba Faiza

Published

on

[Tutorial] Installing Pyrx on Windows.

Pyrx [1] is another virtual screening software that also offers to perform docking using Autodock Vina. In this article, we will install Pyrx on Windows. (more…)

Continue Reading