Software
How to create a pangenome of isolated genome sequences using Roary and Prokka?

Roary is a pangenome genome pipeline, which calculates pangenome of a set of related prokaryotic isolates [1]. It takes annotated assemblies in the gff3 format generated by Prokka [2] and provides the pangenome. The working methodology has been explained in our previous article. In this article, we will learn how to create the pangenome of a few isolated genome sequences using Roary [1] and Prokka [2].
Input for Roary
- Genome sequences in the form of gff3 files.
Downloading the genome sequences
At first, you need to download genome sequences as per your need, which you can easily download yourself or by using the ncbi-genome-download package. It provides several scripts to download genome sequences from NCBI FTP servers. To install this package, open a terminal (Ctrl + T) and type the following commands:
$ pip install ncbi-genome-download
After downloading this package, you can download the genome assemblies as per your requirements such as fasta sequences of all bacteria, viral genome, RefSeq genome sequences in GenBank format, fungal genomes and so on. (Remember, while downloading gff3 files, you need to download Genbank files with the nucleotide sequence because gff3 files on the NCBI website contain annotation only). I will download all bacterial sequences in fasta format using the following command (showing this example with only a few sequences only):
$ ncbi-genome-download --format fasta bacteria
Annotating the genome sequences
Go into the directory of Roary, create a new folder, let’s name it as ‘example’, and save those downloaded sequences. After downloading, you will see many fasta files in the same folder. Now start annotating them to determine the attributes and location of the genes present in them, and also to obtain gff3 files which are used as an input in roary. This can be easily done with Prokka [2]. Open the terminal and type the following commands:
$ cd Downloads/Roary/example/
$ prokka --kingdom Bacteria --outdir prokka_GCA_000006285 --genus Salmonella --locustag GCA_000006285 GCA_000006285.2_ASM828v3_genomic.fna
You can further add other descriptions such as organism details (genus, species, etc.). Make sure you annotate all the genome sequences you are dealing with and remember to change the output directory name, locus tag, and assembly version accordingly. After running this command, a new directory will be created in the name of each sequence and will consist of 12 files with different extensions including the gff3 file.
Creating pangenome/Running Roary
We have got gff3 files of the genome sequences in the directories, now we need to copy the gff3 file from each directory into another directory (let’s say, gff_all). After that, open the terminal again and type the following command to run roary:
$ roary -f ./tutorial -e -n -v ./gff_all/*.gff
At this stage, Roary will get all the coding sequences, translate them into protein sequences, and generate pre-clusters. After that, roary will look for the paralogs using blastp [3] and create clusters using MCL [4]. Finally, it will take every isolate and order them according to the presence/absence of orthologs. This will take time depending upon the number of sequences (or gff3 files) you are using.
If you want to create a pangenome without the core alignment, then use the following command:
$ roary -f ./tutorial -v ./gff_all/*.gff
If you want to change the percentage identity of blastp (not advised to go below 90%), then use the following command:
$ roary -f ./tutorial -i 90 -v ./gff_all/*.gff
These commands will result in a new directory called tutorial (as given name in the command), where all result files will be found. You can see the summary statistics in the file named ‘summary_statistics.txt‘, it will look like this:
summary_statistics.txt
Core genes (99% <= strains <= 100%) 2031 Softcore genes (95% <= strains < 99%) 0 Shell genes (15% <= strains < 95%) 2497 Cloud genes (0% <= strains < 15%) 0 Total genes (0% <= strains <= 100%) 4528 |
Visualizing results
Similarly, you will find some other output files such as ‘gene_presence_absence.csv‘, ‘accessory_binary_genes.fa.newick‘. ‘roary_plots.py’ script (written by Marco Galardini) will be used to visualize the results, which is present inside the directory named contrib in the main roary directory. Open the terminal, go into the tutorial directory (where all the result files are present) and type the following:
$ cd tutorial
$ /home/user/Downloads/roary/contrib/roary_plots/roary_plots.py accessory_binary_genes.fa.newick gene_presence_absence.csv
You will see three png files that will be added in the same tutorial directory: pangenome_frequence.png (Fig. 1), pangenome_matrix.png (Fig. 2), and pangenome_pie.png (Fig. 3) as shown below.
Fig. 1 showing the number of genes present in each genome sequence.
Fig. 2 Gene clusters.
Fig. 3 represents a pie chart showing different genes present in the genome sequences.
Additionally, you can also visualize the Newick file in phylogeny software such as Mega for further analysis.
This article demonstrated the creation of a pangenome of isolated genome sequences using roary. In case of any queries, please write to us at [email protected] or [email protected].
References
- Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T., … & Parkhill, J. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22), 3691-3693.
- Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069.
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
- Dongen S van. Graph Clustering by Flow Simulation. University of Utrecht; 2000.
Software
Video Tutorial: Calculating binding pocket volume using PyVol plugin.

This is a video tutorial for calculating binding pocket volume using the PyVol plugin [1] in Pymol [2].
Software
How to generate topology from SMILES for MD Simulation?

If you need to generate the topology of molecules using their SMILES, a simple Python script is available.
Software
[Tutorial] Installing jdock on Ubuntu (Linux).
![[Tutorial] Installing jdock on Ubuntu (Linux).](https://img.bioinformaticsreview.com/uploads/2023/05/09092952/jdock.jpg)
jdock is an extended version of idock [1]. It has the same features as the idock along with some bug fixes. However, the binary name and the GitHub repository names are changed. We are installing jdock on Ubuntu (Linux).
Software
How to upgrade cmake on Ubuntu (Linux)?

In bioinformatics, cmake is used to install multiple software including GROMACS, jdock, and so on. Here is a short tutorial on how to upgrade cmake on Ubuntu and get rid of the previous version. (more…)
Software
How to install GMXPBSA on Ubuntu (Linux)?

GMXPBSA is a tool to calculate binding free energy [1]. It is compatible with Gromacs version 4.5 and later. In this article, we will install GMXPBSA version 2.1.2 on Ubuntu (Linux).
Docking
[Tutorial] Installing Pyrx on Windows.
![[Tutorial] Installing Pyrx on Windows.](https://img.bioinformaticsreview.com/uploads/2023/04/13181032/pyrx-3.jpg)
Pyrx [1] is another virtual screening software that also offers to perform docking using Autodock Vina. In this article, we will install Pyrx on Windows. (more…)
MD Simulation
How to solve ‘Could NOT find CUDA: Found unsuitable version “10.1”‘ error during GROMACS installation?

Compiling GROMACS [1] with GPU can be trivial. Previously, we have provided a few articles on the same. In this article, we will solve an error frequently occurring during GROMACS installation.
Software
Installing Autodock4 on MacOS.

Previously, we installed the Autodock suite [1] on Ubuntu. Visit this article for details. Now, let’s install it on MacOS.
Docking
How to install Autodock4 on Ubuntu?

Autodock suite is used for docking small molecules [1]. Recently, Autodock-GPU [2] is developed to accelerate the docking process. Its installation is described in this article. In this tutorial, we will install Autodock 4.2.6 on Ubuntu.
Software
DS Visualizer: Uses & Applications

Discovery Studio (DS) Visualizer (from BIOVIA) is a visualization tool for viewing, sharing, and analyzing proteins [1]. Here are some uses and applications of DS Visualizer.
Software
Protein structure & folding information exploited from remote homologs.

Remote homologs are similar protein structures that share similar functions, but there is no easily detectable sequence similarity in them. A new study has revealed that the protein folding information can be exploited from remote homologous structures. A new tool is developed to recognize such proteins and predict their structure and folding pathway. (more…)
RNA-seq analysis
Pathonoia- A new tool to detect pathogens in RNA-seq data.

Detecting viruses and bacteria in RNA-seq data with less false positive rate is a difficult task. A new tool is introduced to detect pathogens in RNA-seq data with high precision and recall known as Pathonoia [1].
Software
AlphaFill- New algorithm to fill ligands in AlphaFold models.

AlphaFold is a popular artificial intelligence based protein prediction tool [1]. Though it predicts good protein structures, it lacks the capability to predict the small molecules present in the structure such as ligands. For this purpose, AlphaFill is introduced by Hekkelman et al.,[2]. (more…)
Software
How to calculate binding pocket volume using PyVol plugin in PyMol?
Software
How to generate electron density map using Pymol?

Electron density maps are available for most of the protein structures in PDB. Therefore, in this article, we are using PDB to generate electron density maps in Pymol.
Software
Installing PyVOL plugin in Pymol on Ubuntu (Linux).

PyVOL [1] is an excellent plugin of Pymol [2] for pocket visualization of proteins. In this article, we will install the PyVOL plugin in Pymol on Ubuntu. (more…)
Software
How to execute matlab from terminal in Ubuntu (Linux)?

While trying to install Matlab [1], it generally gives an error stating “matlab: command not found.”. Here, we provide a solution to this error.
Software
How to install Kpax on Ubuntu (Linux)?

Kpax is a bioinformatics program to search and align protein structures [1]. It is currently available for Linux platforms only. In this article, we are going to install the latest version of Kpax (5.1.3) on Ubuntu (Linux). (more…)
Secondary structure
How to run do_dssp command (mkdssp) in Gromacs 2022?

In the latest version of GROMACS (2022) [1], there are some issues regarding the gmx do_dssp command. Apparently, this command either does not run displaying a fatal error, or if it runs then it does not read any frame from MD simulation files. In this article, we are going to run the same command for GROMACS 2022. (more…)
You must be logged in to post a comment Login