Connect with us

Meta Analysis

Predictive metagenomics profiling: why, what and how ?



What is predictive metagenomics profiling?

Recently, predictive metagenomics profiling (PMP) has been added to the microbial ecologist’s arsenal of strategies for probing microbial communities. Currently, there are two platforms available for PMP; PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States; and tax4fun available at (Langille et al., 2013; Aßhauer et al., 2015). The aim of PMP is to predict the abundance of functional gene families present in microbial communities (i.e. the community’s functional potential) using amplicon-based sequencing data, such as 16S rRNA data.

If PMP reveals that the predicted functional potential of a community changes between treatments, then researchers will have a strong incentive to invest the time, money and computational power to investigate community function further via techniques including shotgun metagenomics or functional microarrays such as GeoChip (He et al., 2010). Additionally, the types of gene families predicted to vary by PMP could give rise to new hypotheses that will direct future experimental design.

How does it work?

PICRUSt and tax4fun are described in detail in their respective publications but briefly:

Both programs generate a database of organisms with known gene content. For tax4fun, this is achieved using KEGG organisms (Kyoto Encyclopedia of Genes and Genomes; with sequenced genomes. For PICRUSt, a reference phylogenetic tree with GreenGenes identifiers (available at is created. The gene content for organisms in the PICRUSt reference phylogenetic tree is either a) compiled directly from databases containing the well annotated, sequenced genomes or, b) inferred using an ancestral state reconstruction algorithm. The latter process is used when organisms in the reference phylogenetic tree have not been sequenced. The underlying assumption is that taxonomically similar organisms will have functionally similar capabilities. Indeed, the authors of PICRUSt highlight that phylogeny and biomolecular function are highly correlated (Langille et al., 2013).

Gene family predictions for experimental data then are made by associating the taxonomic ID of OTUs (Operational Taxonomic Units) in experimental data to organisms in the reference database with precomputed gene content. For PICRUSt, GreenGenes taxonomic identifiers are used to create the phylogenetic reference tree, thus to map the experimental 16S rRNA data to this tree, taxonomic assignments of OTUs must also be performed using GreenGenes. Conversely, tax4fun requires SILVA ( assigned taxonomies which are then mapped to the reference database of KEGG organisms. Both programs use 16S rRNA copy number information to normalize the abundance of identified OTUs.

Why use predictive metagenomics profiling?

Unlike metagenomics, which requires massive amounts of sequencing in order to achieve adequate read-coverage of rare genomes – which is particularly challenging for highly diverse communities such as soils – PMP only requires sufficient sequencing depth to cover a single target gene – i.e. the 16S rRNA gene (Howe et al., 2014). This significantly reduces the amount of data required to take a first look at the community functional profile. The output of a PMP pipeline is a table of predicted gene family counts (in the form of identifiers such as KEGG ontologies) which can be clustered to path-way level categories (i.e. KO modules) if so desired. Additionally, for each OTU, 16S rRNA copy number is used to scale the contribution of predicted gene families to the community’s functional potential.

A natural limitation for PMP lies in the diversity of available reference organisms that have their genomes fully sequenced. For an OTU to have its gene content predicted it must be able to be mapped to a reference organism. If it is not similar enough to a reference organism then gene content cannot be predicted. Additionally, even though PICRUSt can predict the gene content of unsequenced organisms, these predictions will only be accurate if there are a number of closely-related, sequenced genomes available to make predictions from. Nevertheless, PMP is a simple and elegant strategy for adding value to 16S rRNA metagenomic data, such as the one that would be generated on an Illumina Miseq.

Using predictive metagenomics profiling

The input for PMP is a normalized OTU table whereby taxonomic assignments have been made using the database appropriate to the PMP strategy (i.e. SILVA or GreenGenes). As with any sequencing data, appropriate QC and data trimming strategies need to be observed. Data QC, taxonomic assignments, generation of an OTU table and normalisation (i.e. rarefied counts or model-based normalisation) can be performed using existing pipelines such as QIIME (Quantitative Insights Into Microbial ecology; and USEARCH 16S rRNA pipelines (Caporaso et al., 2010; Edgar, 2010; McMurdie and Holmes, 2014). Both PICRUSt and tax4fun are compatible with QIIME outputs, making it easy for users to add PMP to existing workflows. Through the QIIME-PICRUSt help pages, there is a range of recommended scripts that can be used to assess changes to functional profiles between communities (Caporaso et al., 2010). Additionally, KO data can be exported and used in conjunction with traditional ecological tools, such as ordinations (NMDS, PCO etc.,) or species indicator tests (available using the multipatt function in the indicspecies package in R) to identify gene families or functional modules that are indicative of different communities (De Cáceres et al., 2010).

Thus, given a 16S rRNA dataset, PMP is a cost-effective, a logical starting point for deciding a) whether to proceed with an examination of functional changes in a community and b) the best experimental strategy to investigate said community.


Aßhauer, K.P., Wemheuer, B., Daniel, R., Meinicke, P., 2015. Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics, 31, 2882-2884.

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Pẽa, A.G., Goodrich, J.K., Gordon, J.I., Huttley, G.A., Kelley, S.T., Knights, D., Koenig, J.E., Ley, R.E., Lozupone, C.A., McDonald, D., Muegge, B.D., Pirrung, M., Reeder, J., Sevinsky, J.R., Turnbaugh, P.J., Walters, W.A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R., 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335-336.

De Cáceres, M., Legendre, P., Moretti, M., 2010. Improving indicator species analysis by combining groups of sites. Oikos, 119, 1674-1684.

Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.

He, Z., Deng, Y., Van Nostrand, J.D., Tu, Q., Xu, M., Hemme, C.L., Li, X., Wu, L., Gentry, T.J., Yin, Y., Liebich, J., Hazen, T.C., Zhou, J., 2010. GeoChip 3.0 as a high-throughput tool for analyzing microbial community composition, structure and functional activity. ISME Journal, 4, 1167-1179.

Howe, A.C., Jansson, J.K., Malfatti, S.A., Tringe, S.G., Tiedje, J.M., Brown, C.T., 2014. Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences of the United States of America, 111, 4904-4909.

Langille, M.G.I., Zaneveld, J., Caporaso, J.G., McDonald, D., Knights, D., Reyes, J.A., Clemente, J.C., Burkepile, D.E., Vega Thurber, R.L., Knight, R., Beiko, R.G., Huttenhower, C., 2013. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology, 31, 814-821.

McMurdie, P.J., Holmes, S., 2014. Waste Not, Want Not: Why rarefying microbiome data Is inadmissible. PLoS Computational Biology, 10.

For a more comprehensive list of references, readers are directed to mail us at [email protected]

My research interest is in understanding the ecology of soil microbial communities and the practical applications this information will have for global sustainability. In particular I am interested in the contribution that soil microbial communities can make to the plant-based bioremediation of heavy metal pollution and to sustainable agricultural practices. I collaborate with soil scientists, community ecologists and bioinformaticians and utilize community-based approaches including community finger-printing and next-generation sequencing technologies to understand the function of soil microbial communities in a variety of soil conditions. Ultimately I am interested in understanding how soil microbial communities promote plant growth and what drives them to do so. With this information we can utilize soil microbial communities to improve both agriculture and bioremediation.

Meta Analysis

How to develop a search strategy for meta-analysis and systematic reviews?



It all begins with finding a topic to work on! Find yourself a topic which has current relevance and with publications of clinical trials, case-control or response-no response studies in recent years, Randomized control trials are the gold standard for such analyses and are also hard to find. The procedure emphasizes the studies being recent as that makes the data for analysis updated to the latest methods, techniques, and error minimization procedures. Also, the recentness of included studies empowers the analyst’s view of the relevance of the topic in current times.

To construct a systematic review and conduct a meta-analysis of all literature published on any topic, one must know how to precisely extract meta-data from varied databases at their disposal. One may create his corpus of experimental studies from any single database, say PubMed, by submitting a query containing words/phrases outlining names of a disorder, prescribed/ on trial drugs, and type of study expected all connected by use of operators. PubMed usually will return with results in five-figures and this is a fairly large dataset to sort out manually.

We must consider at this point that PubMed is not the ultimate collection of all the biological literature ever published and in current publications, it is a sufficiently well-informed database but mustn’t be regarded as complete. There are other databases which act as topic-specific databases and can be expected to contain a more concentrated literature on the theme they nurture.

Moving on to the actual development of query, the analyst must ascertain that the search string developed is complete in all respects, i.e. it contains all the MeSH terms for the items queried. Even though PubMed will automatically generate synonyms for a query from its database of MeSH terms in our experience we found providing the MeSH terms saw a steep fall in the number of results returned, the drop was so significant that previously obtained five-figured corpus was reduced to a measly three figured and was manually sortable.

It is noteworthy that if possible one mustn’t filter out studies published in languages other than English as it may make the study biased by eliminating participation of regional studies and data. The analyst may always try to contact the authors for the English version of the study and thus reduce the amount of bias, making the analysis robust.


For  any query write to [email protected]

Continue Reading

Meta Analysis

Venice Criteria: Overview



The plethora of research literature available to the modern day biologists provides the luxury to conduct a unique procedure- an analysis of the meta(data of data). (more…)

Continue Reading


Meta-analysis of biological literature : Explained

The new cool word to biological realm “meta-analysis” can be better understood by understanding the meaning of first half of the term; META, meaning data of data..



It’s a fine Monday morning, and the new intern finds his way to the laboratory of biological data mining procedures. His brief interview with the concerned scientist has allowed him to have very limited understanding of the subject…. (more…)

Continue Reading