MetaQuery: quantitative analysis of the human gut microbiome


Documentation on metagenomes, reference genes, and annotation pipeline


Microbiome researchers frequently want to know the abundance of a particular microbial gene, pathway, or species across different human hosts and its association with disease. While there are now thousands of publicly available metagenomes from the human gut, computational barriers prevent most researchers from conducting such analyses. MetaQuery is a web application that enables rapid and quantitative analysis of specific genes, functions and taxa across >2,000 publicly available human gut metagenomes. These data span several continents (Europe, China, North America) and disease states (IBD, diabetes, obesity, rheumatoid arthritis, colorectal cancer, and liver cirrhosis). The speed and accessibility of MetaQuery are a step toward democratizing metagenomics research, which should allow many researchers to query the abundance and variation of specific genes in the human gut microbiome.

You can read more about MetaQuery here:
Nayfach S, Fischbach MA, Pollard KS. MetaQuery: a web server for rapid annotation and quantitative analysis of specific genes in the human gut microbiome. Bioinformatics 2015;31(14). doi:10.1093/bioinformatics/btv382

How it works:

  1. Align query genes to the microbiome gene catalog: Query genes specified by the user are aligned to the integrated catalog of reference genes in the human gut microbiome using BLAST. Read more about the gene catalog.
  2. Identify homologous microbiome genes: Gene sequences similar any of the user's queries are identified in the gene catalog. Based on specified alignment parameters, the user can choose to target genes that are highly similar (e.g. >95% identity) or more distantly related (e.g. >30% identity).
  3. Quantify abundance of homologs: The abundance of all 9.8M microbiome genes is precomputed for >2,000 samples. This enables MetaQuery to rapidly look-up gene abundances for identified homologs. Gene abundances are defined as the average gene copies per cell. Read more about how gene abundances were estimated.

Flowchart for estimating the abundance a query gene in the human gut.

Microbiome gene catalog:

attribute value
count genes in catalog9879900.0
% complete genes57.74
% from reference genomes2.46
% annotated at phylum level21.31
% annotated at genus level16.3
% annotated by kegg42.1
% annotated by eggNOG60.44
Statistics on 9.88M reference genes
For more information, see:

Public metagenomes used:

reference giga_bases sequencing_runs samples subjects
Human Microbiome Project Consortium, The. A framework for human microbiome research. Nature 2012;486(7402):215-221. doi:10.1038/nature11209 3771.271576337180
Le Chatelier, E., et al. Richness of human gut microbiome correlates with metabolic markers. Nature 2013;500(7464):541-546. doi:10.1038/nature12506 1641.29595292292
Li, J., et al. An integrated catalog of reference genes in the human gut microbiome. Nature biotechnology 2014;32(8):834-841. doi:10.1038/nbt.2942 1471.59579260238
Nielsen, H.B., et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature biotechnology 2014;32(8):822-828. doi:10.1038/nbt.2939 1455.111704396318
Zhang, X., et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nature Medicine 2015;21(8):895-905. doi:10.1038/nm.3914 1454.22232232232
Qin, J., et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 2012;490(7418):55-60. doi:10.1038/nature11450 1295.18365365365
Qin, N., et al. Alterations of the human gut microbiome in liver cirrhosis. Nature 2014;513(7516):59-64. doi:10.1038/nature13568 1207.97314237237
Feng, Q., et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat Commun. 2015;6:6528. doi:10.1038/ncomms7528 778.91312156156
Karlsson, F.H., et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 2013;498(7452):99-103. doi:10.1038/nature12198 460.08147145145
Qin, J., et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010;464(7285):59-65. doi:10.1038/nature08821 402.17264124124
Statistics on human gut metagenomes used by MetaQuery
All metagenomes were downloaded from the NCBI Sequence Read Archive.
Datasets were identified using the aid of SRAdb (doi:10.1186/1471-2105-14-19)
FASTQC was used to filter low-quality metagenomes with: read-length <50-bp, average read-quality <20, or >2% of reads with adaptor contamination.

Mapping reads and quantifying genes:

Reads from high-quality metagenomes were mapped to genes from the integrated gene catalog using Bowtie2 (doi:10.1038/nmeth.1923) with settings: --sensitive-local. Alignments with <70% nucleotide identity or where the read was covered by <80% of it's length were discarded. The read-depth of each reference gene was quantified based on mapped reads, and these values were normalized by the median read-depth across a panel of 30 universal single copy genes (doi:10.1371/journal.pone.0077033). The resulting statistic, average genomic copy number is an estimate of average the copies of the gene per cell. We also estimated gene relative abundance, with was obtained by scaling abundances to sum to 1.0 across genes per sample. Gene abundances were aggregated across technical replicate datasets where applicable.

Estimating the abundance of functional and taxonomic groups:

The abundance of taxonomic groups was estimated for all high-quality metagenomes using MetaPhlAn2 (doi:10.1038/nmeth.3589) and mOTU (doi:10.1038/nmeth.2693). The abundance of functional groups was estimated using the KEGG (doi:10.1093/nar/28.1.27) and eggNOG (doi: 10.1093/nar/gkr1060) databases. Reference gene annotations were obtained from the GigaScience database (doi: 10.5524/100064).

Statistical tests to identify biomarkers:

Wilcoxon-rank-sum tests were used to identify genes, functions, and taxa that were differentially abundant between cases and controls. We performed these tests on cohorts from the same country and excluded individuals with co-morbidites or with known drug treatment (e.g. metformin). To avoid repeated measures, gene abundances were averaged across samples from the same individual.