Documentation on metagenomes, reference genes, and annotation pipeline


Microbiome researchers frequently want to know the abundance of a particular microbial gene, pathway, or species across different human hosts and its association with disease. While there are now thousands of publicly available metagenomes from the human gut, computational barriers prevent most researchers from conducting such analyses. MetaQuery is a web application that enables rapid and quantitative analysis of specific genes, functions and taxa across 1,267 publicly available human gut metagenomes. These data span several continents (Europe, China, North America) and disease states (IBD, diabetes, obesity, rheumatoid arthritis, colorectal cancer, and liver cirrhosis). The speed and accessibility of MetaQuery are a step toward democratizing metagenomics research, which should allow many researchers to query the abundance and variation of specific genes in the human gut microbiome.

You can read more about MetaQuery here:
Nayfach S, Fischbach MA, Pollard KS. MetaQuery: a web server for rapid annotation and quantitative analysis of specific genes in the human gut microbiome. Bioinformatics 2015;31(14). doi:10.1093/bioinformatics/btv382

MetaQuery web development team at Gladstone Institutes:
Chunyu Zhao, Stephen Nayfach, Ayushi Agrawal, Andrew Davis, Alexander R. Pico, Katherine S. Pollard

How it works:

  1. Align query genes to the microbiome gene catalog: Query genes specified by the user are aligned to the integrated catalog of reference genes in the human gut microbiome using BLAST. More details about the gene catalog are provided below.
  2. Identify homologous microbiome genes: Gene sequences similar to any of the user's queries are identified in the gene catalog. Based on specified alignment parameters, the user can choose to target genes that are highly similar (e.g. >95% identity) or more distantly related (e.g. >30% identity).
  3. Quantify abundance of homologs: The abundance of all 9.9 million microbiome genes is precomputed for 1,267 samples. This enables MetaQuery to rapidly look up gene abundances for identified homologs. Gene abundances are defined as the average gene copies per cell. Read below for more information about how gene abundances are estimated.

Flowchart for estimating the abundance a query gene in the human gut.

Microbiome gene catalog:

count genes in catalog 9879900.0
% complete genes 57.74
% from reference genomes 2.46
% annotated at phylum level 21.31
% annotated at genus level 16.3
% annotated by kegg 42.1
% annotated by eggNOG 60.44

Statistics on 9.9M reference genes
For more information, see:

Public metagenomes used:

Reference giga_bases sequencing_runs samples subjects
Human Microbiome Project Consortium. The framework for human microbiome research. Nature 2012;486(7402):215-221. doi:10.1038/nature11209 3771.27 1576 337 180
Le Chatelier, E., et al. Richness of human gut microbiome correlates with metabolic markers. Nature 2013;500(7464):541-546. doi:10.1038/nature12506 1641.29 595 292 292
Li, J., et al. An integrated catalog of reference genes in the human gut microbiome. Nature biotechnology 2014;32(8):834-841. doi:10.1038/nbt.2942 1641.29 595 292 292
Nielsen, H.B., et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature biotechnology 2014;32(8):822-828. doi:10.1038/nbt.2939 1455.11 1704 396 318
Zhang, X., et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nature Medicine 2015;21(8):895-905. doi:10.1038/nm.3914 1454.22 232 232 232
Qin, J., et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 2012;490(7418):55-60. doi:10.1038/nature11450 1295.18 365 365 365
Qin, N., et al. Alterations of the human gut microbiome in liver cirrhosis. Nature 2014;513(7516):59-64. doi:10.1038/nature13568 1207.97 314 237 237
Feng, Q., et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat Commun. 2015;6:6528. doi:10.1038/ncomms7528 778.91 312 156 156
Karlsson, F.H., et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 2013;498(7452):99-103. doi:10.1038/nature12198 460.08 147 145 145
Qin, J., et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010;464(7285):59-65. doi:10.1038/nature08821 402.17 264 124 124
Total 13937.8 6088 2271 1962

Statistics on human gut metagenomes used by MetaQuery
All metagenomes were downloaded from the NCBI Sequence Read Archive.
Datasets were identified using the aid of SRAdb (doi:10.1186/1471-2105-14-19)
FASTQC was used to filter low-quality metagenomes with: read-length <50-bp, average read-quality <20, or >2% of reads with adaptor contamination.

Mapping reads and quantifying genes:

Reads from high-quality metagenomes were mapped to genes from the integrated gene catalog using Bowtie2 (doi:10.1038/nmeth.1923) with settings: --sensitive-local. Alignments with <70% nucleotide identity or where the read was covered by <80% of its length were discarded. The read-depth of each reference gene was quantified based on mapped reads, and these values were normalized by the median read-depth across a panel of 30 universal single copy genes (doi:10.1371/journal.pone.0077033). The resulting statistic, average genomic copy number is an estimate of the average number of copies of the gene per cell. We also estimated gene relative abundance, obtained by scaling abundances to sum to 1.0 across genes per sample. Gene abundances were aggregated across technical replicate datasets where applicable.

Estimating the abundance of functional and taxonomic groups:

The abundances of taxonomic groups were estimated for all high-quality metagenomes using MetaPhlAn2 (doi:10.1038/nmeth.3589) and mOTU (doi:10.1038/nmeth.2693). The abundances of functional groups were estimated using the KEGG (doi:10.1093/nar/28.1.27) and eggNOG (doi: 10.1093/nar/gkr1060) databases. Reference gene annotations were obtained from the GigaScience database (doi: 10.5524/100064).

Statistical tests to identify biomarkers:

Wilcoxon-rank-sum tests are used to identify genes, functions, and taxa that are differentially abundant between cases and controls. MetaQuery performs these tests on cohorts from the same country and excludes individuals with comorbidities or with known drug treatment (e.g. metformin). Gene abundances are averaged across samples from the same individual.