MetaQuery

Frequently Asked Questions

How does MetaQuery work?
What are the outputs of MetaQuery?
Does MetaQuery save my input data?
What are the best alignment parameters to use?
What does "average copy number" mean, and how does MetaQuery estimate this?
How do I cite MetaQuery?

Q: How does MetaQuery work?

A: MetaQuery estimates the abundance of a query sequence across 1,267 publicly available fecal metagenomes from human subjects.

The workflow is as follows:

The user enters one or more protein sequences in FASTA format. These sequences are searched against the integrated catalog of reference genes in the human gut microbiome (IGC) (Li, et al., 2014) using BLAST (Altschul, et al., 1990). The IGC is composed of 9.9 million genes that originate from microbial reference genomes assembled from sequencing of isolates and metagenomes.
Homologs of the query sequence are identified in the IGC based on the BLAST alignments and the set of alignment parameters entered by the user. These parameters include maximum E-value and minimum percent identity (%ID). Because over 40% of the genes in the IGC lack either start/stop codons (Li, et al., 2014), many alignments will fail to globally cover both the query and target sequence. Therefore, we enforce a minimum 70% glocal alignment coverage threshold defined as: max(Laln/Lquery, Laln/Ltarget), where Laln is the alignment length, Lquery is the length of the query, and Ltarget is the length of the target.
Next, we obtain the relative abundances of identified homologs from a precomputed abundance matrix built by (Li, et al., 2014). This matrix consists of relative abundances of 9.9 million genes across 1,267 metagenomic samples, where the relative abundance of genes is scaled to sum to 1.0 per-sample. For each query, we sum the relative abundances of all identified homologs for each sample.
Optionally, our software normalizes gene relative abundances using a panel of 30 universal single-copy genes (Nayfach and Pollard, 2015). The result of this normalization is a metric called Average Genomic Copy Number, which represents the estimated average copy number of a gene across microbial cells (Manor and Borenstein, 2015). Without normalization, the resulting metric is Relative Abundance, which is scaled to sum to 1.0 across all genes for a sample.

Q: What are the outputs of MetaQuery?

A: MetaQuery outputs include figures and tables.

Expected Output: Figures
- Abundance Plot abundnace.png: The abundance of identified homologs across gut microbiome samples. For taxonomic groups (e.g. species), abundance is defined as the proportion of cells that are from a taxonomic group. For functional groups (e.g. gene families), abundance is the average genomic copy number of the function per cell (with normalization) or relative abundance (without normalization).
  - Left panel: the abundance of the identified homologs across human gut metagenomes. Samples with an abundance of zero were assigned the smallest non-zero value.
  - Right panel: the average abundance of identified homologs compared to the average abundance of other groups at the same functional or taxonomic level.
- Prevalence Plot prevalence.png: The prevalence of identified homologs across gut microbiome samples. Prevalence is defined at the percent of samples where identified homologs are found.
  - Left panel: the prevalence of identified homologs across human gut metagenomes at different abundance thresholds.
  - Right panel: the prevalence of identified homologs at a minimum abundance of 0.001 compared to the prevalence of other groups at the same functional or taxonomic level.
- Boxplots Showing the Association of Abundance with Clinical Phenotypes: MetaQuery generates a few figures showing associations of the abundance of identified homologs with clinical phenotypes. Wilcoxon rank-sum tests are performed to determine if the abundance of identified homologs is different between cases and controls for several diseases (see table). For each disease, case and control individuals have been selected from the same country and individuals with comorbities were excluded. See the documentation for more information on these cohorts.
  - p_value indicates whether there is a significant difference in the abundance of identified homologs between cases and controls.
  - rank and percentile indicate how the p_value for identified homologs compares to other functional or taxonomic groups.
  For example, a percentile of 5.0 indicates that the p_value for the query was more significant than 95% of other functions or taxa. Boxplots include:
  - Ulcerative colitis.Spain.png
  - Crohns disease.Spain.png
  - Obesity.Denmark.png
  - Type II diabetes.China.png
  - Type II diabetes.Denmark.png
  - Type II diabetes.Sweden.png
  - Liver cirrhosis.China.png
  - Rheumatoid arthritis.China.png
  - Colorectal cancer.Austria.png
Expected Output: Tables
- Search-by-Sequence For each job, MetaQuery assigned a random job_id and all the results can be found in the folder metaquery_output_{job_id}. MetaQuery generates the following tables:
  - homolog_table.tsv
  - homologs_abundance.tsv
  - homologs_annotations.tsv
  - taxa_covariates.tsv
  - pheno_covariates.tsv
  In addition, users can download the raw blast results blast_results.tsv and the full metadata of the subjects subject_attributes.tsv.
- Search-by-Name MetaQuery returns a search_results.tsv table, listing Query Type, Database, Level and Name. For each result, MetaQuery produces a statistics table pheno_table.tsv as well as the above-mentioned figures, and saves them in the folder metaquery_output_{name}.

Q: Does MetaQuery save my input data?

A: No, MetaQuery does not save any user inputs. The MetaQuery outputs are retained for 24 hours in order to enable users to download them. Outputs are deleted after 24 hours.

Q: What are the best alignment parameters to use?

A: This depends on whether you are interested in close or remote homologs of your query. For close homologs, use high percent identity cutoffs (e.g. 90, 95, 98%) and/or low E-value cutoffs. For remote homologs, use a lower percent identity cutoff and/or higher E-value cutoff. The default values may be too lenient for your application. You can also run MetaQuery using several cutoffs and compare the results.

Q: What does "average copy number" mean, and how does MetaQuery estimate this?

A: This is an abundance metric for a gene or gene family. It indicates the average number of gene copies per cell in a microbial community. It is obtained by normalizing gene abundances by the abundance of a group of universal single copy genes. So, a value of 1.0 indicates that a gene is present once per cell on average; a value of 0.01 as present once per 100 cells on average.

Q: How do I cite MetaQuery?

A: If you use MetaQuery, please use the following citation:
Nayfach S, Fischbach MA, Pollard KS. MetaQuery: a web server for rapid annotation and quantitative analysis of specific genes in the human gut microbiome. Bioinformatics 2015;31(14). doi:10.1093/bioinformatics/btv382
Also, be sure to cite the various resources, studies, and tools utilized by MetaQuery. These references can be found on the About page.