mikemc / metacal Goto Github PK

View Code? Open in Web Editor NEW

17.0 5.0 4.0 1.67 MB

Metagenomics calibration R package

Home Page: https://mikemc.github.io/metacal

License: Other

R 100.00%

metagenomics amplicon-sequencing r-package marker-gene-analysis microbiome-analysis

metacal's Introduction

metacal

The metacal package provides tools for bias estimation and calibration in marker-gene and metagenomics sequencing experiments. It implements the methods described in McLaren MR, Willis AD, Callahan BJ (2019) and is used for the analysis associated with that manuscript, available at the manuscript's repository.

Installation

Install the development version of metacal from from GitHub,

# install.packages("devtools")
devtools::install_github("mikemc/metacal")

Usage

See the package tutorial for a demonstration of how to estimate bias from control samples with known composition (i.e., mock community samples), and how to calibrate the relative abundances in unknown samples of the taxa that were in the controls.

The primary utility of this package is quantitatively estimating the bias of protocols in quality control experiments, where samples with known composition are measured or samples with unknown composition are measured by multiple protocols.

It is currently not possible to calibrate the composition of a natural community without making strong and untested assumptions about bias being the same for constructed and natural samples and about the efficiencies of taxa not in the controls (e.g., approximating them by that of the closest relative or the average efficiency). For this and other limitations described in the Discussion of our manuscript, calibration as a practical method to obtain quantitatively accurate composition measurements is not currently feasible using this or any package. However, calibration using a hypothesized bias (perhaps partially informed by experimental measurement) can still be useful to analyze the sensitivity of downstream results to bias, a use case we will illustrate in a future vignette.

metacal's People

Contributors

Stargazers

Watchers

Forkers

elifesciences-publications lixiaopi1985 biologger

metacal's Issues

Add continuous integration with Travis-CI

Update package functions and tutorial to use `pivot_*()` from tidyr v1.0.0

Add support for regression of taxon efficiencies against sample covariates

for both defined and undefined control samples
using the least-squares approach currently used for bias estimation with center()
also demonstrate how to do this with nnet::multinom()

Fix R CMD check warnings

Can I use this package for correcting batch effect using technical replicates?

Hi,

I came across your package through an issue (batch effect) that I was trying to solve with Benjamin Callahan (dada2) - benjjneb/dada2#876

Benjamin suggested that I check metacal. I don't have mock communities but I did add technical replicates in the different sequencing runs. Based on the description of the package, I was left with the impression that I could use this package with the technical replicates, but from the tutorial, it is not clear.

Can you please clarify if the technical replicates are useful for the usage of this package?

Thanks

Add `estimate_bias()` and `calibrate()` functions that work with phyloseq objects

Idea is to create a higher-level interface for estimating bias using data already stored in a phyloseq object.

Bias estimation. User has a phyloseq object or otu_table object that contains the observed abundances for target + control samples, or just the control samples, and an otu_table object that contains the actual abundances for the control samples. This function returns an estimate of bias, perhaps w/ bootstrap replicates and standard errors.

Calibration. The user supplies a phyloseq object or otu_table of observed abundances, and an estimated bias vector. Returns a modified phyloseq object with calibrated abundances.

Eventually, when the new function for doing differential bias estimation via compositional regression is implemented, should also support that

consider soft-deprecating `mutate_by()`

An experimental with_groups() function was added to dplyr that should be able to do the mutate_by() functionality and more.

Add failsafe default to `center()` when the estimate is not fully determined

If there is insufficient taxonomic overlap to determine a unique best estimate (up to compositional equivalence), then center() should halt with an informative error message, unless an argument is passed to indicate that an estimate should be returned anyways

allow calibrate() to work with mc_bias_fit objects

How it should work: If the 'bias' argument of calibate() is an 'mc_bias_fit' object, then the estimated bias should be used for the calibration. An additional feature could be to have an option to use the bootreps, to return an array calibrated by the bootreps

Extend `center()` to find and return connected groups of taxa

When a user wishes to compute the center when the taxa co-occurrence graph is not fully connected, then we can return the least-squares vector along with assignments of taxa to subgroups within which the center is fully estimated.

Should we force setting the 'type' argument in `mean_efficiency()`?

As I've been working with the new mean_efficiency() function with phyloseq objects of observed read counts, I have several times forgotten to set type = 'observed', creating some confusing results that took some time to debug. It might be best to remove the default value of type = 'actual' so that the user is always forced to specify (except when calling on an 'mc_bias_fit' object).

`center()` fails when `.data` doesn't have rownames

I think center() should be able to handle this case; rows correspond to samples and there is no need for the samples to have names for computing the center to make sense. Note, the function works fine w/o column names if enframe is not used.

Z <- matrix(c(
        NaN, 1, 3.5, 
        -1, NaN, 4,
        -2, 3, NaN,
        -1, 2, 3), ncol = 3, byrow = T)
# colnames(Z) <- paste0("T", 1:3)
# rownames(Z) <- paste0("S", 1:4)
metacal::center(Z, in_scale = "log")
#> Object passed to `as_tibble()` must have row names if the `rownames`
#> argument is set.

^{Created on 2019-08-06 by the reprex package (v0.3.0)}

Create function to facilitate performing calibration from a reference species

It might be convenient to have a function that performs a simple plug-in approach to the the 'reference-species' approach to calibration described by https://github.com/mikemc/differential-abundance-theory. In its simplest form, the function simply needs to take an 'observed' matrix and a set of reference measurements for 1 or more species, and it can calibrate all species in observed by multiplying by the geometric mean of the reference measurements divided by the observed measurements for those species. However, since I don't necessarily recommend such a non-statistical approach except for exploration and demonstration, it might be better to instead just make a function that facilitates applying sample-specific normalizations - essentially, an easier to use version of 'sweep()'. This function would allow any type of normalization, including to the total abundance (as in so called 'quantitative microbiome profiling')

Add documentation to math and utilities functions

Installation from github using devtools, not working

Hi Mike,

tried installing the way it is mentioned in the readme.md using devtools. Following was the error.

> devtools::install_github("mikemc/metacal")
Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  No commit found for the ref master

  Did you spell the repo owner (`mikemc`) and repo name (`metacal`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

I installed it using the source code and installation was successful.

Cheers !!!
Anubhav

`build_matrix()` behaves incorrectly on grouped tibbles

E.g.

tb %>%
  group_by(var1, var2) %>%
  summarize_at("var3", sum) %>%
  build_matrix(var1, var2, var3)

First time noticing this bug, I got a message about the grouped row var1 being added, and the elements (var3) being coerced from numeric to characters.

Allow `estimate_bias()` to take "observed" objects with extra samples

Use case: You have a full OTU table with natural and mock samples; you want to estimate bias from just the mock samples. As long as the sample names match with actual, we can just subset the samples rather than making the user do this first. As long as all the samples in actual are found in observed then we can be pretty confident this is good input and can proceed. I think we should still throw an error if there are samples in actual that are not in observed.

Comments on the package

I came here because this post.

I don't know where (if) you plan to submit this to CRAN or Bioconductor. I would recommend Bioconductor for the topic of the package. But in that case you'll get a more through review if you submit to one of these repositories.

In any case, it seems that the package doesn't work well with other packages like phyloseq, or metagenomeSeq, or with other useful classes like SummarizedExperiment (used in Bioconductor to store data about a sequencing experiment). Doing so would help to use the package in existing pipelines/scripts.

Some functions would need more documentation of the parameters that they need and have some examples (at least that is a requirement for Bioconductor packages).

To get the error matrix, it would be perfect if we could distinguish what type of NA is a 0/0 (which imho for the purpose of the error matrix it should be then 0) or a 500/0.

In the vignette it is clearly explained how does the package work. It would be interesting to know how to use this information in other downstream analysis. Also it focus a lot on the tidy data frames, which might reduce the memory footprint of the data if it is very sparse but there are other solutions like data.table or Matrix, so I'm not sure if such an extensive space should be given to it in the tutorial.
The vignette focus on the error matrix and estimating bias, but I couldn't find any function to do it.

I've seen the tests and they should be more minimal, include just the data and the tests (you can create and have data just for tests). But at the same time it should test more than just the center function.

Many thanks for tacking the effort to create this nice package. I'm sure it will be very well received by the community.

Create a function to facilitate mean efficiency computations

Ideda: Have a mean efficiency function that acts on mc_bias_fit objects or bias vectors + actual and/or observed matrices and return the mean efficiency in each sample.

The mean efficiency for a set of samples in a matrix with samples as rows can be calculated by first normalizing to proportions and then doing perturb(mat, bias, margin = 1, norm = 'none') %>% rowSums; or we could compute weighted means with apply() and weighted.mean()

Taxa names sometimes lost when using `pairwise_ratios()`

The problem seems to arise specifically when there are only two taxa in the original phyloseq object (so just one taxon in the result) and ratios are computed on taxa prior to being computed on samples.
This suggests the problem arises when ratios are computed on samples and there is just one taxon.

library(phyloseq)
library(magrittr)
data(enterotype)
p <- enterotype %>%
  prune_taxa(taxa_names(.)[2:3], .)
p %>%
  metacal::pairwise_ratios("taxa") %>%
  metacal::pairwise_ratios("samples", filter = FALSE) %>%
  taxa_names
#> [1] "sp1"
p %>%
  metacal::pairwise_ratios("samples", filter = FALSE) %>%
  metacal::pairwise_ratios("taxa") %>%
  taxa_names
#> [1] "Bacteria:Prosthecochloris"

^{Created on 2021-06-15 by the reprex package (v2.0.0)}