aametwally / metalonda Goto Github PK

METAgenomic LONgitudinal Differential Abundance method

License: Other

R 100.00%

longitudinal-studies metagenomics microbiome smoothing-splines negative-binomial-distribution differential-abundance-methods

metalonda's Introduction

What is MetaLonDA?

MetaLonDA (METAgenomic LONgitudinal Differential Abundance method) is a method that identifies the significant time intervals of microbial features in longitudinal studies. MetaLonDA has the ability to handle the inconsistencies and common challenges associated with human studies, such as variable sample collection times and uneven number of time points along the subjects’ longitudinal study. The method employs a negative binomial distribution in conjunction with a semi-parametric SS-ANOVA to model the read counts. Then, it performs the significance testing based on unit time intervals using permutation testing procedure.

Publications:

Ahmed A. Metwally, Jie Yang, Christian Ascoli, Yang Dai, Patricia W. Finn, and David L. Perkins. "MetaLonDA: a flexible R package for identifying time intervals of differentially abundant features in metagenomic longitudinal studies", Microbiome, 2018. [paper]
Ahmed A. Metwally, Patricia W. Finn, Yang Dai, and David L. Perkins. "Detection of Differential Abundance Intervals in Longitudinal Metagenomic Data Using Negative Binomial Smoothing Spline ANOVA." ACM BCB, 2017. [paper]

Getting Started

This section details steps for installing and running MetaLonDA. If you experience difficulty installing or running the software, please contact (Ahmed Metwally: [email protected]).

Prerequisites

R(>= 3.2.0)

Installation

Install MetaLonDA from CRAN:

Install the latest released version from CRAN: https://cran.r-project.org/web/packages/MetaLonDA/index.html

install.packages("MetaLonDA")

Install MetaLonDA from GitHub:

Download the latest development code of MetaLonDA from GitHub using devtools

library(devtools)
install_github("aametwally/MetaLonDA", ref = "master")

Example:

library(MetaLonDA)

## Load read counts of 8 features from 100 samples. Samples are from 2 groups, 5 subjects per group, and 10 time points per subject.
data(metalonda_test_data)
View(metalonda_test_data[,1:20])

## Create Group, Time, and ID annotation vectors
n.group = 2
n.sample = 5 
n.timepoints = 10
Group = factor(c(rep("A", n.sample*n.timepoints), rep("B",n.sample*n.timepoints)))
Time = rep(rep(1:n.timepoints, times = n.sample), 2)
ID = factor(rep(1:(2*n.sample), each = n.timepoints))

## Define the prediction timeponits 
points = seq(1, 10, length.out = 100)

Test one feature

## Identify significant time intervals of the 5th feature: 
output.metalonda.f5 = metalonda(Count = metalonda_test_data[5,], Time = Time, Group = Group,
                                ID = ID, n.perm = 100, fit.method = "nbinomial", points = points,
                                text = rownames(metalonda_test_data)[5], parall = FALSE, pvalue.threshold = 0.05,
                                adjust.method = "BH", time.unit = "days", ylabel = "Normalized Count",
                                col = c("black", "green"), prefix = "Test_F5")

In our example, we used 20 permutations just to showcase how MetaLonDA works. In real analysis, this number should be much higher. Three figures are generated after running the above snippet:

Figure shows the trajectory of the feature's count across different time points:

2. Figure shows fitted spline for each group:

In case there is a significant time interval, the following figure is generated to highlight the significant intervals.

Test all features

## Identify significant time intervals for all features: 
output.metalonda.all = metalondaAll(Count = metalonda_test_data, Time = Time, Group = Group,
                                    ID = ID, n.perm = 100, fit.method = "nbinomial", num.intervals = 100, 
                                    parall = FALSE, pvalue.threshold = 0.05, adjust.method = "BH", time.unit = "hours", 
                                    norm.method = "none", prefix = "Test_metalondaALL", ylabel = "Read Counts",
                                    col = c("black", "green"))

After running the above snippet for testing all features in the count matrix, along with the 2-3 figures for each feature, MetaLonDA produces the following two summary files:

1. Figure illustrates time intervals of differentially abundant features between the two groups. Each line represents a range of the significant time interval of the corresponding feature.

2. Descriptive summary all signifcant time intervals of differentially abundant features.

Bugs and Suggestions

MetaLonDA is under active research development. Please report any bugs/suggestions to Ahmed Metwally ([email protected]).

metalonda's People

Contributors

Stargazers

Watchers

Forkers

askerdb psilopsych sjanssen2 anigwe gadams1959 jialonghao1997

metalonda's Issues

image output format

Hello Ahmed A. Metwally,
I am trying to use MetaLonda remotely on university server. but while running I am not getting any image (empty image, 0 bite in size) as output with the following warning message:

Warning messages:
1: In grDevices::dev.off() : No jpeg support in this version of R
2: guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.
3: In grDevices::dev.off() : No jpeg support in this version of R

then I tried to search the r version ( supported graphic formats using capabilities() it resulted in jpeg FALSE. So I wanted to know if its possible to get graphic output in different (png) format?

uneven group numbers

Hi I am trying to compare the longitudinal data from 6 time points from 4 groups of samples. But each group has different number of samples like 6, 6, 7 and 6. I am not sure how I can use metalonDA for this. Any help with code will be greatly appreciated. Thanks

Issue with metalonda() command: Error: $ operator is invalid for atomic vectors

Hello, I am having issues running the metalonda() command because I keep producing the error: $ operator is invalid for atomic vectors. This error seems strange because I have ensured all of my vectors are numeric. I am also confused as to what the "points" vector is and how to create it appropriately. For now, my points vector is a list 1:49 since my samples are from days 1:49. Thank you for your assistance. A reproducible answer is below:

otu <- structure(c(2547, 0, 600, 3, 17, 0, 372, 202, 0, 339, 10491,
0, 1417, 2083, 0, 325, 4521, 0, 12366, 6, 5891, 0, 0, 0, 643,
399, 0, 301, 343, 0, 1194, 9, 15835, 0, 56, 0, 20554, 0, 11336,
2, 0, 0, 538, 1104, 0, 2736, 3430, 0, 1347, 1249, 0, 977, 6455,
0, 80, 39, 0, 429, 341, 0, 156, 70, 0, 382, 186, 0, 1751, 925,
0, 0, 7634, 0, 561, 99, 0, 1067, 655, 0, 160, 654, 0, 0, 2009,
0, 139, 0, 5433, 0, 0, 0, 169, 85, 0, 406, 206, 0, 374, 236,
4, 459, 718, 51, 123, 81, 0, 623, 1992, 0, 0, 0, 344, 0, 40,
0, 304, 88, 0, 284, 1372, 0, 79, 96, 0, 130, 403, 0, 144, 61,
0, 357, 731, 40, 95, 504, 0, 2912, 9435, 0, 323, 15, 0, 485,
340, 0, 254, 100, 4, 863, 2210, 0, 32, 21, 0, 219, 443, 0, 1925,
103, 0, 327, 1420, 0, 457, 23, 0, 95, 370, 0, 68, 399, 0, 159,
932, 0, 82, 13, 0, 105, 225, 0, 244, 46, 0, 153, 786, 0, 3105,
1114, 0, 0, 1011, 0, 276, 364, 0, 604, 2936, 0, 417, 154, 0,
224, 710, 0, 55, 26, 0, 198, 254, 0, 92, 8, 0, 236, 37, 0, 243,
123, 0, 195, 161, 0, 115, 123, 0, 869, 577, 0, 1272, 730, 0,
520, 844, 0, 50, 368, 8, 5, 902, 0, 602, 52, 0, 828, 2176, 0,
338, 143, 3, 416, 1341, 31, 1032, 12727, 25637, 0, 5566, 0, 41,
11, 0, 309, 205, 0, 70, 4, 0, 359, 198, 0, 154, 60, 0, 51, 1221,
0, 3, 4, 0, 77, 85, 0, 65, 152, 0, 257, 917, 0, 136, 60, 0, 78,
746, 0), .Dim = c(6L, 49L), .Dimnames = list(c("taxa 1", "taxa 2",
"taxa 3", "taxa 4", "taxa 5", "taxa 6"), c("LID_100039", "LID_100080",
"LID_100105", "LID_100125", "LID_100194", "LID_100225", "LID_100229",
"LID_100551", "LID_101556", "LID_101558", "LID_101560", "LID_101562",
"LID_101564", "LID_101566", "LID_101569", "LID_101570", "LID_101571",
"LID_101573", "LID_101575", "LID_101577", "LID_101581", "LID_101584",
"LID_101587", "LID_101590", "LID_101591", "LID_101592", "LID_101593",
"LID_101594", "LID_101596", "LID_101597", "LID_101601", "LID_101602",
"LID_101604", "LID_101606", "LID_101608", "LID_101610", "LID_101612",
"LID_101613", "LID_101616", "LID_101617", "LID_101618", "LID_101619",
"LID_101620", "LID_101621", "LID_101622", "LID_101623", "LID_101624",
"LID_101630", "LID_101632")))

Group <- structure(c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("Female", "Male"))

Individual <- structure(c(64L, 34L, 5L, 23L, 53L, 4L, 47L, 80L, 71L, 43L, 1L,
87L, 48L, 45L, 55L, 18L, 11L, 47L, 75L, 44L, 47L, 10L, 86L, 44L,
28L, 61L, 21L, 73L, 70L, 86L, 45L, 1L, 43L, 47L, 48L, 11L, 16L,
60L, 58L, 40L, 30L, 31L, 50L, 17L, 5L, 86L, 76L, 60L, 69L), .Label = c("F100",
"F101", "F102", "F103", "F104", "F105", "F106", "F107", "F108",
"F109", "F110", "F111", "F112", "F113", "F114", "F115", "F116",
"F117", "F118", "F119", "F120", "F121", "F122", "F123", "F124",
"F125", "F126", "F127", "F128", "F129", "F130", "F131", "F132",
"F32", "F74", "F84", "F85", "F86", "F87", "F88", "F89", "F90",
"F91", "F92", "F93", "F94", "F95", "F96", "F97", "F98", "F99",
"M49", "M50", "M51", "M52", "M53", "M54", "M55", "M56", "M57",
"M58", "M59", "M60", "M61", "M62", "M63", "M64", "M65", "M66",
"M67", "M68", "M69", "M70", "M71", "M72", "M73", "M74", "M75",
"M76", "M77", "M78", "M79", "M80", "M81", "M82", "M83", "M84",
"M85", "M86"))

Time <- structure(c(13L, 5L, 2L, 6L, 4L, 1L, 12L, 3L, 44L, 45L, 44L,
43L, 42L, 42L, 40L, 41L, 39L, 38L, 36L, 37L, 34L, 35L, 33L, 32L,
8L, 9L, 7L, 7L, 26L, 27L, 23L, 22L, 24L, 21L, 28L, 29L, 25L,
19L, 20L, 14L, 15L, 16L, 17L, 18L, 18L, 11L, 10L, 30L, 31L), .Label = c("2015-02-20",
"2015-04-01", "2015-04-03", "2015-05-13", "2015-05-14", "2015-05-18",
"2015-11-25", "2016-01-09", "2016-01-12", "2016-02-02", "2016-02-13",
"2016-02-22", "2016-04-06", "2016-04-15", "2016-04-18", "2016-05-19",
"2016-05-23", "2016-07-02", "2017-06-02", "2017-06-06", "2017-06-16",
"2017-06-23", "2017-06-29", "2017-08-03", "2017-08-12", "2017-08-15",
"2017-08-21", "2017-10-03", "2017-10-06", "2017-12-02", "2017-12-21",
"2018-02-16", "2018-03-02", "2018-05-01", "2018-05-02", "2018-05-24",
"2018-06-01", "2018-06-02", "2018-06-05", "2018-06-11", "2018-06-16",
"2018-07-03", "2018-07-14", "2018-08-01", "2018-08-02"))

points = as.numeric(seq(1, 49, length.out = 49))

output2.metalonda.f5 = metalonda(Count = otu[5,], Time = Time, Group = Group,
ID = Individual, n.perm = 100, fit.method = "nbinomial", points = points,
text = rownames(subset_otu)[5], parall = FALSE, pvalue.threshold = 0.05,
adjust.method = "BH", time.unit = "days", ylabel = "Normalized Count",
col = c("black", "green"), prefix = "Test2_F5")

Defining prediction timepoints

Hello again,

I now have 2 different groups I would like to compare the microbial features for overtime. I have fecal bacteria for two groups of cats with 6 individuals in each group over 6 days (12 total individuals split between two groups (A and B)).

Group A - Day 0 ,1 ,2, 3, 4, 5 (for n = 6 cats)
Group B - Day 0, 1, 2, 3, 4, 5 (for n = 6 other cats)

n.group = 2
n.sample = 6
n.timepoints = 6

I am having difficulty understanding how to define the prediction timepoints. Could you please clarify this a bit for me?

Thank you for your time!

Sincerely,
Morgan

curveFitting

Hey, I wish to get the output of metalondaAll in some data-structure (especially the curve fits) so I can later plot it myself using other frameworks. I saw that metalondaAll only saves jpg files, so I tried to call curveFittig myself.
I got strange errors and when tried to follow the exact snippet as in the vignette I get the same:

Error in gssanova(Count ~ Time, data = group.0, family = "nbinomial",  : 
  gss error in gssanova: unpenalized terms are linearly dependent
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x, y) : no non-missing arguments to min; returning Inf
6: In max(x, y) : no non-missing arguments to max; returning -Inf

Obviously, the curveFitting function works when called from within metalondaAll, and again, my only wish is to have the data points for the fitted curve and the normalized data in a nice dataframe so I can plot these myself. Can I get some help with that?

My sessionInfo:

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /apps/RH7U2/gnu/R/3.4.2/lib64/R/lib/libRblas.so
LAPACK: /apps/RH7U2/gnu/R/3.4.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.7.8          readr_1.3.1          MetaLonDA_1.1.0      BiocInstaller_1.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           lattice_0.20-38      prettyunits_1.0.2    gtools_3.8.1         utf8_1.1.4          
 [6] assertthat_0.2.0     glmnet_2.0-16        digest_0.6.18        foreach_1.4.4        R6_2.3.0            
[11] plyr_1.8.4           stats4_3.4.2         RSQLite_2.1.1        httr_1.4.0           ggplot2_3.1.0       
[16] pillar_1.3.1         gplots_3.0.1         rlang_0.3.1          progress_1.2.0       lazyeval_0.2.1      
[21] gdata_2.18.0         rstudioapi_0.9.0     blob_1.1.1           S4Vectors_0.16.0     Matrix_1.2-15       
[26] labeling_0.3         stringr_1.3.1        RCurl_1.95-4.11      bit_1.1-14           biomaRt_2.34.2      
[31] munsell_0.5.0        compiler_3.4.2       pkgconfig_2.0.2      BiocGenerics_0.24.0  tidyselect_0.2.5    
[36] tibble_2.0.1         gss_2.1-9            IRanges_2.12.0       codetools_0.2-16     matrixStats_0.54.0  
[41] XML_3.98-1.16        fansi_0.4.0          crayon_1.3.4         bitops_1.0-6         grid_3.4.2          
[46] gtable_0.2.0         DBI_1.0.0            magrittr_1.5         scales_1.0.0         KernSmooth_2.23-15  
[51] cli_1.0.1            stringi_1.2.4        doParallel_1.0.14    bindrcpp_0.2.2       metagenomeSeq_1.20.1
[56] limma_3.34.9         RColorBrewer_1.1-2   iterators_1.0.10     tools_3.4.2          bit64_0.9-7         
[61] Biobase_2.38.0       glue_1.3.0           purrr_0.2.5          hms_0.4.2            parallel_3.4.2      
[66] yaml_2.2.0           AnnotationDbi_1.40.0 colorspace_1.3-2     caTools_1.17.1.1     memoise_1.1.0       
[71] bindr_0.1.1

Thanks!

Creating last plot

Hi,

I did the tutorial and it worked like a charm, however, when I tried on my own dataset, all plots are created except the last one wich is a summary of all plots, the error is :

Error in file(file, ifelse(append, "a", "w")) :
impossible d'ouvrir la connexion

Thanks in advance !

Replace group names on time-interval figure to reflect the actual names.

NB & Poisson checker

Visualizing plots

Hello,

I am able to install MetaLonDA and follow the tutorial fine except when I get to making the plots. When I run output.metalonda.f5 or output.metalonda.all the plots do not show up. Am I missing something?

Thank you,
Mia

Microbiome over time - queries on input data

Hi @aametwally,

I'm currently working on a project investigating potential changes in gut microbiome over time (within a 24 hour period).

I feel this is likely more of an experimental design question but I just wanted to query whether you think your package may be appropriate to use on my data.

I sampled from two cohorts of fish, one in summer and one in winter.

I sampled 3 fish at 9 time points on each of these two occasions (calculated as hours post feed):

Summer - 1, 2, 3, 4, 5, 6, 8, 21 (hours)
Winter - 1, 2, 3, 4, 5, 6, 7, 23 (hours)

These are also not the same time points for each group (I was limited in my flexibility to access the animals for sampling at various times).

I don't have a treatment vs control study design but I am assuming I should be able to compare the two cohorts as "groups"?

I am using Rstudio for all of my analysis and have my data organised within a phyloseq object - which I can obviously extract the abundance table from easily enough for the count = input. However, generating the time, group and ID inputs is a little confusing to me.

I am unsure if you will be able to help me but any advice you might have on whether my data could be used with your package would be greatly appreciated.

a

cannot open the connection

Hi,
I am trying to run your wonderful tool on ~90 features, 1200 samples. After a while, I get the following error:

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
Calls: metalondaAll ... write.csv -> eval.parent -> eval -> eval -> write.table -> file

Unfortunately, I don't know how to make R print line numbers or a more elaborate stack trace. I suspect the number of open file handles is exceeded. Do you know how to raise this number?

Appropriate number of timepoint samples?

Hi,

I am interested in using this package for a longitudinal study characterizing fecal microbiota over time as the feces decays. I subsampled fecal material once per day for 4-7 days (depending on the sample). I noticed in your package test data you have 10 time-points per sample. What do you think would be the minimum number of time points to use this package effectively? I'm worried I may not have enough time points for each of my samples (again, between 4 and 7).

Thank you!