ebi-metagenomics / mgnifyr Goto Github PK

R package for searching, downloading and analysis of EBI MGnify metagenomics data

Home Page: https://ebi-metagenomics.github.io/MGnifyR/

License: Artistic License 2.0

R 98.79% TeX 1.21%

mgnifyr's Introduction

MGnifyR

An R package for searching and retrieving data from the EBI Metagenomics resource. In most cases, MGnifyR interacts directly with the JSONAPI, rather than relying on downloading analyses outputs as TSV files. Thus it is more general - allowing for example the intuitive combining of multiple studies and analyses into a single workflow, but is in some cases slower than the afformentioned direct access. Local caching of results on disk is implemented to help counter some of the overheads, but data downloads can be slow - particularly for functional annotation retrieval.

MGnifyR package is part of miaverse microbiome analysis ecosystem enabling usage of mia and other miaverse packages.

This research has received funding from the Horizon 2020 Programme of the European Union within the FindingPheno project under grant agreement No 952914. FindingPheno, an EU-funded project, is dedicated to developing computational tools and methodologies for the integration and analysis of multi-omics data. Its primary objective is to deepen our understanding of the interactions between hosts and their microbiomes. You can find more information on FindingPheno website.

Installation

Bioc-release

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MGnifyR")

Bioc-devel

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("MGnifyR")

GitHub

remotes::install_github("EBI-Metagenomics/MGnifyR")

Basic usage

For more detailed instructions read the associated function help and vignette (vignette("MGNifyR"))

library(MGnifyR)

# Set up the MGnify client instance
mgclnt <- MgnifyClient(usecache = TRUE, cache_dir = '/tmp/MGnify_cache')

# Retrieve the list of analyses associated with a study
accession_list <- searchAnalysis(mgclnt, "studies", "MGYS00005058", usecache = TRUE)

# Download all associated study/sample and analysis metadata
meta_dataframe <- getMetadata(mgclnt, accession_list, usecache = TRUE)

# Convert analyses outputs to a single `MultiAssayExperiment` object
mae <- getResult(mgclnt, meta_dataframe$analysis_accession, usecache = TRUE)
mae

mgnifyr's People

Contributors

Stargazers

Watchers

Forkers

darrenjw smeeta khemlalnirmalkar robertoalvarezm judet4 daenarys8 artur-sannikov thpralas linxingchen microbiome wmalex12

mgnifyr's Issues

Error during installation

Hi,
I am trying to install MGnifyR in my R desktop (version 4.3.3, bioconductor is up to date). I tried using BiocManager but failed to install the package. While installing from github using remotes(), I stumbled upon the following error. My understanding is that 4.3.3 is the latest and there isnt yet a ver 4.4.0 ?
Could you please help me out with this?

Thanks so much in advance
Gopikaa

Update examples to OMA

- after Bioc approval

Link to notebooks

Link to MGnifyR notebooks on the MGnifyR README/website, as relevant?

Warning: 'loadFromBiom' is deprecated. Use 'importBIOM'

I get Warning: 'loadFromBiom' is deprecated. Use 'importBIOM' instead. when using getResult() function.

Diversity indices

Hi,

to do further analyses, such as the calculation of diversity indices, detailed analysis of the taxonomy and visualisation of multiple samples, which analysis tools to you recommend ? Are there tutorials which work on the results of MGnifyR ?

Best,
Michael

mgnify_get_analyses_results stops with the error: Error in t.default(unlist(c(json$attributes[baseattrlist], metlist))) : argument is not a matrix

Hello @beadyallen and @bsattelb, I have been using this package successfully but recently I encountered an issue while using mgnify_get_analyses_results

> tax_results <- mgnify_get_analyses_results(mg, accession_list, retrievelist = c("taxonomy"), usecache = T)

 |======================================                                |  55%
Error in t.default(unlist(c(json$attributes[baseattrlist], metlist))) :
      argument is not a matrix
Calls: mgnify_get_analyses_results ... mgnify_attr_list_to_df_row -> as.data.frame -> t -> t.default

Previous attempts using a smaller list of analyses accessions accession_list worked well, but my last tries have resulted in this error at different points of completion sometimes at 22%, 55%. I am not sure what is the cause of this error.

I have attached the accession_list if possible to reproduce the error.

Thank you for developing this package and for your availability.

accession_list.csv

Add link to MGnify itself in the README

Locate fastq data associated with MGnify analyses

Adding this as an issue in case anyone has a similar problem.

MGnifyR::get_download_urls() doesn't return links for the read data (fastq) used to generate assemblies/analyses, but this may be useful for some scenarios.

A workaround is to run MGnifyR::mgnify_get_analyses_metadata() then use the ENA accessions returned under the sample_biosample column.

These accessions can be used with the enaGroupGet command from the enaBrowserTools scripts, e.g.:
enaGroupGet -f fastq SAMN04360062

Note that the enaDataGet throws an error and doesn't find the accession (for me at least).

Thanks @beadyallen for the package!

Create sticker image for this pkg (at least for the website)?

MGnifyR - query samples by depth

Dear Ben,
As part of the BlueCloud2026 project, I am trying to build a data access function to retrive species and KEGG (probably) annotations and the corresponding reads, from the MGnify database.
The first step would be to get the list of marine samples within a certain depth and time range.
If I understood correctly, the MGnifyR package does not allow for multiple filters in one query. That is why I first tried to query according to depth levels, that should be the most disciminant factor for filtering samples.

However, I encountered some issues with the mgnify_query() function. Here are some examples :

library(MGnifyR)
library(tidyverse)
mg <- mgnify_client(usecache = TRUE)

Trying with the metadata_value_gte argument. I guess it refers to "greater than".

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                     metadata_key = "depth", 
                     metadata_value_gte = 100,
                     maxhits = 5,
                     usecache = TRUE)
foo$depth %>% unique()
 [1] "1988.0" "75.0"   "102.0"  "1008.0" "119.0"  "182.0"  "101.0"  "30.0"   "111.0"  "5601.0" "202.0"  "143.0"  "150.0"  "151.0"  "100.0"  "135.0"  "149.0" 
[18] "200.0"  "233.0"  "201.0"  "380.0"  "175.0"

Trying with the metadata_value_gte argument. I guess it refers to "lower than".

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value_lte = 100,
                    maxhits = 5,
                    usecache = TRUE)
foo$depth %>% unique()
 [1] "1988.0" "75.0"   "15.0"   "76.0"   "52.0"   "30.0"   "16.0"   "10.0"   "91.0"   "51.0"   "2.0"    "119.0"  "21.0"   "182.0"  "50.0"   "111.0"  "202.0" 
[18] "33.0"   "49.0"   "74.0"   "14.0"   "68.0"   "40.0"   "151.0"  "69.0"

Both _gte and _lte does not seem to work properly. Therefore I tried to query only a single depth layer. Which would then be parallelized to query all depth levels within our range.

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value = 100,
                    maxhits = -1,
                    usecache = TRUE)
foo$depth %>% unique()
[1] "100.0"

The samples equal to 100 m depth seems to work. However, if we try another depth level, it does not anymore...

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value = 5,
                    maxhits = -1,
                    usecache = TRUE)
foo$depth %>% unique()
[1] "5.0"  "2.0"  "0.3"  "0.29" "0.33" "0.28"

Therefore, I am wondering if I am missing something in the functions or miss-use them ?
Thank your in advance for your feedback,
Best,

Alexandre

Can we download .fna files of all MAGs from species catalogues through MGnifyR?

Hi,
I would like to know if any function exists to download all MAGs' .fna files from species catalogues. I find no clear instruction to do so in MGnifyR. Can you suggest possible ways by which I can download those files?

Thanks very much in advance.

Improve documentation

The documenation of getResult() is not as clear as is could be. For example, taxa.su and taxonomy annotation fetched with get.func=TRUE are not well documented. The documentation should clearly state what kind of data could be fetched and with what options.

Support for the (Tree)SummarizedExperiment data container

I propose adding support for (Tree)SummarizedExperiment data container that is an alternative to phyloseq in the MGnifyR R package.

We can open a Pull Request?

An increasing number of microbiome data analysis tools support the TreeSummarizedExperiment (Huang et al. 2021) and MultiAssayExperiment (Ramos et al. 2017) data containers and the curatedMetagenomicData project has started to distribute the data in this format. Converters between the TreeSummarizedExperiment and phyloseq are available via the mia R/Bioc package, these are two alternative formats. We have created an online tutorial, Orchestrating Microbiome Analysis with R/Bioconductor around the TreeSummarizedExperiment framework.

doQuery() error with assemblies

Hello,
With the new version of this package I get an error when running "doQuery" when my qtype is assemblies.

    results = doQuery(client, 'assemblies', max.hits = 500, as.df=FALSE,
                                   accession = list(accession = accessions))

Error in match.arg(type, several.ok = FALSE) :
'arg' should be one of "studies", "samples", "runs", "analyses", "biomes"
Calls: get_sample_ids -> doQuery -> doQuery -> .local -> match.arg
Execution halted

Any help would be greatly appreciated.

Thank you,
Cal Thoma UMN

mgnify_get_download_urls stops with the following error: Error in `colnames<-`(`tmp`, value = `vtmp`) : attempt to set 'colnames' on an object with less than two dimensions

As the title say the mgnify_get_download_urls function stop with the following error:

dl_urls<-mgnify_get_download_urls(mg, $analysis_accession, accession_type = "analyses")
  |=================================================                                                               |  44%
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) : 
  attempt to set 'colnames' on an object with less than two dimensions

This error is triggered by some of the accessions I have in Marine_samples_metagenome, e.g. MGYA00278521 or MGYA00278684 which have only two URL. However, it does not trigger If I launch mgnify_get_download_urls on a single accession:

dl_urls_MGYA00278684<-mgnify_get_download_urls(mg, "MGYA00278684", accession_type = "analyses")

I have attached Marine_samples_metagenome to reproduce the error
Marine_samples_metagenome.txt

Could not find function "mgnify_get_downloads"

Hi, thanks @beadyallen for developing this package! I've recently started using it and found a small issue. The man file for mgnify_download() contains the following line within the examples: downloads <- mgnify_get_downloads(mg, accession_vect, "analyses"). However, when I try to run it it gives the following error:

Error in mgnify_get_downloads(mg, accession_vect, "analyses") : 
  could not find function "mgnify_get_downloads"

I suspect that the name of the function has been updated to mgnify_get_download_urls() which works flawlessly.

Update examples at EBI

See e.g. https://docs.mgnify.org/src/notebooks/R%20Examples/Fetch%20Analyses%20metadata%20for%20a%20Study.html

In general, new EBI notebook examples to be created.

MGnifyR filter out samples during the phyloseq object building step

Hello there!
I'm creating some Jupyter Notebooks for MGnify team to show examples of how to use the tool and I learned that the tool is filtering out samples during the phyloseq object building step. I tested different studies and the same is happening. There's no obvious reason to filter them out by QC. This is what I did:

tara_all <- mgnify_analyses_from_studies(mg, 'MGYS00000410')
metadata <- mgnify_get_analyses_metadata(mg, tara_all)

To keep only surface and mesopelagic samples:

sub1=metadata[str_detect(metadata$'sample_environment-feature', "ENVO:00002042"), ]
sub2=metadata[str_detect(metadata$'sample_environment-feature', "ENVO:00000213"), ]
filtered_samples=rbind(sub1,sub2)
ps <- mgnify_get_analyses_phyloseq(mg, filtered_samples$analysis_accession)

And to check the number of samples before and after the phyloseq object building:

accessions_prePS=filtered_samples$analysis_accession
accessions_postPS=sample_names(ps)
missing_acc = setdiff(accessions_prePS, accessions_postPS)

To check quality filtering issues:

discarded=c()
for (accession in missing_acc) {
    current_match=metadata[which(metadata$'analysis_accession'==accession), ]
    discarded=rbind(discarded,current_match)
}
discarded[, c('sample_environment-feature', 'analysis_Submitted nucleotide sequences', 'analysis_Nucleotide sequences after length filtering','analysis_Nucleotide sequences after undetermined bases filtering')]

I also checked in the MGnify portal and there's taxonomic annotation for all these 5 samples. I don't know what I'm missing, I couldn't find an explanation in the documentation or examples.

Thanks in advance!