metabarfactory / metabar Goto Github PK

metabaR is an R package to curate and visualise DNA metabarcoding data after basic bioinformatics analyses.

Home Page: http://metabaRfactory.github.io/metabaR

R 5.49% TeX 0.43% HTML 94.08%

edna metabarcoding r-package data-cleaning bioinformatics-pipeline edna-pipeline

metabar's Introduction

metabaR is an R package enabling the import, handling and processing of DNA metabarcoding data that have been already processed through bioinformatic pipelines. It provides functions to reveal and filter common molecular artifacts produced during the experimental workflow.

This package can be easily used in combination with others R packages commonly used in ecology (vegan, ade4, ape, picante, etc.), and provides flexible graphic systems based on ggplot2 to visualise the data under both ecological and experimental perspectives.

More specifically, metabaR provides:

Import functions of DNA metabarcoding data from different bioinformatics pipelines
Functions to manipulate the different types of tables one usually deals with when working with DNA metabarcoding.
Functions of data curation that are absent from the above pipelines and detect/flag potential molecular artifacts such as contaminants, dysfynctional PCRs, etc.
Functions to visualise the data under both ecological (e.g. type of samples, rarefaction curves) and experimental (e.g. type of controls, distribution across the PCR plate design) perspectives.

metabaR is developed on GitHub:

https://github.com/metabaRfactory/metabaR

Overall overview

Installation

metabaR can be installed from GitHub using:

# install bioconductor dependencies
install.packages("BiocManager")
BiocManager::install("biomformat")

# install metabaR package
install.packages("remotes")
remotes::install_github("metabaRfactory/metabaR")

Package dependencies:

for graphical purposes: igraph, ggplot2 and cowplot
for formatting purposes: reshape2, seqinr, biomformat
for analysis purposes: vegan, ade4

Example

This is a basic example of use:

library(metabaR)
data(soil_euk)
summary_metabarlist(soil_euk)
#> $dataset_dimension
#>         n_row n_col
#> reads     384 12647
#> motus   12647    15
#> pcrs      384    11
#> samples    64     8
#> 
#> $dataset_statistics
#>         nb_reads nb_motus avg_reads sd_reads avg_motus sd_motus
#> pcrs     3538913    12647  9215.919 10283.45  333.6849  295.440
#> samples  2797294    12382 10926.930 10346.66  489.5117  239.685

Citation

Zinger, L., Lionnet, C., Benoiston, A.‐S., Donald, J., Mercier, C. and Boyer, F. (2021), Metabar: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods Ecol Evol. https://doi.org/10.1111/2041-210X.13552

metabar's People

Contributors

Stargazers

Watchers

Forkers

celine-mercier egeel slambrechts

metabar's Issues

Correct silva_annotator for SILVA 138.1

There are 'false' paths in tax_slv_ssu_138.1.txt:
Eukaryota;**Amorphea;Amoebozoa;**SAR;Stramenopiles;
Eukaryota;**Amorphea;Amoebozoa;**SAR;Stramenopiles;Labyrinthulomycetes;
Eukaryota;**Amorphea;Amoebozoa;**SAR;Stramenopiles;Labyrinthulomycetes;Sorodiplophrys;

This should be corrected to get true paths and parse the paths correctly with silva_annotator.

licence

Add licence at the end of the README file to indicate the licence type.

data import takes a long time

Hello,

This package looks amazing, and I am really excited because it seems to help with the issue of tag-jumping. The problem is that dealing with a full sequencing run with this package is unmanageable currently. The output files for full runs are really big and just importing the data with the tabfiles_to_metabarlist takes forever. I cannot see the point of exploring this tool without having to deal with large files, so I was wondering if there is a work around this issue.

I am trying my best to deal with this, but I cannot be the only one that plans to load a full run. Was this not supposed to deal with full PCR plates?

Thank you in advance!

error message of taxoparser when using ggtaxplot

When the type of the column containing the taxonomic paths (specified with the argument taxo of ggtaxplot()) is not "character", the error message displays:
"`taxo` should be a character string or vector" (line 49 of taxoparser.R)
This is confusing because the user may think (I did) that it refers to the value of the argument taxo of ggtaxplot() while in taxoparser() the test displaying this error message refers to the type of the column containing the taxonomic paths.
I suggest changing the error message of taxoparser() as:
if (!is.character(taxopath)) {
stop(paste("column", taxo, "should be a character vector")
}
(or something similar)

EMBL import——zlib.error: Error -3 while decompressing data:invalid distance code

Hi! Celine Mercier
Recently, STD_INV_13.dat.gz was downloaded from the EMBL database. When I try to import STD_INV_13.dat.gz from the Wolf tutorial to Build a reference database, either on Mac Terminal or Linux system, I get the following error:

(obi3-env)@ [aspinus]$ obi import --embl EMBL coi_embl2023/embl_refs
2023-03-21 09:22:05,456 [import : INFO ] obi import: imports an object (file(s), obiview, taxonomy...) into a DMS
2023-03-21 09:22:05,471 [import : INFO ] Importing an unknown number of entries
Parsing file EMBL/STD_INV_11.dat.gz (1/13)
2023-03-21 09:22:07,994 [import : INFO ] Imported 0 entries
2023-03-21 09:22:16,768 [import : INFO ] Imported 50000 entries
…………
…………
…………

Parsing file EMBL/STD_INV_13.dat.gz (13/13)
2023-03-21 13:06:30,239 [import : INFO ] Imported 12250000 entries
2023-03-21 13:06:52,489 [import : INFO ] Imported 12300000 entries
Traceback (most recent call last):
File "/data/home/obitools3/obi3-env/bin/obi", line 4, in
import('pkg_resources').run_script('OBITools3==3.0.1b21', 'obi')
File "/data/home/obitools3/obi3-env/lib/python3.8/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/data/home/obitools3/obi3-env/lib/python3.8/site-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/data/home/obitools3/obi3-env/lib/python3.8/site-packages/OBITools3-3.0.1b21-py3.8-linux-x86_64.egg/EGG-INFO/scripts/obi", line 62, in
config[root_config_name]['module'].run(config)
File "python/obitools3/commands/import.pyx", line 333, in obitools3.commands.import.run
File "python/obitools3/parsers/embl.pyx", line 186, in emblIterator_dir
File "python/obitools3/parsers/embl.pyx", line 143, in emblIterator_file
File "python/obitools3/files/linebuffer.pyx", line 22, in iter
File "/usr/local/lib/python3.8/gzip.py", line 384, in readline
return self._buffer.readline(size)
File "/usr/local/lib/python3.8/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/local/lib/python3.8/gzip.py", line 481, in read
uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid distance code

Any advice given would be greatly appreciated and thanks for your work on OBITools! Aspinus

there is a ###TO FINISH in the doc of pcrslayer

Finish to complete the documentation of pcrslayer

Use of deprecated/undocumented igraph functions

The following igraph functions are all deprecated or undocumented, and may go away in the future:

metabaR/NAMESPACE

Lines 52 to 55 in 23f3c8e

 importFrom(igraph,get.data.frame) 

 importFrom(igraph,get.vertex.attribute) 

 importFrom(igraph,graph) 

 importFrom(igraph,layout.reingold.tilford)

They should be as_data_frame, vertex_attr, make_graph and layout_with_fr, respectively.

A typo in function definition `pcr_threshold_estimate`

Hello!
I had a problem while using the function pcr_within_between and related functions. Then I looked into the source code and found there might be a typo in the definition of function pcr_threshold_estimate. Within the function code there is this line:

p <- which(ddintra$y - ddinter$y > 0 & ddinter$y > dintra.mode)

but I think it should be:

p <- which(ddintra$y - ddinter$y > 0 & ddinter$x > dintra.mode)

as now it doesn't return the right threshold estimate even if the plot generated by check_pcr_thresh looks correct. Hope I understood it correctly. Thank you!

add exemple for m_silva_annotator

Add a use case in the documentation of silva_annoatator function.

Metabarlist

In order to create a metabarlist file, I must create 4 different files (motus, pcrs,..).
How can I create those files starting from an Obitools3 output? In particullary, which files do I have to take from all the outputs made by Obitools3?
Thank you

Extra 'simplify' parameter needed in subset_metabarlist

subset_metabarlist <- function(metabarlist, table, indices)

would become

subset_metabarlist <- function(metabarlist, table, indices, simplify=TRUE)

=================

when simplify=FALSE, do not remove any columns in the reads table:

reads <- reads[, colSums(reads) > 0, drop = F]

would become

reads <- reads[, (colSums(reads) > 0) & simplify, drop = F]

=================

usefull when you want to compare the reads table of two subseted metabarlist that were originally the same:

Example:

soil_euk2 <- soil_euk
soil_euk2$reads <- corrected

metabaR::ggpcrtag(soil_euk2)
metabaR::ggpcrtag(soil_euk)

s1<- subset_metabarlist(soil_euk, 'pcrs', {v <- soil_euk$pcrs$control_type=="sequencing";v[is.na(v)]<-F;v})
s2 <- subset_metabarlist(soil_euk2, 'pcrs', {v <- soil_euk$pcrs$control_type=="sequencing";v[is.na(v)]<-F;v})

s(1|2)$reads and s(1|2)$motus tables are not easily comparable

aggregate_pcrs transforms plate_col in factor on Windows 10 - R 3.6.1

I observed that aggregate_pcrs, on a particular OS/OS version/R version (Microsoft Windows 10 Professionnel 10.0.19044 N/A build 19044, R 3.6.1) converts metabarlist$pcrs$plate_col to factor at this step (actually all the columns that enter the 'else' condition are converted to factor):

pcr.out <- data.frame(t(sapply(rownames(reads.out), function(x) {
            sub <- metabarlist$pcrs[replicates == x, ]
            apply(sub, 2, function(x) {
                if (is.numeric(x)) {
                  mean(x)
                }
                else {
                  ifelse(length(unique(x)) == 1, x[1], paste(unique(sort(x)), 
                    collapse = "|")) ## HERE!
                }
            })
        })))

The consequence is that, when the metabarlist (object out in aggregate_pcrs) is passed to check_metabarlist, it raises the error message "column plate_col of table pcrs in metabarlist.name must contain numeric values ranging from 1 to 12" because factors are internally stored as numeric variables (that can be > 12 when the number of levels > 12).

I don't know why it has this behaviour on this particular computer configuration, but a solution would be to convert this column to character, either in the aggregate_pcrs function or when checking the values of plate_col in check_metabarlist:

if (!(all(as.numeric(ifelse(grepl("\\|", metabarlist$pcrs$plate_col), 
                            1, as.character(metabarlist$pcrs$plate_col))) %in% 1:12))) {
  stop(paste("column `plate_col` of table `pcrs` in", 
             metabarlist.name, "must contain numeric values ranging from 1 to 12"))
}

@ metabaR developers: Do you agree with one of these solutions? Could this modification create a problem I hadn't considered?

check_pcr_repl won't work if your samples names start with digits

Hello,

first I would like to thank you guys for this amazing package!

I encountered a little problem with my data which is not major at all but thought I should report it.

It happened that I named my samples using a code starting with digits (e.g., 05_R_F_3) which caused problem to plot the PCoA with check_pcr_repl function as R will automatically add an X before rownames starting with numbers (i.e., X05_R_F_3).

This caused the function to fail (line 41:43)

I found a workaround removing the X in rownames by modifying locally the function using rownames(d) <- gsub("X","",rownames(d)) between lines 31 and 32 (not clean but efficient in my case; having no X in my sample names).

This really is a minor issue but I think it might be nice to fix it or at least mention in the "data format and structure" section of the metabaR tutorial that sample names should not start with digits (which is, as I am learning, a bad habit in general).

Thank you again!
Cheers,
Tristan.

Qiime2 to metabaR

Hello there, anyone already designed a script to import data from Qiime2 to metabaR? Would be very useful. Thanks !

import phyloseq object

Thanks for developing this useful package.

I am quite interested in some of the functions and especially the ggpcrtag(). Any chance you can add a function to convert a phyloseq object into a metabaR dataset?

contaslayer Error: replacement has length zero

Hello,

I am running this package in RStudio v.1.3 and R v.4.0.3. I am a processing this table and following the tutorial. When I get to the contaslayer step, I get the following error:

list_12S <- contaslayer(list_12S, control_types = "extraction", output_col = "not_extraction_cont")
Error in reads_matrix.max[i] <- rownames(reads_matrix.fcol)[which.max(reads_matrix.fcol[,  : 
  replacement has length zero

Could this have something to do with an earlier warning I received when creating the metabarlist?

list_12S <- metabarlist_generator(reads, motus, pcrs, samples)
Warning messages:
1: In check_metabarlist(out) :
  PCR plate design not properly provided: columns tag_fwd is missing in table `pcrs` of out!
PCR plate design not properly provided: columns tag_rev is missing in table `pcrs` of out!
PCR plate design not properly provided: columns primer_fwd is missing in table `pcrs` of out!
PCR plate design not properly provided: columns primer_rev is missing in table `pcrs` of out!
PCR plate design not properly provided: columns plate_no is missing in table `pcrs` of out!
PCR plate design not properly provided: columns plate_col is missing in table `pcrs` of out!
PCR plate design not properly provided: columns plate_row is missing in table `pcrs` of out!

2: In check_metabarlist(out) :
  Some MOTUsin out have a number of reads of zero in table `reads`!

Does that mean that somehow I have MOTUs in my table with zero reads?

other types of negative controls?

more flexibility for negative controls (e.g. different type of field negative controls).

Customising the text on ggtaxplot

I am have created a taxonomic table using MetabaR and have been using theme() to edit aspects of this. However, when the figure is initially produced it has names that are angled, difficult to read, overlap, and run out of the boundary.

I know that theme () works in layers and I have tried using theme(text= element_blank()) to get rid of these names so then I can then customise them with more options but this just clears all text around my legends and not the taxa names.

what packages could tidy these as I know that theme() is for manipulating non-data components

read_ngsfilter

I noticed, I can't upload an NGSfilter file if one or more sample have the same forward and reverse tag, because the read_ngsfilter function doesn't take into account if they are a T instead of F in the 6th column.

Connection with DAMe/Begum?

To do later: create import/export functions to connect the package with DAMe / Begum.

function to import phyloseq object

Thanks for developing this useful package.

I am quite interested in some of the functions and especially the ggpcrtag(). Any chance you can add a function to convert a phyloseq object into a metabaR dataset?

Website is not synchronized with repository

Travis-CI does not synchonize the website on GitHub-pages with the contents of the repository...

metabarlist_to_phyloseq function

Hi,
I wrote a new function to convert a metabarlist object into a phyloseq object. My idea while writing this function was to directly convert the reads matrix into a phyloseq otu_table object, the samples data frame into a phyloseq sample_data object and part of the motus data frame into a phyloseq tax_table object to group them in a phyloseq object (see this page for phyloseq objects structure: https://joey711.github.io/phyloseq/import-data.html).
This implies that the sample names in the reads matrix (i.e. rownames) should be the the same as in the samples data frame, so that controls were removed and the metabarlist is aggregated at sample level (see example in https://github.com/metabaRfactory/metabaR/blob/phyloseq_functions/R/metabarlist_to_phyloseq.R).
Do you agree with that or would you suggest to make it possible to convert the metabarlist in a phyloseq object at 'pcrs' level? (and thus include information from the pcrs data frame into the phyloseq sample_data object)
AnneSoBen

Little Typo in Let's MetabaR! tutorial

Hi,

I just notice a typo in the removing spurious signal section of the official tutorial.

It seems to me that the second row should use rowSums as pcrs are the rows of the reads matrix of the metabarlist object.

Cheers,
Tristan

	importFrom(igraph,get.data.frame)
	importFrom(igraph,get.vertex.attribute)
	importFrom(igraph,graph)
	importFrom(igraph,layout.reingold.tilford)

metabarfactory / metabar Goto Github PK

metabar's Introduction

Overall overview

Installation

Example

Citation

metabar's People

Contributors

Stargazers

Watchers

Forkers

metabar's Issues

s(1|2)$reads and s(1|2)$motus tables are not easily comparable

Recommend Projects

Recommend Topics

Recommend Org