imagoxv / nanoasv Goto Github PK

NanoASV official repo

License: GNU General Public License v3.0

Dockerfile 9.99% Shell 65.70% R 24.31%

amplicon-sequencing asv bioinformatics bioinformatics-pipeline end-to-end metabarcoding metabarcoding-pipeline nanopore nanopore-analysis-pipeline nanopore-sequencing

nanoasv's Introduction

NanoASV

NanoASV is a container based workflow using state of the art bioinformatic software to process full-length SSU rRNA (16S/18S) amplicons acquired with Oxford Nanopore Sequencing technology. Its strength lies in reproducibility, portability and the possibility to run offline. It can be installed on the Nanopore MK1C sequencing device and process data locally.

Installation

At the moment, the only way to install NanoASV is building it from source with Docker. At this point you can whether run it with Docker or to build a Singularity image file (SIF) from the docker version to run with Singularity.

ADVANCED - Build from source with Docker

Takes 75 min on my computer (32Gb RAM - 12 cores). The longest part is SILVA indexing step. Avoid this step by downloading the (heavy) NanoASV.tar archive

git clone https://github.com/ImagoXV/NanoASV
docker build -t nanoasv NanoASV/.

Create Docker archive to build with Singularity

docker save NanoASV.tar nanoasv

ADVANCED - Build image with Singularity

I recommend building the sif file from the docker archive

singularity build nanoasv docker-archive://NanoASV.tar

NOT WORKING ATM - EASY - Download for Singularity

Archive is too big at the moment to be a GitHub release. You have to build from source.

wget path/to/archive
tar -xvzf nanoasv.tar.gz 
sudo mv nanoasv /opt/
echo 'export PATH=$PATH:/opt/' >> ~/.bashrc && source ~/.bashrc

Then test if everything is working properly The low vsearch clustering identity threshold allows to successfully recover OTUs from a small number of sequences. You should not use such a low identity threshold for analysis. -i 0.7 works fine.

singularity run nanoasv -d Minimal -o Out_test -i 0.3 [--options]

docker run -v $(pwd)/Minimal:/data/Minimal -it nanoasv -d /data/Minimal -o out --docker -i 0.3 [--options]

ADVANCED - Install on MK1C sequencing device

All previous steps can be used to install on MK1C, but be sure to use the aarch64 version. IT WILL NOT RUN IF IT'S NOT AARCH64 VERSION

Usage

RECOMMENDED - With Singularity

If added to the path

nanoasv -d path/to/sequences -o out [--options]

singularity run nanoasv -d path/to/sequences -o out [--options]

Or if installed elsewhere

/path/to/installation/nanoasv -d path/to/sequences -o out [--options]

ADVANCED - With Docker

I recommand you not to run it with docker because of root privileges. Don't forget the --docker flag

docker run -v $(pwd)/Minimal:/data/Minimal -it nanoasv -d /data/Minimal -o out --docker

You can mount your sequences directory anywhere in the container, but I recommand you to mount in /data/

Technical recommandations

If running on a PC, I suggest to not use more than two threads with 32Gb of RAM. Otherwise, you might crash your system. I highly suggest you to run it on a cluster. 96 samples (--subsampling 50000) took 4h (without tree) with 150Gb and 8 threads. The tree is highly computer intensive.

Options

| Option               | Description                                                                    |
| -------------------- | ------------------------------------------------------------------------------ |
| `-h`, `--help`       | Show help message                                                              |
| `-v`, `--version`    | Show version information                                                       |
| `-d`, `--dir`        | Path to fastq_pass/                                                            |
| `-q`, `--quality`    | Quality threshold for Chopper, default: 8                                      |
| `-l`, `--minlength`  | Minimum amplicon length for Chopper, default: 1300                             |
| `-L`, `--maxlength`  | Maximum amplicon length for Chopper, default: 1700                             |
| `-i`, `--id-vsearch` | Identity threshold for vsearch unknown sequences clustering step, default: 0.7 |
| `-p`, `--num-process`| Number of cores for parallelization, default: 1                                |
| `--subsampling`      | Max number of sequences per barcode, default: 50,000                           |
| `--no-r-cleaning`    | Flag - to keep Eukaryota, Chloroplast, and Mitochondria sequences              |
|                      | from phyloseq object                                                           |
| `--metadata`         | Specify metadata.csv file directory, default is demultiplexed directory (--dir)|
| `--notree`           | Flag - To remove phylogeny step and subsequent tree from phyloseq object       |
| `--docker`           | Flag - To run NanoASV with Docker                                              |
| `--ronly`            | Flag - To run only the R phyloseq step                                         |

How it works

Building from source

Building from source is pretty long at the moment. The main time bottle neck is bwa SILVA138.1 indexing step (~60min on 32Gb RAM PC) It is way faster if you download the archive and build with Singularity. However, the archive is pretty heavy and not available for download at the moment.

Data preparation

Directly input your /path/to/sequence/data/fastq_pass directory 4000 sequences fastq.gz files are concatenated by barcode identity to make one barcodeXX.fastq.gz file.

Filtering

Chopper will filter for inappropriate sequences. Is executed in parrallel (default --num-process = 1 ) Default parameters will filter for sequences with quality>8 and 1300bp<length<1700bp

Chimera detection

There is no efficient chimera detection step at the moment

Adapter trimming

Porechop will trimm known adapters Is executed in parrallel (default --num-process = 1 )

Subsampling

50 000 sequences per barcode is enough for most common questions. Default is set to 50 000 sequences per barcode. Can be modified with --subsampling int

Alignment

bwa will align previously filtered sequences against SILVA 138.1 Is executed in parrallel (default --num-process = 1 ) In the future, I will add the possibility to use another database than SILVA barcode*_abundance.tsv, Taxonomy_barcode*.csv and barcode*_exact_affiliations.tsv like files are produced. Those files can be found in Results directory.

Unknown sequences clustering

Non matching sequences fastq are extracted then clustered with vsearch (default --id 0.7). Clusters with abundance under 5 are discarded to avoid useless heavy computing. Outputs into Results/Unknown_clusters

Phylogenetic tree generation

Reference ASV sequence from SILVA138.1 are extracted accordingly to detected references. Unknown OTUs seed sequence are added. The final file is fed to FastTree to produce a tree file Tree file is then implemented into the final phyloseq object. This allows for phylogeny of unknown OTUs and 16S based phylogeny taxonomical estimation of the entity.

Phylosequization

Alignements results, taxonomy, clustered unknown entities and 16S based phylogeny tree are used to produce a phyloseq opbject: NanoASV.rdata Please refer to the metadata.csv file in Minimal dataset to be sure to input the correct file format for phyloseq to produce a correct phyloseq object. You can choose not to remove Eukaryota, Chloroplasta and Mitochondria sequences (pruned by default) using --r_cleaning 0

--ronly option

Sometimes, your metadata.csv file will not meet phyloseq standards. To avoid you recomputing all the previous steps, a --ronly flag can be added. Just precise --dir and --out as in your first treatment. NanoASV will find final datasets and run only the r script. This will save you time.

Citation

Please don't forget to cite NanoASV and dependencies if it helped you treat your Nanopore data Thank you !

nanoasv's People

Contributors

Stargazers

Watchers

nanoasv's Issues

Subsampling before chimare detection

Chimera detection seems long for some highly sequenced barcodes.
I need to add a subsampling step before chimera detection, maybe something like --subsampling XX * 2 to allow for some buffer sequences.

script should test for the presence of required binaries

for example:

which mafft > /dev/null || \
    { echo "error message" ; exit 1 ; }

Tests should occur early in the script; before any time-consuming computation.

Docker image run with singularity. Error phyloseq R package : "Cannot open shared object"

Error: package or namespace load failed for ‘phyloseq’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/imago/R/x86_64-pc-linux-gnu-library/4.3/stringi/libs/stringi.so':
  libicui18n.so.70: cannot open shared object file: No such file or directory
Execution halted

Something might have changed.

I'm pretty sure that's because of ubuntu:latest

Gotta change for ubuntu 22.04

I'm sure at some point it was specified. IDK what happened

Need to specify software versions for reproducibility

Need to specify the versions in the docker file installation so it's always the same tool versions used

bwa Version: 0.7.17-r1188 (Might consider upgrading to a more recent one
Chopper v0.7.0
fasttree Only one version ? Might consider using FastTree2
MAFFT v7.490 (2021/Oct/30)
Porechop-0.2.4
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
samtools 1.13
vsearch v2.21.1_linux_x86_64

I should probably fix the library verisons as well

Phyloseq installation is as long as SILVA indexing

Phyloseq installation is way too long.

Maybe because of depencies = TRUE

First Test on real dataset

First test on real dataset had memory handling issues for some barcodes.
Dataset has to be subsampled

Numerical taxonomic richness seemed to have increased comparing to my previous treatment. I have to investigate.

Produce phylo tree for phyloseq object

Need to singleline SILVA, directly in dockerfile because, it's the moment.

need to extract reference seq to build tree with fastTree.

Need to feed Rscript with the tree and then inject it in phyloseq object.

Almost there

Binary Realease ?

Hi @frederic-mahe, I tried to make a release to see how it works. However, my binary is too big (~5Gb). Max allowed is 2Gb.

Any idea on how to overcome this ?

Arthur

aarch64 - MK1C fail on minimal dataset

Step 4/9 : Adapter trimming with Porechop
Step 5/9 : Subsampling
Step 6/9 : Reads alignements with bwa against SILVA_138.1
environment: line 1:   218 Segmentation fault      (core dumped) bwa mem ${DB}/SILVA_IDX "${FILE}" 2> /dev/null > "${FILE}.sam"
environment: line 1:   221 Segmentation fault      (core dumped) bwa mem ${DB}/SILVA_IDX "${FILE}" 2> /dev/null > "${FILE}.sam"
Step 7/9 : Skipped - no unknown sequence
Step 8/9 : Phylogeny with MAFFT and FastTree
Step 9/9 : Phylosequization with R and phyloseq
Data treatment is over.
NanoASV took 144 seconds to perform.

This indicates a memory related error.

If only the MK1C was running dozens of useless job in background.

I'll find a way

Not running on aarch64

I cannot make it run pn MK1C because chopper was compiled for amd64

Need to find a way

Running nanoasv smoothly

It seems that singularity is instantly called when running nanoasv, which makes the following
singularity run nanoasv --options uneccessary. If nanoasv singularity file is executable, then just ./nanoasv or nanoasv is you put it in /opt/ and add it to the $PATH

A nice way to do it

echo 'export PATH=$PATH:/opt/' >> ~/.bashrc && source ~/.bashrc

which makes

~$ nanoasv 
WARNING: could not mount /etc/localtime: not a directory
 ______________________________________
/ Error: -d needs an argument, I don't \
\ know where your sequences are.       /
 --------------------------------------
        \   ^__^
         \  (xx)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Lovely

Chimera detection #2

Vsearch seems to never detect chimera with default parameters.

I think it lies on the fact that sequences are not dereplicated and therefore do not have a "count" section in fasta header.
However, I think dereplication might not work because vsearch expects 100% similarity. Which is rarely (if not) achieved with nanopore amplicon sequencing.
Efficient dereplication would come from accepting a certain variability threshold that would end up being clustering. Such clustering with vsearch performs well with a --id 0.7. Which is significantly lower than what we would want to accept for dereplication. If clustering, then it's not ASV treatment anymore.

I need to discuss it with you @frederic-mahe

No control over Porehcop cpu usage.

Porechop uses as many cpu as possible no matter the parallelization I wrote.

It might detect and use whatever is asked. It might even do that in parallel somehow

It seems a pretty lightweight computation still.

Concatenation step is long, need to be executud in parrallel

title

add a citation file

For a given repository, GitHub can parse and expose a recommended way to cite the repository. It only requires a CITATION.cff file. Check for instance:

https://github.com/frederic-mahe/mumu/blob/main/CITATION.cff

which is exposed as:

Mahé, F. (2023). mumu: post-clustering curation tool for metabarcoding data (Version 1.0.2) [Computer software]. https://github.com/frederic-mahe/mumu

Checkpoint system

I need to add a checkpoint system that could allow to resume data analysis in case of error to avoid re-computing

Same, I should add a R_ONLY option to phyloseq data in case the metadata.csv file was not working

Indexing parallelisation ?

I wonder if possible to parallelize the indexing step (which is clearly the most computer intensive during the build process

Singularity version known Issue : WARNING: could not mount /etc/localtime: not a directory

WARNING: could not mount /etc/localtime: not a directory

This warning message does not appear when running with docker but does with singularity.

It seemingly has no consequences on downstream analyses.