NanoASV is a container based workflow using state of the art bioinformatic software to process full-length SSU rRNA (16S/18S) amplicons acquired with Oxford Nanopore Sequencing technology. Its strength lies in reproducibility, portability and the possibility to run offline. It can be installed on the Nanopore MK1C sequencing device and process data locally.
At the moment, the only way to install NanoASV is building it from source with Docker. At this point you can whether run it with Docker or to build a Singularity image file (SIF) from the docker version to run with Singularity.
Takes 75 min on my computer (32Gb RAM - 12 cores). The longest part is SILVA indexing step. Avoid this step by downloading the (heavy) NanoASV.tar archive
git clone https://github.com/ImagoXV/NanoASV
docker build -t nanoasv NanoASV/.
docker save NanoASV.tar nanoasv
I recommend building the sif file from the docker archive
singularity build nanoasv docker-archive://NanoASV.tar
Archive is too big at the moment to be a GitHub release. You have to build from source.
wget path/to/archive
tar -xvzf nanoasv.tar.gz
sudo mv nanoasv /opt/
echo 'export PATH=$PATH:/opt/' >> ~/.bashrc && source ~/.bashrc
Then test if everything is working properly The low vsearch clustering identity threshold allows to successfully recover OTUs from a small number of sequences. You should not use such a low identity threshold for analysis. -i 0.7 works fine.
singularity run nanoasv -d Minimal -o Out_test -i 0.3 [--options]
docker run -v $(pwd)/Minimal:/data/Minimal -it nanoasv -d /data/Minimal -o out --docker -i 0.3 [--options]
All previous steps can be used to install on MK1C, but be sure to use the aarch64 version. IT WILL NOT RUN IF IT'S NOT AARCH64 VERSION
If added to the path
nanoasv -d path/to/sequences -o out [--options]
Or
singularity run nanoasv -d path/to/sequences -o out [--options]
Or if installed elsewhere
/path/to/installation/nanoasv -d path/to/sequences -o out [--options]
I recommand you not to run it with docker because of root privileges. Don't forget the --docker flag
docker run -v $(pwd)/Minimal:/data/Minimal -it nanoasv -d /data/Minimal -o out --docker
You can mount your sequences directory anywhere in the container, but I recommand you to mount in /data/
If running on a PC, I suggest to not use more than two threads with 32Gb of RAM. Otherwise, you might crash your system. I highly suggest you to run it on a cluster. 96 samples (--subsampling 50000) took 4h (without tree) with 150Gb and 8 threads. The tree is highly computer intensive.
| Option | Description |
| -------------------- | ------------------------------------------------------------------------------ |
| `-h`, `--help` | Show help message |
| `-v`, `--version` | Show version information |
| `-d`, `--dir` | Path to fastq_pass/ |
| `-q`, `--quality` | Quality threshold for Chopper, default: 8 |
| `-l`, `--minlength` | Minimum amplicon length for Chopper, default: 1300 |
| `-L`, `--maxlength` | Maximum amplicon length for Chopper, default: 1700 |
| `-i`, `--id-vsearch` | Identity threshold for vsearch unknown sequences clustering step, default: 0.7 |
| `-p`, `--num-process`| Number of cores for parallelization, default: 1 |
| `--subsampling` | Max number of sequences per barcode, default: 50,000 |
| `--no-r-cleaning` | Flag - to keep Eukaryota, Chloroplast, and Mitochondria sequences |
| | from phyloseq object |
| `--metadata` | Specify metadata.csv file directory, default is demultiplexed directory (--dir)|
| `--notree` | Flag - To remove phylogeny step and subsequent tree from phyloseq object |
| `--docker` | Flag - To run NanoASV with Docker |
| `--ronly` | Flag - To run only the R phyloseq step |
Building from source is pretty long at the moment. The main time bottle neck is bwa SILVA138.1 indexing step (~60min on 32Gb RAM PC) It is way faster if you download the archive and build with Singularity. However, the archive is pretty heavy and not available for download at the moment.
Directly input your /path/to/sequence/data/fastq_pass directory 4000 sequences fastq.gz files are concatenated by barcode identity to make one barcodeXX.fastq.gz file.
Chopper will filter for inappropriate sequences. Is executed in parrallel (default --num-process = 1 ) Default parameters will filter for sequences with quality>8 and 1300bp<length<1700bp
There is no efficient chimera detection step at the moment
Porechop will trimm known adapters Is executed in parrallel (default --num-process = 1 )
50 000 sequences per barcode is enough for most common questions. Default is set to 50 000 sequences per barcode. Can be modified with --subsampling int
bwa will align previously filtered sequences against SILVA 138.1 Is executed in parrallel (default --num-process = 1 ) In the future, I will add the possibility to use another database than SILVA barcode*_abundance.tsv, Taxonomy_barcode*.csv and barcode*_exact_affiliations.tsv like files are produced. Those files can be found in Results directory.
Non matching sequences fastq are extracted then clustered with vsearch (default --id 0.7). Clusters with abundance under 5 are discarded to avoid useless heavy computing. Outputs into Results/Unknown_clusters
Reference ASV sequence from SILVA138.1 are extracted accordingly to detected references. Unknown OTUs seed sequence are added. The final file is fed to FastTree to produce a tree file Tree file is then implemented into the final phyloseq object. This allows for phylogeny of unknown OTUs and 16S based phylogeny taxonomical estimation of the entity.
Alignements results, taxonomy, clustered unknown entities and 16S based phylogeny tree are used to produce a phyloseq opbject: NanoASV.rdata Please refer to the metadata.csv file in Minimal dataset to be sure to input the correct file format for phyloseq to produce a correct phyloseq object. You can choose not to remove Eukaryota, Chloroplasta and Mitochondria sequences (pruned by default) using --r_cleaning 0
Sometimes, your metadata.csv file will not meet phyloseq standards. To avoid you recomputing all the previous steps, a --ronly flag can be added. Just precise --dir and --out as in your first treatment. NanoASV will find final datasets and run only the r script. This will save you time.
Please don't forget to cite NanoASV and dependencies if it helped you treat your Nanopore data Thank you !