lentendu / deltamp Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 2.18 MB

A flexible, reproducible and resource efficient metabarcoding amplicon pipeline for HPC

License: GNU General Public License v3.0

Makefile 3.65% Shell 74.35% R 20.27% Awk 1.74%

deltamp's People

Contributors

Stargazers

Watchers

Forkers

a-h-b

deltamp's Issues

Resolve consensus for vsearch assigned taxonomy with different number of ranks

In GitLab by @lentendu on Jun 22, 2018, 14:35

Test datasets

In GitLab by @lentendu on Feb 15, 2018, 22:00

test directory
configuration files for test datasets (one for 454 and one for Illumina)

Add dependencies check and version control at make

In GitLab by @lentendu on Sep 17, 2018, 15:14

include classifier choice in config file example

In GitLab by @ahb-ufz on Sep 18, 2018, 11:15

there's code to use vsearch rather than bayes classifier for taxonomy, but no field in the example config file.

is the code ready to go?

Store jobids in config to ease subsequent debugging and performance statistics

In GitLab by @lentendu on Sep 19, 2018, 21:16

at the end of pipeline_master echo all jobid and their corresponding step name stored in variables into a file:

echo -e "get\t${get_jobid}#${TECH}_${RAW_EXT}\t${raw_jobid} ... " | tr "#" "\n" > config/jobid

Add trim length options for unpaired Illumina reads, linked to #31

In GitLab by @lentendu on Jun 15, 2018, 18:18

Good practice would be to execute first with quality only option, then control the quality of the reads, and choose the position at which the quality is dropping to cut the reads (so long it keep enough nt to allow proper pair-end merging).

Merge consecutive steps with identical memory, nodes, array requirement

In GitLab by @lentendu on Jun 12, 2018, 10:19

Illumina_fastq and Illumina_pair_end

Create a DeltaMP module by make

In GitLab by @lentendu on Feb 6, 2018, 16:23

approximately contains:

#%Module1.0

module-whatis   "DeltaMP version ${VERSION[DELTAMP]}"

module load xxx
[..]
prepend-path    PATH CLONE_DIRECTORY_PATH/DeltaMP/${VERSION[DELTAMP]}/bin

the modulefiles directory path need to be read from a configuration file (e.g. config.txt)
the module(s) contanining all dependencies need to be read from a configuration file (e.g. config.txt)
make files and step scripts requesting the module to be loaded

repair integration of mcl

In GitLab by @ahb-ufz on Sep 18, 2018, 11:12

After MCL clustering, the taxonomy displayed in the all_OTUs table and the taxonomy files don't agree. Likely, the taxonomy data in the all_OTUs table are wrong.

A config file to check if this error is reproduced is attached.
configuration.cerc_dnY_preclC_chiB_clM_dbredY_454_18S_4117590897.tsv

some detail:
-> unique.mcl.pick.0.wang.cons.taxo has OTU, repseq and tax string, the taxonomy-repseq links agrees with the all_OTUs.tsv, but the Otu-taxonomy and OTU-repseq links don't

-> the Otu-taxonomy link in unique.mcl.pick.0.wang.cons.taxonomy doesn't agree with the all_OTUs.tsv

-> the repseq-taxonomy link in unique.mcl.pick.0.wang.taxonomy doesn't agree with the all_OTUs.tsv

-> the bottom of the all_OTUs.tsv is missing taxonomic annotation.

-> the reason seems to be that the .list file doesn't feature the repseq as first member in all cases. Later on there is no real joining, but pasting, which messes up the results

Format cutted database into vsearch UDB binary format

In GitLab by @lentendu on May 24, 2018, 14:39

During cut_db step

Add triming methods based on maximum error rate for Illumina

In GitLab by @lentendu on Jun 13, 2018, 15:19

vsearch --fastq_filter (maxEE)
DADA2 as replacement of trimming and pre-clustering. Problem: this should be applied on primer clipped unpaired reads (between Illumina_fastq and Illumina_pair_end steps, and avoiding Illumina_raw_stats step) and would need a significant change in the workflow
This could look like that (for each library in array job):

Rscript --vanilla dada2_wrap.R $LIB

and the content of dada2_wrap.R:

library(dada2)
samp<-commandArgs()[7]
fnFs<-paste0(samp,".fwd.fastq")
fnRs<-paste0(samp,".rvs.fastq")
pdf(paste0("dada_qual_",samp,".pdf"),width=10,height=5)
plotQualityProfile(c(fnFs,fnRs))
dev.off()
#some tricks to get optimal truncation length (length reached by at least 80 % of reads)
filtFs<-sub("\\.fastq","\\.filter\\.fastq",fnFs)
filtRs<-sub("\\.fastq","\\.filter\\.fastq",fnRs)
out<-filterAndTrim(fnFs,filtFs,rnFs,filtRs,truncLen=c(260,250),maxEE=c(5,5))
# some exports of filtering read counts
errF<-learnErrors(filtFs)
errR<-learnErrors(filtRs)
# pdf(paste0("dada_err_",samp,".pdf"),height=6,width=6)
# plotErrors(errF, nominalQ=TRUE)
# dev.off()
derepFs <- derepFastq(filtFs)
derepRs <- derepFastq(filtRs)
# some exports of dereplicated read counts
dadaFs <- dada(derepFs, err=errF)
dadaRs <- dada(derepRs, err=errR)
# some exports of ASV counts
# format [email protected][[2]][,1:2] and [email protected][[2]][,1:2] to fasta files
# add sequence tracking to name ASV sequences properly and create mothur like names files (if necessary...)

Make check_previous_step with queueing specific variables

In GitLab by @lentendu on Feb 16, 2018, 11:23

License header

In GitLab by @lentendu on Feb 6, 2018, 16:49

insert recall of GNU licence at the beginning of each script

Correct check_previous_step to match with steps.final format

In GitLab by @lentendu on Aug 8, 2018, 17:04

Only a variable(s) containing the dependent step jobid is use in steps.final, no jobnames.

At OTU step, when annotating sequence count, use join and not paste against the names file to avoid any issue

In GitLab by @lentendu on Sep 19, 2018, 14:27

At 454_raw_stat, length and quality value missing/got errased

In GitLab by @lentendu on Sep 19, 2018, 16:35

Make deltamp and pipeline master

In GitLab by @lentendu on Feb 6, 2018, 16:43

make by replacing the batch system specific commands
avoid copies in the lib directories

Database selection in the configuration file

In GitLab by @lentendu on Feb 14, 2018, 17:25

No default database set in deltamp
The path to the database directory and the database prefix $DB need to be provided in the configuration file to match $DB.fasta and $DB.taxonomy
The aligned version of the database will be searched under $DB.align.fasta (e.g. for SILVA database)

Disable pre-clustering for Swarm based clustering

In GitLab by @lentendu on Jun 12, 2018, 13:59

DeltaMP configuration file example

In GitLab by @lentendu on Feb 6, 2018, 16:57

Provide configuration file example needed to reproduce the full bioinformatic workflow of published studies, at least one per target gene (16S, 18S, ITS, COI)

Add an option to deltamp to allow configuring the project/account name for job queueing.

In GitLab by @lentendu on Mar 7, 2018, 11:29

mcl2mothur bad list format

In GitLab by @lentendu on Sep 19, 2018, 16:24

no value for numOtus on second line

no otu label on first line

Example configuration file

In GitLab by @lentendu on May 24, 2018, 16:21

SOP and paper based

Choice between pre-clustering algorithm

In GitLab by @lentendu on Feb 15, 2018, 14:20

Allow selection between MOTHUR pre.cluster or cd-hit-454 for preclustering step

Write README

In GitLab by @lentendu on Feb 6, 2018, 16:54

Dynamically set the memory request based on database size for cut_db

In GitLab by @lentendu on May 30, 2018, 13:52

Clip primers with linked strategy with cutadapt, and check for reverse-complement sequence orientation

In GitLab by @lentendu on Sep 14, 2018, 15:24

Linked strategy with an anchored 5' adapter and a non-anchored 3' adapter, which works in both situation of traversed sequencing and partial sequencing.

Libraries often contains reads orientated in both direction, with, for example, half the R1 library with the forward primer at 5' end and half with reverse primer at 5' end.
So each libraries need to be checked for both directions.
If only a library contains only reads in one direction, this will have no effect.

Add archiving script

In GitLab by @lentendu on Feb 15, 2018, 09:53

script to tar.gz archive and MD5sum check outputs, processing files and demultiplexed read files (if needed)
add archiving to queueing commands in pipeline master
control symlinking of raw reads in DeltaMP main

check header of archived input for compliance with pandaseq (Illumina)

In GitLab by @ahb-ufz on Mar 15, 2018, 15:22

as pandaseq is very strict about the format of the header, there are many datasets in ENA whose headers don't conform with this format. We could include
pandaseq-checkid
to check the format after the download and abort the pipeline reporting the problem, because at the moment it's not always transparent (sometimes a BADID warning is issued, sometimes it isn't and the pipeline only gets stuck, if no reads are merged and in the quality step)

Add utility to list DeltaMP queued jobs and status

In GitLab by @lentendu on Aug 8, 2018, 15:40

Have to output full jobnames, jobid, status etc..

Integrate job control utilities as option to deltamp

In GitLab by @lentendu on Feb 6, 2018, 16:47

integrate deltamp.restart_from_step and deltamp.delete_subproject to deltamp as option
input for these option will be a configuration file
restart_from_step need a named option

vsearch based taxonomic assignment

In GitLab by @lentendu on May 24, 2018, 16:15

Allow DBCHOP without DBALIGN

In GitLab by @lentendu on Sep 19, 2018, 20:51

No need for the full database aligned version if the aligned version for the region between the primers is available.

add size information to fasta file for sumaclust or remove -s size from sumaclust command

In GitLab by @ahb-ufz on Sep 18, 2018, 12:02

workflow runs into error in OTU step with sumaclust, because *unique.sort.fasta has no size argument in the header, which is required by the command. The size could be added to the fasta file or the "-s size" could be removed from l. 118 of the OTU script.
(second option would mean sorting by count, which is reasonable; the workflow runs to the end with default setting)
@lentendu , you decide.

For small reference databases (not Silva), cut_db with primers and then precluster for sequences with identical taxonomic path only

In GitLab by @lentendu on Jun 13, 2018, 01:02

Use Swarm for (pre)clustering, without fastidious and OTU breaking options