hariszaf / pema Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 12.0 336.1 MB

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes

Shell 46.28% R 22.67% Perl 2.01% Roff 12.01% HTML 2.84% CSS 1.88% Python 11.21% Singularity 1.12%

pema's People

Contributors

Stargazers

Watchers

Forkers

biomobst gitter-badger ahderojas ekakurniasih cpavloud chhhhai marc-portier paveleg emagallong jeyu54217 savvas-paragkamian justinepa

pema's Issues

Enhancement: Add option for users to provide a custom adapter file

It is very common that NGS data contain adapter sequences other than the ones used during library construction, mainly because sequencing centers may sequence many different libraries at the same time. This cross-contamination, and any other project-specific contamination of reads with short sequences, can be handled easily if the user has the option to provide a custom adapter file in PEMA.
Thank you for considering!

Convert raw data from other than MiSeq sequencers to PEMA format

In the convertIllumunaRawDataToEnaFormat.sh replace reg expression with more alternative options, like from:

/^@M0.*/

to:

/^@[M|A]0.*/

to work for multiple cases such as novaseq etc

Consider incorporating tools for pseudogene detection

Could be very informative especially in the case of COI samples (in general, useful for protein coding markers).
A tool that could be added are metaMATE (https://github.com/tjcreedy/metamate).
Also, this publication (https://doi.org/10.1186/s12859-021-04180-x) discusses how to remove putative pseudogenes and the method (based on the NCBI ORFfinder program) is implemented in MetaWorks (https://github.com/terrimporter/MetaWorks).

PEMA to create a run summary file

It would be nice to also include in one file all the metadata that are required when PEMA outputs are published (e.g. to obis): this is just a summary of information available elsewhere in the outputs of PEMA, but gathered in one file for ease of findability. I can provide guidance as to the content of this when this issue is worked on.

Fatal error: ./pema_latest.bds, line 446

Hello there,

I am reaching out for help regarding an error I am facing twice and I don't know what is exactly the problem and how to solve it.

I am running my data using PEMA latest version through docker V 3.2.2 on mac OS V10.13.6

I have 100 samples and the problem is happing after 3 days and 5 hours of running approximately.
the error to me is timing out error ( perhaps I am mistaken! ) however, I copied the error below and I would be thankful if you could guide me on that.

Thanks in advance and looking forward to hearing from you

Best Regards,
Elham

SPAdes log can be found here: /mnt/analysis/pema_result_22April2021/3.correct_by_BayesHammer/ERR0000100/spades.log

Thank you for using SPAdes!
Error correction using BayesHammer is completed!
filtered_max_ERR0000011_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz
filtered_max_ERR0000011_2.fastq.gz.2P.fastq.00.0_0.cor.fastq.gz
ERR NOFILE filtered_max_ERR0000011_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz
Too confused to continue.
Try -h for help.
Task failed:
Program & line : './pema_latest.bds', line 444
Task Name : ''
Task ID : 'pema_latest.bds.20210422_075345_852/task.pema_latest.line_444.id_659'
Task PID : '55544'
Task hint : '/home/tools/PANDAseq/bin/pandaseq -f filtered_max_ERR0000011_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz -r filtered_max_ERR0000011_2.fastq.gz.2P.fastq.00'
Task resources : 'cpus: 1 mem: -1.0 B timeout: 86400 wall-timeout: 86400'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[]'
Output files : '[]'
Script file : '/home/pema_latest.bds.20210422_075345_852/task.pema_latest.line_444.id_659.sh'
Exit status : '1'
StdErr (10 lines) :
ERR NOFILE filtered_max_ERR0000011_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz
Too confused to continue.
Try -h for help.

Fatal error: ./pema_latest.bds, line 446, pos 4. Task/s failed.

Creating checkpoint file '/home/pema_latest.bds.line_446.chp'

Add alternative ref db for the case of COI

Consider of adding as an alternative the Midori dbs
the Eukaryote CO1 Reference Set For The RDP Classifier.

Parameters file in a structured format

This issue is about describing the parameter file that is required to run PEMA in a machine-interoperable way: describing the formats and including defaults for all entries in the file, for example
Work has begun on this, but it needs a review and completion

@marc-portier BigDataScript supports reading json files; see here.

The readParameterFile function of pema reads the .tsv parameters file line-by-line to return a bds "dictionary".

We could edit this function to read a .json file instead and have this .json file in an RO-Crate oriented way. 😎

Describe formally the pema output

This is about describing the PEMA outputs in a formal way, so that they are more machine-interoperable.

Similarly to the parameters.tsv file and its structured format (see issue #35 ) we need to do so with the pema output.

In the PEMA's output files file we had started working on it.

We shall go for it step-by-step.
@cpavloud feel free to contribute 😛

pandaseq error "Something is wrong with this ID"

Hi, I'm getting an error from pandaseq when using BGI reads. Before that, I ran a small dataset from the same BGI run (different barcode) successfully.
I'm using singularity v. 3.8.6 and pema_v.2.1.4.sif.
How can I run pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok" using a container?
Here's the output:

ERR	BADID	V350194505L1C001R00100000782:::1:0:0:0:	V350194505L1C001R00100000782/1 BH:ok
* * * * * Something is wrong with this ID. If tags are absent, try passing the -B option.
* * * * * Consult `pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok"` to get an idea of the problem..
Task failed:
	Program & line     : '/home/modules/preprocess.bds', line 334
	Task Name          : ''
	Task ID            : 'pema_latest.bds.20240326_090348_413/task.preprocess.line_334.id_15'
	Task PID           : '1399067'
	Task hint          : '/home/tools/PANDAseq/bin/pandaseq -f filtered_max_ERR0000001_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz -r filtered_max_ERR0000001_2.fastq.gz.2P.fastq.00'
	Task resources     : 'cpus: 1	mem: -1.0 B	timeout: 86400	wall-timeout: 86400'
	State              : 'ERROR'
	Dependency state   : 'ERROR'
	Retries available  : '1'
	Input files        : '[]'
	Output files       : '[]'
	Script file        : '/home/bioinf/pema_latest.bds.20240326_090348_413/task.preprocess.line_334.id_15.sh'
	Exit status        : '1'
	StdErr (10 lines)  :
		0x1677340:1	STAT	READS	0
		0x1677340:1	STAT	NOALGN	0
		0x1677340:1	STAT	LOWQ	0
		0x1677340:1	STAT	BADR	0
		0x1677340:1	STAT	SLOW	0
		0x1677340:1	STAT	OK	0
		0x1677340:1	STAT	OVERLAPS	0
		ERR	BADID	V350194505L1C001R00100000782:::1:0:0:0:	V350194505L1C001R00100000782/1 BH:ok
		* * * * * Something is wrong with this ID. If tags are absent, try passing the -B option.
		* * * * * Consult `pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok"` to get an idea of the problem..

Option of running PEMA in distinct steps

It would be very useful and time-saving when we are trying to determine the appropriate parameters for each dataset, if we could run pema in distinct steps so as to be able to run the analysis partially.
It would be nice to consider adding this option!
Thank you

Conversion to ENA format - Sample names

Using the current PEMA version (pema v.2.1.3), when choosing
EnaData No
the sequence files are converted to the necessary format for PEMA to run.

During this conversion, they get new names such as ERR0000001, ERR0000002, ERR0000003 etc.

However, in the finalTable.tsv (and the other output files at the XXX_taxon_assign folder), the sample names are ERR1, ERR2, ERR3 etc.

NCBI Taxon ID included in the final_table.tsv file?

One think that has been requested is to enhance the final_table.tsv file to include (apart from the columns it already includes), the NCBI Taxon ID for each ASV/OTU and the accession number of the sequence that was its closest match in the database used. The NCBI Taxon ID could then be used as the taxonConceptID when submitting data to GBIF/OBIS using the DwC-A format (as discussed here)

For example, instead of the current final_table.tsv file, which looks like this
OTU_id,ERR0000008,ERR0000009,Classification
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis

(Ideally) It could be something like this
OTU_id,ERR0000008,ERR0000009,Classification,Accession_number,NCBI_Taxon_ID
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora,JN200445,608846
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis,NC_023834,1473587

If it is not possible to retrieve the accession number and/or the NCBI taxon ID, I think we can find some workarounds.
Perhaps it will be possible to retrieve the NCBI Taxon ID using the Bio.Entrez package

ITS sanity check fails

Running PEMA for the ITS sample under /sanity_checks with the given parameters file fails with the following message:

mv: target 'data_after_cutadapt/' is not a directory
Fatal error: ./pema_latest.bds, line 342, pos 5. Exec failed.
Exit value : 1
Command : mv *fastq.gz data_after_cutadapt/

Error PEMA ASV inference: Directory swarm does not exists

Error with the PEMA ASV inference. Possibly due to spelling error.

The line that possibly was skipped because of spelling error in the parameters file.

} else if (paramsDereplication{'clusteringAlgo'} == 'algo_Swarm') {

In the parameters file I write clusteringAlgo algo_swarm, while it is suggested to write (write "Swarm" or "vsearch" or "CROP" after algo_).

In the initialize.bds script there is a line that creates the folder Swarm.

The error

Fatal error: /home/modules/taxAssignment.bds, line 11. Directory '/mnt/analysis/isd_crete_2016_20230823/7.mainOutput/gene_16S/swarm' does not exists
pema_latest.bds, line 156 :     if ( paramsForTaxAssign{'custom_ref_db'} != 'Yes'){
pema_latest.bds, line 158 :        if ( paramsForTaxAssign{'gene'} == 'gene_16S') {
pema_latest.bds, line 170 :           if (paramsForTaxAssign{'taxonomyAssignmentMethod'} != 'phylogeny') {
pema_latest.bds, line 172 :              crestAssign(paramsForTaxAssign, globalVars)
taxAssignment.bds, line 4 :     string crestAssign(string{} params, string{} globalVars) {
taxAssignment.bds, line 6 :        if ( params{'custom_ref_db'} != 'Yes') {
taxAssignment.bds, line 9 :           if ( (params{'gene'} == 'gene_16S' || params{'gene'} == 'gene_18S') && params{'taxonomyAssignmentMethod'} != 'phylogeny' ) {
taxAssignment.bds, line 11 :             globalVars{'assignmentPath'}.chdir()

The parameters file:
parameters0f.isd_crete_2016_20230823.txt

why this Fatal error: /home/pema_latest.bds, line 175, pos 1. Task/s failed

I run the tutorial command in step 3 singularity run -B /root/Desktop/pema/analysis_directory/:/mnt/analysis /root/Desktop/pema_v.1.3.1.sif
Approx 95% complete for SRR3231901_1.fastq.gz
Analysis complete for SRR3231901_1.fastq.gz
Task failed:
Program & line : '/home/pema_latest.bds', line 173
Task Name : ''
Task ID : 'pema_latest.bds.20201123_083118_572/task.pema_latest.line_173.id_6'
Task PID : '561'
Task hint : '/home/tools/fastqc/FastQC/fastqc --outdir /mnt/analysis/16S_final_test/1.quality_control /mnt/analysis/mydata/README.md'
Task resources : 'cpus: 1 mem: -1.0 B timeout: 86400 wall-timeout: 86400'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[]'
Output files : '[]'
Script file : '/root/Desktop/pema_latest.bds.20201123_083118_572/task.pema_latest.line_173.id_6.sh'
Exit status : '1'
StdErr (10 lines) :
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
Failed to process /mnt/analysis/mydata/README.md
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '@'
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:158)
at uk.ac.babraham.FastQC.Sequence.FastQFile.(FastQFile.java:89)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:152)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)

Fatal error: /home/pema_latest.bds, line 175, pos 1. Task/s failed.
pema_latest.bds, line 175 : wait

Wrong sample names in OutputPerSample

Hi!

When checking "outputPerSample", for some of my samples the accession numbers are correct in the file names, but in the file itself it's another accession number. See an example below with files for ERR4914067.
This ERR number does not appear inside these files, but ERR4914068 and ERR4914071 do. "profile_ERR4914067.csv" even has 3 ERR numbers including two ERR4914068.

profile_ERR4914067.csv
Relative_Abundance_ERR4914067.csv
Richness_ERR4914067.csv
All_Cumulative_ERR4914067.csv

Additionally, ERR4614067 does not appear in the final table. Maybe it could be related to what I described above.

Thanks for your input!

MIDORI updates

According to recent emails with the MIDORI developers, it seems wise to update PEMA to where the midori db is now published. Hopefully this will solve a couple of issues that we have had (1) the gaps in the taxonomic classification output when there are missing taxon nodes (2) some were errors and discrepancies in the classifications wrt NCBI

Copy of the emails (latest to first):

Sorry to say that we are no more updating the databases in "MIDORI server”.
We are updating only databased you can download from here : http://www.reference-midori.info/download.php#

Hi Christina,
Thank you for your email.
I think PEMA is using old MIDORI database.
I have fixed this problem quite long time ago.
In all formats, except RAW files, we have inserted missing taxonomy by creating it from a lower taxonomic ranking (ex. description in class-level was missing, so it was created from order-level in the following example, >JF502242.1.7041.7724 root_1;Eukaryota_2759;Chordata_7711;class_Crocodylia_1294634;Crocodylia_1294634;Crocodylidae_8493;Crocodylus_8500;Crocodylus intermedius_184240).
Will it be possible that you download recent databases from our cite and locally perform the taxonomic assignment?
We are using NCBI taxonomy for all MIDORI databases.
I think those inconsistency is happening because PEMA is using old database (NCBI taxonomy has been consistently revised).
If you have further questions, please write me back again.
Best regards, Ryuji

Dear Dr Machida,
My name is Christina Pavloudi and I am a Post Doctoral Researcher at the CNRS.
In my previouds Post Doc position, I was working for the ARMS-MBON project (my colleagues are in CC), where we were sequencing ARMS samples for COI (among other genes) and we were using PEMA for the analyses of the results.
PEMA is using MIDORI for the taxonomic assignment of COI reads, hence I am contacting you regarding an issue we came across.
At the moment, the MIDORI output does not always have the same number of columns, i.e. the same number of taxonomic levels, for all the assignments.
You can see an example in the the attached file ("Example_species_notall.tsv")
For some assignments, the output has all the 8 levels: root, superkingdom, phylum, class, order, family, genus, species (see attached file "Example_species_alllevels.tsv").
It would be extremely helpful, in terms of FAIRness for the ARMS-MBON project, if the MIDORI output was consistent and always contained the 8 levels, even if some columns were empty (see attached "Example_species_emptylevels.tsv"). Do you perhaps consider doing something like this for future versions of MIDORI?
Also, could I ask which taxonomy you are using in MIDORI?
Because, as you can see in "Example_species_emptylevels_completed.tsv", for some of the species in question the missing taxonomic levels do exist (if we check at the WoRMS, but also at the NCBI Taxonomy). Also, some of them are different from the output that is produced by MIDORI.

time consuming preprocessing

It has been observed that the current setup (we'll call it original approach), can be time consuming.
In cases of hundreds of samples this can be a limiting factor for pema.

An alternative would be to try the fastp tool (fastp preprocessing approach).

Unexpected Singletons in Final Table despite removeSingletons set to "Yes"

Issue
When setting the "removeSingletons" parameter to "Yes" in PEMA, I'm noticing that many singletons still appear in the final output table (often hundreds of them towards the end of the table) while they should not be there.

Details
I am using PEMA version 2.1.4, running it on ARMS data using the LifeWatch workflow on the Tesseract platform.

It would be appreciated if this behavior could be looked into, as it's crucial for the accuracy of our analyses to ensure that unwanted singletons are excluded when specified.

Thank you!

Storage usage of PEMA

This is more of a question of how PEMA uses storage for each run. For my project I have 140 samples with PE sequences resulting to 14 gb of data.

14G ./my data
196G /pema215_otu

Is possible to reduce the storage needed for a run of PEMA or all output is required?

For example I have 2 all_samples.fasta (one in mainOutput and one in PEMA folder) files and 1 final_all_samples.fasta, are all necessary?

Also some intermediate folders like linearizedSequences, mergedSequences take up similar space as the mydata folder.

The reason for this issue is that in large scale projects this can lead to exceeding disk quota.

Needed updates on "PEMA's output files.md"

To facilitate use and understanding of PEMA by external users (e.g. future Tesseract users), the file PEMA's output files.md needs a few updates and more details:

"Pre-processing steps": the names of the folders are not the same anymore, update to the new ones
"7.gene_dependent":
information regarding 18S and ITS are missing (even if there are similarities with 16S and COI, there are also differences that should be explained)
for COI and 16S, many file names changed and some are missing
each file that is found in the PEMA output needs to be described here (or in another place that you consider is best, where users can easily find it), along with an explanation of how it is obtained
add description of "8.outputPerSample" folder and checkpoints folder

A new user should be able to understand what each file/folder corresponds to without having to dive in the depths of the PEMA code.

Empty 8.outputPerSample folder

When I am running 16S data with the vsearch algorithm using the current version of PEMA (pema v.2.1.3), the "8.outputPerSample" folder is empty.
I haven't tried it yet with other genes/sets of parameters to see if the error is being repeated.

Upgrade to crest4

New version of the CREST algorithm is also available now. In this version, the latest Silva version and the PR2 database are included.

Thus, by integrating this CREST algorithm, then the related issues (#21 and #26 ) will be addressed.

Here is the repo of the previous CREST version, the one that is currently used in PEMA.

@lanzen could you please let us know when ready? 🎉 thanks a lot in advance!

PEAR as merging algorithm in PEMA

Choosing PEAR as a merging algorithm option in PANDAseq seems to cause disruption and failing in PEMA v. 2.1.4, while the steps before PANDAseq run fine.
I mention the last lines of the output file, after SPADes is finished:

Adjusting sequences using the BayesHammer algorithm of SPAdes has been completed.
Fatal error: /home/modules/preprocess.bds, line 334, pos 1. Map 'params' does not have key 'elimination'.
pema_latest.bds, line 84 : merging(paramsSpadesMerging, globalVars)
preprocess.bds, line 263 : string merging(string{} params, string{} globalVars){
preprocess.bds, line 287 : for ( string correctFile : correct ) {
preprocess.bds, line 334 : task $globalVars{'path'}/tools/PANDAseq/bin/pandaseq -f $forwardFile -r $reverseFile -6 \

ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 9 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903 -> 4904 -> 4906
Node Id : 4927
bdsNode Id : 4906
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 8 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903 -> 4904
Node Id : 4906
bdsNode Id : 4904
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 7 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903
Node Id : 4904
bdsNode Id : 4903
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 6 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779
Node Id : 4903
bdsNode Id : 4779
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 5 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775
Node Id : 4779
bdsNode Id : 4775
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 4 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554
Node Id : 4775
bdsNode Id : 4554
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 3 / 0, nodes: 1422 -> 9771 -> 9772
Node Id : 4554
bdsNode Id : 9772
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 2 / 0, nodes: 1422 -> 9771
Node Id : 9772
bdsNode Id : 9771
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 1 / 0, nodes: 1422
Node Id : 9771
bdsNode Id : 1422

It is noted that elimination parameter is on as mentioned above, copied from the parameters file used:

################################################################
################# PANDAseq (v. 2.11) ################### // https://storage.googleapis.com/pandaseq/pandaseq.html
################################################################

PANDAseq is the algorithm that PEMA uses in order to merge the paired-end reads.

PANDAseq has more than one merging algorithms.

Here, we set the algorithm used for assembly. The most common of them are:

pear --> uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide

simple_bayesian --> uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.

other options are stich, flash and more that you can fing in the above link.

pandaseqAlgorithm pear

PANDAseq is a I/O bound algorithm. That means that it needs severous time in order to handle the ipnut and output files

while the process is quite fast. However, it does support multithreading and here you can set the number of threads it is going to use.

pandaseqThreads 20

The 'minlen' parameter sets the minimum length for a sequence, after primers are removed.

By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.

In case you need to use this parameter, be sure you leave a tab after 'minlen' and set it like this: '-l 80'

If you do not want to use this parameter, please remove everything after the 'minlen'

pandaseqMinlen

The 'minoverlap' parameter sets the minimum overlap between forward and reverse reads.

By default, this is at least one nucleotide of overlap.

Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.

minoverlap 12

The 'threshold' parameter sets the score, between zero and one, that a sequence must meet to be kept in the output.

Any alignments lower than this will be discarded as low quality.

Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence.

It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.

threshold 0.6

The '-N' parameter eliminates all sequences with uncalled nucleotides in the output.

Otherwise, during assembly, uncalled bases (Ns) from unpaired regions may be emitted.

If you need -N to be on your analysis, please add '-N' after 'elimination'. Please make sure you leave a tab.

If you do not want the parameter to be on, please make sure there is nothing after the 'elimination' parameter.

elimination -N

PEMA performs the PANDAseq algorithm, with the -a and the -B parameters also on.

That it for striping the primers after assembly, rather than before and allowing input sequences to lack a barcode/tag correspondingly.

Nevertheless, PEAR can run standalone in Zorbas with the following commands:

module load pear/0.9.11
pear -f -r -o

Could this be considered please, in order to be fixed?

custom_ref_db doesn't work.

When I analyze 18S using my custom_ref_db, I get this error.

This is the parameter and custom db I made.
parameters.txt (Original file name is "parameters.tsv")
hikim_test_SSEQ.nds.txt (Original file name is "hikim_test_SSEQ.nds")
hikim_test.fasta.txt (Original file name is "hikim_test.fasta")

So, I tested your "crest_algo_example" (http://pema/analysis_directory/custom_ref_db/crest_algo_example/).
Even if I use the crest example file you uploaded, I get the same error.

I have my doubts about whether your example files work well.

So what to ask for

Is it possible to get .fasta and .nds file of silva 132 embeded in PEMA.
Is it possible to get an crest example of using custom_ref_db correctly (parameters and custom db).

Thank you!

Add new database for 16S/18S (SILVA 138)

Consider adding the new SILVA release
https://www.arb-silva.de/documentation/release-1381/

PEMA Ro-Crate

We would like that the outputs from PEMA are described in a ro-crate.json file. VLIZ can help in how to fill the file with content. It will be the same idea as the ro-crate produced by MetaGOflow (see comment below), but with some extra provenance-related fields (see e.g. https://github.com/emo-bon/observatory-bpns-data/blob/main/ro-crate-metadata.json)

@kmexter can advice on the content of this file.

Add 12S marker gene

Using reference databases from this repo, PEMA could integrate the analysis of the 12S rRNA marker gene.

Fail with 18S data and vsearch algorithm

Christina has found a bug when using 18S data.
@cpavloud Could you please describe it?

MAPseq alternative classifier for the case of rRNA genes

Check repo and publication

cannot remove 'nospace.*': No such file or directory

Hello, I ran through the demo data given and there wasn't a problem. I have a few fastq files that I converted to fastq.gz using the provided convertIllumunaRawDataToEnaFormat.sh. This created a directory and file in the mydata dir (namely mapping_files_for_PEMA.txt rawDataInEnaFormat/) I ran with those files in the directory and got an error (as others have gotten by keeping the README.md as I thought I addressed initially here with Akhilbiju01's question) I promptly moved them out so that I only have the fastq.gz files and am getting this error involving no "no space*". I looked through the source code in PEMA_v1.2.bds line 507 to 515 the file is created (a temp file I imagine) then deleted. It appears this non-existent folder is supposed to # merge all lines of a fastq entry into one and only one line given by this line:
sys awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' se.$derepl > nospace.$derepl

So at this point I'm uncertain if the data is bad or if there's bug in the code. Here's my full output from the run:

$ singularity run  -B /p/home/tclack/bio/pema-1.2/test/analysis_folder/:/mnt/analysis ./pema_v.1.1.sif
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
this ouput file already exists
1E_S29_L001_R1_001.fastq.gz1E_S29_L001_R2_001.fastq.gz2G_S42_L001_R2_001.fastq.gz2H_S48_L001_R1_001.fastq.gz3C_S17_L001_R1_001.fastq.gz3C_S17_L001_R2_001.fastq.gzperl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Started analysis of 3C_S17_L001_R1_001.fastq.gz
Started analysis of 2H_S48_L001_R1_001.fastq.gz
Started analysis of 1E_S29_L001_R1_001.fastq.gz
Started analysis of 1E_S29_L001_R2_001.fastq.gz
Started analysis of 2G_S42_L001_R2_001.fastq.gz
Approx 5% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 5% complete for 2H_S48_L001_R1_001.fastq.gz
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US.UTF-8",
	LC_ALL = (unset),
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Started analysis of 3C_S17_L001_R2_001.fastq.gz
Approx 5% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 10% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 10% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 15% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 5% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 10% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 15% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 15% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 20% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 10% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 20% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 25% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 15% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 5% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 5% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 20% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 25% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 25% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 30% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 20% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 30% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 30% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 35% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 40% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 25% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 35% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 35% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 45% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 50% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 30% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 10% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 40% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 45% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 40% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 55% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 35% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 10% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 50% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 45% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 60% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 55% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 60% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 50% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 65% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 70% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 40% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 15% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 65% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 55% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 75% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 45% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 70% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 60% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 80% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 50% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 15% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 75% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 65% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 85% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 55% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 20% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 80% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 70% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 90% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 95% complete for 2H_S48_L001_R1_001.fastq.gz
Approx 60% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 85% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 75% complete for 3C_S17_L001_R1_001.fastq.gz
Analysis complete for 2H_S48_L001_R1_001.fastq.gz
Approx 65% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 20% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 90% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 95% complete for 2G_S42_L001_R2_001.fastq.gz
Approx 80% complete for 3C_S17_L001_R1_001.fastq.gz
Analysis complete for 2G_S42_L001_R2_001.fastq.gz
Approx 70% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 25% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 85% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 75% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 90% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 80% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 95% complete for 3C_S17_L001_R1_001.fastq.gz
Approx 30% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 25% complete for 1E_S29_L001_R2_001.fastq.gz
Analysis complete for 3C_S17_L001_R1_001.fastq.gz
Approx 85% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 90% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 95% complete for 3C_S17_L001_R2_001.fastq.gz
Approx 35% complete for 1E_S29_L001_R1_001.fastq.gz
Analysis complete for 3C_S17_L001_R2_001.fastq.gz
Approx 30% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 40% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 35% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 45% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 50% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 40% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 55% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 45% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 60% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 50% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 55% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 65% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 60% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 70% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 65% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 75% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 70% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 80% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 85% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 75% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 80% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 90% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 95% complete for 1E_S29_L001_R1_001.fastq.gz
Approx 85% complete for 1E_S29_L001_R2_001.fastq.gz
Analysis complete for 1E_S29_L001_R1_001.fastq.gz
Approx 90% complete for 1E_S29_L001_R2_001.fastq.gz
Approx 95% complete for 1E_S29_L001_R2_001.fastq.gz
Analysis complete for 1E_S29_L001_R2_001.fastq.gz
FastQC is completed!
readF is: 1E_S29_L001_R1_001.fastq.gz
readF is: 1E_S29_L001_R2_001.fastq.gz
readF is: 2G_S42_L001_R2_001.fastq.gz
readF is: 2H_S48_L001_R1_001.fastq.gz
readF is: 3C_S17_L001_R1_001.fastq.gz
readF is: 3C_S17_L001_R2_001.fastq.gz
Trimmomatic  is done

Error correction using BayesHammer is completed!
Merging step by SPAdes is completed
all the first steps are done! clustering is about to start!
rm: cannot remove 'se.*': No such file or directory
rm: cannot remove 'nospace.*': No such file or directory
Fatal error: /home/PEMA_v1.bds, line 517, pos 1. Exec failed.
	Exit value : 1
	Command    :  rm se.* nospace.*
PEMA_v1.bds, line 517 :	sys rm se.* nospace.*

HPC job specifications

Hi Haris! I'd like to use PEMA on some metabarcoding data for my graduate work. I successfully installed the Singularity image on my university's HPC environment, but I was hoping for some advice about how to estimate the HPC resources I would need to run PEMA in my job script (we use Slurm). Specifically, do you have any guidance for the #SBATCH specifications and values I should use?

For context, I have 96 samples that were PE sequenced for COI, 18S, and 16S amplicons (euk eDNA metabarcoding) on an Illumina MiSeq, so I have 576 fastq files as my raw sequencing data. I would like to use a custom ref db for COI, so I will follow your instructions about training the RDP classifier (I know this will likely affect computational load). Thank you!

PEMAbase needs to be indenpendent from any local files

A new version of the pemabase images needs to be built
making sure that:

everything that is needed is publicly available
updates and upgrades that have been suggested to be included, e.g. ref databases updates (see #21) or pseudogene step (#27 )

Getting an error about '16S_otutab.txt': No such file or directory with the example

Reading file all.nonchimeras.fasta 100%
1242678 nt in 4409 seqs, min 200, max 390, avg 282
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 155 Size min 2, max 692, avg 28.4
Singletons: 0, 0.0% of seqs, 0.0% of clusters
Writing OTU table (classic) 100%
cp: cannot stat '16S_otutab.txt': No such file or directory
Fatal error: ./pema_latest.bds, line 694, pos 3. Exec failed.
Exit value : 1
Command : cp 16S_otutab.txt rna_otutab_its_taxon_assign.txt

Data provenance considerations

We should add in the parameters file the version of SWARM algorithm that is implemented in PEMA.
Also, the version of CROP and of the RAxML-ng (and PaPaRa and EPA-ng).
And the version of cutadapt that is being used for the primer removal in the case of ITS.
And for the MIDORI database, we need to specify the GenBank release that it was based on.
I think that for all the other tools, the versioning information is already there.

Also, we should mention somewhere in the parameters file that RPDClassifier is being used for the COI gene and we should also mention the version of the RPDClassifier.
Similarly, we should also mention the CREST is being used for the 16S, 18S and ITS markers.

Also, we should add the thresholds/default values used by the classifiers for the taxonomic identification of the sequences.
Then, we could add this information in the otu_seq_comp_appr term when submitting data to GBIF/OBIS using the DwC-A format.

Then, after every analysis, the user will have full provenance (regarding tools and parameters implemented) stored in the copy of the parameters file inside the output folder.

alternative database for COI marker gene

A new db mostly for metazoan was recently published:
repo
publication

consider training crest or rdp with that and integrate it the pema workfow.

@cpavloud check and share your thougts! 😅

extended final table

In the next release if PEMA, could the extended final table that is produced when asked for in the parameter files

be called extendedFinalTable.tsv to be the same as the finalTable.tsv (and to correct the current spelling mistake in the word "extened")
be put in the same place as the finalTable.tsv in the outputs - so top level directory, not lower down

This is to make it so that the ARMS workflow in the tesseract can look in just one place to get this file, rather than a slightly different place depending on the parameters set by tye user

Fail with 18S data and swarm algorithm

I wanted to run 18S data using the current version (pema v.2.1.3) and with the swarm algorithm.

I used the attached parameter setting
parameters_1st_try.txt

However, the analysis went until step 4.mergingPairedEndFiles and then an error came up
`
Merging step by SPAdes is completed
Marker gene under study 18S.
Fatal error: /home/modules/initialize.bds, line 193, pos 18. Map 'params' does not have key 'clusteringAlgoFor16S_18SrRNA'.
pema_latest.bds, line 95 : buildDirectories(paramsSpadesMerging, globalVars )
initialize.bds, line 158 : string buildDirectories(string{} params, string{ } globalVars){
initialize.bds, line 162 : if ( params{'gene'} == 'gene_COI' ) {
initialize.bds, line 175 : } else if ( params{'gene'} == 'gene_16S' ) {
initialize.bds, line 188 : } else if ( params{'gene'} == 'gene_18S' ) {
initialize.bds, line 193 : if ( params{'clusteringAlgoFor16S_18SrRNA' } == 'algo_Swarm' ) {

ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 10 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 230 3 -> 2356 -> 2409 -> 2415 -> 2432 -> 2433
Node Id : 2434
bdsNode Id : 2433
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 9 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 2303 -> 2356 -> 2409 -> 2415 -> 2432
Node Id : 2433
bdsNode Id : 2432
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 8 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 2303 -> 2356 -> 2409 -> 2415
Node Id : 2432
bdsNode Id : 2415
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 7 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 2303 -> 2356 -> 2409
Node Id : 2415
bdsNode Id : 2409
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 6 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 2303 -> 2356
Node Id : 2409
bdsNode Id : 2356
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 5 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295 -> 2303
Node Id : 2356
bdsNode Id : 2303
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 4 / 0, nodes: 1422 -> 8178 -> 8179 -> 2295
Node Id : 2303
bdsNode Id : 2295
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 3 / 0, nodes: 1422 -> 8178 -> 8179
Node Id : 2295
bdsNode Id : 8179
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 2 / 0, nodes: 1422 -> 8178
Node Id : 8179
bdsNode Id : 8178
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 1 / 0, nodes: 1422
Node Id : 8178
bdsNode Id : 1422
`

Then, I thought that maybe it needed a little tricking (like the one I did here), so I changed the gene (in the parameters)
parameters_2nd_try.txt

This time, the analysis went until step 7.mainOutput and the files
asvs_representatives_all_samples.fasta all.denovo.nonchimeras.fasta asvs_repr_with_singletons.fasta all_samples.fasta asvs.stats all_sequences_grouped.fa asvs.swarms amplicon_contingency_table.tsv mysilvamod132_18S_taxon_assign.xml asvs_contingency_table.tsv
were created
but nothing was added in the 18S_taxon_assign folder and this error came up

`
Traceback (most recent call last):
File "/home/tools/CREST/LCAClassifier/bin/classify", line 16, in
sys.exit(LCAClassifier.classify.main())
File "/home/tools/CREST/LCAClassifier/src/LCAClassifier/classify.py", line 662, in main
otuFile=open(options.otus,"r")
IOError: [Errno 2] No such file or directory: 'allTab_18S_taxon_assign.tsv'
Task failed:
Program & line : '/home/modules/taxAssignment.bds', line 59
Task Name : ''
Task ID : 'pema_latest.bds.20211019_114721_581/task.taxAssignment.line_59.id_1843'
Task PID : '3380'
Task hint : '/home/tools/CREST/LCAClassifier/bin/classify; -c /home/tools/CREST/LCAClassifier/parts/etc/lcaclassifier.conf; -d silva132; -t allTab_18S_taxon_assign'
Task resources : 'cpus: 1 mem: -1.0 B timeout: 86400 wall-timeout: 86400'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[]'
Output files : '[]'
Script file : '/home1/bilbao/pema_latest.bds.20211019_114721_581/task.taxAssignment.line_59.id_1843.sh'
Exit status : '1'
StdErr (10 lines) :
Traceback (most recent call last):
File "/home/tools/CREST/LCAClassifier/bin/classify", line 16, in
sys.exit(LCAClassifier.classify.main())
File "/home/tools/CREST/LCAClassifier/src/LCAClassifier/classify.py", line 662, in main
otuFile=open(options.otus,"r")
IOError: [Errno 2] No such file or directory: 'allTab_18S_taxon_assign.tsv'

Fatal error: /home/modules/taxAssignment.bds, line 65, pos 13. Task/s failed.
pema_latest.bds, line 151 : if ( paramsForTaxAssign{'custom_ref_db'} != 'Yes'){
pema_latest.bds, line 153 : if ( paramsForTaxAssign{'gene'} == 'gene_16S' || paramsForTaxAssign{'gene'} == 'gene_18S' || paramsForTaxAssign{'gene'} == 'gene_ITS') {
pema_latest.bds, line 165 : if (paramsForTaxAssign{'taxonomyAssignmentMethod'} != 'phylogeny') {
pema_latest.bds, line 167 : crestAssign(paramsForTaxAssign, globalVars)
taxAssignment.bds, line 4 : string crestAssign(string{} params, string{} globalVars) {
taxAssignment.bds, line 6 : if ( params{'custom_ref_db'} != 'Yes') {
taxAssignment.bds, line 9 : if ( (params{'gene'} == 'gene_16S' || params{'gene'} == 'gene_18S') && params{'taxonomyAssignmentMethod'} != 'phylogeny' ) {
taxAssignment.bds, line 24 : if ( params{'silvaVersion'} == 'silva_128' ) {
taxAssignment.bds, line 46 : } else if ( params{'silvaVersion'} == 'silva_132' ) {
taxAssignment.bds, line 65 : wait
`

16s copy number

Consider a normalization of 16S results using
https://rrndb.umms.med.umich.edu

Checkpoint not working

Hello,

I've been testing out PEMA with my data for the past few days and have run into an issue. I got it going, and all was fine until my job ran out of memory. It was partway through the clustering step when this happened. I wasn't too worried because I know PEMA has a checkpoint system I can use to restart at this point. I am having issues doing this, however.

I am running the code on a slurm controlled HPC cluster:

singularity exec -B /nesi/nobackup/uoaxxxxx/PEMA/analysis_folder/:/mnt/analysis /nesi/nobackup/uoaxxxxx/PEMA/pema_latest.sif /home/tools/BDS/.bds/bds -r /nesi/nobackup/uoaxxxxx/PEMA/analysis_folder/trimming.chp

and I get the error:

Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport Exception in thread "main" java.lang.RuntimeException: File not found '/nesi/nobackup/uoa02559/PEMA/analysis_folder/trimming.chp' at org.bds.util.Gpr.reader(Gpr.java:553) at org.bds.util.Gpr.reader(Gpr.java:534) at org.bds.serialize.BdsSerializer.load(BdsSerializer.java:271) at org.bds.Bds.runCheckpoint(Bds.java:886) at org.bds.Bds.run(Bds.java:853) at org.bds.Bds.main(Bds.java:185)

I have tried with other checkpoint files also but with no luck. Am I being dense or should this restart the process at that stage?

Congrats on the pipeline and publication!

Thanks for your help,

Jed

Species level missed in `finalTable.tsv`

As implemented, by keeping column 2 (line 18) of this file, the species assigment is missed.

Enhancement: Add bbmap suite for read preprocessing

BBmap is available here:
https://sourceforge.net/projects/bbmap/
it has been published in PloS One:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657622/
and adopted by a wide community, including the JGI (here is a guide in their website for bbmerge: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmerge-guide/)

Depending on how the suite of tools may be integrated in Pema, merging can be achieved through bbmerge, but additional steps (trimming, adapter removal) may also be handled with the same package in a very fast and efficient way.

a note on trimming*
The Bbmmerge developer doesnot recommend trimming prior to merging, though, similarly to a growing number of researchers apparently (eg https://www.sciencedirect.com/science/article/abs/pii/S1874778721000805).

Thank you for considering adding the tool, it appears to handle better than pandaseq the merging step of fully overlapping reads (insert size equal to read length cases)!

Add alternative db for the case of 18S rRNA

Consider of adding PR2 as an alternative to the Silva database.

Here you may see how you could use PR2.

provide pema main data product in a 7-level taxonomy format

It would be super useful to return the pema main output (otu/asv table) in a 7-level taxonomy format, meaning all taxonomy assignments are as:

d__Bacteria; p__Abyssubacteria; c__SURF-5; o__SURF-5; f__SURF-5; g__SURF-5; s__SURF-5 sp003598085

Long recurrent taxonomic labels

Hi @hariszaf ,
The dev version (2.1.5) produces very long recurrent taxonomy labels like this one:

Main genome;Eukaryota;Excavata;Discoba;Kinetoplastea;Kinetoplastea (class);X (Kinetoplastea (class));Kinetoplastea (X (Kinetoplastea (class)));XX (Kinetoplastea (X (Kinetoplastea (class))));Kinetoplastea (XX (Kinetoplastea (X (Kinetoplastea (class)))));XXX (Kinetoplastea (XX (Kinetoplastea (X (Kinetoplastea (class))))));Kinetoplastea (XXX (Kinetoplastea (XX (Kinetoplastea (X (Kinetoplastea (class)))))));XXX (Kinetoplastea (XXX (Kinetoplastea (XX (Kinetoplastea (X (Kinetoplastea (class))))))));sp. (XXX (Kinetoplastea (XXX (Kinetoplastea (XX (Kinetoplastea (X (Kinetoplastea (class)))))))))

Main genome may be irrelevant, as well.

Error when unzipping using a checkpoing

PEMA tries to unzip folders that have already been unzipped when running using a checkpoint.

Fix that by finding a workaround in line 35

PEMA dependencies

Hi,
I am currently trying to understand what your dockerfiles do and the installation using build-from-source approach, purpose of this being support by EasyBuild.
In my understanding, dockerfiles in ./pemabase folder take care of environment (dependencies) and the ./Dockerfile.pema does little more than just copying PEMA scripts to correct directories inside docker container.
Im guessing then, that it shouldn't be too hard to achieve what I'm trying to do, had I known all the dependencies.
Therefore, let me ask you, in case I did get the PEMA installation process correctly, have you, by any chance, got a list of dependencies, maybe containing even their required versions, other than those in pemabase Dockerfiles (since they are a tad harder to read and definitely not very specific)?
Thanks a lot for your answer! :-)

add remove_singletons option in OTU pipeline

Currently, the option remove_singletons is only available in the ASV pipleline, please add it for the OTU one as well. Οr better, an option of filtering OTUs based on a user-defined filter :-)
Thanks!
Natassa

When Swarm is chosen for the ITS, the names of the sequences are OtuXX and not ASVXX

Hi!

Just a comment for PEMA v.2.1.4, probably it's a small bug.
When Swarm is chosen as a clustering algorithm for ITS, altough the asvs.swarms file is created in the end, the names of the sequences and, subsequently, the names at the extenedFinalTable.tsv and finalTable.tsv are OtuXXX instead of ASVXXX.