jpuritz / ddocent Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 41.0 6.99 MB

a bash pipeline for RAD sequencing

Home Page: ddocent.com

License: MIT License

Shell 98.87% Perl 1.13%

ddocent's People

Contributors

Stargazers

Watchers

ddocent's Issues

Suggested parameters for bam filtering

Hey,

I have been experimenting with some of my datasets and I have found that filtering the bam files prior to SNP calling not only reduces the number of SNPs which come out of freebayes, but also reduces the filtering needed to remove artifacts from my data. Using bamtools filter I remove

reads not properly paired
reads on secondary alignments
Mapping quality less than a given value

I think if you added this as a module before freebayes, then people could add the filters they wanted in a filter script if they wanted to use it.

Something like:

if [[ $FILTER = "yes" ]]; then
    if [ -z filter.txt ]; then
        echo "Please provide the bam filters"
        exit 1
    fi
    bamtools filter -script filter.txt -in cat-RRG.bam -out cat-RRG_filter.bam;
    mv cat-RRG_filter.bam cat-RRG.bam
fi

Then filter.txt scripts could be something like this:

{
"isPrimaryAlignment" : "true",
"isProperPair" : "true",
"mapQuality" : ">=40"
}

according to the bamtools filter manual.

remove duplicates prior to variant calling

I've been reading up on Freebayes and some best-practices and found several recommendations (1,2) for using samtools markdup on alignment .bam files prior to variant calling.
By doing so, freebayes will skip over alignments marked as duplicates.

Is there a reason dDocent doesn't perform this post-alignment processing step?

Fastp not getting number of processors

When trying to run dDocent with a config file, we're getting the following error:

mawk: cannot open uniq.fq1 (No such file or directory)

This seems to point to this section of dDocent:

fastp -i uniq.fq -o uniq.fq1 -w $NP -Q &> assemble.trim.log
mawk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' uniq.fq1 | paste - - | sort -k1,1 -V | tr "\t" "\n" > uniq.fasta
mawk '!/>/' uniq.fasta > totaluniqseq
rm uniq.fq*

When we look at the assemble.trim.log file, we can see that the value for $NP is not getting passed to the -w option:

head assemble.trim.log
 
head logfiles/assemble.trim.log 
option value is invalid: --thread=-Q
usage: fastp [options] ... 
options:
  -i, --in1                            read1 input file name (string [=])
  -o, --out1                           read1 output file name (string [=])
  -I, --in2                            read2 input file name (string [=])
  -O, --out2                           read2 output file name (string [=])
      --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --umpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
      --failed_out                     specify the file to store reads that cannot pass the filters. (string [=])

We confirmed that $NUMProc is getting the correct number of processors that it pulls from the config file:

head dDocent.runs

Variables used in dDocent Run at Tue Nov 26 04:38:06 PST 2019
Number of Processors
8
Maximum Memory
0
Trimming
yes
Assembly?
yes
Type_of_Assembly

Any idea what's happening and/or suggestions on how to fix it?

We haven't attempted to "hard code" the number of processors in the fastp -w option, and would like to avoid doing so.

Thanks for any insight.

installation instructions

Hi, I am very naively and very literally trying to follow your installation instructions on https://ddocent.wordpress.com/ddocent-pipeline-user-guide/. Is this " $sh install_dDocent.FB_requirements" the same as install_dDocent_requirements in the this Github repo (cuz the names are different)? A link, a mention of copying this file to the users directory, and what format is should be in (I'm guessing .sh?) would all be useful things for the naive user. If these are not the same thing, then can you tell me where to look?

Incorrect path searching for dependent software

The dDocent.FB script appears to lookup the location for the samtools executable and assumes that all its dependencies are also located here (e.g. rainbow, ddocent itself etc.). While this may work in common situations, it does not play happy with HPC environments where each software is installed in its own directory and activated via 'environment modules'. Please consider revising the software search code; looking up an environment variable such as RAINBOW_DIR etc. might be a good solution.

Simple Suggestions To Increase Speed of Freebayes

suggestion

apply a minimum coverage filter to the cov.stats prior to making the mapped.*.bed files, which will ultimately reduce the number of contigs genotyped

the minimum coverage value

either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.

speedup

2x in my current situation, plus the time saved later in filtering

implementation

# calculate 2 x NumInd - 1
minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
# make file of contigs with low coverage
mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
# isolate contigs with high coverage
grep -f low.cov.contigs -vF cov.stats > low.cov.stats
# apply changes to cov.stats
mv low.cov.stats cov.stats
# cleanup intermediate files
rm low.cov.contigs

Error in install_dDocent_requirements

Starting on line 54 there is a chunk of code that looks like it should install vcflib, but instead installs freebayes instead.

echo "Checking for vcflib"
if which vcfcombine &>/dev/null; then
echo "FreeBayes is already installed"
else
echo "Downloading and installing FreeBayes"
git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes

please make a new release on github

Dear @jpuritz could you create a new release of dDocent? The latest is 2.7.8 from May 2019. Seems that the current code advanced a bit (2.8.12).

Then we also get a new conda package.

RPE Uniq Seqs Correct?

These lines of code (895-901) in the main bash script only get uniq seqs from F reads, then rejoin F&R even though they are no longer paired and there are more R than F.

		if [ "$ATYPE" = "RPE" ]; then
			cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | $sort -k1 -S 200M > {}.fr"
			cat namelist | parallel --no-notice -j $NUMProc "cut -f1 {}.fr | uniq -c > {}.f.uniq && cut -f2 {}.fr > {}.r"
			cat namelist | parallel --no-notice -j $NUMProc "mawk '$AWK4' {}.f.uniq > {}.f.uniq.e" 
			cat namelist | parallel --no-notice -j $NUMProc "paste -d '-' {}.f.uniq.e {}.r | mawk '$AWK3'| sed -e 's/-/NNNNNNNNNN/' | sed -e '$SED1' | sed -e '$SED2'> {}.uniq.seqs"
			rm *.f.uniq.e *.f.uniq *.r *.fr
		else

awk program choice

This is more of curiosity than issue. Why did dDocent restrict itself to using mawk? In an earlier version apparently gawk was supported, then dropped at commit 1310cbc. I would like to suggest using the more portable name (awk) if there is no absolute reason to use only mawk. I understand that mawk seems to be the fastest awk out there.

too many arguments for ls and rm

I haven't experienced this directly, but was contacted by someone using dDocent on a lot of samples such that calls to ls piped into other things created errors where the message said there were too many arguments for ls. This also occurred towards the end of the pipeline where things like rm mapped.*.bed were occurring.

solution

Replace all calls to ls with calls to find
replace instances of rm ... with find . -name ... | rm or similar

Suggestion: add bioconda installation instructions

The conda package manager now supports dDocent installation. You could add this alternative installation method to the documentation somewhere:

Easy local install with bioconda

Install Miniconda: http://conda.pydata.org/miniconda.html

Add the bioconda channel:

conda config --add channels bioconda

Create a dDocent conda environment:

conda create -n ddocent_env ddocent=2.1

Activate the dDocent environment:

source activate ddocent_env

Run dDocent:

dDocent ...

Close the environment when you're done:

source deactivate

dDOCENT on HPC

Hi Jon,

we've been trying to set-up the dDOCENT pipeline on our University HPC (VSC) and we have some issues.

We installed all tools, also had to reinstall some different versions, but I fear it is still a bit tricky to run the pipeline on its own. We might need to go for a custom scripts to run each tool separately.

Our first analysis returned a gzip broken pipeline error (see attached error log), any idea what we should finetune to get it to run on a HPC (local run was ok with same test dataset).

Cheers,

Greg

ddocent_e20281763.txt

Very inefficent scan of full reference genome

Four times in the code there's a line like

if head -1 reference.fasta | grep -e 'dDocent' reference.fasta 1>/dev/null; then

The logic is to take the first line of the reference fasta (head command) containing the name of the first chromosome/scaffold. The string is then passed to grep to check if it's a dDocent-created file. However the grep command specifies again the reference file, thus forcing the scan of the entire uncompressed reference, which can easily be in the GB of data. The correct command should be

if head -1 reference.fasta | grep -e 'dDocent' 1>/dev/null; then

without the file name. In this way useless computation is avoided.

This issue is present four times in the code:

FIRST INSTANCE

dDocent/dDocent

Line 341 in 9718247

 if head -1 reference.fasta | grep -e 'dDocent_' reference.fasta 1>/dev/null; then 

SECOND INSTANCE

dDocent/dDocent

Line 424 in 9718247

 if head -1 reference.fasta | grep -e 'dDocent' reference.fasta 1>/dev/null; then 

THIRD INSTANCE, slightly different in the searched string. The issue remains.

dDocent/dDocent

Line 341 in 9718247

 if head -1 reference.fasta | grep -e 'dDocent_' reference.fasta 1>/dev/null; then 

FOURTH INSTANCE

dDocent/dDocent

Line 424 in 9718247

 if head -1 reference.fasta | grep -e 'dDocent' reference.fasta 1>/dev/null; then 

performance bottleneck

It just occurred to me that

dDocent/dDocent

Line 364 in 3615f8e

samtools sort -@$SAMProc $i.bam -o $i.bam 2>>$i.bam.log

is a performance bottleneck because the mapping process all but halts to sort the single bam file before finishing the iteration. Would you be interested in me modifying this/these loops to democratize it a bit so the sorting exploits the number of threads more?
My thinking is that the options are:

adding more available threads in the loop body for samtools sort
move the sorting to outside the loop body and into a parallel call

As it is now, the mapping part of dDocent maximizes threading per individual rather than across individuals (I'm a fan, as it quickens time-till-first-error), so my guess is that same strategy can be applied to the samtools sort call, or just split sort jobs across threads.

Do you have any thoughts on this?

ulimit silent error (bug)

I recently had variant calling finish, and somewhere in the terminal log was a minor mention of mawk being unable to open mapped.somenumber.bed because of an open file limit. The issue is that despite this error, freebayes still ran and dDocent still finished, and had I not spotted that single mention of mawk failure, I wouldn't have noticed something was wrong. That instance of freebayes ran for ~24hrs and since changing the ulimit, it's been running >48hrs (and still going), so the error had a pretty substantial impact on how many instances of freebayes dDocent was creating.

I think I isolated the mawk part creating the mapped.??.bed files. Is there a way to introduce logic to this so it will error and exit if it fails, rather than continuing to run freebayes anyway?

dDocent/dDocent

Lines 497 to 504 in 270bc93

 mawk -v x=$DP '$4 < x' cov.split.stats |sort -V -k1,1 -k2,2 | mawk -v cutoff=$CC -v tt=$TT 'BEGIN{i=1} 

  {len=$3-$2;lc=len*$4;cov = cov + lc 

  if (NR == 1 && lc > tt) {x="mapped."i".bed";print $1"\t"$2"\t"$3 > x; i=i+1; e=1} 

  else if ( cov < cutoff && lc < tt) {x="mapped."i".bed";print $1"\t"$2"\t"$3 > x; e=0} 

  else if (lc > tt && e > 0 ) {x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=0;i=i+1; e=1} 

  else if (lc > tt && e < 1 ) {i=i+1; x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=0;i=i+1;e=1} 

  else if (cov > cutoff && lc < tt ) {i=i+1; x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=lc;e=0} 

  }'

"Checking for required software" not operating as expected

When I start dDocent, I get the following error but dDocent allows me to keep going. This might be causing other errors downstream. I have recently installed dDocent 2.7 and it pinged me to update quite a few dependencies.

When I looked at line 70 of the dDocent script on github, it is the requirement that seqtk be at least 1.2, mine was only 1.0.

$ dDocent
dDocent 2.7.1

Contact [email protected] with any problems

Checking for required software
/local/home/michelles/14_programs/dDocent/dDocent: line 70: [: 1.0-r82: integer expression expected

All required software is installed!

dDocent run started Sun Nov 18 09:55:01 EST 2018

install_dDocent.FB_requirements

When installing locally, cutadapt installation fails with this error:
error: could not create '/usr/lib64/python2.6/site-packages/cutadapt': Permission denied

running python setup.py install --user works, but then you have to append ~/.local/bin to the PATH

Also, FreeBays v0.9.21-19-gc003c1e is install which leads to this error/warning.

-bash-4.1$ dDocent.FB
dDocent 1.2

Contact [email protected] with any problems


Checking for required software
The version of FreeBayes installed in your $PATH is not optimized for dDocent.
Please install a version 0.9.20

Tutorial

We've been going through the tutorial at https://github.com/jpuritz/dDocent/blob/master/tutorials/Reference%20Assembly%20Tutorial.md.

First, this is fabulous! Thanks for your efforts to create such a high-quality document and corresponding scripts...

We're running dDocent here on CentOS 6 and FreeBSD and we encountered a few portability issues with the tutorial on the FreeBSD side:

This is not an issue for someone who follows your instructions to the letter, but if someone tries to run your downloaded bash scripts as "./script.sh" instead of "bash script.sh", the #!/bin/bash will fail. Changing the shebang line to #!/usr/bin/env bash would make it 99% portable.
In generating rainbow.fasta, the "cat rbasm.out <(echo "E")" fails. I replaced this with the following:

echo "E" > endmarker
cat rbasm.out endmarker |sed 's/[0-9]:[0-9]://g' | mawk ' {

Some of the scripts contain "awk" commands (not mawk) that depend on GNU extensions. If you use "gawk" in place of "awk", it should still work on Linux systems and will work on others that don't use GNU awk by default. On all the Linux systems I've used, awk is a link to gawk, so this change should not impact Linux users.

`getAssemblyInfo` is missing a closing `}`

getAssemblyInfo(){
#Have user estimate/enter assembly parameters if unentered

if [ -z "$CUTOFF" ]; then

	for i in {2..20};
	do 
	echo $i >> pfile
	done
	cat pfile | parallel -j $NUMProc --no-notice "echo -n {}xxx && mawk -v x={} '\$1 >= x' uniq.seqs | wc -l" | mawk  '{gsub("xxx","\t",$0); print;}'| $sort -g > uniqseq.data
	rm pfile


	#Plot graph of above data
	gnuplot << \EOF 
	set terminal dumb size 120, 30
	set autoscale
	set xrange [2:20] 
	unset label
	set title "Number of Unique Sequences with More than X Coverage (Counted within individuals)"
	set xlabel "Coverage"
	set ylabel "Number of Unique Sequences"
	set lmargin 10
	plot 'uniqseq.data' with lines notitle
	pause -1
EOF


	echo -en "\007"
	echo -en "\007"
	echo -en "\007"
	echo -e "Please choose data cutoff.  In essence, you are picking a minimum (within individual) coverage level for a read (allele) to be used in the reference assembly"
	
	read CUTOFF
fi

parallel --env <function> throws error on some systems

Solution confirmed on 3 different HPC:

# add the following line prior to parallel --env line, which saves the whole current environment
parallel --record-env
# modify by replacing --env <function> with --env  _   which give everything in the current environment to parallel
parallel --no-notice -j $NUMProc --env _

Program stuck in using BWA mapping reads

Hi,

I am running dDocent for 164 individuals with pair-end sequencing and reference genome on dDocent 2.7.8. I used a configuration file to run in HPC. The trimming step went really well, while now I'm stuck in mapping reads with bwa. It had been stuck in one of my individuals for more than ten hours, and can not generate *-RG.bam and *-RG.bam.bai, and nothing in its *.bam.log file. My raw data is about 60 GB, and I set the maximum memory to 200G, while now there are a total of 151GB files, is this the reason?

Here is the config.file I used.

Number of Processors
20
Maximum Memory
200
Trimming
yes
Assembly?
no
Type_of_Assembly
PE
Clustering_Similarity%
0.85
Minimum within individual coverage level to include a read for assembly (K1)
6
Minimum number of individuals a read must be present in to include for assembly (K2)
6
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
4
Mapping_GapOpen_Penalty
6
Calling_SNPs?
yes

I'd really appreciate it if anyone could give me some suggestions.

Thanks,
Tong

consider using new versions of vcftools?

Hi Jon,
did you consider updating vcftools commands, so dDocent could use new versions of vcftools?
So far, I found these changes in vcftools flags to be made:

--geno to --max-missing
--missing to --missing-indv

Am I missing something?
I could make these changes if you want.

Error with main script - EOF management

I encountered errors like:

/home/ubuntu/software/dDocent/dDocent: line 1383: warning: here-document at line 833 delimited by end-of-file (wanted `EOF')
/home/ubuntu/software/dDocent/dDocent: line 1384: syntax error: unexpected end of file

There are extra spaces after EOF at lines (which I removed):

https://github.com/jpuritz/dDocent/blob/master/dDocent#L778
https://github.com/jpuritz/dDocent/blob/master/dDocent#L833

I've also removed the trailing tab before EOF at lines:

https://github.com/jpuritz/dDocent/blob/master/dDocent#L789
https://github.com/jpuritz/dDocent/blob/master/dDocent#L843

And then it works fine.

dDocent version: 2.8.9
bash version: 4.4.19(1)-release (x86_64-pc-linux-gnu)

missing data in merged vcf file

I've noticed that missing data in my merged vcf file following genotyping are haploid ".:" while all genotypes with data are correctly diploid. Running thesed -e 's/ \.\:/ \.\/\.\:/g' line on the merged vcf file does not resolve the issue. However, using 's/.:.:.:.:.:.:.:./.\/\.:.:.:.:.:.:.:./g' does resolve ploidy of missing genotypes. Just curious as to why the line from the pipeline was not functioning.

Missing data within the raw*.vcf files is coded as a single "." without additional fields

Using vcflib ver 1.0.2 for vcfcombine.

Cheers,
-Alex

Use bedops instead of bedools merge for memory efficiency

I found that the command

bedtools merge -i cat-RRG.bam -bed > mapped.bed

uses a lot of memory and often gets the pipeline killed. An efficient alternative is to use the bedops suite, and in particular substitute the previous code with:

bedtools bamtobed -i cat-RRG.bam > cat-RRG.bed
bedops --merge cat-RRG.bed > mapped.bed

This approach uses a trivial amount of memory. It requires the extra bam -> bed transformation, but it allowed me to run the pipeline on a previously unavailable system.
Since it is an extra tool, it would require an update in the installation instruction. Fortunately bedops is in conda, so

conda install bedops

is enough. The problematic command is present in two places in the dDocent code:

dDocent/dDocent

Line 407 in 9718247

bedtools merge -i cat-RRG.bam -bed > mapped.bed

dDocent/dDocent

Line 1197 in 9718247

bedtools merge -i cat-RRG.bam -bed > mapped.bed

RefMapOpt error

RefMapOpt.sh gives this error which I think is related to commit 251dd42:

Checking for required software

All required software is installed!

dDocent RefMapOpt version 2.8.8

Trimmed sequences found, proceeding with optimization.
/bin/bash: line 11: 226.075 + 100 : syntax error: invalid arithmetic operator (error token is ".075 + 100 ")
[...]
mawk: cannot open 2.2.results (No such file or directory)

cannot see how config file improperly configured

Trying to run dDocent in non-interactive mode and I get an error that my config file is improperly configured:

")syntax error: invalid arithmetic operator (error token is "ine 202: [[: 6

Configuration file is not properly configured.  Please see example on dDocent.com or the dDocent GitHub page.

I copied and pasted the config file structure from the dDocent tutorial, input my parameters instead, and can't tell what's wrong (below). Am I missing something?

Thanks!

My config file:

Number of Processors
6
Maximum Memory
50
Trimming
yes
Assembly?
no
Type_of_Assembly

Clustering_Similarity%

Minimum within individual coverage level to include a read for assembly (K1)

Minimum number of individuals a read must be present in to include for assembly (K2)

Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
4
Mapping_GapOpen_Penalty
6
Calling_SNPs?
yes
Email
[email protected]

gzip and core dumping

I needed to run some older Illumina reads through dDocent, and I am getting these two specific errors that are preventing the pipeline from completing:

The first is when initiating trimming, this error appears

Trimming reads and simultaneously assembling reference sequences
Removing the _1 character and replacing with /1 in the name of every sequence

gzip: FRS_001.F.fq.gz: unexpected end of file

gzip: FRS_002.F.fq.gz: unexpected end of file

gzip: FRS_003.F.fq.gz: unexpected end of file

gzip: FRS_004.F.fq.gz: unexpected end of file

It's unclear why this is happening, as the files were zipped (gzip) using the default settings. The trim logs seem to indicate the trimming proceeds to completion,

This may possibly be related to the first issue, but after completion of trimming and input of assembly parameters after the gnuplot prompts, this rainbow error appears that shuts the whole process down:

Now sit back, relax, and wait for your analysis to finish
/home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 841:  1269 Segmentation fault      (core dumped) rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10
/home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 882: / 100 + 1: syntax error: operand expected (error token is "/ 100 + 1")

After the process shuts down, here is a list of what's left in the working directory, with filesizes on the left:
4096 unpaired
0 trim.log
1027337 rcluster.gz
1040 assemble.trim.log
109509 uniq.F.fasta.gz
1104453 totaluniqseq.gz
1112 cdhit.log
1255 FRS_003.trim.log
1256 FRS_001.trim.log
1256 FRS_002.trim.log
1256 FRS_004.trim.log
1264501 uniq.fasta.gz
136 sort.contig.cluster.ids
164 uniqseq.data
215 xxx
24 uniqseq.peri.data
30 rbdiv.out.gz
324 lengths.txt
32 namelist
4373229 contig.cluster.totaluniqseq
43948224 FRS_003.uniq.seqs
4415013 uniq.k.4.c.2.seqs
44420402 FRS_002.uniq.seqs
46093742 FRS_001.uniq.seqs
475 dDocent.runs
50114803 FRS_004.uniq.seqs
50720656 FRS_003.R2.fq.gz
50840802 uniq.seqs.gz
51329776 FRS_003.R.fq.gz
51936612 FRS_002.R2.fq.gz
5234827 uniq.full.fasta
52537196 FRS_002.R.fq.gz
58284180 FRS_001.R2.fq.gz
58939121 FRS_001.R.fq.gz
60670176 FRS_003.R1.fq.gz
61136939 FRS_003.F.fq.gz
62141619 FRS_002.R1.fq.gz
62677396 FRS_002.F.fq.gz
63877600 FRS_004.R2.fq.gz
64686395 FRS_004.R.fq.gz
69506759 FRS_001.R1.fq.gz
70096018 FRS_001.F.fq.gz
76271641 FRS_004.R1.fq.gz
76923721 FRS_004.F.fq.gz
890 xxx.clstr
9863900 uniqCperindv
9939 dDocent_main.LOG

To be honest, this is old data that someone else (years ago) has preprocessed somewhat to remove UMI elements, so it's unclear to me if the issue is with the software, or the inputs I am giving the dDocent. For the sake of being thorough, here is what the format of the input fasta files looks like:

@cluster_1562	CATCTCCT
ATGAAGGGAACTACATTTCCCATATTTCATGAAAAGAGTGGGTGAGCATGATGTTTTCACACCAACTTTCAGGTGTCGTTC
+
?BB@?A:3@??A=>@EEB?<??=>EEBA@BB@DC===AAB;@AB=B:7B@<9@DEEBA636?:A=>DEBA:A@9@A=0EB?
@cluster_1579	GATATGGT
TTGCGAAGCATCTAGTATTGTCACACTCCGTTACTCAACACTATGTATGATGCGCTTTTCTGTGATATCTCGTGGTACTCTTTTT
+
A<->:03<,4.5@1692=2/119>:=400/4B0/592<4=.26244.55/1)/B74;3;?@4:1/1-94D.<A6/;3244BDD@C

Any insight as to where the issue stems from would be appreciated. Thank you!

Discrepancy between dDocent_filters.sh and SNP Filtering Tutorial

Several of the filters (allele balance, mapping quality, quality/depth) contained in dDocent_filters.sh are not run with the same parameters as the step-by-step examples in the SNP filtering tutorial. Here's the relevant line (17) in dDocent_filters.sh:

And the respective parameters from the step-by-step tutorial:

AB > 0.25 & AB < 0.75 | AB < 0.01
MQM / MQMR > 0.9 & MQM / MQMR < 1.05
QUAL / DP > 0.25

Conda dDocent recipe has incorrect FreeBayes

A fresh conda install of dDocent with conda install -c bioconda ddocent installs this version of freebayes:

 freebayes          bioconda/linux-64::freebayes-0.9.21.7-0

such that when dDocent is invoked, this error appears and dDocent exits:

/home/pdimens/.conda/envs/ddocent2/bin/dDocent: line 6: export: `2.7.8': not a valid identifier
dDocent 2.7.8 

Contact [email protected] with any problems 

 
Checking for required software
The version of FreeBayes installed in your $PATH is not optimized for dDocent.
Please install at least version 1.0.0

The problem is easily fixed with conda update freebayes though.

Downgrade conda vcftools version dependency

I'm unsure if this is an actual issues, but it seems the output of freebayes doesn't jive with the vcftools (0.1.16) that's bundled with conda's dDocent recipe, as it spits out many initial warnings:

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --vcf BFT_biallelic_noindel.recode.vcf

Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
Warning: Expected at least 2 parts in INFO entry: ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
Warning: Expected at least 2 parts in INFO entry: ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
Warning: Expected at least 2 parts in INFO entry: ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
Warning: Expected at least 2 parts in INFO entry: ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=RPP,Number=A,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=EPP,Number=A,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
Warning: Expected at least 2 parts in FORMAT entry: ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
After filtering, kept 646 out of 646 Individuals

fastp v 0.19.5 issue

With the recent version of dDocent (v 2.9.1) fastp 0.19.5 does not have the correct arguments.

It appears that fastp v 0.19.7 does have the correct arguments so I would recommend adding to the software check that fastp is version 0.19.7 or higher.

Trimmomatic SE with PE adapters

This might be something that isn't easily generalizable across all users, but ran across this today digging through the dDocent wrapper.

ADAPTERS appear to be set here:

if find ${PATH//:/ } -maxdepth 1 -name TruSeq2-PE.fa 2> /dev/null | grep -q 'Tru' ; then
    ADAPTERS=$(find ${PATH//:/ } -maxdepth 1 -name TruSeq2-PE.fa 2> /dev/null | head -1)
        ...

Then used here:

#Function for trimming reads using trimmomatic
TrimReads () { 
...
for i in "${NAMES[@]}"
do
#echo "Trimming Sample $i"
if [ -f $i.R.fq.gz ]; then
java -jar $TRIMMOMATIC PE -threads $NUMProc -phred33 $i.F.fq.gz $i.R.fq.gz $i.R1.fq.gz $i.unpairedF.fq.gz $i.R2.fq.gz $i.unpairedR.fq.gz ILLUMINACLIP:$ADAPTERS:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:10 $TW &> $i.trim.log
else java -jar $TRIMMOMATIC SE -threads $NUMProc -phred33 $i.F.fq.gz $i.R1.fq.gz ILLUMINACLIP:$ADAPTERS:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:10 $TW &> $i.trim.log
...
}

Shouldn't this be the SE adapters for SE ddRAD, or at least some choice between which adapter set to use TruSeq2 vs 3, for example? Maybe this is more of a feature request than an issue, but wanted to get it logged. ;-)

This is dDocent 2.24 I'm happy to come up with something, if you want a PR. Just let me know.

Thanks!

miniconda recipe with samtools bug?

Hey John,

Reinstalled dDocent on a new systems through miniconda, got this error:

RefMapOpt.sh 3 3 5 5 0.990 PE 16
Checking for required software
/miniconda3/envs/ddocent_env/bin/RefMapOpt.sh: line 27: [: : integer expression expected
The version of Samtools installed in your $PATH is not optimized for dDocent.
Please install at least version 1.3.0

Let me know if you want any log info.

Best,

~ JT

Change ulimit for large sample sizes

Ubuntu 14.04 by default has a limit of 1024 open files, controlled by the value of the ulimit variable, and the CreateIntervals function will fail at the samtools merge step if there are more than 1024 bam files to be merged. Adding a few lines of code can avoid this problem by auto-adjusting the ulimit variable for the current terminal session.
Line 133-135 of the dDocent script evaluate the number of individuals and create the NumInd variable; one way to deal with the limit on open files is to test for $NumInd > 1000 and reset the ulimit value if necessary:

Increase limit on number of open files if $NumInd > 1000 (default limit is 1024 on Ubuntu)

if [ "$NumInd" -gt 1000 ]; then
NewLimit=$(($NumInd + 100))
ulimit -n $NewLimit
fi

sh: /dev/tty: No such device or address

I consistently get the below error message when I run dDocent 2.8.13 (conda installation) on the SLURM-based computation cluster (CentOS Linux release 7.9.2009) at my end. The correct final output files seem to be created though. I was wondering if there were any quick-fix solutions here? Thank you very much.

Using FreeBayes to call SNPs
sh: /dev/tty: No such device or address

How to give population FILE to dDocent.FB ?

I have samples of same species from different populations (different locations). I would like to incorporate this information while calling SNPs using freeBayes. How can I do that ?

CD-HIT runs forever / never finishes

Hi,
a recent run on a large dataset (both min thresholds set to 2) was stuck on the CD-HIT step; perhaps something like
weizhongli/cdhit#18

I suggest to replace CD-HIT in dDocent by linclust:

https://github.com/soedinglab/MMseqs2/wiki#linclust

I just tried it out and it will produce almost the same result and be much faster than CD-HIT. You can install it via conda:

conda install -c conda-forge -c bioconda mmseqs2

Best regards, Mathias

error in trimming

dDocent hit an error during trimming and I can't tell why.

Here's what's printed to the terminal:

/home/nclowell/miniconda2/envs/updated_dDocent/bin/dDocent: line 6: export: `2.7.8': not a valid identifier
dDocent 2.7.8 

Contact [email protected] with any problems 

 
Checking for required software

All required software is installed!

dDocent version 2.7.8 started Wed Sep 11 10:10:23 PDT 2019 

At this point, all configuration information has been entered and dDocent may take several hours to run.
It is recommended that you move this script to a background operation and disable terminal input and output.
All data and logfiles will still be recorded.
To do this:
Press control and Z simultaneously
Type 'bg' without the quotes and press enter
Type 'disown -h' again without the quotes and press enter

Now sit back, relax, and wait for your analysis to finish

Trimming reads
(updated_dDocent) nclowell@ubuntu:/mnt/hgfs/E/dDocent_configfile_test$

Here's the content of the trim log:

mawk: cannot open lengths.txt (No such file or directory)
/home/nclowell/miniconda2/envs/updated_dDocent/bin/dDocent: line 688: / 2: syntax error: operand expected (error token is "/ 2")

PEAR is not accessible or licensed under a software iicense. Please consider substituting PANDAseq.

Hi Jonathan,

As part of my work with the Genomics for Aotearoa New Zealand society, I am building docker containers for bioinformatics tools. It would be great to build a fully functional one for dDocent. Unfortunately, the PEAR software has two significant issues prohibiting inclusion. The first is that it is just not available in a way one can script up an install. The second is that it does not have a suitable license for software distribution.

On the other hand, PANDAseq is available from their github repo under free software licensing terms. https://github.com/neufeld/pandaseq It implements many algorithms, including the one used by PEAR. If you wouldn't mind substituting PANDAseq for PEAR, I would then be able to build a docker container of dDocent with full functionality.

Thank you for considering my request.

--Rob

"rename.for.dDocent" not working

Hi.

I'm pretty new to genomics and using the command line, but I was wondering if this is an easy fix:

/Documents/ddocent-master$ ls
bars.names install_dDocent.FB_requirements README.md
bars.names install_dDocent.GATK_requirements rename.for.dDocent
dDocent.FB LICENSE Rename_for_dDocent.sh
dDocent.GATK mergefq.pl

~/Documents/ddocent-master$ sh rename.for.dDocent bars.names
rename.for.dDocent: 9: rename.for.dDocent: Syntax error: "(" unexpected (expecting "fi")

Can't seem to get barcode file to work - instructions followed exactly as on the blog. As far as I am aware I have the newest version of dDocent.

Also, are there sample data sets to available for dDocent?

Cheers,

~ jat101

Adaptive trimming of adapters with fastp

Hi Jon,

I think that trimming with fastp would benefit from adding the actual adapters using the --adapter_fasta flag. Below I've pasted the results of a small test showing the presence of adapters from readthrough of the P2 in the forward reads of one sample. P_007.F.fq.gz is untrimmed, P_007.nofa.fq.gz is trimmed with only the adaptive trimming, and P_007.trimfa.fq.gz uses the --adapter_fasta flag and the TruSeq adapter FASTA. I'm grepping for the P2 here:

$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.F.fq.gz 
2144
$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.nofa.fq.gz 
2144
$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.trimfa.fq.gz 
0

I think what might be happening is the auto adapter detection in fastp is mistaking some other repetitive sequences in the data for the adapters. It prints the sequence it detects, but it's not the true adapter:

Detecting adapter sequence for read1...
CATGCAATATTCTAACTGATAAATCATGACATGACATCCAAGATTTCAAAATGTCATGCC

I can't see that there is a way to turn off the auto-detection, but it seems that adding the TrueSeq adapters specifically lets it search for those in addition to the ones it detects.

Incorrect error prompt

dDocent/dDocent

Line 789 in eed5a7d

 echo "Please double check that all fastq files are named Ind01.F.fq.gz and Ind01.R.fq.gz" 

When answering no to the incorrect count of individuals in a directory, dDocent returns an error indicating the incorrect naming convention for the individuals:
Please double check that all fastq files are named Ind01.F.fq.gz and Ind01.R.fq.gz
instead of how it appears correctly in different error prompts:
Please rename individuals to: Locality_Individual.F.fq.gz

Add `--without-curses` samtools ./configure in the install script

Install fails on any cluster compute node that doesn't have curses libraries (probably most of them). This is an easy fix by updating install_dDocent_requirements line 129 to read:
./configure --without-curses

I would make a PR but seems silly for such a small change

Genomic interval creation failed

Received this error:

Creating alignment intervals
mawk: cannot open "mapped.2918.bed" for output (Too many open files)
Genomic interval creation failed. This may be related to the maximum number of open files.

I am unsure how to fix.
Please advise.

ReferenceOpt.sh - kopt.data output

Not sure if this is a common issue, but when running ReferenceOpt.sh I got standard output in my kopt.data file.

See excerpt:

4 1 0.88 ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH10.F.fq.gz Reverse reads file.................: DH10.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213765 C: 0.305644 G: 0.278709 T: 0.201882 19671 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 51%
Assemblying reads: 79%
Assemblying reads: 100% Assembled reads ...................: 248,318 / 262,242 (94.690%) Discarded reads ...................: 405 / 262,242 (0.154%) Not assembled reads ...............: 13,519 / 262,242 (5.155%) Assembled reads file...............: DH10.assembled.fastq Discarded reads file...............: DH10.discarded.fastq Unassembled forward reads file.....: DH10.unassembled.forward.fastq Unassembled reverse reads file.....: DH10.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH15.F.fq.gz Reverse reads file.................: DH15.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213404 C: 0.306030 G: 0.278267 T: 0.202299 9590 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 51%
Assemblying reads: 79%
Assemblying reads: 100% Assembled reads ...................: 249,053 / 262,194 (94.988%) Discarded reads ...................: 204 / 262,194 (0.078%) Not assembled reads ...............: 12,937 / 262,194 (4.934%) Assembled reads file...............: DH15.assembled.fastq Discarded reads file...............: DH15.discarded.fastq Unassembled forward reads file.....: DH15.unassembled.forward.fastq Unassembled reverse reads file.....: DH15.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH16.F.fq.gz Reverse reads file.................: DH16.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213534 C: 0.306293 G: 0.278708 T: 0.201464 8191 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 50%
Assemblying reads: 77%
Assemblying reads: 100% Assembled reads ...................: 254,211 / 267,994 (94.857%) Discarded reads ...................: 167 / 267,994 (0.062%) Not assembled reads ...............: 13,616 / 267,994 (5.081%) Assembled reads file...............: DH16.assembled.fastq Discarded reads file...............: DH16.discarded.fastq Unassembled forward reads file.....: DH16.unassembled.forward.fastq Unassembled reverse reads file.....: DH16.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH20.F.fq.gz Reverse reads file.................: DH20.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214051 C: 0.304835 G: 0.279797 T: 0.201317 3080 uncalled bases Assemblying reads: 0%
Assemblying reads: 22%
Assemblying reads: 43%
Assemblying reads: 67%
Assemblying reads: 90%
Assemblying reads: 100% Assembled reads ...................: 295,485 / 307,638 (96.050%) Discarded reads ...................: 70 / 307,638 (0.023%) Not assembled reads ...............: 12,083 / 307,638 (3.928%) Assembled reads file...............: DH20.assembled.fastq Discarded reads file...............: DH20.discarded.fastq Unassembled forward reads file.....: DH20.unassembled.forward.fastq Unassembled reverse reads file.....: DH20.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH22.F.fq.gz Reverse reads file.................: DH22.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213903 C: 0.304844 G: 0.279855 T: 0.201398 6650 uncalled bases Assemblying reads: 0%
Assemblying reads: 19%
Assemblying reads: 38%
Assemblying reads: 57%
Assemblying reads: 77%
Assemblying reads: 96%
Assemblying reads: 100% Assembled reads ...................: 344,335 / 358,909 (95.939%) Discarded reads ...................: 140 / 358,909 (0.039%) Not assembled reads ...............: 14,434 / 358,909 (4.022%) Assembled reads file...............: DH22.assembled.fastq Discarded reads file...............: DH22.discarded.fastq Unassembled forward reads file.....: DH22.unassembled.forward.fastq Unassembled reverse reads file.....: DH22.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH33.F.fq.gz Reverse reads file.................: DH33.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214020 C: 0.304925 G: 0.280528 T: 0.200526 12950 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 50%
Assemblying reads: 77%
Assemblying reads: 100% Assembled reads ...................: 259,069 / 267,732 (96.764%) Discarded reads ...................: 282 / 267,732 (0.105%) Not assembled reads ...............: 8,381 / 267,732 (3.130%) Assembled reads file...............: DH33.assembled.fastq Discarded reads file...............: DH33.discarded.fastq Unassembled forward reads file.....: DH33.unassembled.forward.fastq Unassembled reverse reads file.....: DH33.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH34.F.fq.gz Reverse reads file.................: DH34.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214097 C: 0.304848 G: 0.279696 T: 0.201359 16310 uncalled bases Assemblying reads: 0%
Assemblying reads: 19%
Assemblying reads: 38%
Assemblying reads: 57%
Assemblying reads: 78%
Assemblying reads: 96%
Assemblying reads: 100% Assembled reads ...................: 344,165 / 355,884 (96.707%) Discarded reads ...................: 331 / 355,884 (0.093%) Not assembled reads ...............: 11,388 / 355,884 (3.200%) Assembled reads file...............: DH34.assembled.fastq Discarded reads file...............: DH34.discarded.fastq Unassembled forward reads file.....: DH34.unassembled.forward.fastq Unassembled reverse reads file.....: DH34.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH35.F.fq.gz Reverse reads file.................: DH35.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214095 C: 0.304591 G: 0.280127 T: 0.201186 3290 uncalled bases Assemblying reads: 0%
Assemblying reads: 23%
Assemblying reads: 44%
Assemblying reads: 69%
Assemblying reads: 92%
Assemblying reads: 100% Assembled reads ...................: 288,858 / 299,368 (96.489%) Discarded reads ...................: 64 / 299,368 (0.021%) Not assembled reads ...............: 10,446 / 299,368 (3.489%) Assembled reads file...............: DH35.assembled.fastq Discarded reads file...............: DH35.discarded.fastq Unassembled forward reads file.....: DH35.unassembled.forward.fastq Unassembled reverse reads file.....: DH35.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH38.F.fq.gz Reverse reads file.................: DH38.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213124 C: 0.305703 G: 0.279985 T: 0.201188 6160 uncalled bases Assemblying reads: 0%
Assemblying reads: 22%
Assemblying reads: 43%
Assemblying reads: 66%
Assemblying reads: 88%
Assemblying reads: 100% Assembled reads ...................: 300,670 / 312,379 (96.252%) Discarded reads ...................: 118 / 312,379 (0.038%) Not assembled reads ...............: 11,591 / 312,379 (3.711%) Assembled reads file...............: DH38.assembled.fastq Discarded reads file...............: DH38.discarded.fastq Unassembled forward reads file.....: DH38.unassembled.forward.fastq Unassembled reverse reads file.....: DH38.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH39.F.fq.gz Reverse reads file.................: DH39.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213268 C: 0.305838 G: 0.279499 T: 0.201395 9065 uncalled bases Assemblying reads: 0%
Assemblying reads: 18%
Assemblying reads: 35%
Assemblying reads: 52%
Assemblying reads: 71%
Assemblying reads: 88%
Assemblying reads: 100% Assembled reads ...................: 377,077 / 389,570 (96.793%) Discarded reads ...................: 199 / 389,570 (0.051%) Not assembled reads ...............: 12,294 / 389,570 (3.156%) Assembled reads file...............: DH39.assembled.fastq Discarded reads file...............: DH39.discarded.fastq Unassembled forward reads file.....: DH39.unassembled.forward.fastq Unassembled reverse reads file.....: DH39.unassembled.reverse.fastq 7
4 1 0.90 8
4 1 0.92 8
4 1 0.94 8
4 1 0.96 13

releases tags?

This is not a real issue but just a suggestion.

it would be nice to have releases tags for the application so it's easier to download the required version. You can see an example of release tags here https://github.com/samtools/htslib/releases

and here is the github documentation about it https://help.github.com/articles/creating-releases/

Patch GNU sort assumption out or remake_reference.sh

Also set SHELL to bash as "parallel" does not work with some other shells and uses $SHELL to run child processes.

Lastly fix logic for choosing awk command. If "awk" is not GNU then we should use "gawk".

--- dDocent-test/Data/remake_reference.sh 2018-06-13 15:38:07.762081000 -0500
+++ remake_reference.sh 2018-06-13 15:30:42.031595000 -0500
@@ -1,16 +1,23 @@
export LC_ALL=en_US.UTF-8
+export SHELL=bash

if [[ -z "$5" ]]; then
echo "Usage is sh remake_reference.sh K1 K2 similarity% Assembly_Type Number_of_Processors"
exit 1
fi

-if ! awk --version | fgrep -v GNU &>/dev/null; then
+if ! awk --version | fgrep GNU &>/dev/null; then
awk=gawk
else
awk=awk
fi

+if ! sort --version | fgrep GNU &>/dev/null; then

sort=gsort
+else
sort=sort
+fi

if find ${PATH//:/ } -maxdepth 1 -name trimmomaticjar 2> /dev/null| grep -q 'trim' ; then
TRIMMOMATIC=$(find ${PATH//:/ } -maxdepth 1 -name trimmomaticjar 2> /dev/null | head -1)
else
@@ -52,7 +59,7 @@
cat namelist | parallel --no-notice -j $NUMProc "zcat {}.F.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.forward"
cat namelist | parallel --no-notice -j $NUMProc "zcat {}.R.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.reverse"
if [ "$ATYPE" = "RPE" ]; then

  	cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | sort -k1 -S 100M > {}.fr"

  	cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | $sort -k1 -S 100M > {}.fr"
  	cat namelist | parallel --no-notice -j $NUMProc "cut -f1 {}.fr | uniq -c > {}.f.uniq && cut -f2 {}.fr > {}.r"
  	cat namelist | parallel --no-notice -j $NUMProc "mawk '$AWK4' {}.f.uniq > {}.f.uniq.e" 
  	cat namelist | parallel --no-notice -j $NUMProc "paste -d '-' {}.f.uniq.e {}.r | mawk '$AWK3'| sed 's/-/NNNNNNNNNN/' | sed -e '$SED1' | sed -e '$SED2'> {}.uniq.seqs"

@@ -110,12 +117,12 @@
parallel --no-notice -j $NUMProc mawk -v x=$CUTOFF ''$1 >= x'' ::: *.uniq.seqs | cut -f2 | sed 's/NNNNNNNNNN/-/' > total.uniqs
cut -f 1 -d "-" total.uniqs > total.u.F
cut -f 2 -d "-" total.uniqs > total.u.R

paste total.u.F total.u.R | sort -k1 --parallel=$NUMProc -S 2G > total.fr

paste total.u.F total.u.R | $sort -k1 --parallel=$NUMProc -S 2G > total.fr
special_uniq(){
mawk -v x=$1 '$1 >= x' $2 |cut -f2 | sed -e 's/NNNNNNNNNN/ /g' | cut -f1 | uniq
}
export -f special_uniq

parallel --no-notice --env special_uniq special_uniq $CUTOFF {} ::: *.uniq.seqs | sort --parallel=$NUMProc -S 2G | uniq -c > total.f.uniq

parallel --no-notice --env special_uniq special_uniq $CUTOFF {} ::: .uniq.seqs | $sort --parallel=$NUMProc -S 2G | uniq -c > total.f.uniq
join -1 2 -2 1 -o 1.1,1.2,2.2 total.f.uniq total.fr | mawk '{print $1 "\t" $2 "NNNNNNNNNN" $3}' | mawk -v x=$CUTOFF2 '$1 >= x' > uniq.k.$CUTOFF.c.$CUTOFF2.seqs
rm total.uniqs total.u. total.fr total.f.uniq*

@@ -123,7 +130,7 @@
parallel --no-notice mawk -v x=$CUTOFF ''$1 >= x'' ::: *.uniq.seqs | cut -f2 | perl -e 'while (<>) {chomp; $z{$_}++;} while(($k,$v) = each(%z)) {print "$v\t$k\n";}' | mawk -v x=$CUTOFF2 '$1 >= x' > uniq.k.$CUTOFF.c.$CUTOFF2.seqs
fi

-sort -k1 -r -n --parallel=$NUMProc -S 2G uniq.k.$CUTOFF.c.$CUTOFF2.seqs |cut -f2 > totaluniqseq
+$sort -k1 -r -n --parallel=$NUMProc -S 2G uniq.k.$CUTOFF.c.$CUTOFF2.seqs |cut -f2 > totaluniqseq
mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' totaluniqseq > uniq.full.fasta
LENGTH=$(mawk '!/>/' uniq.full.fasta | mawk '(NR==1||length<shortest){shortest=length} END {print shortest}')
LENGTH=$(($LENGTH * 3 / 4))
@@ -149,6 +156,7 @@

   	if [ -s "rbdiv.out.$1" ]; then
        		rainbow merge -o rbasm.out.$1 -a -i rbdiv.out.$1 -r 2 -N10000 -R10000 -l 20 -f 0.75

```
  	more rbasm.out.$1
      	fi
   }
```

@@ -159,21 +167,21 @@
sed -e 's/NNNNNNNNNN/ /g' uniq.fasta | cut -f1 > uniq.F.fasta
CDHIT=$(python -c "print (max("$simC" - 0.1,0.8))")
cd-hit-est -i uniq.F.fasta -o xxx -c $CDHIT -T $NUMProc -M 0 -g 1 -d 100 &>cdhit.log

	mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids

  mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
	paste sort.contig.cluster.ids totaluniqseq > contig.cluster.totaluniqseq
     
	else

   	sed -e 's/NNNNNNNNNN/	/g' totaluniqseq | cut -f1 | sort --parallel=$NUMProc -S 2G| uniq | mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' > uniq.F.fasta

  sed -e 's/NNNNNNNNNN/   /g' totaluniqseq | cut -f1 | $sort --parallel=$NUMProc -S 2G| uniq | mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' > uniq.F.fasta
  CDHIT=$(python -c "print (max("$simC" - 0.1,0.8))")
  cd-hit-est -i uniq.F.fasta -o xxx -c $CDHIT -T $NUMProc -M 0 -g 1 -d 100 &>cdhit.log

  mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids

  mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
  paste sort.contig.cluster.ids <(mawk '!/>/' uniq.F.fasta) > contig.cluster.Funiq

  sed -e 's/NNNNNNNNNN/	/g' totaluniqseq | sort --parallel=$NUMProc -k1 -S 2G | mawk '{print $0 "\t" NR}'  > totaluniqseq.CN

  sed -e 's/NNNNNNNNNN/   /g' totaluniqseq | $sort --parallel=$NUMProc -k1 -S 2G | mawk '{print $0 "\t" NR}'  > totaluniqseq.CN
  join -t $'\t' -1 3 -2 1 contig.cluster.Funiq totaluniqseq.CN -o 2.3,1.2,2.1,2.2 > contig.cluster.totaluniqseq

#CD-hit output is converted to rainbow format

sort -k2,2 -g contig.cluster.totaluniqseq -S 2G --parallel=$NUMProc | sed -e 's/NNNNNNNNNN/ /g' > rcluster

$sort -k2,2 -g contig.cluster.totaluniqseq -S 2G --parallel=$NUMProc | sed -e 's/NNNNNNNNNN/ /g' > rcluster
rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10
CLUST=(tail -1 rbdiv.out | cut -f5)
CLUST1=$(( $CLUST / 100 + 1))
@@ -242,9 +250,9 @@
sed -e 's/NNNNNNNNNN/ /g' uniq.ua.fasta | cut -f1 > uniq.F.ua.fasta
CDHIT=$(python -c "print(max("$simC" - 0.1,0.8))")
cd-hit-est -i uniq.F.ua.fasta -o xxx -c $CDHIT -T 0 -M 0 -g 1 -d 100 &>cdhit.log

  mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids.ua

  mawk '{if ($1 ~ /Cl/) clus = clus + 1; else  print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids.ua
  paste sort.contig.cluster.ids.ua totaluniqseq.ua > contig.cluster.totaluniqseq.ua

  sort -k2,2 -g -S 2G --parallel=$NUMProc contig.cluster.totaluniqseq.ua | sed -e 's/NNNNNNNNNN/	/g' > rcluster.ua

  $sort -k2,2 -g -S 2G --parallel=$NUMProc contig.cluster.totaluniqseq.ua | sed -e 's/NNNNNNNNNN/ /g' > rcluster.ua
  #CD-hit output is converted to rainbow format
  rainbow div -i rcluster.ua -o rbdiv.ua.out -f 0.5 -K 10
  if [ "$ATYPE" == "PE" ]; then

bedtools version

Is it really necessary to use bedtools 2.23.0, or should the latest version work as well?

I see the note about the latest vcftools lacking certain options, but couldn't find an explanation of why bedtools 2.23.0 should be required.

Thanks,

Jason

	mawk -v x=$DP '$4 < x' cov.split.stats \|sort -V -k1,1 -k2,2 \| mawk -v cutoff=$CC -v tt=$TT 'BEGIN{i=1}
	{len=$3-$2;lc=len*$4;cov = cov + lc
	if (NR == 1 && lc > tt) {x="mapped."i".bed";print $1"\t"$2"\t"$3 > x; i=i+1; e=1}
	else if ( cov < cutoff && lc < tt) {x="mapped."i".bed";print $1"\t"$2"\t"$3 > x; e=0}
	else if (lc > tt && e > 0 ) {x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=0;i=i+1; e=1}
	else if (lc > tt && e < 1 ) {i=i+1; x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=0;i=i+1;e=1}
	else if (cov > cutoff && lc < tt ) {i=i+1; x="mapped."i".bed"; print $1"\t"$2"\t"$3 > x; cov=lc;e=0}
	}'

jpuritz / ddocent Goto Github PK

ddocent's People

Contributors

Stargazers

Watchers

Forkers

ddocent's Issues

suggestion

the minimum coverage value

speedup

implementation

solution

Easy local install with bioconda

Increase limit on number of open files if $NumInd > 1000 (default limit is 1024 on Ubuntu)

Recommend Projects

Recommend Topics

Recommend Org