jpuritz / ddocent Goto Github PK
View Code? Open in Web Editor NEWa bash pipeline for RAD sequencing
Home Page: ddocent.com
License: MIT License
a bash pipeline for RAD sequencing
Home Page: ddocent.com
License: MIT License
Hey,
I have been experimenting with some of my datasets and I have found that filtering the bam files prior to SNP calling not only reduces the number of SNPs which come out of freebayes, but also reduces the filtering needed to remove artifacts from my data. Using bamtools filter I remove
I think if you added this as a module before freebayes, then people could add the filters they wanted in a filter script if they wanted to use it.
Something like:
if [[ $FILTER = "yes" ]]; then
if [ -z filter.txt ]; then
echo "Please provide the bam filters"
exit 1
fi
bamtools filter -script filter.txt -in cat-RRG.bam -out cat-RRG_filter.bam;
mv cat-RRG_filter.bam cat-RRG.bam
fi
Then filter.txt
scripts could be something like this:
{
"isPrimaryAlignment" : "true",
"isProperPair" : "true",
"mapQuality" : ">=40"
}
according to the bamtools filter manual.
I've been reading up on Freebayes and some best-practices and found several recommendations (1,2) for using samtools markdup
on alignment .bam files prior to variant calling.
By doing so, freebayes
will skip over alignments marked as duplicates.
Is there a reason dDocent doesn't perform this post-alignment processing step?
When trying to run dDocent
with a config file, we're getting the following error:
mawk: cannot open uniq.fq1 (No such file or directory)
This seems to point to this section of dDocent:
fastp -i uniq.fq -o uniq.fq1 -w $NP -Q &> assemble.trim.log
mawk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' uniq.fq1 | paste - - | sort -k1,1 -V | tr "\t" "\n" > uniq.fasta
mawk '!/>/' uniq.fasta > totaluniqseq
rm uniq.fq*
When we look at the assemble.trim.log
file, we can see that the value for $NP
is not getting passed to the -w
option:
head assemble.trim.log
head logfiles/assemble.trim.log
option value is invalid: --thread=-Q
usage: fastp [options] ...
options:
-i, --in1 read1 input file name (string [=])
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])
--unpaired1 for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --umpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
--failed_out specify the file to store reads that cannot pass the filters. (string [=])
We confirmed that $NUMProc
is getting the correct number of processors that it pulls from the config file:
head dDocent.runs
Variables used in dDocent Run at Tue Nov 26 04:38:06 PST 2019
Number of Processors
8
Maximum Memory
0
Trimming
yes
Assembly?
yes
Type_of_Assembly
Any idea what's happening and/or suggestions on how to fix it?
We haven't attempted to "hard code" the number of processors in the fastp -w
option, and would like to avoid doing so.
Thanks for any insight.
Hi, I am very naively and very literally trying to follow your installation instructions on https://ddocent.wordpress.com/ddocent-pipeline-user-guide/. Is this " $sh install_dDocent.FB_requirements" the same as install_dDocent_requirements in the this Github repo (cuz the names are different)? A link, a mention of copying this file to the users directory, and what format is should be in (I'm guessing .sh?) would all be useful things for the naive user. If these are not the same thing, then can you tell me where to look?
The dDocent.FB script appears to lookup the location for the samtools executable and assumes that all its dependencies are also located here (e.g. rainbow, ddocent itself etc.). While this may work in common situations, it does not play happy with HPC environments where each software is installed in its own directory and activated via 'environment modules'. Please consider revising the software search code; looking up an environment variable such as RAINBOW_DIR etc. might be a good solution.
apply a minimum coverage filter to the cov.stats
prior to making the mapped.*.bed
files, which will ultimately reduce the number of contigs genotyped
either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.
2x in my current situation, plus the time saved later in filtering
# calculate 2 x NumInd - 1
minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
# make file of contigs with low coverage
mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
# isolate contigs with high coverage
grep -f low.cov.contigs -vF cov.stats > low.cov.stats
# apply changes to cov.stats
mv low.cov.stats cov.stats
# cleanup intermediate files
rm low.cov.contigs
Starting on line 54 there is a chunk of code that looks like it should install vcflib, but instead installs freebayes instead.
echo "Checking for vcflib"
if which vcfcombine &>/dev/null; then
echo "FreeBayes is already installed"
else
echo "Downloading and installing FreeBayes"
git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes
Dear @jpuritz could you create a new release of dDocent? The latest is 2.7.8 from May 2019. Seems that the current code advanced a bit (2.8.12).
Then we also get a new conda package.
These lines of code (895-901) in the main bash script only get uniq seqs from F reads, then rejoin F&R even though they are no longer paired and there are more R than F.
if [ "$ATYPE" = "RPE" ]; then
cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | $sort -k1 -S 200M > {}.fr"
cat namelist | parallel --no-notice -j $NUMProc "cut -f1 {}.fr | uniq -c > {}.f.uniq && cut -f2 {}.fr > {}.r"
cat namelist | parallel --no-notice -j $NUMProc "mawk '$AWK4' {}.f.uniq > {}.f.uniq.e"
cat namelist | parallel --no-notice -j $NUMProc "paste -d '-' {}.f.uniq.e {}.r | mawk '$AWK3'| sed -e 's/-/NNNNNNNNNN/' | sed -e '$SED1' | sed -e '$SED2'> {}.uniq.seqs"
rm *.f.uniq.e *.f.uniq *.r *.fr
else
This is more of curiosity than issue. Why did dDocent restrict itself to using mawk
? In an earlier version apparently gawk was supported, then dropped at commit 1310cbc. I would like to suggest using the more portable name (awk
) if there is no absolute reason to use only mawk
. I understand that mawk
seems to be the fastest awk out there.
I haven't experienced this directly, but was contacted by someone using dDocent on a lot of samples such that calls to ls
piped into other things created errors where the message said there were too many arguments for ls
. This also occurred towards the end of the pipeline where things like rm mapped.*.bed
were occurring.
ls
with calls to find
rm ...
with find . -name ... | rm
or similarThe conda package manager now supports dDocent installation. You could add this alternative installation method to the documentation somewhere:
Install Miniconda: http://conda.pydata.org/miniconda.html
Add the bioconda channel:
conda config --add channels bioconda
Create a dDocent conda environment:
conda create -n ddocent_env ddocent=2.1
Activate the dDocent environment:
source activate ddocent_env
Run dDocent:
dDocent ...
Close the environment when you're done:
source deactivate
Hi Jon,
we've been trying to set-up the dDOCENT pipeline on our University HPC (VSC) and we have some issues.
We installed all tools, also had to reinstall some different versions, but I fear it is still a bit tricky to run the pipeline on its own. We might need to go for a custom scripts to run each tool separately.
Our first analysis returned a gzip broken pipeline error (see attached error log), any idea what we should finetune to get it to run on a HPC (local run was ok with same test dataset).
Cheers,
Greg
Four times in the code there's a line like
if head -1 reference.fasta | grep -e 'dDocent' reference.fasta 1>/dev/null; then
The logic is to take the first line of the reference fasta (head command) containing the name of the first chromosome/scaffold. The string is then passed to grep to check if it's a dDocent-created file. However the grep command specifies again the reference file, thus forcing the scan of the entire uncompressed reference, which can easily be in the GB of data. The correct command should be
if head -1 reference.fasta | grep -e 'dDocent' 1>/dev/null; then
without the file name. In this way useless computation is avoided.
This issue is present four times in the code:
FIRST INSTANCE
Line 341 in 9718247
SECOND INSTANCE
Line 424 in 9718247
THIRD INSTANCE, slightly different in the searched string. The issue remains.
Line 341 in 9718247
FOURTH INSTANCE
Line 424 in 9718247
It just occurred to me that
Line 364 in 3615f8e
samtools sort
parallel
callAs it is now, the mapping part of dDocent maximizes threading per individual rather than across individuals (I'm a fan, as it quickens time-till-first-error), so my guess is that same strategy can be applied to the samtools sort
call, or just split sort jobs across threads.
Do you have any thoughts on this?
I recently had variant calling finish, and somewhere in the terminal log was a minor mention of mawk being unable to open mapped.somenumber.bed
because of an open file limit. The issue is that despite this error, freebayes still ran and dDocent still finished, and had I not spotted that single mention of mawk failure, I wouldn't have noticed something was wrong. That instance of freebayes ran for ~24hrs and since changing the ulimit, it's been running >48hrs (and still going), so the error had a pretty substantial impact on how many instances of freebayes dDocent was creating.
I think I isolated the mawk
part creating the mapped.??.bed
files. Is there a way to introduce logic to this so it will error and exit if it fails, rather than continuing to run freebayes anyway?
Lines 497 to 504 in 270bc93
When I start dDocent, I get the following error but dDocent allows me to keep going. This might be causing other errors downstream. I have recently installed dDocent 2.7 and it pinged me to update quite a few dependencies.
When I looked at line 70 of the dDocent script on github, it is the requirement that seqtk be at least 1.2, mine was only 1.0.
$ dDocent
dDocent 2.7.1
Contact [email protected] with any problems
Checking for required software
/local/home/michelles/14_programs/dDocent/dDocent: line 70: [: 1.0-r82: integer expression expected
All required software is installed!
dDocent run started Sun Nov 18 09:55:01 EST 2018
When installing locally, cutadapt installation fails with this error:
error: could not create '/usr/lib64/python2.6/site-packages/cutadapt': Permission denied
running python setup.py install --user
works, but then you have to append ~/.local/bin
to the PATH
Also, FreeBays v0.9.21-19-gc003c1e is install which leads to this error/warning.
-bash-4.1$ dDocent.FB
dDocent 1.2
Contact [email protected] with any problems
Checking for required software
The version of FreeBayes installed in your $PATH is not optimized for dDocent.
Please install a version 0.9.20
We've been going through the tutorial at https://github.com/jpuritz/dDocent/blob/master/tutorials/Reference%20Assembly%20Tutorial.md.
First, this is fabulous! Thanks for your efforts to create such a high-quality document and corresponding scripts...
We're running dDocent here on CentOS 6 and FreeBSD and we encountered a few portability issues with the tutorial on the FreeBSD side:
This is not an issue for someone who follows your instructions to the letter, but if someone tries to run your downloaded bash scripts as "./script.sh" instead of "bash script.sh", the #!/bin/bash will fail. Changing the shebang line to #!/usr/bin/env bash would make it 99% portable.
In generating rainbow.fasta, the "cat rbasm.out <(echo "E")" fails. I replaced this with the following:
echo "E" > endmarker
cat rbasm.out endmarker |sed 's/[0-9]:[0-9]://g' | mawk ' {
getAssemblyInfo(){
#Have user estimate/enter assembly parameters if unentered
if [ -z "$CUTOFF" ]; then
for i in {2..20};
do
echo $i >> pfile
done
cat pfile | parallel -j $NUMProc --no-notice "echo -n {}xxx && mawk -v x={} '\$1 >= x' uniq.seqs | wc -l" | mawk '{gsub("xxx","\t",$0); print;}'| $sort -g > uniqseq.data
rm pfile
#Plot graph of above data
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
set xrange [2:20]
unset label
set title "Number of Unique Sequences with More than X Coverage (Counted within individuals)"
set xlabel "Coverage"
set ylabel "Number of Unique Sequences"
set lmargin 10
plot 'uniqseq.data' with lines notitle
pause -1
EOF
echo -en "\007"
echo -en "\007"
echo -en "\007"
echo -e "Please choose data cutoff. In essence, you are picking a minimum (within individual) coverage level for a read (allele) to be used in the reference assembly"
read CUTOFF
fi
Solution confirmed on 3 different HPC:
# add the following line prior to parallel --env line, which saves the whole current environment
parallel --record-env
# modify by replacing --env <function> with --env _ which give everything in the current environment to parallel
parallel --no-notice -j $NUMProc --env _
Hi,
I am running dDocent for 164 individuals with pair-end sequencing and reference genome on dDocent 2.7.8. I used a configuration file to run in HPC. The trimming step went really well, while now I'm stuck in mapping reads with bwa. It had been stuck in one of my individuals for more than ten hours, and can not generate *-RG.bam and *-RG.bam.bai, and nothing in its *.bam.log file. My raw data is about 60 GB, and I set the maximum memory to 200G, while now there are a total of 151GB files, is this the reason?
Here is the config.file I used.
Number of Processors
20
Maximum Memory
200
Trimming
yes
Assembly?
no
Type_of_Assembly
PE
Clustering_Similarity%
0.85
Minimum within individual coverage level to include a read for assembly (K1)
6
Minimum number of individuals a read must be present in to include for assembly (K2)
6
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
4
Mapping_GapOpen_Penalty
6
Calling_SNPs?
yes
I'd really appreciate it if anyone could give me some suggestions.
Thanks,
Tong
Hi Jon,
did you consider updating vcftools commands, so dDocent could use new versions of vcftools?
So far, I found these changes in vcftools
flags to be made:
--geno to --max-missing
--missing to --missing-indv
Am I missing something?
I could make these changes if you want.
I encountered errors like:
/home/ubuntu/software/dDocent/dDocent: line 1383: warning: here-document at line 833 delimited by end-of-file (wanted `EOF')
/home/ubuntu/software/dDocent/dDocent: line 1384: syntax error: unexpected end of file
There are extra spaces after EOF at lines (which I removed):
https://github.com/jpuritz/dDocent/blob/master/dDocent#L778
https://github.com/jpuritz/dDocent/blob/master/dDocent#L833
I've also removed the trailing tab before EOF at lines:
https://github.com/jpuritz/dDocent/blob/master/dDocent#L789
https://github.com/jpuritz/dDocent/blob/master/dDocent#L843
And then it works fine.
dDocent version: 2.8.9
bash version: 4.4.19(1)-release (x86_64-pc-linux-gnu)
I've noticed that missing data in my merged vcf file following genotyping are haploid ".:" while all genotypes with data are correctly diploid. Running thesed -e 's/ \.\:/ \.\/\.\:/g'
line on the merged vcf file does not resolve the issue. However, using 's/.:.:.:.:.:.:.:./.\/\.:.:.:.:.:.:.:./g'
does resolve ploidy of missing genotypes. Just curious as to why the line from the pipeline was not functioning.
Missing data within the raw*.vcf files is coded as a single "." without additional fields
Using vcflib ver 1.0.2 for vcfcombine.
Cheers,
-Alex
I found that the command
bedtools merge -i cat-RRG.bam -bed > mapped.bed
uses a lot of memory and often gets the pipeline killed. An efficient alternative is to use the bedops suite, and in particular substitute the previous code with:
bedtools bamtobed -i cat-RRG.bam > cat-RRG.bed
bedops --merge cat-RRG.bed > mapped.bed
This approach uses a trivial amount of memory. It requires the extra bam -> bed transformation, but it allowed me to run the pipeline on a previously unavailable system.
Since it is an extra tool, it would require an update in the installation instruction. Fortunately bedops is in conda, so
conda install bedops
is enough. The problematic command is present in two places in the dDocent code:
Line 407 in 9718247
Line 1197 in 9718247
RefMapOpt.sh gives this error which I think is related to commit 251dd42
:
Checking for required software
All required software is installed!
dDocent RefMapOpt version 2.8.8
Trimmed sequences found, proceeding with optimization.
/bin/bash: line 11: 226.075 + 100 : syntax error: invalid arithmetic operator (error token is ".075 + 100 ")
[...]
mawk: cannot open 2.2.results (No such file or directory)
Trying to run dDocent in non-interactive mode and I get an error that my config file is improperly configured:
")syntax error: invalid arithmetic operator (error token is "ine 202: [[: 6
Configuration file is not properly configured. Please see example on dDocent.com or the dDocent GitHub page.
I copied and pasted the config file structure from the dDocent tutorial, input my parameters instead, and can't tell what's wrong (below). Am I missing something?
Thanks!
My config file:
Number of Processors
6
Maximum Memory
50
Trimming
yes
Assembly?
no
Type_of_Assembly
Clustering_Similarity%
Minimum within individual coverage level to include a read for assembly (K1)
Minimum number of individuals a read must be present in to include for assembly (K2)
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
4
Mapping_GapOpen_Penalty
6
Calling_SNPs?
yes
Email
[email protected]
I needed to run some older Illumina reads through dDocent, and I am getting these two specific errors that are preventing the pipeline from completing:
Trimming reads and simultaneously assembling reference sequences
Removing the _1 character and replacing with /1 in the name of every sequence
gzip: FRS_001.F.fq.gz: unexpected end of file
gzip: FRS_002.F.fq.gz: unexpected end of file
gzip: FRS_003.F.fq.gz: unexpected end of file
gzip: FRS_004.F.fq.gz: unexpected end of file
It's unclear why this is happening, as the files were zipped (gzip) using the default settings. The trim logs seem to indicate the trimming proceeds to completion,
rainbow
error appears that shuts the whole process down:Now sit back, relax, and wait for your analysis to finish
/home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 841: 1269 Segmentation fault (core dumped) rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10
/home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 882: / 100 + 1: syntax error: operand expected (error token is "/ 100 + 1")
After the process shuts down, here is a list of what's left in the working directory, with filesizes on the left:
4096 unpaired
0 trim.log
1027337 rcluster.gz
1040 assemble.trim.log
109509 uniq.F.fasta.gz
1104453 totaluniqseq.gz
1112 cdhit.log
1255 FRS_003.trim.log
1256 FRS_001.trim.log
1256 FRS_002.trim.log
1256 FRS_004.trim.log
1264501 uniq.fasta.gz
136 sort.contig.cluster.ids
164 uniqseq.data
215 xxx
24 uniqseq.peri.data
30 rbdiv.out.gz
324 lengths.txt
32 namelist
4373229 contig.cluster.totaluniqseq
43948224 FRS_003.uniq.seqs
4415013 uniq.k.4.c.2.seqs
44420402 FRS_002.uniq.seqs
46093742 FRS_001.uniq.seqs
475 dDocent.runs
50114803 FRS_004.uniq.seqs
50720656 FRS_003.R2.fq.gz
50840802 uniq.seqs.gz
51329776 FRS_003.R.fq.gz
51936612 FRS_002.R2.fq.gz
5234827 uniq.full.fasta
52537196 FRS_002.R.fq.gz
58284180 FRS_001.R2.fq.gz
58939121 FRS_001.R.fq.gz
60670176 FRS_003.R1.fq.gz
61136939 FRS_003.F.fq.gz
62141619 FRS_002.R1.fq.gz
62677396 FRS_002.F.fq.gz
63877600 FRS_004.R2.fq.gz
64686395 FRS_004.R.fq.gz
69506759 FRS_001.R1.fq.gz
70096018 FRS_001.F.fq.gz
76271641 FRS_004.R1.fq.gz
76923721 FRS_004.F.fq.gz
890 xxx.clstr
9863900 uniqCperindv
9939 dDocent_main.LOG
To be honest, this is old data that someone else (years ago) has preprocessed somewhat to remove UMI elements, so it's unclear to me if the issue is with the software, or the inputs I am giving the dDocent. For the sake of being thorough, here is what the format of the input fasta files looks like:
@cluster_1562 CATCTCCT
ATGAAGGGAACTACATTTCCCATATTTCATGAAAAGAGTGGGTGAGCATGATGTTTTCACACCAACTTTCAGGTGTCGTTC
+
?BB@?A:3@??A=>@EEB?<??=>EEBA@BB@DC===AAB;@AB=B:7B@<9@DEEBA636?:A=>DEBA:A@9@A=0EB?
@cluster_1579 GATATGGT
TTGCGAAGCATCTAGTATTGTCACACTCCGTTACTCAACACTATGTATGATGCGCTTTTCTGTGATATCTCGTGGTACTCTTTTT
+
A<->:03<,4.5@1692=2/119>:=400/4B0/592<4=.26244.55/1)/B74;3;?@4:1/1-94D.<A6/;3244BDD@C
Any insight as to where the issue stems from would be appreciated. Thank you!
Several of the filters (allele balance, mapping quality, quality/depth) contained in dDocent_filters.sh are not run with the same parameters as the step-by-step examples in the SNP filtering tutorial. Here's the relevant line (17) in dDocent_filters.sh:
vcffilter -s -f "AB > 0.2 & AB < 0.8 | AB < 0.01 | AB > 0.99" -s -g "QR > 0 | QA > 0 " $1 | vcffilter -s -f "QUAL / DP > 0.2" | vcffilter -s -f "MQM / MQMR > 0.25 & MQM / MQMR < 1.75" > $2
And the respective parameters from the step-by-step tutorial:
AB > 0.25 & AB < 0.75 | AB < 0.01
MQM / MQMR > 0.9 & MQM / MQMR < 1.05
QUAL / DP > 0.25
A fresh conda
install of dDocent with conda install -c bioconda ddocent
installs this version of freebayes:
freebayes bioconda/linux-64::freebayes-0.9.21.7-0
such that when dDocent
is invoked, this error appears and dDocent exits:
/home/pdimens/.conda/envs/ddocent2/bin/dDocent: line 6: export: `2.7.8': not a valid identifier
dDocent 2.7.8
Contact [email protected] with any problems
Checking for required software
The version of FreeBayes installed in your $PATH is not optimized for dDocent.
Please install at least version 1.0.0
The problem is easily fixed with conda update freebayes
though.
I'm unsure if this is an actual issues, but it seems the output of freebayes doesn't jive with the vcftools (0.1.16) that's bundled with conda's dDocent recipe, as it spits out many initial warnings:
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf BFT_biallelic_noindel.recode.vcf
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
Warning: Expected at least 2 parts in INFO entry: ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
Warning: Expected at least 2 parts in INFO entry: ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
Warning: Expected at least 2 parts in INFO entry: ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
Warning: Expected at least 2 parts in INFO entry: ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=RPP,Number=A,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=EPP,Number=A,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
Warning: Expected at least 2 parts in FORMAT entry: ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
After filtering, kept 646 out of 646 Individuals
With the recent version of dDocent (v 2.9.1) fastp 0.19.5 does not have the correct arguments.
It appears that fastp v 0.19.7 does have the correct arguments so I would recommend adding to the software check that fastp is version 0.19.7 or higher.
This might be something that isn't easily generalizable across all users, but ran across this today digging through the dDocent wrapper.
ADAPTERS
appear to be set here:
if find ${PATH//:/ } -maxdepth 1 -name TruSeq2-PE.fa 2> /dev/null | grep -q 'Tru' ; then
ADAPTERS=$(find ${PATH//:/ } -maxdepth 1 -name TruSeq2-PE.fa 2> /dev/null | head -1)
...
Then used here:
#Function for trimming reads using trimmomatic
TrimReads () {
...
for i in "${NAMES[@]}"
do
#echo "Trimming Sample $i"
if [ -f $i.R.fq.gz ]; then
java -jar $TRIMMOMATIC PE -threads $NUMProc -phred33 $i.F.fq.gz $i.R.fq.gz $i.R1.fq.gz $i.unpairedF.fq.gz $i.R2.fq.gz $i.unpairedR.fq.gz ILLUMINACLIP:$ADAPTERS:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:10 $TW &> $i.trim.log
else java -jar $TRIMMOMATIC SE -threads $NUMProc -phred33 $i.F.fq.gz $i.R1.fq.gz ILLUMINACLIP:$ADAPTERS:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:10 $TW &> $i.trim.log
...
}
Shouldn't this be the SE adapters for SE ddRAD, or at least some choice between which adapter set to use TruSeq2 vs 3, for example? Maybe this is more of a feature request than an issue, but wanted to get it logged. ;-)
This is dDocent 2.24
I'm happy to come up with something, if you want a PR. Just let me know.
Thanks!
Hey John,
Reinstalled dDocent on a new systems through miniconda, got this error:
RefMapOpt.sh 3 3 5 5 0.990 PE 16
Checking for required software
/miniconda3/envs/ddocent_env/bin/RefMapOpt.sh: line 27: [: : integer expression expected
The version of Samtools installed in your $PATH is not optimized for dDocent.
Please install at least version 1.3.0
Let me know if you want any log info.
Best,
~ JT
Ubuntu 14.04 by default has a limit of 1024 open files, controlled by the value of the ulimit variable, and the CreateIntervals function will fail at the samtools merge step if there are more than 1024 bam files to be merged. Adding a few lines of code can avoid this problem by auto-adjusting the ulimit variable for the current terminal session.
Line 133-135 of the dDocent script evaluate the number of individuals and create the NumInd variable; one way to deal with the limit on open files is to test for $NumInd > 1000 and reset the ulimit value if necessary:
if [ "$NumInd" -gt 1000 ]; then
NewLimit=$(($NumInd + 100))
ulimit -n $NewLimit
fi
I consistently get the below error message when I run dDocent 2.8.13 (conda installation) on the SLURM-based computation cluster (CentOS Linux release 7.9.2009) at my end. The correct final output files seem to be created though. I was wondering if there were any quick-fix solutions here? Thank you very much.
Using FreeBayes to call SNPs
sh: /dev/tty: No such device or address
I have samples of same species from different populations (different locations). I would like to incorporate this information while calling SNPs using freeBayes. How can I do that ?
Hi,
a recent run on a large dataset (both min thresholds set to 2) was stuck on the CD-HIT step; perhaps something like
weizhongli/cdhit#18
I suggest to replace CD-HIT in dDocent by linclust:
https://github.com/soedinglab/MMseqs2/wiki#linclust
I just tried it out and it will produce almost the same result and be much faster than CD-HIT. You can install it via conda:
conda install -c conda-forge -c bioconda mmseqs2
Best regards, Mathias
dDocent hit an error during trimming and I can't tell why.
Here's what's printed to the terminal:
/home/nclowell/miniconda2/envs/updated_dDocent/bin/dDocent: line 6: export: `2.7.8': not a valid identifier
dDocent 2.7.8
Contact [email protected] with any problems
Checking for required software
All required software is installed!
dDocent version 2.7.8 started Wed Sep 11 10:10:23 PDT 2019
At this point, all configuration information has been entered and dDocent may take several hours to run.
It is recommended that you move this script to a background operation and disable terminal input and output.
All data and logfiles will still be recorded.
To do this:
Press control and Z simultaneously
Type 'bg' without the quotes and press enter
Type 'disown -h' again without the quotes and press enter
Now sit back, relax, and wait for your analysis to finish
Trimming reads
(updated_dDocent) nclowell@ubuntu:/mnt/hgfs/E/dDocent_configfile_test$
Here's the content of the trim log:
mawk: cannot open lengths.txt (No such file or directory)
/home/nclowell/miniconda2/envs/updated_dDocent/bin/dDocent: line 688: / 2: syntax error: operand expected (error token is "/ 2")
Hi Jonathan,
As part of my work with the Genomics for Aotearoa New Zealand society, I am building docker containers for bioinformatics tools. It would be great to build a fully functional one for dDocent. Unfortunately, the PEAR software has two significant issues prohibiting inclusion. The first is that it is just not available in a way one can script up an install. The second is that it does not have a suitable license for software distribution.
On the other hand, PANDAseq is available from their github repo under free software licensing terms. https://github.com/neufeld/pandaseq It implements many algorithms, including the one used by PEAR. If you wouldn't mind substituting PANDAseq for PEAR, I would then be able to build a docker container of dDocent with full functionality.
Thank you for considering my request.
--Rob
Hi.
I'm pretty new to genomics and using the command line, but I was wondering if this is an easy fix:
/Documents/ddocent-master$ ls install_dDocent.GATK_requirements rename.for.dDocent
bars.names install_dDocent.FB_requirements README.md
bars.names
dDocent.FB LICENSE Rename_for_dDocent.sh
dDocent.GATK mergefq.pl
~/Documents/ddocent-master$ sh rename.for.dDocent bars.names
rename.for.dDocent: 9: rename.for.dDocent: Syntax error: "(" unexpected (expecting "fi")
Can't seem to get barcode file to work - instructions followed exactly as on the blog. As far as I am aware I have the newest version of dDocent.
Also, are there sample data sets to available for dDocent?
Cheers,
~ jat101
Hi Jon,
I think that trimming with fastp
would benefit from adding the actual adapters using the --adapter_fasta flag. Below I've pasted the results of a small test showing the presence of adapters from readthrough of the P2 in the forward reads of one sample. P_007.F.fq.gz
is untrimmed, P_007.nofa.fq.gz
is trimmed with only the adaptive trimming, and P_007.trimfa.fq.gz
uses the --adapter_fasta flag and the TruSeq adapter FASTA. I'm grepping for the P2 here:
$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.F.fq.gz
2144
$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.nofa.fq.gz
2144
$ zgrep -c AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC P_007.trimfa.fq.gz
0
I think what might be happening is the auto adapter detection in fastp
is mistaking some other repetitive sequences in the data for the adapters. It prints the sequence it detects, but it's not the true adapter:
Detecting adapter sequence for read1...
CATGCAATATTCTAACTGATAAATCATGACATGACATCCAAGATTTCAAAATGTCATGCC
I can't see that there is a way to turn off the auto-detection, but it seems that adding the TrueSeq adapters specifically lets it search for those in addition to the ones it detects.
Line 789 in eed5a7d
When answering no
to the incorrect count of individuals in a directory, dDocent returns an error indicating the incorrect naming convention for the individuals:
Please double check that all fastq files are named Ind01.F.fq.gz and Ind01.R.fq.gz
instead of how it appears correctly in different error prompts:
Please rename individuals to: Locality_Individual.F.fq.gz
Install fails on any cluster compute node that doesn't have curses libraries (probably most of them). This is an easy fix by updating install_dDocent_requirements line 129 to read:
./configure --without-curses
I would make a PR but seems silly for such a small change
Received this error:
Creating alignment intervals
mawk: cannot open "mapped.2918.bed" for output (Too many open files)
Genomic interval creation failed. This may be related to the maximum number of open files.
I am unsure how to fix.
Please advise.
Not sure if this is a common issue, but when running ReferenceOpt.sh
I got standard output in my kopt.data
file.
See excerpt:
4 1 0.88 ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH10.F.fq.gz Reverse reads file.................: DH10.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213765 C: 0.305644 G: 0.278709 T: 0.201882 19671 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 51%
Assemblying reads: 79%
Assemblying reads: 100% Assembled reads ...................: 248,318 / 262,242 (94.690%) Discarded reads ...................: 405 / 262,242 (0.154%) Not assembled reads ...............: 13,519 / 262,242 (5.155%) Assembled reads file...............: DH10.assembled.fastq Discarded reads file...............: DH10.discarded.fastq Unassembled forward reads file.....: DH10.unassembled.forward.fastq Unassembled reverse reads file.....: DH10.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH15.F.fq.gz Reverse reads file.................: DH15.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213404 C: 0.306030 G: 0.278267 T: 0.202299 9590 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 51%
Assemblying reads: 79%
Assemblying reads: 100% Assembled reads ...................: 249,053 / 262,194 (94.988%) Discarded reads ...................: 204 / 262,194 (0.078%) Not assembled reads ...............: 12,937 / 262,194 (4.934%) Assembled reads file...............: DH15.assembled.fastq Discarded reads file...............: DH15.discarded.fastq Unassembled forward reads file.....: DH15.unassembled.forward.fastq Unassembled reverse reads file.....: DH15.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH16.F.fq.gz Reverse reads file.................: DH16.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213534 C: 0.306293 G: 0.278708 T: 0.201464 8191 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 50%
Assemblying reads: 77%
Assemblying reads: 100% Assembled reads ...................: 254,211 / 267,994 (94.857%) Discarded reads ...................: 167 / 267,994 (0.062%) Not assembled reads ...............: 13,616 / 267,994 (5.081%) Assembled reads file...............: DH16.assembled.fastq Discarded reads file...............: DH16.discarded.fastq Unassembled forward reads file.....: DH16.unassembled.forward.fastq Unassembled reverse reads file.....: DH16.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH20.F.fq.gz Reverse reads file.................: DH20.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214051 C: 0.304835 G: 0.279797 T: 0.201317 3080 uncalled bases Assemblying reads: 0%
Assemblying reads: 22%
Assemblying reads: 43%
Assemblying reads: 67%
Assemblying reads: 90%
Assemblying reads: 100% Assembled reads ...................: 295,485 / 307,638 (96.050%) Discarded reads ...................: 70 / 307,638 (0.023%) Not assembled reads ...............: 12,083 / 307,638 (3.928%) Assembled reads file...............: DH20.assembled.fastq Discarded reads file...............: DH20.discarded.fastq Unassembled forward reads file.....: DH20.unassembled.forward.fastq Unassembled reverse reads file.....: DH20.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH22.F.fq.gz Reverse reads file.................: DH22.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213903 C: 0.304844 G: 0.279855 T: 0.201398 6650 uncalled bases Assemblying reads: 0%
Assemblying reads: 19%
Assemblying reads: 38%
Assemblying reads: 57%
Assemblying reads: 77%
Assemblying reads: 96%
Assemblying reads: 100% Assembled reads ...................: 344,335 / 358,909 (95.939%) Discarded reads ...................: 140 / 358,909 (0.039%) Not assembled reads ...............: 14,434 / 358,909 (4.022%) Assembled reads file...............: DH22.assembled.fastq Discarded reads file...............: DH22.discarded.fastq Unassembled forward reads file.....: DH22.unassembled.forward.fastq Unassembled reverse reads file.....: DH22.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH33.F.fq.gz Reverse reads file.................: DH33.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214020 C: 0.304925 G: 0.280528 T: 0.200526 12950 uncalled bases Assemblying reads: 0%
Assemblying reads: 26%
Assemblying reads: 50%
Assemblying reads: 77%
Assemblying reads: 100% Assembled reads ...................: 259,069 / 267,732 (96.764%) Discarded reads ...................: 282 / 267,732 (0.105%) Not assembled reads ...............: 8,381 / 267,732 (3.130%) Assembled reads file...............: DH33.assembled.fastq Discarded reads file...............: DH33.discarded.fastq Unassembled forward reads file.....: DH33.unassembled.forward.fastq Unassembled reverse reads file.....: DH33.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH34.F.fq.gz Reverse reads file.................: DH34.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214097 C: 0.304848 G: 0.279696 T: 0.201359 16310 uncalled bases Assemblying reads: 0%
Assemblying reads: 19%
Assemblying reads: 38%
Assemblying reads: 57%
Assemblying reads: 78%
Assemblying reads: 96%
Assemblying reads: 100% Assembled reads ...................: 344,165 / 355,884 (96.707%) Discarded reads ...................: 331 / 355,884 (0.093%) Not assembled reads ...............: 11,388 / 355,884 (3.200%) Assembled reads file...............: DH34.assembled.fastq Discarded reads file...............: DH34.discarded.fastq Unassembled forward reads file.....: DH34.unassembled.forward.fastq Unassembled reverse reads file.....: DH34.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH35.F.fq.gz Reverse reads file.................: DH35.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.214095 C: 0.304591 G: 0.280127 T: 0.201186 3290 uncalled bases Assemblying reads: 0%
Assemblying reads: 23%
Assemblying reads: 44%
Assemblying reads: 69%
Assemblying reads: 92%
Assemblying reads: 100% Assembled reads ...................: 288,858 / 299,368 (96.489%) Discarded reads ...................: 64 / 299,368 (0.021%) Not assembled reads ...............: 10,446 / 299,368 (3.489%) Assembled reads file...............: DH35.assembled.fastq Discarded reads file...............: DH35.discarded.fastq Unassembled forward reads file.....: DH35.unassembled.forward.fastq Unassembled reverse reads file.....: DH35.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH38.F.fq.gz Reverse reads file.................: DH38.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213124 C: 0.305703 G: 0.279985 T: 0.201188 6160 uncalled bases Assemblying reads: 0%
Assemblying reads: 22%
Assemblying reads: 43%
Assemblying reads: 66%
Assemblying reads: 88%
Assemblying reads: 100% Assembled reads ...................: 300,670 / 312,379 (96.252%) Discarded reads ...................: 118 / 312,379 (0.038%) Not assembled reads ...............: 11,591 / 312,379 (3.711%) Assembled reads file...............: DH38.assembled.fastq Discarded reads file...............: DH38.discarded.fastq Unassembled forward reads file.....: DH38.unassembled.forward.fastq Unassembled reverse reads file.....: DH38.unassembled.reverse.fastq ____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.6 [January 15, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: DH39.F.fq.gz Reverse reads file.................: DH39.R.fq.gz PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 100 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 6 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.213268 C: 0.305838 G: 0.279499 T: 0.201395 9065 uncalled bases Assemblying reads: 0%
Assemblying reads: 18%
Assemblying reads: 35%
Assemblying reads: 52%
Assemblying reads: 71%
Assemblying reads: 88%
Assemblying reads: 100% Assembled reads ...................: 377,077 / 389,570 (96.793%) Discarded reads ...................: 199 / 389,570 (0.051%) Not assembled reads ...............: 12,294 / 389,570 (3.156%) Assembled reads file...............: DH39.assembled.fastq Discarded reads file...............: DH39.discarded.fastq Unassembled forward reads file.....: DH39.unassembled.forward.fastq Unassembled reverse reads file.....: DH39.unassembled.reverse.fastq 7
4 1 0.90 8
4 1 0.92 8
4 1 0.94 8
4 1 0.96 13
This is not a real issue but just a suggestion.
it would be nice to have releases tags for the application so it's easier to download the required version. You can see an example of release tags here https://github.com/samtools/htslib/releases
and here is the github documentation about it https://help.github.com/articles/creating-releases/
Also set SHELL to bash as "parallel" does not work with some other shells and uses $SHELL to run child processes.
Lastly fix logic for choosing awk command. If "awk" is not GNU then we should use "gawk".
--- dDocent-test/Data/remake_reference.sh 2018-06-13 15:38:07.762081000 -0500
+++ remake_reference.sh 2018-06-13 15:30:42.031595000 -0500
@@ -1,16 +1,23 @@
export LC_ALL=en_US.UTF-8
+export SHELL=bash
if [[ -z "$5" ]]; then
echo "Usage is sh remake_reference.sh K1 K2 similarity% Assembly_Type Number_of_Processors"
exit 1
fi
-if ! awk --version | fgrep -v GNU &>/dev/null; then
+if ! awk --version | fgrep GNU &>/dev/null; then
awk=gawk
else
awk=awk
fi
+if ! sort --version | fgrep GNU &>/dev/null; then
if find ${PATH//:/ } -maxdepth 1 -name trimmomaticjar 2> /dev/null| grep -q 'trim' ; then
TRIMMOMATIC=$(find ${PATH//:/ } -maxdepth 1 -name trimmomaticjar 2> /dev/null | head -1)
else
@@ -52,7 +59,7 @@
cat namelist | parallel --no-notice -j $NUMProc "zcat {}.F.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.forward"
cat namelist | parallel --no-notice -j $NUMProc "zcat {}.R.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.reverse"
if [ "$ATYPE" = "RPE" ]; then
cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | sort -k1 -S 100M > {}.fr"
cat namelist | parallel --no-notice -j $NUMProc "paste {}.forward {}.reverse | $sort -k1 -S 100M > {}.fr"
cat namelist | parallel --no-notice -j $NUMProc "cut -f1 {}.fr | uniq -c > {}.f.uniq && cut -f2 {}.fr > {}.r"
cat namelist | parallel --no-notice -j $NUMProc "mawk '$AWK4' {}.f.uniq > {}.f.uniq.e"
cat namelist | parallel --no-notice -j $NUMProc "paste -d '-' {}.f.uniq.e {}.r | mawk '$AWK3'| sed 's/-/NNNNNNNNNN/' | sed -e '$SED1' | sed -e '$SED2'> {}.uniq.seqs"
@@ -110,12 +117,12 @@
parallel --no-notice -j $NUMProc mawk -v x=$CUTOFF ''$1 >= x'' ::: *.uniq.seqs | cut -f2 | sed 's/NNNNNNNNNN/-/' > total.uniqs
cut -f 1 -d "-" total.uniqs > total.u.F
cut -f 2 -d "-" total.uniqs > total.u.R
@@ -123,7 +130,7 @@
parallel --no-notice mawk -v x=$CUTOFF ''$1 >= x'' ::: *.uniq.seqs | cut -f2 | perl -e 'while (<>) {chomp; $z{$_}++;} while(($k,$v) = each(%z)) {print "$v\t$k\n";}' | mawk -v x=$CUTOFF2 '$1 >= x' > uniq.k.$CUTOFF.c.$CUTOFF2.seqs
fi
-sort -k1 -r -n --parallel=$NUMProc -S 2G uniq.k.$CUTOFF.c.$CUTOFF2.seqs |cut -f2 > totaluniqseq
+$sort -k1 -r -n --parallel=$NUMProc -S 2G uniq.k.$CUTOFF.c.$CUTOFF2.seqs |cut -f2 > totaluniqseq
mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' totaluniqseq > uniq.full.fasta
LENGTH=$(mawk '!/>/' uniq.full.fasta | mawk '(NR==1||length<shortest){shortest=length} END {print shortest}')
LENGTH=$(($LENGTH * 3 / 4))
@@ -149,6 +156,7 @@
if [ -s "rbdiv.out.$1" ]; then
rainbow merge -o rbasm.out.$1 -a -i rbdiv.out.$1 -r 2 -N10000 -R10000 -l 20 -f 0.75
more rbasm.out.$1
fi
}
@@ -159,21 +167,21 @@
sed -e 's/NNNNNNNNNN/ /g' uniq.fasta | cut -f1 > uniq.F.fasta
CDHIT=$(python -c "print (max("$simC" - 0.1,0.8))")
cd-hit-est -i uniq.F.fasta -o xxx -c $CDHIT -T $NUMProc -M 0 -g 1 -d 100 &>cdhit.log
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
paste sort.contig.cluster.ids totaluniqseq > contig.cluster.totaluniqseq
else
sed -e 's/NNNNNNNNNN/ /g' totaluniqseq | cut -f1 | sort --parallel=$NUMProc -S 2G| uniq | mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' > uniq.F.fasta
sed -e 's/NNNNNNNNNN/ /g' totaluniqseq | cut -f1 | $sort --parallel=$NUMProc -S 2G| uniq | mawk '{c= c + 1; print ">dDocent_Contig_" c "\n" $1}' > uniq.F.fasta
CDHIT=$(python -c "print (max("$simC" - 0.1,0.8))")
cd-hit-est -i uniq.F.fasta -o xxx -c $CDHIT -T $NUMProc -M 0 -g 1 -d 100 &>cdhit.log
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids
paste sort.contig.cluster.ids <(mawk '!/>/' uniq.F.fasta) > contig.cluster.Funiq
sed -e 's/NNNNNNNNNN/ /g' totaluniqseq | sort --parallel=$NUMProc -k1 -S 2G | mawk '{print $0 "\t" NR}' > totaluniqseq.CN
sed -e 's/NNNNNNNNNN/ /g' totaluniqseq | $sort --parallel=$NUMProc -k1 -S 2G | mawk '{print $0 "\t" NR}' > totaluniqseq.CN
join -t $'\t' -1 3 -2 1 contig.cluster.Funiq totaluniqseq.CN -o 2.3,1.2,2.1,2.2 > contig.cluster.totaluniqseq
fi
#CD-hit output is converted to rainbow format
tail -1 rbdiv.out | cut -f5
) mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids.ua
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>dDocent_Contig_,...]//g' | $sort -g -k1 -S 2G --parallel=$NUMProc > sort.contig.cluster.ids.ua
paste sort.contig.cluster.ids.ua totaluniqseq.ua > contig.cluster.totaluniqseq.ua
sort -k2,2 -g -S 2G --parallel=$NUMProc contig.cluster.totaluniqseq.ua | sed -e 's/NNNNNNNNNN/ /g' > rcluster.ua
$sort -k2,2 -g -S 2G --parallel=$NUMProc contig.cluster.totaluniqseq.ua | sed -e 's/NNNNNNNNNN/ /g' > rcluster.ua
#CD-hit output is converted to rainbow format
rainbow div -i rcluster.ua -o rbdiv.ua.out -f 0.5 -K 10
if [ "$ATYPE" == "PE" ]; then
Is it really necessary to use bedtools 2.23.0, or should the latest version work as well?
I see the note about the latest vcftools lacking certain options, but couldn't find an explanation of why bedtools 2.23.0 should be required.
Thanks,
Jason
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.