msettles / exphts Goto Github PK
View Code? Open in Web Editor NEWPython application for "Experimental High Throughput Sequencing"
License: Apache License 2.0
Python application for "Experimental High Throughput Sequencing"
License: Apache License 2.0
Synopsis: -A works but -c does not. (Matt fixed this in his local repo.)
Complete info:
I created a folder, called contaminants. In the folder there is a symlink to a fasta file that I downloaded from NCBI (and have already used with bowtie2 and bwa). When I try to run this command:
expHTS preprocess -t 24 -w -c contaminants -f samples.txt -a
I get this error message:
Traceback (most recent call last):
File "/home/msettles/Python_venv/bin/expHTS", line 9, in
load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/init.py", line 18, in main
expHTS.main()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
commands[args.command].execute(args)
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 108, in execute
if os.path.exists(args.contaminatesFolder):
AttributeError: 'Namespace' object has no attribute 'contaminatesFolder'
But the usage also shows -A as contaminants (or adapter contaminants). So I changed the command to -A and it ran, but I don't think that's correct
Also, it should be spelled "contaminants".
Thanks,
Monica
When -O (no overlap) -S (force split) and -a (polyAT trimming) are used memory leak that messes up R2 file. -O and -S - no error; -S and -a - no errors; -O -a still no errors. Something odd is happening - poly AT trimming - will be turned back on once things are fixed.
Would be nice to know, time/sample and total time processing
in the spades output directory there's another sub directory of the sample name along with the normal spades output
Add an argument --forcePairs within mapping. R1 and R2 reads will be processed normally, however, SE reads will be split in half with the front half going to be processed with R1 and the second half of the read being RC and processed with R2.
Can't run it on new systems, exits saying contaminant_screen.sh not found
The GitHub copy appears to have lagged behind developments and bug fixes on the GRC server.
One of my labmates grabbed data from the SRA that had the old pair information at the end of the reads (i.e., the additional :1/:2 field after the coordinate information). expHTS ran to completion with no errors but does not output any fastqs. Ran a script that converted to a newer format (\s1 or \s2) and everything worked fine. Probably specific to one of the applications, so might want to consider older notation when writing the new stuff.
Not sure if you want this installable outside of Linux distos at first, but there are errors and warnings thrown when trying to install at least on OS X 10.10.5. Appears restricted to finalCleanup.
expHTS fails, with uninformative errors or no errors, when the 'incorrect' versions of called applications (sickle, scythe, etc.) are in $PATH. This can be handled efficiently by invoking submodules in git. Alternatively, at a bare minimum, the versions required should be listed.
-More robust options
-R readable tables.
-Handling PE and SE
Running expHTS preprocess with polyA trimming, output fastq have invalid lines. See attachment with two examples of this.
Monica Britton
invalid_fastq_lines.txt
log.txt
I see 2 issues, 1) its not assembling within sample folder and 2) something (possibly post assembly) with mapping is failing.
Lower the amount of function calls (unique functions)
Go through the preprocess -h help and clean up, some errors and incorrect text in there
-Base pairs
-Average Qual
-Number PE and SE reads
I am doing some RNAseq experiments, so I don't want to dedup. My call is:
expHTS preprocessing -m 25 -t 10 --skip-duplicate
finalCleanup.log exists in each folder, i.e.:
A T G C N PolyA_Removed_Reads PolyT_Removed_Reads Short_discarded PE_Kept SE_Kept Forced_Pairs R1_Ave_Len R2_Ave_Len SE_Ave_Len AverageQual
19049046 20709390 18174348 18719907 1784 0 0 0 0 1018970 0 -nan -nan 75.23 34.90
This is just SE data. The summary log is completely empty.
change stats.log to preprocessing_summary.txt
I think this bug was introduced relative to the last time I tried some cleaning, because I cannot get folders with just PE data to be processed.
I previously processed some data made up of two PE FASTQ files and had no issues. This looks like it was over a month ago. I must have updated things in the interim as part of general server maintenance.
I can no longer get the data processed above to clean. It completes but does nothing, as described in Issue 52. Additionally, when I try to pass various flags/arguments to it, I generally get the same thing EXCEPT for the --skip-duplicates flag, which throws an index error:
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz -t 5 2>/dev/null) -o stdout 2>02-Cleaned/22MO/PE_filter_info.log) -p stdout 2>02-Cleaned/22MO/PE_deduper_info.log) -m stdout -s /dev/null -t sanger -T 2>02-Cleaned/22MO/PE_sickle_info.log) -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c 2>02-Cleaned/22MO/flash_info.log) 0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 0.0303189754486
Total amount of seconds to run all samples
Seconds: 0.0303189754486
And, with the flag, after blowing away the folder:
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py <(flash2 -Ti <(sickle pe -c <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz -t 5 2>/dev/null) -o stdout 2>02-Cleaned/22MO/PE_filter_info.log) -m stdout -s /dev/null -t sanger -T 2>02-Cleaned/22MO/PE_sickle_info.log) -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c 2>02-Cleaned/22MO/flash_info.log) 0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 1.76817893982
Traceback (most recent call last):
File "/usr/local/bin/expHTS", line 9, in <module>
load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/__init__.py", line 18, in main
expHTS.main()
File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
commands[args.command].execute(args)
File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 136, in execute
logFiles.append(parseOut(key[1], key[1].split("/")[-1]))
File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 257, in parseOut
printToFile(os.path.join(base, sample + "_SummaryStats.log"), header, data)
File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 235, in printToFile
if '\t'.join(map(str, data))[-1] != '\n':
IndexError: string index out of range
I spent several hours trying to track this down (thought I was going crazy because what should be a straightforward process was now broken with seemingly no cause). Again, note that this proceeds fine on SE reads. Additionally, I was monitoring server resource usage, and things appear to exit uncleanly; there are Perl, Python, gzip, and Bowtie2 calls that hang around for a while (~ 1m) seemingly doing nothing in a sleep or deferred state before ultimately being resolved.
This might be a documentation issue, but I've had issues where sometimes expHTS just won't work. It will run, take a fraction of a second, and produce empty results folders minus a few (empty or effectively empty) logs. Sometimes I've gone through and renamed folders, symlinks, etc. and things appear to work fine after that for seemingly no reason. This is especially prevalent when combining data from a lot of different sources and experiments, such as a trial sequencing run, initial sequencing effort, and additional coverage run that might have different naming schemes. Other times, no problems. Needs to be on the very front of the GitHub page. A recent example:
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R1.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R1.fastq.gz -2 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R2.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R2.fastq.gz -t 20 2>/dev/null) -o stdout 2>02-Cleaned/caroli/PE_filter_info.log) -p stdout 2>02-Cleaned/caroli/PE_deduper_info.log) -m stdout -s /dev/null -t sanger -T 2>02-Cleaned/caroli/PE_sickle_info.log) -M 700 --allow-outies -o caroli -d 02-Cleaned/caroli -To -c 2>02-Cleaned/caroli/flash_info.log) 0 0 50 02-Cleaned/caroli/caroli
02-Cleaned/caroli
Seconds: 0.0292479991913
Total amount of seconds to run all samples
Seconds: 0.0292479991913
/mnt/storage2/scratch2/projects/mouse_references$ expHTS preprocess -t 16
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
No fastq files were found under this directory - 00-RawData/cast
Failure in Validation - exiting process
/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ ls
H12_88996_ERR019144_1.fastq.gz H12_88996_ERR019144_2.fastq.gz
/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ zcat H12_88996_ERR019144_1.fastq.gz | head
@ERR019144.1 IL19_4344:5:1:1016:2407/1
NTACCTACAAATTTAATATTTTCTAAGAGCAGAATGAAACTTCATAGTGTATGTATAAAACATTCCACTATTCTTTCCTGTTCATCAGTTGTTCGATACACAGACTGA
+
!+''+)('%556668:87977577565208855588666+.1-40-544566626525+/-/03+0//11././,,/13(((),60-//(&(&),,)+'(&($$/++
@ERR019144.2 IL19_4344:5:1:1016:13842/1
NGATCTGCACGCCAGACTCACCTTGTGACCTGCCTCTAGTCGGGGGTGTGCTCTCCTGAGGAGAGAGAGGATAGCTACAGAGAGAAGATCACAGCATCGACGTCCTGT
+
!)(,(1/2//023077:;994,3540/+1'01,,-1113388'6%(&%$-/-,.-$#'&$#%(%'&$#%#$%$#'%%&#%#$$%$#$#$)%%%$%&'('%'%)&'+'
@ERR019144.3 IL19_4344:5:1:1016:7439/1
NTTGTTTTACTCAGGTTCCACACATGTGCCTTCAAATTTCATATCATTGTTTTTAATATCTGAGTAATAATCTACTGTGTAAATGTATCAAATTTTATTTATCTATCT
The bug is in this function:
def directoryFiles(self, sampleSeq):
sampleSeq = tuple(sampleSeq)
#print sampleSeq
directoryTest = sampleSeq[0].rstrip();
#fastqCount insures at least one fastq files is under the directory
fastqCount = 0
if self.testDirectory(directoryTest):
for subdir, dir, files in os.walk(directoryTest):
for file in files:
file = os.path.abspath(os.path.join(directoryTest, file))
if "_R1" in file and ".fastq" in file:
fastqCount += 1
if not sampleSeq in self.sampleFiles:
self.sampleFiles[sampleSeq] = []
self.sampleFiles[sampleSeq].append(self.isPairedEnd(file))
if (fastqCount == 0):
self.exitTime("No fastq files were found under this directory - " + directoryTest)
else:
self.exitTime("Directory " + directoryTest + " does not exists")
You set fastqCount to 0, then call self.exitTime() below. This is never updated because the reads are named (e.g.,) H12_88996_ERR0191444_1.fastq and the function looks for _R1. Little edge case action going on.
Edit: new_devel branch too.
There are some applications where you wouldn't want to dedup but still do the other stuff. A flag would be a welcome addition for some things other people are working on (expression and all that - not necessarily anything I'm doing currently).
Input output spelling errors.
This is very important for directional RNA-Seq. If an R1 is discarded, leaving just an R2, it should be reverse-complemented before writing to the SE.fastq file to preserve orientation.
Monica Britton
-gzip
-skip cont. screening
I ran:
expHTS preprocess -s -a -O -F 02-Cleaned_dups -w -t 12
with the -s option to see how leaving duplicates affected my results.
I didn't receive any warnings or errors, but my read counts are off for the cleaned fastq files:
Raw read count:
sed -n '1~4p' B2ST1_TTAGGC_L005_R1_001.fastq | grep "^@" | wc -l
6948683
Cleaned read count after running the above command:
sed -n '1~4p' B2_R1.fastq | grep "^@" | wc -l
12843598
(twice as much as it should be based on my calculations from the preprocessing summary)
First 10 lines of the cleaned fastq file:
head -10 B2_R1.fastq
@HWI-D00289:175:C6K3VANXX:5:1101:1661:2064#0/1
GATGCTTACAAATACAGAGAGACATTTAGATGTGCACCGTCGTCGTCGTCGTCACGCTCGTTCACACACGCGCGCACAAAAGGAAGAAAAAAAGACTTAAGACAAAAGGTTTAAGTACTACAGAC
+
<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB
@HWI-D00289:175:C6K3VANXX:5:1101:4385:2074#0/1
GCTAGCTTCTGTATTAAGTCGTAGAGCAGATATATCCCAAACTAATGAAACGTCCTCTCCCCATTTATGAACAAAGTCATTTAACAACATATAGTACCAGGTGTGCGCTATCAGAGTATCCACGAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-D00289:175:C6K3VANXX:5:1101:4383:2135#0/1
GTCGAACAAACACATTTGATGACTTTAAACTGTTGAATTAGGGTTTTAACTTATGAAGCGATGGATAAATGAGCAGTCCAGGCTAGTTGTAGGGTTTACTGGCTTCCACTGAAATAGGACACAAAG
Cleaned read count WITHOUT the -s option:
5368273
Files are located: /mnt/home/mbenner/GRC_Pipeline/02-Cleaned_dups
Thank you!
discovar errors out with bubbles output - doesn't give reason
manually running each segment doesn't produce the same problem (contaminant screen, dedupe, sickle)
no flash overlap for any of them
I noticed that some of the individual applications will gzip fastqs (at the expense of speed). Perhaps include a flag to do this? Since gzipping is CPU-light (usually a fraction of one) yet is heaver on I/O, perhaps a thread could be set aside and allocated to compression as other libraries are cleaned? I (and I'm guessing others) am going to have to do this anyway to save disk space, so this would let me ditch the find-and-gzip mini-script I'd need to run after.
More relevant error messages could be implemented in several places. For example, I ran expHTS with "-c Contaminants" but I did not have a "Contaminants" folder. Instead of an error message like "Contaminants" does not exist, I got this uninformative error:
Traceback (most recent call last):
File "/home/msettles/Python_venv/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py", line 112, in
flag = int(line2[1])
ValueError: invalid literal for int() with base 10: 'does'
Monica Britton
Column SEQUENCE_ID looks for raw data, column SAMPLE_ID is new sample name, when SAMPLE_ID is the first column, expHTS thinks its SEQUENCE_ID and fails
All read id header end in #0, R1 should end in #0/1, R1 should end in #0/2 and SE should end in #0
@ Also missing from the beginning of each line, for R1 and R2
Start documenting all the madness.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.