msettles / exphts Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 191 KB

Python application for "Experimental High Throughput Sequencing"

License: Apache License 2.0

Python 85.63% C 14.37%

exphts's People

Contributors

Stargazers

Watchers

Forkers

mfcovington lindanimoyo b854414

exphts's Issues

Contaminants confusion

Synopsis: -A works but -c does not. (Matt fixed this in his local repo.)

Complete info:
I created a folder, called contaminants. In the folder there is a symlink to a fasta file that I downloaded from NCBI (and have already used with bowtie2 and bwa). When I try to run this command:

expHTS preprocess -t 24 -w -c contaminants -f samples.txt -a

I get this error message:
Traceback (most recent call last):
File "/home/msettles/Python_venv/bin/expHTS", line 9, in
load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/init.py", line 18, in main
expHTS.main()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
commands[args.command].execute(args)
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 108, in execute
if os.path.exists(args.contaminatesFolder):
AttributeError: 'Namespace' object has no attribute 'contaminatesFolder'

But the usage also shows -A as contaminants (or adapter contaminants). So I changed the command to -A and it ran, but I don't think that's correct

Also, it should be spelled "contaminants".

Thanks,

Monica

not to dev/null but to error.out for troubleshooting

add -R in bwa "@RG. . . ." from grcscript configure mapping

Single end mapping

finalCleanup.py - memory leak

When -O (no overlap) -S (force split) and -a (polyAT trimming) are used memory leak that messes up R2 file. -O and -S - no error; -S and -a - no errors; -O -a still no errors. Something odd is happening - poly AT trimming - will be turned back on once things are fixed.

Add timer to preprocessing

Would be nice to know, time/sample and total time processing

call to spades creates empty directory

in the spades output directory there's another sub directory of the sample name along with the normal spades output

--forcePairs in mapping

Add an argument --forcePairs within mapping. R1 and R2 reads will be processed normally, however, SE reads will be split in half with the front half going to be processed with R1 and the second half of the read being RC and processed with R2.

Make sure bwa is taking r1 r2 se

checkPreprocessApplications looks for applications not longer needed or exists

Can't run it on new systems, exits saying contaminant_screen.sh not found

Do commits so that GitHub copy is updated with installed version.

The GitHub copy appears to have lagged behind developments and bug fixes on the GRC server.

Older-formatted reads with pair information do not produce output fastq files

One of my labmates grabbed data from the SRA that had the old pair information at the end of the reads (i.e., the additional :1/:2 field after the coordinate information). expHTS ran to completion with no errors but does not output any fastqs. Ran a script that converted to a newer format (\s1 or \s2) and everything worked fine. Probably specific to one of the applications, so might want to consider older notation when writing the new stuff.

not portable to OSX

Not sure if you want this installable outside of Linux distos at first, but there are errors and warnings thrown when trying to install at least on OS X 10.10.5. Appears restricted to finalCleanup.

Spades summary stats as in GRC2.0

Explicit version checks and application version information

expHTS fails, with uninformative errors or no errors, when the 'incorrect' versions of called applications (sickle, scythe, etc.) are in $PATH. This can be handled efficiently by invoking submodules in git. Alternatively, at a bare minimum, the versions required should be listed.

expHTS mapping

-More robust options
-R readable tables.
-Handling PE and SE

Output fastq with invalid lines

Running expHTS preprocess with polyA trimming, output fastq have invalid lines. See attachment with two examples of this.

Monica Britton
invalid_fastq_lines.txt

trouble with spades assembly

log.txt
I see 2 issues, 1) its not assembling within sample folder and 2) something (possibly post assembly) with mapping is failing.

Refactor

Lower the amount of function calls (unique functions)

Date Version Control

help doc

Go through the preprocess -h help and clean up, some errors and incorrect text in there

Fix parser to include final stats

-Base pairs
-Average Qual
-Number PE and SE reads

Preprocessing_Summary.log is empty when skipping certain steps

I am doing some RNAseq experiments, so I don't want to dedup. My call is:

expHTS preprocessing -m 25 -t 10 --skip-duplicate

finalCleanup.log exists in each folder, i.e.:

A       T       G       C       N       PolyA_Removed_Reads     PolyT_Removed_Reads     Short_discarded    PE_Kept SE_Kept Forced_Pairs    R1_Ave_Len      R2_Ave_Len      SE_Ave_Len      AverageQual

19049046        20709390        18174348        18719907        1784    0       0       0       0 1018970  0                       -nan    -nan    75.23   34.90

This is just SE data. The summary log is completely empty.

Change final stats name

change stats.log to preprocessing_summary.txt

Add --minqual, --type, idattr from configure mapping

Stand-alone PE cleaning does not appear to be working

I think this bug was introduced relative to the last time I tried some cleaning, because I cannot get folders with just PE data to be processed.

I previously processed some data made up of two PE FASTQ files and had no issues. This looks like it was over a month ago. I must have updated things in the interim as part of general server maintenance.

I can no longer get the data processed above to clean. It completes but does nothing, as described in Issue 52. Additionally, when I try to pass various flags/arguments to it, I generally get the same thing EXCEPT for the --skip-duplicates flag, which throws an index error:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz   -t 5 2>/dev/null)   -o stdout 2>02-Cleaned/22MO/PE_filter_info.log)  -p stdout 2>02-Cleaned/22MO/PE_deduper_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/22MO/PE_sickle_info.log)   -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c  2>02-Cleaned/22MO/flash_info.log)   0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 0.0303189754486
Total amount of seconds to run all samples
Seconds: 0.0303189754486

And, with the flag, after blowing away the folder:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz   -t 5 2>/dev/null)   -o stdout 2>02-Cleaned/22MO/PE_filter_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/22MO/PE_sickle_info.log)   -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c  2>02-Cleaned/22MO/flash_info.log)   0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 1.76817893982
Traceback (most recent call last):
  File "/usr/local/bin/expHTS", line 9, in <module>
    load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/__init__.py", line 18, in main
    expHTS.main()
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
    commands[args.command].execute(args)
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 136, in execute
    logFiles.append(parseOut(key[1], key[1].split("/")[-1]))
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 257, in parseOut
    printToFile(os.path.join(base, sample + "_SummaryStats.log"), header, data)
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 235, in printToFile
    if '\t'.join(map(str, data))[-1] != '\n':
IndexError: string index out of range

I spent several hours trying to track this down (thought I was going crazy because what should be a straightforward process was now broken with seemingly no cause). Again, note that this proceeds fine on SE reads. Additionally, I was monitoring server resource usage, and things appear to exit uncleanly; there are Perl, Python, gzip, and Bowtie2 calls that hang around for a while (~ 1m) seemingly doing nothing in a sleep or deferred state before ultimately being resolved.

byreadid add view to -F 0x100

Explicit clarification on folders/naming/etc.

This might be a documentation issue, but I've had issues where sometimes expHTS just won't work. It will run, take a fraction of a second, and produce empty results folders minus a few (empty or effectively empty) logs. Sometimes I've gone through and renamed folders, symlinks, etc. and things appear to work fine after that for seemingly no reason. This is especially prevalent when combining data from a lot of different sources and experiments, such as a trial sequencing run, initial sequencing effort, and additional coverage run that might have different naming schemes. Other times, no problems. Needs to be on the very front of the GitHub page. A recent example:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R1.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R2.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R2.fastq.gz   -t 20 2>/dev/null)   -o stdout 2>02-Cleaned/caroli/PE_filter_info.log)  -p stdout 2>02-Cleaned/caroli/PE_deduper_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/caroli/PE_sickle_info.log)   -M 700 --allow-outies -o caroli -d 02-Cleaned/caroli -To -c  2>02-Cleaned/caroli/flash_info.log)   0 0 50 02-Cleaned/caroli/caroli
02-Cleaned/caroli
Seconds: 0.0292479991913
Total amount of seconds to run all samples
Seconds: 0.0292479991913

develop_new version does not identify gzipped fastqs

/mnt/storage2/scratch2/projects/mouse_references$ expHTS preprocess -t 16
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
No fastq files were found under this directory - 00-RawData/cast
Failure in Validation - exiting process

/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ ls
H12_88996_ERR019144_1.fastq.gz H12_88996_ERR019144_2.fastq.gz
/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ zcat H12_88996_ERR019144_1.fastq.gz | head
@ERR019144.1 IL19_4344:5:1:1016:2407/1
NTACCTACAAATTTAATATTTTCTAAGAGCAGAATGAAACTTCATAGTGTATGTATAAAACATTCCACTATTCTTTCCTGTTCATCAGTTGTTCGATACACAGACTGA
+
!+''+)('%556668:87977577565208855588666+.1-40-544566626525+/-/03+0//11././,,/13(((),60-//(&(&),,)+'(&($$/++
@ERR019144.2 IL19_4344:5:1:1016:13842/1
NGATCTGCACGCCAGACTCACCTTGTGACCTGCCTCTAGTCGGGGGTGTGCTCTCCTGAGGAGAGAGAGGATAGCTACAGAGAGAAGATCACAGCATCGACGTCCTGT
+
!)(,(1/2//023077:;994,3540/+1'01,,-1113388'6%(&%$-/-,.-$#'&$#%(%'&$#%#$%$#'%%&#%#$$%$#$#$)%%%$%&'('%'%)&'+'
@ERR019144.3 IL19_4344:5:1:1016:7439/1
NTTGTTTTACTCAGGTTCCACACATGTGCCTTCAAATTTCATATCATTGTTTTTAATATCTGAGTAATAATCTACTGTGTAAATGTATCAAATTTTATTTATCTATCT

The bug is in this function:

def directoryFiles(self, sampleSeq):
    sampleSeq = tuple(sampleSeq)
    #print sampleSeq
    directoryTest = sampleSeq[0].rstrip();
    #fastqCount insures at least one fastq files is under the directory
    fastqCount = 0
    if self.testDirectory(directoryTest):
        for subdir, dir, files in os.walk(directoryTest):
            for file in files:
                file = os.path.abspath(os.path.join(directoryTest, file))
                if "_R1" in file and ".fastq" in file:
                    fastqCount += 1
                    if not sampleSeq in self.sampleFiles:
                        self.sampleFiles[sampleSeq] = []

                    self.sampleFiles[sampleSeq].append(self.isPairedEnd(file))


        if (fastqCount == 0):
            self.exitTime("No fastq files were found under this directory - " + directoryTest)

    else:
        self.exitTime("Directory " + directoryTest + " does not exists")

You set fastqCount to 0, then call self.exitTime() below. This is never updated because the reads are named (e.g.,) H12_88996_ERR0191444_1.fastq and the function looks for _R1. Little edge case action going on.

Edit: new_devel branch too.

Flag to skip deduping

There are some applications where you wouldn't want to dedup but still do the other stuff. A flag would be a welcome addition for some things other people are working on (expression and all that - not necessarily anything I'm doing currently).

Remove sortbyreadid - add functionality to htseq

Spelling Errors

Input output spelling errors.

parameter for pairs to be aligned

Sickle Orphan Single R2 reads should be reverse-complemented

This is very important for directional RNA-Seq. If an R1 is discarded, leaving just an R2, it should be reverse-complemented before writing to the SE.fastq file to preserve orientation.

Monica Britton

Add Additional Options

-gzip
-skip cont. screening

expHTS preprocess -s option issue

I ran:
expHTS preprocess -s -a -O -F 02-Cleaned_dups -w -t 12
with the -s option to see how leaving duplicates affected my results.

I didn't receive any warnings or errors, but my read counts are off for the cleaned fastq files:

Raw read count:
sed -n '1~4p' B2ST1_TTAGGC_L005_R1_001.fastq | grep "^@" | wc -l
6948683

Cleaned read count after running the above command:
sed -n '1~4p' B2_R1.fastq | grep "^@" | wc -l
12843598
(twice as much as it should be based on my calculations from the preprocessing summary)

First 10 lines of the cleaned fastq file:
head -10 B2_R1.fastq
@HWI-D00289:175:C6K3VANXX:5:1101:1661:2064#0/1
GATGCTTACAAATACAGAGAGACATTTAGATGTGCACCGTCGTCGTCGTCGTCACGCTCGTTCACACACGCGCGCACAAAAGGAAGAAAAAAAGACTTAAGACAAAAGGTTTAAGTACTACAGAC
+
<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB
@HWI-D00289:175:C6K3VANXX:5:1101:4385:2074#0/1
GCTAGCTTCTGTATTAAGTCGTAGAGCAGATATATCCCAAACTAATGAAACGTCCTCTCCCCATTTATGAACAAAGTCATTTAACAACATATAGTACCAGGTGTGCGCTATCAGAGTATCCACGAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-D00289:175:C6K3VANXX:5:1101:4383:2135#0/1
GTCGAACAAACACATTTGATGACTTTAAACTGTTGAATTAGGGTTTTAACTTATGAAGCGATGGATAAATGAGCAGTCCAGGCTAGTTGTAGGGTTTACTGGCTTCCACTGAAATAGGACACAAAG

Cleaned read count WITHOUT the -s option:
5368273

Files are located: /mnt/home/mbenner/GRC_Pipeline/02-Cleaned_dups

Thank you!

02-Cleaned not compatible with Discovar

discovar errors out with bubbles output - doesn't give reason
manually running each segment doesn't produce the same problem (contaminant screen, dedupe, sickle)
no flash overlap for any of them

Final fastq compression

I noticed that some of the individual applications will gzip fastqs (at the expense of speed). Perhaps include a flag to do this? Since gzipping is CPU-light (usually a fraction of one) yet is heaver on I/O, perhaps a thread could be set aside and allocated to compression as other libraries are cleaned? I (and I'm guessing others) am going to have to do this anyway to save disk space, so this would let me ditch the find-and-gzip mini-script I'd need to run after.

More Relevant Error Messages

More relevant error messages could be implemented in several places. For example, I ran expHTS with "-c Contaminants" but I did not have a "Contaminants" folder. Instead of an error message like "Contaminants" does not exist, I got this uninformative error:

Traceback (most recent call last):
File "/home/msettles/Python_venv/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py", line 112, in
flag = int(line2[1])
ValueError: invalid literal for int() with base 10: 'does'