Giter VIP home page Giter VIP logo

exphts's People

Contributors

msettles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

exphts's Issues

Contaminants confusion

Synopsis: -A works but -c does not. (Matt fixed this in his local repo.)

Complete info:
I created a folder, called contaminants. In the folder there is a symlink to a fasta file that I downloaded from NCBI (and have already used with bowtie2 and bwa). When I try to run this command:

expHTS preprocess -t 24 -w -c contaminants -f samples.txt -a

I get this error message:
Traceback (most recent call last):
File "/home/msettles/Python_venv/bin/expHTS", line 9, in
load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/init.py", line 18, in main
expHTS.main()
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
commands[args.command].execute(args)
File "/home/msettles/Python_venv/local/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 108, in execute
if os.path.exists(args.contaminatesFolder):
AttributeError: 'Namespace' object has no attribute 'contaminatesFolder'

But the usage also shows -A as contaminants (or adapter contaminants). So I changed the command to -A and it ran, but I don't think that's correct

Also, it should be spelled "contaminants".

Thanks,

Monica

finalCleanup.py - memory leak

When -O (no overlap) -S (force split) and -a (polyAT trimming) are used memory leak that messes up R2 file. -O and -S - no error; -S and -a - no errors; -O -a still no errors. Something odd is happening - poly AT trimming - will be turned back on once things are fixed.

--forcePairs in mapping

Add an argument --forcePairs within mapping. R1 and R2 reads will be processed normally, however, SE reads will be split in half with the front half going to be processed with R1 and the second half of the read being RC and processed with R2.

Older-formatted reads with pair information do not produce output fastq files

One of my labmates grabbed data from the SRA that had the old pair information at the end of the reads (i.e., the additional :1/:2 field after the coordinate information). expHTS ran to completion with no errors but does not output any fastqs. Ran a script that converted to a newer format (\s1 or \s2) and everything worked fine. Probably specific to one of the applications, so might want to consider older notation when writing the new stuff.

not portable to OSX

Not sure if you want this installable outside of Linux distos at first, but there are errors and warnings thrown when trying to install at least on OS X 10.10.5. Appears restricted to finalCleanup.

Explicit version checks and application version information

expHTS fails, with uninformative errors or no errors, when the 'incorrect' versions of called applications (sickle, scythe, etc.) are in $PATH. This can be handled efficiently by invoking submodules in git. Alternatively, at a bare minimum, the versions required should be listed.

expHTS mapping

-More robust options
-R readable tables.
-Handling PE and SE

Refactor

Lower the amount of function calls (unique functions)

help doc

Go through the preprocess -h help and clean up, some errors and incorrect text in there

Preprocessing_Summary.log is empty when skipping certain steps

I am doing some RNAseq experiments, so I don't want to dedup. My call is:

expHTS preprocessing -m 25 -t 10 --skip-duplicate

finalCleanup.log exists in each folder, i.e.:

A       T       G       C       N       PolyA_Removed_Reads     PolyT_Removed_Reads     Short_discarded    PE_Kept SE_Kept Forced_Pairs    R1_Ave_Len      R2_Ave_Len      SE_Ave_Len      AverageQual

19049046        20709390        18174348        18719907        1784    0       0       0       0 1018970  0                       -nan    -nan    75.23   34.90

This is just SE data. The summary log is completely empty.

Stand-alone PE cleaning does not appear to be working

I think this bug was introduced relative to the last time I tried some cleaning, because I cannot get folders with just PE data to be processed.

I previously processed some data made up of two PE FASTQ files and had no issues. This looks like it was over a month ago. I must have updated things in the interim as part of general server maintenance.

I can no longer get the data processed above to clean. It completes but does nothing, as described in Issue 52. Additionally, when I try to pass various flags/arguments to it, I generally get the same thing EXCEPT for the --skip-duplicates flag, which throws an index error:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz   -t 5 2>/dev/null)   -o stdout 2>02-Cleaned/22MO/PE_filter_info.log)  -p stdout 2>02-Cleaned/22MO/PE_deduper_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/22MO/PE_sickle_info.log)   -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c  2>02-Cleaned/22MO/flash_info.log)   0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 0.0303189754486
Total amount of seconds to run all samples
Seconds: 0.0303189754486

And, with the flag, after blowing away the folder:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/capture_data/test/00-RawData/22MO/22MO_R2.fastq.gz   -t 5 2>/dev/null)   -o stdout 2>02-Cleaned/22MO/PE_filter_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/22MO/PE_sickle_info.log)   -M 700 --allow-outies -o 22MO -d 02-Cleaned/22MO -To -c  2>02-Cleaned/22MO/flash_info.log)   0 0 25 02-Cleaned/22MO/22MO
02-Cleaned/22MO
Seconds: 1.76817893982
Traceback (most recent call last):
  File "/usr/local/bin/expHTS", line 9, in <module>
    load_entry_point('expHTS==1.0.1.dev2', 'console_scripts', 'expHTS')()
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/__init__.py", line 18, in main
    expHTS.main()
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/expHTS.py", line 167, in main
    commands[args.command].execute(args)
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/preprocessCMD.py", line 136, in execute
    logFiles.append(parseOut(key[1], key[1].split("/")[-1]))
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 257, in parseOut
    printToFile(os.path.join(base, sample + "_SummaryStats.log"), header, data)
  File "/usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/parse_files.py", line 235, in printToFile
    if '\t'.join(map(str, data))[-1] != '\n':
IndexError: string index out of range

I spent several hours trying to track this down (thought I was going crazy because what should be a straightforward process was now broken with seemingly no cause). Again, note that this proceeds fine on SE reads. Additionally, I was monitoring server resource usage, and things appear to exit uncleanly; there are Perl, Python, gzip, and Bowtie2 calls that hang around for a while (~ 1m) seemingly doing nothing in a sleep or deferred state before ultimately being resolved.

Explicit clarification on folders/naming/etc.

This might be a documentation issue, but I've had issues where sometimes expHTS just won't work. It will run, take a fraction of a second, and produce empty results folders minus a few (empty or effectively empty) logs. Sometimes I've gone through and renamed folders, symlinks, etc. and things appear to work fine after that for seemingly no reason. This is especially prevalent when combining data from a lot of different sources and experiments, such as a trial sequencing run, initial sequencing effort, and additional coverage run that might have different naming schemes. Other times, no problems. Needs to be on the very front of the GitHub page. A recent example:

super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
___ PE COMMANDS ___
python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/cleanupWrapper.py  <(flash2 -Ti <(sickle pe -c <(super_deduper -i <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py  <(python /usr/local/lib/python2.7/dist-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/screen.py -1 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R1.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R1.fastq.gz  -2 /scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_10460_001_R2.fastq.gz,/scratch/1/brice/coding_expression/phylogenetic_data/test/00-RawData/caroli/caroli_1875_001_R2.fastq.gz   -t 20 2>/dev/null)   -o stdout 2>02-Cleaned/caroli/PE_filter_info.log)  -p stdout 2>02-Cleaned/caroli/PE_deduper_info.log)   -m stdout -s /dev/null -t sanger -T  2>02-Cleaned/caroli/PE_sickle_info.log)   -M 700 --allow-outies -o caroli -d 02-Cleaned/caroli -To -c  2>02-Cleaned/caroli/flash_info.log)   0 0 50 02-Cleaned/caroli/caroli
02-Cleaned/caroli
Seconds: 0.0292479991913
Total amount of seconds to run all samples
Seconds: 0.0292479991913

develop_new version does not identify gzipped fastqs

/mnt/storage2/scratch2/projects/mouse_references$ expHTS preprocess -t 16
super_deduper found
sickle found
flash2 found
bowtie2 found
scythe found
No fastq files were found under this directory - 00-RawData/cast
Failure in Validation - exiting process

/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ ls
H12_88996_ERR019144_1.fastq.gz H12_88996_ERR019144_2.fastq.gz
/mnt/storage2/scratch2/projects/mouse_references/00-RawData/cast$ zcat H12_88996_ERR019144_1.fastq.gz | head
@ERR019144.1 IL19_4344:5:1:1016:2407/1
NTACCTACAAATTTAATATTTTCTAAGAGCAGAATGAAACTTCATAGTGTATGTATAAAACATTCCACTATTCTTTCCTGTTCATCAGTTGTTCGATACACAGACTGA
+
!+''+)('%556668:87977577565208855588666+.1-40-544566626525+/-/03+0//11././,,/13(((),60-//(&(&),,)+'(&($$/++
@ERR019144.2 IL19_4344:5:1:1016:13842/1
NGATCTGCACGCCAGACTCACCTTGTGACCTGCCTCTAGTCGGGGGTGTGCTCTCCTGAGGAGAGAGAGGATAGCTACAGAGAGAAGATCACAGCATCGACGTCCTGT
+
!)(,(1/2//023077:;994,3540/+1'01,,-1113388'6%(&%$-/-,.-$#'&$#%(%'&$#%#$%$#'%%&#%#$$%$#$#$)%%%$%&
'('%'%)&'+'
@ERR019144.3 IL19_4344:5:1:1016:7439/1
NTTGTTTTACTCAGGTTCCACACATGTGCCTTCAAATTTCATATCATTGTTTTTAATATCTGAGTAATAATCTACTGTGTAAATGTATCAAATTTTATTTATCTATCT

The bug is in this function:

def directoryFiles(self, sampleSeq):
    sampleSeq = tuple(sampleSeq)
    #print sampleSeq
    directoryTest = sampleSeq[0].rstrip();
    #fastqCount insures at least one fastq files is under the directory
    fastqCount = 0
    if self.testDirectory(directoryTest):
        for subdir, dir, files in os.walk(directoryTest):
            for file in files:
                file = os.path.abspath(os.path.join(directoryTest, file))
                if "_R1" in file and ".fastq" in file:
                    fastqCount += 1
                    if not sampleSeq in self.sampleFiles:
                        self.sampleFiles[sampleSeq] = []

                    self.sampleFiles[sampleSeq].append(self.isPairedEnd(file))


        if (fastqCount == 0):
            self.exitTime("No fastq files were found under this directory - " + directoryTest)

    else:
        self.exitTime("Directory " + directoryTest + " does not exists")

You set fastqCount to 0, then call self.exitTime() below. This is never updated because the reads are named (e.g.,) H12_88996_ERR0191444_1.fastq and the function looks for _R1. Little edge case action going on.

Edit: new_devel branch too.

Flag to skip deduping

There are some applications where you wouldn't want to dedup but still do the other stuff. A flag would be a welcome addition for some things other people are working on (expression and all that - not necessarily anything I'm doing currently).

expHTS preprocess -s option issue

I ran:
expHTS preprocess -s -a -O -F 02-Cleaned_dups -w -t 12
with the -s option to see how leaving duplicates affected my results.

I didn't receive any warnings or errors, but my read counts are off for the cleaned fastq files:

Raw read count:
sed -n '1~4p' B2ST1_TTAGGC_L005_R1_001.fastq | grep "^@" | wc -l
6948683

Cleaned read count after running the above command:
sed -n '1~4p' B2_R1.fastq | grep "^@" | wc -l
12843598
(twice as much as it should be based on my calculations from the preprocessing summary)

First 10 lines of the cleaned fastq file:
head -10 B2_R1.fastq
@HWI-D00289:175:C6K3VANXX:5:1101:1661:2064#0/1
GATGCTTACAAATACAGAGAGACATTTAGATGTGCACCGTCGTCGTCGTCGTCACGCTCGTTCACACACGCGCGCACAAAAGGAAGAAAAAAAGACTTAAGACAAAAGGTTTAAGTACTACAGAC
+
<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB
@HWI-D00289:175:C6K3VANXX:5:1101:4385:2074#0/1
GCTAGCTTCTGTATTAAGTCGTAGAGCAGATATATCCCAAACTAATGAAACGTCCTCTCCCCATTTATGAACAAAGTCATTTAACAACATATAGTACCAGGTGTGCGCTATCAGAGTATCCACGAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-D00289:175:C6K3VANXX:5:1101:4383:2135#0/1
GTCGAACAAACACATTTGATGACTTTAAACTGTTGAATTAGGGTTTTAACTTATGAAGCGATGGATAAATGAGCAGTCCAGGCTAGTTGTAGGGTTTACTGGCTTCCACTGAAATAGGACACAAAG

Cleaned read count WITHOUT the -s option:
5368273

Files are located: /mnt/home/mbenner/GRC_Pipeline/02-Cleaned_dups

Thank you!

02-Cleaned not compatible with Discovar

discovar errors out with bubbles output - doesn't give reason
manually running each segment doesn't produce the same problem (contaminant screen, dedupe, sickle)
no flash overlap for any of them

Final fastq compression

I noticed that some of the individual applications will gzip fastqs (at the expense of speed). Perhaps include a flag to do this? Since gzipping is CPU-light (usually a fraction of one) yet is heaver on I/O, perhaps a thread could be set aside and allocated to compression as other libraries are cleaned? I (and I'm guessing others) am going to have to do this anyway to save disk space, so this would let me ditch the find-and-gzip mini-script I'd need to run after.

More Relevant Error Messages

More relevant error messages could be implemented in several places. For example, I ran expHTS with "-c Contaminants" but I did not have a "Contaminants" folder. Instead of an error message like "Contaminants" does not exist, I got this uninformative error:

Traceback (most recent call last):
File "/home/msettles/Python_venv/lib/python2.7/site-packages/expHTS-1.0.1.dev2-py2.7-linux-x86_64.egg/expHTS/extract_unmapped_reads.py", line 112, in
flag = int(line2[1])
ValueError: invalid literal for int() with base 10: 'does'

Monica Britton

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.