opengene / afterqc Goto Github PK

View Code? Open in Web Editor NEW

201.0 20.0 50.0 1.34 MB

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

License: MIT License

Python 93.33% Makefile 0.10% C++ 5.72% C 0.85%

quality-control fastq ngs sequencing bioinformatics overlap trimming filtering error qc

afterqc's People

Contributors

Stargazers

Watchers

afterqc's Issues

AfterQC total bases calculation

AfterQC is constantly counting wrong the total bases in paired end Miseq fastq reads. For example two paired end files with total bases 611,060,153 (as calculated by various programs) it seems to have only 582,146,000 total bases in AfterQC.
I was wondering why this difference exists, and if it affects the downstream filtering process.

TODO: change trimming function to make trimmed read1/read2 have identical length

This is to make some mark duplication tool work well.

default good folder

Hello!
Due to default = "good" in line 27, the description in line 28 contains a mistake. If option -g is not set by user, then the good folder will be created in the current dir, but not in the dir of read1. And it's ok! Could you change the description in line 28, please?
Also lines 287 and 288 of the same file are useless.

bubble

Hi, what is the principle of AfterQC to detect bubbles? how is it reflected in the data?

ValueError: max() arg is an empty sequence

Hi：when I run:

python after.py --qc_only -1 ../fastq1/SRR1294494_1.fastq.gz -2 ../fastq1/SRR1294494_2.fastq.gz
../fastq1/SRR1294494_1.fastq.gz options:

{'qc_only': True, 'version': '0.9.6', 'seq_len_req': 35, 'index1_file': None, 'trim_tail': 1, 'report_output_folder': None, 'trim_pair_same': True, 'no_correction': False, 'debubble_dir': 'debubble', 'barcode_flag': 'barcode', 'read2_file': '../fastq1/SRR1294494_2.fastq.gz', 'barcode_length': 12, 'trim_tail2': 1, 'unqualified_base_limit': 60, 'allow_mismatch_in_poly': 2, 'read2_flag': 'R2', 'store_overlap': False, 'debubble': False, 'read1_flag': 'R1', 'index2_flag': 'I2', 'draw': True, 'index1_flag': 'I1', 'mask_mismatch': False, 'barcode': False, 'overlap_output_folder': None, 'barcode_verify': 'CAGTA', 'index2_file': None, 'qualified_quality_phred': 15, 'trim_front': 2, 'good_output_folder': 'good', 'poly_size_limit': 35, 'n_base_limit': 5, 'qc_sample': 200000, 'trim_front2': 2, 'no_overlap': False, 'input_dir': None, 'read1_file': '../fastq1/SRR1294494_1.fastq.gz', 'qc_kmer': 8, 'bad_output_folder': None}

it has error

Traceback (most recent call last):
File "after.py", line 224, in
main()
File "after.py", line 218, in main
processOptions(options)
File "after.py", line 171, in processOptions
filter.run()
File "/disk/zhw/cross_talk/GSE57872/fastq/AfterQC-master/preprocesser.py", line 768, in run
self.addFiguresToReport(reporter)
File "/disk/zhw/cross_talk/GSE57872/fastq/AfterQC-master/preprocesser.py", line 783, in addFiguresToReport
reporter.addFigure('Read1 per base discontinuity after filtering', self.r1qc_postfilter.discontinuityPlotly("r1_post_discontinuity", 'Read1 discontinuity curve after filtering'), 'r1_post_discontinuity', "")
File "/disk/zhw/cross_talk/GSE57872/fastq/AfterQC-master/qualitycontrol.py", line 234, in discontinuityPlotly
json_str += "var layout={title:'" + title + "', xaxis:{title:'cycles'}, yaxis:{title:'discontinuity', range:" + makeRange(0.0, max(self.meanDiscontinuity)*1.5) + "}};\n"

ValueError: max() arg is an empty sequence

filter only for poly-X but nothing else

What would be the command-line to run AfterQC in order to filter only for poly-X reads but nothing else?

AfterQC in FASTQ joined

I would like to ask a question. Can I parse paired-end files after the process joined by AfterQC?

Adapters trimming

As far as I understood you don't have predefined sequences for adapters:

"By searching the best overlapping of each pair, AfterQC automatically detects and cuts adapters for pair-end data, with no need of adapter sequence input"

I used AfterQC on metagenomics data and it seemed to reduce the numbers of adapters, but not entirely. What could be the cause? Also, what is important, I couldn't see it from the AfterQC report, I checked it with fastqc/multiqc, so from the user's point of view I would want to still see the validation for typical adapters in QC report.

Please, check the last post here - Sudden quality drop in the middle of HiSeq R1 reads but not in R2

If it is somehow beneficial for you I can send you full reports.

Plans for reading .gz ?

Hi, nice tool.

Are there any plans for reading gzipped files in the future ? This can be quite helpful, especially when reanalyzing quality of older compressed projects.

Also, is there an internal adapter library or does AfterQC find adapters by itself in the reads ? I couldn't understand the command line help about barcodes.

Thanks.
Colin

My bioconda install Python version

Hi my bioconda install is reporting this needs Py <3.0. I can set up a Py 2.7 environment for just this of course, but wondered if there would be a Py3.6+ version or if I am being silly in some way?
Kind Regards, sounds a fabulous tool

Report

Hi,

I have two suggestions.

Could you implement an option to put out the graphs separately in a folder? Would make it possible to include them in an automated report concerning my entire NGS pipeline.
The order is all graphs prior filtering, then all graphs after filtering. Could you make them side by side? Forward prior next to forward after? That way you could immediately compare prior and after filtering, instead of having to scroll all the time.

Thanks
Anselm

output files are truncated

Hello,

I'm processing some files that were output by afterQC, but I'm double checking with FastQC, and it looks like the output from AfterQC is truncated:

Failed to process file SRR5335803_1.bz2.good.fq.bz2
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated

perhaps afterQC isn't compatible with reading bz2 format?

-Dave

Deafult multiprocessing behavior

Do you create as many jobs with python multiprocessing as there are input files? I ran it with default parameters and it seemed to occupy all 32 cores. Is it possible to limit the number of jobs created, cause it is not always possible to use all the resources when computing on shared server.

AterQc

Please somebody guide me install AfterQC. I have downloaded the zip folder and extracted the contents but can not find any executable file. I have python 3.7.

Issue with overlap analysis

Hi,

I'm working on SRA data (SRR4292097). I get the following error
after.py specify current dir as input dir SRR4292097_R1.fastq.gz ./SRR4292097_R1.fastq.gz options: {'read1_file': './SRR4292097_R1.fastq.gz', 'read2_file': './SRR4292097_R2.fastq.gz', 'index1_file': None, 'index2_file': None, 'input_dir': '.', 'good_output_folder': 'good', 'bad_output_folder': None, 'report_output_folder': None, 'read1_flag': 'R1', 'read2_flag': 'R2', 'index1_flag': 'I1', 'index2_flag': 'I2', 'trim_front': 8, 'trim_tail': 0, 'trim_pair_same': True, 'qualified_quality_phred': 15, 'unqualified_base_limit': 60, 'poly_size_limit': 35, 'allow_mismatch_in_poly': 2, 'n_base_limit': 5, 'seq_len_req': 35, 'debubble': False, 'debubble_dir': 'debubble', 'draw': True, 'barcode': False, 'barcode_length': 12, 'barcode_flag': 'barcode', 'barcode_verify': 'CAGTA', 'store_overlap': False, 'overlap_output_folder': None, 'qc_only': False, 'qc_sample': 200000, 'qc_kmer': 8, 'no_correction': False, 'mask_mismatch': False, 'no_overlap': False, 'version': '0.9.6', 'trim_front2': 8, 'trim_tail2': 0} Process Process-1: Traceback (most recent call last): File "/home/XX/Softs/miniconda3/lib-python/2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/XX/Softs/miniconda3/lib-python/2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/home/XX/Softs/miniconda3/envs/py27/bin/after.py", line 171, in processOptions filter.run() File "/home/XX/Softs/miniconda3/envs/py27/share/afterqc-0.9.6-0/preprocesser.py", line 512, in run overlap_histgram[overlap_len] += 1 IndexError: list index out of range Time used: 16.7429320812

Everything is working fine with the option --no_overlap.

Thanks,
Maxime

Tool to keep reads where all bases are above a specific quality score.

Hi. First I just discovered and tested your tool a few days ago. If my request has no sense or linked to something I didn't understand well, feel free to remove my message.

I would like to filter reads of FASTQ files to keep only very high quality sequences : reads where ALL bases are above Q30 by example. Two filter options seem to be appropriate :
-q QUALIFIED_QUALITY_PHRED --> set to "-q 30"
-u UNQUALIFIED_BASE_LIMIT --> would be logic to set "-u 0" to be sure to remove all reads where at least one base is under Q30. But sadly if you set "-u 0" in fact by default it do not filter reads by low quality base count at all (see -u UNQUALIFIED_BASE_LIMIT info). So I guess it is impossible to filter "perfect quality reads" (reads with no bases lower a specific quality). Default option to deactivate UNQUALIFIED_BASE_LIMIT should not be something other than 0?

Thanks a lot,
Max

output gzipped data

If the input is gzipped data, the output should be automatically gzipped. Otherwise users would encounter large files

TODO: integrate pair-end sequencing based error correction

To reduce SV false alarm rate.

AttributeError: 'tuple' object has no attribute 'major'

Hi,
Firstly, I am new to Linux, so the solution might be obvious, but I can't get AfterQC to run. I keep getting this error:

Traceback (most recent call last):
File "after.py", line 208, in
main()
File "after.py", line 175, in main
if sys.version_info.major >2:
AttributeError: 'tuple' object has no attribute 'major'

I tried the -h option and adding a path to my data but get the same error. I installed the editdistance with the make command in the AfterQC directory. I am using python version 2.6.6.

1) AfterQC is slow 2)Aggregate results from many samples into a single report

Hi,
I am running 20 paired-end RNA-seq samples since yesterday (more than 24 hours over) and only 7 samples have been completed (others are still running) on 16 GB RAM computer.
Any way to make it faster?

Secondly, I am wondering if there is any possibility to aggregate results from many samples into a single report?As per MultiQC [https://github.com/ewels/MultiQC], AfterQC is not in their list of supported tools.

Any suggestions please!
Thanks!

support bzip2 format

Need to support bzip2 format since it has a better compression ratio

Remove overrepresented sequences

It may be a good idea to implement and option to remove specific sequences like overrepresented sequences, usually being rRNA or PCR artefacts; well passing these sequences in a fasta or identifying them by afterqc-self

Pack (gzip) output good and bad reads

I think it would be great to pack the output reads (good & bad)!

Specify output folder name

Is it possible to make option for specifying output folder name with the report files rather than using input files names?

Yours faithfully,
Katerina

Packaging: make available on Pypi/conda

Hi,

Would it be possible to make this package available in Pypi and/or conda? This would greatly improve the accessibility of your tool.
We're interested in integrating AfterQC in our QC pipeline, but this depends on the package being available from Pypi or Conda.

Thanks!
M

extra whitespace before shebang line in after.py

Hi,

please remove the first space before the first-line # in after.py, it is causing this error (the file interpreter is not recognized):

$ ./after.py
from: can't read /var/mail/optparse
from: can't read /var/mail/multiprocessing
from: can't read /var/mail/util
./after.py: line 12: syntax error near unexpected token `('
./after.py: line 12: `def parseCommand():

Thanks :)

Wrong (low quality) bases in overlapped regions failed to be corrected

AfterQC has helped me to improve the quality of my data. Thanks for creating it.

I did notice, though, that some wrong (low quality) bases in overlapped regions failed to be corrected and I am not sure why.

Could you perhaps have a look, below one read as an example.
---------------- test_R1.fastq
@NS500261:202:HKMTKAFXX:1:21112:5370:11983 1:N:0:ATCTAGCCGGCC
CTGCGCCTGGTTGGGCATCGCTCCGCTAGGTGTCAGCGGCTCCACCAGCTGGGGTGAGGGGGTGGTGGGTCAGTGCTGGGGGCCGGTGCAGACCCCACGCGGGCTGGGAGGACTTCACCCCGCCTCACCTCCGTTTCCTGCAGATCGGAAG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEAEEEEEEEEEEEEEEEEEE/EEEE/EEEEEEEEEEEAEE/AA<AAEEAAAAAEE<EAAA<A<6AAAAE<<AEAAAAAEE
---------------- test_R2.fastq
@NS500261:202:HKMTKAFXX:1:21112:5370:11983 2:N:0:ATCTAGCCGGCC
GCAGGAAACGGAGGTGAGGCGGGGTGAAGTCCTCCCAGCCCGCGTGGGGTCTGCACCGGCCCCCAGCCCTGACCCACCACCCCCTCACCCCAGCTGGTGGAGCCGCGGCCCCCTAGCGGCGCGATGCCCAACCAGGCGCAGAGCTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEE/AE/E/E/EEE/EE/EEEEE/EAE6EE/EE//E/E//EEE/A/A/E/AE/AAE/E/EE</<AAEE//AE//<E6E////6<<

Error despite creating env with 2.7 in conda

Hi there
I am getting the following error
Traceback (most recent call last):
File "/Users/apple/miniconda3/envs/py27/bin/after.py", line 228, in
main()
File "/Users/apple/miniconda3/envs/py27/bin/after.py", line 222, in main
processOptions(options)
File "/Users/apple/miniconda3/envs/py27/bin/after.py", line 175, in processOptions
filter.run()
File "/Users/apple/miniconda3/envs/py27/share/afterqc-0.9.7-3/preprocesser.py", line 249, in run
self.r1qc_prefilter.statFile(self.options.read1_file)
File "/Users/apple/miniconda3/envs/py27/share/afterqc-0.9.7-3/qualitycontrol.py", line 350, in statFile
self.statRead(read)
File "/Users/apple/miniconda3/envs/py27/share/afterqc-0.9.7-3/qualitycontrol.py", line 80, in statRead
self.totalNum[i] += 1
IndexError: list index out of range

Regards
Dinesh

Overlap merge can be optimized

Fast forward mode can be applied

Parallel mode for one pair of reads

What do you think about some kind of parallel mode for processing a single pair of PE reads (-1 and -2 options) when good reads are to be generated?

File exists: './QC'

Even when removing previous runs' QC folders, I still get an error when running after.py

  File "/home/sdt5z/anaconda/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: './QC'

Python 2 is a requirement?

I installed this in a venv with Python3 and got:

  Traceback (most recent call last):
  File "/config/binaries/afterqc/0.9.2/afterqc/after.py", line 7, in <module>
    import preprocesser
  File "/config/binaries/afterqc/0.9.2/afterqc/preprocesser.py", line 8, in <module>
    import util
  File "/config/binaries/afterqc/0.9.2/afterqc/util.py", line 167
    print overlap(r1, r2)

which I presume is a python3 thing.

You should note that python2 is a requirement in the install notes.

Too slow with python implementation

Should try rust or julia

Float division by zero in circledetector.py

Hi, Shifu

I and my colleague (@yodeng) encountered this issue recently.

This issue may be caused by the reassignment of empty list to self.records at line 19 in circledetector.py. The initial assignment
is at line 14.

When we commented out line 19, the division by zero error disappeared and we got the circles data.

Best,
Richard

Removal of PCR/RNA primers

I am currently using afterqc to do QC trimming of the reads. Generally it performs well, removes adapters etc. However I noticed it doesn't seem to screen out primers. Would it be possible to have this as an add on?

Cheers
Amali

String index out of range.

Hi!

I managed to run AfterQC succesfully the first time I used it with one of my libraries with the following command line:

pypy /home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/after.py

I tried to use the exact same command to run AfterQC on my second pair of libraries (from/in a different directory wheres these libraries are) and this error appeared:

specify current dir as input dir
l4i2-unpaired_R1.fastq
l4i2-trimmo_R1.fastq
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/pypy/lib-python/2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/usr/lib/pypy/lib-python/2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/after.py", line 175, in processOptions
filter.run()
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/preprocesser.py", line 249, in run
self.r1qc_prefilter.statFile(self.options.read1_file)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/qualitycontrol.py", line 350, in statFile
self.statRead(read)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/qualitycontrol.py", line 107, in statRead
if seq[j] != seq[j+1]:
IndexError: string index out of range
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/pypy/lib-python/2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/usr/lib/pypy/lib-python/2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/after.py", line 175, in processOptions
filter.run()
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/preprocesser.py", line 249, in run
self.r1qc_prefilter.statFile(self.options.read1_file)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/qualitycontrol.py", line 350, in statFile
self.statRead(read)
File "/home/pop_manuel/proyecto_transcriptoma_rhizophagus/software/AfterQC-master/qualitycontrol.py", line 107, in statRead
if seq[j] != seq[j+1]:
IndexError: string index out of range
Time used: 0.0947189331055

I tried executing the program with pypy and python2, but it returns the same error.

Thank you for your help!

Afterqc with pypy

Hello,
I have a 12 paired end files(50gb~). For each file it takes me 14hrs approx to run with native python. I read that with pypy command it runs 3 times faster. But when i edit my script with pypy command it returs error saying there is no such command. Am i need to download somwthing to use it with pypy command?

multithreading

Is there a way to limit the number of threads that the program uses? It looks like it is running one thread per fastq file, but I would like it to only run a limited number of threads on our HPC node. I suppose that I could run afterqc on only a limited number of fastq files at a time, but I thought that there might be a more elegant solution.
Thanks,
Ken

Float division by zero in circledetector.py

I've encountered following error:

finished polyX stat for all files
write records to poly_X.csv
process records by tile
Traceback (most recent call last):
  File "AfterQC/after.py", line 221, in <module>
    main()
  File "AfterQC/after.py", line 205, in main
    runDebubble(options)
  File "AfterQC/after.py", line 180, in runDebubble
    debubble.debubbleDir(options.input_dir, 20, options.debubble_dir, options.draw)
  File "AfterQC/debubble.py", line 47, in debubbleDir
    circles = bp.run()
  File "AfterQC/bubbleprocesser.py", line 74, in run
    self.processByTile()
  File "AfterQC/bubbleprocesser.py", line 161, in processByTile
    self.detectBubbleForTile(tileRecords, lastTileWithBothSurface, laneOfLastTile)
  File "AfterQC/bubbleprocesser.py", line 142, in detectBubbleForTile
    c = bd.detect()
  File "AfterQC/bubbledetector.py", line 60, in detect
    self.detectCircles()
  File "AfterQC/bubbledetector.py", line 284, in detectCircles
    labelCircles = cd.detect()
  File "AfterQC/circledetector.py", line 25, in detect
    if self.isInCorner():
  File "AfterQC/circledetector.py", line 102, in isInCorner
    if float(cornerCount) / len(self.records) > 0.1:
ZeroDivisionError: float division by zero

question about AfterQC/preprocesser.py

Hi , in preprocesser.py line 498 , your code is :
if lowQual1 > self.options.unqualified_base_limit or lowQual1 > self.options.unqualified_base_limit:
I suppose the code should be :
if lowQual1 > self.options.unqualified_base_limit or lowQual2 > self.options.unqualified_base_limit:

Read length distributon after processing

Hello, I have using afterqc with default settings to remove adapter sequences from RNA Seq read files. My original files have all the reads of uniform length. After processing with afterqc the sequence lengths are not uniform. Therefore, I have two questions 1) Can I somehow make all the reads of uniform length? 2) What implications are we talking if the length in not uniform down the analysis (Original read length for all reads was 151 bases, after afterqc read length is between 35-142).

opengene / afterqc Goto Github PK

afterqc's People

Contributors

Stargazers

Watchers

Forkers

afterqc's Issues

Recommend Projects

Recommend Topics

Recommend Org