Streamlining clinical bioinformatics workflows in production environments.
brwnj / fastq-multx Goto Github PK
View Code? Open in Web Editor NEWDemultiplexes a fastq.
Demultiplexes a fastq.
Streamlining clinical bioinformatics workflows in production environments.
I'm trying this tool for the first time to demultiplex a fastq file Undetermined_L001_R1.fastq.gz
from an illumina run taht was done using dual-indexes, and single-end reads. I have been getting a Segmentation fault
as error for a number of things when running this tool.
So I decided to test this tool using the scripts and data in the test folder in this repository. Whenever I use the parameter -H
to specify that the indexes are in the header of the reads, the tool always fails if I supply only one fastq file as input. however, the when I supply two fastq files as input (entering the same file twice), the program runs without issues. My lines of code are below, with an excerpt of the data.
This fails:
$ fastq-multx -H -B indexes.txt test.fastq -o test_%.fastq
Using Barcode File: indexes.txt
Segmentation fault (core dumped)
but this runs (using the input file twice, and suppressing one of the ourputs, sort of a work around):
$ fastq-multx -H -B indexes.txt test.fastq test.fastq -o test_%.fastq -o n/a
Using Barcode File: indexes.txt
End used: start
Id Count File(s)
S1 17800 test_S1.fastq
S2 32100 test_S2.fastq
S3 3900 test_S3.fastq
...
The files I am using look like this:
==> index.txt <==
S1 TTACCGAC-CGTATTCG
S2 TCGTCTGA-TCAAGGAC
S3 TTCCAGGT-AAGCACTG
S4 TACGGTCT-GCAATGGA
S5 AAGACCGT-CAATCGAC
==> test.fastq <==
@A00929:83:HL75TDRXX:1:2101:13919:1047 1:N:0:TTACCGAC+CGTATTCG
CATATTGATAGTTCGCACAGGTAG
+
FFFFFFFFFFFFFFFFFFFFFFFF
@A00929:83:HL75TDRXX:1:2101:14009:1047 1:N:0:TCGTCTGA+TCAAGGAC
GTGCGTATCTATCAAAAATGTATA
+
I installed fastq-multx using conda in Ubuntu 20.04
Hi @brwnj ,
It's actually more a question than an issue.
In my mind, your script is able to split raw read in different files according to a barcode.
So, I launch your script as follow :
fastq-multx -m 2 -b -x -B preprocess/barcodes.txt Undetermined_S0_L001_R1_001.fastq.gz Undetermined_S0_L001_R2_001.fastq.gz -o preprocess/multx/%_R1.fq.gz preprocess/multx/%_R2.fq.gz
The script runs well, but results are unexpected. fastq-multx sends 1/3 of my reads in the unmatched.fq.gz.
A lot of reads in the unmatched.fq.gz contain my barcodes.
Is there more than the barcode match involved in the unmatch decision ? Like the insert length or something else ?
Thanks !
Bastien
I need to run multiple demultiplexing steps because my barcodes are unique only in pairs (i5/i7).
The first round using i5 barcodes works fine as usual, but when running the next round of demultiplexing with i7 barcodes the program doesn't run because the length of the barcode and read files is different (because the reads were derived from a previous demultiplexing round).
error msg: # of rows in mate file 'xxxxx.fq' doesn't match primary file, quitting!
Is there a way in which fastq-multx can demultiplex a read1.fq file that is shorter than the barcode file?
index1-read1 --- read2-index2
sample1 index1-a index2-b
sample2 index1-a index2-c
sample3 index1-d index2-c
paired end
only 8bp I7 index on R2 tail.
SX20G0032 TTCTGGTG
SX20G0033 CCGAAAAC
SX20G0034 CGAAAAGG
SX20G0035 AACCAGCT
SX20G0081 GTTTGTGC
SX20G0083 CGGTTTTC
SX20G0084 TGACCGAA
SX20G0085 CCCTATTC
SX20G0086 ACGTTGTG
SX20G0087 AAAAGCGC
SX20G0088 TCACTCGT
SX20G0089 GCCGATTT
SX20G0090 TGCAAGAG
SX20U00079 CCCTTAAC
SX20U00080 CCTACCTA
SX20U00081 GTTTCAGC
commandline
fastq-multx -e -B barcode.txt -m 1 E100004487_L01_read_2.fq.gz E100004487_L01_read_1.fq.gz -o %_R2.fastq -o %_R1.fastq
question
Did I used the right format barcode and command line?
Got an out put like "Skipped because of distance < 2 : 9439859", what exactly kind reads did it skipped?
split rate:
Id Count File(s)
SX20G0032 232298140 SX20G0032_R2.fastq SX20G0032_R1.fastq
SX20G0033 16034321 SX20G0033_R2.fastq SX20G0033_R1.fastq
SX20G0034 245363709 SX20G0034_R2.fastq SX20G0034_R1.fastq
SX20G0035 599969474 SX20G0035_R2.fastq SX20G0035_R1.fastq
SX20G0081 596771042 SX20G0081_R2.fastq SX20G0081_R1.fastq
SX20G0083 40790595 SX20G0083_R2.fastq SX20G0083_R1.fastq
SX20G0084 1038522885 SX20G0084_R2.fastq SX20G0084_R1.fastq
SX20G0085 5594048 SX20G0085_R2.fastq SX20G0085_R1.fastq
SX20G0086 8100620 SX20G0086_R2.fastq SX20G0086_R1.fastq
SX20G0087 5314546 SX20G0087_R2.fastq SX20G0087_R1.fastq
SX20G0088 206429619 SX20G0088_R2.fastq SX20G0088_R1.fastq
SX20G0089 433365557 SX20G0089_R2.fastq SX20G0089_R1.fastq
SX20G0090 429231416 SX20G0090_R2.fastq SX20G0090_R1.fastq
SX20U00079 281572893 SX20U00079_R2.fastq SX20U00079_R1.fastq
SX20U00080 270611344 SX20U00080_R2.fastq SX20U00080_R1.fastq
SX20U00081 304485507 SX20U00081_R2.fastq SX20U00081_R1.fastq
unmatched 144289659 unmatched_R2.fastq unmatched_R1.fastq
total 563778079
So much reads unmatched, I am not sure if this is normal rate. if not , is it the result of my bad commandline or barcode setting
Hi,
I have paired-end Illumina reads where the barcode is at the start of the read in either the R1 file OR in the R2 file. To demultiplex, my thought was to run fastq-multx looking first for the barcode in the R1 file, then to repeat with the unmatched reads looking for the barcode in the R2 file. Unfortunately, fastq-multx appears to be adding the full sequence of the read to the header of each read. Is there anyway to prevent this? There does not seem to be an issue with the headers in the successfully demultiplexed reads.
I am running the command as follows:
fastq-multx -B barcode_file_plate1.txt Europe_R1_001.fastq Europe_R2_001.fastq -m 1 -o R1_has_barcode/R1.%.fastq -o R1_has_barcode/R2.%.fastq
Thanks!
Hello,
Is there a way to calculate the ram requirement for the program? I run the program on a cluster. The jobs are killed because of high memory usage. E.g. my lastest job exceeded 256 GB RAM usage, and was killed. Is it normal?
Thanks,
Yeserin.
I don't get this error on local machines running macos or ubuntu, but when I run it on a gzip file in a github action using ubuntu-latest, I get the following error on STDERR:
gzip: stdout: Broken pipe
It might be due to a version of gzip or else the timing of the closing of the STDOUT handle of the gzip popen call. Not sure.
See the commented test 26 from: https://github.com/hepcat72/htseq2multx/blob/2d96a9a53f0e7741be554689294fa42b729c20aa/tests/run_tests.sh#L355
How to extract fastq files with the cell barcode still remained?
We just do not want to lose the cell barcode information in the extracted fastq files.
Could be improved with various barcode usage examples. And fix 'batcode' (https://github.com/brwnj/fastq-multx/blob/master/fastq-multx.cpp#L1150).
I don't know if this is specific to my system (macOS Catalina 10.15.6), but tests 10-12 fail due to a segfault (on master):
sh: line 1: 98519 Segmentation fault: 11 ../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err
not ok 10 - test4 worked (../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err)
# Failed test 'test4 worked (../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err)'
# at multx.t line 25.
not ok 11 - Files equal: tmp/multx.t.R0jAb/test4.out == out/multx/test4.out
# Failed test 'Files equal: tmp/multx.t.R0jAb/test4.out == out/multx/test4.out'
# at ./test-prep.pl line 48.
not ok 12 - Files equal: tmp/multx.t.R0jAb/test4.err == out/multx/test4.err
# Failed test 'Files equal: tmp/multx.t.R0jAb/test4.err == out/multx/test4.err'
# at ./test-prep.pl line 48.
We just integrated fastq-multx into our sequencing core post-processing and the first job revealed a bug. The standard output showed:
total -2085122201
We're guessing it's due to the int
here:
Line 1081 in b2735e8
Probably should change that to something with a higher max value.
Hi!
First of all, thank you for this useful tool. I have a little remark:
When demultiplexing from separate R1, R2, I1, I2 files, the resulting demultiplexed fastq files get their respective barcode sequences added to the headers separated by a space. Depending on the original header format, this sometimes results in a non-standard header like this:
@NB551430:403:HGKFLBGXG:1:11101:16941:1077/1 GTAGAGGA
This causes problems in subsequent processing with some tools. I would wish to have the option to keep the headers untouched and not to have the barcodes added at all (Now I have to remove them in a subsequent step).
Thank you!
This is basically the same issue as #14, but instead of the total read count, happens when an individual sample's read count exceeds the max int of: 2147483647.
In our case, it occurred specifically with unmatched output, but given the growth of single-cell sequencing analyses and technologies, it seems that ending up with real samples exceeding this max int is imminent.
A tag/release of version 1.4 would make it easier to build a bioconda recipe or generally instruct users on how to build/install the latest updates. Thankx.
Hi Joe,
I just run across a Segmentation fault.
error, when demultiplexing from barcodes in the header. However, all %.fastq.gz
files are created as empty files, so the error must occur afterwards.
My call is fastq-multx -H -m1 -B barcodes.txt input.fastq.gz -o %.fastq.gz
and a fastq header line looks like:
@NS500475:199:HHML2BGX2:1:11101:21358:1116 2:N:0:1 AACCAATCGT
GCGGTTAAGAGTACTGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
AAAA/EEEAEEEEEEEE###############################################################
Maybe you can provide me with some directions on how demultiplexing from a barcode in the header is possible.
Best and many thanks!
Jens
The order of the options in the usage:
Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
suggests that input files and output file naming templates can be interspersed, such as:
fastq-multx ... read1.fq -o 'r1.%.fq' mate.fq -o 'r2.%.fq'
However, supplying in that order results in the error:
Error: number of input files (1) must match number of output files following '-o'.
and no output files are generated. Whereas this ordering does not generate an error:
fastq-multx ... read1.fq mate.fq -o 'r1.%.fq' -o 'r2.%.fq'
Also, the exit code from a run with the indicated error is 0 (success).
I will likely submit another PR that addresses these and any other issues I encounter in my efforts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.