brwnj / fastq-multx Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 6.0 135 KB

Demultiplexes a fastq.

Makefile 0.29% C++ 89.43% C 5.82% Shell 0.10% Perl 4.36%

fastq-multx's Introduction

Streamlining clinical bioinformatics workflows in production environments.

fastq-multx's People

Contributors

Stargazers

Watchers

Forkers

fjrossello y9c jun-lizst hepcat72 dauss75

fastq-multx's Issues

Demuxing dual-indexed, single-end fastq files with indexes in header fails

I'm trying this tool for the first time to demultiplex a fastq file Undetermined_L001_R1.fastq.gz from an illumina run taht was done using dual-indexes, and single-end reads. I have been getting a Segmentation fault as error for a number of things when running this tool.

So I decided to test this tool using the scripts and data in the test folder in this repository. Whenever I use the parameter -H to specify that the indexes are in the header of the reads, the tool always fails if I supply only one fastq file as input. however, the when I supply two fastq files as input (entering the same file twice), the program runs without issues. My lines of code are below, with an excerpt of the data.

This fails:

$ fastq-multx -H -B indexes.txt test.fastq -o test_%.fastq
Using Barcode File: indexes.txt
Segmentation fault (core dumped)

but this runs (using the input file twice, and suppressing one of the ourputs, sort of a work around):

$ fastq-multx -H -B indexes.txt test.fastq test.fastq -o  test_%.fastq -o n/a
Using Barcode File: indexes.txt
End used: start
Id      Count   File(s)
S1    17800     test_S1.fastq
S2    32100     test_S2.fastq
S3    3900      test_S3.fastq
...

The files I am using look like this:

==> index.txt <==
S1     TTACCGAC-CGTATTCG
S2     TCGTCTGA-TCAAGGAC
S3     TTCCAGGT-AAGCACTG
S4     TACGGTCT-GCAATGGA
S5     AAGACCGT-CAATCGAC

==> test.fastq <==
@A00929:83:HL75TDRXX:1:2101:13919:1047 1:N:0:TTACCGAC+CGTATTCG
CATATTGATAGTTCGCACAGGTAG
+
FFFFFFFFFFFFFFFFFFFFFFFF
@A00929:83:HL75TDRXX:1:2101:14009:1047 1:N:0:TCGTCTGA+TCAAGGAC
GTGCGTATCTATCAAAAATGTATA
+

I installed fastq-multx using conda in Ubuntu 20.04

Unmatch decision question

Hi @brwnj ,
It's actually more a question than an issue.
In my mind, your script is able to split raw read in different files according to a barcode.
So, I launch your script as follow :
fastq-multx -m 2 -b -x -B preprocess/barcodes.txt Undetermined_S0_L001_R1_001.fastq.gz Undetermined_S0_L001_R2_001.fastq.gz -o preprocess/multx/%_R1.fq.gz preprocess/multx/%_R2.fq.gz
The script runs well, but results are unexpected. fastq-multx sends 1/3 of my reads in the unmatched.fq.gz.
A lot of reads in the unmatched.fq.gz contain my barcodes.
Is there more than the barcode match involved in the unmatch decision ? Like the insert length or something else ?
Thanks !
Bastien

Length of Read1.fq file different from length of indexfile.fq

I need to run multiple demultiplexing steps because my barcodes are unique only in pairs (i5/i7).

The first round using i5 barcodes works fine as usual, but when running the next round of demultiplexing with i7 barcodes the program doesn't run because the length of the barcode and read files is different (because the reads were derived from a previous demultiplexing round).

error msg: # of rows in mate file 'xxxxx.fq' doesn't match primary file, quitting!

Is there a way in which fastq-multx can demultiplex a read1.fq file that is shorter than the barcode file?

how to demultiplex dual index on paried end reads

index1-read1 --- read2-index2

sample1 index1-a index2-b
sample2 index1-a index2-c
sample3 index1-d index2-c

what does "Skipped because of distance < 2 : 9439859" means?

I have some usage problem.

data type:

paired end
only 8bp I7 index on R2 tail.

barcode file:

SX20G0032	TTCTGGTG
SX20G0033	CCGAAAAC
SX20G0034	CGAAAAGG
SX20G0035	AACCAGCT
SX20G0081	GTTTGTGC
SX20G0083	CGGTTTTC
SX20G0084	TGACCGAA
SX20G0085	CCCTATTC
SX20G0086	ACGTTGTG
SX20G0087	AAAAGCGC
SX20G0088	TCACTCGT
SX20G0089	GCCGATTT
SX20G0090	TGCAAGAG
SX20U00079	CCCTTAAC
SX20U00080	CCTACCTA
SX20U00081	GTTTCAGC

commandline
fastq-multx -e -B barcode.txt -m 1 E100004487_L01_read_2.fq.gz E100004487_L01_read_1.fq.gz -o %_R2.fastq -o %_R1.fastq
question
Did I used the right format barcode and command line?
Got an out put like "Skipped because of distance < 2 : 9439859", what exactly kind reads did it skipped?
split rate:

Id	Count   File(s)
SX20G0032 232298140 SX20G0032_R2.fastq SX20G0032_R1.fastq
SX20G0033 16034321 SX20G0033_R2.fastq SX20G0033_R1.fastq
SX20G0034 245363709 SX20G0034_R2.fastq SX20G0034_R1.fastq
SX20G0035 599969474 SX20G0035_R2.fastq SX20G0035_R1.fastq
SX20G0081 596771042 SX20G0081_R2.fastq SX20G0081_R1.fastq
SX20G0083 40790595 SX20G0083_R2.fastq SX20G0083_R1.fastq
SX20G0084 1038522885 SX20G0084_R2.fastq SX20G0084_R1.fastq
SX20G0085 5594048 SX20G0085_R2.fastq SX20G0085_R1.fastq
SX20G0086 8100620 SX20G0086_R2.fastq SX20G0086_R1.fastq
SX20G0087 5314546 SX20G0087_R2.fastq SX20G0087_R1.fastq
SX20G0088 206429619 SX20G0088_R2.fastq SX20G0088_R1.fastq
SX20G0089 433365557 SX20G0089_R2.fastq SX20G0089_R1.fastq
SX20G0090 429231416 SX20G0090_R2.fastq SX20G0090_R1.fastq
SX20U00079 281572893 SX20U00079_R2.fastq SX20U00079_R1.fastq
SX20U00080 270611344 SX20U00080_R2.fastq SX20U00080_R1.fastq
SX20U00081 304485507 SX20U00081_R2.fastq SX20U00081_R1.fastq
unmatched 144289659 unmatched_R2.fastq unmatched_R1.fastq
total 563778079

So much reads unmatched, I am not sure if this is normal rate. if not , is it the result of my bad commandline or barcode setting

headers in unmatched file

Hi,

I have paired-end Illumina reads where the barcode is at the start of the read in either the R1 file OR in the R2 file. To demultiplex, my thought was to run fastq-multx looking first for the barcode in the R1 file, then to repeat with the unmatched reads looking for the barcode in the R2 file. Unfortunately, fastq-multx appears to be adding the full sequence of the read to the header of each read. Is there anyway to prevent this? There does not seem to be an issue with the headers in the successfully demultiplexed reads.

I am running the command as follows:
fastq-multx -B barcode_file_plate1.txt Europe_R1_001.fastq Europe_R2_001.fastq -m 1 -o R1_has_barcode/R1.%.fastq -o R1_has_barcode/R2.%.fastq

Thanks!

memory issues

Hello,

Is there a way to calculate the ram requirement for the program? I run the program on a cluster. The jobs are killed because of high memory usage. E.g. my lastest job exceeded 256 GB RAM usage, and was killed. Is it normal?

Thanks,
Yeserin.

`gzip: stdout: Broken pipe`

I don't get this error on local machines running macos or ubuntu, but when I run it on a gzip file in a github action using ubuntu-latest, I get the following error on STDERR:

gzip: stdout: Broken pipe

It might be due to a version of gzip or else the timing of the closing of the STDOUT handle of the gzip popen call. Not sure.

See the commented test 26 from: https://github.com/hepcat72/htseq2multx/blob/2d96a9a53f0e7741be554689294fa42b729c20aa/tests/run_tests.sh#L355

How to extract fastq files with the cell barcode still remained?

We just do not want to lose the cell barcode information in the extracted fastq files.

Help message

Could be improved with various barcode usage examples. And fix 'batcode' (https://github.com/brwnj/fastq-multx/blob/master/fastq-multx.cpp#L1150).

Tests 10-12 fail

I don't know if this is specific to my system (macOS Catalina 10.15.6), but tests 10-12 fail due to a segfault (on master):

sh: line 1: 98519 Segmentation fault: 11  ../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err
not ok 10 - test4 worked (../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err)
#   Failed test 'test4 worked (../fastq-multx -H -v ' ' -l in/multx/master-barcodes.txt in/multx/mxtest-h_1.fastq in/multx/mxtest-h_2.fastq -o tmp/multx.t.R0jAb/mxout_%_1.fq -o tmp/multx.t.R0jAb/mxout_%_2.fq > tmp/multx.t.R0jAb/test4.out 2> tmp/multx.t.R0jAb/test4.err)'
#   at multx.t line 25.
not ok 11 - Files equal: tmp/multx.t.R0jAb/test4.out == out/multx/test4.out
#   Failed test 'Files equal: tmp/multx.t.R0jAb/test4.out == out/multx/test4.out'
#   at ./test-prep.pl line 48.
not ok 12 - Files equal: tmp/multx.t.R0jAb/test4.err == out/multx/test4.err
#   Failed test 'Files equal: tmp/multx.t.R0jAb/test4.err == out/multx/test4.err'
#   at ./test-prep.pl line 48.

Possible int overflow issue

We just integrated fastq-multx into our sequencing core post-processing and the first job revealed a bug. The standard output showed:

total	-2085122201

We're guessing it's due to the int here:

fastq-multx/fastq-multx.cpp

Line 1081 in b2735e8

int tot=0;

Probably should change that to something with a higher max value.

space in fastq header

Hi!
First of all, thank you for this useful tool. I have a little remark:

When demultiplexing from separate R1, R2, I1, I2 files, the resulting demultiplexed fastq files get their respective barcode sequences added to the headers separated by a space. Depending on the original header format, this sometimes results in a non-standard header like this:

@NB551430:403:HGKFLBGXG:1:11101:16941:1077/1 GTAGAGGA

This causes problems in subsequent processing with some tools. I would wish to have the option to keep the headers untouched and not to have the barcodes added at all (Now I have to remove them in a subsequent step).

Thank you!

int overflow when sample contains more reads than max int

This is basically the same issue as #14, but instead of the total read count, happens when an individual sample's read count exceeds the max int of: 2147483647.

In our case, it occurred specifically with unmatched output, but given the growth of single-cell sequencing analyses and technologies, it seems that ending up with real samples exceeding this max int is imminent.

Request a tag/release of version 1.4

A tag/release of version 1.4 would make it easier to build a bioconda recipe or generally instruct users on how to build/install the latest updates. Thankx.

Demultiplexing from barcodes in headers

Hi Joe,

I just run across a Segmentation fault. error, when demultiplexing from barcodes in the header. However, all %.fastq.gz files are created as empty files, so the error must occur afterwards.
My call is fastq-multx -H -m1 -B barcodes.txt input.fastq.gz -o %.fastq.gz and a fastq header line looks like:

@NS500475:199:HHML2BGX2:1:11101:21358:1116 2:N:0:1 AACCAATCGT
GCGGTTAAGAGTACTGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
AAAA/EEEAEEEEEEEE###############################################################

Maybe you can provide me with some directions on how demultiplexing from a barcode in the header is possible.
Best and many thanks!
Jens

Usage suggests an option order that generates an error, no outputs, and exit code 0

The order of the options in the usage:

Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...

suggests that input files and output file naming templates can be interspersed, such as:

fastq-multx ... read1.fq -o 'r1.%.fq' mate.fq -o 'r2.%.fq'

However, supplying in that order results in the error:

Error: number of input files (1) must match number of output files following '-o'.

and no output files are generated. Whereas this ordering does not generate an error:

fastq-multx ... read1.fq mate.fq -o 'r1.%.fq' -o 'r2.%.fq'

Also, the exit code from a run with the indicated error is 0 (success).

I will likely submit another PR that addresses these and any other issues I encounter in my efforts.