bgi-flexlab / soapnuke Goto Github PK

A Tool for integrated Quality Control and Preprocessing on FASTQ or BAM/CRAM files

License: GNU General Public License v3.0

C++ 97.76% R 1.86% Makefile 0.38%

soapnuke's Introduction

Introduction

As a novel analysis tool developed for quality control and preprocessing of FASTQ and SAM/BAM data, SOAPnuke includes 5 modules for different usage scenarios, namely filter, filterHts, filterStLFR, filtersRNA and filterMeta.

filter: Preprocess FASTQ files, include trimming (adapter, low quality end and etc.) if set, discarding (adapter, low quality, high N base ratio and etc.) and generating statistic report.

filterHts: Preprocess BAM/CRAM files. The process procedure remains the same as filter module.

(Note: Input BAM/CRAM files should be sorted by readID when it contains Paired-End data.)

filterStLFR: Preprocess stLFR FASTQ files, added with a barcode-detection step at the beginning, and support FASTQ files list as input.

filtersRNA: Preprocess sRNA FASTQ files. Since it is still under testing, please inform us if you encounter any bug.

filterMeta: Preprocess Meta FASTQ files. Since it is still under testing, please inform us if you encounter any bug.

PERFORMANCE

SOAPnuke 2.X version shows an excellent performance compared with 1.X version. An great acceleration has been accomplished by refactoring the whole framework, optimizing multithreading and IO.

This table presents a benchmark result on 628M Paired-End 150bp reads. As thread number increases, user time obviously decreases.

Software	ThreadNum	RunTime(min)	MaxMem(MB)	Parameter
SOAPnuke	16	35.7	2270	filter module
SOAPnuke	8	48.4	881	filter module
SOAPnuke	4	72.1	275	filter module
fastp	8	62.0	1004	-A -w 8

Getting started

Requirements

gcc: 4.7 or higher
zlib: 1.2.3.5 or higher
htslib: 1.9 or higher
pthread library

Install

git clone https://github.com/BGI-flexlab/SOAPnuke.git
cd SOAPnuke
 
// Considering rarely been used and complex compile dependency, we turn off filterHts module by default.
// If you want to use filterHts module, please set USEHTS true in Makefile like this:
// USEHTS=true
 
make

QuickStart

All usages start with executable file SOAPnuke, and different modules are invoked with different sub-commands. Here are some usage examples:

    #filter:

	#QC the input fastq and extract 10M clean reads to the output files.
	echo "totalReadsNum=10000000" >config.txt

    SOAPnuke filter -1 test.r1.fq.gz -2 test.r2.fq.gz -C clean_1.fq.gz -D clean_2.fq.gz -o result -T 8 -c config.txt
    
    
    #filterHts:
    
    SOAPnuke filterHts --ref chr21.fa -1 input.bam -2 output.cram  -o result
	SOAPnuke filterHts -1 input.bam  -2 output.bam -o result


    #filterStLFR:

    filterStLFR -1 fq1.list -2 fq2.list -C clean1.gz -D clean2.gz -o result -T 8 -c config

Detailed QC steps

If set trim-related parameters(no trim if not set), do trimming first:

Read ID

If parameter “index” set in config file, remove index sequence from read ID.

Once “index” is set, if seqType is 0(default value), read ID would be expected like:

@FCD1PB1ACXX:4:1101:1799:2201#GAAGCACG/2,

“#GAAGCACG” would be removed then.

If seqType is 1, read ID would be expected like:

@HISEQ:310:C5MH9ANXX:1:1101:3517:20432:N:0:TCGGTCAC,

“:TCGGTCAC” would be removed then.

Read sequence and quality

First, the cutting length of all trimming type would be calculated, including hard trim, low quality end trim, adapter trim and tail-polyG trim. The longest cutting would be performed.

hard trim: directly remove a certain length sequence from head or tail on read sequence
low quality end trim: remove low quality base starting from end until quality higher than cutoff
adapter trim: when adapter was found, the base sequence and quality sequence would be trimmed from the start position which match adapter
tail-polyG trim: if polyG number is greater than cutoff, then these polyG sequence in tail would be trimmed

Then do filtering:

Note that the read pair would be both discarded both when any of which fails to pass QC.

Priority(High to Low):

Tile, may be used in some types of BGI data.

If you want to discard reads with certain tile ID, set the parameter like “1101-1104,1205”.

Fov, may be used in data from zebra-platform.

If you want to discard reads with certain FOV ID, set the parameter like “C001R003,C003R004”.

Minimal read length

Discard a read with sequence length shorter than the parameter.

Maximal read length

Discard a read with sequence length longer than the parameter.

N ratio

Discard a read with N base ratio not smaller than the parameter.

High A ratio

Discard a read with A base ratio not smaller than the parameter.

polyX number (X means any one base)

Discard a read with poly-X number not smaller than the parameter.

Low quality base ratio

Discard a read with low-quality bases ratio not smaller than the parameter.

Mean quality

Discard a read of which mean quality of sequence smaller than the parameter.

Overlapped length if PE

Discard a read pair which is suspected to be overlapped longer then the parameter.

Adapter

Discard a read which contains an adapter.

Parameter

Commonly used parameters

filter module

-1 / --fq1

fq1 file(required), .gz or normal text format are both supported

-2 / --fq2

fq2 file(used when process PE data), format should be same as fq1 file, both are gz or both are normal text

-C / --cleanFq1

reads which passed QC from fq1 file would output to this file

-D / --cleanFq2

reads which passed QC from fq2 file would output to this file

-o / --out

Output directory. Processed fq files and statistical results would be output to here

-f / --adapter1

adapter sequence or list file of read1

-r / --adapter2

adapter sequence or list file of read2

-J / --ada_trim

trim read when find adapter, it’s a bool parameter, default is false which means discard the read when find adapter

-T / --thread

threads number used in process, default value is 6

-c / --configFile

config file which include uncommonly used parameters. Each line contains a parameter, e.g., for value needed parameter: adaMis=2, for bool parameter: contam_trim, which means set mode as discard when find contaminant sequence

-l / --lowQual

low quality threshold, default value is 5

-q / --qualRate

low quality rate threshold, default value is 0.5

-n / --nRate

N rate threshold, default value is 0.05

-m / --mean

low average quality threshold, if you want discard reads with low average quality, you can set a value. The software do NOT check this item by default

-p / --highA

ratio of A threshold in a read, the software do NOT check this item by default

-g / --polyG_tail

polyG number threshold in read tail, the software do NOT check this item by default

-X / --polyX

polyX number threshold, the software do NOT check this item by default

-4 / --minReadLen

read minimal length, default value is 30

-h / --help

Show help information

-v / --version

Show version information

filterHts module

Here we only present options different from filter module.

-E / --ref

reference file(required when process cram format)

-1

input bam/cram file(required)

-2

output bam/cram file(required)

filterStLFR module

Here we only present options different from filter module.

-1 / --fq1

Support FASTQ files list as input

-2 / --fq2

Support FASTQ files list as input

Uncommonly used parameters

ctMatchR

Contaminant sequence shortest consistent matching ratio [default:0.2]

seqType

Sequence fq name type, 0->old fastq name, 1->new fastq name [0]

old fastq name: @FCD1PB1ACXX:4:1101:1799:2201#GAAGCACG/2

new fastq name: @HISEQ:310:C5MH9ANXX:1:1101:3517:2043 2:N:0:TCGGTCAC

trimFq1

trim fq1 file name(gz format) [optional]

trimFq2

trim fq2 file name [optional]. If trim related parameters were set on, these output files would include the total reads which only do trimming. For example, if read A failed QC after trimming, it will still output to -R/-W, but not to -C/-D

tile

tile number to ignore reads, such as [1101-1104,1205]

fov number to ignore reads (only for zebra-platform data), such as [C001R003,C003R004]

barcodeListPath

barcode list of two columns:sequence and barcodeID

barcodeRegionStr

barcode regions, such as: 101_10,117_10,145_10 or 101_10,117_10,133_10

notCutNoLFR

do not cut sequence when fail found barcode

inputAsList

input file list not a file

tenX

output tenX format

outFileType

output file format: fastq or fasta[default: fastq]

index

remove index

totalReadsNum

number/fraction of reads you want to keep in the output clean FASTQ file(cannot be assigned when -w is given). It will extract reads randomly through the total clean FASTQ file by default, you also can get the head reads for save time by add head suffix to the integer

trim

trim some bp of the read's head and tail, they means: (PE type:read1's head and tail and read2's head and tail [0,0,0,0]; SE type:read head and tail [0,0])

trimBadHead

Trim from head ends until meeting high-quality base or reach the length threshold, set (quality threshold,MaxLengthForTrim) [0,0]

trimBadTail

Trim from tail ends until meeting high-quality base or reach the length threshold, set (quality threshold,MaxLengthForTrim) [0,0]

overlap

filter the small insert size.Not filter until the value exceed 1 [-1]

the maximum mismatch ratio when find overlap between PE reads(depend on -O) [0.1]

patch

reads number of a patch processed [400000]

qualSys

quality system 1:64, 2:33 [default:2]

outQualSys

out quality system 1:64, 2:33 [default:2]

maxReadLen

read max length, default 49 for filtersRNA, the software do NOT check this item by default in other modules

cleanOutSplit

max reads number in each output clean FASTQ file

pe_info

Add /1, /2 at the end of FASTQ name. [default: not add]

baseConvert

convert base when write data, example: TtoU , means convert base T to base U in the output

log file output path

Plotting

The three scripts in src/Rscripts/ are used for plotting QC stats from SOAPnuke.

Q20Q30.R

USAGE:

Rscript src/Rscripts/Q20Q30.R Distribution_of_Q20_Q30_bases_by_read_position_1.txt Distribution_of_Q20_Q30_bases_by_read_position_2.txt q2030.png

base.R

USAGE:

Rscript src/Rscripts/base.R Base_distributions_by_read_position_1.txt Base_distributions_by_read_position_2.txt raw.png clean.png

quality.R

Rscript src/Rscripts/quality.R Base_quality_value_distribution_by_read_position_1.txt Base_quality_value_distribution_by_read_position_2.txt rawQuality.png cleanQuality.png 0 0

Availability

SOAPnuke is released under GPLv3. The latest source code is freely available at github.

Citing SOAPnuke

Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018;7(1):1-6. doi:10.1093/gigascience/gix120 [PMID: 29220494]

soapnuke's People

Contributors

Stargazers

Watchers

soapnuke's Issues

"seqType" and "qualSys" can't use

Hi,

   when I  used SOAPnuke-2.1.6, I found that the parameter "seqType" and "qualSys" can't use.

-o参数

-o参数help信息说明是默认当前目录，实际必须指定。

疑问：关于SOAPnuke 2.0 -J参数

我的原始数据是2个样本，R1、R2分别15000条转录组重测序的Reads，物种：拟南芥。用普通的过滤条件，不加-J参数时，R1和R2的接头都能去除掉。参数如下：filter -f AGATCGGAAGAGC -r AATGATACGGCGA -l 10 -q 0.5 -n 0.05 -Q 2 -G 2。

但是加上-J后，参数如下：filter -f AGATCGGAAGAGC -r AATGATACGGCGA -l 10 -q 0.5 -n 0.05 -Q 2 -G 2 -J，发现R1的接头有去掉，但是R2的接头几乎没有减少（少了0.01%）。

当只取R1过滤，并加上-J后，参数如下：filter -f AGATCGGAAGAGC -l 10 -q 0.5 -n 0.05 -Q 2 -G 2 -J。R1的接头完全没有去除（根据Reads长度分布全部是150bp推测）。

请问一下是不是我对这个参数的使用方式有误？还是这个参数目前只支持双端数据？
谢谢！

fqcheck dose not exists

Dear Young,

I used SOAPnuke to filter my RNAseq data

SOAPnuke filter -l 15 -q 0.2 -n 0.05 -i -Q 2   -c 3 -1 *1.fq.gz -2 *2.fq.gz -f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA -r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG $tile -o */ -C *_1.fq.gz -D *_2.fq.gz -R *_1.rawdata.fq.gz -W *_2.rawdata.fq.gz &
& \

However, the process reported this bug as following that I could not find any website.

[findNtile 2020-05-18 15:54:56.804 ERROR] /WhereInputSeqFile/1.fqcheck dose not exists (SeqType: 2)

Could you help to fix it?

Thank you very much!

Fan

Bug：关于SOAPnuke filter参数-t

开发者你好，

在使用SOAPnuke filter 参数 -t [2,0,20,0]对PE序列进行截取时发现无法对read1 head进行截取
尝试SOAPnuke 2.0.7 以及2.0.5版本，均存在以上问题

split clean fastq

增加一个参数，指定输出clean_fq.gz文件的最大大小或最大reads数，超出的输出到不同文件。

-l and -q

Hi,

Can anyone help me to understand more about the parameter used in SOAPnuke?

-l / --lowQual = low quality threshold, default value is 5
-q / --qualRate= low quality rate threshold, default value is 0.5

I am more familiar with Phred value score, so, can I know which parameter in SOAPnuke is similar to this?
And what is the difference between -l and -q?

Much appreciated for the help.

SOAPnuke filter error

command：
/home/luna/Desktop/Software/SOAPnuke/SOAPnuke filter -n 0.009 -l 10 -q 0.1 -Q 1 -G 1 -1 SRX247249_1.fastq.gz -2 SRX247249_2.fastq.gz -C SRX247249_S1.fastq.gz -D SRX247249_S2.fastq.gz -T 10 -o . -0 log.SOAPnuke
Error:quality is too low,please check the quality system parameter or fastq file

使用SAOPnuke处理小鼠测序数据的参数问题

您好，在使用SOAPnuke默认参数处理小鼠肠道宏基因组数据时，发现小鼠数据被百分之百过滤，报错为unexpected end of file，修改 l 为1 ，q 为 0.5 后仍然被全部过滤，请问在处理小鼠数据时应该如何设置参数使其适应小鼠宏基因组数据呢？

Fail to build with GCC 4, 5, 6, 7

质量值参数

	-Q, --qualSys		INT		quality system 1:illumina, 2:sanger[1]
	-G, --outQualSys	INT		out quality system 1:illumina, 2:sanger[1]

此参数说明有歧义。目前，无论illumina，sanger，bgiseq500都是使用的Phred33质量值体系，即-Q 2
只有上古测序数据才会用Phred64
目前所有分析流程基本都固定指定-Q 2 -G （soapnuke1参数）
outQualSys参数与soapnuke1不兼容（可以考虑将该参数去掉，固定为输出Phred33 ？）
所以质量值参数的默认设置需要更改

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

merge window version and linux version

we need to make soapnuke works on both linux and windows with just one copy of source code.

Statistics error

I get this result from my test, it obvious error

编译失败

/home/glad_zarl/Biosoft/soapnuke/SOAPnuke/src/threadpool/detail/locking_ptr.hpp:24:29: 致命错误：boost/utility.hpp：没有那个文件或目录
#include <boost/utility.hpp>
^
编译中断。
make[2]: *** [CMakeFiles/SOAPnuke.dir/Main.cpp.o] 错误 1
make[1]: *** [CMakeFiles/SOAPnuke.dir/all] 错误 2
make: *** [all] 错误 2

The return value is false when command finished

The return value is false when command finished.
"SOAPnuke2.0 && ls", the command "ls" would not be run.

Occur Segmentation fault (core dumped)

Hello Mr.
I try to use SOAPnuke2 to have a test, and i install it as below:

git clone https://github.com/BGI-flexlab/SOAPnuke.git
cd SOAPnuke
make

Then I test it using Ubuntu 18.04.3 LTS 64-bit and 32GB RAM.

SOAPnuke filter -1 SRR2061818.fastq -C SRR2061818.fastq.gz -o ./cleandata

However it showed this error:
Segmentation fault (core dumped)

I have already install these library:

sudo apt-get install libboost-all-dev
sudo apt-get install openssl
sudo apt-get install zlib1g-dev

SRR2061818.fastq:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/008/SRR2061818/SRR2061818.fastq.gz

did i do any wrong thing to this software? I don't know how to do with it.
looking forward to your help..

make ERROR

Hello,I met the new make error, and how can I solve it, thank you.
"GCC version Passes, 4.8.5 >= 4.7"
"ZLIB version Passes, 1.2.7 >= 1.2.3.5"
Makefile:49: *** missing separator (did you mean TAB instead of 8 spaces?). Stop.

various reads length

hi
the reads in my fq file are kinds of length, and the stat output of read length is determined by the first read, may this question could be dealt? the longest read length is both 151 in fq1 and fq2
2017-08-31 16:00:38 INFO - [processParams] [416] fq1 read length: 94
2017-08-31 16:00:38 INFO - [processParams] [441] fq2 read length: 151

Parameters "-d" and "-7 1" are in conflict in "filter" command

As mentioned in help message ("filter" command):
-d, --rmdup : remove PCR duplications
-7, --outType: Add /1, /2 at the end of fastq name, 0:not add, 1:add [default: 0]
When I set "-d" and "-7 1", the parameter "-7 1" does not work.

In FilterProcessor.cpp, the scripts are written as below, maybe sth is wrong when running this loop?
if (rmdup_)
{
char tempFile[1024];
sprintf(tempFile, "%s/%d.sort.temp", outDir_.c_str(), fileNum);
tempOFS.open(tempFile);
outputTempData(tempOFS, reads1, reads2, outType_);
if(dupRateOnly_){
ofstream tempOFS1;
char dupTempFile[1024];
sprintf(dupTempFile, "%s/%d.dup.temp", outDir_.c_str(), fileNum);
tempOFS1.open(dupTempFile);
outputDupData(tempOFS1, reads1, reads2,size,outType_);
}
duplications_.clear();
}
Could you double-check this problem? Thanks so much!!

number of filted reads is wrong

following is output in Statistics_of_Filtered_Reads.txt

the number of "Read with n rate exceed: (%)" is wrong, 4450 != 477 + 2079

Item Total Percentage Counts(fq1) Percentage Counts(fq2) Percentage
Total filtered reads (%) 6788 100.00% 3394 100.00% 3394 100.00%
Reads with adapter (%) 2316 34.12% 1155 34.03% 1110 32.70%
Reads with low quality (%) 22 0.32% 2 0.06% 9 0.27%
Reads with low mean quality (%) 0 0.00% 0 0.00% 0 0.00%
Reads with duplications (%) 0 0.00% 0 0.00% 0 0.00%
Read with n rate exceed: (%) 4450 65.56% 477 14.05% 2079 61.26%
Read with small insert size: (%) 0 0.00% 0 0.00% 0 0.00%
Reads with PolyA (%) 0 0.00% 0 0.00% 0 0.00%

SOAPnuke make failed

CMakeFiles/SOAPnuke.dir/build.make:230: recipe for target 'CMakeFiles/SOAPnuke.dir/SRNAProcessor.cpp.o' failed
make[2]: *** [CMakeFiles/SOAPnuke.dir/SRNAProcessor.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/SOAPnuke.dir/all' failed
make[1]: *** [CMakeFiles/SOAPnuke.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

Error in "make" step.
No error in "cmake" step, but there were no "SRNAProcessor.cpp.o" file in "CMakeFiles/SOAPnuke.dir" directory, what should I do? I need your help. TKS

demo config file

can you please show a demo config file for me, cause I wonder to know how to use tile or other parameters

add code

use git to add code

SOAPnuke.2.2.6 -L参数用法

hi~我最近在用SOAPnuke.2.2.6过滤时遇到一个问题，希望得到帮助，我希望获得的clean reads >=22M，因此设置了-L参数，命令行如下
SOAPnuke.2.2.6 filter -R 41011723 -L 22000000 -f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA -r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG -1 raw.1.fq.gz -2 raw.2.fq.gz -C clean_1.fq.gz -D clean_2.fq.gz -o result

但最后过滤得到的clean reads是小于22M的，raw reads反而是22M，所以-L这个参数是设定raw reads的数量，而非得到的clean reads数吗？
cat Basic_Statistics_of_Sequencing_Quality.txt

item	raw reads(fq1)	clean reads(fq1)	raw reads(fq2)	clean reads(fq2)
Read length	150.0	150.0	150.0	150.0
Total number of reads	22000000 (100.00%)	19892573 (100.00%)	22000000 (100.00%)	19892573 (100.00%)
Number of filtered reads	2107427 (9.58%)	-	2107427 (9.58%)	-
Total number of bases	3300000000 (100.00%)	2983885950 (100.00%)	3300000000 (100.00%)	2983885950 (100.00%)
Number of filtered bases	316114050 (9.58%)	-	316114050 (9.58%)	-
Number of base A	768797016 (23.30%)	691175550 (23.16%)	765981849 (23.21%)	692845445 (23.22%)
Number of base C	858757499 (26.02%)	781336195 (26.19%)	937064495 (28.40%)	857786803 (28.75%)
Number of base G	905225714 (27.43%)	815179421 (27.32%)	835232863 (25.31%)	748190945 (25.07%)
Number of base T	765924074 (23.21%)	695247404 (23.30%)	761060830 (23.06%)	684637740 (22.94%)
Number of base N	1295697 (0.04%)	947380 (0.03%)	659963 (0.02%)	425017 (0.01%)
Q20 number	3246623126 (98.38%)	2932381006 (98.27%)	3174678684 (96.20%)	2869463274 (96.17%)
Q30 number	3132356444 (94.92%)	2823194601 (94.61%)	2970663655 (90.02%)	2682127128 (89.89%)

期待您的回复，祝好

Install using conda

It will be great to support SOAPnuke installation via Bioconda (https://bioconda.github.io). Bioconda repro contains numerous bionformatics tools which could be easily installed using conda install -c bioconda tool_name. Such installation is very convenient for using in pipelines and on computational clusters.

processHts.cp

Duplicates Count

I can't find duplicates count in the results with -d parameter, did 2.0 version remove the rmdup function?
Is there any way I can use soapnuke 2.0 version counts dups?

And when I was installing soapnuke 1.6.5, there was an error in make:
< CMakeFiles/SOAPnuke.dir/build.make:230: recipe for target 'CMakeFiles/SOAPnuke.dir/SRNAProcessor.cpp.o' failed >
How can I solve this error?

Any comments or suggestions would be appreciated!

Thank you!

parameters

The details of parameters in this software are missing, so please add them in README.md.

Make error in SOAPnuke2.

./obj/peprocess.o: In function peProcess::process_nonssd()': SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1304: undefined reference to gzbuffer'
/lustre/liyan/01.software/SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1309: undefined reference to gzbuffer' SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1294: undefined reference to gzbuffer'
SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1298: undefined reference to gzbuffer' SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1314: undefined reference to gzbuffer'
./obj/peprocess.o:SOAPnuke/SOAPnuke-SOAPnuke2.0/src/peprocess.cpp:1317: more undefined references to `gzbuffer' follow

make error-undefined reference

Hi,
I downloaded the latest version of SOAPmuke but make error.The error messages are below:
$ make
"GCC version Passes, 4.8.5 >= 4.7"
"ZLIB version Passes, 1.2.7 1.2.7 >= 1.2.3.5"
g++ -std=c++11 -g -O3 -c src/peprocess.cpp -o obj/peprocess.o
g++ -std=c++11 -g -O3 -c src/sequence.cpp -o obj/sequence.o
g++ -std=c++11 -g -O3 -c src/gc.cpp -o obj/gc.o
g++ -std=c++11 -g -O3 -c src/read_filter.cpp -o obj/read_filter.o
g++ -std=c++11 -g -O3 -c src/seprocess.cpp -o obj/seprocess.o
g++ -std=c++11 -g -O3 -c src/Main.cpp -o obj/Main.o
g++ -std=c++11 -g -O3 -c src/processStLFR.cpp -o obj/processStLFR.o
g++ -std=c++11 -g -O3 -c src/global_variable.cpp -o obj/global_variable.o
g++ -std=c++11 -g -O3 -c src/process_argv.cpp -o obj/process_argv.o
g++ ./obj/peprocess.o ./obj/sequence.o ./obj/gc.o ./obj/read_filter.o ./obj/seprocess.o ./obj/Main.o ./obj/processStLFR.o ./obj/global_variable.o ./obj/process_argv.o -o SOAPnuke -lz -lpthread
./obj/Main.o：in ‘main’：
./src/Main.cpp:21：对‘mGzip::check_mGzip(std::string)’undefined reference
./src/Main.cpp:33：对‘mGzip::allocate(int, std::vector<std::string, std::allocatorstd::string >)’undefined reference
./src/Main.cpp:37：对‘processHts::processHts(C_global_parameter)’undefined reference
./src/Main.cpp:39：对‘processHts::processPE()’undefined reference
./src/Main.cpp:41：对‘processHts::processSE()’undefined reference
collect2: error：ld return1
make: *** [SOAPnuke] erro 1

Can you tell me how to solve this?
Thanks .

Does SOAPnuke 2 not remove PCR duplications？

Dear Young,

I used SOAPnuke to filter NGS data, before using SOAPnuke, i used SOAPnuke 1.5.6 to remove PCR duplication by "-d", however i can't find "-d" or other parameters about PCR duplication.

Does SOAPnuke 2 not remove PCR duplications?

Thanks

[email protected]

make error in 2.X

g++ -std=c++11 -g -O3 -c src/gc.cpp -o obj/gc.o
g++ -std=c++11 -g -O3 -c src/global_variable.cpp -o obj/global_variable.o
g++ -std=c++11 -g -O3 -c src/Main.cpp -o obj/Main.o
g++ -std=c++11 -g -O3 -c src/peprocess.cpp -o obj/peprocess.o
src/peprocess.cpp: In member function ‘void* peProcess::sub_thread(PEthreadOpt)’:
src/peprocess.cpp:1057:45: error: ‘gzbuffer’ was not declared in this scope
gzbuffer(gz_trim_out1[index],10241024160);
^
src/peprocess.cpp:1068:46: error: ‘gzbuffer’ was not declared in this scope
gzbuffer(gz_clean_out1[index],10241024160);
^
src/peprocess.cpp: In member function ‘void peProcess::process_nonssd()’:
src/peprocess.cpp:1294:45: error: ‘gzbuffer’ was not declared in this scope
gzbuffer(gz_trim_out1_nonssd,10241024160);
^
src/peprocess.cpp:1304:46: error: ‘gzbuffer’ was not declared in this scope
gzbuffer(gz_clean_out1_nonssd,10241024160);
^
src/peprocess.cpp:1314:27: error: ‘gzbuffer’ was not declared in this scope
gzbuffer(gzfp1,2048*2048);
^
make: *** [obj/peprocess.o] Error 1

合并gz测序文件并过滤问题

您好，我在用SOAPnuke version2.1.7过滤二代数据时，因需要合并两次下机的数据，我使用了zcat合并再gzip压缩的方法合并了rawdata去跑fiilter，之后觉得zcat太慢了，直接使用cat合并了两次下机的fq.gz去跑filter；但最后发现使用zcat合并和cat合并跑filter的结果是不同的，这个是为什么呢，后来跑了两遍zcat发现跑两次zcat的结果是一样的，说明过滤计算可重现的。我也比较了zcat和cat合并后解压出来的fastq，发现是一样的呀，但为什么两种方法合并的数据跑filter得到了两种不同的结果呢，是程序内部解压的方式的原因吗，还是有其他原因呢，使用cat合并fq.gz的数据跑filter的结果可以使用吗？

Fail to build with GCC 4, 5, 6, 7 due to `no match for ‘operator==’`

I failed to build SOAPnuke with GCC 4.8, 5.4 6.4 and 7.2 on CentOS 7. Erorr messages are the same, saying "SRNAProcessor.cpp:687:8: error: no match for ‘operator==’ (operand types are ‘std::ifstream {aka std::basic_ifstream’ and ‘long int’)}".

Steps to reproduce this issue

$ spack install soapnuke %[email protected]
...
==> Error: ProcessError: Command exited with status 2:
'make'                                                                                                         

10 errors found in build log:                                                                                      
     [ ... ]                                                                                                       
     812                    from /lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke1.6.2/src/SRNAProcessor.h:12,                                                                
     813                    from /lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd6
6ll7/SOAPnuke-SOAPnuke1.6.2/src/SRNAProcessor.cpp:8:
     814   /lustre/spack/tools/linux-centos7-x86_64/gcc-4.8.5/gcc-7.2.0-s6dfmbnzhj4rig5ilivoawy5kyk2f6dp/include/c+
+/7.2.0/bits/unique_ptr.h:51:28: note: declared here
     815      template<typename> class auto_ptr;
     816                               ^~~~~~~~
     817   /lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke
1.6.2/src/SRNAProcessor.cpp: In member function ‘long int SRNAProcessTool::RNAProcessor::splitFile(std::__cxx11::st
ring&, std::__cxx11::string&, std::vector<std::__cxx11::basic_string<char> >&)’:
  >> 818   /lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke
1.6.2/src/SRNAProcessor.cpp:687:8: error: no match for ‘operator==’ (operand types are ‘std::ifstream {aka std::bas
ic_ifstream<char>’ and ‘long int’)}
     819      if(in==NULL)
     820           ^
...
>> 8478  make[2]: *** [CMakeFiles/SOAPnuke.dir/SRNAProcessor.cpp.o] Error 1 
>> 8479  make[2]: Leaving directory `/lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke1.6.2/spack-build'                                                                       >> 8480  make[1]: *** [CMakeFiles/SOAPnuke.dir/all] Error 2
     8481  make[1]: Leaving directory `/lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke1.6.2/spack-build'                               
  >> 8482  make: *** [all] Error 2

See build log for details:                              
  /lustre/home/rpm/spack/var/spack/stage/soapnuke-1.6.2-ky6vy6ihbglnit2xjsnzm4fxerd66ll7/SOAPnuke-SOAPnuke1.6.2/spack-build.out

Dependency libraries

$ spack spec soapnuke %[email protected]
Input spec
--------------------------------
soapnuke%[email protected]

Concretized
--------------------------------
[email protected]%[email protected] build_type=RelWithDebInfo arch=linux-centos7-x86_64 
    ^[email protected]%[email protected]+atomic+chrono~clanglibcpp+date_time~debug+exception+filesystem~graph~icu+iostreams+locale+log+math~mpi+multithreaded patches=2ab6c72d03dec6a4ae20220a9dfd5c8c572c5294252155b85c6874d97c323199 +program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave arch=linux-centos7-x86_64 
        ^[email protected]%[email protected]+shared arch=linux-centos7-x86_64 
        ^[email protected]%[email protected]+optimize+pic+shared arch=linux-centos7-x86_64 
    ^[email protected]%[email protected]~doc+ncurses~openssl+ownlibs~qt arch=linux-centos7-x86_64 
    ^[email protected]%[email protected] build_type=RelWithDebInfo arch=linux-centos7-x86_64 
    ^openssl@system%[email protected] arch=linux-centos7-x86_64

Detailed logs are attached below.

spack-build.env.txt
spack-build.out.txt

make Error

"GCC version Passes, 6.3.0 >= "4.7""
readlink: missing operand
Try 'readlink --help' for more information.
sed: -e expression #1, char 20: unterminated `s' command
"Warning: ZLIB version is lower than "1.2.3.5"."
g++ -std=c++11 -g -O3 -c src/peprocess.cpp -o obj/peprocess.o
src/peprocess.cpp: In member function ‘void peProcess::print_stat()’:
src/peprocess.cpp:339:41: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(gv.raw1_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/peprocess.cpp:338:16: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/peprocess.cpp: In member function ‘void peProcess::update_stat(C_fastq_file_stat&, C_fastq_file_stat&, C_filter_stat&, std::__cxx11::string)’:
src/peprocess.cpp:611:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/peprocess.cpp:610:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/peprocess.cpp:732:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/peprocess.cpp:731:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/peprocess.cpp:826:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq2s_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/peprocess.cpp:825:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/peprocess.cpp:814:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/peprocess.cpp:813:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

g++ -std=c++11 -g -O3 -c src/sequence.cpp -o obj/sequence.o
g++ -std=c++11 -g -O3 -c src/gc.cpp -o obj/gc.o
g++ -std=c++11 -g -O3 -c src/read_filter.cpp -o obj/read_filter.o
g++ -std=c++11 -g -O3 -c src/seprocess.cpp -o obj/seprocess.o
src/seprocess.cpp: In member function ‘void seProcess::update_stat(C_fastq_file_stat&, C_filter_stat&, std::__cxx11::string)’:
src/seprocess.cpp:355:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/seprocess.cpp:354:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/seprocess.cpp:415:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/seprocess.cpp:414:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/seprocess.cpp:462:39: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(fq1s_stat.qs.position_qual[i][j]>0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/seprocess.cpp:461:17: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

src/seprocess.cpp: In member function ‘void seProcess::print_stat()’:
src/seprocess.cpp:206:41: warning: iteration 41 invokes undefined behavior [-Waggressive-loop-optimizations]
if(gv.raw1_stat.qs.position_qual[i][j]>0){
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
src/seprocess.cpp:205:16: note: within this loop
for(int j=1;j<=MAX_QUAL;j++){

g++ -std=c++11 -g -O3 -c src/Main.cpp -o obj/Main.o
g++ -std=c++11 -g -O3 -c src/global_variable.cpp -o obj/global_variable.o
g++ -std=c++11 -g -O3 -c src/processHts.cpp -o obj/processHts.o
src/processHts.cpp: In member function ‘void processHts::catBam(std::vector<std::__cxx11::basic_string >, BGZF*)’:
src/processHts.cpp:319:9: error: ‘sam_hdr_t’ was not declared in this scope
sam_hdr_t *old = bam_hdr_read(in);
^~~~~~~~~
src/processHts.cpp:319:20: error: ‘old’ was not declared in this scope
sam_hdr_t old = bam_hdr_read(in);
^~~
src/processHts.cpp: In member function ‘void processHts::catCram(std::vector<std::__cxx11::basic_string >, htsFile)’:
src/processHts.cpp:404:9: error: ‘sam_hdr_t’ was not declared in this scope
sam_hdr_t *old_h;
^~~~~~~~~
src/processHts.cpp:404:20: error: ‘old_h’ was not declared in this scope
sam_hdr_t *old_h;
^~~~~
Makefile:35: recipe for target 'obj/processHts.o' failed
make: *** [obj/processHts.o] Error 1

NO module filterStLFR in the conda version

Dear Young,

I'm trying to use SOAPnuke on StLFR data. I noticed that the conda installation was an older version (2.0) of the software and cannot find the filterStLFR command.

best regards,

Yann

Segmentation fault (core dumped)

I'm getting an error trying to run SOAPnuke on raw/cleaned samples. Each FASTQ is ~101Gb in size (300,000,000 reads for each read pair)

/bin/bash
#BSUB -q normal
#BSUB -J SOAPnuke_BIG
#BSUB -n 1
##BSUB -R "span[hosts=1]"
#BSUB -R "select[hname!=node036]"
#BSUB -M 2500000
#BSUB -W 72:00
#BSUB -u [email protected]
#BSUB -B
#BSUB -N
#BSUB -o SOAPnuke_script_BIG.out
#BSUB -e SOAPnuke_script_BIG.err

export LD_LIBRARY_PATH=/home/moldach/bin/SOAPnuke/

/home/moldach/bin/SOAPnuke/SOAPnuke filter \
        -1 ./data/Raw_Fastq/R1/S1aR1.fastq \
        -2 ./data/Raw_Fastq/R2/S1aR2.fastq \
        -C ./data/Trimmed_Fastq/R1/S1aR1.fastq \
        -D ./data/Trimmed_Fastq/R2/S1aR2.fastq \
        -o ./

LSF's e-mail is:

Exited with exit code 139.

Resource usage summary:

    CPU time :                                   7576.88 sec.
    Max Memory :                                 325 MB
    Average Memory :                             273.10 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   2 MB
    Max Processes :                              8
    Max Threads :                                18
    Run time :                                   1498 sec.
    Turnaround time :                            1500 sec.

And the .err log is:

/home/moldach/.lsbatch/1595546086.86175.shell: line 22: 139963 Segmentation fault      (core dumped) /home/moldach/bin/SOAPnuke/SOAPnuke filter -1 ./data/Raw_Fastq/R1/S1aR1.fastq -2 ./data/Raw_Fastq/R2/S1aR2.fastq -C ./data/Trimmed_Fastq/R1/S1aR1.fastq -D ./data/Trimmed_Fastq/R2/S1aR2.fastq -o ./`

Analysis MinION data using SOAPnuke

Hi,
I am looking to analyze MinION data using this pipeline. I have input file in fastq format and a reference file.
Please let me know how to install the pipeline.
Please let me know a command to RUN the pipeline.

Thank you,

Manoj

install problem

Hi,
when I make , I met some problem as:
collect2: error: ld returned 1 exit status
make[2]: *** [SOAPnuke] Error 1
make[1]: *** [CMakeFiles/SOAPnuke.dir/all] Error 2
make: *** [all] Error 2

i install the log4cplus 1.0.3 and add it to my ENV
how can i resolve this problem ?

Recommended htslib version

Hi,
I noticed that 2.1.2 version seems to need htslib in order to compile.
What version do you recommend?

Problem with bowtie2

Hi, I am currently working with BGI rnaseq data. I am facing a problem right now at the mapping phase after doing SOAPnuke filter to my data.

The code I am using is:
SOAPnuke filter -1 File1.fq -2 File2.fq -C File1_clean.fq -D File_2_clean.fq -n 0.001 -l 20 -q 0.4 -A 0.25 -Q 2 -G 2

I obtain "clean reads", but when I try to use bowtie2, this error appears:
Saw ASCII character 10 but expected 33-based Phred qual

I have tried with reads without filtering but I obtain around 60% of succesfull alignment, as far as I know, a good alignment should be around 80%.

Thanks in advance for your help.

bgi-flexlab / soapnuke Goto Github PK

soapnuke's Introduction

Introduction

PERFORMANCE

Getting started

Requirements

Install

QuickStart

Detailed QC steps

Parameter

Commonly used parameters

filter module

filterHts module

filterStLFR module

Uncommonly used parameters

Plotting

Availability

Citing SOAPnuke

soapnuke's People

Contributors

Stargazers

Watchers

Forkers

soapnuke's Issues

Steps to reproduce this issue

Dependency libraries

Recommend Projects

Recommend Topics

Recommend Org