tianshilu / qbrc-somatic-pipeline Goto Github PK

QBRC Somatic Mutation Calling Pipeline

Perl 26.72% R 14.42% Shell 0.25% Python 2.08% C 30.11% C++ 26.37% Rebol 0.05%

qbrc-somatic-pipeline's Introduction

The QBRC somatic mutation calling pipeline

Introduction

The QBRC mutation calling pipeline is a flexible and comprehensive pipeline for mutation calling that has glued together a lot of commonly used software and data processing steps for mutation calling. The mutation calling software include: sambamba, speedseq, varscan, shimmer, strelka, manta, lofreq_tar. It identifies somatic and germline variants from whole exome sequencing (WXS), RNA sequencing and deep sequencing data. It can be used for human, PDX, and mouse data (fastq files or bam files as input).
Please refer to the lab website of Dr. Tao Wang, https://qbrc.swmed.edu/labs/wanglab/index.php, for more information.

Citation

If you use the pipeline, please cite:
T. Lu, S. Wang, L. Xu, Q. Zhou, N. Singla, J. Gao, S. Manna, L. Pop, Z. Xie, M. Chen, J. J. Luke, J. Brugarolas, R. Hannan, T. Wang, Tumor neoantigenicity assessment with CSiN score incorporates clonality and immunogenicity to predict immunotherapy outcomes. Sci. Immunol. 5, eaaz3199 (2020).
For liscence information, please refer to:
https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/blob/master/Liscense.txt

Running time

For a paird of 'fastq.gz'files of 200M, it takes around 2 hours to finish somatic mutation calling.

Dependencies

64 bit linux operating system
BWA (version >=0.7.15)
STAR (required if applied for RNA sequencing data)
sambamba
speedseq
varscan
samtools (version >=1.6)
shimmer
annovar (database downloaded in default folder: refGene,ljb26_all,cosmic70,esp6500siv2_all,exac03,1000g2015aug)
python2
strelka (version >=2.8.3, note: strelka is tuned to run exome sequencing or RNA sequencing)
manta (version >=1.4.0)
java (version 1.8)
perl (Parallel::ForkManager)
lofreq_star (version >=2.1.3, for tumor-only calling)
bowtie2 (version>= 2.3.4.3, for Patient Derived Xenograft models)
picard.jar (please download the file https://drive.google.com/file/d/1lL_vUgrY6VAtjG87bf9PXgYubuYd-st2/view?usp=sharing and place it under the folder named "somatic_script" before running the pipeline)

Input files

Input can be fastq files or bam files or a mixture of fastq and bam files.

Main procedures:

Genome Alignment:
Genome sequencing files are aligned to the human reference genome by BWA-MEM (Please contact [email protected] for genome reference files). Picard was used to add read group information and sambamba was used to mark PCR duplilcates. GATK toolkt was used to perform base quality score relcalibration adn local realignment around Indels.
Variant Calling:
MuTect, VarScan Shimmer, SpeedSeq, Manta, and Strelka2 were used to call SNPs and Indels. A mutation that was repeatedly called by any two of these softwares was retained.
Mutation Annotation:
Annovar was used to annotate SNPs, and Indels and protein sequence changes. Somatic mutations and germline mutations were annotated according to the mutation allele frequencies in the normal and tumor samples.
Filter False Mutations:
All SNPs and Indels were combined ony kept if there were at least 7 total( wild type and variant) reads in the normal sample and at least 3 variant reads in the tumor sample. Variants with allele frequency more than 2 times allele frequency of the according normal allele are kept. Variants with allele frequency less than 5% in background sample are kept.

Guided Tutorial

somatic.pl

The code for somatic and germline mutation calling for a pair of normal and tumor sequencing files.

Usage

perl /Path/to/somatic.pl <normal_fastq1> <normal_fastq2/NA> <tumor_fastq1> <tumor_fastq2/NA> <thread> <build> <index> <java17> </Path/to/output> <pdx> <disambiguate_pipeline>

fastq files:
- fastq1 and fastq2 of normal control sample, fastq1 and fastq2 of sample of interest (must be .gz) default input is full path to the 4 fastq files for saample of interest and normal control samples.
- If need to directly input bam files, use "bam path_to_bam.bam" in replace of the two corresponding fastq input files can be a mixture of fastq and bam input.
- If RNA-Seq data are used, use "RNA:fastq1" or "RNA:bam" at the first
- If Agilent SureSelect (Deep exome sequencing) data, use "Deep:fastq1" at the first or third slot.
  optional: run somatic_script/SurecallTrimmer.jar on the fastq files before running somatic.pl.
- For tumor-only calling, put "NA NA" in the slots of the normal control samples. Results will be written to germline files
- If only single end fastq data are available, put the fastq file(s) at the first and/or the third slots, then put NA in the second and/or fourth slot.
thread: number of threads to use. Recommended: 32
build: hg19 or hg38 or mm10
index: path (including file name) to the human/mouse reference genome
java17: path (including the executable file name) to java 1.7 (needed only for mutect)
output: the output folder, it will be deleted (if pre-existing) and re-created during analysis
pdx: "PDX" or "human" or "mouse"
keep_coverage: whether to keep per-base coverage information. Default is 0. Set to 1 to keep coverage information.
disambiguate_pipeline: the directory to disambiguate_pipeline which is for distinguishing human and mouse reads in PDX data.

Example:

perl ~/somatic/somatic.pl ~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz 32 hg38 ~/ref/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java ~/somatic_result/1799-01/ human 1 ~/disambiguate_pipeline

Note:

Input seuqencing files:
(1) If input are fastq files, they must be 'gz' files. 'sequencing_file_1', 'sequencing_file_2' are path to fastq1 and fastq2 of control sample; 'sequencing_file_3', 'sequencing_file_4' are path to fastq1 and fastq2 of sample of interest.
(2) If input are bam files, use "bam /path/to/bam/files.bam" in replace of the tow corresponding fastq input files.
(3) If input are RNA sequencing files, use "RNA:fastq1" or "RNA:bam" at the first or third slot.
(4) If input are deep exome sequencing data, use "Deep:fastq1" at the first or third slot.
(5) For tumor-only calling, put "NA NA" in the first two slots. Results will be written to germline output files.
(6) Optional: run somatic_script/SurecallTrummer.jar on the fastq files before runnign somatic.pl for deep seuquencing files.
(7) If only single end fastq data are available, put the fastq file(s) at the first and/or the third slots, then put NA in the second and/or fourth slot.

thread : number of threads to use.
build : genome build, hg19 or hg38.
index : path (including file names) to the reference genome fasta file of the reference bundle hg38 or hg19. (The pipeline will search for other files in that bundle folder automatically.)
java17 : path (including the executable file name) to java 1.7 (needed only for MuTect).
output : the output folder, it will be deleted (if pre-existing) adn re-created during analysis.
pdx : "PDX" or "human" if this is PDX sample, reads will be aligned to mouse genome first. And unmapped reads will be mapped to the human genome.
(4) Example data to run the QBRC somatic mutation pipeline can be found at https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/example/example_dataset/sequencing. The output for the example data can be found at https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/example/example_dataset/example_output.

job_somatic.pl

Slurm wrapper for somatic.pl for a batch of sampels and it is easy to change for other job scheduler system by revising this line of code: "system("sbatch ".$job)" and using proper demo job submission shell script.

Command

perl /Directory/to/folder/of/code/job_somatic.pl design.txt example_file thread build index java17 disambiguate_pipeline

somatic_design.txt example

(5 columns; columns seperated by tab):

~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz ~/out/1799-01/ human 
~/seq/1799-02N.R1.fastq.gz ~/seq/1799-02N.R2.fastq.gz ~/seq/1799-02T.R1.fastq.gz ~/seq/1799-02T.R2.fastq.gz ~/out/1799-02/ human 
~/seq/1799-03N.R1.fastq.gz ~/seq/1799-03N.R2.fastq.gz ~/seq/1799-03T.R1.fastq.gz ~/seq/1799-03T.R2.fastq.gz ~/out/1799-03/ human

Command example

perl ~/somatic/job_somatic.pl somatic_design.txt ~/somatic/example/example.sh 32 hg38 ~/ref/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java 0 2 ~/disambiguate_pipeline

Note:

design.txt: the batch job design file. It has 6 columns separated by '\t', the first four slots are fastq files or bam files for normal control sample and sample of interest. The fifth is the output folder, and the last is "PDX" or "human".
example_file : the demo job submission shell script. A default one is in example/.
thread : number of threads to use. Recommended: 32
build : genome build, hg19 or hg38.
index : path (including file names) to the reference genome in the reference bundle.
java17 : path (including the executable file name) to java 1.7 (needed only for MuTect).
keep_coverage: whether to keep per-base coverage information. Default is 0. Set to 1 to keep coverage information.
n : bundle $n somatic calling job into one submission.
disambiguate_pipeline: the directory to disambiguate_pipeline

filter.R

Post-processing script for somatic mutations for a batch of sampels.

Usage

Rscript filter.R  design.txt output build index VAF_cutoff filter

Note:

design.txt : tab-delimited file with three columns: sample_id, patient_id, output folder.
output : the output folder to place all filtering results.
build : the reference genome build, hg38, hg19 etc.
index : the path to the reference genome file in the reference bundle.
VAF_cutoff : the minimum VAF of the mutations in the tumor sample (recommended: 0.001-0.05).
filter : TRUE or FALSE. Whether to filter out extremely long genes in the list "TTN","KCNQ1OT1","MUC16","ANKRD20A9P","TSIX","SYNE1","ZBTB20","OBSCN", "SH3TC2","NEB","MUC19","MUC4","NEAT1","SYNE2","CCDC168","AAK1","HYDIN","RNF213","LOC100131257","FSIP2". These genes usually turn out ot have somatic muitations in any cohort of patients. Default is FALSE.\

filter_design.txt example (3 columns; columns seperated by tab; header):

sample_id patient_id folder 1799-01 pat-01 ~/filter/1799-01/ 1799-02 pat-02 ~/filter/1799-02/ 1799-03 pat-03 ~/filter/1799-03/

Command example:

Rscript ~/somatic/filter.R filter_design.txt ~/filter/ hg38 ~/ref/hg38/hs38d1.fa 0.01 FALSE

cnv.pl

Pipeline for somatic copy number variation calling and quality check for each sample

Command

perl cnv.pl 
sequencing_file_1 
sequencing_file_2 
sequencing_file_3 
sequencing_file_4 
thread index somatic_mutation_result output

Note:

prerequisite in path: R; BWA; sambamba; perl (Parallel::ForkManager); samtools (version>=1.6); cnvkit;
fastqc Input seuqencing files:
- (1) If input are fastq files, they must be 'gz' files.
- 'sequencing_file_1', 'sequencing_file_2' are path to fastq1 and fastq2 of normal control sample;
- 'sequencing_file_3', 'sequencing_file_4' are path to fastq1 and fastq2 of sample of interest.
- (2) If input are bam files, use "bam /path/to/bam/files.bam" in replace of the tow corresponding fastq input files.
thread : number of threads to use. Recommended: 32
index : the path to the reference genome file in the reference bundle.
somatic_mutation_result": somatic mutation calling output file. THis is for adjusting CNV by somatic mutation VAF. Set to 1 to turn off this adjustment.
output :the output folder. it will be deleted (if pre-existing) adn re-created during analysis. The CNV calling needs at least 128GB of memory.

Example

perl ~/somatic/cnv.pl ~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz 32 ~/ref/hg38/hs38d1.fa ~/somatic_result/1799-01/somatic_mutation_hg38.txt ~/cnv_result/1799-01

job_cnv.pl:

Slurm wrapper for cnv.pl for a batch of samples and it is easy to change for other job scheduler system by revising this line of code: "system("sbatch ".$job)" and using proper demo job submission shell script.

Command

perl job_cnv.pl design.txt example.sh thread index

cnv_design.txt example

(6 columns; columns seperated by tab):

~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz ~/somatic_result/1799-01/somatic_mutations_hg38.txt ~/cnv_result/1799-01/ 
~/seq/1799-02N.R1.fastq.gz ~/seq/1799-02N.R2.fastq.gz ~/seq/1799-02T.R1.fastq.gz ~/seq/1799-02T.R2.fastq.gz ~/somatic_result/1799-02/somatic_mutations_hg38.txt ~/cnv_result/1799-02/ 
~/seq/1799-03N.R1.fastq.gz ~/seq/1799-03N.R2.fastq.gz ~/seq/1799-03T.R1.fastq.gz ~/seq/1799-03T.R2.fastq.gz ~/somatic_result/1799-03/somatic_mutations_hg38.txt ~/cnv_result/1799-03/

Command example

perl ~/somatic/job_cnv.pl cnv_design.txt ~/somatic/example/example.sh 32 ~/ref/hg38/hs38d1.fa 2

summarize_cnv.R

Summarizing script for CNV and quality check callings for a batch of samples.

Command

Rscript summarize_cnv.R design.txt output index

cnv_sum_design.txt example

(2 columns; columns seperated by tab; header):

sample_id folder 
1799-01 ~/cnv_result/1799-01 
1799-02 ~/cnv_result/1799-02 
1799-03 ~/cnv_result/1799-03

Command example:

Rscript ~/somatic/summarize_cnv.R cnv_sum_design.txt ~/cnv_sum/ ~/ref/hg38/

qbrc-somatic-pipeline's People

Contributors

Stargazers

Watchers

Forkers

decodebiology zpeng1989 qbrc anastasia0123 bit-vs-it sunqiangzai

qbrc-somatic-pipeline's Issues

Unable to access picard.jar

Hi，
I have a problem when using the somatic.pl, the following is the error "Error: Unable to access jarfile /QBRC-somatic-mutation/somatic_script//somatic_script/picard.jar". This is my code: perl /QBRC-somatic-mutation/somatic_script/somatic.pl NA NA SRR7246238_1.fastq.gz SRR7246238_2.fastq.gz 32 hg38 $gatkgenomeFasta /usr/bin/java $output human 1 /QBRC-somatic-mutation/disambiguate_pipeline .
I have picard.jar in this path "/QBRC-somatic-mutation/somatic_script/somatic_script". Maybe it's error because of the extra slashes "//somatic_script"?But I try to delete the extra slashes, it appears a new problem:

Use of /c modifier is meaningless in s/// at /QBRC-somatic-mutation/somatic_script/somatic.pl line 74.
String found where operator expected at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75, near "$mutect=$path.""
(Missing semicolon on previous line?)
Use of /c modifier is meaningless without /g at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75.
String found where operator expected at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75, near "$picard=$path.""
(Missing semicolon on previous line?)
Use of /c modifier is meaningless without /g at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75.
String found where operator expected at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75, near "$bam2fastq=$path.""
(Missing semicolon on previous line?)
Unknown regexp modifier "/t" at /QBRC-somatic-mutation/somatic_script/somatic.pl line 74, at end of line
Unknown regexp modifier "/_" at /QBRC-somatic-mutation/somatic_script/somatic.pl line 74, at end of line
Unknown regexp modifier "/t" at /QBRC-somatic-mutation/somatic_script/somatic.pl line 74, at end of line
syntax error at /QBRC-somatic-mutation/somatic_script/somatic.pl line 75, near "$mutect=$path.""
/QBRC-somatic-mutation/somatic_script/somatic.pl has too many errors.

Running somatic.pl through sbatch is very slow

Dear professors,
I am sorry to trouble you.
When I use the following code to run the “somatic.pl” file in “.sbatch” file to call mutation, the speed is particularly slow. It takes about 14 hours to get to the GATK BaseRecalibrator step.
I would like to ask you if there is any way to solve the problem.
The following is the content of the ".sbatch" file:
#!/bin/bash

#SBATCH -n 80
#SBATCH -t 0-30:00
#SBATCH -p xhacnormalb
#SBATCH --mem=150000
#SBATCH -o /public/home/wumeng01/NeoantigenML/SomaticMutationCalling.o
#SBATCH -e /public/home/wumeng01/NeoantigenML/SomaticMutationCalling.e

module load /public/software/modules/apps/biosoft/sambamba/0.8.1-linux-amd64
module load /public/software/modules/apps/biosoft/bwa/0.7.17-gcc-4.5.8
perl somatic.pl /public/home/wumeng01/NeoantigenML/PatientCohort/Patient1/SRR37_38N.R1.fastq.gz /public/home/wumeng01/NeoantigenML/PatientCohort/Patient1/SRR37_38N.R2.fastq.gz /public/home/wumeng01/NeoantigenML/PatientCohort/Patient1/SRR37_38T.R1.fastq.gz /public/home/wumeng01/NeoantigenML/PatientCohort/Patient1/SRR37_38T.R2.fastq.gz 32 hg38 /public/home/wumeng01/NeoantigenML/QBRC-Somatic-Pipeline/genome/hg38/hg38.fa /public/share/yujijun01/wumeng/software/java/jdk1.7.0_80/bin/java /public/home/wumeng01/NeoantigenML/QBRC-Somatic-Pipeline/output/Patient1/ human 1 /public/home/wumeng01/NeoantigenML/QBRC-Somatic-Pipeline/disambiguate_pipeline

Hope for your suggestions!Thank you for your nice work!
Best regrads,
Wu

Issue on filter_vcf.R

Dear @tianshilu ,

Thank you for your dealing with the issue I raised last time! That helped us a lot.

For filter_vcf.R, I notice that from line 108 to line 122,

  if (caller!="strelka_germline") 
  {
    vcf=vcf[vcf$normal_ref+vcf$normal_alt>=7,]
    vcf=vcf[vcf$tumor_alt>=3,]
    if (type=="somatic")
    {
      vcf=vcf[vcf$normal_alt/(vcf$normal_ref+vcf$normal_alt)<
                vcf$tumor_alt/(vcf$tumor_ref+vcf$tumor_alt)/2,]
      vcf=vcf[vcf$normal_alt/(vcf$normal_ref+vcf$normal_alt)<0.05,]
    }else
    {
      vcf=vcf[vcf$normal_alt>=3,]
    } 
  }else # for tumor-only calling, make the calling super sensitive
  {
    vcf=vcf[vcf$normal_ref+vcf$normal_alt>=3,]
    vcf=vcf[vcf$normal_alt>=1,]
  }

several filtering criteria were used here. I am a bit curious and confused, however, why those filtering criteria were applied here. To be more specific:

For both tumor-normal sample, it requires vcf=vcf[vcf$normal_ref+vcf$normal_alt>=7,] and vcf=vcf[vcf$tumor_alt>=3,]. I was a bit curious why the total number of ref and alt reads in normal read should be added up larger than 7, together with alt reads in tumor larger than 3? Is it because it considers the usual sequencing depth and coverage of tumor samples?
For somatic mutation, it requires vcf=vcf[vcf$normal_alt/(vcf$normal_ref+vcf$normal_alt)< vcf$tumor_alt/(vcf$tumor_ref+vcf$tumor_alt)/2,]. Why alter reads in tumor sample need to be divided by two here specifically?

I would highly appreciate it if you would like to help me on this issue. Thank you in advance!

Best wishes,
Jianning

Problem with GATK3

Hi @tianshilu,
You uesd GATK3 in the somatic.pl,but now when I use GATK3, the RealignerTargetCreator, IndelRealigner and PrintReads function of GATK3 can not be found in my GATK3,may be GATK4 replaces GATK3. But when I use GATK4, it has errors with RealignerTargetCreator, IndelRealigner and PrintReads function.Someone said that RealignerTargetCreator and IndelRealigner had little impact, so I wonder if this is ok to use GATK4 in your somatic.pl without RealignerTargetCreator and IndelRealigner steps.

A USER ERROR has occurred: IndelRealigner is no longer included in GATK as of version 4.0.0.0. Please use GATK3 to run this tool

Thanks!

Problems with job_somatic.pl

Hi,tianshi
Sorry to trouble you. I am trying to use your job_somatic.pl. However, I met a error with "Illegal modulus zero at /home/QBRC-somatic-mutation/job_somatic.pl line 31, line 1"
This is my code "perl /home/QBRC-somatic-mutation/job_somatic.pl somatic_design.txt example.sh 32 hg38 $genomeFasta /usr/bin/java /output 0 2 /disambiguate_3human_hepa"

Somatic_design.txt:
RNA:home/SRR8990697_R1.fastq.gz home/SRR8990697_R2.fastq.gz NA NA ./output human
RNA:home/SRR8990698_R1.fastq.gz home/SRR8990698_R2.fastq.gz NA NA ./output human

Best wishes!

Potential bug in somatic.pl

Dear @tianshilu ,

I am sorry to interrupt you again during this hard time. We suddenly came to realize that in about Feburary, we came across a confused coding in somatic.pl that we are not sure whether it was a bug or not. Recently I suddenly remember that so I raise an issue here.

In around line 487-491,

  system_call("lofreq call-parallel --pp-threads ".$thread." -s --sig 0.1 --bonf 1 -C 7 -f ".$index.
    " -S ".$resource_dbsnp." --call-indels -l ".$index.".exon.bed -o ".$output."/lofreq_t.vcf ".$tumor_bam);
  system_call("lofreq call-parallel --pp-threads ".$thread." -s --sig 1 --bonf 1 -C 7 -f ".$index.
    " -S ".$resource_dbsnp." --call-indels -l ".$index.".exon.bed -o ".$output."/lofreq_n.vcf ".$normal_bam);

We are not sure why in the tumor sample, the significance level was set at 0.1 whilst in the normal sample, the significance level was set at 1. Was it a typo or set on purpose?

Thank you in advance!

Best regards,
Jianning

About LocatIt_

Hi, Professor Wang@wtwt5237
Sorry to bother you. thank you for doing such an excellent job! I am trying to use the "QBRC-Somatic-Pipeline" in my deep exome sequencing data.I see that the software LocatIt_v4.0.1.jar is used here, and I get an error in the "mark duplicates" step when I use this software. I don't know why this is. By the way I would like to ask what is the difference between it and picard in the "mark duplicates" step?
Errors:
Saving /tumor/tmp/_login01_fccf7c70-7e84-4817-83b3-3f6c7c576376_041.bam, #reads: 800000 (0), 703822 amplicons written to file.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
at java.lang.String.getBytes(String.java:958)
at com.agilent.locatit.main.SequencingRead.readBarcode(SequencingRead.java:485)
at com.agilent.locatit.main.MolecularBarcodePairedEndProcess.createSequencingReadsFromSAMRecs(MolecularBarcodePairedEndProcess.java:295)
at com.agilent.locatit.main.MolecularBarcodePairedEndProcess.processRestOfCache(MolecularBarcodePairedEndProcess.java:580)
at com.agilent.locatit.main.MolecularBarcodePairedEndProcess.locate(MolecularBarcodePairedEndProcess.java:678)
at com.agilent.locatit.main.LocatIt.main(LocatIt.java:665)

Internal Error caught: 30.

disambiguate_pipeline/conda_env/

Hi,
Sorry to trouble you. I want to know how I can download the conda_env "QBRC-Somatic-Pipeline/tree/master/disambiguate_pipeline/conda_env/". Seeing that there are a lot of things in the script that need to be downloaded, I don't have to download conda_env after downloading it or not

Is there any mark error in the drawing fig（1B.1）?

Dear professors,

I am sorry to trouble you again!These days,I use the K563 patient in the CML dataset,and want to repeat the heatmap and the histogram in the fig(1.A/B).Up to now,I have some questions about this picture.

(1)Firstly,I have not found the ChrX 12975141 T-->A the variants calling result in my result files!However other mutation sites you listed in the fig(1.B) are both found in our result,and the changing tendency is both similar.The result is attached to this email.The VAF score of the ChrX 12975141 T-->A is 1 and the count frequency is up to 80,and in our result files we found none!That is really werid!So,we send this email and want to confirm that if you have misplaced the variant label in fig(1.B.1).We hope you can help us to confrim this doubt,because this result is really important for us to verify if there is any error in our calling pipeline!Only by confirming this,can we continue push our work forward!

Because the set-up of your email,I can not send the pictures.So,we just want you to check if the variant site ChrX 12975141 T-->A is avialable in you result files such as the vcf/germilne_mutations.txt/somatic_mutations.txt!

Thanks!
Xiu

What is the meaning to the normal/tumor files for the input in somatic.pl

Dear Professors@wtwt5237:

Recently,I have been learning this pipeline,and hope to transplant it in our own data.However,I have some questions about the input data in the somatic.pl.Why you set both the normal and tumor samples at the same time?Does it means that compare the tumor sample with the normal sample,and turn out the mutations in tumor samples against the normal?

I also notice your annoucement that"For tumor-only calling, put "NA NA" in the slots of the normal samples. Results will be written to germline files",Or maybe we can use the tumor only to call the germline files,while use the normal only to call what?
I can not understand the pair of normal and tumor samples,can how to define?The cells that come from the normal and tumor tissue from one patient?or the cells form the normal and camer patients respectively?

In other words,If I want to call the normal person's somatic mutations in one particular tissues to traces their development lineage,How can I input my files?

Hope for your suggestions!Thank you for your nice work!
Best regrads,
Xiu

Issues on somatic.pl

Dear @tianshilu ,

Hi! My fellow and I got to know about the QBRC Somatic Pipeline several months ago by a Cell paper discussed in our jounral club, and thanks to your piepline, we have managed to build up a local pipeline in our computer cluster.

However, there are still several code lines in somatic.ql that we could not fully understand. If you would like to help with us, I would highly appreciate it. Thank you in advance!

Here are the issues:

 system_call("annotate_variation.pl -geneanno -dbtype refGene -buildver ".$build." ".$output."/".$type."_mutations_".$build.".txt ".$annovar_path.$annovar_db);
  system_call("coding_change.pl --includesnp --alltranscript --newevf ".$output."/".$type."_mutations_".$build.".txt_tmp.txt ".$output."/".$type."_mutations_".$build.".txt".
    ".exonic_variant_function ".$annovar_path.$annovar_db."/".$build."_refGene.txt ".$annovar_path.$annovar_db."/".$build."_refGeneMrna.fa >/dev/null 2>/dev/null");
  system_call("Rscript ".$path."/somatic_script/add_fs_annotation.R ".$output." ".$build." ".$type);
  system_call("rm -f ".$output."/".$type."_mutations_".$build.".txt?*");
}

In these code lines, you end up with calling the add_fs_annotation.R. In the home directory, however, we also find a filter.R script. So what is the useage of filter.R? Does it need to be run before calling the add_fs_annotation.R?

Does it work with 10x scRNA-seq BAM ?

Hi,

I am just wondering if this work flow works with 10x data?

Wilson

picar.jar missing

Hi Tianshi,

I was trying to run your pipeline on the example data and I run into the following error:

java -Djava.io.tmpdir=../test_results//tumor/tmp -jar /mnt/home/icb/laura.martens/QBRCpipeline/QBRC-Somatic-Pipeline/somatic_script//somatic_script/picard.jar AddOrReplaceReadGroups INPUT=../test_results//tumor/alignment.sam OUTPUT=../test_results//tumor/rgAdded.bam SORT_ORDER=coordinate RGID=tumor RGLB=tumor RGPL=illumina RGPU=tumor RGSM=tumor CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0

Error: Unable to access jarfile /mnt/home/icb/laura.martens/QBRCpipeline/QBRC-Somatic-Pipeline/somatic_script//somatic_script/picard.jar

When I check the somatic_script folder there is no picard.jar file in there, so I was wondering if I am missing anything?

Thanks a lot for your help,
Laura

tianshilu / qbrc-somatic-pipeline Goto Github PK

qbrc-somatic-pipeline's Introduction

The QBRC somatic mutation calling pipeline

Introduction

Citation

Running time

Dependencies

Input files

Main procedures:

Guided Tutorial

somatic.pl

Usage

Example:

Note:

job_somatic.pl

Command

somatic_design.txt example

Command example

Note:

filter.R

Usage

Note:

filter_design.txt example (3 columns; columns seperated by tab; header):

Command example:

cnv.pl

Command

Note:

Example

job_cnv.pl:

Command

cnv_design.txt example

Command example

summarize_cnv.R

Command

cnv_sum_design.txt example

Command example:

qbrc-somatic-pipeline's People

Contributors

Stargazers

Watchers

Forkers

qbrc-somatic-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org