Giter VIP home page Giter VIP logo

getzlab / detin Goto Github PK

View Code? Open in Web Editor NEW
48.0 6.0 21.0 18.02 MB

DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.

License: BSD 3-Clause "New" or "Revised" License

Python 16.65% Jupyter Notebook 83.35%
cancer-genomics cancer-variants cancer-genome-atlas tumor detection tumor-in-normal somatic-mutations tin

detin's People

Contributors

amarotaylor avatar briandiamondage-sema4 avatar cbirger avatar jett-crowdis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

detin's Issues

Error message

Hi

I got this error message when running deTiN on my own data. I have tried it with the example_data and it runs fine and produces output. I think this error is because I haven't provided Exac data. I wanted to run my data through without the Exac info first to see how it does and then maybe add it in later.

Traceback (most recent call last):
File "build/lib/deTiN/deTiN.py", line 611, in
main()
File "build/lib/deTiN/deTiN.py", line 544, in main
di.candidates = du.select_candidate_mutations(di.call_stats_table, di.exac_db_file)
File "/rds/project/rds-70r4qFasPsQ/PBCP/Work/scripts/deTiN/build/lib/deTiN/deTiN_utilities.py", line 336, in select_candidate_mutations
candidate_sites = remove_exac_sites_from_call_stats(candidate_sites, exac_db_file)
File "/rds/project/rds-70r4qFasPsQ/PBCP/Work/scripts/deTiN/build/lib/deTiN/deTiN_utilities.py", line 681, in remove_exac_sites_from_call_stats
with open(exac_file, 'rb') as handle:
TypeError: coercing to Unicode: need string or buffer, NoneType found

The Exac data isn't required, but it seems to throw this error without it. If I comment out line 336 in deTiN_utilities.py then it doesn't give this error message.

ExAc pickle file for hg38

Hello,

I've managed to generate all the input files for my samples, and the final thing I'm missing to run deTiN is an ExAc pickle file for hg38. I assume the version you have provided in the example_data folder is for hg19. What's the easiest way to generate such a file for hg38?

I assume I can write a Python script to iterate over the lines in the ExAc VCF, populate a dictionary, and save it as a pickle. Just wondering if there are any easier ways/ pre-written scripts that do this exact thing (I'm not very familiar with tools for processing VCFs in Python, sorry!).

Thanks for any help!

Fails without INDEL data

Hi Amaro,

Thanks a lot for sharing this useful tool with the community!
I want to run the tool without INDEL data but get the following error:

pre-processing SSNV data
initialized TiN to 0
TiN inference after 1 iterations = 0.0
SSNV based TiN estimate converged: TiN = 0.0 based on 8320 sites
calculating aSCNA based TiN estimate using data from chromosomes: [ 8 10 19]
aSCNA based TiN estimate: TiN = 0.0
Traceback (most recent call last):
File "deTin/deTiN.py", line 607, in
main()
File "deTin/deTiN.py", line 565, in main
do = output(di, ssnv_based_model, ascna_based_model)
File "deTin/deTiN.py", line 260, in init
if self.input.indel_table.isnull().values.sum() == 0:
AttributeError: 'list' object has no attribute 'isnull'

I also tried to use an empty file with the headers from the example indel data but got the following error:

Traceback (most recent call last):
File "deTin/deTiN.py", line 607, in
main()
File "deTin/deTiN.py", line 537, in main
di.read_and_preprocess_data()
File "deTin/deTiN.py", line 223, in read_and_preprocess_data
self.read_and_preprocess_SSNVs()
File "deTin/deTiN.py", line 208, in read_and_preprocess_SSNVs
self.indel_table = du.read_indel_vcf(self.indel_file, self.seg_table, self.indel_type)
File "/Users/schnidd/Downloads/deTiN-master/deTiN/deTiN_utilities.py", line 562, in read_indel_vcf
counts_format = indel_table['format'][0].split(':')
AttributeError: 'float' object has no attribute 'split'

The tool is working if I use the example data and also when I use my data plus the full INDEL data from the example files. Could you possibly help me to understand the cause of this issue?

which version of GATK should i use?

Hello,i'm using gatk4,and i can't find AllelicCNV

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gpfs/users/yanghao/software/anaconda2/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar AllelicCNV

A USER ERROR has occurred: 'AllelicCNV' is not a valid command.

can you tell me which version should i use? many thanks.

@amarotaylor

Does deTiN support to accept SSNV and SCNA data from callers other than Mutect1 and AllelicCNV?

Hi Amaro,
Thank you for developing deTiN tool and we appreciate your great job!

My understanding is that as long as we can prepare the input files conforming to the required input definition and data format, deTiN should still have the capabilities to use the input data, train the models, estimate the TiN and do the inferences.

So I am wondering if deTiN supports using SSNV/SCNA data from callers other than Broad's Mutect1 and allelicCNV?

For example,

  • For mutation statistics file, if we follow the definition of the fields definition and we convert the SNV VCF results from either Sanger CaVEMan caller or Mutect2 to the call-stats format, I think the results of deTiN should still be valid, right?
  • For aSCNA segmentation file, I was trying to find a mapping relationship between the CNV results from Sanger ASCAT caller and required fields of f, tau and n_probe in segmentation file. However it is not that straightforward. Do you mind give us more information about how to generate those values from existing VCF files other than outputs from AllelicCNV? Or any suggestions are very appreciated.

Thank again!

-Linda

AttributeError: 'list' object has no attribute 'isnull'

Dear amarotaylor

I prepared all the input files and then running deTiN, I saw upto SSNV based TiN estimate value in standard output log(as below), but ended up with "AttributeError: 'list' object has no attribute 'isnull'' and no output files were generated.

I looked up my aSCNA segmentation file and found some lines had "NaN" values, so I deleted the lines and reran deTiN(as well as the first several lines starting with "@" that also gives an error), but the result was same.
As mentioned in #16 , I generated my aSCNA segment file using ModelSegments of the latest GATK4 version(4.1.1.0).
Do you have any clues from the error message?
My installation version is python(2.7.15), numpy(1.15.4), pandas(0.24.2), and scipy( 1.2.1).

standard output

/home/ngs_dev/Analysis/Pipelines/development/20190409.mod/Exome_pipeline/tools/deTiN/deTiN/deTiN_utilities.py:364: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return C[[x.astype(int)]] + position
changing header of seg file from CONTIG to Chromosome
changing header of seg file from START to Start.bp
changing header of seg file from END to End.bp
changing header of seg file from MINOR_ALLELE_FRACTION_POSTERIOR_50 to f
transforming log2 data tau column to 2 centered: 2^(CNratio)+1
changing header of seg file from LOG2_COPY_RATIO_POSTERIOR_50 to tau
changing header of seg file from NUM_POINTS_COPY_RATIO to n_probes
changing header of seg file from CONTIG to Chromosome
changing header of seg file from START to Start.bp
changing header of seg file from END to End.bp
changing header of seg file from MINOR_ALLELE_FRACTION_POSTERIOR_50 to f
transforming log2 data tau column to 2 centered: 2^(CNratio)+1
changing header of seg file from LOG2_COPY_RATIO_POSTERIOR_50 to tau
changing header of seg file from NUM_POINTS_COPY_RATIO to n_probes
pre-processing SSNV data
initialized TiN to 0
TiN inference after 1 iterations = 0.44
TiN inference after 2 iterations = 0.47000000000000003
TiN inference after 3 iterations = 0.48
TiN inference after 4 iterations = 0.49
TiN inference after 5 iterations = 0.49
SSNV based TiN estimate converged: TiN = 0.49 based on 1080 sites

then error output

Traceback (most recent call last):
  File "/home/ngs_dev/Analysis/Pipelines/development/20190409.mod/Exome_pipeline/tools/deTiN/deTiN/deTiN.py", line 606, in <module>
    main()
  File "/home/ngs_dev/Analysis/Pipelines/development/20190409.mod/Exome_pipeline/tools/deTiN/deTiN/deTiN.py", line 564, in main
    do = output(di, ssnv_based_model, ascna_based_model)
  File "/home/ngs_dev/Analysis/Pipelines/development/20190409.mod/Exome_pipeline/tools/deTiN/deTiN/deTiN.py", line 259, in __init__
    if self.input.indel_table.isnull().values.sum() == 0:
AttributeError: 'list' object has no attribute 'isnull'

generating inputs with GATK4

Hi, apologies for cross-posting, I wonder if I can get some information regarding this from the deTiN team too.

Is it possible to generate inputs for deTiN using GATK4 (4.0.6.0)? Specifically, is there a way to get the call-stats file from Mutect2 and is there an equivalent way to generate the SNP statistics file and aSCNV seg file using GATK 4.0.6.0? For example, do f and tau referred to here https://github.com/broadinstitute/deTiN/wiki/Description-of-inputs relate to MINOR_ALLELE_FRACTION_POSTERIOR_50 and LOG2_COPY_RATIO_POSTERIOR_50 respectively in GATK4 (4.0.6.0) ModelSegments output?

Thanks.

TypeError: object of type 'numpy.float64' has no len(): len(self.ascna_based_model.centroids)

I recently encountered this error across several CLL samples:

Traceback (most recent call last):
  File "/root/deTiN/deTiN/deTiN.py", line 606, in <module>
    main()
  File "/root/deTiN/deTiN/deTiN.py", line 565, in main
    do.calculate_joint_estimate()
  File "/root/deTiN/deTiN/deTiN.py", line 266, in calculate_joint_estimate
    if len(self.ascna_based_model.centroids) > 1:
TypeError: object of type 'numpy.float64' has no len()

Here is the code that initializes object ascna_based_model:

import deTiN_aSCNA_based_estimate as dascna
ascna_based_model = dascna.model(di.seg_table, di.het_table, di.resolution)

When I checked the centroids parameter for the object it is initialized to numpy array in the init function:

 self.centroids = np.zeros([3, 1])

I am not sure where in the code it is reinitialized and hence fails this condition:

if len(self.ascna_based_model.centroids) > 1:

Running with SSNV data only

I was wondering if and how it's possible to run deTiN using only SSNV data.

Running deTiN without any input produces the error message

"One of CN data or SSNV data are required."

implying that it should be possible to run with SSNV data only. When I try this I get

Traceback (most recent call last): File "/home/jmitchell1/deTiN_env27/bin/deTiN", line 8, in <module> sys.exit(main()) File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/deTiN/deTiN.py", line 462, in main di.read_and_preprocess_SSNVs() File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/deTiN/deTiN.py", line 195, in read_and_preprocess_SSNVs self.annotate_call_stats_with_allelic_cn_data() File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/deTiN/deTiN.py", line 172, in annotate_call_stats_with_allelic_cn_data self.call_stats_table['tau'] = tau File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3370, in __setitem__ self._set_item(key, value) File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3444, in _set_item self._ensure_valid_index(value) File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3424, in _ensure_valid_index value = Series(value) File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/pandas/core/series.py", line 262, in __init__ raise_cast_failure=True) File "/home/jmitchell1/deTiN_env27/local/lib/python2.7/site-packages/pandas/core/internals/construction.py", line 658, in sanitize_array raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional

which looks like it's expecting the CN data.

Is it poossible to run with the --mutation_data_path file only and if so do you have any suggestions as to what I might be doing wrong?

Thanks very much.

How can we get the value "tau" in aSCNA segmentation file?

Dear @amarotaylor, I have sawn that in the description of aSCNA segmentation file, the value "tau" is that:

Two centered segment copy ratio data. Copy ratio is the relative amount of DNA in the tumor sample at the segment compared to normal (or a panel of normals). We use this ratio plus 2.

Does it mean that:

tau = N(tumor) / N(normal) + 2 ?

However, in the example data "HCC-1143_100_T-sim-final.acs.seg", we can see many tau values are smaller than 2. If I did not make a misunderstanding, tau should not be smaller than 2.

Could you tell me how to calculate tau? Thanks a lot!

Best regards.

n_probs argument requirement in aSCNA segmentation file:

Hello!

Thanks for developing DeTiN to provide a solution in this very important bioinformatics problem!

I am writing to ask about the n_probs argument n the aSCNA segmentation file. It is not mentioned as one of the required inputs in the wiki: https://github.com/broadinstitute/deTiN/wiki/Description-of-inputs But we don't seem to be able to run the program without it. My questions are:

  1. If we do not have information on exome capture kit for the sequencing files, but we have generated f and tau from ASCAT, can we still run DeTin?
  2. If n_probe is absolutely required, we are able to get the number of properly covered (heterozygous) SNP loci in each segment through ASCAT, , which seems to correlated with n_probe, can we use this value somehow?
  3. How can we use DeTin on WGS data?

gatk except AllelicCNV

Hi,

i have adjacent normal sample, so i want to check there is tumor in sample.
so i try using deTiN, but gatk does not support AllelicCNV.
and i try to get AllelicCNV tool and it failed.

so how can i run deTiN on this situation?

Best Regards and Happy new year!

Jeongmin

Results

screen shot 2018-10-18 at 10 14 38

Hi, I've run deTIN on a truth-set of artificially created TIN from 1-30%, for two samples. The results file is attached. Two thoughts;

  1. I seem to get higher score with aSCGC than with SSNV.
  2. aSCG at very low levels of containation seem to report artificially high values.

Have you got any advice to offer to improve/interpret the results?

Also, I've noticed that even with 1% TIN contamination my SNV calling looses a lot of accuracy, do you believe deTIN can be tuned to discover such low levels of TIN?

Many thanks for your input!

Description of outputs

Hi,
it would be so nice if the content of the output files was described as comprehensively on the wiki as the content of the input files.
Especially for the text files the column headers it is not always completely clear what has been calculated there.

Thanks!!!

Validation data on SRA

Hi,

I would like to do some benchmarking with the validation data you have generated and shared through SRA PRJNA422575.

Could you please provide more information what exactly do the Library Names mean?

  • There are two prefixes - HCC and TiN - what do they correspond to?
  • In samples whose Library names start with HCC (e.g. HCC_70_30) - is 70 for the tumor and 30 for the normal or vice versa?
  • n samples whose Library names start with TiN (e.g. TiN_12_5) - what do numbers 12 and 5 correspond to?
  • For TiN samples, what are the calculated/expected purities?

Thank you very much in advance and thank you for this amazing data!

probalica

aSCNA segmentation file from Canvas

Hi,

We use Canvas in our somatic CNA calling workflow. Is it possible to use the output of Canvas to create the aSCNA segmentation file required for detin? In particular, how would tau be calculated?

Thanks for any help provided.

error while generating ExAC file

Hi,

I am trying to get your tool to actually work (which has proven not to be an easy task) and while trying to find instructions on how to generate the ExAC file (besides your only one line in the Wiki), I get the following error:

>>> deTiN.deTiN_utilities.build_exac_pickle("./ExAC.r0.3.1.sites.vep.vcf.gz")
Filtering ExAC sites from candidate mutations
processed 0 ExAC sites
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "deTiN-2.0.1/deTiN/deTiN_utilities.py", line 655, in build_exac_pickle
    spl = line.strip("\n").split("\t")
TypeError: a bytes-like object is required, not 'str'

The file was downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/current

Any idea why this is? Did you maintain forward compatibility with the ExAC files? Do you still maintain this tool?
Thank you in advance

Marius

error message "can't set using a multi-index" when running example data

hi,

we just installed deTin recently. Python3.7 was required. We also installed all packages. When we tried to run the example data with the command recommended:

python3.7 deTiN/deTiN.py --mutation_data_path example_data/HCC_10_90.call_stats.pon_removed.txt --cn_data_path example_data/HCC-1143_100_T-sim-final.acs.seg --tumor_het_data example_data/HCC_10_90.tumor.hets.tsv --normal_het_data example_data/HCC_10_90.normal.hets.tsv --exac_data_path example_data/exac.pickle_high_af --output_name 10_percent_TiN_simulation --indel_data_path example_data/MuTect2.call_stats.txt --indel_data_type MuTect2 --output_dir example_data/

We got the error message as:

joint TiN estimate = 0.08
Traceback (most recent call last):
File "deTiN/deTiN.py", line 611, in
main()
File "deTiN/deTiN.py", line 572, in main
do.reclassify_mutations()
File "deTiN/deTiN.py", line 387, in reclassify_mutations
self.SSNVs.loc[:, ('p_somatic_given_TiN')] = np.nan_to_num(np.true_divide(numerator, denominator))
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py", line 670, in setitem
iloc._setitem_with_indexer(indexer, value)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1601, in _setitem_with_indexer
self._setitem_with_indexer(new_indexer, value)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1667, in _setitem_with_indexer
"cannot set using a multi-index "
ValueError: cannot set using a multi-index selection indexer with a different length than the value

Thank you very much for your help.

Missing required input fields

Hi again Amaro,

Thanks for all your help so far! I've now successfully run deTiN, but I ran into a few errors with missing input fields (not mentioned on the Wiki) that I figured I'd report here.

First, I got an error from the mutation statistics file:

Error reading call stats skipping first two rows and trying again
Traceback (most recent call last):
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 588, in <module>
    main()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 518, in main
    di.read_and_preprocess_data()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 216, in read_and_preprocess_data
    self.read_and_preprocess_SSNVs()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 196, in read_and_preprocess_SSNVs
    self.read_call_stats_file()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 111, in read_call_stats_file
    comment='#', skiprows=2, usecols=fields, dtype=fields_type)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1749, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1134, in _validate_usecols_names
    "columns expected but not found: {missing}".format(missing=missing)
ValueError: Usecols do not match columns, columns expected but not found: ['alt_allele', 't_ref_sum', 'n_alt_count', 'tumor_name', 'normal_name', 'n_ref_count', 'judgement', 't_alt_sum', 't_alt_count', 'position', 'contig', 'ref_allele', 't_ref_count', 'failure_reasons']

Adding dummy columns t_ref_sum and t_alt_sum fixed this issue. I used MuTect2 rather than MuTect to call variants, and thus had to assemble my own input file rather than using a pre-made call_stats file.

The other error I got was was from the aSCNA segmentation file:

changing header of seg file from Start to Start.bp
changing header of seg file from End to End.bp
missing required header n_probes and could not replace with any one of alternates

I fixed this by adding a column n_probes to the input, set equal to Num_SNPs from the Allelic CNV output (I wasn't sure if I should use Num_SNPs or Num_Targets?)

Thanks again!

Easy Question: Is "alt_allele_in_normal" from Mutect1 is similar to "normal artifact" in Mutect2(GATK4)

Hello,

I have a very quick and perhaps a basic question. I am trying to use the latest version of Mutect2 via GATK4. The only constrain I have is that the latest version of the Mutect2 does not have FILTER "alt_allele_in_normal" in the VCF. Instead, I do see "normal artifact" in the newer version.

Do you have any experience running DeTiN with a variant noted as "normal artifact" as the failure reason?

Best

AllelicCNV input files

Hello,

I've been going through the GATK pre-processing steps to generate all of the input files for DeTiN. I ran into an error in the AllelicCNV (missing Spark dependency), and googling the issue led to a GitHub issue that concluded with the entire CNV pipeline being deprecated:
broadinstitute/gatk#3599

Is there an alternative way to generate these files going forward?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.