HATCHet paper

This repository contains the simulated data, the results of all the methods considered in the comparison, and the results of HATCHet (HATCHet repository) on the published whole-genome multi-sample tumor sequencing datasets, all these are described in the HATCHet paper at:

Simone Zaccaria and Ben Raphael, 2018

Simulated data
- Tumors
- Patients and samples
- Results
  - Fixed
  - Free
Cancer data
- Prostate cancer
- Pancreas cancer
Analysis
- Mutated copies, predicted VAF, CCF, and explaining of mutations

Simulated data

All the simulated data are included in the folder simulation. The simulated data comprises 256 mixed samples with 2-3 tumor clones for 64 patients (3-5 samples per patient), half with a whole-genome duplication (WGD). These simulated data have been generated using MASCoTE which has also been described in the HATCHet paper and is available here:

MASCoTE

Due to space limitations, we are unable to publish in this repository the sequencing reads from all samples. As such, for every sample we provide a BB file which encodes the read-depth ratio (RDR) and B-allele frequency (BAF) for every genomic bin of the reference genome in every sample. The BB files are produced by the pre-prosessing steps of HATCHet and summarize whole the basic input data needed by CNA-inference methods.

Tumors

All these data are reported in the folder data. The patients are divided between those with a tumor without a WGD in noWGD folder and those with a tumor with a WGD in WGD folder. In both cases, a dataset corresponds to a collection of clones and is reported in the format dataset_nX_sYwhere X is the number of tumor clones (in addition there is a normal diploid clone) and Y is the random seed used by MASCoTE for reproducibility. The copy-number profiles of the tumor clones and the corresponding phylogenetic tree with CNAs and WGds are correspondingly reported in the two following files contained in the related subfolder tumor:

Filename	Description	Format
`copynumber.cvs`	The allele and clone-specific copy-number profiles resulting from the CNAs and WGDs simulated by MASCoTE	The file is a tab-separated file with the following fields: `CHR`: the name of a chromosome `START`: the genomic position in `CHR` determining the start of a genomic segment `END`: the genomic position in `CHR` determining the end of the corresponding genomic segment `cloneX`: the copy-number state of `cloneX` (with `X` from 0 to N-1) in the corresponding genomic segment. The copy number state of `cloneX` is given in the format `A\|B` where `A` and `B` are the two allele-specific copy numbers
`tumor.dot`	The phylogenetic tree describing the tumor evolution where there is a node for every clone and the edges are labeled by the corresponding CNAs and WGDs.	The phylogenetic tree is encoded in the DOT format. The mutations are given in the following formats: A CNA in a edge is reported in the format `(START,END) del/tdup in P/M-CHR` where `START`, `END` are the genomic coordinates of the corresponding genomic segment, `del` or `tdup` indicate whether the corresponding CNA is a deletion or duplication respectively, `M` or `P` indicates whether the maternal or paternal copy has been, and `CHR` is the corresponding chromosome. A chromosomal arm aberration is reported in the format `(START,END) del/tdup of P/M-CHR arm` where `START`, `END` are the genomic coordinates of the corresponding chromosomal arm, `del` or `tdup` indicate whether the corresponding aberration is a deletion or duplication respectively, `M` or `P` indicates whether the maternal or paternal copy has been affected, and `CHR` is the corresponding chromosome. A chromosomal loss is given in the format `M/P-CHR loss` where `M` or `P` indicates whether the maternal or paternal copy has been lost and `CHR` indicates the corresponding chromosome. A WGD is reported in the format `WGD`

Patients and samples

Each dataset includes two patients and for each patient a BB file describes the RDR and BAF of every genomic bin in all samples of the corresponding patient. The name of each BB file specifies the number of samples for the related patient, the number of clones, and the corresponding clone proportions. More specifically, the BB filename is given by a _-separated list where the first element preceeded by the letter k specifies the number of corresponding samples and each other element specifies the clone proportions of a sample, listed such that the first is the proportion of normal diploid clone and the clone proportion of any other tumor clone is given in corresponding order. The name of a sample is a _-separated list which starts with the noun bulk and each element specifies the clone proportion (without the dot) of every clone.

For example, k4_01090_02008_00506035_00504055.bb.gz is a BB file for a patient with 4 samples which incude 2 tumor clones (clone0 and clone1) and a normal diploid clone normal. In particular, the samples have the following clonal compositions

Name of sample	`normal` proportion	`clone0` proportion	`clone1` proportion
`bulk_01normal_09clone0_Noneclone1`	`0.1`	`0.9`	Not present
`bulk_02normal_Noneclone0_08clone1`	`0.2`	Not present	`0.8`
`bulk_005normal_06clone0_035clone1`	`0.05`	`0.6`	`0.35`
`bulk_005normal_04clone0_055clone1`	`0.05`	`0.4`	`0.45`

Another example, k7_040600_010090_020008_0103060_0205003_0100504_01030303.bb.gz is a BB file for a patient with 7 samples which incude 3 tumor clones (clone0, clone1, and clone2) and a normal diploid clone normal. In particular, the samples have the following clonal compositions

Name of sample	`normal` proportion	`clone0` proportion	`clone1` proportion	`clone2` proportion
`bulk_04normal_06clone0_Noneclone1_Noneclone2`	`0.4`	`0.6`	Not present	Not present
`bulk_01normal_Noneclone0_09clone1_Noneclone2`	`0.1`	Not present	`0.9`	Not present
`bulk_02normal_Noneclone0_Noneclone1_08clone2`	`0.2`	Not present	Not present	`0.8`
`bulk_01normal_03clone0_06clone1_Noneclone2`	`0.1`	`0.3`	`0.6`	Not present
`bulk_02normal_05clone0_Noneclone1_03clone2`	`0.2`	`0.5`	Not present	`0.3`
`bulk_01normal_03clone0_03clone1_03clone2`	`0.1`	`0.3`	`0.3`	`0.3`

Each BB file corresponds to a patint and is a tab-separated file describing the RDR and BAF of every genomic bin in all samples in the following format:

Field	Description
`CHR`	Name of a chromosome
`START`	Starting genomic position of a genomic bin in `CHR`
`END`	Ending genomic position of a genomic bin in `CHR`
`SAMPLE`	Name of a tumor sample
`RD`	RDR of the bin in `SAMPLE`
`#SNPS`	Number of SNPs present in the bin in `SAMPLE`
`COV`	Average coverage in the bin in `SAMPLE`
`ALPHA`	Alpha parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from A allele
`BETA`	Beta parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from B allele
`BAF`	BAF of the bin in `SAMPLE`

Due to space limitations, each BB file has been compressed using gzip with level of compression 9. The file can be easily decompressed with the command gzip -d BBFILE.

Results

HATCHet has been compared with 4 current state-of-the-art methods for CNA inference:

Method	Reference	Repository
Battenberg	(Nik-Zainal et al., Cell, 2012)	cgpBattenberg and Wedge-Oxford Battenberg
TITAN	(Ha et al., Genome Research, 2014)	TitanCNA
THetA	(Oesper et al., Genome Biology, 2013)	THetA/THetA2
cloneHD	(Fischer et al., Cell Reports, 2014)	cloneHD

Each of these methods and HATCHet has been applied on the simulated samples. More specifically, Battenberg, TITAN, and THetA have been applied on each sample individually, cloneHD has been applied jointly on all samples from the same patient, and HATCHet has been applied both on each sample individually (single-sample HATCHet) and jointly on all samples from the same patient. We consider two different settings when executing the methods on simulated data.

Fixed

First, every method has been applied on all 128 samples of the 32 patients without a WGD by providing the true value of the main parameters, including tumor ploidy, number of clones, and maximum copy number. In this case, the results obtained by every method are reported in the folder fixed and in the subfolder of the corresponding dataset. The results of Battenberg, TITAN, and THetA are specifically reported for every sample, the results of cloneHD are reported for every patient, and the results of HATCHet are specifically reported for every sample (when obtained by executing HATCHet on each sample inidividually) and specifically for every patient (when obtained by executing HATCHet jointly on all samples from the same patient).

Free

Second, every method has been applied on all 256 samples of the 64 patients with and without a WGD, requiring that each method infers all the relevant parameters, including tumor ploidy and number of clones, and setting the maximum copy number to 8. THetA has been excluded from this analysis as it does not automatically infer the presence/absence of a WGD. In this case, the results obtained by every method are reported in the folder free and in the subfolder of the corresponding dataset, which are divided according to either the presence or absence of a WGD. The results of Battenberg and TITAN are specifically reported for every sample, the results of cloneHD are reported for every patient, and the results of HATCHet are specifically reported for every sample (when obtained by executing HATCHet on each sample inidividually) and specifically for every patient (when obtained by executing HATCHet jointly on all samples from the same patient).

For every method, all the most important and relevant output files are reported. The largest of these files have been compressed due to space limitations using the command gzip -9 and they can be easily decompressed by using the corresponding command gzip -d.

Cancer data

HATCHet has been applied on two whole-genome multi-sample tumor sequencing datasets; the first dataset comprises 10 prostate cancer patients analyzed in (Gundem et al., Nature, 2015) and the second dataset comprises 4 pancreas cancer patients described in (Makohon-Moore et al., Nature genetics, 2017).

Prostate cancer

The data for all prostate cancer patients are contained in the subfolder prostate. For each of the 10 prostate cancer patients (A10, A12, A17, A21, A22, A24, A29, A31, A32, and A34) the results inferred by HATCHet in a subfolder with the corresponding name. More specifically, the following files encode the results inferred by HATCHet for each prostate cancer patient:

Name	Description	Format
`best.seg.ucn`	Clone and allele-specific copy number profiles and clone proportions for every genomic segment	the format is described in the HATCHet repository here
`best.bbc.ucn.gz`	Clone and allele-specific copy number profiles and clone proportions for every clustered bin with the corresponding RDR and BAF	the format is described in the HATCHet repository here. Due to space limitations, this file is compressed
`chosen.diploid.seg.ucn`	The best result inferred by HATCHet assuming there is no WGD	The format is the same of `best.seg.ucn`
`chosen.tetraploid.seg.ucn`	The best result inferred by HATCHet assuming there is a WGD	The format is the same of `best.seg.ucn`

The mutations inferred from all samples of every prostate cancer patient are reported in a subfolder mutations. The SNVs and small indels are reported in two comma-separated files indel_hc.csv and snv_hc.csv with the following fields

Name	Description
`Patient`	The name of a patient
`Sample`	A sample from the patient `Patient`
`chrom`	The name of a chromosome
`position`	The genomic position of a somatic-point mutation in `chrom`
`ref`	Number of sequencing reads coverging `position` with the reference allele
`var`	Number of sequencing reads coverging `position` with the alternating allele, i.e. harboring the mutation
`normal_reads1`	Reads supporting the reference allele of `position` in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`normal_reads2`	Reads supporting the variant allele of `position` in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`normal_var_freq`	Variant-allele frequency of `position` in the matched-normal sample
`normal_gt`	Genotype call for `position` in matched-normal sample
`tumor_reads1`	Reads supporting the reference allele of `position` in the tumor sample `Sample` (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`tumor_reads2`	Reads supporting the variant allele of `position` in the tumor sample `Sample` (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`tumor_var_freq`	Variant-allele frequency (VAF) of `position` in the tumor sample `Sample`
`tumor_gt`	Genotype call for `position` in the tumor sample `Sample`
`somatic_status`	Status of the variant (Germline, Somatic, or LOH). Here, all the mutations are Somatic
`variant_p_value`	Significance of variant read count compared to baseline error rate
`somatic_p_value`	Significance of tumor read count compared to normal read count

Pancreas cancer

The data for all pancreas cancer patients are contained in the subfolder pancreas. For each of the 4 pancreas cancer patients (Pam01, Pam02, Pam03, and Pam04) the results inferred by HATCHet in a subfolder with the corresponding name. More specifically, the following files encode the results inferred by HATCHet for each pancreas cancer patient:

Name	Description	Format
`best.seg.ucn`	Clone and allele-specific copy number profiles and clone proportions for every genomic segment	the format is described in the HATCHet repository here
`best.bbc.ucn.gz`	Clone and allele-specific copy number profiles and clone proportions for every clustered bin with the corresponding RDR and BAF	the format is described in the HATCHet repository here. Due to space limitations, this file is compressed
`chosen.diploid.seg.ucn`	The best result inferred by HATCHet assuming there is no WGD	The format is the same of `best.seg.ucn`
`chosen.tetraploid.seg.ucn`	The best result inferred by HATCHet assuming there is a WGD	The format is the same of `best.seg.ucn`

The mutations inferred from all samples of every pancreas cancer patient are reported in a subfolder mutations. The SNVs and small indels are reported in two comma-separated files indel_hc.csv and snv_hc.csv with the following fields

Name	Description
`Patient`	The name of a patient
`Sample`	A sample from the patient `Patient`
`chrom`	The name of a chromosome
`position`	The genomic position of a somatic-point mutation in `chrom`
`ref`	Number of sequencing reads coverging `position` with the reference allele
`var`	Number of sequencing reads coverging `position` with the alternating allele, i.e. harboring the mutation
`normal_reads1`	Reads supporting the reference allele of `position` in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`normal_reads2`	Reads supporting the variant allele of `position` in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`normal_var_freq`	Variant-allele frequency of `position` in the matched-normal sample
`normal_gt`	Genotype call for `position` in matched-normal sample
`tumor_reads1`	Reads supporting the reference allele of `position` in the tumor sample `Sample` (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`tumor_reads2`	Reads supporting the variant allele of `position` in the tumor sample `Sample` (the corresponding fields with plus/minus are specific to reads belonging to +/- strand)
`tumor_var_freq`	Variant-allele frequency (VAF) of `position` in the tumor sample `Sample`
`tumor_gt`	Genotype call for `position` in the tumor sample `Sample`
`somatic_status`	Status of the variant (Germline, Somatic, or LOH). Here, all the mutations are Somatic
`variant_p_value`	Significance of variant read count compared to baseline error rate
`somatic_p_value`	Significance of tumor read count compared to normal read count

Analysis

The section contains tools which have been applied for obtaining the analysis presented in the HATCHet's paper.

Analysis	Tool	Requirement
Compute mutated copies, predicted VAF, CCF, and explaining of mutations	explainMutationsCCF.py	The toold requires in input a SEG file with allele and clone-specific copy-number states and proportions, and a CSV file with the following fields (whose names must be specified in the first-row header): `chrom`: name of a chromosome `position`: genomic position of the mutation `Patient`: name of the patient `Sample`: name of the sample `somatic_status`: `Somatic` or `Germline`, only somatic mutations are considered `tumor_var_freq`: observed VAF in either percentage forma, e.g. `10.789%`, or floating format, e.g. `0.10789` `tumor_reads1`: `REF` count for the mutation `tumor_reads2`: `ALT` count for the mutation

The tool for the analysis and clustering of SNVs computes and otuputs the following fields (specified in the header which starts with the symbol #):

Field	Description
`CHR`	The name of a chromosome
`POS`	The genomic position of the mutation in `CHR`
`PATIENT-SAMPLE`	Patient-sample name in the format `P-S` where `P` is the name of the patient and `S` is the name of the sample
`TOOL`	The name of the methods which inferred the copy numbers
`COV`	Total number of reads covering the mutation
`COUNTS`	Comma-sperated numbers of reads without and with the mutation
`ObservedVAF`	Observed variant-allele frequency of the mutation
`predicted_VAF`	Predicted VAF when considering the given copy-number states and clone proportions
`Error`	Error in the prediction of VAF
`CNStates`	Given copy-number states and clone proportion for the mutation in `POS`. The state and proportion of the mutation in every clone i are reported in a comma separated list (where clones are sorted according to the input) and the entry for clone i is in the format `A_i
`MutatedCopies`	Inferred number of mutated copies for every clone, these are reported in a comma-separated list such that, also in this field, the clones are sorted according to the same order in the input
`CCF`	Computed cancer-cell fraction for the mutation
`Explained`	True or False to indicate whether the mutation is explained
`SNVState`	Name of the cluster of the mutation based on its SNVState which is defined by the unique combination of its `CNStates` and `MutatedCopies`
`SPRUCEState`	This is the state of the mutation as defined in the SPRUCE model. More specifically, this corresponds to a comma-separated list with an element for every clone i equal to `MAJ_i
`SPRUCECluster`	Name of the cluster of the mutation based on the unique values of `SPRUCEState`

karini925 / hatchet-paper Goto Github PK

hatchet-paper's Introduction

HATCHet paper

Contents

Simulated data

Tumors

Patients and samples

Results

Fixed

Free

Cancer data

Prostate cancer

Pancreas cancer

Analysis

hatchet-paper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent