Giter VIP home page Giter VIP logo

tcgamutations's Introduction


TCGAmutations - An R data package for TCGA somatic mutations


Introduction

TCGAmutations is an R data package containing somatic mutations from TCGA cohorts. This is particularly useful for those working with mutation data from TCGA studies - where most of the time is spent on searching various databases, downloading, compiling and tidying up the data before even the actual analysis is started. This package tries to mitigate the issue by providing pre-compiled, curated somatic mutations from 33 TCGA cohorts along with relevant clinical information for all sequenced samples.

Installation

BiocManager::install("PoisonAlien/TCGAmutations")

TCGA cohorts

tcga_available() function lists the available cohorts along with the source, sample size and DOI for citation.

>TCGAmutations::tcga_available()
 Study_Abbreviation   MC3                          Firehose
                <char> <int>                            <char>
 1:                ACC    92  62 [dx.doi.org/10.7908/C1610ZNC]
 2:               BLCA   411 395 [dx.doi.org/10.7908/C1MW2GGF]
 3:               BRCA  1026 978 [dx.doi.org/10.7908/C1TB167Z]
 4:               CESC   291 194 [dx.doi.org/10.7908/C1MG7NV6]
 5:               CHOL    36  35 [dx.doi.org/10.7908/C1K936V8]
 6:               COAD   406 367 [dx.doi.org/10.7908/C1DF6QJD]
 7:               DLBC    37  48 [dx.doi.org/10.7908/C1X066DK]
 8:               ESCA   185 185 [dx.doi.org/10.7908/C1BV7FZC]
 9:                GBM   400 283 [dx.doi.org/10.7908/C1XG9QGN]
10:               HNSC   509 511 [dx.doi.org/10.7908/C18C9VM5]
11:               KICH    66  66 [dx.doi.org/10.7908/C1765DQK]
12:               KIRC   370 476 [dx.doi.org/10.7908/C10864RM]
13:               KIRP   282 282 [dx.doi.org/10.7908/C19C6WTF]
14:               LAML   140 193 [dx.doi.org/10.7908/C1D21X2X]
15:                LGG   525 516 [dx.doi.org/10.7908/C1MC8ZDF]
16:               LIHC   365 373 [dx.doi.org/10.7908/C128070B]
17:               LUAD   517 533 [dx.doi.org/10.7908/C17P8XT3]
18:               LUSC   485 178 [dx.doi.org/10.7908/C1X34WXV]
19:               MESO    82                                  
20:                 OV   411 466 [dx.doi.org/10.7908/C1736QC5]
21:               PAAD   178 126 [dx.doi.org/10.7908/C1513XNS]
22:               PCPG   184 179 [dx.doi.org/10.7908/C13T9GN0]
23:               PRAD   498 498 [dx.doi.org/10.7908/C1Z037MV]
24:               READ   150 122 [dx.doi.org/10.7908/C1S46RDB]
25:               SARC   239 247 [dx.doi.org/10.7908/C137785M]
26:               SKCM   468 290 [dx.doi.org/10.7908/C1J67GCG]
27:               STAD   439 393 [dx.doi.org/10.7908/C1C828SM]
28:               TGCT   134 147 [dx.doi.org/10.7908/C1S1820D]
29:               THCA   500 496 [dx.doi.org/10.7908/C16W99KN]
30:               THYM   123 120 [dx.doi.org/10.7908/C15T3JZ6]
31:               UCEC   531 248 [dx.doi.org/10.7908/C1C828T2]
32:                UCS    57  57 [dx.doi.org/10.7908/C1PC31W8]
33:                UVM    80  80 [dx.doi.org/10.7908/C1S1821V]
    Study_Abbreviation   MC3                          Firehose

Usage

There are only two commands

  • tcga_available() - Lists the available cohorts in the package
  • tcga_load() - Takes a cohort name and returns the corresponding MAF object

There are two sources from which MAF files were compiled:

MC3

> luad <- TCGAmutations::tcga_load(study = "LUAD")
Loading LUAD. Please cite: https://doi.org/10.1016/j.cels.2018.03.002 for reference
> luad
An object of class  MAF 
                        ID summary    Mean Median
                    <char>  <char>   <num>  <num>
 1:             NCBI_Build  GRCh37      NA     NA
 2:                 Center       .      NA     NA
 3:                Samples     517      NA     NA
 4:                 nGenes   17130      NA     NA
 5:        Frame_Shift_Del    4021   7.778      5
 6:        Frame_Shift_Ins    1185   2.292      1
 7:           In_Frame_Del     388   0.750      0
 8:           In_Frame_Ins      37   0.072      0
 9:      Missense_Mutation  133671 258.551    177
10:      Nonsense_Mutation   11074  21.420     13
11:       Nonstop_Mutation     179   0.346      0
12:            Splice_Site    4469   8.644      5
13: Translation_Start_Site     225   0.435      0
14:                  total  155249 300.288    202

Firehose

Change source argument to Firehose for MAF files from Broad Firehose

WARNING: Use Firehose data at your own risk. MAF data has not been updated in a long time. It is strongly suggested to use the default MC3 cohort

> TCGAmutations::tcga_load(study = "LUAD", source = "Firehose")
Loading LUAD. Please cite: dx.doi.org/10.7908/C17P8XT3 for reference
An object of class  MAF 
                   ID       summary    Mean Median
 1:        NCBI_Build            37      NA     NA
 2:            Center broad.mit.edu      NA     NA
 3:           Samples           533      NA     NA
 4:            nGenes         16515      NA     NA
 5:   Frame_Shift_Del          4018   7.538      5
 6:   Frame_Shift_Ins          1409   2.644      2
 7:      In_Frame_Del           526   0.987      1
 8:      In_Frame_Ins            74   0.139      0
 9: Missense_Mutation        119156 223.557    157
10: Nonsense_Mutation          9521  17.863     12
11:  Nonstop_Mutation           157   0.295      0
12:       Splice_Site          7675  14.400      9
13:             total        142536 267.422    187

Returned MAF objects can be passed to any functions from maftools for visualization and analysis.

Clinical data

Clinical data for MC3 are obtained from harmonized clinical data resource. Thanks to @mitchellcheung8 for pointing to the reference and the data source.

Recommendations for survival analysis (as suggested by the publication)

Recommended use of the endpoints: For clinical outcome endpoints, we recommend the use of PFI for progression-free interval, and OS for overall survival. Both endpoints are relatively accurate. Given the relatively short follow-up time, PFI is preferred over OS. Detailed recommendations please refer to Table 3 in the accompanying paper.

Below are the column names for the event and the timepoint.

endpoint event column name timepoint column name
PFI (Progression-free interval) CDR_PFI CDR_PFI.time
OS (Overall survival) CDR_OS CDR_OS.time
DSS (Disease-specific survival) CDR_DSS CDR_DSS.time
DFI (Disease-free interval) CDR_DFI CDR_DFI.time

example usage for survival:

#OS
maftools::mafSurvival(maf = brca, genes = c("TP53"), time = "CDR_OS.time", Status = "CDR_OS")

#PFI
maftools::mafSurvival(maf = brca, genes = c("TP53"), time = "CDR_PFI.time", Status = "CDR_PFI")

FAQ

Q:How did I compile the data?

A: See compile_MC3.R for the details.

Q: Are there any non-TCGA/external cohorts

A: Please open an issue if you have any particular publication in mind that you want me to include in the package.

References

For maftools

Maftools: efficient and comprehensive analysis of somatic variants in cancer. Mayakonda A, Lin DC, et. al. Genome Research

For MC3 cohort

Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Kyle Ellrott, Matthew H. Bailey, Gordon Saksena, et. al. Cell Syst

For clinical data resource

An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.Liu, Jianfang et al. Cell

tcgamutations's People

Contributors

poisonalien avatar selkamand avatar shixiangwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tcgamutations's Issues

TGCA visualisation in lolliplots

Hi, thank you so much for all your powerful tools!

I was thinking to compare my data with the tcga precompiled mafs in terms of visualizing the mutations on certain genes (aka lolliplots for two maf files). Problem is that due to space there is no protein change or AA change column. And when I use the HGVSp_Short I get an error (Error in subsetMaf(maf = maf, includeSyn = FALSE, genes = geneID, query = "Variant_Type != 'CNV'") :
trying to get slot "maf.silent" from an object (class "data.table") that is not an S4 object ).

Do you think there is a faster way to do that through maftools?
Was thinking to annotate the maf but it doesn't really work...I also tried to extract only chr, start, end and make a new maf for the genes that I want ...so that is why I am reaching out to you in case you have a "quicker" way to do this.

best
Iris

cannot read ICGC simple somatic mutation file in MAF

Hi,
i am using the below commands:
icgc =system.file("extdata", "ssm_orca_In.tsv.gz", package="maftools")
maf <- icgcSimpleMutationToMAF(icgc = icgc, addHugoSymbol = TRUE)
Converting Ensemble Gene IDs into HGNC gene symbols.
Done! Original ensemble gene IDs are preserved under field name ens_id
--Removed 904661 duplicated variants
--Found 101696 variants with no Gene Symbols
--Annotating them as 'Unknown' for convenience
--Non MAF specific values in Variant_Classification column:
NA
--Non MAF specific values in Variant_Type column:
NA

I am getting the below error:
plotmafSummary(maf)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘getSampleSummary’ for signature ‘"data.table"’

Request for additional metadata: Capture Kit

Hi,

Just wondering about feasibility of adding additional sample metadata, for example which capture kit was used?

Reason we care about capture kit:

Capture kit biases have led to false negative mutation calls in multiple TCGA cohorts. Variability in the capture kits used can contribute to inter-cohort TMB variation. More importantly, there are some genes that may be mutated in a sample, but whose mutation you will/won't see because of the specific capture kit that was used. See Wang et al. 2018 for details

Downloading mc3.v0.2.8.PUBLIC.maf

So I am new to R, new to Github, new to coding... I apologize for simple minded questions. I am trying to generate a gene expression analysis heatmap using 10 genes (10 genes). I basically want to see the number of mutations in these genes in each of the TCGA cancers. I was directed to install the maf file from the https://gdc.cancer.gov/about-data/publications/mc3-2017 website. I am also using a Mac. So I downloaded the file "mc3.v0.2.8.PUBLIC.maf" and everytime I download it, I open to see how big the file is and its only 66KB on disk! Would you be able to help me?

Many duplicated variants

@PoisonAlien It is strange that there are many duplicated variants if I recreate the MAF, why this happen?

> tcga_load("luad")
Loading objects:
  tcga_luad_mc3
Successfully loaded TCGA LUAD!
Please cite https://doi.org/10.1016/j.cels.2018.03.002 for MAF source.
> maftools::read.maf(rbind(tcga_luad_mc3@data, tcga_luad_mc3@maf.silent))
-Validating
--Removed 18834 duplicated variants
-Silent variants: 63940 
-Summarizing
--Possible FLAGS among top ten genes:
  TTN
  MUC16
  USH2A
  FLG
-Processing clinical data
--Missing clinical data
-Finished in 5.521s elapsed (5.024s cpu) 
An object of class  MAF 
                        ID summary    Mean Median
 1:             NCBI_Build      NA      NA     NA
 2:                 Center      NA      NA     NA
 3:                Samples     515      NA     NA
 4:                 nGenes   17130      NA     NA
 5:        Frame_Shift_Del    3675   7.150    5.0
 6:        Frame_Shift_Ins    1096   2.132    1.0
 7:           In_Frame_Del     360   0.700    0.0
 8:           In_Frame_Ins      33   0.064    0.0
 9:      Missense_Mutation  122329 237.994  165.5
10:      Nonsense_Mutation   10152  19.751   12.5
11:       Nonstop_Mutation     164   0.319    0.0
12:            Splice_Site    4094   7.965    5.0
13: Translation_Start_Site     206   0.401    0.0
14:                  total  142109 276.477  189.0

Broad/MC3 PTEN mutational frequency discrepancy

Hi,
I am getting really different mutational frequencies in some genes, when looking at data from the Broad Firehose, compared to the MC3 source. I would expect some differences due to the documented differences in preprocessing, but this effect is quite extreme in some cases. For example, looking at the top 10 mutated genes in colon cancer for Broad and MC3:
Screenshot 2021-07-15 at 11 25 10
Broad
Screenshot 2021-07-15 at 11 50 20
MC3
If you look at the PTEN gene for example, this is mutated in over 40% of subjects in the Broad firehose data. This is way above its mutation rate in this cancer from the literature, including in the TCGA marker paper. Further, this gene is mutated in only 3% of cases in the MC3 data. Do you have an idea where such a large discrepancy could come from? Help here much appreciated!

CCLE maf files

Hi PoisonAlien, I am very very new to bioinformatics and I apologize if this is a dumb questions. Would it be possible to show how you can apply your tools to CCLE maf file? I tried separating the file to individual lineage maf files but having difficulty in running MutsigCV on them. I am sorry if I am not making sense. My question is that if you can apply your tool for CCLE maf file.

Thanks.

Shwetha

Are the available datasets whole exome sequencing data ?

Hello,
I am very interested in comparing my data with the TCGA mutations. My data (mutations in RNA-seq data)is in VCF file format which I converted to MAF files. I wanted to check with if you if the available datasets in your package are from whole exome sequencing data ? Do you have any datasets that are mutations from TGGA RNA-seq cohorts ?
Thanks, K

cannot read maf

Hi!
After loading the study
How could I use them in maftool by reading maf?

tcga_load("luad")
Loading LUAD. Please cite: https://doi.org/10.1016/j.cels.2018.03.002 for reference
An object of class MAF
ID summary Mean Median
1: NCBI_Build NA NA NA
2: Center NA NA NA
3: Samples 517 NA NA
4: nGenes 17130 NA NA
5: Frame_Shift_Del 4021 7.778 5
6: Frame_Shift_Ins 1185 2.292 1
7: In_Frame_Del 388 0.750 0
8: In_Frame_Ins 37 0.072 0
9: Missense_Mutation 133671 258.551 177
10: Nonsense_Mutation 11074 21.420 13
11: Nonstop_Mutation 179 0.346 0
12: Splice_Site 4469 8.644 5
13: Translation_Start_Site 225 0.435 0
14: total 155249 300.288 202

I tired...

maftools::read.maf(rbind(tcga_luad@data))
Error in rbind(tcga_luad@data) : object 'tcga_luad' not found

or

luad = read.maf(maf = 'inst/extdata/MC3/luad.RDs')
-Reading
Error in data.table::fread(file = maf, sep = "\t", stringsAsFactors = FALSE, :
File 'inst/extdata/MC3/luad.RDs' does not exist or is non-readable. getwd()=='/Users/dingdingdingplus'

Thank you very much!

status arg not found due to binary value change

Error in mafSurvival(maf = tcga_stad, genes = "TP53", time = "days_to_last_followup", Status = 'vital_status', isTCGA = TRUE):

Overall_Survival_Status not found in clinicalData. Use argument Status to povide column name containing events (Dead or Alive).

error msg told that vital status should be assigned values dead/alive, while in the package assigned binary value 0/1.

How should i fix it?

Significant number of discrepancies for LUAD clinical data - vital_status / days_to_death / or days_to_last_followup

I was using TCGAmutations to obtain LUAD mutation and clinical data from 'TCGA MC3', so I can plot survival data (Kaplan–Meier) using the mafsurvival function of the maftools library.

I noticed differences between the mafsurvival generated graph (LUAD for the CDKN2A gene) with the survival graph generated on-line (https://www.tcga-survival.com/data-table?view=gene&gene=CDKN2A&filter=cancer_type&cancer_type=LUAD).

To determine the cause of this, I then downloaded the mc3 MAF and the clinical data from "https://gdc.cancer.gov/node/905/":

  1. mc3.v0.2.8.PUBLIC.maf
  2. clinical_PANCAN_patient_with_followup.tsv
    I then processed the files with maftools after subsetting for LUAD samples.
    When I performed 'mafsurvival' for the CDKN2A gene with this data, it was identical to the www.tcga-survival.com website.

To determine the discrepancies, I compared the specific clinical data downloaded by TCGAmutations and the ones downloaded from "https://gdc.cancer.gov/node/905/".

Here are a few examples of the inaccurate info I found from the pre-compiled TCGAmutations clinical data:
TCGA-44-5645-01A-51D-A27T-08 , days to last followup=208, (should be 852)
TCGA-44-7662-01A-11D-2063-08 , days to last followup=50, (should be 218)
TCGA-53-A4EZ-01A-12D-A24P-08 , days to last followup=280, (should be 1071)
TCGA-55-8089-01A-11D-2238-08 , vital status = Alive ; days to death=NA, (should be dead and 702, respectively)
TCGA-73-A9RS-01A-11D-A410-08 , vital status = Alive ; days to death=NA, (should be dead and 340, respectively)

I also went to "https://portal.gdc.cancer.gov/" to search for these specific cases and confirmed that the pre-compiled TCGAmutations data was inaccurate.

Maybe I'm not processing the clinical data from TCGAmutations correctly:

library(TCGAmutations)
library(maftools)
tcga_luad <- tcga_load(study = "LUAD")
luad_maf = read.maf(maf = tcga_luad@data, clinicalData = [email protected])

Clinical data from TCGAmutations to Maftools

Hello!

I am having some issues in passing samples' clinical data to use it as mentioned in maftools vignettes. I have tried a desperate:

cancer.maf <- toga_acc_mc3
cancer.clinical <- getClinicalData(toga_acc_mc3)

cancer <- read.maf(maf = cancer.maf, clinicalData = cancer.clinical)

and gives an error:

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'getClinicalData' for signature '"character"'

However, reading the papers for TCGAmutations and Maftools I couldn't figure out how could be possible to connect one with the other.

May you please help me?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.