d3b-center / d3b-bixu-data-assembly Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Updating this ticket with the latest histologies:
There are some minor issues with the merged histologies file that we can discuss in the next toolkit meeting. Using the latest merged histologies from the data assembly project:
Kids_First_Biospecimen_ID
. For each of these duplicated rows, one row has NA
and one row is properly populated for certain fields.dat <- read.delim('~/Downloads/tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')
length(unique(dat$Kids_First_Biospecimen_ID)) # 36829 unique BS ids
dim(dat) # 39814 number of total rows
# one example is BS_007JTNB8 where the first row has NA for fields like cns_region, short_histology, etc but second row is properly populated
> dat %>%
+ filter(Kids_First_Biospecimen_ID == "BS_007JTNB8")
Kids_First_Biospecimen_ID cns_region sample_id aliquot_id Kids_First_Participant_ID
1 BS_007JTNB8 <NA> 7316-2558 655073 PT_1MW98VR1
2 BS_007JTNB8 Posterior fossa 7316-2558 655073 PT_1MW98VR1
experimental_strategy sample_type composition tumor_descriptor primary_site
1 WGS Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
2 WGS Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
reported_gender race ethnicity age_at_diagnosis_days pathology_diagnosis
1 Male White Not Hispanic or Latino 1872 Ependymoma
2 Male White Not Hispanic or Latino 1872 Ependymoma
integrated_diagnosis short_histology broad_histology Notes germline_sex_estimate RNA_library
1 <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> Ependymal tumor Ependymal tumor <NA> Male <NA>
OS_days OS_status PFS_days cohort age_last_update_days seq_center normal_fraction
1 687 LIVING 687 CBTN 2559 NantOmics 0.3224044
2 687 LIVING 687 PBTA 2559 NantOmics 0.3224044
tumor_fraction tumor_ploidy cancer_predispositions molecular_subtype cohort_participant_id
1 0.6775956 2 None documented <NA> C632220
2 0.6775956 2 None documented EPN, To be classified C632220
extent_of_tumor_resection
1 Gross/Near total resection
2 Gross/Near total resection
# an example with patient 47, 48 and 49
dat %>%
filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
dplyr::select(cohort_participant_id, experimental_strategy) %>%
unique()
cohort_participant_id experimental_strategy
1 C3507714 WXS
2 C3507591 WXS
3 C3507468 WXS
short_histology
info missing for many samples# an example with patient 47, 48 and 49
dat %>%
filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
dplyr::select(cohort_participant_id, short_histology) %>%
unique()
cohort_participant_id short_histology
1 C3507714 <NA>
2 C3507591 <NA>
3 C3507468 <NA>
RNA_library
info# ids where RNA_library is NA
unique_bs_ids <- dat %>%
filter(sample_type == "Tumor",
experimental_strategy == "RNA-Seq",
is.na(RNA_library)) %>%
pull(Kids_First_Biospecimen_ID) %>%
unique()
# 188 are from CBTN and 1 from GMKF
dat %>%
filter(Kids_First_Biospecimen_ID %in% unique_bs_ids) %>%
group_by(cohort) %>%
summarise(n = n())
# A tibble: 2 × 2
cohort n
<chr> <int>
1 CBTN 188
2 GMKF 1
# latest data assembly histology file
dat = read.delim('tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')
# pnoc008 clinical data manifest from Kids First data tracker
pnoc008_manifest = read_xlsx('FV_D8H9K61X_Copy of 03.01.2022 PNOC008 Clinical Data Manifest.xlsx', sheet = 2)
# identify ids that are in clinical data manifest but not in data assembly histology file
pnoc008_ids <- pnoc008_manifest$`Research ID`
pnoc008_not_found <- setdiff(pnoc008_ids, dat$cohort_participant_id)
pnoc008_manifest[which(pnoc008_manifest$`Research ID` %in% pnoc008_not_found),"PNOC Subject ID"]
# A tibble: 5 × 1
`PNOC Subject ID`
<chr>
1 P-01
2 P-07
3 P-43
4 P-44
5 P-50
cc: @aadamk
Hi @zhangb1,
Some data assembly requests:
For the progression sample reports I was able to fetch the StarFusion output file for Patient-43-P and merge it to the v11 so it is not needed by me anymore but just wanted to point this out for future patients.
Would it be possible to generate a collapsed matrix i.e. (rownames are unique gene symbols) of Counts and TPM? Not high priority but would be nice to have so we don't have to collapse in the reporting code.
Would it be possible to integrate the fusion annotation code to generate the fusion-putative-oncogenic.tsv
? This is not high priority but would be nice to have.
cc: @aadamk
Thanks!
Some issues that we should look into to ensure PNOC008 sample information is correctly filled in the data assembly histologies file:
# columns relevant to generalized patient reporting
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
"parent_aliquot_id","aliquot_id",
"experimental_strategy","RNA_library",
"sample_type","tumor_descriptor",
"pathology_diagnosis","integrated_diagnosis",
"short_histology","broad_histology","molecular_subtype",
"gtex_group","cancer_group",
"OS_days","OS_status")
# read most recent data assembly and keep only columns relevant to generalized patient reporting
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
filter(cohort %in% c("PNOC", "CBTN")) %>%
mutate(parent_aliquot_id = NA,
gtex_group = NA,
cancer_group = NA) %>%
dplyr::select(hist_cols_used) # 3687 rows
# read PNOC008 clinical manifest from KF data tracker
pnoc008_clinical <- readxl::read_xlsx('~/Projects/OMPARE/data/manifest/pnoc008_manifest.xlsx')
# integrated_diagnosis, short_histology, broad_histology, molecular_subtype missing entirely
data_assembly %>%
dplyr::filter(cohort_participant_id %in% pnoc008_clinical$`Research ID`) %>%
dplyr::select(integrated_diagnosis, short_histology, broad_histology, molecular_subtype) %>%
unique()
integrated_diagnosis short_histology broad_histology molecular_subtype
1 NA <NA> <NA> <NA>
# combine PNOC008 clinical with data assembly on common cohort participant identifiers
pnoc008_clinical %>%
dplyr::select(`Research ID`, `Last Known Status`, `Age at Diagnosis (in days)`, `Age at Collection (in days)`, `Age at Age At Last Known Status (if deceased, this is days to death)`, ) %>%
inner_join(data_assembly %>%
dplyr::select(cohort_participant_id, OS_status, OS_days) %>%
unique(), by = c("Research ID" = "cohort_participant_id"))
First four columns are from KF clinical file and last two from data assembly file:
Research ID | Last Known Status | Age at Diagnosis (in days) | Age at Collection (in days) | Age at Age At Last Known Status (if deceased, this is days to death) | OS_status | OS_days |
---|---|---|---|---|---|---|
C3064299 | Living | 4065 | 4067 | NA | NA | NA |
C3064422 | Deceased | 2758 | 2758 | 3074 | NA | NA |
C3070818 | Living | 737 | 737 | NA | NA | NA |
C3070941 | Living | 4143 | 4143 | NA | NA | NA |
C3071064 | Living | 6784 | 6784 | NA | NA | NA |
C3071310 | Living | 3922 | 3922 | NA | NA | NA |
C3071433 | Living | NA | NA | NA | NA | NA |
C3077337 | Living | 5151 | 5151 | NA | NA | NA |
C3077460 | Living | 2288 | 2288 | NA | NA | NA |
C3077583 | Living | 4049 | 4049 | NA | NA | NA |
C3077706 | Living | 5233 | 5233 | NA | NA | NA |
C3077829 | Living | 7309 | NA | NA | NA | NA |
C3078075 | Living | 3480 | 3480 | NA | NA | NA |
C3077952 | Living | 2801 | 2801 | NA | NA | NA |
C3078198 | Living | 3147 | 3147 | NA | NA | NA |
C3183978 | Living | NA | 2435 | NA | NA | NA |
C3172416 | Living | NA | 1552 | NA | NA | NA |
C3172539 | Living | NA | 5688 | NA | NA | NA |
C3172662 | Living | NA | 4867 | NA | NA | NA |
C3172785 | Living | NA | 6904 | NA | NA | NA |
C3172908 | Living | NA | 3072 | NA | NA | NA |
C3173031 | Living | NA | 5080 | NA | NA | NA |
C3173154 | Living | NA | 3863 | NA | NA | NA |
C3173277 | Living | NA | 5353 | NA | NA | NA |
C3173400 | Living | NA | 1886 | NA | NA | NA |
C3173523 | Living | NA | 1825 | NA | NA | NA |
C3173646 | Living | NA | 4532 | NA | NA | NA |
C3173769 | Living | NA | 2038 | NA | NA | NA |
C3505500 | Living | NA | 4715 | NA | NA | NA |
C3505623 | Living | NA | 1065 | NA | NA | NA |
C3505746 | Living | NA | 3376 | NA | NA | NA |
C3505869 | Living | NA | 6783 | NA | NA | NA |
C3505992 | Living | NA | 1991 | NA | NA | NA |
C3506115 | Living | NA | 5569 | NA | NA | NA |
C3506238 | Living | NA | 1765 | NA | NA | NA |
C3506361 | Living | NA | 1126 | NA | NA | NA |
C3506484 | Living | NA | 2952 | NA | NA | NA |
C3506607 | Living | NA | 6026 | NA | NA | NA |
C3506730 | Living | NA | 4687 | NA | NA | NA |
C3506853 | Living | NA | 5874 | NA | NA | NA |
C3506976 | Living | NA | 4352 | NA | NA | NA |
C3507099 | Living | NA | 4015 | NA | NA | NA |
C3507222 | Living | NA | 4991 | NA | NA | NA |
C3507345 | Living | NA | 6300 | NA | NA | NA |
C3507468 | Living | NA | 6817 | NA | NA | NA |
C3507714 | Living | NA | 2069 | NA | NA | NA |
C3507591 | Living | NA | 4869 | NA | NA | NA |
C3507837 | Living | NA | 4595 | NA | NA | NA |
C3507960 | Living | NA | 4169 | NA | NA | NA |
C3508083 | Living | NA | 3165 | NA | NA | NA |
cc: @aadamk
Currently, the generalized patient reporting code cannot be automated because for each new patient report, I have manually check and map all missing information between various file sources i.e. KF clinical data (new patients are not in OT), Open Targets and Data assembly histology file.
Here are differences in some columns between OT histology file and data assembly file (To be discussed in toolkit as well):
# minimal set of columns to be used for pediatric samples within the generalized reporting code
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
"parent_aliquot_id","aliquot_id",
"experimental_strategy","RNA_library",
"sample_type","tumor_descriptor",
"pathology_diagnosis","integrated_diagnosis",
"short_histology","broad_histology","molecular_subtype",
"OS_days","OS_status")
# read data assembly histology and subset to minimal columns
# add parent_aliquot_id as this field is not present in the data assembly file
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
filter(cohort %in% c("PNOC", "CBTN")) %>%
mutate(parent_aliquot_id = NA) %>%
dplyr::select(hist_cols_used) # 3687
# read open targets and subset to minimal columns
open_targets <- read.delim('~/Projects/PediatricOpenTargets/OpenPedCan-analysis/data/histologies.tsv')
open_targets <- open_targets %>%
dplyr::filter(cohort == "PBTA") %>%
dplyr::select(hist_cols_used) # 2984
# subset both files on common samples
common_samples <- intersect(data_assembly$Kids_First_Biospecimen_ID, open_targets$Kids_First_Biospecimen_ID)
data_assembly <- data_assembly %>%
filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
arrange(Kids_First_Biospecimen_ID)
open_targets <- open_targets %>%
filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
arrange(Kids_First_Biospecimen_ID)
# replace NA to blank so we can compare the columns in both histology files
data_assembly[is.na(data_assembly)] <- ""
open_targets[is.na(open_targets)] <- ""
# check differences in the minimal set of columns only
for(i in 1:length(hist_cols_used)){
col_to_check <- hist_cols_used[i]
diff_rows <- open_targets[which(open_targets[,col_to_check] != data_assembly[,col_to_check]),] %>%
nrow()
print(paste(col_to_check, diff_rows, sep = ": "))
}
[1] "Kids_First_Biospecimen_ID: 0"
[1] "sample_id: 0"
[1] "cohort: 2984"
[1] "cohort_participant_id: 0"
[1] "parent_aliquot_id: 2840"
[1] "aliquot_id: 0"
[1] "experimental_strategy: 0"
[1] "RNA_library: 0"
[1] "sample_type: 0"
[1] "tumor_descriptor: 94"
[1] "pathology_diagnosis: 80"
[1] "integrated_diagnosis: 1364"
[1] "short_histology: 2111"
[1] "broad_histology: 2111"
[1] "molecular_subtype: 1615"
[1] "OS_days: 472"
[1] "OS_status: 0"
So, in short:
parent_aliquot_id
is absent in data assemblyPBTA
and data assembly uses PNOC/CBTN
), tumor_descriptor, pathology_diagnosis, integrated_diagnosis, short_histology, broad_histology, molecular_subtype, OS_dayscc: @aadamk
To be discussed in the next toolkit meeting.
Using the latest merged TPM/Count expression matrices from data assembly project:
gene_id
column is duplicated# TPM matrix
tpm <- readRDS('~/Downloads/gene-expression-rsem-tpm.BS_MADCWWMX.rds')
grep('gene_id', colnames(tpm))
[1] 1 2527
This is also the case with counts matrix
cc: @aadamk
Comparing the primary sample information for Patient 43 and 35 from v11 to the corresponding progression samples in the new histology file histologies_v11-base-add-BS_AG4BP2PM.tsv, I think some of the fields have missing information. Is this because there is no information available?
pathology_diagnosis
should be High-grade glioma/astrocytoma (WHO grade III/IV)
(?). This is only for the +1 patient.broad_histology
should be Diffuse astrocytic and oligodendroglial tumor
(?). This is only for the +1 patient.short_histology
should be HGAT
(?). This is only for the +1 patient.RNA_library
seems to have moved to PFS_days
for BS_A7Y1Y314
(C3506976
). This is from histologies_v11-base-add-BS_AG4BP2PM.tsv
molecular_subtype
seems to be missing entirely from histologies_v11-base-add-BS_AG4BP2PM.tsv and histologies_v11-base-add-BS_HSXARQ1K.tsv. I think it was present in v11 histologies.tsv (but not in histologies-base.tsv). So they are NA across all samples.BS_J4E9SW51
and BS_H1XPVS9A
(both correspond to C334437
) are annotated as LGAT
in v10/v11 but HGAT
in data assembly histology files:# OpenPedCan v10 histology
v10_histology <- read_tsv('data/OpenPedCan-analysis/data/v10/histologies.tsv')
v10_histology %>%
filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
pull(short_histology)
[1] "Low-grade astrocytic tumor" "Low-grade astrocytic tumor"
# OpenPedCan v11 histology
v11_histology <- read_tsv('data/OpenPedCan-analysis/data/v11/histologies.tsv')
v11_histology %>%
filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
pull(short_histology)
[1] "LGAT" "LGAT"
# data assembly file generated with BS_HSXARQ1K
dat <- read_tsv('histologies_v11-base-add-BS_HSXARQ1K.tsv')
dat %>%
filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
pull(short_histology)
[1] "HGAT" "HGAT"
# data assembly file with BS_AG4BP2PM
dat <- read_tsv('histologies_v11-base-add-BS_AG4BP2PM.tsv')
dat %>%
filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
pull(short_histology)
[1] "HGAT" "HGAT"
I manually filled these for the report generation so don't need these. Just wanted to make a note of it.
cc @aadamk
For OT we will need each data release to included processed GTEx v8 data in gene-expression-rsem-tpm-collapsed.rds
and gene-counts-rsem-expected_count-collapsed.rds
.
In the last data release I did:
gtex-gene-expression-rsem-tpm-collapsed.rds
pbta-gmkf-gene-expression-rsem-tpm-collapsed.rds
gene-expression-rsem-tpm-collapsed.rds
Update the collapse-rnaseq to be able to handle adding GTEx processed files.
CC @zhangb1 @yuankunzhu @jharenza for discussion.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.