d3b-center / d3b-cds-manifest-prep Goto Github PK
View Code? Open in Web Editor NEWscripts to prep manifests for cds
License: Apache License 2.0
scripts to prep manifests for cds
License: Apache License 2.0
Version 0.5
156 rows in the File file have null controlled_access
try to incorporate bailey's testing and reporting harness into qc
No response
No response
No response
This is so that when file is unzipped, it is clear what version the data are.
PT_95S99RWE is not in the diagnosis manifest
PT_95S99RWE should have a diagnosis
0.13.0
V0.50
There were no null values in these 2 columns in previous releases.
These values are not in the approved enum list. This test previously passed.
platform | count
"hiseq" 98
"dnbseq" 42
"not reported" 6860
"hiseq x" 301
V0.50
There are 91 subjects in participant but not in diagnosis
No response
See the model google sheet here.
Diagnoses should be at the sample/ event level (7316 number).
Diagnosis IDs should take the form dg_[7316-1234]_[x]
, where the id starts with DG
, then the 7316 number, then the diagnosis number within that event when events have multiple diagnoses attached to them.
0.12.0
Investigate why diagnoses have the text Not Reported
no diagnosis should be Not Reported
.
0.9.0
germline samples are in the diagnosis sample map
germline samples shouldn't be in the diagnosis sample map. participants that have only germline samples will have a diagnosis in the diagnosis file but will not have a diagnosis in the diagnosis-sample map file.
0.13.0
23 diagnosis ids are not unique
all ids need to be unique
0.8.1
There are 3 values for platform, strategy and library that are not in the allowed enum
all values for these columns should be enums
0.8.1
v0.50 ... File attached
select count(*)
from (select library_id, count(*)
from genomic_info
group by library_id
having count(*) > 1)as_
This version of CDS will have data from a few sources:
Note: the below list will be updated periodically with links to the related manifests.
To establish item 3: these participants/samples/files are ones that are released in either CAVATICA, OpenPedCan Histologies v12, or on PedCBioPortal but not in cds v0.14.1.
Sequencing Information is incomplete for some files to be submitted in the second CDS release.
There are 8756 unique sequencing experiments associated with files being submitted.
The export from the dataservice with information about each of these experiemnts is here.
AB Capillary
ABI Solid
BGISEQ
Complete Genomics
Helicos
Illumina
Ion Torrent
LS 454
Oxford Nanopore
PacBio SMRT
platform | count |
---|---|
Illumina | 8503 |
Not Reported | 242 |
Other | 11 |
The issue is with the last two platforms. We need to decide what platform these experiments were performed on.
The 11 experiments where platform is other
are all rna-seq samples, where the instrument model is DNBSeq
that were sequenced at BGI
.
@chris-s-friedman to get the platform for the above from bix
For the 242, their compostion of strategy, instrument model, and sequencing center is below. Note that none of these experiments have a value for instrument model.
library_strategy | instrument_model | sequencing_center_id | sequencing center name | count |
---|---|---|---|---|
RNA-Seq | Not Reported | SC_2ZBAMKK0 | Novogene | 81 |
WGS | Not Reported | SC_2ZBAMKK0 | Novogene | 131 |
WGS | Not Reported | SC_FAD4KCQG | BGI | 15 |
WGS | Not Reported | SC_N1EVHSME | NantOmics | 10 |
WGS | Not Reported | SC_WWEQ9HFY | BGI@CHOP Genome Center | 5 |
@chris-s-friedman to look through past files to get previously investigated platform
Instrument Model | Count |
---|---|
Not Reported | 5838 |
HiSeq | 1809 |
HiSeq X | 1007 |
Novaseq 6000 | 91 |
DNBSeq | 11 |
None of these instrument models are accepted values in their data model
Neither HiSeq
or HiSeq X
are accepted values, but they do have values for HiSeq X Five
and HiSeq X Ten
.
There is no Novaseq instrument model in their enumerated values.
There is no DNBSeq instrument model in their enumerated values.
@baileyckelly to ask ccdi if these values above are acceptable
Of the Not Reported
instrument models:
SD_8C478S85
, High Incidence of Pediatric CNS Tumors
, D3B-PCNST
.Items 1 and 2 will need some further investigation.
3, 4, and 5 are all from the cbtn x01 and should all have similiar instrument models.
For RNA-Seq samples, this is missing for all pre-x01 data
For WGX, WXS, and Targeted Capture, this is missing for pre-x01 data and x01 data
From the metadata template:
For sequencing files, please try to provide all metadata, if applicable, for the following properties: avg_read_length, number_of_reads, number_of_bp, coverage
missing for 3192 experiments. All pre x01
missing for 3192 experiments. All pre x01
Missing for all experiments
missing for all experiments
No response
None
one sample (BS_0J5MCBZV
) and three files (GF_8A1T39FW
, GF_H6Z3Q10Y
, GF_QN2WX9M5
) are missing from the genomic_info manifest.
These items should be in the genomic_info manifest
0.8.0
CDS found that some samples had multiple diagnoses associated with them in the diagnosis-sample mapping. They expect a sample to only have one diagnosis.
For example:
in the diagnosis-sample mapping table:
diagnosis_id | sample_id |
---|---|
DG__BS_1GFP3T8N__0 | BS_1GFP3T8N |
DG__BS_1GFP3T8N__1 | BS_1GFP3T8N |
DG__BS_1GFP3T8N__2 | BS_1GFP3T8N |
and in the diagnosis table
diagnosis_id | primary_diagnosis | participant_id |
---|---|---|
DG__BS_1GFP3T8N__0 | Craniopharyngioma | PT_P1F0AHMT |
DG__BS_1GFP3T8N__1 | High-grade glioma/astrocytoma (WHO grade III/IV) | PT_P1F0AHMT |
DG__BS_1GFP3T8N__2 | Low-grade glioma/astrocytoma (WHO grade I/II) | PT_P1F0AHMT |
Is this expected and true?
from the cds team:
If this is expected (e.g. because of heterogeneity in the tumor), would it be possible to modify the sample_ids so that there could be 1:1 mapping of sample ID : diagnosis ID?
If the participant_age_at_collection and the anatomic_site of a set of samples was the same, but they each had a unique diagnosis, could a secondary user infer that the tumor was heterogenous for tumor grade or type?
0.9.0
1 sample that is in genomic info that is not in sample
all samples in genomic info should be in sample
0.8.1
Is your feature request related to a problem? Please describe.
The output order of entities in the different output manifests is not controlled. occasionally this order can change between versions without underlying changes to the data. This causes unexpected diffs when updating manifests in github. Ordered IDs would makeit easier to understand diffs between versions
Describe the solution you'd like
Order the output manifests by key ID
one file in genomic_info not in file
all files in genomic_info should be in file
0.8.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.