File: 156 files with null controlled_access value

Version 0.5
156 rows in the File file have null controlled_access

Feature Request: try to incorporate bailey's testing harness

Is your feature request related to a problem? Please describe

try to incorporate bailey's testing and reporting harness into qc

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional Context

No response

Directory within the zip file should have version ID in it

This is so that when file is unzipped, it is clear what version the data are.

Submission Package bug - PT_95S99RWE not in diagnoses manifest

Describe the bug

PT_95S99RWE is not in the diagnosis manifest

Expected behavior

PT_95S99RWE should have a diagnosis

Version ID

0.13.0

Effected file(s)

Genomic Info: 368 files missing bases & number of reads

V0.50

There were no null values in these 2 columns in previous releases.

Genomic Info: 7301 files with an instrument model not in the enum list

These values are not in the approved enum list. This test previously passed.

platform | count
"hiseq" 98
"dnbseq" 42
"not reported" 6860
"hiseq x" 301

Participant/DX: 91 participants missing diagnosis

V0.50
There are 91 subjects in participant but not in diagnosis

Submission Package bug - Diagnoses should be at the event level, not the aliquot level

Describe the bug

No response

Expected behavior

See the model google sheet here.

Diagnoses should be at the sample/ event level (7316 number).

Diagnosis IDs should take the form dg_[7316-1234]_[x], where the id starts with DG, then the 7316 number, then the diagnosis number within that event when events have multiple diagnoses attached to them.

Version ID

0.12.0

Effected file(s)

Submission Package bug - Diagnoses Not Reported

Describe the bug

Investigate why diagnoses have the text Not Reported

Expected behavior

no diagnosis should be Not Reported.

Version ID

0.9.0

Effected file(s)

Submission Package bug - germline samples are in the diagnosis-sample map

Describe the bug

germline samples are in the diagnosis sample map

Expected behavior

germline samples shouldn't be in the diagnosis sample map. participants that have only germline samples will have a diagnosis in the diagnosis file but will not have a diagnosis in the diagnosis-sample map file.

Version ID

0.13.0

Effected file(s)

Submission Package bug -

Describe the bug

23 diagnosis ids are not unique

Expected behavior

all ids need to be unique

Version ID

0.8.1

Effected file(s)

Submission Package bug - genomic_info values not in enums

Describe the bug

There are 3 values for platform, strategy and library that are not in the allowed enum

Expected behavior

all values for these columns should be enums

Version ID

0.8.1

Effected file(s)

Genomic Info: 1064 non unique library ids in genomic info

v0.50 ... File attached

select count(*) 
from (select library_id, count(*) 
	  from genomic_info 
	  group by library_id 
	  having count(*) > 1)as_

Feature Request: CDS v1.x.x

CDS v1.x.x

This version of CDS will have data from a few sources:

Note: the below list will be updated periodically with links to the related manifests.

CBTN X01 source and Harmonized data
a. source data file-sample-participant mapping: https://data-tracker.kidsfirstdrc.org/study/SD_BHJXBDQK/documents/SF_Z4B1Q5XE
b. harmonized data post-harmonization manifest: https://data-tracker.kidsfirstdrc.org/study/SD_BHJXBDQK/documents/SF_R8XTMZAN
CBTN Pre-X01 DNA and RNA files that went through new gencode
CBTN Pre-X01 data that was not included in v0.14.1 that can be identified as coming from a particular participant and sample
PNOC008 samples collected and analyzed after the file-sample-participant manifest for v0.14.1 was closed.

To establish item 3: these participants/samples/files are ones that are released in either CAVATICA, OpenPedCan Histologies v12, or on PedCBioPortal but not in cds v0.14.1.

Edits

Edit 1: 2023-02-16 - add links for cbtn x01 source file-sample-participant mapping and harmonized post-harmonization data manifests

Submission Package bug - Sequencing File Information is Missing

Describe the bug

Sequencing Information is incomplete for some files to be submitted in the second CDS release.

There are 8756 unique sequencing experiments associated with files being submitted.

The export from the dataservice with information about each of these experiemnts is here.

Platform

Accepted values:

AB Capillary
ABI Solid
BGISEQ
Complete Genomics
Helicos
Illumina
Ion Torrent
LS 454
Oxford Nanopore
PacBio SMRT

Actual Values

platform	count
Illumina	8503
Not Reported	242
Other	11

The issue is with the last two platforms. We need to decide what platform these experiments were performed on.

The 11 experiments where platform is other are all rna-seq samples, where the instrument model is DNBSeq that were sequenced at BGI.

@chris-s-friedman to get the platform for the above from bix

For the 242, their compostion of strategy, instrument model, and sequencing center is below. Note that none of these experiments have a value for instrument model.

library_strategy	instrument_model	sequencing_center_id	sequencing center name	count
RNA-Seq	Not Reported	SC_2ZBAMKK0	Novogene	81
WGS	Not Reported	SC_2ZBAMKK0	Novogene	131
WGS	Not Reported	SC_FAD4KCQG	BGI	15
WGS	Not Reported	SC_N1EVHSME	NantOmics	10
WGS	Not Reported	SC_WWEQ9HFY	BGI@CHOP Genome Center	5

@chris-s-friedman to look through past files to get previously investigated platform

Instrument Model

Actual Values

Instrument Model	Count
Not Reported	5838
HiSeq	1809
HiSeq X	1007
Novaseq 6000	91
DNBSeq	11

None of these instrument models are accepted values in their data model

Neither HiSeq or HiSeq X are accepted values, but they do have values for HiSeq X Five and HiSeq X Ten.

There is no Novaseq instrument model in their enumerated values.

There is no DNBSeq instrument model in their enumerated values.

@baileyckelly to ask ccdi if these values above are acceptable

Of the Not Reported instrument models:

199 experiments are cbtn experiments from pre-x01
76 experiments are pnoc 003/008 experiments created before february 2023
5449 experiments are from cbtn x01
40 experiments are pnoc 003/008 experiments on 2/6/2023 and 2/8/2023 that look to be associated with cbtn x01
74 experiments are associated with cbtn x01 under the study ID SD_8C478S85, High Incidence of Pediatric CNS Tumors, D3B-PCNST.

Items 1 and 2 will need some further investigation.

3, 4, and 5 are all from the cbtn x01 and should all have similiar instrument models.

Library Selection

For RNA-Seq samples, this is missing for all pre-x01 data
For WGX, WXS, and Targeted Capture, this is missing for pre-x01 data and x01 data

From the metadata template:

For sequencing files, please try to provide all metadata, if applicable, for the following properties: avg_read_length, number_of_reads, number_of_bp, coverage

Number of Reads

missing for 3192 experiments. All pre x01

Mean read length

missing for 3192 experiments. All pre x01

Coverage

Missing for all experiments

number of bp

missing for all experiments

Expected behavior

No response

Version ID

None

Effected file(s)

Submission Package bug - missing sample and files

Describe the bug

one sample (BS_0J5MCBZV) and three files (GF_8A1T39FW, GF_H6Z3Q10Y, GF_QN2WX9M5) are missing from the genomic_info manifest.

Expected behavior

These items should be in the genomic_info manifest

Version ID

0.8.0

Effected file(s)

Submission Package bug - diagnosis-sample mapping

Describe the bug

CDS found that some samples had multiple diagnoses associated with them in the diagnosis-sample mapping. They expect a sample to only have one diagnosis.

For example:

in the diagnosis-sample mapping table:

diagnosis_id	sample_id
DG__BS_1GFP3T8N__0	BS_1GFP3T8N
DG__BS_1GFP3T8N__1	BS_1GFP3T8N
DG__BS_1GFP3T8N__2	BS_1GFP3T8N

and in the diagnosis table

diagnosis_id	primary_diagnosis	participant_id
DG__BS_1GFP3T8N__0	Craniopharyngioma	PT_P1F0AHMT
DG__BS_1GFP3T8N__1	High-grade glioma/astrocytoma (WHO grade III/IV)	PT_P1F0AHMT
DG__BS_1GFP3T8N__2	Low-grade glioma/astrocytoma (WHO grade I/II)	PT_P1F0AHMT

Is this expected and true?

Expected behavior

from the cds team:

If this is expected (e.g. because of heterogeneity in the tumor), would it be possible to modify the sample_ids so that there could be 1:1 mapping of sample ID : diagnosis ID?
If the participant_age_at_collection and the anatomic_site of a set of samples was the same, but they each had a unique diagnosis, could a secondary user infer that the tumor was heterogenous for tumor grade or type?

Version ID

0.9.0

Effected file(s)

Submission Package bug - sample in genomic_info not in sample

Describe the bug

1 sample that is in genomic info that is not in sample

Expected behavior

all samples in genomic info should be in sample

Version ID

0.8.1

Effected file(s)

order IDs in the output manifests

Is your feature request related to a problem? Please describe.
The output order of entities in the different output manifests is not controlled. occasionally this order can change between versions without underlying changes to the data. This causes unexpected diffs when updating manifests in github. Ordered IDs would makeit easier to understand diffs between versions

Describe the solution you'd like
Order the output manifests by key ID

Submission Package bug -

Describe the bug

one file in genomic_info not in file

Expected behavior

all files in genomic_info should be in file

Version ID

0.8.1

d3b-center / d3b-cds-manifest-prep Goto Github PK

d3b-cds-manifest-prep's People

Contributors

Watchers

d3b-cds-manifest-prep's Issues

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

CDS v1.x.x

Edits

Describe the bug

Platform

Accepted values:

Actual Values

Instrument Model

Actual Values

Library Selection

Number of Reads

Mean read length

Coverage

number of bp

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Describe the bug

Expected behavior

Version ID

Effected file(s)

Recommend Projects

Recommend Topics

Recommend Org