Giter VIP home page Giter VIP logo

icdc-model-tool's Introduction

icdc-model-tool's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icdc-model-tool's Issues

Update model to support Publications associated with Studies

Need to update the data model to support Publications associated with Studies:

  1. Add new node: publication

  2. Add node properties to capture:
    Up to 3 author names. Longer lists would be first author and et.al
    Title
    Journal
    Year
    DOI link
    PubMed ID (if known)
    etc., TBD

  3. Add one to many relationship between study and publication nodes

Update data model with additional Program properties

Need to update data model with additional Program properties.

Specifically, need to add properties for curated, i.e. non-derived, program-centric metadata as follows:

  • Program description - a short (2-3 sentence) descriptive text that outlines the nature of any given Program

  • External program URL - an outward pointing URL for the web page that's deemed to be the single most informative starting point for someone wanting to learn more about any given Program

Update data model to include a Biobank "look-up" node

As part of implementing biobank-based seacrhing, the data model needs to be updated to include a "floating" Biobank node to function as a look-up table for conversion of registration_origin values, i.e. biobank/tissue repository names in acronym form, into their corresponding full text versions, which will be displayed via mouse over tool-tips.

Specifically:

  1. Add a node called simply "biobank"
  2. Define two properties within the node, one for the biobank/tissue repository acronyms, one for the full text rendering of the acronyms
  3. At least in the immediate term, have the new node "float" within the data model, i.e. not be tethered to other nodes by any relationships

Once biobank-based searching has been enabled, the "biobank" node will more than likely be tethered to the registration node, at which time, ICDC's registration loading files will have to be updated with a pointer to the appropriate biobank record.

Update the data model to include mulitple tags for all nodes

The data model should be updated so as to include multiple tags for all nodes, via edits to the icdc-model.yml file.

Specifically:

  1. Modify the existing node-level Category property such that it is one of several tags applied to each and every node
  2. Add tags for Assignment (node is core vs. extended in nature), Class (node is primary aka "absolutely core" vs. secondary aka "not always needed/used"), and Color (for future use in model visualization tools, set to "black" until such time as needed)

For example, tags as applied to the case node:
case:
Tags:
Category: case
Assignment: core
Class: primary
Color: black

Change the designation for enumerated lists of acceptable values from Type: to Enum:

Within the model property definitions file, change the designation for enumerated lists of acceptable values from Type: <followed by indented list of acceptable terms to Enum:

Once we're able to generate any sort of data dictionary and submission guide outputs from the model definition files, use of the term Enum should make our designation of the data type for properties that use controlled vocabularies rather more intuitive to consumers of these outputs.

Update Data Model to accommodate new studies currently being on-boarded

The data model should be updated in various ways in order to accommodate values being provided by new studies currently being on-boarded:

  1. Need to add a new relationship from Case directly to Study Arm for UBC02

  2. Need additional acceptable values for Stage of Disease for MGC01 and UBC02 studies

  3. Need/want to extend list of acceptable values for various sample-level annotations and/or add constraints to them, e.g. physical_sample_type (needs "Cell Line" for OSA01 and CCL01 studies), tumor_grade (needs acceptable values and should be set to req=true), sample_preservation (needs "Not Applicable" for cell line samples), necropsy_sample (needs "Not Applicable" for cell line samples), etc.

  4. Need to extend the list of acceptable values for file_type even beyond those listed in Issue #59, to include "Array CGH Analysis File" for MGC01 study

  5. May need to dial back registration: is_primary_id to no longer be required, possibly no longer even needed pending further discussion of registration best practice

Remove the crf_id property from all core nodes

As part of the continued tidy up of the data model, the crf_id property should be removed from all core nodes.

It was included in good faith when the model was first assembled, but has turned out to be too specific to COTC studies to warrant including it to the extent to which it was, and the newly-implemented model viewer is raising its visibility.

Also:

  • Fix the typo in the description of cohort_dose:
    The intended or protocol dose of the therapeutic agnet used in any given cohort

  • Relocate the property definition for the arm_description property from line 34, and nest it under study_arm props where it belongs, and add a description for the property

  • Correct a minor error in the description for patient_age_at_enrollment
    The age the canine patient/subject/donor as of study/trial enrollment, expressed in standard human years, as opposed to dog years

  • Specify values of "Internally-curated" for the Src: property definition for the biospecimen_repository_acronym and the biospecimen_repository_full_name properties

Multiple Diagnoses and consequent changes.

The system needs to allow for multiple diagnoses for a single case. This is particularly important in human cases (CTN) where someone may have more than one cancer diagnosed at the same time due to mutational specific diagnoses. This would necessitate a "next" relationship to be able to traverse the links.

The system needs to allow for a case to be associated with different arms throughout its history. For example, a case may be associated with a single arm, fail treatment and then be assigned to a different arm for different treatment. Treatment needs to be associated directly with the arm and have the specifics of that treatment listed (may require elevating treatment to a node of its own). Cases which move from one treatment arm to another will need to have an off-treatment node associated when they go off, so the system needs to allow for multiple off-treatment nodes per case. In talking with Mark Jensen, a relationship of "former_member_of" may be needed for previous relationships that were listed as "member_of" and the new nodes would get "member_of" relationships. These "former_member_of" nodes would need to get a property of "date_membership_stopped". Cases should only be allowed to be associated with a single treatment at one time.

The system needs to allow for a case to be associated with more than one study (similar to that described for treatment above). It may be the case that the once the patient fails one treatment, they are removed from the study, but then recommended to be put on a second study of a different treatment. In this case, the treatments are not associated at the study level, but at the program level through different studies. The link to the case might remain.

Extend list of acceptable values for the File Type property

Prior to us actually having to load any study-level files, the list of acceptable values for the File Type property (where file type = type of file content) needs to be extended to include values such as:

Study Protocol
Supplemental Data File
Variant Calling File (?)
Data Analysis Whitepaper (???)
etc.

Current list of acceptable values for File Type focuses very much upon different types of sequencing file, with no coverage for the types of files expected to be associated directly with a study.

For properties with numberWithUnits type, make "units" value an array everwhere

Based on discussion today (8/14/19), decision taken to express the possible 'units' values as an array on every property with numberWithUnits type, even if there is only one acceptable unit of measurement for the property.

The convention will be that the first unit of measurement in the "units" array will be the default for data coming in.

Revise model per submission rules and emerging needs

The ICDC model needs to be revised per evolving data submission ground rules and guidelines, and per emerging needs driven by the Glioma01 and Vemurafenib studies currently being processed. Specifically:

  1. Enumeration values are now needed for certain properties for which they were not previously specified, e.g. stage_of_disease
  2. Existing enumeration values need to be expanded, e.g. for file_type
  3. Constraints can now be added to various properties, e.g. required, data type
  4. Limited additional properties are needed on some existing nodes, e.g. sample_chronology

Deprecate "crf_id" property

CCDH noted via Slack that the crf_id property occurs on only 10 nodes and is only populated in the data on one (disease_extent). Since this was very specific to 007 study, maybe we ought to retire it.
@PhilipMusk

Update model to support representing multi-study participants as "Individuals"

Need to update the model to support representing multi-study participants as "Individuals":

  1. Add a new node: individual

  2. Set node properties = null

  3. Add a new one-to-many relationship between multi-study participant "Individuals" and study-specific Cases

"Individual" nodes will be auto-created and linked to the appropriate study-specific Cases representing them based upon the data loader's detection of registration-based matches

Update data model with additional Program properties

Need to add a couple more Program properties in order to more fully support front end rendering of program views.

Specifically:
program_acronym
program_short_description - of questionable value
program_sort_order - also of questionable value

Make program_acronym in the program node a required property

The load file for the study node includes a column for program.program_acronym. This is how a study is linked to a program. Since the load file for study requires this value, then the program node's program_acronym should be a required property, so that it will always be available to create the link.

Add properties to File node

With the NCATS study now loaded, and pathology reports available as files to associate with its Case and/or Diagnosis records, need to add the appropriate properties to the File node.

Add "Best Response" property; update various single-letter enumerations to words

Need to perform another round of minor model updates.

Specifically, need to:

  1. Add a new "Best Response" property within Diagnosis
  2. Update various single-letter enumerations to words, e.g. sex (currently M,F,U), neutered status (currently Y,N,U), concurrent disease (currently Y,N,U), etc.
  3. Replace sample: sample_type with sample: physical_sample_type
  4. Change sample: percent_tumor from number to text to allow for values such as ">90" and "50-75"

Modify "individual" node to include an ID property

The properties around the recently-added "individual" node, specifically the complete absence thereof, need to be modified to include an ID property, such that an auto-generated ID value can be inserted into the property in question, and subsequently used as the "dog tag" by which multi-study participants can be identified and tracked.

The "individual" node should be re-named to indicate its canine origin per EVS recommendations

The relationship between "case" and "registration" nodes needs to be changed from one-to-many to many-to-many

Update data model to include units for various properties on Sample

Per discussions around how best to include expectations around units in various places within our data model, while not being overly prescriptive/restrictive, model needs to be updated to include units for various properties on Sample.

For example:
need units for area of tissue subject to analysis
need units for length of tumor
need units for width of tumor
etc.

Add constraints to properties on Study

Add constraints to properties on Study:
Add acceptable Type (data type) to each of the 7 properties
Make properties required where applicable

Add "formal" description for each of the 6 properties

Specify Src as 'curated" where applicable

Local terminology references in the model desc

There are references of a localhost URL for the definition of enumerations for properties in the model-desc. Please see the following two lines for example.

The CCDH terminology team is trying to pull the enumeration into our harmonization model and would like to see how to get access to these values, or if the model-desc could be updated with these values.

Thank you!

- http://localhost/terms/domain/disease_term

- http://localhost/terms/domain/primary_disease_site

Update data model: remove Evaluation node and its relationships:

Update the data model so as to completely remove the Evaluation node and its relationships:
Remove the Evaluation node itself
Remove the relationships between Visit and Evaluation, and between Evaluation and Physical Exam, Vital Signs, Extent of Disease and any other connected nodes
Add the appropriate relationships between Visit and the downstream nodes mentioned above

Update data model with additional Sample properties

Need to update data model with additional Sample properties relevant to NCATS-derived data.

Specifically, need to add properties for pathology annotations specific to tumor samples:

  • sample_id

  • sample_type (to indicate tissue vs. blood vs. urine, etc.)

  • general_sample_pathology (to indicate normal vs. tumor sample)

  • date_of_sample_collection

  • necropsy_sample

  • length_of_tumor (the length of the tumor from which the tumor sample was taken)

  • width_of_tumor (the width of the tumor from which the tumor sample was taken)

  • analysis_area

analysis_area_percentage_tumor
analysis_area_percentage_stroma
analysis_area_percentage_glass
analysis_area_percenatge_pigmented_tumor
total_tissue_area
tumor_tissue_area
non_tumor_tissue_area

  • percentage_tumor

  • percentage_stroma

  • comment

Some of these fields (especially those not bulleted above) are so NCATS-specific that I question the value of us accommodating them and their data. But on the other hand, Greg included the fields and their values, so he sees this as useful data.

Update data model to include a relationship directly from Sample to Case

The data model should be updated to include a relationship directly from Sample to Case, such that we can correctly represent the way in which the samples analyzed for the NCATS01 study were acquired on a one off manner as part of a much bigger biobanking effort, as opposed to being acquired in a longitudinal, visit-based manner.

Assorted model updates required

Assorted model updates are required in order to support upload on minimum viable content for MVP release:
Need a relationship from File directly to Study, for study-level data files
Need a relationship from Visit to Case, for situations where a Visit can't be unambiguously mapped to a Cycle
Need a relationship from Adverse Event to Case, to replace the relationship from Adverse Event to Visit - AEs are based on date of onset, not visit
Advocate proactively adding sample_site and specific_sample_pathology properties to Sample
Need to add "not applicable" as an acceptable term for general_sample_pathology to accommodate blood samples
Need to remove the s from the file_locations property
Need to change cohort_dose property to type = string to facilitate upload of inconsistent data

Updates to properties on Case and Sample

Given the gradual evolution of data submission guidelines and ground rules, various updates to the properties on various properties within the Case and Sample nodes can now be made such that we can move towards validating inbound data against these nascent data submission requirements.

Multiple Case Identifiers

We had a conversation at the DGAB meeting about how cases may be derived from CCOGC samples (which have a fixed format for an identifier and must maintain that), however, users may attach their own local identifier, or collect their own cases with their own identifier and then submit cases to CCOGC (which would get a different identifier). There is a need in ICDC to be able to attach a separate (or more than one?) identifiers to a case to be able to track uniqueness across an identifier system. Suggest that, in addition to the local identifier provided, a separate listing of ALL identifiers attached to that case (including the local identifier) be created. The local identifier then stays as a unique key, but the listing can be searched across multiple studies in case that local identifier is used as a foreign identifier in another study.

Update data model with additional Demographic properties

Need to update data model with additional Demographic properties relevant to NCATS-derived data.

Specifically, need to add the following properties:

  • weight - to indicate the subject's weight at the time the subject was enrolled and biospecimens were acquired, at least in the case of studies that are not longitudinal in nature

  • neutered_indicator - a Boolean indicator as to whether the subject is spayed or neutered. Not needed thus far, because at least for the COTC007B study, the data in C3D captures whether the subject has been spayed/neutered as part of whether the subject in female/male. Whereas the NCATS data captures sex and spayed/neutered status as separate data elements.

Add constraints to properties on Program

Add constraints to properties on Program:
Add acceptable Type (data type) to each of the 6 properties
Make properties required where applicable

Add "formal" description for each of the 6 properties

Specify Src as 'curated" for each of the 6 properties

Update data model with additional Diagnosis properties

Need to update data model with additional Diagnosis properties relevant to NCATS-derived data

Specifically, need to add the following properties:

  • concurrent_disease

  • concurrent_disease_type

  • immunophenotype - of questionable value; within the NCATS data, this field is used to delineate B cell vs. T cell lymphoma. We could so easily concatenate the values in this field with those in the NCATS "tumor_type_specific" field to produce a very specific value for ICDC's "disease_term", and thereby avoid adding this property, yet make use of its data.

Add Grantee Information

On the Steering Committee call today, it came up that multiple grantees may be contributing to a specific study. For tracking purposes, it will be useful and necessary to associate grants with studies. There will be a need for multiple grantees associated with a specific study (Glioma, for example, has at least 3 grantees associated with it). For each grant, there is an associated PI, Grant Number and Location which should be tracked. These locations may show up in the study itself as sites (see Glioma which has samples from the site Texas A&M University (TAMU) with associated Grantees - Amy Heimberger and Jonathan Levine).

How to generate SVG

Hi, I'm playing a bit with a tool and I wanted to generate svg.
I run bin/model-tool -g image.svg icdc-model.yml icdc-model-props.yml, but I didn't get any svg output. Yamls are from this repo.

Can you tell me what am I missing?

Add Tags entity to MDF syntax

Add a "Tags" object that can contain an array of metadata tags for any element in the MDF. Can be used for marking up the model to graph portions of it with certain colors, etc. Can be used to annotate nodes, properties with other metadata, e.g., project or submitter associations.

Update model to support study-level Image Collections

Need to update the data model to support study-level Image Collections:

  1. Add a new node: image_collection

  2. Add node properties to capture:
    The name of each image collection
    The types of images included in each image collection (using property type of "list", and define acceptable terms that can be listed)
    The url at which each image collection can be found
    The repository within which each collection exists
    The accessibility of of each collection - download vs. access via cloud

  3. Add one to many relationship between clinical_study and image_collection nodes

Update data model with additional Study properties.

Need to update data model with additional Study properties.

Specifically, need to add properties for curated, i.e. non-derived, study-centric metadata as follows:

  • Study description - a relatively short (5-6 sentence) descriptive text that outlines the nature of any given Study. For a Study that's identified as a clinical trial by the proposed "study type" property (see below), this text would be culled directly from the study protocol's precis or summary section

  • Study type - a 2-3 word classification of any given study as a "clinical trial", or a "transcriptomics and proteomics analysis", etc., with the verbiage used assigned on a case by case basis

  • Conducted when - a date range indicating the time period over which the study was conducted

Update model in terms of assorted minor changes dictated by data evolution

The model needs to updated in terms of assorted minor changes dictated by data evolution, as follows:

  1. Add a many to one relationship from file to case, to support the vcf files for the GLIOMA01 study being associated with cases rather than samples
  2. Add "Mutation Annotation File" to the list of acceptable values for the file: file_type property, to accommodate the MAF files generated by the GLIOMA01 study
  3. Add "Ultrasound" and "Optical" to the list of acceptable values for the image_collection: image_type property
  4. Correct two horrible typos in comments pertaining to acceptable values for the diagnosis: best_response property

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.