Giter VIP home page Giter VIP logo

Comments (15)

LucaCappelletti94 avatar LucaCappelletti94 commented on June 12, 2024 1

One more thing: if you have, generally speaking, node features and edge features, as in either other categorical or metric ones, even stuff like the BED coordinates if some are genomic regions, they can be useful when running GNN and GCN models on the graph.

from pheknowlator.

fmellomascarenhas avatar fmellomascarenhas commented on June 12, 2024 1

Hi, if you dont mind me sharing my two cents. In the data PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt there are 291 unique edge_types. Are all of these edge_types in the dataset?

Also, in DisGeNet, for example, the edges between DIS-GENE can have multiple labels, which completely changes the meaning between their interaction. Would it be possible to add the edge subtypes? https://www.disgenet.org/dbinfo paragraph The DisGeNET Association Type Ontology

In PheKnowLator's data sources description, it doesnt seem that any particular type of edge was filtered/selected: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet

For Protein-Protein from String, there are multiple scores. Some of them, for example, are lab based, others are from literature, and others are from predictions. I would imagine that some stakeholders would feel better being able to filter out only the scores coming from lab based experiments. So maybe all the provided scores could be made available for filtering/enhancement? Same logic applies to DisGeNet, where count of # of papers seems to be a good score metric too.

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024 1

Hi, if you dont mind me sharing my two cents. In the data PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt there are 291 unique edge_types. Are all of these edge_types in the dataset?

Also, in DisGeNet, for example, the edges between DIS-GENE can have multiple labels, which completely changes the meaning between their interaction. Would it be possible to add the edge subtypes? https://www.disgenet.org/dbinfo paragraph The DisGeNET Association Type Ontology

In PheKnowLator's data sources description, it doesnt seem that any particular type of edge was filtered/selected: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet

For Protein-Protein from String, there are multiple scores. Some of them, for example, are lab based, others are from literature, and others are from predictions. I would imagine that some stakeholders would feel better being able to filter out only the scores coming from lab based experiments. So maybe all the provided scores could be made available for filtering/enhancement? Same logic applies to DisGeNet, where count of # of papers seems to be a good score metric too.

Hi @fmellomascarenhas - I always appreciate your feedback!

You are right that I have not yet included specific edge typing from the resources that we import, like DisGeNet and STRING. I agree it's time that we do!

I will spend some time working on and thinking through this tomorrow and will create a spec for what we can add from each source we import. I will also include a brief plan/overview of how I might approach integrating them (there will always be some solutions that are easier or better than others 😄). I can post those both here so guys can take a look. I will set aside time next week to make the changes as part of a new major release.

How does that sound?

from pheknowlator.

fmellomascarenhas avatar fmellomascarenhas commented on June 12, 2024 1

Sounds great! :)

From a Machine Learning point of view, I see four main uses for that:

  1. Filter data that my stakeholders might not want there, for example, edges created from protein-protein predictions;
  2. Predict specific edgetypes, for example, drug-approved_treatment-disease;
  3. Predict multiple edgetypes at once, as a multitask learning problem;
  4. Use as an edge feature;

from pheknowlator.

fmellomascarenhas avatar fmellomascarenhas commented on June 12, 2024 1

Hi @callahantiff , I am vacation this week, but I will get back to you soon! Thanks :)

from pheknowlator.

sanyabt avatar sanyabt commented on June 12, 2024 1

Hi @callahantiff, just caught up with the discussion here and I agree that this would be a great solution! I can envision adding timestamps to the edge metadata and other metrics (eg. node centrality) to node metadata if needed. Thank you for figuring out a solution so quickly 😄

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024 1

Hi @callahantiff, just caught up with the discussion here and I agree that this would be a great solution! I can envision adding timestamps to the edge metadata and other metrics (eg. node centrality) to node metadata if needed. Thank you for figuring out a solution so quickly 😄

Absolutely, that's what I was envisioning too. That we would have a baseline amount of metadata we provide, that users can choose from and/or extend -- with things like timestamps -- as needed.

I will likely make the updates the week after next and will let you know when it's ready. Thanks again for your feedback!

from pheknowlator.

fmellomascarenhas avatar fmellomascarenhas commented on June 12, 2024 1

Hi @callahantiff, I had the time today to read everything. I think it sounds good! I don't think I am in a position to propose a better way of organizing the files, but if that helps, I thought of some additional ideas of features/metadata. Maybe this can help with the brainstorming process :) :

Edge related:
1. Timestamp: So users can split train/test/validation by time, which is much more robust;
2. Edge subtypes: The meaning of gene-biomarker-disease is completely different than gene-causalMutation-disease;
3. Edge scores: Many edgetypes have multiple scores. DisGeNet for example has multiple ways of scoring the edges. Number of papers is completely different than number of different sources with is completely different than number of positive papers / number total papers. You can have a DIS-GENE edge with 500+ papers, but that is reported by only one source, and thus, will have a low score. OpenBioLink, for example, provides two datasets: One with all scores, and one containing only high confidence. This allows the user to create their own threshold using their favorite score type.

4. Edge features: Examples

  • Knowing if a drug-dis is FDA approved;
  • Gene expression;

5. Source information: If paper ID is available, possibly add it. This can help with generating a timestamp. Also, I remember once checking one edge that had 3 sources, but when checking the paper ID, they were 2 different versions of the same manuscript and a third paper of the same group citing themselves. So there weren't 3 sources, just one. This information can help stakeholders validate why the edge exists.

Node related:
6. Node features: Examples

  • Is the DRUG a small molecule or an antibody?
  • Drug chemical properties;

7. Parent/Children: Some biological entities can be described as a tree structure. Diseases, for example, branch into multiple disease subtypes. This information can be very useful to:

  • Find missing IDs (if a child ID is missing, replace with parent);
  • Group diseases: In OpenBioLink, Alzheimer's disease has about 20 subtypes (AZ1, AZ2, etc);
  • Avoid dataleakage: If you mask a connection DIS-GENE, should you also mask the parent and children of that DIS-GENE?

One thing I haven't had the bandwidth to think about is edge properties that are true only when others conditions are also true. For example, the gene expression in a cell type is X1 when disease D in present, otherwise the expression is X2. Or features that differ by gender/race/age. But this is probably way too complex for this stage.

Thanks for all of your great work!

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Per discussion with @LucaCappelletti94 - create two separate tsv files:

  • Node Data: contains node identifiers and node type. For nodes with multiple types, separate each type label with a |
  • Edge Data: contains columns for source and destination node identifiers, edge weight, and edge type. For edges with multiple types, separate each type label with a |. Example: relation:edge type | relation:edge type

Will let you know as soon as this is ready @LucaCappelletti94!

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Great suggestion @LucaCappelletti94! One thought that immediately comes to mind is gene expression values (for specific tissues) we can definitely add that. I'll think through what else might make a good edge type.

Sometimes the distinction between what should be used as a weight or type gets blurred. However, if we were to first mark everything that may be interesting/useful as an edge type (i.e., all categorical and metric-based), then we would also allow the user the ability to select from those what they wanted to use an edge based on their use case. I like this idea a lot!

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Excellent points and even more motivation for me to make these changes! 👍

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Just an update -- I was not able to get to this last week, but plan on coming back to it next week. Sorry for the delay!

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Sorry for the delay, I think we are close to being able to make the updates we have been discussing in this thread. I have been reviewing the different resources that we bring in and thinking through some of the challenges with @bill-baumgartner, who has been involved with me from the beginning in building pkt.

I think we came up with the best possible solution in terms of being able to incorporate the greatest amount of edge and node metadata from all input data sources. This should also allow us to easily incorporate other attributes or data types like timestamps and multiple edge weights. Note that this approach is specifically not designed to extend the base OWL knowledge representation as that can quickly get complicated. Instead, this approach is meant to supplement the existing output files in the least complicated way. Ideally, it provides each user with the most flexibility and options and enabling full downstream customization without having to know the details of each use case ahead of time. A brief description of what I am proposing is included below. Hope you like it!


Current Approach and Output

Currently, we are only producing metadata output for nodes and relations, not for triples or edges.

Node Metadata

Filename: XXXX_OWL_NodeLabels.txt
Output: tab-delimited txt file containing six columns. This file is somewhat misleading as it also contains information for each relation.

entity_type   integer_id   entity_uri                                      label      description/definition   synonym
NODES         375312       <http://www.ncbi.nlm.nih.gov/gene/58155>        PTBP2 (human)         A protein coding gene PTBP2 in human.   None
NODES         6297907      <https://www.ncbi.nlm.nih.gov/snp/rs10902762>   NM_000203.5(IDUA):c.60G>A (p.Ala20=)   This variant is a germline/unknown single nucleotide variant located on chromosome 4 (NC_000004.12, start:987144/stop:987144 positions, cytogenetic location:4p16.3) and has clinical significance 'Benign'. This entry is for the GRCh38 and was last reviewed on Nov 26, 2020 with review status 'criteria provided, multiple submitters, no conflicts'    None
RELATIONS     2057563      <http://purl.obolibrary.org/obo/RO_0002002>     has boundary   a relation between a material entity and a 2D immaterial entity (the boundary), in which the boundary delimits the material entity   None
RELATIONS     958453       <http://purl.obolibrary.org/obo/RO_0002444>     parasite of   None   direct parasite of
...


Proposed Representation

All data for nodes and edges will be output to a JSON Lines file (jsonl) file. This file essentially outputs a separate JSON file for each element. A more detailed description of the benefits of using this type of file can be found here. An example output is shown below (taken from https://jsonlines.org).

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Node Metadata

Output Filename: XXXXX_node_metadata.jsonl

The node metadata file will be keyed uri and each node will contain at a minimum, the following metadata:

  • integer_id: the numeric identifier generated for the integer-based edge lists
  • primary_type: A string stating the type of node, taken from the source ontology or input data type. Additional node types can be added, but will appear like <<src>>_entity_type, where src with being the name of the data source that the type was obtained from (more details below)
  • primary_label: a string label for the node
  • primary_description/definition: a string containing the definition for node
  • primary_synonym: a list of synonyms for the node

Additional types of metadata at the node level will be added and the general format will be: <<src>>_<<type>>, where src is the name of the data source and type is the metadata type for that node (e.g., type, weight).

Edge Metadata

Output Filename: XXXXX_edge_metadata.jsonl

The edge metadata file will be keyed by a triple or edge identifier created as the MD5 hash of each identifier in the edge (i.e., MD5(subj_uri, relation_uri, obj_uri), and each edge will contain at a minimum, the following metadata:

  • ``primary_relation_type`: A string containing the primary type for the edge
  • weight: all edges will be initialized to 0.0

Additional types of metadata at the edge level will be added and the general format will be: <<src>>_<<type>>, where src is the name of the data source and type is the metadata type for that edge (e.g., weight).

As a result of including this file, I will also update the two flat-file outputs (XXXX_Triples_Identifiers.txt and XXXX_Triples_Integers.txt) to include the triple identifiers. Even though these can easily be generated on the fly.


Feedback/Questions

@bill-baumgartner - does that seem correct and cover everything we talked about?

@LucaCappelletti94 - I realize that the proposed output would not readily work as input to Embiggen, I am still very happy to produce a file in the input format we originally discussed.

@fmellomascarenhas and @sanyabt - Please let me know if you have any comments/feedback or if you have any issues with this approach. I think it will be the best overall and hopefully, be flexible enough to be useful for most use cases.

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Hi @callahantiff , I am vacation this week, but I will get back to you soon! Thanks :)

That sounds great! Have a great vacation! 😄

from pheknowlator.

callahantiff avatar callahantiff commented on June 12, 2024

Hi @callahantiff, I had the time today to read everything. I think it sounds good! I don't think I am in a position to propose a better way of organizing the files, but if that helps, I thought of some additional ideas of features/metadata. Maybe this can help with the brainstorming process :) :

Edge related: 1. Timestamp: So users can split train/test/validation by time, which is much more robust; 2. Edge subtypes: The meaning of gene-biomarker-disease is completely different than gene-causalMutation-disease; 3. Edge scores: Many edgetypes have multiple scores. DisGeNet for example has multiple ways of scoring the edges. Number of papers is completely different than number of different sources with is completely different than number of positive papers / number total papers. You can have a DIS-GENE edge with 500+ papers, but that is reported by only one source, and thus, will have a low score. OpenBioLink, for example, provides two datasets: One with all scores, and one containing only high confidence. This allows the user to create their own threshold using their favorite score type.

4. Edge features: Examples

  • Knowing if a drug-dis is FDA approved;
  • Gene expression;

5. Source information: If paper ID is available, possibly add it. This can help with generating a timestamp. Also, I remember once checking one edge that had 3 sources, but when checking the paper ID, they were 2 different versions of the same manuscript and a third paper of the same group citing themselves. So there weren't 3 sources, just one. This information can help stakeholders validate why the edge exists.

Node related: 6. Node features: Examples

  • Is the DRUG a small molecule or an antibody?
  • Drug chemical properties;

7. Parent/Children: Some biological entities can be described as a tree structure. Diseases, for example, branch into multiple disease subtypes. This information can be very useful to:

  • Find missing IDs (if a child ID is missing, replace with parent);
  • Group diseases: In OpenBioLink, Alzheimer's disease has about 20 subtypes (AZ1, AZ2, etc);
  • Avoid dataleakage: If you mask a connection DIS-GENE, should you also mask the parent and children of that DIS-GENE?

One thing I haven't had the bandwidth to think about is edge properties that are true only when others conditions are also true. For example, the gene expression in a cell type is X1 when disease D in present, otherwise the expression is X2. Or features that differ by gender/race/age. But this is probably way too complex for this stage.

Thanks for all of your great work!

@fmellomascarenhas this is fantastic feedback, thank you very much! I also really enjoy the examples. I am not sure we can accommodate everything in the first pass, but this format will allow easy integration of the types of metadata you suggest (and likely things neither of us has thought of yet [I think 🤔 and hope 😄 ])! OK, will keep you posted as I begin working on this over the next few weeks.

Thanks so much for the feedback and suggestions!

from pheknowlator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.