callahantiff / pheknowlator Goto Github PK

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models

Home Page: https://github.com/callahantiff/PheKnowLator/wiki

License: Apache License 2.0

Python 65.02% Jupyter Notebook 34.09% Dockerfile 0.61% Shell 0.29%

knowledge-graph ontologies biomedical-applications mechanisms translational-research linked-open-data owl semantic-web benchmarks obofoundry

pheknowlator's Introduction

What is PheKnowLator?

PheKnowLator (Phenotype Knowledge Translator) or pkt_kg is the first fully customizable knowledge graph (KG) construction framework enabling users to build complex KGs that are Semantic Web compliant and amenable to automatic Web Ontology Language (OWL) reasoning, generate contemporary property graphs, and are importable by today’s popular graph toolkits. Please see the project Wiki for additional information.

📢 Please see our preprint 👉 https://arxiv.org/abs/2307.05727

What Does This Repository Provide?

A Knowledge Graph Sharing Hub: Prebuilt KGs and associated metadata. Each KG is provided as triple edge lists, OWL API-formatted RDF/XML and NetworkX graph-pickled MultiDiGraphs. We also make text files available containing node and relation metadata.
A Knowledge Graph Building Framework: An automated Python 3 library designed for optimized construction of semantically-rich, large-scale biomedical KGs from complex heterogeneous data. The framework also includes Jupyter Notebooks to greatly simplify the generation of required input dependencies.

NOTE. A table listing and describing all output files generated for each build along with example output from each file can be found here.

How do I Learn More?

Join and/or start a Discussion
The Project Wiki for available knowledge graphs, pkt_kg data sources, and the knowledge graph construction process
A Zenodo Community has been established to provide access to software releases, presentations, and preprints related to this project

Releases

Data Access

Build Documentation

Getting Started

Install Library

This program requires Python version 3.6. To install the library from PyPI, run:

pip install pkt_kg

You can also clone the repository directly from GitHub by running:

git clone https://github.com/callahantiff/PheKnowLator.git

Note. Sometimes OWLTools, which comes with the cloned/forked repository (./pkt_kg/libs/owltools) loses "executable" permission. To avoid any potential issues, I recommend running the following in the terminal from the PheKnowLator directory:

chmod +x pkt_kg/libs/owltools

Set-Up Environment

The pkt_kg library requires a specific project directory structure.

If you plan to run the code from a cloned version of this repository, then no additional steps are needed.
If you are planning to utilize the library without cloning the library, please make sure that your project directory matches the following:

PheKnowLator/
    |
    |---- resources/
    |         |
    |     construction_approach/
    |         |
    |     edge_data/
    |         |
    |     knowledge_graphs/
    |         |
    |     node_data/
    |         |
    |     ontologies/
    |         |
    |     owl_decoding/
    |         |
    |     relations_data/

Dependencies

Several input documents must be created before the pkt_kg library can be utilized. Each of the input documents are listed below by knowledge graph build step:

DOWNLOAD DATA

This code requires three documents within the resources directory to run successfully. For more information on these documents, see Document Dependencies:

For assistance in creating these documents, please run the following from the root directory:

python3 generates_dependency_documents.py

Prior to running this step, make sure that all mapping and filtering data referenced in resources/resource_info.txt have been created. To generate these data yourself, please see the Data_Preparation.ipynb Jupyter Notebook for detailed examples of the steps used to build the v2.0.0 knowledge graph.

Note. To ensure reproducibility, after downloading data, a metadata file is output for the ontologies (ontology_source_metadata.txt) and edge data sources (edge_source_metadata.txt).

CONSTRUCT KNOWLEDGE GRAPH

The KG Construction Wiki page provides a detailed description of the knowledge construction process (please see the knowledge graph README for more information). Please make sure the documents listed below are presented in the specified location prior to constructing a knowledge graph. Click on each document for additional information. Note, that cloning this library will include a version of these documents that points to the current build. If you use this version then there is no need to download anything prior to running the program.

resources/construction_approach/subclass_construction_map.pkl
resources/Master_Edge_List_Dict.json ➞ automatically created after edge list construction
resources/node_data/node_metadata_dict.pkl ➞ if adding metadata for new edges to the knowledge graph
resources/knowledge_graphs/PheKnowLator_MergedOntologies*.owl ➞ see ontology README for information
resources/relations_data/RELATIONS_LABELS.txt
resources/relations_data/INVERSE_RELATIONS.txt ➞ if including inverse relations

Running the pkt Library

pkt_kg can be run via the provided main.py script or using the main.ipynb Jupyter Notebook or using a Docker container.

Main Script or Jupyter Notebook

The program can be run locally using the main.py script or using the main.ipynb Jupyter Notebook. An example of the workflow used in both of these approaches is shown below.

import psutil
import ray
from pkt import downloads, edge_list, knowledge_graph

# initialize ray
ray.init()

# determine number of cpus available
available_cpus = psutil.cpu_count(logical=False)

# DOWNLOAD DATA
# ontology data
ont = pkt.OntData('resources/ontology_source_list.txt')
ont.downloads_data_from_url()
ont.writes_source_metadata_locally()

# edge data sources
edges = pkt.LinkedData('resources/edge_source_list.txt')
edges.downloads_data_from_url()
edges.writes_source_metadata_locally()

# CREATE MASTER EDGE LIST
combined_edges = dict(edges.data_files, **ont.data_files)

# initialize edge dictionary class
master_edges = pkt.CreatesEdgeList(data_files=combined_edges, source_file='./resources/resource_info.txt')
master_edges.runs_creates_knowledge_graph_edges(source_file'./resources/resource_info.txt',
                                                data_files=combined_edges,
                                                cpus=available_cpus)

# BUILD KNOWLEDGE GRAPH
# full build, subclass construction approach, with inverse relations and node metadata, and decode owl
kg = PartialBuild(kg_version='v2.0.0',
                  write_location='./resources/knowledge_graphs',
                  construction='subclass,
                  node_data='yes,
                  inverse_relations='yes',
                  cpus=available_cpus,
                  decode_owl='yes')

kg.construct_knowledge_graph()
ray.shutdown()

`main.py`

The example below provides the details needed to run pkt_kg using ./main.py.

python3 main.py -h
usage: main.py [-h] [-p CPUS] -g ONTS -e EDG -a APP -t RES -b KG -o OUT -n NDE -r REL -s OWL -m KGM

PheKnowLator: This program builds a biomedical knowledge graph using Open Biomedical Ontologies
and linked open data. The program takes the following arguments:

optional arguments:
-h, --help            show this help message and exit
-p CPUS, --cpus CPUS  # workers to use; defaults to use all available cores
-g ONTS, --onts ONTS  name/path to text file containing ontologies
-e EDG,  --edg EDG    name/path to text file containing edge sources
-a APP,  --app APP    construction approach to use (i.e. instance or subclass
-t RES,  --res RES    name/path to text file containing resource_info
-b KG,   --kg KG      the build, can be "partial", "full", or "post-closure"
-o OUT,  --out OUT    name/path to directory where to write knowledge graph
-r REL,  --rel REL    yes/no - adding inverse relations to knowledge graph
-s OWL,  --owl OWL    yes/no - removing OWL Semantics from knowledge graph

`main.ipynb`

The ./main.ipynb Jupyter notebook provides detailed instructions for how to run the pkt_kg algorithm and build a knowledge graph from scratch.

Docker Container

pkt_kg can be run using a Docker instance. In order to utilize the Dockerized version of the code, please make sure that you have downloaded the newest version of Docker. There are two ways to utilize Docker with this repository:

Obtain Pre-Built Container from DockerHub
Build the Container (see details below)

Obtaining a Container

Obtain Pre-Built Containiner: A pre-built containers can be obtained directly from DockerHub.

Build Container: To build the pkt_kg download a stable release of this repository (or fork/clone it repository). Once downloaded, you will have everything needed to build the container, including the ./Dockerfile and ./dockerignore. The code shown below builds the container. Make sure to replace [VERSION] with the current pkt_kg version before running the code.

cd /path/to/PheKnowLator (Note, this is the directory containing the Dockerfile file)
docker build -t pkt:[VERSION] .

Notes:

Update PheKnowLator/resources/resource_info.txt, PheKnowLator/resources/edge_source_list.txt, and PheKnowLator/resources/ontology_source_list.txt
Building the container "as-is" off of DockerHub will include a download of the data used in the latest releases. No need to update any scripts or pre-download any data.

Running a Container

The following code can be used to run pkt_kg from outside of the container (after obtaining a prebuilt container or after building the container locally). In:

docker run --name [DOCKER CONTAINER NAME] -it pkt:[VERSION] --app subclass --kg full --nde yes --rel yes --owl no --kgm yes

Notes:

The example shown above builds a full version of the knowledge graph using the subclass construction approach with node metadata, inverse relations, and decoding of OWL classes. See the Running the pkt Library section for more information on the parameters that can be passed to pkt_kg
The Docker container cannot write to an encrypted filesystem, however, so please make sure /local/path/to/PheKnowLator/resources/knowledge_graphs references a directory that is not encrypted

Finding Data Inside a Container

In order to enable persistent data, a volume is mounted within the Dockerfile. By default, Docker names volumes using a hash. In order to find the correctly mounted volume, you can run the following:

Command 1: Obtains the volume hash:

docker inspect --format='{{json .Mounts}}' [DOCKER CONTAINER NAME] | python -m json.tool

Command 2: View data written to the volume:

sudo ls /var/lib/docker/volumes/[VOLUME HASH]/_data

Get In Touch or Get Involved

Contribution

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Contact Us

We’d love to hear from you! To get in touch with us, please join or start a new Discussion, create an issue or send us an email 💌

Attribution

Licensing

This project is licensed under Apache License 2.0 - see the LICENSE.md file for details.

Citing this Work

Please see our preprint: https://arxiv.org/abs/2307.05727

pheknowlator's People

Contributors

Stargazers

Watchers

pheknowlator's Issues

TODO: Handle Redirects to PKT domain

TODO:

Set-up pkt domain and handle redirects for all used data files

TODO - Project Organization: add project README

Task: Add a README for repo

TODO - Project Organization: add license to project

Task: Add a license to project

Description: Select and add a license to project using the options described here.

Cannot Apply OWLAPI Formatting to Very Large KGs

Problem: When applying OWLAPI formatting to very large KGs ~86 million triples (i.e. subclass + inverse relations KG and adding non-ontology metadata) OWLAPI hangs and process never completes. See error message from Dave Farrell below:

The current process that is in the “S” state is hung on a futex wait. At one point, there was a thread with process id of 3912 that was holding a resource. The thread has ended or crashed but it did not release the hold which is why, I believe, the current parent process 3907 is “waiting/sleeping”. In Java, methods such as lock(), park() or unpark() use futex_wait(). Strace is the utility that can help track these events but I do not know how to use it to pinpoint the actual resource being blocked. The command that I used to see what was happening with your process was:
strace -p 3907
strace: Process 3907 attached
futex(0x7f77b8f7f9d0, FUTEX_WAIT, 3912, NULL

Script: knowledge_graph.py

Current Solution: Not adding non-ontology metadata to KG and leaving the rest of the KG build workflow in tact. All ontology and non-ontology metadata (i.e. labels, definitions, and synonyms) get written out to a .txt file the information is still available to users, just not as part of the KG.

Multi-Stage Docker Build -- Not Runnable Outside of Container

The multi-stage Docker container is runnable from within the container, but is not runnable from outside of the container. No error message is generated, but container hangs in main() and never seems to instantiate the first class in the workflow.

Dockerfile Location: https://github.com/callahantiff/PheKnowLator/blob/adding_docker/Dockerfile

Updates:

Docker runs successfully on my laptop
Dave helping me update Tantor to enable updating Docker to 1.17.05, the version needed to run multi-stage container

Convert Ontology Cleaning Jupyter Notebook into Script

TASK

Task Type: CODEBASE

Convert the code in the Ontology_Cleaning.ipynb into a script that can be run as part of the KG CI/CD build.

TODO

Convert notebook to script
Write tests against script
Integrate script into CI/CD workflow (#68)

Preparing for ISMB Bio-Ontologies Task

Task

Complete the needed tasks in order to get PheKnowLator prepared for the 2021 ISMB Bio-Ontologies Challenge. The specific tasks are organized and described below.

SPARQL Endpoint

Not sure this is actually needed to support the challenge, but if it is, we need to answer the following question:

What do we use?
Where do we host it?
Do we want to think about using something other than SPARQL (i.e. Neo4J) that can support other

Frequent Refresh of KG Builds

Confirm with organizing committee which pkt build type to use - OWL-NETS versions of the builds are not true property graphs and may not be compatible with the evaluation framework. Probably want an instance and subclass build and their
Discuss potential changes that may be needed in order to create monthly KG updates to support challenge
- What needs to be done in the build framework for ontology preprocessing and other preprocessing steps in order to automate monthly builds
  - Figure out about API set-up and requiring/providing keys for folks to download resources
  - Probably best to support a system that refreshes existing edge lists and not yet worry about sustaining/adding new edges sources yet
  - Set-up CI system that builds it once a month and deposits files (see if this works with GitHub Actions)

TODO: Perform OWL Reasoner Evaluation

TODO: Perform comparison of OWL reasoners.

During today's meeting with @bill-baumgartner, we outlined how we will perform an evaluation of OWL reasoners on the PheKnowLator V.2.0 KG.

OWL Reasoner Selection Criteria
Using the following reviews (shown below), we selected reasoners that met the following criteria:

Low response time
Available via the OWLAPI
Open source

Khamparia A, Pandey B. Comprehensive analysis of semantic
web reasoners and tools: a survey. Education and Information
Technologies. 2017 Nov 1;22(6):3121-45.

Parsia B, Matentzoglu N, Gonçalves RS, Glimm B, Steigmiller A.
The OWL reasoner evaluation (ORE) 2015 competition report.
Journal of Automated Reasoning. 2017 Dec 1;59(4):455-82.

Eligible Reasoners:

Reasoner	Language	OWLTools
ELK	EL	Yes
ELepHant	EL	No
Pellet	DL	Yes
RACER	DL	No
FACT++	DL	No
Chainsaw	DL	No
Konclude	DL	No
Crack	DL	No
TrOWL	DL+EL	No
MORe	DL+EL	No

@bill-baumgartner - to determine what reasoners were available in OWLTools, I ran the following:

./owltools -h

Next Steps:

Verify which of the reasoners above are currently included in OWLTools. For those that are not in the list, @bill-baumgartner and I will look into what is involved with adding the missing algorithms to OWLTools.

Evaluation Steps:

Benchmark each of the algorithms on HPO+Imports
- Run-time
- Justifications
- Count of inferred axioms
- Consistency
For all algorithms that pass the benchmark, run them against PheKnowLator
- Including disjointness axioms
- Excluding disjointness axioms
Clinician Evaluation via @jwyrwa
- Create a spreadsheet of the inferred axioms by algorithm and mark them as:
  - Correct/Incorrect
  - Definitely Clinically relevant, Maybe clinically relevant, not clinically relevant

@bill-baumgartner - did I forget anything?

Add integer and identifiers to node metadata

Problem: Right now the node metadata that is output is keyed by an identifier, which means if you use the integer edge lists, but want node labels you have to use the provided dictionary that maps node integers to identifiers first.

Solution: In the next iteration, I will add a new column that includes the identifier and the integer. Examples of each output are shown below.

An example of the current output:

node_id	label	description/definition	synonym
388324	INCA1	INCA1 has locus group 'protein-coding' and is located on chromosome 17 (map_location: 17p13.2).	HSD45protein INCA1
92106	OXNAD1	OXNAD1 has locus group 'protein-coding' and is located on chromosome 3 (map_location: 3p25.1-p24.3).	oxidoreductase NAD-binding domain-containing protein 1
56140	PCDHA8	PCDHA8 has locus group 'protein-coding' and is located on chromosome 5 (map_location: 5q31.3).	PCDH-ALPHA8protocadherin alpha-8

An example of the improved output:

node_integer	node_id	label	description/definition	synonym
0	388324	INCA1	INCA1 has locus group 'protein-coding' and is located on chromosome 17 (map_location: 17p13.2).	HSD45protein INCA1
1	92106	OXNAD1	OXNAD1 has locus group 'protein-coding' and is located on chromosome 3 (map_location: 3p25.1-p24.3).	oxidoreductase NAD-binding domain-containing protein 1
2	56140	PCDHA8	PCDHA8 has locus group 'protein-coding' and is located on chromosome 5 (map_location: 5q31.3).	PCDH-ALPHA8protocadherin alpha-8

TODO: Export/Create Link to GCP Instance Template

Tasks: Provide link to Google GCP instance in order to make it easier for users who want to run the Docker container.

Resource: Google Instance Templates

Build V3 - edge_list.py Work Needed

V3 Build Changes.
Script: edge_list.py

Requested Changes:

using eval() to handle filtering of downloaded data, should consider replacing this usage in the filter_data()
modify the data_reader() method to stream/chunk large data files instead of reading all into memory

Coding: Improve storage of original triples when running OWL-NETS

TASK

Task Type: CODEBASE

Improve the storage of removed OWL semantics when running OWL-NETS version of the build. Remove dictionary constraint for something network-based.

TODO

Come up with a better solution to handle triples removed from the full KG when creating OWL-NETS
- Information that is compressed when converting a class from OWL to OWL-NETS
- Information that is purposefully ignored or deleted (i.e. specific triples, disjointness)
Determine if we want to update OWL-NETS to follow any of the OWL to RDF specifications mentioned in the OWL2Vec* paper:
- W3C Defined Graph Mapping (here)
- Projection rule-based approach (here)

@bill-baumgartner - can we talk about the other OWL transformations soon? From what I can tell, OWL-NETS is a more extreme version of the transformations described above.

Other: Also use persistent RDFlib store for output graphs

Once a graph has been built, it may be useful to also import the resulting .owl file into an RDFlib persistent store. Use of a persistent store allows for the graph to be accessed using RDFlib without having to import the entire structure into memory, which may be advantageous when working with large graphs. Below is a sample implementation that uses the Berkeley Database as a persistent backend. RDFlib has built-in support for this particular backend. Note that Berkeley DB was formerly developed by Sleepycat Software, hence the use of "Sleepycat" as the backend name when creating the Graph object.

import rdflib
# The persistent store requires an identifier
graph_id = rdflib.URIRef(identifier)
# Open the graph with the "Sleepycat" Berkeley DB Backend
graph = rdflib.Graph("Sleepycat", identifier=graph_id)
# Open the graph and create it if it doesn't exist
graph.open(uri, create=True)
# Parse the graph at 'graph_path', typically XML formatted
# This could take many hours if the graph is large
graph.parse(graph_path)
# Close the graph to free resources. Mostly unneccessary due
# to the small overhead of the on-disk store
graph.close()

Alternatively, the following code wraps the above functionality in a context manager, allowing the graph to be managed inside of a with block for convenience:

from contextlib import contextmanager
import rdflib


@contextmanager
def open_persistent_graph(uri, identifier, graph_path=None):
    """Provides a context manager for working with an OWL graph while also
    automatically closing it afterward. URI is the location of the
    graph store directory and IDENTIFIER is the name of the graph
    within that store. Optional argument GRAPH_PATH specifies an
    appropriately formatted RDF file to import when opening the graph.

    """
    try:
        # Only force create if a path is provided
        create_graph = bool(graph_path)
        # Open and load the on-disk store
        graph_id = rdflib.URIRef(identifier)
        graph = rdflib.Graph("Sleepycat", identifier=graph_id)
        graph.open(uri, create=create_graph)
        # Parse the file at GRAPH_PATH if set
        if graph_path:
            graph.parse(graph_path)
        yield graph
    finally:
        graph.close()

TODO: Create YouTube PKT Video

TASK

Task Type: PKT DATA DELIVERY

Create a YouTube video (maybe more than one) that can help new users understand the codebase and learn how to build their own KG and/or obtain data on current releases/builds

TODO

Create a YouTube video that provides a walk through the repo and includes examples of:
- Where to find a current build's data
- How to build the project from PyPi vs. command line vs. Jupyter Notebook vs. Docker

HELP - Verifying README Content

Thanks so much for being willing to help with this too @jwyrwa! 🙏

I am hoping that you can proof each of the following README pages (listed below) to verify it's free of spelling/grammar errors and to make sure that the content makes sense (i.e. if you were trying to use this repo you this content would be helpful):

TODO - Project Organization: Neo4J Pricing

Task: Research Neo4J Implementations

Description: Work with @bill-baumgartner to decide on how to host pkt KGs. Specific things to decide on include:

What Neo4J product
Implemented via Google Cloud or some other means?

TODO: Finalize KG Construction Survey

I am working on the qualitative component of our evaluation and am requesting your review of a Google Form I created to help organize this information.

TODO: Please take a look at the Google Form (link to form can be found here) and let me know if you have any edits by 11:59pm on May 22, 2020?

TODO - Coding: parallelize KG Build Process

Parallelize KG Build Process

Wiki: Update v2.0.0 data source hyperlinks

Task: Make sure that all data source links are updated on the Wiki and throughout repo for current build.

TODO - Coding: OBO update Watcher

Write code for OBOs that monitors issues and pings when a better match has been added or when a term has been deprecated issue.

Verify KG Output File Types

TASK

Task Type: CODEBASE

Improve the naming of generated data and verify output file types we will provided for each build.

TODO

Problem:

Add n-triples format for OWL-NETS builds
Check input file specifications for GraphDB
Check input file specifications for Neo4J
Fix OWL-NETS build output
- Problem: When running the OWL-NETS parameter, the full .owl file created during the build is named "NoOWL" when when it is actually the original OWL KG. Make sure to fix the name of this file to ensure it's clear that this is not part of the OWL-NETS output. Thanks to @rkboyce for helping identify this error!

TODO: Add licenses for public data?

TODO: Is it enough to cite the data sources used in the knowledge graph on the Wiki or do I need to additional information like the license (if available)?

@LEHunter or @bill-baumgartner - what are your thoughts on this?

Update Dockerfile

See issue #43 - Need to update current Dockerfile to copy pkt repo rather than cloning from GitHub.

TODO - Coding: Update code to output final path that is embedded

Update code to output final path that is embedded

Bug: Instance-based OWL-NETS build have anonymous nodes

Bug: Thanks to @MSBradshaw for noticing that the OWL-NETS code, applied to the instance-based builds, includes blank nodes.

TODO: Clean up code to identify what the blank nodes are and add tests to catch this type of error in the future. Potential fixes could be something as simple as needing to replace the instance URI with the node URI that is represented by that instance.

Convert Data Preprocessing Jupyter Notebook to Script

TASK

Task Type: CODEBASE

Convert the code in the Data_Preparation.ipynb into a script that can be run as part of the KG CI/CD build.

TODO

Convert notebook to script
Write tests against script
Integrate script into CI/CD workflow (#68)

HELP: Creating New Ontology Classes with Constructors

@bill-baumgartner

I was hoping I could ask for your advice on how to go about adding new terms, which reflect my many hp concepts:1 clinical concept mappings, to an existing ontology. I understand that I will create a new class, give it an identifier (making sure that identifier does not already exist in HP), a label, and then create the connection between it and existing terms using an equivalent class. This equivalent class would be constructed using the and (owl:intersectionOf), or (owl:unionOf), and not (owl:complementOf) operators.

OK, so with that in mind, I’m not entirely clear if I fully understand how to do this. Still permits me to close my knowledge graph. Below I include 3 examples I found in HP/CL along with my attempt at applying this logic to my examples. Note I reuse the anonymous nodes from the examples for ease.

Would you mind taking a look and letting me know if this is correct?

EXAMPLE 1: owl:intersectionOf
Class: http://purl.obolibrary.org/obo/HP_0040261

class `has part`
  some (`increased size`
    and ('inheres in’ some 'pharyngeal tonsil’)
    and ('has modifier’ some `abnormal`))

# class has_part
http://purl.obolibrary.org/obo/HP_0040261, http://www.w3.org/2002/07/owl#equivalentClass, Nf8a96b7801764cb7859eb8289ee641e4
Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/BFO_0000051

# some
Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/2002/07/owl#someValuesFrom, N21ee3aecd67e44f3a64920531702a4a7
N21ee3aecd67e44f3a64920531702a4a7, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class

# and  
N21ee3aecd67e44f3a64920531702a4a7, http://www.w3.org/2002/07/owl#intersectionOf, Nb60bf88cd2554c9d8ee95d3bd8d7caf5
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nc5e5c1d339564599a1824bbc4d35b0ec

# increased size
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/PATO_0000586
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nca0ddadf8e974e9a91db380d88cd5ad7

# inheres in 
Nca0ddadf8e974e9a91db380d88cd5ad7, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, Nfca2a5a88bc44e5da8ff8a38601fd834
Nca0ddadf8e974e9a91db380d88cd5ad7, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, N3a404d2e114e46da98d369033e6ee7c0
Nfca2a5a88bc44e5da8ff8a38601fd834, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/RO_0000052
Nfca2a5a88bc44e5da8ff8a38601fd834, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction

# some pharyngeal tonsil
Nfca2a5a88bc44e5da8ff8a38601fd834, http://www.w3.org/2002/07/owl#someValuesFrom, http://purl.obolibrary.org/obo/UBERON_0001732

# and
N3a404d2e114e46da98d369033e6ee7c0, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, Nc5e5c1d339564599a1824bbc4d35b0ec
N3a404d2e114e46da98d369033e6ee7c0, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

# has modifier abnormal
Nc5e5c1d339564599a1824bbc4d35b0ec, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
Nc5e5c1d339564599a1824bbc4d35b0ec, http://www.w3.org/2002/07/owl#onProperty,, http://purl.obolibrary.org/obo/RO_0002573
Nc5e5c1d339564599a1824bbc4d35b0ec, http://www.w3.org/2002/07/owl#someValuesFrom, http://purl.obolibrary.org/obo/PATO_0000460

My Example
The OMOP_4128371 concept (Acute rejection of renal transplant) maps to AND(Acute, Renal insufficiency, Status post organ transplantation)

Class: https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4128371 (Or does this need to be a NEW HP term?)

class `has part`
  some ('acute'
    and (`renal insufficiency` some `status post organ transplantation`)

# class has_part
https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4128371, http://www.w3.org/2002/07/owl#equivalentClass, Nf8a96b7801764cb7859eb8289ee641e4 . Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction  
Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/BFO_0000051

# some
Nf8a96b7801764cb7859eb8289ee641e4, http://www.w3.org/2002/07/owl#someValuesFrom, N21ee3aecd67e44f3a64920531702a4a7 
N21ee3aecd67e44f3a64920531702a4a7, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class

# and  
N21ee3aecd67e44f3a64920531702a4a7, http://www.w3.org/2002/07/owl#intersectionOf, Nb60bf88cd2554c9d8ee95d3bd8d7caf5
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nc5e5c1d339564599a1824bbc4d35b0ec

# acute
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0011009
Nb60bf88cd2554c9d8ee95d3bd8d7caf5, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nca0ddadf8e974e9a91db380d88cd5ad7

# and
Nca0ddadf8e974e9a91db380d88cd5ad7, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, Nfca2a5a88bc44e5da8ff8a38601fd834
Nca0ddadf8e974e9a91db380d88cd5ad7, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

# renal insufficiency
Nfca2a5a88bc44e5da8ff8a38601fd834, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0000083	 
Nfca2a5a88bc44e5da8ff8a38601fd834	http://www.w3.org/1999/02/22-rdf-syntax-ns#rest	http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

# and
Nc5e5c1d339564599a1824bbc4d35b0ec http://www.w3.org/1999/02/22-rdf-syntax-ns#first N3a404d2e114e46da98d369033e6ee7c0
Nc5e5c1d339564599a1824bbc4d35b0ec, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

# status post organ transplantation
N3a404d2e114e46da98d369033e6ee7c0, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0032444
N3a404d2e114e46da98d369033e6ee7c, 0http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

EXAMPLE 2: owl:unionOf
Class: http://purl.obolibrary.org/obo/HP_0100258

class `has part`
  some (`Preaxial hand polydactyly` 
    or `Preaxial foot polydactyly`)

# some
http://purl.obolibrary.org/obo/HP_0100258, http://www.w3.org/2002/07/owl#equivalentClass, N26938016efc64d77800b84e87e7ec4f9
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/BFO_0000051
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/2002/07/owl#someValuesFrom, N1619f38e707c466db37a30fc78b91db5

# or
N1619f38e707c466db37a30fc78b91db5, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N1619f38e707c466db37a30fc78b91db5, http://www.w3.org/2002/07/owl#unionOf, N931ca444e1c84f1ea3e1250819f9f47d

# preaxial hand polydactyly
N931ca444e1c84f1ea3e1250819f9f47d, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0001177
N931ca444e1c84f1ea3e1250819f9f47d, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nd5c2d15afbda49e89f803d0a426523f2

# preaxial foot polydactyly
Nd5c2d15afbda49e89f803d0a426523f2, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0001841
Nd5c2d15afbda49e89f803d0a426523f2, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

My Example
The OMOP_4048191 concept (Enlargement of tonsil or adenoid) maps to OR(Enlarged tonsils, Increased size of nasopharyngeal adenoids)

Class: https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4048191 (Or does this need to be a NEW HP term?)

class 'has part’
  some ('Enlarged tonsils’ 
    or 'Increased size of nasopharyngeal adenoids’)

# some
https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4048191, http://www.w3.org/2002/07/owl#equivalentClass, N26938016efc64d77800b84e87e7ec4f9
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/BFO_0000051
N26938016efc64d77800b84e87e7ec4f9, http://www.w3.org/2002/07/owl#someValuesFrom, N1619f38e707c466db37a30fc78b91db5

# or
N1619f38e707c466db37a30fc78b91db5, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N1619f38e707c466db37a30fc78b91db5, http://www.w3.org/2002/07/owl#unionOf, N931ca444e1c84f1ea3e1250819f9f47d

# enlarged tonsils
N931ca444e1c84f1ea3e1250819f9f47d, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0030812
N931ca444e1c84f1ea3e1250819f9f47d, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, Nd5c2d15afbda49e89f803d0a426523f2

# increased size of nasopharyngeal adenoids
Nd5c2d15afbda49e89f803d0a426523f2, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/HP_0040261
Nd5c2d15afbda49e89f803d0a426523f2, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

EXAMPLE 3: owl:complementOf
Class: http://purl.obolibrary.org/obo/CL_0001068

class 'group 1 innate lymphoid cell’
  and(not('capable of’
    some 'leukocyte mediated cytotoxicity’))

# and group 1 innate lymphoid cell
http://purl.obolibrary.org/obo/CL_0001068, http://www.w3.org/2002/07/owl#equivalentClass, N3b65c766ed0b4fa5a5a8cca6d1371af9
N3b65c766ed0b4fa5a5a8cca6d1371af9, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N3b65c766ed0b4fa5a5a8cca6d1371af9, http://www.w3.org/2002/07/owl#intersectionOf, N00ff83ae947d44fba9c2a2ec4c4a443b
N00ff83ae947d44fba9c2a2ec4c4a443b, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/CL_0001067
N00ff83ae947d44fba9c2a2ec4c4a443b, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, N5332defaa8f149d28d5ea2b137ea6315

# not
N5332defaa8f149d28d5ea2b137ea6315, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, N0e9ef34e56a7491eae04492497dcf34d
N5332defaa8f149d28d5ea2b137ea6315, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
N0e9ef34e56a7491eae04492497dcf34d, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N0e9ef34e56a7491eae04492497dcf34d, http://www.w3.org/2002/07/owl#complementOf, N9df33997e3d14fea8be760c624beae89

# capable of some leukocyte mediated cytotoxicity
N9df33997e3d14fea8be760c624beae89, http://www.w3.org/2002/07/owl#onProperty, http://purl.obolibrary.org/obo/RO_0002215
N9df33997e3d14fea8be760c624beae89, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
N9df33997e3d14fea8be760c624beae89, http://www.w3.org/2002/07/owl#someValuesFrom, http://purl.obolibrary.org/obo/GO_0001909

My Example
The OMOP_4021760 concept (Non-infectious pneumonia) maps to AND(pneumonia(NOT(disease by infectious agent))

Class: https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4021760 (Or does this need to be a NEW DOID term?)

class 'pneumonia’
  and (not(
    some 'disease by infectious agent’))

# and pneumonia
https://github.com/callahantiff/PheKnowLator/obo/ext/OMOP_4021760, http://www.w3.org/2002/07/owl#equivalentClass, N3b65c766ed0b4fa5a5a8cca6d1371af9  
N3b65c766ed0b4fa5a5a8cca6d1371af9, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N3b65c766ed0b4fa5a5a8cca6d1371af9, http://www.w3.org/2002/07/owl#intersectionOf, N00ff83ae947d44fba9c2a2ec4c4a443b
N00ff83ae947d44fba9c2a2ec4c4a443b, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://purl.obolibrary.org/obo/DOID_552
N00ff83ae947d44fba9c2a2ec4c4a443b, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, N5332defaa8f149d28d5ea2b137ea6315

# not
N5332defaa8f149d28d5ea2b137ea6315, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, N0e9ef34e56a7491eae04492497dcf34d
N5332defaa8f149d28d5ea2b137ea6315, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
N0e9ef34e56a7491eae04492497dcf34d, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Class
N0e9ef34e56a7491eae04492497dcf34d, http://www.w3.org/2002/07/owl#complementOf, N9df33997e3d14fea8be760c624beae89

# disease by infectious agent
N9df33997e3d14fea8be760c624beae89, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2002/07/owl#Restriction
N9df33997e3d14fea8be760c624beae89, http://www.w3.org/2002/07/owl#someValuesFrom, http://purl.obolibrary.org/obo/DOID_0050117

Coding: Add Jupyter Notebook to Dockerfile

Enhancement: It would be useful to add Jupyter notebook to Dockerfile to facilitate running example notebooks within a Docker container. This would also include an updated Docker usage example in the README. I'm happy to put forth a pull request with this if you'd like @callahantiff, just let me know. The work you're doing is very cool!

Add Baseline KG Embeddings To CI/CD

TASK

Task Type: CODEBASE

Create code that generates a single set of baseline embeddings for each KG build output by PheKnowLator. This code should be run as part of the CD/CI

TODO

Choose an embedding method
- CSR + Node2Vec
- Embiggen
Select reasonable hyperparameter settings for each OWL and OWL-NETS build
Incorporate build into CI/CD workflow (#68)

TODO: Update DOID nodes with MONDO Identifiers

TASK

Task Type: CODEBASE
Update all DOID identifiers with MONDO identifiers

TODO

Script(s) Impacted: Data_Preparation.ipynb

Proposed Solution:

Add small function to notebook that pulls all DbXRefs from MONDO to DOID and store as a dictionary
Add dictionary with mappings to Wiki

HELP - Verifying Wiki Content

Thanks so much for being willing to help with this @jwyrwa!

I am hoping that you would be willing to proof each of the following Wiki pages listed below to verify I have removed all spelling/grammar errors and to make sure that the content makes sense to you (i.e. if you were trying to use this repo would would know what to do):

Create Docker Container for Neo4J instance

Create a new Dockerfile that contains a Neo4J instance for each OWL-NETS version of the parameterized KGs.

TODO: Finalize KR with ALS

Finalize KR for building the KG and choose which sources are classes and which are data.

Other: Add sparse KG representation output

TASK

Task Type: CODEBASE

TODO

Consider including an additional KG output format that is sparse. A sparse representation is needed for most graph representation learning algorithms and significantly decreases the time to load the graph into memory

Libraries to Consider: CSRGraph

Enhancement: Improve Networkx MultiDiGraph Metadata

TASK

Task Type: CODEBASE

Improve the node and edge metadata when outputting the Networkx MultiDiGraph versions of each build. Thanks to @rkboyce, who suggested that we could make very small changes to the current Network graph and drastically improve the usability of the output structure.

TODO

Impacted Scripts:

knowledge_graph.py
converts_rdflib_to_networkx() in utils/kg_utils.py

Needed Functionality:

Add a helper function to utils/kg_utils.py that can be called by converts_rdflib_to_networkx(). The helper function will set graph attributes for edges:
- key: a unique value for each predicate with respect to the triple it appears in, could be a hash of the triple. Just need to ensure that it is unique
- weight: default to 0

@rkboyce, can you please verify that I have covered the needed changes that we discussed this week correctly above?

I will also be implementing a few changes to the OWL-NETS architecture (issue #56) and will be storing the collapsed semantic information from the full graph as attributes of the transformed OWL-NETS graph, likely in the form of edge and and node dictionary entries.

codacy and code climate test coverage st-up

It looks like both the Codacy and Code Climate test coverage reports are not being generated correctly. In the web dashboards for each of these apps it's stating that coverage still needs to be set-up.

@LucaCappelletti94 - would you be willing to take a look at this with me? I'm sure it is something simple I am missing.

TODO - Project Organization: add contributing information

Task: Add documentation for contributing

Description: Create or modify contribution information for project. A good example of how to do this can be seen here.

Here is a general outline:

Contributing to the PheKnowLater Project

🎉 👏 First off, thanks so much for being willing to contribute to our project! 👍

We welcome contributions to our project and ask that you please follow the Code of Conduct.

We Support Reproducible Research

Please also take a look at how we use GitHub to enable reproducible research. We are also working on creating guidelines we would like our project collaborators to follow. Please take a look, if you have suggestions we'd love to hear them here.

Contributing

Issues

When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change. Whenever possible, the issue templates should be selected according to their description:

Bug report: A built-in issue template that should be used when you find an issue in the code base that needs to be fixed.

Coding Tasks: A custom template that should be used when you want to request a change be made to existing code or when you want to suggest new code that could be added to the code base.

Feature request: A built-in issue template that is used when you have a new idea or suggestion that you would like to share with the project developers.

Help: A custom template that should be used when you have a question on how to contribute to the repository. This can also be used a place for asking any question on how to contribute to this repository.

Manuscript Tasks: A custom template that should be used when you want to create a task that is related to a manuscript being written about/using this project.

Meetings: A custom template that should be used when you want follow-up on a task assigned during a meeting or when you want to suggest a new topic for discussion at an upcoming meeting.

Other: A custom template that should be used when you are unable to use any of the other issue templates (e.g. general questions about

Project Organization Tasks: A custom template that should be used when you want to add a task related to the organization of the project (e.g. adding collaborators or modifying project boards or milestones).

Wiki: A custom template that should be used when you want to suggest and edit to the project Wiki page.

Once you have selected the type of issue you want to submit, you will be presented with an empty template, specific to that issue, and asked to provide certain information.

Pull Requests

In general, we follow the "fork-and-pull" Git workflow.

Fork the repo on GitHub

Clone the project to your own machine

Commit changes to your own branch

Push your work back up to your fork

Submit a Pull request to our development branch so that we can review and add your changes

NOTE: Be sure to merge the latest from "development" before making a pull request!

Code of Conduct

Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and. our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to creating a positive environment include:

Using welcoming and inclusive language

Being respectful of differing viewpoints and experiences

Gracefully accepting constructive criticism

Focusing on what is best for the community

Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

The use of sexualized language or imagery and unwelcome sexual attention or
advances

Trolling, insulting/derogatory comments, and personal or political attacks

Public or private harassment

Publishing others' private information, such as a physical or electronic
address, without explicit permission

Other conduct which could reasonably be considered inappropriate in a
professional setting

Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team here. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.

Attribution

This document was inspired by atom's CONTRIBUTION documentation

This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
available at http://contributor-covenant.org/version/1/4

Handling Unicode Encoding Errors in Ontology Metadata

Problem: unicode errors occurring when writing out knowledge graph metadata locally --depending on the OS and Python version used.

Script: metadata.py

Current Solution: encode/decode ontology term labels, definitions, and synonyms and explicitly ignore UnicodeEncodeError.

Proposed Solution: Add functionality to better handle processing of UnicodeEncodeError

TODO - Coding: Create testing harness

Needed Scripts: Create testing harness

TODO: Publish Docker to Docker Hub

Publish Docker to Docker Hub, when creating v2.0.0 release.

Build V3 - knowledge_graph.py

V3 Build Changes.
Script: knowledge_graph.py.py

Requested Changes:

Extend functionality of the code to improve KR for:
- Instance-based builds that include connections between 2 instance nodes (for example, the complex and reaction nodes from Reactome (shown in Figure below)
- The ability to combine instance and subclass-based methods

Set-up SPARQL Endpoint

TASK

Task Type: PKT DATA DELIVERY

Select and set-up a SPARQL endpoint for exploring KG build data

TODO

Pick an endpoint. Here is a Medium article that compares and contrasts existing triplestores. Considering what we care about (Docker compatibility, RDF and Querying speed), I have selected a few and ordered them from best to worst:
- GraphDB
  - Concurrent queries not free
- BlazeGraph
  - Having working example via BB
- Virtuoso
Configure with CI/CD
Figure out where to host it
- Through Google Cloud Run?

Questions:

@bill-baumgartner - Which one do you think we should use?
Two versions needed, 1 for our production build and 1 for those who want to build their own

TODO: Move print statements to logging

TASK

Task Type: CODEBASE

There are a lot of print statements created during the KG build that provide useful information as well track build progress. Most of these should be moved to logging.

TODO

Move all print statements and processing to logging

Project Meeting -- 09/18/2019 @ 13:00

Meeting Date: September 18, 2019
Topic: Weekly Meeting
Attendees: @bill-baumgartner

Proposed Agenda:

Comparing full KG to full KG + reasoner inferences
- Discuss KaBOB script used to convert anonymous nodes to non-anonymous nodes to properly add edges inferred from running reasoners
Continue discussing developing KR for clinical to biological concept mappings (see #12)

Migrate from Travis CI to GitHub Actions

Task

Task Type: INFRASTRUCTURE
TravisCI is no longer free and a new CI framework is needed. Github has provided some documentation on how to do this here

TODO

Migrate from TravisCI to GitHub Actions

KG V2.0.0 - Finalizing Knowledge Representation

Extending Knowledge Representation for current KG

Current Release: v2.0.0

Description
Adding the following entities/data sources to the current KG build:

Variants via Clinvar
Proteins via PRO
New connections between existing ChEBI, GO, and Reactome concepts to proteins and genes - here:
- protein-protein, RO_0002434 (interacts with)
- protein-gobp, RO_0000056 (participates in)
- protein-gomf, RO_0000085 (has function)
- protein-gocc, RO_0001025 (located in)
- protein-cofactor/catalyst (ChEBI), RO_0002436 (molecularly interacts with)
- protein-complex (reactome), RO_0002436 (molecularly interacts with)
- gene-protein, RO_0002211 (regulates)
- chemical (ChEBI)-complex, RO_0002436 (molecularly interacts with)
- complex (reactome)-complex (reactome), RO_0002436 (molecularly interacts with)
- protein-pathway (reactome), RO_0000056 (participates in)
- protein-reaction (reactome), RO_0000056 (participates in)

TODO 📋 💻 📝

Create edge types to connect variants to KG
Verify ontological assumptions for edges provided by @ignaciot to ensure satisfiability and consistency with existing KR
Investigate which version of PRO to download, specifically searching for one which only includes human proteins
Update KR schema and verify it
Update input documentation
Add new data sources to wiki

@callahantiff Due Dates:

Have KR and Wiki updated and finalized by 10/23/19
Begin building KG v2.1.0 by 10/23/19

Publish code to PyPI

TASK

Task Type: PKT DATA DELIVERY

Upload and complete process to push pkt to PyPI

TODO

Add package deployment to GitHub Actions workflow. Details provided here

Setting Up and End-to-End CI/CD Framework

Task

Task Type: INFRASTRUCTURE
Determine which tools we will use in order to set-up an end-to-end CI/CD framework.

TODO

The requirements for this system include:

Leveraging GitHub Actions to:
- Test the codebase
- Downloaded needed resources and build the Docker Container
- Deploy and run the Docker container via Google Cloud Run (one for each KG build type)
- Generate baseline embeddings (#71)
- Returning all results
- Pushing certain files to Neo4J instance and SPARQL Endpoint

Potential Configurations:

CI/CD with Serverless Containers on GCP - Described here
Consider using Google Cloud Composer to kick-off the first task of the monthly build process which downloads and preprocess the data used for each build (LOD and Ontology data)

Proposed Tasks for CI/CD

Download all LOD and Ontology data
Preprocess and Clean data
KG Build

Related GitHub Issues: #47, #49

CTD Data Source - CAPTCHA

Issue: CTD now has a CAPTCHA is place to prevent automatic downloading of data. This impacts the current build as there is no solution currently in place to work around this.

Temporary Workaround: All CTD data sources need to be manually downloaded to the resources/edge_data repo prior to running the download step of the build. The downloaded file also needs to be unzipped and have the edge type label appended to the front of the file name (example below).

File: edge_source_list.txt

chemical-disease, http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz
chemical-gene, http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz
chemical-phenotype, http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz
chemical-protein, http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz

Repository: resources/edge_data/
chemical-disease_CTD_chemicals_diseases.tsv
chemical-gene_CTD_chem_gene_ixns.tsv
chemical-phenotype_CTD_chemicals_diseases.tsv
chemical-protein_CTD_chem_gene_ixns.tsv

Configure Google Cloud Storage

TASK

Task Type: PKT DATA DELIVERY

Use Google Cloud Storage in build to store each build's downloaded data and output knowledge graphs

Data used for each build
Built KGs
Built Embeddings

TODO

Create GCS Bucket for builds in GCP PheKnowLator bucket (details here)
Apply Object Lifecycle Management to GCS bucket
~~Create script similar to google_cloud_storage_downloader.py that can provides API access to Google Cloud Storage~~

Resources:

Python API Tutorials

callahantiff / pheknowlator Goto Github PK

pheknowlator's Introduction

What is PheKnowLator?

What Does This Repository Provide?

How do I Learn More?

Releases

Getting Started

Install Library

Set-Up Environment

Dependencies

DOWNLOAD DATA

CONSTRUCT KNOWLEDGE GRAPH

Running the pkt Library

Main Script or Jupyter Notebook

main.py

main.ipynb

Docker Container

Obtaining a Container

Notes:

Running a Container

Notes:

Finding Data Inside a Container

Get In Touch or Get Involved

Contribution

Contact Us

Attribution

Licensing

Citing this Work

pheknowlator's People

Contributors

Stargazers

Watchers

Forkers

pheknowlator's Issues

TASK

TODO

Task

SPARQL Endpoint

Frequent Refresh of KG Builds

TASK

TODO

TASK

TODO

TASK

TODO

TASK

TODO

TASK

TODO

TASK

TODO

TASK

TODO

TASK

TODO

Contributing to the PheKnowLater Project

We Support Reproducible Research

Contributing

Issues

Pull Requests

Code of Conduct

Our Pledge

Our Standards

Our Responsibilities

Scope

Enforcement

Attribution

TASK

TODO

TASK

TODO

Task

TODO

Extending Knowledge Representation for current KG

TODO 📋 💻 📝

TASK

TODO

Task

TODO

TASK

`main.py`

`main.ipynb`