Giter VIP home page Giter VIP logo

orion's Introduction

ORION

Operational Routine for the Ingest and Output of Networks

This package takes data sets from various sources and converts them into Knowledge Graphs.

Each data source will go through the following pipeline before it can be included in a graph:

  1. Fetch (retrieve an original data source)
  2. Parse (convert the data source into KGX files)
  3. Normalize (use normalization services to convert identifiers and ontology terms to preferred synonyms)
  4. Supplement (add supplementary knowledge specific to that source)

To build a graph use a Graph Spec yaml file to specify the sources you want.

ORION will automatically run each data source specified through the necessary pipeline. Then it will merge the specified sources into a Knowledge Graph.

Using ORION

Create a parent directory:

mkdir ~/ORION_root

Clone the code repository:

cd ~/ORION_root
git clone https://github.com/RobokopU24/ORION.git

Next create directories where data sources, graphs, and logs will be stored.

ORION_STORAGE - for storing data sources ORION_GRAPHS - for storing knowledge graphs ORION_LOGS - for storing logs

You can do this manually, or use the script indicated below to set up a standard configuration (Option 1 or 2).

Option 1: Create three directories and set environment variables specifying paths to the locations of those directories.

mkdir ~/ORION_root/storage/
export ORION_STORAGE=~/ORION_root/storage/ 

mkdir ~/ORION_root/graphs/
export ORION_GRAPHS=~/ORION_root/graphs/

mkdir ~/ORION_root/logs/
export ORION_LOGS=~/ORION_root/logs/

Option 2: Use this script to create the directories and set the environment variables:

cd ~/ORION_root/ORION/
source ./set_up_test_env.sh

Next create or select a Graph Spec yaml file where the content of knowledge graphs to be built will be specified.

Use either of the following options, but not both:

Note that running the setup script set_up_test_env.sh will perform Option 1 for you.

Option 1: ORION_GRAPH_SPEC - the name of a Graph Spec file located in the graph_specs directory of ORION

export ORION_GRAPH_SPEC=testing-graph-spec.yml

Option 2: ORION_GRAPH_SPEC_URL - a URL pointing to a Graph Spec file

export ORION_GRAPH_SPEC_URL=https://example.com/example-graph-spec.yml

To build a custom graph, alter the Graph Spec file. See the graph_specs directory for examples.

TODO: explain options available in the graph spec (normalization version, source data version can be specified)

graphs:
  - graph_id: Example_Graph_ID
    graph_name: Example Graph
    graph_description: This is a description of what is in the graph.
    output_format: neo4j
    sources:
      - source_id: Biolink
      - source_id: HGNC

Install Docker to create and run the necessary containers.

By default using docker-compose up will build every graph in your Graph Spec. It runs the command: python /ORION/Common/build_manager.py all.

docker-compose up

If you want to specify an individual graph you can override the default command with a graph id from your Spec.

docker-compose run --rm data_services python /ORION/Common/build_manager.py Example_Graph_ID

To run the ORION pipeline for a single data source, you can use:

docker-compose run --rm data_services python /ORION/Common/load_manager.py Example_Source

To see available arguments and a list of supported data sources:

python /ORION/Common/load_manager.py -h

For Developers

To add a new data source to ORION, create a new parser. Each parser extends the SourceDataLoader interface in Common/loader_interface.py.

To implement the interface you will need to write a class that fulfills the following.

Set the class level variables for the source ID and provenance:

source_id: str = 'ExampleSourceID'
provenance_id: str = 'infores:example_source'

In initialization, call the parent init function first and pass the initialization arguments. Then set the file names for the data file or files.

super().__init__(test_mode=test_mode, source_data_dir=source_data_dir)

self.data_file = 'example_file.gz'
OR
self.example_file_1 = 'example_file_1.csv'
self.example_file_2 = 'example_file_2.csv'
self.data_files = [self.example_file_1, self.example_file_2]

Note that self.data_path is set by the parent class and by default refers to a specific directory for the current version of that source in the storage directory.

Implement get_latest_source_version(). This function should return a string representing the latest available version of the source data.

Implement get_data(). This function should retrieve any source data files. The files should be stored with the file names specified by self.data_file or self.data_files. They should be saved in the directory specified by self.data_path.

Implement parse_data(). This function should parse the data files and populate lists of node and edge objects: self.final_node_list (kgxnode), self.final_edge_list (kgxedge).

Finally, add your source to the list of sources in Common/data_sources.py. The source ID string here should match the one specified in the new parser. Also add your source to the SOURCE_DATA_LOADER_CLASS_IMPORTS dictionary, mapping it to the new parser class.

Now you can use that source ID in a graph spec to include your new source in a graph, or as the source id using load_manager.py.

Testing and Troubleshooting

After you alter the codebase, or if you are experiencing issues or errors you may want to run tests:

docker-compose run --rm data_services pytest /ORION

orion's People

Contributors

evandietzmorris avatar phillipsowen avatar beaslejt avatar wumirose avatar beasleyjonm avatar shalsh23 avatar bpow avatar cbizon avatar yaphetkg avatar dnlrkorn avatar dependabot[bot] avatar cthoyt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.