Giter VIP home page Giter VIP logo

nmdc-runtime's Introduction

A runtime system for NMDC data management and orchestration.

Service Status

http://nmdcstatus.polyneme.xyz/

How It Fits In

  • issues
    tracks issues related to NMDC, which may necessitate work across multiple repos.

  • nmdc-schema houses the LinkML schema specification, as well as generated artifacts (e.g. JSON Schema).

  • nmdc-server houses code specific to the data portal -- its database, back-end API, and front-end application.

  • workflow_documentation references workflow code spread across several repositories, that take source data and produce computed data.

  • This repo (nmdc-runtime)

    • houses code that takes source data and computed data, and transforms it to broadly accommodate downstream applications such as the data portal
    • manages execution of the above (i.e., lightweight data transformations) and also of computationally- and data-intensive workflows performed at other sites, ensuring that claimed jobs have access to needed configuration and data resources.

Data exports

The NMDC metadata as of 2021-10 is available here:

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys086d541

The link returns a GA4GH DRS API bundle object record, with the NMDC metadata collections (study_set, biosample_set, etc.) as contents, each a DRS API blob object.

For example the blob for the study_set collection export, named "study_set.jsonl.gz", is listed with DRS API ID "sys0xsry70". Thus, it is retrievable via

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys0xsry70

The returned blob object record lists https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/study_set.jsonl.gz as the url for an access method.

The 2021-10 exports are currently all accessible at https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/${COLLECTION_NAME}.jsonl.gz, but the DRS API indirection allows these links to change in the future, for mirroring via other URLs, etc. So, the DRS API links should be the links you share.

Overview

The runtime features:

  1. Dagster orchestration:

    • dagit - a web UI to monitor and manage the running system.
    • dagster-daemon - a service that triggers pipeline runs based on time or external state.
    • PostgresSQL database - for storing run history, event logs, and scheduler state.
    • workspace code
      • Code to run is loaded into a Dagster workspace. This code is loaded from one or more dagster repositories. Each Dagster repository may be run with a different Python virtual environment if need be, and may be loaded from a local Python file or pip installed from an external source. In our case, each Dagster repository is simply loaded from a Python file local to the nmdc-runtime GitHub repository, and all code is run in the same Python environment.
      • A Dagster repository consists of solids and pipelines, and optionally schedules and sensors.
        • solids represent individual units of computation
        • pipelines are built up from solids
        • schedules trigger recurring pipeline runs based on time
        • sensors trigger pipeline runs based on external state
      • Each pipeline can declare dependencies on any runtime resources or additional configuration. There are TerminusDB and MongoDB resources defined, as well as preset configuration definitions for both "dev" and "prod" modes. The presets tell Dagster to look to a set of known environment variables to load resources configurations, depending on the mode.
  2. A TerminusDB database supporting revision control of schema-validated data.

  3. A MongoDB database supporting write-once, high-throughput internal data storage by the nmdc-runtime FastAPI instance.

  4. A FastAPI service to interface with the orchestrator and database, as a hub for data management and workflow automation.

Local Development

Ensure Docker (and Docker Compose) are installed; and the Docker engine is running.

docker --version
docker compose version
docker info

Ensure the permissions of ./mongoKeyFile are such that only the file's owner can read or write the file.

chmod 600 ./mongoKeyFile

Ensure you have a .env file for the Docker services to source from. You may copy .env.example to .env (which is gitignore'd) to get started.

cp .env.example .env

Create environment variables in your shell session, based upon the contents of the .env file.

export $(grep -v '^#' .env | xargs)

If you are connecting to resources that require an SSH tunnel—for example, a MongoDB server that is only accessible on the NERSC network—set up the SSH tunnel.

The following command could be useful to you, either directly or as a template (see Makefile).

make nersc-mongo-tunnels

Finally, spin up the Docker Compose stack.

make up-dev

Docker Compose is used to start local MongoDB and PostgresSQL (used by Dagster) instances, as well as a Dagster web server (dagit) and daemon (dagster-daemon).

The Dagit web server is viewable at http://127.0.0.1:3000/.

The FastAPI service is viewable at http://127.0.0.1:8000/ -- e.g., rendered documentation at http://127.0.0.1:8000/redoc/.

Local Testing

Tests can be found in tests and are run with the following commands:

On an M1 Mac? May need to export DOCKER_DEFAULT_PLATFORM=linux/amd64.

make up-test
make test

As you create Dagster solids and pipelines, add tests in tests/ to check that your code behaves as desired and does not break over time.

For hints on how to write tests for solids and pipelines in Dagster, see their documentation tutorial on Testing.

Publish to PyPI

This repository contains a GitHub Actions workflow that publishes a Python package to PyPI.

You can also manually publish the Python package to PyPI by issuing the following commands in the root directory of the repository:

rm -rf dist
python -m build
twine upload dist/*

Links

Here are links related to this repository:

nmdc-runtime's People

Contributors

aclum avatar brynnz22 avatar cmungall avatar dehays avatar dwinston avatar eecavanna avatar elais avatar mbthornton-lbl avatar peoplemakeculture avatar pkalita-lbl avatar scanon avatar shreddd avatar sujaypatil96 avatar turbomam avatar wdduncan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nmdc-runtime's Issues

clarify purpose of nmdc_database.json file in this repo and make it easier to

I frequently need access to the latest NMDC sample metadata for various purposes - e.g analyze patterns of ENVO usage, GOLD usage, sparsity of mixs variables, etc. We also have various collaborators e.g. ENSURABLE who would like access.

Ultimately this will all be available via API but in the interim it would be great to make all this more transparent

Where should I download data?

It looks like the one in this repo is just test data? I only see 716 gold samples. Whereas the polyneme URL has 32k gold samples.

does the one in this repo serve any purpose? I recommend removing it if it doesn't

there was a ticket somewhere about adding metadata to the database object with provenance about when the ingest was performed, what the version/date of source was etc, can't find it now, but we need this

Release documentation

Document all components and the release process steps and share before November test release and make documentation available to all team members.

Add github actions to check PRs

I think this is a required part of #56

We have github actions but it seems they are just for deployment, they don't check PRs, I did an experiment here:

#64

This introduced a deliberate error yet I am still able to merge:

image

S4G1 - Data Processing, Workflow definition

Workflows defined for each of the test raw data files - yes
Metagenome - done
Metabolomics (GCMS?) - done >>> future sprint

Site available with computing capacity to process the raw data files

  • Develop/deploy services to turn NERSC into a site (Shane?)
  • Develop/deploy services to turn EMSL into a site (Yuri) >>>> EMSL will be done in future sprint

Step 2 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

mongo access over HTTPS

NERSC Spin doesn't allow exposing a MongoDB service at this time.

I propose setting up an Eve REST API to expose external access to the MongoDB.

@jbeezley have you worked with Eve?

data_object_type is not being set on new data objects

New data objects are not getting the data_object_type attribute set on them. Looks like you are doing these using the regex from the filter attributes on the Mongo file_type_enum collection.

Kitware then sets the file name and description based on the data_object_type attributes they see. When they are missing, there are no files displayed.

{
    "_id": {
        "$oid": "6148a0a9e9822b255a68d6ad"
    },
    "description": "Assembled contigs fasta for gold:Gp0138727",
    "url": "https://data.microbiomedata.org/data/nmdc:mga0ac72/assembly/nmdc_mga0ac72_contigs.fna",
    "md5_checksum": "8ea1e1eab9d34bfb48cf83dccb8e95de",
    "file_size_bytes": 209350820,
    "id": "nmdc:8ea1e1eab9d34bfb48cf83dccb8e95de",
    "name": "gold:Gp0138727_Assembled contigs fasta"
}

This is one that does work (from a different study)...

{
    "_id": {
        "$oid": "60e840cde9822b255ad950da"
    },
    "description": "Assembled AGP file for gold:Gp0061273",
    "url": "https://data.microbiomedata.org/data/1472_51277/assembly/assembly.agp",
    "file_size_bytes": 27785827,
    "type": "nmdc:DataObject",
    "id": "nmdc:8e504039a96e9ab885eef69155127754",
    "name": "assembly.agp",
    "data_object_type": "0hna-73pd-79"
}

change sheet does not allow for multiple values when inserting object into array

When inserting an object into an array, on the last value is set. E.g., If inserting a credit association in which the multiple properties are set for an object, only the last property is set.
E.g., This is supposed to update all the values for the person object p1.

gold:Gs0114675 update has_credit_associations ca1
ca1 update applied_role Conceptualization
ca1 update applies_to_person p1
p1 update name Kelly Wrighton
p1 update email [email protected]
p1 update orcid orcid:0000-0003-0434-4217

But, only the orcid (i.e., the last property) is set.

cc @dehays @dwinston

test metadata ingest

Using SOP (#53).

Options: additional Spruce, EMP500, NEON, other EMSL & JGI metadata.

/cc @microbiomedata/architecture-wg

please remove documents from nmdc . metaproteomics_analysis_activity_set that have non prefixed IDs

Those that can be removed have IDs that DO NOT begin with "nmdc:"

Backstory: Anubhav provided metaP activities to go along with the new data objects he provided. In both cases, because the IDs were different, the metadata-in DAGster job created new document (Because there were no documents with the same ID to replace.). So there are duplicates of all the metaproteomics activities that were in the recent 'update' JSON.

The only thing that changed on the activities was changing from the Mongo hash ID that Anubhav had mistakenly used as their IDs with "nmdc:" + md5. So the ones not in that form can be removed.

The duplicate analysis activities appear on the portal - but each activity in the pair points to the same files - so the file downloads all work.

convert notebook-based JSON->Mongo ETL to dagster-based JSON->Terminus ETL

Currently, the portal-directed metadata ETL has two parts:

  1. The ETL for dumped JGI GOLD metadata to a NMDC-Schema-compliant JSON file. This process is currently wrapped as the nmdc_runtime.solids.jgi.get_json_db dagster solid.
  2. A series of Jupyter notebooks that takes the above JSON file as input, fetches additional metadata source files that are hosted elsewhere, and ensures NMDC-Schema-complaint JSON document collections in a MongoDB database. These notebooks (currently in the microbiomedata/nmdc-metdata repo's metadata-translation/notebooks directory) are:
    1. gold_ids_to_igsns.ipynb
    2. metaP_stegen.ipynb
    3. mongo_etl_demo.ipynb
    4. ghissue_252_253_linked_samples.ipynb
    5. ghissue_255.ipynb
    6. ghissue_272.ipynb
    7. ensure_biosample_set_study_id.ipynb

The notebook-based JSON->Mongo ETL above needs to be converted to a dagster-based JSON->Terminus ETL.

  • translate notebook logic to dagster solids (retain MongoDB target)
  • create and test full pipeline
  • create and test new solids/pipeline for TerminusDB target

This is a follow-on to microbiomedata/nmdc-metadata#316.

Add protections to main branch

Now that the tests are running after the changes related to #64 #71 the next step would be to add protections to the main branch that require the tests to to pass before a pull request can be merged.

If this is already done please close this issue but it will be good that this exists to document that the administrative changes have been made.

missing 108 samples from the NMDC_DUMP_Jun_21_2021 GOLD data dump

The ETL process on the NMDC_DUMP_Jun_21_2021 GOLD data dump failed translate all the biosamples. A list of the gold ids of 108 such failures are listed below.

cc @dwinston

1 gold:Gb0291799
2 gold:Gb0291728
3 gold:Gb0291794
4 gold:Gb0291771
5 gold:Gb0291740
6 gold:Gb0291713
7 gold:Gb0291768
8 gold:Gb0291757
9 gold:Gb0291795
10 gold:Gb0291766
11 gold:Gb0291797
12 gold:Gb0291716
13 gold:Gb0291739
14 gold:Gb0291699
15 gold:Gb0291790
16 gold:Gb0291787
17 gold:Gb0291769
18 gold:Gb0291732
19 gold:Gb0291756
20 gold:Gb0291733
21 gold:Gb0291791
22 gold:Gb0291726
23 gold:Gb0291693
24 gold:Gb0291785
25 gold:Gb0291719
26 gold:Gb0291717
27 gold:Gb0291746
28 gold:Gb0291765
29 gold:Gb0291738
30 gold:Gb0291777
31 gold:Gb0291714
32 gold:Gb0291712
33 gold:Gb0291751
34 gold:Gb0291792
35 gold:Gb0291744
36 gold:Gb0291718
37 gold:Gb0291758
38 gold:Gb0291727
39 gold:Gb0291783
40 gold:Gb0291708
41 gold:Gb0291711
42 gold:Gb0291722
43 gold:Gb0291775
44 gold:Gb0291700
45 gold:Gb0291779
46 gold:Gb0291748
47 gold:Gb0291752
48 gold:Gb0291761
49 gold:Gb0291764
50 gold:Gb0291729
51 gold:Gb0291720
52 gold:Gb0291696
53 gold:Gb0291702
54 gold:Gb0291709
55 gold:Gb0291698
56 gold:Gb0291701
57 gold:Gb0291710
58 gold:Gb0291697
59 gold:Gb0291776
60 gold:Gb0291737
61 gold:Gb0291734
62 gold:Gb0291721
63 gold:Gb0291731
64 gold:Gb0291706
65 gold:Gb0291793
66 gold:Gb0291692
67 gold:Gb0291784
68 gold:Gb0291789
69 gold:Gb0291778
70 gold:Gb0291767
71 gold:Gb0291747
72 gold:Gb0291694
73 gold:Gb0291798
74 gold:Gb0291695
75 gold:Gb0291741
76 gold:Gb0291770
77 gold:Gb0291782
78 gold:Gb0291742
79 gold:Gb0291735
80 gold:Gb0291715
81 gold:Gb0291760
82 gold:Gb0291763
83 gold:Gb0291780
84 gold:Gb0291703
85 gold:Gb0291781
86 gold:Gb0291707
87 gold:Gb0291753
88 gold:Gb0291749
89 gold:Gb0291704
90 gold:Gb0291755
91 gold:Gb0291796
92 gold:Gb0291736
93 gold:Gb0291750
94 gold:Gb0291754
95 gold:Gb0291743
96 gold:Gb0291705
97 gold:Gb0291773
98 gold:Gb0291723
99 gold:Gb0291730
100 gold:Gb0291788
101 gold:Gb0291725
102 gold:Gb0291724
103 gold:Gb0291786
104 gold:Gb0291745
105 gold:Gb0291774
106 gold:Gb0291772
107 gold:Gb0291759
108 gold:Gb0291762

hotfixes.csv to ensure metadata overrides on re-processing

A spreadsheet of changes to metadata entity (e.g. study) attributes was compiled in preparation for the upcoming release. It was saved as CSV and persisted in this repo as metadata-translation/src/data/2021-07-02-study-changes.csv.

Recently, another "hotfix" was applied, to correct the lat_lon.longitude, and by extension lat_lon.has_raw_value value, for a particular biosample. @cmungall suggested this fix be recorded to ensure re-application if necessary, e.g. in a version-controlled hotfixes.yaml file.

Seeing as this may be a practice going forward, I'd like to refactor the 2021-07-02-study-changes.csv file to be a hotfixes.csv in this repo, to be the source of truth for hotfixes / manually determined corrections to be applied. There is already a flow to extract and apply changes from such a file.

@emileyfadrosh @pvangay is it feasible to expect such fixes to be entered via GitHub's interface as pull requests, rather than the initial workflow of using a Google Sheet?

coordinate range-attribute schema changes with Kitware prior to ETL

For attributes that hold ranges (e.g., depth, depth2), the schema has been changed (see #44) so that min/max of the range is represented like so:

     depth: 5.0-10.0
      has_unit: meter
      has_minimum_numeric_value: 5.0
      has_maximum_numeric_value: 10.0

Before pushing the changed docs to production we need to make sure Kitware can ingest the new docs.

Steps:

  • Ingest new ETL into Mongo (i.e., run the the dagster operations)
  • Add post-ETL modifications (e.g., change sheets, or notebooks)
  • Coordinate with Kitware to download the changed docs (in the biosample collection)
  • Verify that new data is ingested into portal
  • Remove old depth2 and subsurface_depth2 attributes (see microbiomedata/nmdc-schema#193)

cc @dwinston @jeffbaumes @subdavis

S4G1 - Data Processing, Run Workflows (NERSC)

New files are in the system (Shane) - service at NERSC site should pick up the raw data and for processing metaG;

  • Workflows are run
  • New data and metadata are generated
  • NERSC - outputs moved to www directory in the project directory
  • New data (URLs) and metadata are registered with the NMDC runtime service

Step 4 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.