microbiomedata / nmdc-runtime Goto Github PK

View Code? Open in Web Editor NEW

4.0 11.0 3.0 211.54 MB

runtime system for NMDC data management and orchestration

Home Page: https://microbiomedata.github.io/nmdc-runtime/

License: Other

Python 24.54% Dockerfile 0.09% HTML 42.06% Shell 0.41% Makefile 0.26% Jupyter Notebook 32.40% Scheme 0.21% Nix 0.02%

microbiomedata workflow nmdc

nmdc-runtime's Introduction

A runtime system for NMDC data management and orchestration.

Service Status

http://nmdcstatus.polyneme.xyz/

How It Fits In

issues
tracks issues related to NMDC, which may necessitate work across multiple repos.
nmdc-schema houses the LinkML schema specification, as well as generated artifacts (e.g. JSON Schema).
nmdc-server houses code specific to the data portal -- its database, back-end API, and front-end application.
workflow_documentation references workflow code spread across several repositories, that take source data and produce computed data.
This repo (nmdc-runtime)
- houses code that takes source data and computed data, and transforms it to broadly accommodate downstream applications such as the data portal
- manages execution of the above (i.e., lightweight data transformations) and also of computationally- and data-intensive workflows performed at other sites, ensuring that claimed jobs have access to needed configuration and data resources.

Data exports

The NMDC metadata as of 2021-10 is available here:

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys086d541

The link returns a GA4GH DRS API bundle object record, with the NMDC metadata collections (study_set, biosample_set, etc.) as contents, each a DRS API blob object.

For example the blob for the study_set collection export, named "study_set.jsonl.gz", is listed with DRS API ID "sys0xsry70". Thus, it is retrievable via

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys0xsry70

The returned blob object record lists https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/study_set.jsonl.gz as the url for an access method.

The 2021-10 exports are currently all accessible at https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/${COLLECTION_NAME}.jsonl.gz, but the DRS API indirection allows these links to change in the future, for mirroring via other URLs, etc. So, the DRS API links should be the links you share.

Overview

The runtime features:

Dagster orchestration:
- dagit - a web UI to monitor and manage the running system.
- dagster-daemon - a service that triggers pipeline runs based on time or external state.
- PostgresSQL database - for storing run history, event logs, and scheduler state.
- workspace code
  - Code to run is loaded into a Dagster workspace. This code is loaded from one or more dagster repositories. Each Dagster repository may be run with a different Python virtual environment if need be, and may be loaded from a local Python file or pip installed from an external source. In our case, each Dagster repository is simply loaded from a Python file local to the nmdc-runtime GitHub repository, and all code is run in the same Python environment.
  - A Dagster repository consists of solids and pipelines, and optionally schedules and sensors.
    - solids represent individual units of computation
    - pipelines are built up from solids
    - schedules trigger recurring pipeline runs based on time
    - sensors trigger pipeline runs based on external state
  - Each pipeline can declare dependencies on any runtime resources or additional configuration. There are TerminusDB and MongoDB resources defined, as well as preset configuration definitions for both "dev" and "prod" modes. The presets tell Dagster to look to a set of known environment variables to load resources configurations, depending on the mode.
A TerminusDB database supporting revision control of schema-validated data.
A MongoDB database supporting write-once, high-throughput internal data storage by the nmdc-runtime FastAPI instance.
A FastAPI service to interface with the orchestrator and database, as a hub for data management and workflow automation.

Local Development

Ensure Docker (and Docker Compose) are installed; and the Docker engine is running.

docker --version
docker compose version
docker info

Ensure the permissions of ./mongoKeyFile are such that only the file's owner can read or write the file.

chmod 600 ./mongoKeyFile

Ensure you have a .env file for the Docker services to source from. You may copy .env.example to .env (which is gitignore'd) to get started.

cp .env.example .env

Create environment variables in your shell session, based upon the contents of the .env file.

export $(grep -v '^#' .env | xargs)

If you are connecting to resources that require an SSH tunnel—for example, a MongoDB server that is only accessible on the NERSC network—set up the SSH tunnel.

The following command could be useful to you, either directly or as a template (see Makefile).

make nersc-mongo-tunnels

Finally, spin up the Docker Compose stack.

make up-dev

Docker Compose is used to start local MongoDB and PostgresSQL (used by Dagster) instances, as well as a Dagster web server (dagit) and daemon (dagster-daemon).

The Dagit web server is viewable at http://127.0.0.1:3000/.

The FastAPI service is viewable at http://127.0.0.1:8000/ -- e.g., rendered documentation at http://127.0.0.1:8000/redoc/.

Local Testing

Tests can be found in tests and are run with the following commands:

On an M1 Mac? May need to export DOCKER_DEFAULT_PLATFORM=linux/amd64.

make up-test
make test

As you create Dagster solids and pipelines, add tests in tests/ to check that your code behaves as desired and does not break over time.

For hints on how to write tests for solids and pipelines in Dagster, see their documentation tutorial on Testing.

Publish to PyPI

This repository contains a GitHub Actions workflow that publishes a Python package to PyPI.

You can also manually publish the Python package to PyPI by issuing the following commands in the root directory of the repository:

rm -rf dist
python -m build
twine upload dist/*

nmdc-runtime's People

Contributors

Stargazers

Watchers

Forkers

polyneme elais peoplemakeculture

nmdc-runtime's Issues

accessibility/availability/uptime, and observability thereof

how to ensure the server is kept up
notifications if anything goes down
how to modify permissions for accessing the API
test any access issues
identify manual steps and document

add principal_investigator.websites using change sheet

Allow principal_investigator.websites to be added using change sheet.
Currently fails b/c principal_investigator.websites is an array.

API for data portal (nmdc-server) team to pull metadata

It will still be possible to directly query the underlying MongoDB, as the nmdc-server infrastructure is hosted within the NERSC private network, but it is expected that portal ingest will migrate to using the runtime's HTTP API.

/cc @zachmullen @subdavis

S4G1 - Ingest new data into portal

Make new data available in portal

trigger portal ingest process for new data

Step 6 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

fix ETL to use has minimum/maximum numeric value slots

Fix ETL to use has minimum/maximum numeric value slots,
This aligns the ETL with schema update references here:
microbiomedata/nmdc-schema#189
microbiomedata/nmdc-schema#80

Add hotfix functionality

When the data needs to be changed, we need to implement a consistent method to make hotfixes to the data.

cc @cmungall @dehays @dwinston @wdduncan

Implement NMDC ID generation service

Implement ID generation service for NMDC based on July discussions and approved approach documented in https://docs.google.com/document/d/1cqTKtHTtKfXmfed41FpQGLO-PkXvgYT7_secbu8ZT0w/edit#heading=h.978dcrst4ubh

Generate error reports for NMDC ETL staging

Part of larger issue #23

After data is loaded into the NMDC ETL staging database, we need to validate it against the JSON schema and return/out violations. These violations can then be addressed by the hotfixes.

cc @dwinston

clarify purpose of nmdc_database.json file in this repo and make it easier to

I frequently need access to the latest NMDC sample metadata for various purposes - e.g analyze patterns of ENVO usage, GOLD usage, sparsity of mixs variables, etc. We also have various collaborators e.g. ENSURABLE who would like access.

Ultimately this will all be available via API but in the interim it would be great to make all this more transparent

Where should I download data?

It looks like the one in this repo is just test data? I only see 716 gold samples. Whereas the polyneme URL has 32k gold samples.

does the one in this repo serve any purpose? I recommend removing it if it doesn't

there was a ticket somewhere about adding metadata to the database object with provenance about when the ingest was performed, what the version/date of source was etc, can't find it now, but we need this

Release documentation

Document all components and the release process steps and share before November test release and make documentation available to all team members.

move nmdc metadata translation code to nmdc-runtime

Move the translation code in nmdc-metadata/metadata-translation to the nmdc-runtime repo. This will make ETL management easier.

@dwinston I'm putting this in a ticket so that we can reference in a pull request.

permissions management for underlying MongoDB instance

Authorize @scanon and @dehays to manage MongoDB access.
Develop and document procedure for MongoDB access control and management

migrate metadata store infrastructure

migrate dagit (dagster UI) instance to use microbiomedata.org subdomain
migrate MongoDB host to NERSC Spin2

S4G1 - Data Identification for testing

Data identification for testing
Identify small set of raw data for processing from the same study (Pajau/Montana)

1 metagenome - 1 file
1 metaproteome - 1 file
1 metabolome - 1file
Put file name, sample ID, file location into a spreadsheet - manual hand-off (Pajau/Montana)

Step 1 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

add github users to architecture working group team

invite @kfox1111 and @elais to to the https://github.com/microbiomedata github org
add them to the @microbiomedata/architecture-wg team.

@ssarrafan can you please do (1)? I can do (2).

Add github actions to check PRs

I think this is a required part of #56

We have github actions but it seems they are just for deployment, they don't check PRs, I did an experiment here:

#64

This introduced a deliberate error yet I am still able to merge:

detailed SOP for nmdc-runtime infra management

which notebooks/modules if any are run after data is ingested for ETL processes? -> None!
identify manual steps and document -> None! All through GH actions (https://microbiomedata.github.io/nmdc-runtime/release-process/)
establish code review process -> normal pull request review process as per e.g. nmdc-schema repo.
ensure NMDC RACI chart reflects responsibilities for all infrastructure components

S4G1 - Data Processing, Workflow definition

Workflows defined for each of the test raw data files - yes
Metagenome - done
Metabolomics (GCMS?) - done >>> future sprint

Site available with computing capacity to process the raw data files

Develop/deploy services to turn NERSC into a site (Shane?)
Develop/deploy services to turn EMSL into a site (Yuri) >>>> EMSL will be done in future sprint

Step 2 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

mongo access over HTTPS

NERSC Spin doesn't allow exposing a MongoDB service at this time.

I propose setting up an Eve REST API to expose external access to the MongoDB.

@jbeezley have you worked with Eve?

Dagster and Mongo migration

Migrate Dagster and Mongo to run on NERSC before December

Syncing stored metadata and schema

Need to ensure metadata migration to a new schema when that new schema is installed, so that existing metadata and submissions going forward are in sync with the same schema.

/cc @wdduncan @hubin-keio @scanon @cmungall

data_object_type is not being set on new data objects

New data objects are not getting the data_object_type attribute set on them. Looks like you are doing these using the regex from the filter attributes on the Mongo file_type_enum collection.

Kitware then sets the file name and description based on the data_object_type attributes they see. When they are missing, there are no files displayed.

{
    "_id": {
        "$oid": "6148a0a9e9822b255a68d6ad"
    },
    "description": "Assembled contigs fasta for gold:Gp0138727",
    "url": "https://data.microbiomedata.org/data/nmdc:mga0ac72/assembly/nmdc_mga0ac72_contigs.fna",
    "md5_checksum": "8ea1e1eab9d34bfb48cf83dccb8e95de",
    "file_size_bytes": 209350820,
    "id": "nmdc:8ea1e1eab9d34bfb48cf83dccb8e95de",
    "name": "gold:Gp0138727_Assembled contigs fasta"
}

This is one that does work (from a different study)...

{
    "_id": {
        "$oid": "60e840cde9822b255ad950da"
    },
    "description": "Assembled AGP file for gold:Gp0061273",
    "url": "https://data.microbiomedata.org/data/1472_51277/assembly/assembly.agp",
    "file_size_bytes": 27785827,
    "type": "nmdc:DataObject",
    "id": "nmdc:8e504039a96e9ab885eef69155127754",
    "name": "assembly.agp",
    "data_object_type": "0hna-73pd-79"
}

change sheet does not allow for multiple values when inserting object into array

When inserting an object into an array, on the last value is set. E.g., If inserting a credit association in which the multiple properties are set for an object, only the last property is set.
E.g., This is supposed to update all the values for the person object p1.

gold:Gs0114675	update	has_credit_associations	ca1
ca1	update	applied_role	Conceptualization
ca1	update	applies_to_person	p1
p1	update	name	Kelly Wrighton
p1	update	email	[email protected]
p1	update	orcid	orcid:0000-0003-0434-4217

But, only the orcid (i.e., the last property) is set.

cc @dehays @dwinston

S4G1 - Validate workflow metadata against schema

New unvalidated workflow metadata is available

Validation against the schema (Donny)
Ingest into Mongo (Donny)
Update NMDC runtime with new valid data available (Donny)

Step 5 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

test metadata ingest

Using SOP (#53).

Options: additional Spruce, EMP500, NEON, other EMSL & JGI metadata.

/cc @microbiomedata/architecture-wg

set up notifications re: server uptime

status webpage (http://nmdcstatus.polyneme.xyz/)
slack channel notification (#updown on NMDC Slack)
email notification - @dehays and @scanon

design doc + prototype for data management system

This work is related to microbiomedata/workflow_documentation#5 (design doc / api spec for workflow execution automation) but deserves its own "place" (this issue) to track progress and completion.

support demo for microbiomedata/workflow_documentation#5
facilitate performance of microbiomedata/nmdc-metadata#309
SOP for process of generating and integrating new data objects. (see comment below)

please remove documents from nmdc . metaproteomics_analysis_activity_set that have non prefixed IDs

Those that can be removed have IDs that DO NOT begin with "nmdc:"

Backstory: Anubhav provided metaP activities to go along with the new data objects he provided. In both cases, because the IDs were different, the metadata-in DAGster job created new document (Because there were no documents with the same ID to replace.). So there are duplicates of all the metaproteomics activities that were in the recent 'update' JSON.

The only thing that changed on the activities was changing from the Mongo hash ID that Anubhav had mistakenly used as their IDs with "nmdc:" + md5. So the ones not in that form can be removed.

The duplicate analysis activities appear on the portal - but each activity in the pair points to the same files - so the file downloads all work.

convert notebook-based JSON->Mongo ETL to dagster-based JSON->Terminus ETL

Currently, the portal-directed metadata ETL has two parts:

The ETL for dumped JGI GOLD metadata to a NMDC-Schema-compliant JSON file. This process is currently wrapped as the nmdc_runtime.solids.jgi.get_json_db dagster solid.
A series of Jupyter notebooks that takes the above JSON file as input, fetches additional metadata source files that are hosted elsewhere, and ensures NMDC-Schema-complaint JSON document collections in a MongoDB database. These notebooks (currently in the microbiomedata/nmdc-metdata repo's metadata-translation/notebooks directory) are:
1. gold_ids_to_igsns.ipynb
2. metaP_stegen.ipynb
3. mongo_etl_demo.ipynb
4. ghissue_252_253_linked_samples.ipynb
5. ghissue_255.ipynb
6. ghissue_272.ipynb
7. ensure_biosample_set_study_id.ipynb

The notebook-based JSON->Mongo ETL above needs to be converted to a dagster-based JSON->Terminus ETL.

translate notebook logic to dagster solids (retain MongoDB target)
create and test full pipeline
create and test new solids/pipeline for TerminusDB target

This is a follow-on to microbiomedata/nmdc-metadata#316.

coordinating automation for metaB

/cc @corilo

document handoff of metadata store responsibilities to particular team members

S4G1 - Data Files Registered with NMDC service

Date files registered with the main NMDC service

establish a basic protocol for this

Step 3 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

Add protections to main branch

Now that the tests are running after the changes related to #64 #71 the next step would be to add protections to the main branch that require the tests to to pass before a pull request can be merged.

If this is already done please close this issue but it will be good that this exists to document that the administrative changes have been made.

S4G1 - Data Processing, Run Workflows (EMSL)

New files are in the system (Yuri) - EMSL site should pick up the raw data for processing metaB

Workflows are run
New data and metadata are generated
EMSL - outputs moved to the nmdc-pilot space
New data (URLs) and metadata are registered with the NMDC runtime service

Step 4 of Goal 1 for Sprint 4
See https://docs.google.com/document/d/1iBNXkBn24ZkmJkeptoqpyjcz5PAQU59ZjMOzDObnx4E/edit?ts=60d55721#heading=h.gy54da3ooy3b for more details

missing 108 samples from the NMDC_DUMP_Jun_21_2021 GOLD data dump

The ETL process on the NMDC_DUMP_Jun_21_2021 GOLD data dump failed translate all the biosamples. A list of the gold ids of 108 such failures are listed below.

cc @dwinston

1 gold:Gb0291799
2 gold:Gb0291728
3 gold:Gb0291794
4 gold:Gb0291771
5 gold:Gb0291740
6 gold:Gb0291713
7 gold:Gb0291768
8 gold:Gb0291757
9 gold:Gb0291795
10 gold:Gb0291766
11 gold:Gb0291797
12 gold:Gb0291716
13 gold:Gb0291739
14 gold:Gb0291699
15 gold:Gb0291790
16 gold:Gb0291787
17 gold:Gb0291769
18 gold:Gb0291732
19 gold:Gb0291756
20 gold:Gb0291733
21 gold:Gb0291791
22 gold:Gb0291726
23 gold:Gb0291693
24 gold:Gb0291785
25 gold:Gb0291719
26 gold:Gb0291717
27 gold:Gb0291746
28 gold:Gb0291765
29 gold:Gb0291738
30 gold:Gb0291777
31 gold:Gb0291714
32 gold:Gb0291712
33 gold:Gb0291751
34 gold:Gb0291792
35 gold:Gb0291744
36 gold:Gb0291718
37 gold:Gb0291758
38 gold:Gb0291727
39 gold:Gb0291783
40 gold:Gb0291708
41 gold:Gb0291711
42 gold:Gb0291722
43 gold:Gb0291775
44 gold:Gb0291700
45 gold:Gb0291779
46 gold:Gb0291748
47 gold:Gb0291752
48 gold:Gb0291761
49 gold:Gb0291764
50 gold:Gb0291729
51 gold:Gb0291720
52 gold:Gb0291696
53 gold:Gb0291702
54 gold:Gb0291709
55 gold:Gb0291698
56 gold:Gb0291701
57 gold:Gb0291710
58 gold:Gb0291697
59 gold:Gb0291776
60 gold:Gb0291737
61 gold:Gb0291734
62 gold:Gb0291721
63 gold:Gb0291731
64 gold:Gb0291706
65 gold:Gb0291793
66 gold:Gb0291692
67 gold:Gb0291784
68 gold:Gb0291789
69 gold:Gb0291778
70 gold:Gb0291767
71 gold:Gb0291747
72 gold:Gb0291694
73 gold:Gb0291798
74 gold:Gb0291695
75 gold:Gb0291741
76 gold:Gb0291770
77 gold:Gb0291782
78 gold:Gb0291742
79 gold:Gb0291735
80 gold:Gb0291715
81 gold:Gb0291760
82 gold:Gb0291763
83 gold:Gb0291780
84 gold:Gb0291703
85 gold:Gb0291781
86 gold:Gb0291707
87 gold:Gb0291753
88 gold:Gb0291749
89 gold:Gb0291704
90 gold:Gb0291755
91 gold:Gb0291796
92 gold:Gb0291736
93 gold:Gb0291750
94 gold:Gb0291754
95 gold:Gb0291743
96 gold:Gb0291705
97 gold:Gb0291773
98 gold:Gb0291723
99 gold:Gb0291730
100 gold:Gb0291788
101 gold:Gb0291725
102 gold:Gb0291724
103 gold:Gb0291786
104 gold:Gb0291745
105 gold:Gb0291774
106 gold:Gb0291772
107 gold:Gb0291759
108 gold:Gb0291762

Tests fail on main

Continued from #65

permissions management for API access

Authorize @scanon and @dehays to manage API access.
Develop and document procedure for API access control and management

Consider ISO8601 for datetime which would be much easier to parse and validate across systems

Basically the portal team would like to review with the metadata team the choice of this more difficult to parse datetime. If there are good reasons for this format we can keep it. If there are standard libraries for parsing these somewhat odd timestamps we should utilize them.

hotfixes.csv to ensure metadata overrides on re-processing

A spreadsheet of changes to metadata entity (e.g. study) attributes was compiled in preparation for the upcoming release. It was saved as CSV and persisted in this repo as metadata-translation/src/data/2021-07-02-study-changes.csv.

Recently, another "hotfix" was applied, to correct the lat_lon.longitude, and by extension lat_lon.has_raw_value value, for a particular biosample. @cmungall suggested this fix be recorded to ensure re-application if necessary, e.g. in a version-controlled hotfixes.yaml file.

Seeing as this may be a practice going forward, I'd like to refactor the 2021-07-02-study-changes.csv file to be a hotfixes.csv in this repo, to be the source of truth for hotfixes / manually determined corrections to be applied. There is already a flow to extract and apply changes from such a file.

@emileyfadrosh @pvangay is it feasible to expect such fixes to be entered via GitHub's interface as pull requests, rather than the initial workflow of using a Google Sheet?

generate omics processing data for SPRUCE

Need to re-run etl to generate omics processing data for SPRUCE. This requires allowing for omics processing data that does not have an associated output.

Please remove the following omics_processing documents from Mongo

These are metaproteomics omics_processing for which there are no corresponding metaproteomics analysis

"emsl:512156"
"emsl:512155"
"emsl:504850"
"emsl:502966"

document components and release process steps for metadata store/API

ensure all @microbiomedata/architecture-wg team members are aware of and have access to this documentation.

components
code review process
any manual steps
release process

how to update the sensors/operations that pick up change sheets

establish SOP and code review process

data objects not getting set with data_object type

recently updated data objects are missing data_object_type

coordinate range-attribute schema changes with Kitware prior to ETL

For attributes that hold ranges (e.g., depth, depth2), the schema has been changed (see #44) so that min/max of the range is represented like so:

     depth: 5.0-10.0
      has_unit: meter
      has_minimum_numeric_value: 5.0
      has_maximum_numeric_value: 10.0

Before pushing the changed docs to production we need to make sure Kitware can ingest the new docs.

Steps:

Ingest new ETL into Mongo (i.e., run the the dagster operations)
Add post-ETL modifications (e.g., change sheets, or notebooks)
Coordinate with Kitware to download the changed docs (in the biosample collection)
Verify that new data is ingested into portal
Remove old depth2 and subsurface_depth2 attributes (see microbiomedata/nmdc-schema#193)

cc @dwinston @jeffbaumes @subdavis