Giter VIP home page Giter VIP logo

connect_server's Introduction

MDF Connect

The Materials Data Facility Connect service is the ETL flow to deeply index datasets into MDF Search. It is not intended to be run by end-users. To submit data to the MDF, visit the Materials Data Facility.

Architecture

The MDF Connect service is a serverless REST service that is deployed on AWS. It consists of an AWS API Gateway that uses a lambda function to authenticate requests against GlobusAuth. If authorised, the endpoints trigger AWS lambda functions. Each endpoint is implemented as a lambda function contained in a python file in the aws/ directory. The lambda functions are deployed via GitHub actions as described in a later section.

The API Endpoints are:

  • POST /submit: Submits a dataset to the MDF Connect service. This triggers a Globus Automate flow
  • GET /status: Returns the status of a dataset submission
  • POST /submissions: Forms a query and returns a list of submissions

Globus Automate Flow

The Globus Automate flow is a series of steps that are triggered by the POST /submit endpoint. The flow is defined using a python dsl that can be found in automate/minimus_mdf_flow.py. At a high level the flow:

  1. Notifies the admin that a dataset has been submitted
  2. Checks to see if the data files have been updated or if this is a metadata only submission
  3. If there is a dataset, it starts a globus transfer
  4. Once the transfer is complete it may trigger a curation step if the organization is configured to do so
  5. A DOI is minted if the organization is configured to do so
  6. The dataset is indexed in MDF Search
  7. The user is notified of the completion of the submission

Development Workflow

Changes should be made in a feature branch based off of the dev branch. Create PR and get a friend to review your changes. Once the PR is approved, merge it into the dev branch. The dev branch is automatically deployed to the dev environment. Once the changes have been tested in the dev environment, create a PR from dev to main. Once the PR is approved, merge it into main. The main branch is automatically deployed to the prod environment.

Deployment

The MDF Connect service is deployed on AWS into development and production environments. The automate flow is deployed into the Globus Automate service via a second GitHub action.

Deploy the Automate Flow

Changes to the automate flow are deployed via a GitHub action, triggered by the push of a new GitHub release. If the release is tagged as "pre-release" it will be deployed to the dev environment, otherwise it will be deployed to the prod environment.

The flow IDs for dev and prod are stored in automate/mdf_dev_flow_info.json and automate/mdf_prod_flow_info.json respectively. The flow ID is stored in the flow_id key.

Deploy a Dev Release of the Flow

  1. Merge your changes into the dev branch
  2. On the GitHub website, click on the Release link on the repo home page.
  3. Click on the Draft a new release button
  4. Fill in the tag version as X.Y.Z-alpha.1 where X.Y.Z is the version number. You can use subsequent alpha tags if you need to make further changes.
  5. Fill in the release title and description
  6. Select dev as the Target branch
  7. Check the Set as a pre-release checkbox
  8. Click the Publish release button

Deploy a Prod Release of the Flow

  1. Merge your changes into the main branch
  2. On the GitHub website, click on the Release link on the repo home page.
  3. Click on the Draft a new release button
  4. Fill in the tag version as X.Y.Z where X.Y.Z is the version number.
  5. Fill in the release title and description
  6. Select main as the Target branch
  7. Check the Set as the latest release checkbox
  8. Click the Publish release button

You can verify deployment of the flows in the Globus Automate Console.

Deploy the MDF Connect Service

The MDF Connect service is deployed via a GitHub action. The action is triggered by a push to the dev or main branch. The action will deploy the service to the dev or prod environment respectively.

Updating Schemas

Schemas and the MDF organization database are managed in the automate branch of the Data Schemas Repo.

The schema is deployed into the docker images used to serve up the lambda functions.

Running Tests

To run the tests first make sure that you are running python 3.7.10. Then install the dependencies:

$ cd aws/tests
$ pip3 install -r requirements-test.txt

Now you can run the tests using the command:

$ PYTHONPATH=.. python -m pytest --ignore schemas

Support

This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the Midwest Big Data Hub under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate".

connect_server's People

Contributors

bengalewsky avatar jgaff avatar eschondorf avatar blaiszik avatar wardlt avatar ericblau avatar owenpriceskelly avatar

Stargazers

Matt McCormick avatar  avatar  avatar  avatar

Watchers

James Cloos avatar Ian Foster avatar  avatar  avatar

connect_server's Issues

Request: remove mass_update_search or convert to use Ingest API

Search is unifying this on the backend right now, so that calls to POST or PUT an entry will actually submit an Ingest task. This is being done to unify a lot of the logic around handling these updates and avoid many of the pitfalls that we had with synchronous Ingest operations in the past.

However, making these the same internally leaves me questioning the value of having a separate API for these at all. I think it would be better -- both for Search and for its users, like MDF -- if we eliminated the POST/PUT entry APIs ( update_entry and create_entry in the SDK) and removed the relevant SDK methods.
I don't want to cause unnecessary breakage, so I'd like to see the code here using update_entry switch to using ingest:

update_res = ingest_client.update_entry(index, gmeta_update)

Either that or remove it if it's unused. The multiprocessing ingest flow is probably better and faster in all ways, based on my reading of this module.

Clear "test" submissions

We should put into place scripts that run regularly to clear out old test submissions from the database.

Perhaps the script could run from the server each day and clear out any "test" submissions older than 14 days

Resilience to Directories that Do Not End with "/"

If a user provides a path to Connect that is a directory, Connect currently requires that the user specify the path with the directory. We should make Connect determine whether the user path is a directory or not using a more robust mechanism (e.g., operation_ls using the Globus SDK)

Deploy MDF Serverless Lambda With Containers

Problem

Right now we deploy the python code and dependencies to AWS Lambda using a GitHub action. It always deploys to a specific lambda. It also builds a new set of dependencies and deploys that the Lambda, but doesn't update the dependency link in the functions.

Assumptions

  1. Add Dockerfile to repo
  2. Make a GHA to build the docker image and push to ECR
  3. Use aws lambda update to force the lambda to pick up the new image

Design domain agnostic MDF concept and UI

DOUBLE CHECK THIS WITH BEN B

Is this changing MDF into Foundry, and having Foundry-ML be a part of the new generic Foundry?

Foundry is basically the XDF we've been talking about.

Foundry-ML is Foundry-ML.

Switch to Role Based Auth in AWS Account

As an Accelerate Developer I want access to only the AWS resources I need so I can securely perform my job

Description

Create a Terraform script to create MDF connect server roll in AWS account. Assume that Ben B or someone with full permissions runs the terraform script to actually create the role and assign the permissions.

Personalized Greeting on Acceptance Emails

As an MDF Contributor I want to see my own name in the submission email so I have increased confidence it was correctly accepted

Description

MDF Connect 1.0 put the submitting user's first name as a greeting in the acceptance email. The first name is known in the Authorizer lambda but is not passed down to the submit lambda. It will need to be added to the JSON document placed in the Submit lambda's context and then put into the Flow input document.

[blocking user flow] Look into success email validation issues in MDF Flow

The success email after curation and successful ingest seems to fail sometimes. It looks like something to do with DKIM authentication.

To: [[email protected]](mailto:[email protected]) <[[email protected]](mailto:[email protected])>
Subject: Undeliverable: Returned mail: see transcript for details
 
The original message was received at Fri, 9 Feb 2024 17:05:48 -0500
from mail-bn8nam12lp2168.outbound.protection.outlook.com [104.47.55.168]

   ----- The following addresses had permanent fatal errors -----
[[email protected]](mailto:[email protected])
    (reason: 550-5.7.26 This mail has been blocked because the sender is unauthenticated.)
    (expanded from: <[[email protected]](mailto:[email protected])>)

   ----- Transcript of session follows -----
... while talking to gmail-smtp-in.l.google.com.:
>>> DATA
<<< 550-5.7.26 This mail has been blocked because the sender is unauthenticated.
<<< 550-5.7.26 Gmail requires all senders to authenticate with either SPF or DKIM.
<<< 550-5.7.26
<<< 550-5.7.26  Authentication results:
<<< 550-5.7.26  DKIM = did not pass
<<< 550-5.7.26  SPF [amazonses.com] with ip: [130.127.237.235] = did not pass
<<< 550-5.7.26
<<< 550-5.7.26  For instructions on setting up authentication, go to
<<< 550 5.7.26  https://support.google.com/mail/answer/81126#authentication a18-20020a25ae12000000b00dc748a2a044si1000667ybj.47 - gsmtp
554 5.0.0 Service unavailable

Move Data Retrieval Code to New Repository

As we modify the codes that publish data from certain repositories, should we move it to their own repositories? They provide great examples for how to use MDF Connect, and are kind of hidden here.

I'm willing to do this. Does anyone object?

Support HTTP Upload to MDF

As an MDF Contributor I want to be able to upload a file from a public HTTP endpoint so I don't have to have access to a Globus Endpoint

Description

MDF Connect Server 2.0 can only transfer files with the Globus Transfer Action Provider which does not support public HTTP files. We will need to upload the user's file to a scratch directory on the NCSA endpoint and transfer it from there.

Consider adding a POST from the client to this scratch directory. This assumes that the directory is writable by members of the MDF Publishers globus group. There will need to be a periodic job to clear out old files from this scratch directory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.