Giter VIP home page Giter VIP logo

aind-codeocean-utils's Introduction

aind-codeocean-utils

License Code Style semantic-release: angular Interrogate Coverage Python

Library to contain useful utility methods to interface with Code Ocean.

Installation

To use the package, you can install it from pypi:

pip install aind-codeocean-utils

To install the package from source, in the root directory, run

pip install -e .

To develop the code, run

pip install -e .[dev]

Usage

The package includes helper functions to interact with Code Ocean:

CodeOceanJob

This class enables one to run a job that:

  1. Registers a new asset to Code Ocean from s3
  2. Runs a capsule/pipeline on the newly registered asset (or an existing assey)
  3. Captures the run results into a new asset

Steps 1 and 3 are optional, while step 2 (running the computation) is mandatory.

Here is a full example that registers a new ecephys asset, runs the spike sorting capsule with some parameters, and registers the results:

import os

from aind_codeocean_api.codeocean import CodeOceanClient
from aind_codeocean_utils.codeocean_job import (
    CodeOceanJob, CodeOceanJobConfig
)

# Set up the CodeOceanClient from aind_codeocean_api
CO_TOKEN = os.environ["CO_TOKEN"]
CO_DOMAIN = os.environ["CO_DOMAIN"]

co_client = CodeOceanClient(domain=CO_DOMAIN, token=CO_TOKEN)

# Define Job Parameters
job_config_dict = dict(
    register_config = dict(
        asset_name="test_dataset_for_codeocean_job",
        mount="ecephys_701305_2023-12-26_12-22-25",
        bucket="aind-ephys-data",
        prefix="ecephys_701305_2023-12-26_12-22-25",
        tags=["codeocean_job_test", "ecephys", "701305", "raw"],
        custom_metadata={
            "modality": "extracellular electrophysiology",
            "data level": "raw data",
        },
        viewable_to_everyone=True
    ),
    run_capsule_config = dict(
        data_assets=None, # when None, the newly registered asset will be used
        capsule_id="a31e6c81-49a5-4f1c-b89c-2d47ae3e02b4",
        run_parameters=["--debug", "--no-remove-out-channels"]
    ),
    capture_result_config = dict(
        process_name="sorted",
        tags=["np-ultra"] # additional tags to the ones inherited from input
    )
)

# instantiate config model
job_config = CodeOceanJobConfig(**job_config_dict)

# instantiate code ocean job
co_job = CodeOceanJob(co_client=co_client, job_config=job_config)

# run and wait for results
job_response = co_job.run_job()

This job will:

  1. Register the test_dataset_for_codeocean_job asset from the specified s3 bucket and prefix
  2. Run the capsule a31e6c81-49a5-4f1c-b89c-2d47ae3e02b4 with the specified parameters
  3. Register the result as test_dataset_for_codeocean_job_sorter_{date-time}

To run a computation on existing data assets, do not provide the register_config and provide the data_asset field in the run_capsule_config.

To skip capturing the result, do not provide the capture_result_config option.

Contributing

Linters and testing

There are several libraries used to run linters, check documentation, and run tests.

  • Please test your changes using the coverage library, which will run the tests and log a coverage report:
coverage run -m unittest discover && coverage report
  • Use interrogate to check that modules, methods, etc. have been documented thoroughly:
interrogate .
  • Use flake8 to check that code is up to standards (no unused imports, etc.):
flake8 .
  • Use black to automatically format the code into PEP standards:
black .
  • Use isort to automatically sort import statements:
isort .

Pull requests

For internal members, please create a branch. For external members, please fork the repository and open a pull request from the fork. We'll primarily use Angular style for commit messages. Roughly, they should follow the pattern:

<type>(<scope>): <short summary>

where scope (optional) describes the packages affected by the code changes and type (mandatory) is one of:

  • build: Changes that affect build tools or external dependencies (example scopes: pyproject.toml, setup.py)
  • ci: Changes to our CI configuration files and scripts (examples: .github/workflows/ci.yml)
  • docs: Documentation only changes
  • feat: A new feature
  • fix: A bugfix
  • perf: A code change that improves performance
  • refactor: A code change that neither fixes a bug nor adds a feature
  • test: Adding missing tests or correcting existing tests

Semantic Release

The table below, from semantic release, shows which commit message gets you which release type when semantic-release runs (using the default configuration):

Commit message Release type
fix(pencil): stop graphite breaking when too much pressure applied Patch Fix Release, Default release
feat(pencil): add 'graphiteWidth' option Minor Feature Release
perf(pencil): remove graphiteWidth option

BREAKING CHANGE: The graphiteWidth option has been removed.
The default graphite width of 10mm is always used for performance reasons.
Major Breaking Release
(Note that the BREAKING CHANGE: token must be in the footer of the commit)

Documentation

To generate the rst files source files for documentation, run

sphinx-apidoc -o doc_template/source/ src 

Then to create the documentation HTML files, run

sphinx-build -b html doc_template/source/ doc_template/build/html

More info on sphinx installation can be found here.

aind-codeocean-utils's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

aind-codeocean-utils's Issues

Publish code on pypi

User story

As a user, I want the code published to PyPI, so I can easily install it in other packages.

Acceptance criteria

  • When code is merged into main, then the package is published to PyPI

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Write script to query Code Ocean for stale runs and runs that can be removed

User story

As an admin, I would like to know which Runs we could potentially delete so as to save space/cost.

Code Ocean Capsules and Pipelines have many Runs that store output files. In many cases these runs are part of the normal testing cycle and can be removed.

Acceptance criteria

  • A script we can run regularly that produces a CSV with columns: capsule name, run datetime, size

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

This script should not actually delete data - that will be a separate task.

*Original issue:

Code Ocean Capsules and Pipelines have many Runs that store output files. In many cases these runs are part of the normal testing cycle and can be removed.

Write a script uses the Code Ocean API to identify Runs that we could potentially delete. The script should output a CSV with the following columns:
1. Capsule name
2. Run date / time
3. Whether the Run was captured as a Data Asset
4. Total size of files

We should be able to generate this report whenever we like. Actually deleting data will be a separate task.

Simplify CodeOceanJob, particularly metadata handling

Is your feature request related to a problem? Please describe.
When trying to use CodeOceanJob to handle metadata tags correctly, I spent couple hours looking at the control flow, and I'm still not sure I get it. Some other inconsistencies were also confusing - wrappers around configuration objects in aind-codeocean-api that really were only renaming things. We should also support a reprocessing workflow - registration should be optional.

Describe the solution you'd like
I propose CodeOceanJob be organized as follows:

  • Use aind-codeocean-api's request/configuration objects directly for the Register, Run, and Capture steps.
  • Add a pass_metadata_to_result flag, default True
  • Add a add_data_level_tags flag, default True
  • Ensure that all metadata tags are unique (use Set, rather than List)
  • Make sure that Run+Capture can be used independently of Register+Run+Capture

Method to apply custom metadata to data assets

User story

As a user, I want to run a method to easily update custom metadata.

Acceptance criteria

  • When a user runs a method with arguments with a supplied set of data asset IDs, these assets will have modality, subject id, platform, collection date, institution, and data level filled out correctly.
  • The values for these fields should come from the document db API.
  • There is an option to print out the changes to a log file (default to True)
  • Appropriate docstrings and unit tests

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

method to delete archived data assets

Is your feature request related to a problem? Please describe.
We have a large number of assets that users have archived. These are using unnecessary space and cost.

Describe the solution you'd like
A method that let's me see a list of all archived data assets older than a particular age and then separately decide to delete them. I should optionally be able to exclude assets that have attachements.

Class to run a generic job on Code Ocean

Is your feature request related to a problem? Please describe.
Currently, the aind-trigger-codeocean has some classes to run specific capsules, register assets, and capture results (see: https://github.com/AllenNeuralDynamics/aind-trigger-codeocean/blob/main/code/aind_trigger_codeocean/pipelines.py#L104)
However, such class should live here and the aind-trigger-codeocean repo should be a CO capsule to trigger jobs using this class.

Describe the solution you'd like
Ideally, a CodeOceanJob class should:

  1. register an asset (if needed) or mount a registered asset
  2. run a capsule/pipeline from an ID
  3. optionally wait
  4. optionally capture results
  5. optionally send Teams notifications

Describe alternatives you've considered
One could use the aind-codeocean-api directly, but registration and waiting for results to capture requires some additional and non-trivial coding

Better warnings

User story

As a user, I want more logging so that we can better understand the root cause of issues.

Acceptance criteria

  • Given timeout, log

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Reliable fetch all records method

Is your feature request related to a problem? Please describe.
Code Ocean's rest api will only return 10,000 records max using their start, limit method

Describe the solution you'd like
We need a temp workaround until they fix the bug

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Use codeocean's python api

Is your feature request related to a problem? Please describe.
We are currently using aind-codeocean-api as the main Code Ocean client.

Describe the solution you'd like
Now that Code Ocean has rolled out their own client, we should switch to using that one.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

add default tags to asset capture configs

Is your feature request related to a problem? Please describe.
Assets are not being tagged automatically, which is making them difficult to find.

Describe the solution you'd like
raw data should be tagged with the DataLevel.RAW tag and derived data should be tagged with the DataLevel.DERIVED tag.

Use a map to update tags

User story

As a user, I'd like to use a map to replace tags, to make it easier to replace tags instead of running remove and add separately.

Acceptance criteria

  • Given a user calls update_tags, then they can supply an arg tags_to_replace: Optional[Dict[str,str]] = None that will change tags in the data_assets list.

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Method to update tags for assets

User story

As a user, I want a method to update tags on assets, so I can easily update tags.

Acceptance criteria

  • When a user runs a method with arguments old_tag and new_tag and iterator of data asset ids, then all assets with those data asset ids in code ocean and have old_tag will have "old_tag" changed to new_tag.
  • If old_tag is None, then all assets satisfying the filter will be tagged with new_tag.
  • There is an option to print out the changes to a log file (default to True)
  • Appropriate docstrings and unit tests

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.