allenneuraldynamics / aind-data-asset-indexer Goto Github PK

License: MIT License

Python 99.73% Dockerfile 0.12% Shell 0.15%

aind-data-asset-indexer's Introduction

aind-data-asset-indexer

Script to create metadata analytics table and write to redshift table. This script will parse through a list of s3 buckets and document whether data asset records in each of those buckets does or does not contain metadata.nd.json

Usage

Define the environment variables in the .env.template
- REDSHIFT_SECRETS_NAME: defining secrets name for Amazon Redshift
- BUCKETS: list of buckets. comma separated format (ex: "bucket_name1, bucket_name2")
- TABLE_NAME: name of table in redshift
- FOLDERS_FILEPATH: Intended filepath for txt file
- METADATA_DIRECTORY: Intended path for directory containing copies of metadata records
- AWS_DEFAULT_REGION: Default AWS region.
Records containing metadata.nd.json file will be copies to METADATA_DIRECTORY and compared against list of all records in FOLDERS_FILEPATH
An analytics table containing columns s3_prefix, bucket_name, and metadata_bool will be written to TABLE_NAME in Redshift

Development

It's a bit tedious, but the dependencies listed in the pyproject.toml file needs to be manually updated

aind-data-asset-indexer's People

Contributors

Watchers

aind-data-asset-indexer's Issues

Check if _id exists in metadata.nd.json

Is your feature request related to a problem? Please describe.
If the metadata.nd.json exists in s3 but not in docdb, AindIndexBucketJob._process_prefix will write it to DocDB. There is an unhandled error if the metadata.nd.json does not have _id field, causing the indexer to crash.

Describe the solution you'd like
Add a check that _id exists in the metadata.nd.json. We should also check that other services are not creating the .nd.json files, or if they do, they should use something like:

metadata_dict = Metadata.model_construct(**metadata_dict).model_dump_json(warnings=False, by_alias=True)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add a check in S3 crawler script to catch empty pandas dataframes

User story

As a software engineer, I want to add a check in S3 crawler script, so I can catch empty dataframes before merging them.

Acceptance criteria

Check for empty dataframes before merging

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Use ec2 instance role to create boto3 client

User story

As a user, I want to optionally pass an ec2 instance's assumed role's aws credentials, so I can run the capsule via pipelines.

Acceptance criteria

If the default aws credentials attached to a capsule are not found, then it will attempt to use the ec2 instance's credentials.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Indexer should include processed Code Ocean results

The indexer needs to include assets that are processed results in the Code Ocean datasets bucket. This is needed so that science teams can analyze data as soon as it is processed, regardless of whether we are capturing it to an external bucket.

Acceptance Criteria

The indexer crawls the Code Ocean datasets bucket.
The indexer looks at all top-level prefixes. If the prefix contains (not recursively) any of the well-known JSON files (data_description.json, subject.json, procedures.json, rig.json, session.json, instrument.json, acquisition.json), include it in the index.
Query Code Ocean to populate relevant Metadata record fields (name, creation time, etc).

Missing output file in test codeocean aind_data_asset_indexer_docdb capsule

Describe the bug
Missing output file(s) capsule/results/* expected by process `capsule_aind_data_asset_indexer_docdb_1 (capsule-3506143)

To Reproduce
Steps to reproduce the behavior:

Run Mat Views pipeline
Open output file
See Error

Update S3 based on DocDB

User story

This is part of the AIND Metadata Update POC.
As a service admin, I want to user updates to metadata in DocDB to also be reflected in S3, so I can ensure data is in sync.

The new AIND metadata update process will allow users to update metadata directly to DocDB. We can leverage existing AIND Data Asset Indexer to make the appropriate downstream changes to S3.

Acceptance criteria

Nightly job checks records in DocDB and make appropriate updates in S3 if required.
Ensure data updates between DocDB and S3 do not run in a loop.
If desired, retain a history of updates by writing to logs or other store.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

The new DocDB to S3 workflow should only run in Dev as part of the POC.

Run AIND bucket indexer on a schedule (twice a day)

User story

As a developer, I want to run the aind_bucket_indexer.py script twice a day, so I can index updates in docdb to the s3 buckets.

Acceptance criteria

run.sh will run the aind_bucket_indexer job
Parameters added in Parameter Store and used in code
Update code to pull secret from secret manager if needed.
Update Task Definition in AWS console to run script twice daily.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Ensure full added aind-data-access-api on the next upgrade

User story

As a user I want to ensure that aind-data-access-api is fully added on the next upgrade so that I can access the latest changes

Acceptance criteria

Change pyproject.toml to use the full aind-data-access-api

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

missing assets from docdb

Describe the bug
The following prefixes, which all exist in s3://aind-open-data, are not showing up in the docdb and therefore not showing on the SmartSPIM dashboard.

SmartSPIM_705130_2024-02-14_14-07-29_stitched_2024-02-15_21-47-28
SmartSPIM_705129_2024-02-14_10-39-43_stitched_2024-02-15_20-58-26
SmartSPIM_717200_2024-04-23_22-21-26_stitched_2024-04-25_00-39-12
SmartSPIM_716949_2024-04-23_12-53-35_stitched_2024-04-24_15-46-00
SmartSPIM_716950_2024-04-23_17-36-59_stitched_2024-04-24_22-47-42

Nonexistent assets appearing in docdb

Describe the bug
The SmartSPIM report shows the following records:

These were for runs that have since been deleted from S3, and as far as I can tell they also do not exist in Code Ocean.

To Reproduce
Steps to reproduce the behavior:

Go to the SmartSPIM dashboard
Filter to subject_id 695464.
See all the zombie assets.

Expected behavior
Only real data should appear in docdb :).

Add missing check for /original_metadata for edge case

User story

As a developer, I want to check that /original_metadata exists, so I can ensure correct files are not overwritten.

This check is only missing in 1 edge case: If metadata record exists in S3 but not DocDB, and was not picked up by lambda function, and /original_metadata was already copied.
Other appropriate cases already have the check.

Acceptance criteria

Given that a metadata record file exists in S3 but not in DocDB...

...and /original_metadata does not exist, the top-level core jsons are copied and then overwritten/synced with metadata.nd.json
...and /original_metadata already exists, the top-level core jsons are not copied to /original_metadata, and the top level jsons are still synced with metadata.nd.json

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Check if s3 prefix format matches expected regex

User story

As a user, I want to see only metadata records from valid s3 prefixes according to a certain format, so that invalid folders are ignored.

Note that in the lambda function, invalid s3 prefixes are already ignored.

Acceptance criteria

A valid s3 prefix should be in format: {modality}_{id}_{acq_datetime}

Given that the populate job is run, a s3 prefix with an invalid format should not be processed.
Given that the bucket indexer job is run, a s3 prefix with an invalid format should not be processed.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

We can check using DATA = f"^(?P<label>.+?)_(?P<c_date>{RegexParts.DATE.value})_(?P<c_time>{RegexParts.TIME.value})$" from aind-data-schema

Missing output file in prod aind_data_asset_indexer_docdb capsule

Describe the bug
Missing output file(s) capsule/results/* expected by process `capsule_aind_data_asset_indexer_docdb_1 (capsule-1936612)

To Reproduce
Steps to reproduce the behavior:

Run Mat Views pipeline
Open output file
See Error

Handle empty buckets

The indexer gets a key error when a bucket is empty. I believe this can be fixed by using .get("CommonPrefixes", [])

aind-data-asset-indexer/src/aind_data_asset_indexer/utils.py

Line 464 in d919d90

for p in page["CommonPrefixes"]

Reference docdb secret from Parameter Store parameters

User story

As a developer, I want to reference any AWS Secrets Manager Secrets directly from Parameter store, so I can get the DocDB secret without using the secrets manager client.

Currently we have a parameter in Parameter Store, which contains the name of the DocDB secret, and we use a separate secret client to get the secret value and add it to the appropriate job settings.

Acceptance criteria

Given the run.sh script is run with the correct parameter, the docdb secret should be accessed from Parameter store rather than Secrets Manager.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

https://docs.aws.amazon.com/systems-manager/latest/userguide/integration-ps-secretsmanager.html

Add check for processed data in non-Code Ocean results bucket.

User story

As a user, I want to index processed data assets stored in aws buckets other than the Code Ocean results bucket, so I can analyze data from there.

Acceptance criteria

When a run is executed, data tagged with processed and are in s3 buckets will be added to the document store.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Add main job runners that loop through bucket. Remove legacy jobs.

Is your feature request related to a problem? Please describe.
Add job runners to index data for a list of aind buckets. Remove the legacy jobs.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Update Dockerfile for running update_docdb.py script

User story

As a DevOps engineer, I want to update Dockerfile, so I can publish an image that runs both s3_crawler.py and update_docdb.py scripts

Acceptance criteria

Modify Dockerfile so that it runs both s3_crawler and update_docdb scripts.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Use updated "derived" tags to filter records

User story

As a user, I want to see processed data in the docdb. The tags are being changed from "processed" to "derived" so we need to update that here.

Acceptance criteria

We should keep "processed" for legacy purposes.
We need to add aind_data_schema.data_description.DataLevel.derived also

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Add testing and linting on pull request

User story

As a developer, I want to have checks on tests and code formats, so I can maintain the code base easier.

Acceptance criteria

When a Pull Request is opened, tests and linters will run automatically.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Delete records from docDB if not in s3

User story

As a user, I want the metadata docdb to be synced with S3. Delete records in the docdb data assets collection if they are not found in S3.

Acceptance criteria

Given a record is in docdb, check if it is in s3.
If record is not in s3, delete.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Add script to crawl through S3

User story

As an engineer, I want a script to do sanity checks on DocDB.

Acceptance criteria

Given script is run, then all data asset records in S3 are downloaded
Given downloaded data asset records, check whether metadata file exists and create a table
Given table, save table in redshift

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Figure out how we want to log discrepancies.

Improve logging

Is your feature request related to a problem? Please describe.
As a service maintainer, I want to easily review job summary logs in AWS cloudwatch so that I can troubleshoot issues faster.

Currently there are no summary logs for each bucket, and the s3 and docdb operations are being logged as raw results:

INFO:root:{'ResponseMetadata': {'RequestId': '*************', 'HostId': '*************', 'HTTPStatusCode': 200, 
'**************************', 'x-amz- 'date': 'Fri, 17 May 2024 20:08:09 GMT', 'x-amz-version-id': '**************-side-encryption': '***', 
'etag': '"*******************************"', ength': '0'}, 'RetryAttempts': 0}, 'ETag': '"*******************************',
 'VersionId': '*************************'}

Describe the solution you'd like

If applicable, update logs in populate_s3_with_metadata_files.py and aind_bucket_indexer.py to include summary counts (num created, updated, deleted, etc.) after each bucket.
If applicable, log record ID, operation, and location before any logging.info(response)

Describe alternatives you've considered
Leaving it as is does not affect running the jobs.

Additional context
Add any other context or screenshots about the feature request here.

is_dict_corrupt fails if json is a list instead of a dict

Describe the bug
Ran into this bug while running the populate job:

aind-data-asset-indexer/src/aind_data_asset_indexer/utils.py", line 453, in is_dict_corrupt
    for key, value in input_dict.items():
AttributeError: 'list' object has no attribute 'items'

To Reproduce
Steps to reproduce the behavior:

Run is_dict_corrupt([])
See error

Expected behavior
If the json file is a list, we can flag it as corrupt

Additional context
Add any other context about the problem here.

Upgrade dependencies

User story

As a user I want to use the latest versions of the dependencies so that I can access the latest changes

Acceptance criteria

Change the docker file
Change pyproject.toml to use the latest dependencies

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

aind-data-access-api needs full option

Remove redundant `copy_original_md_subdir` from utils

User story

As a developer, I want to remove any redundant copy_original_md_subdir defaults, so that it is easier to maintain the codebase in case of future changes.

Currently, copy_original_md_subdir is set in the job settings and has a default in the utils.copy_then_overwrite_core_json_files method. Remove one of them.

Acceptance criteria

There is only 1 place that copy_original_md_subdir is set
"original_metadata" should not be hardcoded anywhere in the indexer job

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Add code to index data assets

User story

As a user, I want to update the document store with data asset metadata, so I can search them more easily.

Acceptance criteria

When the capsule is triggered, the document store will be populated.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Maybe this will be moot in with future releases of Code Ocean?

Scheduled S3 crawler task is failing in data asset indexer ECS prod environment

ValueError: You are trying to merge on object and float64 columns for key 's3_prefix'. If you wish to proceed you should use pd.concat

More can be found on cloudwatch logs in prod:

https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Fecs$252FAindDataAssetIndexerTD/log-events/ecs$252FDataAssetIndexer$252Fd67c97b6dec94bc09d13546595570eb3$3Fstart$3D1705377646996

Update README

Is your feature request related to a problem? Please describe.
As a developer, I want to use an updated README so that I can easily setup and contribute to the repo.

Describe the solution you'd like

Add Contribution guidelines, how to run tests/linters, and instructions for running or testing jobs locally.
- E.g. you must start a ssh session locally and set host to 'localhost' in the JobSettings.
Add description and example usage for the new populate_s3 and indexer jobs.
Remove legacy jobs from readme

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Latest api no longer returns certain keys in responses

Describe the bug
In the past, "tags" used to always be in the response returned from Code Ocean.

To Reproduce
Steps to reproduce the behavior:

Run response = co_client.search_all_data_assets() and then results = response.json()["results"]
Notice that some of the latest results don't have a "tags" field.

Expected behavior
In the past, the tags was always present. We'll have to update the code to handle this new behavior.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Update DocDB

User story

As an engineer, I want DocDB to be updated to match S3.

Acceptance criteria

Given metadata directory with downloaded metadata files from s3, overwrite records in DocDB.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Instead of using redshift table, use s3 as source of truth. Log discrepancies (maybe in future ticket)

AIND Bucket Indexer should also update individual metadata files in S3

User story

As a user, I want to see changes updated in the core schema files in S3 after updating a metadata record in DocDB, so changes are synced in s3 and docdb.

Currently, the aind_buckets_indexer.py job will check for updates to records in DocDB and update the metadata.nd.json files in S3.
We also want the individual metadata JSONs (subject, rig, etc) updated in the S3 buckets.

Acceptance criteria

Given the populate_s3_with_metadata_files.py job is run, then the core fields from the metadata.nd.json get saved to json files.
Given the populate_s3_with_metadata_files.py job is run and there already is a {core_schema}.json, the original contents are copied to another file as {core_schema}.old.json.
Given the aind_bucket_indexer.py job is run and there were updates to a metadata record in docdb, the core schema jsons get updated in S3 as well.
Given the aind_bucket_indexer.py job is run and a metadata.nd.json is found or created in S3, also ensure core jsons are copied and in sync.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Dockerize and Publish docker image

User story

As an engineer, I want to publish a docker image so that I can run the code in the cloud.

Acceptance criteria

Given a branch is merged into main, then github action will build and publish a docker image.
Figure out where to publish docker image.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Refactor `does_s3_prefix_exist` to use head_object operation

User story

As a developer, I want to refactor the utils.does_s3_prefix_exist method, so I can use a less expensive operation. The head_object operation is approx 12.5x less expensive than list_objects_v2 even with a MaxKeys set to 1.

Note: This may be lower priority since does_s3_prefix_exist is only called 1x when any metadata record is updated in docdb.

Acceptance criteria

Given the does_s3_prefix_exist method is called for any prefix, the head_object operation is used.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

https://aws.amazon.com/s3/pricing/

Fix indexer issue

From @jtyoung84: seems like ECS is not configured correctly.

Job to populate S3 with metadata.nd.json files

User story

As a user, I want all the data assets in aind managed buckets to have an metadata.nd.json file so I can query them in DocDB

Acceptance criteria

Given a bucket, when a user runs a job, then all the data assets will have a metadata.nd.json file after the job is run
Given the user sets an option to overwrite, then all the metadata.nd.json files will be overwritten with the original core metadata json files such as subject.json, procedures.json, etc.

Sprint Ready Checklist

1. Acceptance criteria defined
2. Team understands acceptance criteria
3. Team has defined solution / steps to satisfy acceptance criteria
4. Acceptance criteria is verifiable / testable
5. External / 3rd Party dependencies identified
6. Ticket is prioritized and sized

Notes

Add any helpful notes here.