allenneuraldynamics / aind-data-asset-indexer Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
From @jtyoung84: seems like ECS is not configured correctly.
As a developer, I want to check that /original_metadata exists, so I can ensure correct files are not overwritten.
This check is only missing in 1 edge case: If metadata record exists in S3 but not DocDB, and was not picked up by lambda function, and /original_metadata was already copied.
Other appropriate cases already have the check.
Given that a metadata record file exists in S3 but not in DocDB...
Add any helpful notes here.
Is your feature request related to a problem? Please describe.
As a developer, I want to use an updated README so that I can easily setup and contribute to the repo.
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
The following prefixes, which all exist in s3://aind-open-data
, are not showing up in the docdb and therefore not showing on the SmartSPIM dashboard.
Describe the bug
Missing output file(s) capsule/results/*
expected by process `capsule_aind_data_asset_indexer_docdb_1 (capsule-3506143)
To Reproduce
Steps to reproduce the behavior:
Describe the bug
Missing output file(s) capsule/results/*
expected by process `capsule_aind_data_asset_indexer_docdb_1 (capsule-1936612)
To Reproduce
Steps to reproduce the behavior:
ValueError: You are trying to merge on object and float64 columns for key 's3_prefix'. If you wish to proceed you should use pd.concat
More can be found on cloudwatch logs in prod:
As a software engineer, I want to add a check in S3 crawler script, so I can catch empty dataframes before merging them.
Add any helpful notes here.
As a developer, I want to refactor the utils.does_s3_prefix_exist
method, so I can use a less expensive operation. The head_object operation is approx 12.5x less expensive than list_objects_v2 even with a MaxKeys set to 1.
Note: This may be lower priority since does_s3_prefix_exist
is only called 1x when any metadata record is updated in docdb.
does_s3_prefix_exist
method is called for any prefix, the head_object operation is used.The indexer needs to include assets that are processed results in the Code Ocean datasets bucket. This is needed so that science teams can analyze data as soon as it is processed, regardless of whether we are capturing it to an external bucket.
Acceptance Criteria
Is your feature request related to a problem? Please describe.
Add job runners to index data for a list of aind buckets. Remove the legacy jobs.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
As a user, I want to index processed data assets stored in aws buckets other than the Code Ocean results bucket, so I can analyze data from there.
Add any helpful notes here.
Describe the bug
The SmartSPIM report shows the following records:
These were for runs that have since been deleted from S3, and as far as I can tell they also do not exist in Code Ocean.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Only real data should appear in docdb :).
As a user, I want to see changes updated in the core schema files in S3 after updating a metadata record in DocDB, so changes are synced in s3 and docdb.
Currently, the aind_buckets_indexer.py
job will check for updates to records in DocDB and update the metadata.nd.json
files in S3.
We also want the individual metadata JSONs (subject, rig, etc) updated in the S3 buckets.
populate_s3_with_metadata_files.py
job is run, then the core fields from the metadata.nd.json get saved to json files.populate_s3_with_metadata_files.py
job is run and there already is a {core_schema}.json, the original contents are copied to another file as {core_schema}.old.json.aind_bucket_indexer.py
job is run and there were updates to a metadata record in docdb, the core schema jsons get updated in S3 as well.aind_bucket_indexer.py
job is run and a metadata.nd.json is found or created in S3, also ensure core jsons are copied and in sync.Add any helpful notes here.
As a user, I want all the data assets in aind managed buckets to have an metadata.nd.json file so I can query them in DocDB
Add any helpful notes here.
The indexer gets a key error when a bucket is empty. I believe this can be fixed by using .get("CommonPrefixes", [])
Is your feature request related to a problem? Please describe.
As a service maintainer, I want to easily review job summary logs in AWS cloudwatch so that I can troubleshoot issues faster.
INFO:root:{'ResponseMetadata': {'RequestId': '*************', 'HostId': '*************', 'HTTPStatusCode': 200,
'**************************', 'x-amz- 'date': 'Fri, 17 May 2024 20:08:09 GMT', 'x-amz-version-id': '**************-side-encryption': '***',
'etag': '"*******************************"', ength': '0'}, 'RetryAttempts': 0}, 'ETag': '"*******************************',
'VersionId': '*************************'}
Describe the solution you'd like
populate_s3_with_metadata_files.py
and aind_bucket_indexer.py
to include summary counts (num created, updated, deleted, etc.) after each bucket.logging.info(response)
Describe alternatives you've considered
Leaving it as is does not affect running the jobs.
Additional context
Add any other context or screenshots about the feature request here.
As a user, I want to see processed data in the docdb. The tags are being changed from "processed" to "derived" so we need to update that here.
Add any helpful notes here.
As a user I want to ensure that aind-data-access-api is fully added on the next upgrade so that I can access the latest changes
As an engineer, I want to publish a docker image so that I can run the code in the cloud.
Add any helpful notes here.
As a user I want to use the latest versions of the dependencies so that I can access the latest changes
aind-data-access-api needs full option
As a developer, I want to remove any redundant copy_original_md_subdir
defaults, so that it is easier to maintain the codebase in case of future changes.
Currently, copy_original_md_subdir
is set in the job settings and has a default in the utils.copy_then_overwrite_core_json_files
method. Remove one of them.
copy_original_md_subdir
is setAdd any helpful notes here.
As a developer, I want to reference any AWS Secrets Manager Secrets directly from Parameter store, so I can get the DocDB secret without using the secrets manager client.
Currently we have a parameter in Parameter Store, which contains the name of the DocDB secret, and we use a separate secret client to get the secret value and add it to the appropriate job settings.
As a DevOps engineer, I want to update Dockerfile, so I can publish an image that runs both s3_crawler.py and update_docdb.py scripts
Add any helpful notes here.
As a developer, I want to have checks on tests and code formats, so I can maintain the code base easier.
Add any helpful notes here.
As an engineer, I want a script to do sanity checks on DocDB.
Figure out how we want to log discrepancies.
As a user, I want to update the document store with data asset metadata, so I can search them more easily.
Maybe this will be moot in with future releases of Code Ocean?
Is your feature request related to a problem? Please describe.
If the metadata.nd.json exists in s3 but not in docdb, AindIndexBucketJob._process_prefix
will write it to DocDB. There is an unhandled error if the metadata.nd.json does not have _id
field, causing the indexer to crash.
Describe the solution you'd like
Add a check that _id
exists in the metadata.nd.json. We should also check that other services are not creating the .nd.json files, or if they do, they should use something like:
metadata_dict = Metadata.model_construct(**metadata_dict).model_dump_json(warnings=False, by_alias=True)
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
As a user, I want to optionally pass an ec2 instance's assumed role's aws credentials, so I can run the capsule via pipelines.
Add any helpful notes here.
As an engineer, I want DocDB to be updated to match S3.
Instead of using redshift table, use s3 as source of truth. Log discrepancies (maybe in future ticket)
Describe the bug
Ran into this bug while running the populate job:
aind-data-asset-indexer/src/aind_data_asset_indexer/utils.py", line 453, in is_dict_corrupt
for key, value in input_dict.items():
AttributeError: 'list' object has no attribute 'items'
To Reproduce
Steps to reproduce the behavior:
is_dict_corrupt([])
Expected behavior
If the json file is a list, we can flag it as corrupt
Additional context
Add any other context about the problem here.
This is part of the AIND Metadata Update POC.
As a service admin, I want to user updates to metadata in DocDB to also be reflected in S3, so I can ensure data is in sync.
The new AIND metadata update process will allow users to update metadata directly to DocDB. We can leverage existing AIND Data Asset Indexer to make the appropriate downstream changes to S3.
The new DocDB to S3 workflow should only run in Dev as part of the POC.
As a developer, I want to run the aind_bucket_indexer.py
script twice a day, so I can index updates in docdb to the s3 buckets.
run.sh
will run the aind_bucket_indexer jobAdd any helpful notes here.
Describe the bug
In the past, "tags" used to always be in the response returned from Code Ocean.
To Reproduce
Steps to reproduce the behavior:
response = co_client.search_all_data_assets()
and then results = response.json()["results"]
Expected behavior
In the past, the tags was always present. We'll have to update the code to handle this new behavior.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
As a user, I want the metadata docdb to be synced with S3. Delete records in the docdb data assets collection if they are not found in S3.
Add any helpful notes here.
As a user, I want to see only metadata records from valid s3 prefixes according to a certain format, so that invalid folders are ignored.
Note that in the lambda function, invalid s3 prefixes are already ignored.
A valid s3 prefix should be in format: {modality}_{id}_{acq_datetime}
We can check using DATA = f"^(?P<label>.+?)_(?P<c_date>{RegexParts.DATE.value})_(?P<c_time>{RegexParts.TIME.value})$"
from aind-data-schema
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.