hubmapconsortium / ingest-validation-tools Goto Github PK

HuBMAP data submission guidelines, and tools which check that submissions adhere to those guidelines.

License: MIT License

Python 95.27% Shell 4.73%

gehlenborglab hubmap standards fair schema hidivelab

ingest-validation-tools's Introduction

ingest-validation-tools

HuBMAP data upload guidelines and instructions for checking that uploads adhere to those guidelines. Assay documentation is on Github Pages.

HuBMAP has three distinct metadata processes:

Donor metadata is handled by Jonathan Silverstein on an adhoc basis: He works with whatever format the TMC can provide, and aligns it with controlled vocabularies.
Sample metadata is ingested by the HuBMAP Data Ingest Portal--see "Upload Sample Metadata" at the top of the page.
Dataset uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. Details below.

For assay type working groups:

Before we can write code to validate a particular assay type, there are some prerequisites:

A document describing the experimental techniques involved.
A list of the metadata fields for this type, along with descriptions and constraints.
A list of the files to be expected in each dataset directory, along with descriptions. Suggestions for describing directories.

When all the parts are finalized,

The document will be translated into markdown, and added here.
The list of fields will be translated into a table schema, like those here.
The list of files will be translated into a directory schema, like those here.

Stability

Once approved, both the CEDAR Metadata Template (metadata schema) and the list of files (directory schema) are fixed in a particular version. The metadata for a particular assay type needs to be consistent for all datasets, as does the set of files which comprise a dataset. Edits to descriptions are welcome, as are improved validations.

If a more significant change is necessary, a new version is required, and when the older form is no longer acceptable, the schema should be deprecated.

HuBMAP HIVE members: For questions about the stability of metadata, contact Nils Gehlenborg (@ngehlenborg), or add him as a reviewer on the PR. For the stability of directory structures, contact Phil Blood (@pdblood).

For data submitters and curators:

Validate TSVs

To validate your metadata TSV files, use the HuBMAP Metadata Spreadsheet Validator. This tool is a web-based application that will categorize any errors in your spreadsheet and provide help fixing those errors. More detailed instructions about using the tool can be found in the Spreadsheet Validator Documentation.

Validate Directory Structure

Checkout the repo and install dependencies:

python --version  # Should be Python3.
git clone https://github.com/hubmapconsortium/ingest-validation-tools.git
cd ingest-validation-tools
# Optionally, set up venv or conda, then:
pip install -r requirements.txt
src/validate_upload.py --help

You should see the documention for validate_upload.py

Now run it against one of the included examples, giving the path to an upload directory:

src/validate_upload.py \
  --local_directory examples/dataset-examples/bad-tsv-formats/upload \
  --no_url_checks \
  --output as_text

Note: URL checking is not supported via validate_upload.py at this time, and is disabled with the use of the --no_url_checks flag. Please ensure that any fields containing a HuBMAP ID (such as parent-sample_id) or an ORCID (orcid) are accurate.

You should now see this (extensive) error message. This example TSV has been constructed with a mistake in every column, just to demonstrate the checks which are available. Hopefully, more often your experience will be like this:

src/validate_upload.py \
  --local_directory examples/dataset-examples/good-codex-akoya-metadata-v1/upload \
  --no_url_checks

No errors!

Documentation and metadata TSV templates for each assay type are here.

Running plugin tests:

Additional plugin tests can also be run. These additional tests confirm that the files themselves are valid, not just that the directory structures are correct. These additional tests are in a separate repo, and have their own dependencies.

Starting from ingest-validation-tools...

cd ..
git clone https://github.com/hubmapconsortium/ingest-validation-tests.git
cd ingest-validation-tests
pip install -r requirements.txt

Back to ingest-validation-tools...

cd ../ingest-validation-tools

Failing example, see README.md

src/validate_upload.py \
  --local_directory examples/plugin-tests/expected-failure/upload \
  --run_plugins \
  --no_url_checks \
  --plugin_directory ../ingest-validation-tests/src/ingest_validation_tests/

For developers and contributors:

An example of the core error-reporting functionality underlying validate-upload.py:

upload = Upload(directory_path=path)
report = ErrorReport(errors=upload.get_errors(), info=upload.get_info())
print(report.as_text())

(If it would be useful for this to be installable with pip, please file an issue.)

To make contributions, checkout the project, cd, venv, and then:

pip install -r requirements.txt
pip install -r requirements-dev.txt
brew install parallel    # On macOS
apt-get install parallel # On Ubuntu
./test.sh

After making tweaks to the schema, you will need to regenerate the docs: The test error message will tell you what to do.

GitHub Actions

This repo uses GitHub Actions to check formatting and linting of code using black, isort, and flake8. Especially before submitting a PR, make sure your code is compliant. Run the following from the base ingest-validation-tools directory:

black --line-length 99 .
isort --profile black --multi-line 3 .
flake8

Integrating black and potentially isort/flake8 with your editor may allow you to skip this step.

Releases

For releases we're just using git tags:

$ git tag v0.0.x
$ git push origin v0.0.x

Repo structure

Checking in the built documentation is not the typical approach, but has worked well for this project:

It's a sanity check when making schema changes. Since the schema for an assay actually comes for multiple sources, having the result of include resolution checked in makes it possible to catch unintended changes.
It simplifies administration, since a separate static documentation site is not required.
It enables easy review of the history of a schema, since the usual git/github tools can be used.

Upload process and upload directory structure

Data upload to HuBMAP is composed of discrete phases:

Upload preparation and validation
Upload and re-validation
Restructuring
Re-re-validation and pipeline runs

Uploads are based on directories containing at a minimum:

one or more *-metadata.tsv files.
top-level dataset directories in a 1-to-1 relationship with the rows of the TSVs.

The type of a metadata TSV is determined by reading the first row.

The antibodies_path (for applicable types), contributors_path, and data_path are relative to the location of the TSV. The antibodies and contributors TSV will typically be at the top level of the upload, but if they are applicable to only a single dataset, they can be placed within that dataset's extras/ directory.

You can validate your upload directory locally, then upload it to Globus, and the same validation will be run there.

ingest-validation-tools's People

Contributors

Stargazers

Watchers

Forkers

nicopierson jnick-uf hhakimian michdaniel lukasz-migas jamestwebber pecan88 nhpatterson safisher ackagel jinayun huatian1 vasylvaskivskyi mccalluc j-uranic sennetconsortium wirtz-lab

ingest-validation-tools's Issues

Make example of directory structure

Look at globus and the codex submission spec.

data_path exhaustive

We want to be sure that all the files are referenced by exactly one data_path.

friendlier directory validation errors

Number the errors
Give an explanation for "oneOf" and "contains"

What is the interface for getting directory listings?

Will you be able to pass in a local filesystem path for the submission, or will it be something else… maybe I need to make globus API calls, or maybe you’ll pass me a directory listing, or maybe …?

add pre.md and post.md

The generated documentation would be more useful if it could be part of a larger document: Consider adding pre.md and post.md which could be used to bracket the generated content in a larger document.

Numbers should generally be > 0

Fix CODEX TODOs

Here's an example of a DOI:  DOI: 10.1038/ejhg.2009.142

1:05
name: assay_category
    description: ‘Each assay is placed into one of the following 3 general categories: generation of images of microscopic entities, identification & quantitation of molecules by mass spectrometry, and determination of nucleotide sequence.’
    # TODO: What are the exact strings to expect?
  imaging, mass_spectrometry, sequence_data
    name: assay_type
    description: The specific type of assay being executed.
    # TODO: What are the exact strings to expect?
  The assay names here as a dropdown menu????: Assay Type  https://docs.google.com/spreadsheets/d/1gSSwCi9kx7-x_wcEDQFv-MLNgm_hNNN4PPIL0mg4GSI/edit#gid=1899790107
scRNA-Seq (10xGenomics)
AF
bulk RNA
bulk ATAC
CODEX
Imaging Mass Cytometry
LC-MS (metabolomics)
LC-MS/MS (label-free proteomics)
MxIF
IMS positive
IMS negative
MS (shotgun lipidomics)
PAS microscopy
sci-ATAC-seq
sci-RNA-seq
seqFISH
SNARE-SEQ2
snATAC
snRNA
SPLiT-Seq
TMT (proteomics)
WGS
    name: analyte_class
    description: Analytes are the target molecules being measured with the assay.
    # TODO: What are the exact strings to expect?
  free text-
    name: is_targeted
    description: Specifies whether or not a specific molecule(s) is/are targeted for detection/measurement by the assay .The CODEX analyte is protein.
    type: boolean
    # TODO: More strict?   TRUE or FALSE

IDs are HuBMAP Display IDs, not UUIDs

In errors, identify columns by letters, not numbers

Telling someone the error is in column 46 on a spreadsheet isn't helpful.

Fix ATAC-seq TODOs

Validate dataset structure for ATAC-seq

I think this might be the location?

https://docs.google.com/document/d/1UqAq04xSzd7PhhdXaiMVUk-lVaZB1orGXvyFS5CHKI8/edit

Matt Ruffalo 1:19 PM
UCSD dataset ea42a70873cd0eb7e2cc91fd6d79fc8b

Regenerate everything at once

Use %z, not %Z

@cebriggs7135 : Looping you in: I think it's not feasible to support timezone abbreviations in the datetimes. The three letter codes are just not specific, maintaining our own list of HuBMAP approved timezones will be an on-going nightmare, and the code needed to parse the date becomes much more complicated at every level.

As an alternative, I think folks can just use the +/- notation: +06:00 for example.

edit allowed values for fieldname: tranposition transposase source

The allowed values are now:

10X snATAC
In-house produced (Protocol Reference)
Nextera

change

In-house produced (Protocol Reference)
to
in-house

define antibodies.tsv

antibody id
antibody name
protein name (https://www.ncbi.nlm.nih.gov/genome/doc/internatprot_nomenguide/)
cycle number
channel number
link to metadata tsv (use the path as the ID)

In doctests, confirm that name of fixture occurs in name of test

The same fixture might be used in multiple doctests, but I know I have gotten confused when I've duplicated a file to make a new doctest, and then forgot to change the target inside.

add number_of_channels

data_path non-intersection

We want to make sure the the data_paths are not only unique, but that none are left substrings of others. (ie, the trees don't overlap.)

Make most fields required

ATACseq metadata

https://docs.google.com/document/d/1Lrzruebio9nUusaFSJQlZGANkRjS5lrJmk4W0sSj_xM/edit#heading=h.tyjcwt

Add timezone "%Z"

Test passed locally, failed on travis... guessing it pulled from some OS localization file which wasn't present there?

Modify definitions for barcode size and offset

cell_barcode_offset definition should read:
Positions in the read at which the cell barcodes start. Cell barcodes are 3 x 8 bp sequences that are spaced by constant sequences (the offsets). First barcode at position 0, then 38, then 76. (Does not apply to SNARE-seq and BulkATAC)

cell_barcode_size definition should read:
Length of the cell barcode in base pairs. Cell barcodes are, for example, 3 x 8 bp sequences that are spaced by constant sequences, the offsets. (Does not apply to SNARE-seq and BulkATAC)

Update readme

Just an explanation of the generate and validate scripts: What their input and output is.

Desired interface? -> Function that returns list of strings

Right now, I have validate.py which takes command line arguments, and either exits with 0 on success, or non-0, and with human-readable errors on STDOUT… is that the easiest interface for you to use, or would something else be better?

Subtyping of codex-submissions?

For codex submissions, I’m currently distinguishing akoya and stanford… it could be merged, but that would make it a lot harder to understand the error messages.

Get other documentation to point here

This (or documentation generated from this source code) needs to be the one place people look for how to structure their submission. Higher level concerns can still be described in docs, but the details need to be here, and only here.

First, I need to talk with Joel and figure out whether this will continue to exist independently, or if it will be merged into another repo.
With that settled, other docs can be simplified so that they point here for details.
Example spreadsheets should preferably be removed entirely, or if necessary, retitled to indicate they are deprecated.

We know the standards are going to continue to evolve, and this is the way I think it can work. If you see a problem with this course, please raise an alarm!

delete a fieldname from the ATACseq metadata:

Delete the metadata fieldname:

"subspecimen_assay_input_number"

on line 124 of the yaml. This information is captured by another filename.

assay_category metadata field enumerated options

The 3 options should be:

microscopy
mass spec
sequence

How should this be packaged?

I can think of two main alternatives, though there may be others:

Keep this repo: publish to PyPI, and the ingest pipeline will use it.
This whole repo is assimilated into the ingest pipeline.

How should it be delivered?

Should I put this on PyPI, or would you like something else?

add tissue_id

Include more of the table schema constraints in the generated jsonschema

Table schema is richer, but we can get some of the constraints encoded in the json schema as well.

avoid copy-and-paste of levels 1 and 2

... in particular: The protocols.io field got a different name in the atac spec.

Check that protocols entry actually exists

"Row 5 is completely blank" probably not needed?

create 2 fields called metadata_path and data_path

The CODEX metadata ingest form has 2 fields for data_path and metadata_path. Please add the same 2 fields for submitters to paste the Globus link to their data and a link to metadata.

Remove TODO notes from metadata field notes when resolved

The issue regarding transposition_transposase_source has been resolved but the TODO note is still there causing confusion to viewers.

Fill in ATAC-seq descriptions

Talk with Chris and deliver MVP for ingest-validation-tools

Referential integrity up: Donor and sample

Make sure the donor and sample IDs passed in match what we see in the file.

Referential integrity down: data_path

We want to make sure the data_paths actually exist in the submission

Include all constraints in generated documentation

I think a key-value table would work well.

parent_id -> tissue_id

option to turn off periods in output

They are just to keep ugly <BLANKLINE> out of doctests.

Columns referring to barcode size and offset take 3 comma-separated values

For ATACseq data, metadata columns cell_barcode_offset (column AE) and cell_barcode_size (column AF) take 3 comma-separated values.
Cell barcodes are 3 x 8 bp (for UCSD, it's 3 x 8bp) sequences that are spaced by constant sequences (the offsets).
Barcode size: 8,8,8
Barcode offset: 0,38,76
First barcode at position 0, then 38, then 76.

Translate columns to letters for all errors

Evaluate tools for "CSV Schema"

The original pointer was to cidc-schemas, but I have some concerns after looking at an an example:

It's not generating a template from json schema: This is their own syntax, though it looks like it does reference types in json schema.
The pypi package is pretty huge, with lot of dependencies, and when I start up there are warnings about deprecations.
The template that it produces is Excel, not CSV.

Alternatives to:

Write our own JSON Schema.
Table Schema and its python implementation: This looks well thought out, and in a JSON syntax supports things like unique values across columns, and referential integrity. Used by outside projects like csvlint.
goodtables-py is under the same umbrella: Not sure how it relates.
PandasSchema would require pandas, and the schema is python.
CSV Schema from LoC only seems to have Java implementation: csv-validator

For completeness:

pandas-validator is archived.
CsvSchema hasn't been touched in six years.
CsvValidator hasn't been touched in seven.

Generate documentation from schema

We need to generate user facing documentation from the schema.

Nils suggests OpenAPI, but that is for REST APIs, not simply schemas. And the presentation it has for schemas is not much better than looking at source code: petstore demo
CIDC does use JSON Schema for their manifests, and I don't see anything about doc generation there.
The JSON schema site itself has code generation, and form generation, but no doc generation.

Should we support "*" globs in paths?

In the first metadata.tsv we got (from Dinh), they’re expecting to be able to use globs (actually, [1-3] , but *-globs could work) for datasets, instead of having neatly separated subdirectories. Would this be feasible to handle downstream? I’d really prefer not, but if it would be feasible for you, I don’t want the validation to be an arbitrary stumbling block.

is_targeted metadata field description remains as is

This field has a TODO question:
TODO: Should this be updated? Seems like copy and paste from CODEX.

The answer is no, it should not be updated. CODEX is just offered as an example of a targeted assay.