hubmapconsortium / ingest-validation-tools Goto Github PK
View Code? Open in Web Editor NEWHuBMAP data submission guidelines, and tools which check that submissions adhere to those guidelines.
License: MIT License
HuBMAP data submission guidelines, and tools which check that submissions adhere to those guidelines.
License: MIT License
We need to generate user facing documentation from the schema.
We want to make sure the the data_paths are not only unique, but that none are left substrings of others. (ie, the trees don't overlap.)
For ATACseq data, metadata columns cell_barcode_offset (column AE) and cell_barcode_size (column AF) take 3 comma-separated values.
Cell barcodes are 3 x 8 bp (for UCSD, it's 3 x 8bp) sequences that are spaced by constant sequences (the offsets).
Barcode size: 8,8,8
Barcode offset: 0,38,76
First barcode at position 0, then 38, then 76.
Delete the metadata fieldname:
"subspecimen_assay_input_number"
on line 124 of the yaml. This information is captured by another filename.
In the first metadata.tsv we got (from Dinh), they’re expecting to be able to use globs (actually, [1-3] , but *-globs could work) for datasets, instead of having neatly separated subdirectories. Would this be feasible to handle downstream? I’d really prefer not, but if it would be feasible for you, I don’t want the validation to be an arbitrary stumbling block.
Will you be able to pass in a local filesystem path for the submission, or will it be something else… maybe I need to make globus API calls, or maybe you’ll pass me a directory listing, or maybe …?
... in particular: The protocols.io field got a different name in the atac spec.
Test passed locally, failed on travis... guessing it pulled from some OS localization file which wasn't present there?
I can think of two main alternatives, though there may be others:
The allowed values are now:
change
cell_barcode_offset definition should read:
Positions in the read at which the cell barcodes start. Cell barcodes are 3 x 8 bp sequences that are spaced by constant sequences (the offsets). First barcode at position 0, then 38, then 76. (Does not apply to SNARE-seq and BulkATAC)
cell_barcode_size definition should read:
Length of the cell barcode in base pairs. Cell barcodes are, for example, 3 x 8 bp sequences that are spaced by constant sequences, the offsets. (Does not apply to SNARE-seq and BulkATAC)
For codex submissions, I’m currently distinguishing akoya and stanford… it could be merged, but that would make it a lot harder to understand the error messages.
I think this might be the location?
https://docs.google.com/document/d/1UqAq04xSzd7PhhdXaiMVUk-lVaZB1orGXvyFS5CHKI8/edit
Matt Ruffalo 1:19 PM
UCSD dataset ea42a70873cd0eb7e2cc91fd6d79fc8b
The same fixture might be used in multiple doctests, but I know I have gotten confused when I've duplicated a file to make a new doctest, and then forgot to change the target inside.
The issue regarding transposition_transposase_source has been resolved but the TODO note is still there causing confusion to viewers.
@cebriggs7135 : Looping you in: I think it's not feasible to support timezone abbreviations in the datetimes. The three letter codes are just not specific, maintaining our own list of HuBMAP approved timezones will be an on-going nightmare, and the code needed to parse the date becomes much more complicated at every level.
As an alternative, I think folks can just use the +/- notation: +06:00
for example.
Here's an example of a DOI: DOI: 10.1038/ejhg.2009.142
1:05
name: assay_category
description: ‘Each assay is placed into one of the following 3 general categories: generation of images of microscopic entities, identification & quantitation of molecules by mass spectrometry, and determination of nucleotide sequence.’
# TODO: What are the exact strings to expect?
imaging, mass_spectrometry, sequence_data
name: assay_type
description: The specific type of assay being executed.
# TODO: What are the exact strings to expect?
The assay names here as a dropdown menu????: Assay Type https://docs.google.com/spreadsheets/d/1gSSwCi9kx7-x_wcEDQFv-MLNgm_hNNN4PPIL0mg4GSI/edit#gid=1899790107
scRNA-Seq (10xGenomics)
AF
bulk RNA
bulk ATAC
CODEX
Imaging Mass Cytometry
LC-MS (metabolomics)
LC-MS/MS (label-free proteomics)
MxIF
IMS positive
IMS negative
MS (shotgun lipidomics)
PAS microscopy
sci-ATAC-seq
sci-RNA-seq
seqFISH
SNARE-SEQ2
snATAC
snRNA
SPLiT-Seq
TMT (proteomics)
WGS
name: analyte_class
description: Analytes are the target molecules being measured with the assay.
# TODO: What are the exact strings to expect?
free text-
name: is_targeted
description: Specifies whether or not a specific molecule(s) is/are targeted for detection/measurement by the assay .The CODEX analyte is protein.
type: boolean
# TODO: More strict? TRUE or FALSE
Look at globus and the codex submission spec.
The original pointer was to cidc-schemas, but I have some concerns after looking at an an example:
Alternatives to:
For completeness:
I think a key-value table would work well.
The generated documentation would be more useful if it could be part of a larger document: Consider adding pre.md and post.md which could be used to bracket the generated content in a larger document.
Just an explanation of the generate and validate scripts: What their input and output is.
This field has a TODO question:
TODO: Should this be updated? Seems like copy and paste from CODEX.
The answer is no, it should not be updated. CODEX is just offered as an example of a targeted assay.
Make sure the donor and sample IDs passed in match what we see in the file.
We want to be sure that all the files are referenced by exactly one data_path.
The CODEX metadata ingest form has 2 fields for data_path and metadata_path. Please add the same 2 fields for submitters to paste the Globus link to their data and a link to metadata.
Right now, I have validate.py which takes command line arguments, and either exits with 0 on success, or non-0, and with human-readable errors on STDOUT… is that the easiest interface for you to use, or would something else be better?
We want to make sure the data_paths actually exist in the submission
Table schema is richer, but we can get some of the constraints encoded in the json schema as well.
They are just to keep ugly <BLANKLINE>
out of doctests.
This (or documentation generated from this source code) needs to be the one place people look for how to structure their submission. Higher level concerns can still be described in docs, but the details need to be here, and only here.
We know the standards are going to continue to evolve, and this is the way I think it can work. If you see a problem with this course, please raise an alarm!
The 3 options should be:
Should I put this on PyPI, or would you like something else?
Telling someone the error is in column 46 on a spreadsheet isn't helpful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.