Giter VIP home page Giter VIP logo

ingest-validation-tools's Issues

Generate documentation from schema

We need to generate user facing documentation from the schema.

  • Nils suggests OpenAPI, but that is for REST APIs, not simply schemas. And the presentation it has for schemas is not much better than looking at source code: petstore demo
  • CIDC does use JSON Schema for their manifests, and I don't see anything about doc generation there.
  • The JSON schema site itself has code generation, and form generation, but no doc generation.

data_path non-intersection

We want to make sure the the data_paths are not only unique, but that none are left substrings of others. (ie, the trees don't overlap.)

Columns referring to barcode size and offset take 3 comma-separated values

For ATACseq data, metadata columns cell_barcode_offset (column AE) and cell_barcode_size (column AF) take 3 comma-separated values.
Cell barcodes are 3 x 8 bp (for UCSD, it's 3 x 8bp) sequences that are spaced by constant sequences (the offsets).
Barcode size: 8,8,8
Barcode offset: 0,38,76
First barcode at position 0, then 38, then 76.

Should we support "*" globs in paths?

In the first metadata.tsv we got (from Dinh), they’re expecting to be able to use globs (actually, [1-3] , but *-globs could work) for datasets, instead of having neatly separated subdirectories. Would this be feasible to handle downstream? I’d really prefer not, but if it would be feasible for you, I don’t want the validation to be an arbitrary stumbling block.

Add timezone "%Z"

Test passed locally, failed on travis... guessing it pulled from some OS localization file which wasn't present there?

How should this be packaged?

I can think of two main alternatives, though there may be others:

  • Keep this repo: publish to PyPI, and the ingest pipeline will use it.
  • This whole repo is assimilated into the ingest pipeline.

Modify definitions for barcode size and offset

cell_barcode_offset definition should read:
Positions in the read at which the cell barcodes start. Cell barcodes are 3 x 8 bp sequences that are spaced by constant sequences (the offsets). First barcode at position 0, then 38, then 76. (Does not apply to SNARE-seq and BulkATAC)

cell_barcode_size definition should read:
Length of the cell barcode in base pairs. Cell barcodes are, for example, 3 x 8 bp sequences that are spaced by constant sequences, the offsets. (Does not apply to SNARE-seq and BulkATAC)

Subtyping of codex-submissions?

For codex submissions, I’m currently distinguishing akoya and stanford… it could be merged, but that would make it a lot harder to understand the error messages.

Use %z, not %Z

@cebriggs7135 : Looping you in: I think it's not feasible to support timezone abbreviations in the datetimes. The three letter codes are just not specific, maintaining our own list of HuBMAP approved timezones will be an on-going nightmare, and the code needed to parse the date becomes much more complicated at every level.

As an alternative, I think folks can just use the +/- notation: +06:00 for example.

Fix CODEX TODOs

Here's an example of a DOI:  DOI: 10.1038/ejhg.2009.142

1:05
name: assay_category
    description: ‘Each assay is placed into one of the following 3 general categories: generation of images of microscopic entities, identification & quantitation of molecules by mass spectrometry, and determination of nucleotide sequence.’
    # TODO: What are the exact strings to expect?
  imaging, mass_spectrometry, sequence_data
    name: assay_type
    description: The specific type of assay being executed.
    # TODO: What are the exact strings to expect?
  The assay names here as a dropdown menu????: Assay Type  https://docs.google.com/spreadsheets/d/1gSSwCi9kx7-x_wcEDQFv-MLNgm_hNNN4PPIL0mg4GSI/edit#gid=1899790107
scRNA-Seq (10xGenomics)
AF
bulk RNA
bulk ATAC
CODEX
Imaging Mass Cytometry
LC-MS (metabolomics)
LC-MS/MS (label-free proteomics)
MxIF
IMS positive
IMS negative
MS (shotgun lipidomics)
PAS microscopy
sci-ATAC-seq
sci-RNA-seq
seqFISH
SNARE-SEQ2
snATAC
snRNA
SPLiT-Seq
TMT (proteomics)
WGS
    name: analyte_class
    description: Analytes are the target molecules being measured with the assay.
    # TODO: What are the exact strings to expect?
  free text-
    name: is_targeted
    description: Specifies whether or not a specific molecule(s) is/are targeted for detection/measurement by the assay .The CODEX analyte is protein.
    type: boolean
    # TODO: More strict?   TRUE or FALSE

Evaluate tools for "CSV Schema"

The original pointer was to cidc-schemas, but I have some concerns after looking at an an example:

  • It's not generating a template from json schema: This is their own syntax, though it looks like it does reference types in json schema.
  • The pypi package is pretty huge, with lot of dependencies, and when I start up there are warnings about deprecations.
  • The template that it produces is Excel, not CSV.

Alternatives to:

  • Write our own JSON Schema.
  • Table Schema and its python implementation: This looks well thought out, and in a JSON syntax supports things like unique values across columns, and referential integrity. Used by outside projects like csvlint.
  • goodtables-py is under the same umbrella: Not sure how it relates.
  • PandasSchema would require pandas, and the schema is python.
  • CSV Schema from LoC only seems to have Java implementation: csv-validator

For completeness:

add pre.md and post.md

The generated documentation would be more useful if it could be part of a larger document: Consider adding pre.md and post.md which could be used to bracket the generated content in a larger document.

Update readme

Just an explanation of the generate and validate scripts: What their input and output is.

data_path exhaustive

We want to be sure that all the files are referenced by exactly one data_path.

Desired interface? -> Function that returns list of strings

Right now, I have validate.py which takes command line arguments, and either exits with 0 on success, or non-0, and with human-readable errors on STDOUT… is that the easiest interface for you to use, or would something else be better?

Get other documentation to point here

This (or documentation generated from this source code) needs to be the one place people look for how to structure their submission. Higher level concerns can still be described in docs, but the details need to be here, and only here.

  • First, I need to talk with Joel and figure out whether this will continue to exist independently, or if it will be merged into another repo.
  • With that settled, other docs can be simplified so that they point here for details.
  • Example spreadsheets should preferably be removed entirely, or if necessary, retitled to indicate they are deprecated.

We know the standards are going to continue to evolve, and this is the way I think it can work. If you see a problem with this course, please raise an alarm!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.