Giter VIP home page Giter VIP logo

phes-odm-validation's People

Contributors

dougmanuel avatar yulric avatar zargot avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phes-odm-validation's Issues

Quarto/Jupyter notebook example

On the topic of documentation. I still think we should also add an executable example of the rule as a Quatro or Juypter notebook. This is worth a discussion. I've viewed this as not unit testing but a 'how to' documentation.

We generally try to follow the approach of the Divio documentation approach. I think the executable examples would go high up in the file structure or be incorporated into the individual rule md files. (covert the .md files to, say, a .ipynb or .qmd. Discuss?

Originally posted by @DougManuel in #5 (comment)

Add suggestions to errors/warnings

Once we have a coersion library, we will probably want to update the warnings to give specific advice on what do to when a user receives a warning that their data can be coersed to an number. This made me wonder if we should have an addition error message section for "suggestions". Meaning, we would have "error", "warning", and "suggestion". We can always put a suggestion into an error or warning message, so this isn't really required. Just raising it as a thought or discussion point.

Originally posted by @DougManuel in #64 (review)

Rename PartData fields and use constants

Have names for functions and PartData more closely adhere to the definitions in the dictionary. Instead of is_cat_val(p) use is_category(p). Instead of all_rows use all_parts. Instead of att_rows use attributes, etc.

Originally posted by @DougManuel in #9 (review)

I don't mind the helper functions you have, but I agree that it would be nice if the names aligned more with the ODM. In addition, can we move all the magic strings into an enum for the partType, lots of repetitions.

Originally posted by @yulric in #9 (comment)

mandatory-if validation rules

Some fields are mandatory with conditions. For example, a sampleID is a required header if the measure is taken from a sample.

measureID = covN1, sampleID is mandatory.
measureID = flowRate, sampleID is not mandatory. this field is identified by mandatory-if.

refactor doc-gen actions

The github actions for testing and publishing documentation share 90% of the same code and should be refactored.

The initial attempt was made in branch 15-ci-docgen-before-deduplication, which has been renamed to suite the needs of solving this issue, and can be deleted after.
https://github.com/Big-Life-Lab/PHES-ODM-Validation/blob/15-ci-docgen-before-deduplication/.github/workflows/gen-doc/action.yml

Issues in previous attempt:

  • Had to add an additional checkout at the end of the composite action to avoid failing post run.
  • Publishing didn't work properly and ended in a weird-looking page.

Generate Quarto document from Python API

Write a script/parser (in Python) that takes the python source code as input and generates a quarto document as output.
The output document should describe the Python API in the same way that Sphinx would. It will then be part of the existing quarto docs.

Rationale:

  • We want to use Quarto for our documentation, and it's way better to use Quarto for everything than to try and fit Quarto into Sphinx.
  • We want to generate API docs for our code, but existing solutions like Sphinx generate whole websites and not just the API part that we need. Extracting that part alone isn't very easy either, neither is creating a Quarto-theme for Sphinx.

Duplicate table and catset structures

It looks like the table name is being duplicated between this field and the table_attr and table_catset_attr fields. What do you think about converting table_names into a dictionary and nesting its categories and attributes into it? We can then retrieve the table names from the dictionary keys.

You could also argue that the same thing is happening with the catset_values and table_catset_attr fields, where the catset names are being duplicated.

I bring this up because we want to avoid the situation where there are two difference source of truths for the same thing, keeping them synchronized may cause bugs in the future.

Originally posted by @yulric in #11 (comment)

Language support

OMD supports languages in the languages and translations tables. Any part can have translations. That means we will need to support languages during validation.

Specifically, we will need to support table names and categories in different languages, as well as datatype = boolean.

We can use languageID to identify the language that would be evaluated, with English as the default.

Language ID Language family Language name Native Name ISO639-1 ISO639-2B ISO639-2T ISO639-3 ISO639-6 First released version Last updated date
eng en Indo-European English English en eng eng eng 2.0.0 2.0.0
fra fr Indo-European French Français fr fra fre fra 2.0.0 2.0.0
spa es Indo-European Spanish Español es spa spa spa 2.0.0 2.0.0

documentation generation

Use Sphinx to generate documentation via github actions.

  • Include quarto docs, converted to markdown.
  • Add a project/config file for Quarto, so that we can render all quarto-parts with a single Quarto command.
  • Make sure that the readme link anchor works (after #75).

spec summary report

All the data in a long table format (our main format) will have numbers as text. This means we will generate 1000s of warnings. We may want to have a "warning and errors summary" that would give a count of warnings and errors, without out the specific information about which lines the errors occurred.

This means we would have the opinion of generating two reports: a summary report; and a detailed report. We currently follow this approach in other applications (our planning tool).

Originally posted by @DougManuel in #64 (review)

Write specifications for the following:

  • Summary report (high level)
  • API function "summarize_report"
  • CLI tool "summarize":
    • options: output dir or stdout, output format (yaml, markdown)
  • CLI tool "validate":
    • options: output dir
    • yaml output

move validation-rules-list.csv

I suggest deleting the metadata folder and putting the validation-rule-list.csv into the validation-rules folder. Just in the main directory, validation-rules/validation-rule-list.csv.

If you do want to keep the metadata folder, I'd rename it because really much of the repo is metadata -- so it is hard to know what to expect is in this specific folder.

I agree with doing something about the metadata folder. I'm about to move /validation-rules into /docs, and the rule-list csv is more of an asset than documentation, so maybe move it to /assets?

Originally posted by @zargot in #62 (comment)

tweak error message format

Just a small issue. The convention for the ODM is [table]_[part] not [table].[part]. We do that because using _ generates a valid name in all programming languages. (.) does not.

Originally posted by @DougManuel in #21 (comment)

tweak tutorial

Consider starting with a minimal example like the text below, and then go into more complex examples such as changing directories and prettier print.

I'd probably put the 'generating a schema' into a different file that would focus just on that. Not sure.

samples = utils.import_dataset(join("assets/v2/samples-invalid.csv"))
data = {"samples": samples}
errors = validate_data(schema, data)

Originally posted by @DougManuel in #21 (review)

Development Dependencies

Figure out where to put dependencies not required to run the package functions but for development activities like running tests, the quarto notebooks etc.

For example, the rich library is needed to run the explanation-document.qmd file but not required to run the package functions.

error union

Can the error list be typed out into a union type with each type object being the error object for each validation rule?

Originally posted by @yulric in #28 (comment)

Or maybe we can use an enum for the rule ids.

Use snake_case for rule names in the spec

What should we do about the rule names tho?

  • def invalid_category(): current. We may change the rule name to this like @DougManuel mentioned.
  • def InvalidCategory(): would break convention and look like a class name.
  • def invalidCategory(): may be ok, but is again breaking convention.
  • def <prefix>_InvalidCategory(): may be a little verbose and isn't equal to the rule name anyway.

Edit: We may also just ignore the difference between the rule names in documentation and code, and just write a single notice about how the rules are implemented in code. For example:

The rules are currently implemented as functions in rules.py. They are directly named after the rules, using snake_case. For example, the rule "Invalid Category" becomes invalid_category.

What about changing the names and conventions in validation-rules-list to snake case?
MissingMandatoryColumn --> missing_mandatory_column

Originally posted by @DougManuel in #5 (comment)

Tasks:

  • Change the rule names to snake_case
  • Update the invalid_category_rule error report to use invalidValue rather than invalidCategoryValue

ci: add static type checking

Depends on #25, due to the big changes it makes.

This seems to required a lot of refactoring, in addition to adding the CI action.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.