big-life-lab / phes-odm-validation Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 0.0 2.24 MB

A toolkit to assist in validating whether data conforms to the PHES-ODM dictionary.

Home Page: https://validate-docs.phes-odm.org/

License: Creative Commons Attribution 4.0 International

Python 99.75% Shell 0.25%

validation validator phes-odm

phes-odm-validation's People

Contributors

Stargazers

Watchers

phes-odm-validation's Issues

Update tests to use (parts/schema-)files in assets folder

CI linting

Quarto/Jupyter notebook example

On the topic of documentation. I still think we should also add an executable example of the rule as a Quatro or Juypter notebook. This is worth a discussion. I've viewed this as not unit testing but a 'how to' documentation.

We generally try to follow the approach of the Divio documentation approach. I think the executable examples would go high up in the file structure or be incorporated into the individual rule md files. (covert the .md files to, say, a .ipynb or .qmd. Discuss?

Originally posted by @DougManuel in #5 (comment)

ci: run tests

Add suggestions to errors/warnings

Once we have a coersion library, we will probably want to update the warnings to give specific advice on what do to when a user receives a warning that their data can be coersed to an number. This made me wonder if we should have an addition error message section for "suggestions". Meaning, we would have "error", "warning", and "suggestion". We can always put a suggestion into an error or warning message, so this isn't really required. Just raising it as a thought or discussion point.

Originally posted by @DougManuel in #64 (review)

Rename PartData fields and use constants

Have names for functions and PartData more closely adhere to the definitions in the dictionary. Instead of is_cat_val(p) use is_category(p). Instead of all_rows use all_parts. Instead of att_rows use attributes, etc.

Originally posted by @DougManuel in #9 (review)

I don't mind the helper functions you have, but I agree that it would be nice if the names aligned more with the ODM. In addition, can we move all the magic strings into an enum for the partType, lots of repetitions.

Originally posted by @yulric in #9 (comment)

update requirements.txt

it doesn't list all the current requirements.

mandatory-if validation rules

Some fields are mandatory with conditions. For example, a sampleID is a required header if the measure is taken from a sample.

measureID = covN1, sampleID is mandatory.
measureID = flowRate, sampleID is not mandatory. this field is identified by mandatory-if.

update contributing style

refactor doc-gen actions

The github actions for testing and publishing documentation share 90% of the same code and should be refactored.

The initial attempt was made in branch 15-ci-docgen-before-deduplication, which has been renamed to suite the needs of solving this issue, and can be deleted after.
https://github.com/Big-Life-Lab/PHES-ODM-Validation/blob/15-ci-docgen-before-deduplication/.github/workflows/gen-doc/action.yml

Issues in previous attempt:

Had to add an additional checkout at the end of the composite action to avoid failing post run.
Publishing didn't work properly and ended in a weird-looking page.

add warning field to validation report

implement rules for min/max value

Add a new validation rule to support minimum and maximum values

Specs

skip parts without version1 fields

Log error and skip part when:

required version1 fields are missing

Originally suggested by @yulric in #62 (comment)

New rule - type validation rule

Specification: https://big-life-lab.github.io/PHES-ODM-Validation/validation-rules/invalid_type.html

No need to validate the email type for now since it will only be in v0.2 of the dictionary

finish missing sections in project.toml

Originally posted by @DougManuel in #53 (review)

Generate Quarto document from Python API

Write a script/parser (in Python) that takes the python source code as input and generates a quarto document as output.
The output document should describe the Python API in the same way that Sphinx would. It will then be part of the existing quarto docs.

Rationale:

We want to use Quarto for our documentation, and it's way better to use Quarto for everything than to try and fit Quarto into Sphinx.
We want to generate API docs for our code, but existing solutions like Sphinx generate whole websites and not just the API part that we need. Extracting that part alone isn't very easy either, neither is creating a Quarto-theme for Sphinx.

Duplicate table and catset structures

It looks like the table name is being duplicated between this field and the table_attr and table_catset_attr fields. What do you think about converting table_names into a dictionary and nesting its categories and attributes into it? We can then retrieve the table names from the dictionary keys.

You could also argue that the same thing is happening with the catset_values and table_catset_attr fields, where the catset names are being duplicated.

I bring this up because we want to avoid the situation where there are two difference source of truths for the same thing, keeping them synchronized may cause bugs in the future.

Originally posted by @yulric in #11 (comment)

doc: add presentation

Depends on #25.

Fix min/max rule specification

Specify new rule for validating email fields

This new rule will validate those columns in the dictionary that are emails. For now, the list of email columns will be hardcoded in the rule.

Update min/max value since the cerberus min/max rules are already inclusive

use datasets and error report assets in the tests

For each test, can you use all the datasets and error report in the assets directory? I put them in there so we can test what I hope are all the cases.

Originally posted by @yulric in #62 (review)

Check the examples in the rule specifications

Language support

OMD supports languages in the languages and translations tables. Any part can have translations. That means we will need to support languages during validation.

Specifically, we will need to support table names and categories in different languages, as well as datatype = boolean.

We can use languageID to identify the language that would be evaluated, with English as the default.

Language ID	Language family	Language name	Native Name	ISO639-1	ISO639-2B	ISO639-2T	ISO639-3	ISO639-6	First released version	Last updated date
eng	en	Indo-European	English	English	en	eng	eng	eng	2.0.0	2.0.0
fra	fr	Indo-European	French	Français	fr	fra	fre	fra	2.0.0	2.0.0
spa	es	Indo-European	Spanish	Español	es	spa	spa	spa	2.0.0	2.0.0

Escape angle brackets (<>) in the validation rule files

The unescaped version results in the text within them not being rendered in the final document

Specify new rule for validating URL columns

This rule will validate URL columns ensuring they're a valid URL. For now, the list of URL columns will be hardcoded in the rule.

Add meta fields to the Cerberus schema

New validation rule, missing_values_found

Specification: https://github.com/Big-Life-Lab/PHES-ODM-Validation/blob/main/validation-rules/missing_values_found.md

documentation generation

Use Sphinx to generate documentation via github actions.

Include quarto docs, converted to markdown.
~~Add a project/config file for Quarto, so that we can render all quarto-parts with a single Quarto command.~~
Make sure that the readme link anchor works (after #75).

Specify support for date types in the min and max validation rule

spec summary report

All the data in a long table format (our main format) will have numbers as text. This means we will generate 1000s of warnings. We may want to have a "warning and errors summary" that would give a count of warnings and errors, without out the specific information about which lines the errors occurred.

This means we would have the opinion of generating two reports: a summary report; and a detailed report. We currently follow this approach in other applications (our planning tool).

Originally posted by @DougManuel in #64 (review)

Write specifications for the following:

Summary report (high level)
API function "summarize_report"
CLI tool "summarize":
- options: output dir or stdout, output format (yaml, markdown)
CLI tool "validate":
- options: output dir
- yaml output

move validation-rules-list.csv

I suggest deleting the metadata folder and putting the validation-rule-list.csv into the validation-rules folder. Just in the main directory, validation-rules/validation-rule-list.csv.

If you do want to keep the metadata folder, I'd rename it because really much of the repo is metadata -- so it is hard to know what to expect is in this specific folder.

I agree with doing something about the metadata folder. I'm about to move /validation-rules into /docs, and the rule-list csv is more of an asset than documentation, so maybe move it to /assets?

Originally posted by @zargot in #62 (comment)

implement duplicate_entries_found rule

Spec: https://validate-docs.phes-odm.org/validation-rules/duplicate_entries_found.html

tweak error message format

Just a small issue. The convention for the ODM is [table]_[part] not [table].[part]. We do that because using _ generates a valid name in all programming languages. (.) does not.

Originally posted by @DougManuel in #21 (comment)

test validation schema gen for v2

Depends on #25.

Can we also test version 2 validation schema generation?

Originally posted by @yulric in #28 (review)

README: Update the reference documents section

Originally posted by @DougManuel in #53 (review)

Validation rule spec file reorganization

versioning

implement the spec from #4

tweak tutorial

Consider starting with a minimal example like the text below, and then go into more complex examples such as changing directories and prettier print.

I'd probably put the 'generating a schema' into a different file that would focus just on that. Not sure.

samples = utils.import_dataset(join("assets/v2/samples-invalid.csv"))
data = {"samples": samples}
errors = validate_data(schema, data)

Originally posted by @DougManuel in #21 (review)

Development Dependencies

Figure out where to put dependencies not required to run the package functions but for development activities like running tests, the quarto notebooks etc.

For example, the rich library is needed to run the explanation-document.qmd file but not required to run the package functions.

error union

Can the error list be typed out into a union type with each type object being the error object for each validation rule?

Originally posted by @yulric in #28 (comment)

Or maybe we can use an enum for the rule ids.

Use snake_case for rule names in the spec

What should we do about the rule names tho?

def invalid_category(): current. We may change the rule name to this like @DougManuel mentioned.

def InvalidCategory(): would break convention and look like a class name.

def invalidCategory(): may be ok, but is again breaking convention.

def <prefix>_InvalidCategory(): may be a little verbose and isn't equal to the rule name anyway.

Edit: We may also just ignore the difference between the rule names in documentation and code, and just write a single notice about how the rules are implemented in code. For example:

The rules are currently implemented as functions in rules.py. They are directly named after the rules, using snake_case. For example, the rule "Invalid Category" becomes invalid_category.

What about changing the names and conventions in validation-rules-list to snake case?
MissingMandatoryColumn --> missing_mandatory_column

Originally posted by @DougManuel in #5 (comment)

Tasks:

Change the rule names to snake_case
Update the invalid_category_rule error report to use invalidValue rather than invalidCategoryValue

and

https://github.com/Big-Life-Lab/PHES-ODM-Validation/blob/main/validation-rules/less_than_min_length.md

https://github.com/mcanouil/awesome-quarto

ci: add static type checking

Depends on #25, due to the big changes it makes.

This seems to required a lot of refactoring, in addition to adding the CI action.

big-life-lab / phes-odm-validation Goto Github PK

phes-odm-validation's People

Contributors

Stargazers

Watchers

phes-odm-validation's Issues

Recommend Projects

Recommend Topics

Recommend Org