Giter VIP home page Giter VIP logo

dataharmonizer's People

Contributors

cmrn-rhi avatar ddooley avatar griffie avatar ivansg44 avatar mgopez avatar subdavis avatar sujaypatil96 avatar turbomam avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

dataharmonizer's Issues

may need a gsheets authenticating script

I moved test_sntc.py into tests/, and when I next tried to run the test suite (by right clicking on the rest folder in PyCharm), it appeared that a new google sheet authentication step was required

Please go to this URL and finish the authentication flow

etc., but the test script didn't pause for the next step:

Enter the authorization code:

I ran this in the PyCharm Python Shell, after cd-ing to tests\

import pygsheets
sntc_id = '1pSmxX6XGOxmoA7S7rKyj5OaEl3PmAl4jAOlROuNHrU0'
client_secret_json = "../local/client_secret.apps.googleusercontent.com.json"
gc = pygsheets.authorize(client_secret=some_google_auth_file)

Please go to this URL and finish the authentication flow: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id...
Enter the authorization code: >? ...

That created a sheets.googleapis.com-python.json in tests/ and the tests continued successfully. Maybe I could have just copied the previously created sheets.googleapis.com-python.json from the project root?

Creating a sheets.googleapis.com-python.json requires a client secrets file, as defined in client_secret_json = "../local/client_secret.apps.googleusercontent.com.json" above. Neither of those should be checked into GitHub!

Use EnvO subsets for pull-downs in triad columns

@sujaypatil96 here's another one we could work on together

Right now, the MIxS environmental triad columns in the MIXS:environment field section of the NMDC Data Harmonizer accept free text (xs:token), but are validated against .* \[ENVO:\d+\]. An appropriate response for the environmental broad scale might be 'ocean [ENVO:00000015]', but the user would have to type all of that in.

Problems

  • First of all, eventually we will want to accept terms form ontologies other than EnvO, like UBERON for anatomical sites like colon [UBERON:0001155]
  • also, these columns could use pull-down menus, but they would be unmanageable with all terms from EnvO, much less other ontologies
    • @mslarae13 was involved in determining reasonable subsets of EnvO terms for various MIxS packages (soil, water, etc.) x each of the environmental triad columns. We could build enumerations into the LinkML that precedes the DataHarmonizer interface, which would then appear as pull-down menus.

@mslarae13 can you please remind me where to find the curated subsets?

@sujaypatil96 I can walk you though more of this if you can budget some time for this.

Triad columns

term id description
MIXS:0000012 broad-scale environmental context
MIXS:0000013 local environmental context
MIXS:0000014 environmental medium

CC @cmungall

Keep Soil-NMDC-Template_Compiled, esp. SheetIdentification tab tidy and authoritative

SheetIdentification

Tests have been started for this task

see also #59

  • all tabs that appear in Soil-NMDC-Template_Compiled must be listed in the SheetIdentification tab
  • the tabs should (?) be listed in the order they appear in Soil-NMDC-Template_Compiled
  • tabs that are expected to be parsed in creation of DH templates must be tagged as input for soil DH template generation in SheetIdentification
  • tabs that aren't tagged as to be parsed in creation of DH templates, or don't serve some other well-characterized purpose should be deleted
    • maybe we should designate some tabs as scratch space?

expectations of SNTC from test_sntc.py

I need to do a better job of communicating these to @mslarae13 or agreeing on alternatives

Regarding Soil-NMDC-Template_Compiled

  • tab name and ordering won't change.
    • support for adding new tabs to the left should be added.
  • tab names and order will match the sheet_name column in tab SheetIdentification
    • not implemented yet
  • no otherwise populated row in the Terms tab will have an empty Column Header

bugs in data.tsv (and upstream yaml) from use_modular_gd.py

  • if the pattern looks like a list, make it a enumeration/pulldown

    • solved by processing MIxS after NMDC because the list-like patterns came from NMDC and get overwritten by MIxS?
    • they are still in MIxS string serialization
  • add hierarchical indentation of enumerated values

    • I have some old code for that somewhere
    • probably requires sem-sql code and database (vs a SPARQL solution)
    • use any ontologies besides EnvO?
    • see #58
  • Add support for partial date columns and time columns

  • align section composition and ordering with @mslarae13's Example Use tab

    • that info will be added to other more structured tabs
  • tidy the descriptions

  • add meanings for enumerated values

    • lookup with enum_annotator (see example in Makefile)
    • expose as Ontology IDs
    • some enum labels from MIxS have added parenthetic content that makes string matching difficult: meadows (grasses,alfalfa,fescue,bromegrass,timothy)
  • add more patterns based on

    • string_serialization
    • slot's range... pretty thorough at this point
  • populate examples column?

  • are terms being included even though they are marked skip on nmdc_biosample_slots ?

  • Ontology ID

  • terse labels (from apparent prioritization of NMDC over MIxS annotations?)

  • parent classes with "https:" prefixes

    • long-term solution: align section composition and ordering below
    • short-term solution comes from full URLs (prefer prefixed) above
  • seems like number of required fields too low

    • added requirements from slot usages
  • elaborate on the use of regular expressions in the guidance column. Also include the string serialization?

  • Where is the default PV in the sample_type enum coming from

    • @click.option('--default_data_status', default="default", show_default=True)
  • what does the Null values section in the double-click header help mean? see cidgoh#244

    • shows the contents of the data status column in data.tsv, which I was populating with --default_data_status
  • take advantage of min and max values for pH (anything else?)

  • whose id-like fields should be used? The ones from NMDC or ones created by @mslarae13

    • using identifiers from biosample_identification_slots

DRY about linkml model enrichment in Makefile

@cmungall , could you help me try to generalize a pattern for running LME enum annotation over several different enums, in series?

We'll probaly want to tune the ontology_string, max_cosine and trim_parentheticals for each requested_enum_name

I do remember your request to separate the annotation from the NMDC DH build

general factoring

  • more class orientation
  • better name for linkml_round_trips folder, broken out into topical subdirectories

conflicts between MIxS and NMDC's MIxS namespaces

  1 WARNING:Namespaces:MIXS namespace is already mapped to https://w3id.org/gensc/ - Mapping to https://w3id.org/mixs/terms/ ignored
276 WARNING:YAMLGenerator:File "<file>" Prefix case mismatch - supplied: MIXS expected: mixs
  1 WARNING:YAMLGenerator:Overlapping subset and class names: soil

inconsistent prefix syntax for asserting prefixes

prefixes:
  SNTC_exact_mixs_usages:
    prefix_prefix: SNTC_exact_mixs_usages
    prefix_reference: http://example.com/SNTC_exact_mixs_usages/
  linkml: https://w3id.org/linkml/
  mixs.vocab: https://w3id.org/mixs/vocab/
  MIXS: https://w3id.org/mixs/terms/
  MIGS: https://w3id.org/mixs/migs/

Generate valid MIxS soil + NMDC biosample LinkML based on Soil-NMDC-Template_Compiled

Add EMSL and JGI terms ASAP too

Related work

  • See also #20
  • make all already comes close to this but is too inclusive (and the workflow is a mess)
  • See also modular_gd.py

Terms will come from Soil-NMDC-Template_Compiled

The result should be a LinkML YAML file that passes basic validation, like serving as the input into gen-yaml

Take slot terms from the mixs_packages_x_slots tab where the package column = 'soil' and the package column = 'use as-is or 'borrowed'

Take name terms from tab nmdc_biosample_slots where column from_schema != 'https://microbiomedata/schema/mixs'

All of those slots should be assigned to a new class like soil_biosample

section-based jump and hide

Requesting section-orients navigation in addition to column-based navigation.

Who's in the best position to implement these? Somebody in NMDC, Kitware, or CIDGOH?

Hoping to implement by early February, for a late February conference demo

  • add jump to section
  • add show/hide section

make: *** [Makefile:42: target/nmdc_biosample_generated.yaml] Error 1

Traceback (most recent call last):
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/bin/gen-yaml", line 8, in
sys.exit(cli())
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml/generators/yamlgen.py", line 45, in cli
gen = YAMLGenerator(yamlfile, **args)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml/generators/yamlgen.py", line 24, in init
super().init(schema, **kwargs)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml/utils/generator.py", line 91, in init
loader = SchemaLoader(schema, self.base_dir, useuris=useuris, importmap=importmap, logger=self.logger,
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml/utils/schemaloader.py", line 52, in init
self.schema = load_raw_schema(data, base_dir=base_dir, merge_modules=mergeimports,
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml/utils/rawloader.py", line 69, in load_raw_schema
schema = yaml_loader.load(copy.deepcopy(data) if isinstance(data, dict) else data,
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml_runtime/loaders/yaml_loader.py", line 22, in load
return self.load_source(source, loader, target_class, accept_header="text/yaml, application/yaml;q=0.9",
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml_runtime/loaders/loader_root.py", line 60, in load_source
data_as_dict = loader(data, metadata)
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/linkml_runtime/loaders/yaml_loader.py", line 16, in loader
return yaml.load(StringIO(data), DupCheckYamlLoader) if isinstance(data, str) else data
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/init.py", line 114, in load
return loader.get_single_data()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/constructor.py", line 49, in get_single_data
node = self.get_single_node()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/composer.py", line 58, in compose_document
self.get_event()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/parser.py", line 118, in get_event
self.current_event = self.state()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/parser.py", line 193, in parse_document_end
token = self.peek_token()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/scanner.py", line 129, in peek_token
self.fetch_more_tokens()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/scanner.py", line 223, in fetch_more_tokens
return self.fetch_value()
File "/home/mark/.cache/pypoetry/virtualenvs/linkml-round-trips-2m4fR-mZ-py3.9/lib/python3.9/site-packages/yaml/scanner.py", line 577, in fetch_value
raise ScannerError(None, None,
yaml.scanner.ScannerError: mapping values are not allowed here

Document Google sheets authentication

Authentication is required before these scripts (and tests) can access the Soil-NMDC-Template_Compiled Google Sheet. Documentation is required.

This is especially tricky if running tests like test_sntc.py if authentication is incomplete. If local/client_secret.apps.googleusercontent.com.json is present but sheets.googleapis.com-python.json isn't, run it like this the first time around:

poetry run pytest -s test_sntc.py

You will be asked to open a URL, grant authorization to these scripts, and then paste a code back into the shell

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.