ausclimateservice / axiom Goto Github PK

View Code? Open in Web Editor NEW

3.0 1.0 3.0 6.69 MB

License: MIT License

Python 99.15% Shell 0.57% Makefile 0.28%

axiom's Introduction

Axiom

An established rule or principle, a self-evident truth

Axiom is a prototype utility for validating/applying metadata templates for scientific data files.

Documentation can be found in the docs directory of this repository.

axiom's People

Contributors

Stargazers

Watchers

Forkers

jakeweis aliciatak ngben

axiom's Issues

Write documentation for the DRS utilities.

I need to write new documentation to include the DRS utilities.

Configuration boolean default has unintended side-effects

The configuration subsystem returns a default "False" in the event of a missing key, this allows for rapid development to try new features with a simple toggle in the configuration.

An unintended side-effect of this is when trying to access missing project/model keys elsewhere in the system, whereby a missing key returns False and triggers an error down the line.

While this is the correct behaviour, it is frustrating and can be difficult to diagnose.

There are two options to address this:

Check that the model/project keys actually return a dict prior to using them
Update the config loader to disable the "default to False" behaviour if requested.

I think both options should be implemented to cover off.

Command to generate user configuration

As per discussions with Jake, the ability to automatically generate a user's .axiom directory would be useful.

This will probably be a CLI option like axiom drs_generate_user_config or similar.

Ability for DRS to detect input resolution and domain to auto-populate domains.

Some data is precut to domains, axiom should be able to detect this and work out the domain.

This is basically a reverse-lookup table for the internal domain specifications from the drs.json data file.

@tha051

Duplicate fixed files

Fixed frequency files are duplicated with dates - need to make them singular.

Process ERA5 at aus10i

As per discussion with @tha051 05/09/2022 we need to process CCAM ERA5 @ aus10i. All payloads should be ready.

Fail files if interpolation placeholders remain.

Axiom drs should fail files if there are any placeholders remaining in the output path or metadata, alerting the user as to what is missing in an error digest at runtime.

Reactive walltime and resubmission

The DRS subsystem should be able to, based on the last processed variable, work out how long the next would take and if there isn't sufficient walltime remaining (+threshold for PBS) resubmit itself.

This will require inspecting the PBS environment to capture the submission details as well as implementing checks to see if a file has already been written.

Filename variable filtering causes issue with variables followed by dot

Filename variable filtering requires variables to be delimited by underscores. Variables followed by a dot (e.g. a variable just before the file extension) are not identified correctly (continue until the next underscore or the end of the file name). Adding additional delimiters would be preferable.

Provide a preset Axiom conda environment

A preset conda environment with all packages and modules needed to run Axiom would make the installation more streamlined and would eliminate issues caused by missing packages or package contents.

Test JSON consumption on new payload files.

As per discussions with @tha051, the JSON payload format has subtly changed, we need to test that the consumption CLI and underlying code still hold.

Daily data has incorrect timesteps

Data converted to daily from hourly appears to have repeating time steps and incorrect dates.

This is using data found here: /g/data/xv83/mxt599/ccam_era5_aus-11i_12.5km_coupled/drs_cordex/CORDEX-CMIP6/output/AUS-11i/CSIRO/ECMWF-ERA5/historical/r1i1p1f1/CSIRO-CCAM-2204/v1/day/pr

CDO output shows:
$ cdo showdate pr_AUS-11i_ECMWF-ERA5_historical_r1i1p1f1_CSIRO-CCAM-2204_v1_day_19870101-19871231.nc
1987-01-15 1987-02-14 1987-03-15 1987-04-15 1987-05-15 1987-06-15 1987-07-15 1987-08-15 1987-09-15 1987-10-15 1987-11-15 1987-12-15
cdo showdate: Processed 1 variable over 365 timesteps [0.03s 33MB].

fx files to go into their own directory

Currently, the fx variables are written under the frequency directory. There needs to be a conditional added to switch directory to fx, detecting existing files first.

Also remove the time information from the filename.

FYI @tha051.

Investigate speeding up netcdf writes.

Currently the hourly data will compute in around 30-60s per file, but then take a long time to write out the netcdf. This is more or less expected given the 60-70GB output file per variable. Total processing < 10min per file in most cases.

I've tried a number of approaches to speed up the write:

Various options to persist/load/compute etc. before the write.
Writing to JOBFS then moving to the destination
Writing to Ramdisk then moving.

The last option seems the most promising, but when ram is shared with the scheduler it fills up and crashes pretty quickly.

Just raising this issue for future investigation if anyone is interested in taking a look.

Update documentation with conda instructions

Installation fails - data directory not created

Attempted to install axiom and axiom-schemas, but dummy test failed:

[]$ axiom -h
Traceback (most recent call last):
  File "/$HOME/.local/bin/axiom", line 211, in <module>
    parser = get_parser()
  File "/$HOME/.local/bin/axiom", line 201, in get_parser
    parser_drs = ad.get_parser(parent=subparsers)
  File "/$HOME/.local/lib/python3.9/site-packages/axiom/drs/__init__.py", line 24, in get_parser
    config = au.load_package_data('data/drs.json')
  File "/$HOME/.local/lib/python3.9/site-packages/axiom/utilities.py", line 299, in load_package_data
    raw = pkgutil.get_data(package_name, slug)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/pkgutil.py", line 639, in get_data
    return loader.get_data(resource_name)
  File "<frozen importlib._bootstrap_external>", line 1039, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/$HOME/.local/lib/python3.9/site-packages/axiom/data/drs.json'

Investigate yaml for configuration

While JSON files are highly portable, they are verbose with lots of brackets etc.

YAML might be an option should we find it too cumbersome.

https://pyyaml.org/wiki/PyYAMLDocumentation

Test DRS on CCAM data.

Inputs:

/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/cordex

Outputs:

/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/DRS/CORDEX/

Setup ReadTheDocs

Set up RTD for this repo with automatic module doc generation.

Move remaining specifications over to axiom-schemas.

There are still remaining schemas in the specifications directory that have not yet been moved to axiom-schemas. Low-hanging fruit for anyone interested.

Update setup.py with missing packages, remove Click if necessary.

axiom validate -h fails to import items from modules distributed, dask and click. I had to install these modules manually for it to work.

pip install distributed dask click

Integrate DRS with Axiom CLI

This is now handled via GitHub Actions on the dev branch. All tests can be added to the test directory (follow example). Pls run all tests and verify that they BEFORE pushing.

Allow Axiom DRS to consume JSON files.

As per discussions with @tha051, there should be a way to consume a JSON message containing DRS configuration in the DRS scripts for automated workflow.

Aliasing sub package

One of the issues that we are facing is the difference in some variables names of the input data. For example: rlat0, rlong0 etc.

The current approach to remap these to standard variable names (lat, lon in this case) is to use a preprocessor, however, I can see that this is going to be an issue going forward as we start to ingest more data from different models - we don't necessarily want to write a preprocessor for each and every model.

I propose that we develop a small sub-package that can maintain a standard mapping, at least for coordinate variables, that can do an automatic aliasing of common variable names to standard names.

This functionality would have to be limited at first, mapping variants of certain coordinate spellings etc. to something standard and conditionally renaming them if need be.

Add build status badge to readme.

Add payload generation to CLI.

As above.

Add CMIP5/6 source id metadata

Model metadata is currently configured on a user basis, however we may have a path to automatically pull in the source_id tables:

https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html

Allow DRS processing to work on batches of variables.

Currently, DRS processing will work through all configured variables for a single processing job. For daily data this can be time-consuming.

In a small test, an individual variable/domain/year processed in the order of about 8 mins for daily data, which would put ~200 variables at over 24hrs at the default parallel settings.

This speed constraint can be overcome by either:
a) Scaling horizontally over more cores, which is limited by the stability of the HPC running Dask.
b) Splitting the variables into batches and running those in parallel.

Axiom is already configured to run either all configured variables or the user-provided variables, it is just a matter of splitting them up on the command line somehow.

Update documentation

Now that RTD is set up, we need to update the documentation as it is outdated and still points to the original CRE location.

Some additional pages:

Anatomy of a Payload
DRS configuration
Example DRS workflow

Add workflow configuration.

Adding this to the repo

https://github.com/robvanderleek/create-issue-branch

Ensure metadata is correct in output files.

Most of these should be specified in axiom-schemas, but good to check regardless.
http://is-enes-data.github.io/CORDEX_variables_requirement_table.pdf

Update installed drs.json to reflect new keys.

As above

Move to Jinja templating

Some elements of processing are becoming unwieldy with standard string interpolation.

We may need to move the templating over to use Jinja.

Ability to auto-detect output frequency from input files.

This can be done differencing the first time steps.

1h, 3h, 6h, daily, monthly.

Look at time deltas and pick the closest interval.

Whatever the the user provides gets applied as time aggregate after the auto detection.

FYI @tha051

Increase code coverage

Provide option for NetCDF-3 processing

Currently, Axiom is unable to process NetCDF version 3 files, as it attempts to read the data with HF5NetCDF, which only works on NetCDF version 4 files. To work around this issue for now files can be converted from NetCDF-3 to NetCDF-4.

Add ability to process multiple levels at the same time.

As per meeting 20220315 some variables will be coming in on multiple levels and will come out on multiple levels.

The current approach will process a single level and single variable - we can set this up to vectorise the operations automatically, but we will still need to be careful with the number of files passing through the system.

Investigate CCAM drs file sizes.

The file sizes are about 3x bigger than they theoretically should be.

Fixing this will also somewhat address #44.

AttributeError: 'datetime.timedelta' object has no attribute 'astype'

File "/scratch/xv83/at2708/miniconda3/envs/conda_axiom/lib/python3.10/site-packages/axiom/drs/utilities.py", line 465, in detect_input_frequency
total_seconds = (ds.time.data[1] - ds.time.data[0]).astype('timedelta64[s]').astype(np.int32)
AttributeError: 'datetime.timedelta' object has no attribute 'astype'

solution:
import pandas as pd
import numpy as np
diff=(ds.time.data[1] - ds.time.data[0])
total_seconds = pd.to_timedelta([diff]).astype('timedelta64[h]')[0].astype(np.int32)

Fix tests in CI

The last update broke the CI, there was a change in the API that was not reflected in the tests as it was a legacy portion of the code.

Create list of recoverable errors for DRS

Transient dask/distributed errors can result in rare exceptions when processing. In almost every case this can be resolved by simple re-running the processing task. However, it would be useful if there was a known lookup of error messages that could be evaluated to safely and automatically rerun the particular DRS processing exercise.

Move schema directive into payloads.

For various legacy reasons, the schema directive (i.e. what schema gets applied) is configured in the user drs.json configuration file. This is fine when always processing the same schema.

It would be better to select the schema from the incoming payload, or at the very least make it so that it can be used as an override.

Update single instance DRS cli to reflect new changes.

Significant changes to the DRS subsystem are underway for the JSON consumption approach for CCAM, which will require changes to the single instances axiom drs command.

Rather than packing it all into a single commit I'll raise this as another issue/task to be completed once the JSON consumption task is complete.

Allow arbitrary domains.

As discussed in meeting, arbitrary domains should be allowed in the CLI.

Package for pip

As Axiom matures we will need to be able to install it via pip.
The same will need to be done for axiom-schemas.

Add interactive flag to drs_launch

For debugging purposes in the drs_launch command it would be useful to have an interactive flag to be added in conjunction with the dump flag.

Integrate CF/ACDD compliance checking

In order to verify the Conventions metadata in output files, we would need to integrate CF/ACDD compliance checks into the processing chain.

http://cfconventions.org/software.html
https://github.com/ioos/compliance-checker

No timeline at this stage, just a thought starter.

Fix DRS processing docstring.

The API docstring here is missing a few arguments.

axiom/axiom/drs/__init__.py

Line 326 in beaf680

def process(

Add configuration option for pre-processing

Axiom seems to be auto-detecting CCAM in the file name, initiating the CCAM pre-processor. The pre-processor causes errors when processing data files with missing metadata/attributes (i.e. missing attributes rlon0, rlat0).

Removing "CCAM" from the file name prevents the pre-processing.

Extract schemas into their own repository

As it says in the name - schemas to be removed from this repository and moved into their own so that an update of the schema does not trigger an update of the utility.