An established rule or principle, a self-evident truth
Axiom is a prototype utility for validating/applying metadata templates for scientific data files.
Documentation can be found in the docs directory of this repository.
License: MIT License
I need to write new documentation to include the DRS utilities.
The configuration subsystem returns a default "False" in the event of a missing key, this allows for rapid development to try new features with a simple toggle in the configuration.
An unintended side-effect of this is when trying to access missing project/model keys elsewhere in the system, whereby a missing key returns False and triggers an error down the line.
While this is the correct behaviour, it is frustrating and can be difficult to diagnose.
There are two options to address this:
I think both options should be implemented to cover off.
As per discussions with Jake, the ability to automatically generate a user's .axiom directory would be useful.
This will probably be a CLI option like axiom drs_generate_user_config
or similar.
Some data is precut to domains, axiom should be able to detect this and work out the domain.
This is basically a reverse-lookup table for the internal domain specifications from the drs.json data file.
Fixed frequency files are duplicated with dates - need to make them singular.
As per discussion with @tha051 05/09/2022 we need to process CCAM ERA5 @ aus10i. All payloads should be ready.
Axiom drs should fail files if there are any placeholders remaining in the output path or metadata, alerting the user as to what is missing in an error digest at runtime.
The DRS subsystem should be able to, based on the last processed variable, work out how long the next would take and if there isn't sufficient walltime remaining (+threshold for PBS) resubmit itself.
This will require inspecting the PBS environment to capture the submission details as well as implementing checks to see if a file has already been written.
Filename variable filtering requires variables to be delimited by underscores. Variables followed by a dot (e.g. a variable just before the file extension) are not identified correctly (continue until the next underscore or the end of the file name). Adding additional delimiters would be preferable.
A preset conda environment with all packages and modules needed to run Axiom would make the installation more streamlined and would eliminate issues caused by missing packages or package contents.
As per discussions with @tha051, the JSON payload format has subtly changed, we need to test that the consumption CLI and underlying code still hold.
Data converted to daily from hourly appears to have repeating time steps and incorrect dates.
This is using data found here: /g/data/xv83/mxt599/ccam_era5_aus-11i_12.5km_coupled/drs_cordex/CORDEX-CMIP6/output/AUS-11i/CSIRO/ECMWF-ERA5/historical/r1i1p1f1/CSIRO-CCAM-2204/v1/day/pr
CDO output shows:
$ cdo showdate pr_AUS-11i_ECMWF-ERA5_historical_r1i1p1f1_CSIRO-CCAM-2204_v1_day_19870101-19871231.nc
1987-01-15 1987-02-14 1987-03-15 1987-04-15 1987-05-15 1987-06-15 1987-07-15 1987-08-15 1987-09-15 1987-10-15 1987-11-15 1987-12-15
cdo showdate: Processed 1 variable over 365 timesteps [0.03s 33MB].
Currently, the fx variables are written under the frequency directory. There needs to be a conditional added to switch directory to fx, detecting existing files first.
Also remove the time information from the filename.
FYI @tha051.
Currently the hourly data will compute in around 30-60s per file, but then take a long time to write out the netcdf. This is more or less expected given the 60-70GB output file per variable. Total processing < 10min per file in most cases.
I've tried a number of approaches to speed up the write:
The last option seems the most promising, but when ram is shared with the scheduler it fills up and crashes pretty quickly.
Just raising this issue for future investigation if anyone is interested in taking a look.
Attempted to install axiom and axiom-schemas, but dummy test failed:
[]$ axiom -h
Traceback (most recent call last):
File "/$HOME/.local/bin/axiom", line 211, in <module>
parser = get_parser()
File "/$HOME/.local/bin/axiom", line 201, in get_parser
parser_drs = ad.get_parser(parent=subparsers)
File "/$HOME/.local/lib/python3.9/site-packages/axiom/drs/__init__.py", line 24, in get_parser
config = au.load_package_data('data/drs.json')
File "/$HOME/.local/lib/python3.9/site-packages/axiom/utilities.py", line 299, in load_package_data
raw = pkgutil.get_data(package_name, slug)
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/pkgutil.py", line 639, in get_data
return loader.get_data(resource_name)
File "<frozen importlib._bootstrap_external>", line 1039, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/$HOME/.local/lib/python3.9/site-packages/axiom/data/drs.json'
While JSON files are highly portable, they are verbose with lots of brackets etc.
YAML might be an option should we find it too cumbersome.
Inputs:
/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/cordex
Outputs:
/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/DRS/CORDEX/
Set up RTD for this repo with automatic module doc generation.
There are still remaining schemas in the specifications directory that have not yet been moved to axiom-schemas. Low-hanging fruit for anyone interested.
axiom validate -h
fails to import items from modules distributed, dask and click. I had to install these modules manually for it to work.
pip install distributed dask click
This is now handled via GitHub Actions on the dev branch. All tests can be added to the test directory (follow example). Pls run all tests and verify that they BEFORE pushing.
As per discussions with @tha051, there should be a way to consume a JSON message containing DRS configuration in the DRS scripts for automated workflow.
One of the issues that we are facing is the difference in some variables names of the input data. For example: rlat0, rlong0 etc.
The current approach to remap these to standard variable names (lat, lon in this case) is to use a preprocessor, however, I can see that this is going to be an issue going forward as we start to ingest more data from different models - we don't necessarily want to write a preprocessor for each and every model.
I propose that we develop a small sub-package that can maintain a standard mapping, at least for coordinate variables, that can do an automatic aliasing of common variable names to standard names.
This functionality would have to be limited at first, mapping variants of certain coordinate spellings etc. to something standard and conditionally renaming them if need be.
As above.
Model metadata is currently configured on a user basis, however we may have a path to automatically pull in the source_id tables:
https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html
Currently, DRS processing will work through all configured variables for a single processing job. For daily data this can be time-consuming.
In a small test, an individual variable/domain/year processed in the order of about 8 mins for daily data, which would put ~200 variables at over 24hrs at the default parallel settings.
This speed constraint can be overcome by either:
a) Scaling horizontally over more cores, which is limited by the stability of the HPC running Dask.
b) Splitting the variables into batches and running those in parallel.
Axiom is already configured to run either all configured variables or the user-provided variables, it is just a matter of splitting them up on the command line somehow.
Now that RTD is set up, we need to update the documentation as it is outdated and still points to the original CRE location.
Some additional pages:
Adding this to the repo
Most of these should be specified in axiom-schemas, but good to check regardless.
http://is-enes-data.github.io/CORDEX_variables_requirement_table.pdf
As above
Some elements of processing are becoming unwieldy with standard string interpolation.
We may need to move the templating over to use Jinja.
This can be done differencing the first time steps.
1h, 3h, 6h, daily, monthly.
Look at time deltas and pick the closest interval.
Whatever the the user provides gets applied as time aggregate after the auto detection.
FYI @tha051
Currently, Axiom is unable to process NetCDF version 3 files, as it attempts to read the data with HF5NetCDF, which only works on NetCDF version 4 files. To work around this issue for now files can be converted from NetCDF-3 to NetCDF-4.
As per meeting 20220315 some variables will be coming in on multiple levels and will come out on multiple levels.
The current approach will process a single level and single variable - we can set this up to vectorise the operations automatically, but we will still need to be careful with the number of files passing through the system.
The file sizes are about 3x bigger than they theoretically should be.
Fixing this will also somewhat address #44.
File "/scratch/xv83/at2708/miniconda3/envs/conda_axiom/lib/python3.10/site-packages/axiom/drs/utilities.py", line 465, in detect_input_frequency
total_seconds = (ds.time.data[1] - ds.time.data[0]).astype('timedelta64[s]').astype(np.int32)
AttributeError: 'datetime.timedelta' object has no attribute 'astype'
solution:
import pandas as pd
import numpy as np
diff=(ds.time.data[1] - ds.time.data[0])
total_seconds = pd.to_timedelta([diff]).astype('timedelta64[h]')[0].astype(np.int32)
The last update broke the CI, there was a change in the API that was not reflected in the tests as it was a legacy portion of the code.
Transient dask/distributed errors can result in rare exceptions when processing. In almost every case this can be resolved by simple re-running the processing task. However, it would be useful if there was a known lookup of error messages that could be evaluated to safely and automatically rerun the particular DRS processing exercise.
For various legacy reasons, the schema directive (i.e. what schema gets applied) is configured in the user drs.json configuration file. This is fine when always processing the same schema.
It would be better to select the schema from the incoming payload, or at the very least make it so that it can be used as an override.
Significant changes to the DRS subsystem are underway for the JSON consumption approach for CCAM, which will require changes to the single instances axiom drs
command.
Rather than packing it all into a single commit I'll raise this as another issue/task to be completed once the JSON consumption task is complete.
As discussed in meeting, arbitrary domains should be allowed in the CLI.
As Axiom matures we will need to be able to install it via pip.
The same will need to be done for axiom-schemas.
For debugging purposes in the drs_launch command it would be useful to have an interactive flag to be added in conjunction with the dump flag.
In order to verify the Conventions metadata in output files, we would need to integrate CF/ACDD compliance checks into the processing chain.
http://cfconventions.org/software.html
https://github.com/ioos/compliance-checker
No timeline at this stage, just a thought starter.
The API docstring here is missing a few arguments.
Line 326 in beaf680
Axiom seems to be auto-detecting CCAM in the file name, initiating the CCAM pre-processor. The pre-processor causes errors when processing data files with missing metadata/attributes (i.e. missing attributes rlon0, rlat0).
Removing "CCAM" from the file name prevents the pre-processing.
As it says in the name - schemas to be removed from this repository and moved into their own so that an update of the schema does not trigger an update of the utility.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.