Giter VIP home page Giter VIP logo

axiom's Introduction

Python package Build and upload to PyPI

Axiom

An established rule or principle, a self-evident truth

Axiom is a prototype utility for validating/applying metadata templates for scientific data files.

Documentation can be found in the docs directory of this repository.

axiom's People

Contributors

bschroeter avatar aliciatak avatar

Stargazers

Benjamin Ng avatar Romain Beucher avatar  avatar

Watchers

Claire Trenham avatar

axiom's Issues

Configuration boolean default has unintended side-effects

The configuration subsystem returns a default "False" in the event of a missing key, this allows for rapid development to try new features with a simple toggle in the configuration.

An unintended side-effect of this is when trying to access missing project/model keys elsewhere in the system, whereby a missing key returns False and triggers an error down the line.

While this is the correct behaviour, it is frustrating and can be difficult to diagnose.

There are two options to address this:

  • Check that the model/project keys actually return a dict prior to using them
  • Update the config loader to disable the "default to False" behaviour if requested.

I think both options should be implemented to cover off.

Command to generate user configuration

As per discussions with Jake, the ability to automatically generate a user's .axiom directory would be useful.

This will probably be a CLI option like axiom drs_generate_user_config or similar.

Reactive walltime and resubmission

The DRS subsystem should be able to, based on the last processed variable, work out how long the next would take and if there isn't sufficient walltime remaining (+threshold for PBS) resubmit itself.

This will require inspecting the PBS environment to capture the submission details as well as implementing checks to see if a file has already been written.

Filename variable filtering causes issue with variables followed by dot

Filename variable filtering requires variables to be delimited by underscores. Variables followed by a dot (e.g. a variable just before the file extension) are not identified correctly (continue until the next underscore or the end of the file name). Adding additional delimiters would be preferable.

Provide a preset Axiom conda environment

A preset conda environment with all packages and modules needed to run Axiom would make the installation more streamlined and would eliminate issues caused by missing packages or package contents.

Daily data has incorrect timesteps

Data converted to daily from hourly appears to have repeating time steps and incorrect dates.

This is using data found here: /g/data/xv83/mxt599/ccam_era5_aus-11i_12.5km_coupled/drs_cordex/CORDEX-CMIP6/output/AUS-11i/CSIRO/ECMWF-ERA5/historical/r1i1p1f1/CSIRO-CCAM-2204/v1/day/pr

CDO output shows:
$ cdo showdate pr_AUS-11i_ECMWF-ERA5_historical_r1i1p1f1_CSIRO-CCAM-2204_v1_day_19870101-19871231.nc
1987-01-15 1987-02-14 1987-03-15 1987-04-15 1987-05-15 1987-06-15 1987-07-15 1987-08-15 1987-09-15 1987-10-15 1987-11-15 1987-12-15
cdo showdate: Processed 1 variable over 365 timesteps [0.03s 33MB].

fx files to go into their own directory

Currently, the fx variables are written under the frequency directory. There needs to be a conditional added to switch directory to fx, detecting existing files first.

Also remove the time information from the filename.

FYI @tha051.

Investigate speeding up netcdf writes.

Currently the hourly data will compute in around 30-60s per file, but then take a long time to write out the netcdf. This is more or less expected given the 60-70GB output file per variable. Total processing < 10min per file in most cases.

I've tried a number of approaches to speed up the write:

  • Various options to persist/load/compute etc. before the write.
  • Writing to JOBFS then moving to the destination
  • Writing to Ramdisk then moving.

The last option seems the most promising, but when ram is shared with the scheduler it fills up and crashes pretty quickly.

Just raising this issue for future investigation if anyone is interested in taking a look.

Installation fails - data directory not created

Attempted to install axiom and axiom-schemas, but dummy test failed:

[]$ axiom -h
Traceback (most recent call last):
  File "/$HOME/.local/bin/axiom", line 211, in <module>
    parser = get_parser()
  File "/$HOME/.local/bin/axiom", line 201, in get_parser
    parser_drs = ad.get_parser(parent=subparsers)
  File "/$HOME/.local/lib/python3.9/site-packages/axiom/drs/__init__.py", line 24, in get_parser
    config = au.load_package_data('data/drs.json')
  File "/$HOME/.local/lib/python3.9/site-packages/axiom/utilities.py", line 299, in load_package_data
    raw = pkgutil.get_data(package_name, slug)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/pkgutil.py", line 639, in get_data
    return loader.get_data(resource_name)
  File "<frozen importlib._bootstrap_external>", line 1039, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/$HOME/.local/lib/python3.9/site-packages/axiom/data/drs.json'

Test DRS on CCAM data.

Inputs:

/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/cordex

Outputs:

/datastore/raf018/CCAM/WINE/ACCESS1-0/50km/DRS/CORDEX/

Integrate DRS with Axiom CLI

This is now handled via GitHub Actions on the dev branch. All tests can be added to the test directory (follow example). Pls run all tests and verify that they BEFORE pushing.

Aliasing sub package

One of the issues that we are facing is the difference in some variables names of the input data. For example: rlat0, rlong0 etc.

The current approach to remap these to standard variable names (lat, lon in this case) is to use a preprocessor, however, I can see that this is going to be an issue going forward as we start to ingest more data from different models - we don't necessarily want to write a preprocessor for each and every model.

I propose that we develop a small sub-package that can maintain a standard mapping, at least for coordinate variables, that can do an automatic aliasing of common variable names to standard names.

This functionality would have to be limited at first, mapping variants of certain coordinate spellings etc. to something standard and conditionally renaming them if need be.

Allow DRS processing to work on batches of variables.

Currently, DRS processing will work through all configured variables for a single processing job. For daily data this can be time-consuming.

In a small test, an individual variable/domain/year processed in the order of about 8 mins for daily data, which would put ~200 variables at over 24hrs at the default parallel settings.

This speed constraint can be overcome by either:
a) Scaling horizontally over more cores, which is limited by the stability of the HPC running Dask.
b) Splitting the variables into batches and running those in parallel.

Axiom is already configured to run either all configured variables or the user-provided variables, it is just a matter of splitting them up on the command line somehow.

Update documentation

Now that RTD is set up, we need to update the documentation as it is outdated and still points to the original CRE location.

Some additional pages:

  • Anatomy of a Payload
  • DRS configuration
  • Example DRS workflow

Move to Jinja templating

Some elements of processing are becoming unwieldy with standard string interpolation.

We may need to move the templating over to use Jinja.

Provide option for NetCDF-3 processing

Currently, Axiom is unable to process NetCDF version 3 files, as it attempts to read the data with HF5NetCDF, which only works on NetCDF version 4 files. To work around this issue for now files can be converted from NetCDF-3 to NetCDF-4.

Add ability to process multiple levels at the same time.

As per meeting 20220315 some variables will be coming in on multiple levels and will come out on multiple levels.

The current approach will process a single level and single variable - we can set this up to vectorise the operations automatically, but we will still need to be careful with the number of files passing through the system.

AttributeError: 'datetime.timedelta' object has no attribute 'astype'

File "/scratch/xv83/at2708/miniconda3/envs/conda_axiom/lib/python3.10/site-packages/axiom/drs/utilities.py", line 465, in detect_input_frequency
total_seconds = (ds.time.data[1] - ds.time.data[0]).astype('timedelta64[s]').astype(np.int32)
AttributeError: 'datetime.timedelta' object has no attribute 'astype'

solution:
import pandas as pd
import numpy as np
diff=(ds.time.data[1] - ds.time.data[0])
total_seconds = pd.to_timedelta([diff]).astype('timedelta64[h]')[0].astype(np.int32)

Fix tests in CI

The last update broke the CI, there was a change in the API that was not reflected in the tests as it was a legacy portion of the code.

Create list of recoverable errors for DRS

Transient dask/distributed errors can result in rare exceptions when processing. In almost every case this can be resolved by simple re-running the processing task. However, it would be useful if there was a known lookup of error messages that could be evaluated to safely and automatically rerun the particular DRS processing exercise.

Move schema directive into payloads.

For various legacy reasons, the schema directive (i.e. what schema gets applied) is configured in the user drs.json configuration file. This is fine when always processing the same schema.

It would be better to select the schema from the incoming payload, or at the very least make it so that it can be used as an override.

Update single instance DRS cli to reflect new changes.

Significant changes to the DRS subsystem are underway for the JSON consumption approach for CCAM, which will require changes to the single instances axiom drs command.

Rather than packing it all into a single commit I'll raise this as another issue/task to be completed once the JSON consumption task is complete.

Package for pip

As Axiom matures we will need to be able to install it via pip.
The same will need to be done for axiom-schemas.

Add interactive flag to drs_launch

For debugging purposes in the drs_launch command it would be useful to have an interactive flag to be added in conjunction with the dump flag.

Add configuration option for pre-processing

Axiom seems to be auto-detecting CCAM in the file name, initiating the CCAM pre-processor. The pre-processor causes errors when processing data files with missing metadata/attributes (i.e. missing attributes rlon0, rlat0).

Removing "CCAM" from the file name prevents the pre-processing.

Extract schemas into their own repository

As it says in the name - schemas to be removed from this repository and moved into their own so that an update of the schema does not trigger an update of the utility.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.