Giter VIP home page Giter VIP logo

daops's Introduction

daops - data-aware operations

Pypi Build Status Documentation

The daops library (pronounced "day-ops") provides a python interface to a set of operations suitable for working with climate simulation outputs. It is typically used with ESGF data sets that are described in NetCDF files. daops is unique in that it accesses a store of fixes defined for datasets that are irregular when compared with others in their population.

When a daops operation, such as subset, is requested, the library will look up a database of known fixes before performing and calculations or transformations. The data will be loaded and fixed using the xarray library before the any actual operations are sent to its sister library clisops.

Features

The package has the following features:

  • Ability to run data-reduction operations on large climate data sets.
  • Knowledge of irregularities/anomalies in some climate data sets.
  • Ability to apply fixes to those data sets before operating on them. This process is called normalisation of the data sets.

Credits

This package was created with Cookiecutter and the cedadev/cookiecutter-pypackage project template.

Python Black

daops's People

Contributors

agstephens avatar alaniwi avatar cehbrecht avatar charlesgauthier-udm avatar dependabot[bot] avatar ellesmith88 avatar huard avatar jhaigh0 avatar zeitsperre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

daops's Issues

Could decadal fixes derive leadtime values on-the-fly?

  • daops version: branch: decadal-fixes
  • Python version: all
  • Operating System: all

Description

We have a system for generating fixes for adding the lead time variable, which uses:

https://github.com/roocs/daops/blob/decadal_fixes/daops/data_utils/coord_utils.py#L47-L76

For each fix, we include a string that includes a list of values, e.g.:

            "fix_id": "AddCoordFix",
            "operands": {
                "var_id": "leadtime",
                "value": "15,45,74,105,135,166,196,227,258,288,319,349,380,410,439,470,500,531,561,592,623,653,684,714,745,775,804,835,865,896,926,957,988,1018,1049,1079,1110,
1140,1169,1200,1231,1262,1292,1323,1354,1384,1415,1445,1476,1506,1535,1566,1596,1627,1657,1688,1719,1749,1780,1810,1841,1871,1900,1931,1961,1992,2022,2053,2084,2114,2145,2175,
2206,2236,2265,2296,2326,2357,2387,2418,2449,2479,2510,2540,2571,2601,2630,2661,2692,2723,2753,2784,2815,2845,2876,2906,2937,2967,2996,3027,3057,3088,3118,3149,3180,3210,3241,
3271,3302,3332,3361,3392,3422,3453,3483,3514,3545,3575,3606,3636,3667,3697",
                "dim": [
                    "time"
                ],
                "dtype": "float64",
                "attrs": {
                    "long_name": "Time elapsed since the start of the forecast",
                    "standard_name": "forecast_period",
                    "units": "days"
                },
                "encoding": {
                    "dtype": "double"
                }
            },
            "source": {
                "name": "ceda",
                "version": "",
                "comments": "",
                "url": "https://github.com/cp4cds/c3s34g_master/tree/master/Decadal"
            }

An alternative would be to encode a rule that tells the fix function to lookup the required values and add them into the new coordinate variable. Instead of value being set as a list of values, it could be some kind of rule such as:

"value": "derive: daops.data_utils.time_utils._get_lead_times"

@ellesmith88 This might be overkill but it is probably worth a discussion.

Create a "roocs-utils" package

daops and dachar should use same common root dirs.

We should remove root_dir from the operation function parameters in daops. But it should be possible to override their values somewhere, e.g. in a python object or an environment variable.

daops needs clisops:: all
dachar needs clisops:: general xarray utils
daops needs dachar:: root dirs

roocs-utils - must be lightweight and have no dependencies except xarray

Add unit tests for command-line

Update daops to interrogated Fixes index for dataset fixes

Pre-requisites:

  • Fix index exists and has public read-only access
    • which implies: other indexes exists and are working with dachar.

Task:

  1. Update daops so that normalise will look up whether fixes need to applied to each dataset and will then apply them.
  2. Requires integration with "Fixer" class, and updating that class to query the Fix Index by dataset id.

daops production error: ZeroDivisionError: float divmod() - with tiny (lat, lon) box

  • daops version: production
  • Python version: 3.7
  • Operating System: Centos7

Description

The error logs have shown that this request fails:

from daops.ops.subset import subset

inputs = {
  'collection': 'c3s-cmip6.ScenarioMIP.NCC.NorESM2-MM.ssp245.r1i1p1f1.day.tasmax.gn.v20191108',
  'area': (8.37, 39.12, 8.56, 39.26),
  'level': None,
  'time': ('2006-01-01T00:00:00', '2099-12-30T00:00:00'),
  'output_type': 'netcdf',
  'output_dir': '.',
  'split_method': 'time:auto',
  'file_namer': 'standard'
}

resp = subset(**inputs)

The error we are seeing is:

/usr/local/Miniconda3-py39_4.9.2-Linux-x86_64/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py: found within input date time range. Defaulting to minimum time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
/usr/local/Miniconda3-py39_4.9.2-Linux-x86_64/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py:een nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
ZeroDivisionError: float divmod()

My first guess is that the lat and lon selection is coming back with no data because the range is too small - but that that doesn't actually trigger an exception in xarray, then subsequent processing is highlighting the error.

@ellesmith88 please take a look. Thanks

order of elements of area needed for doc strings

New intake catalog approach could speed up `consolidate` step

  • daops version: master
  • Python version: 3.7
  • Operating System: all

Description

In our consolidate step, we read the datasets and decide which are in the requested time range, by opening them all with xarray:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L42-L69

In our new intake catalog approach, we have the time information for each file directly accessible. We could allow daops to lookup an intake catalog (if we can work out a clean way to make this connection).

This would speed things up.

Tagging: @cehbrecht @ellesmith88

daops: subset-by-point

Extend the subset operation to support the proposed extension in: roocs/34e-mngmt#105

Key issues:

  • level: allow x1/x2 and x1,x2,x3,x4
  • time: allow x1/x2 and x1,x2,x3,x4
  • year, month, day: add these arguments as options instead of time - either one or the other - if both: default to use time
  • how to distinguish between a range of (<start>, <end>) and a sequence of (<value1>, <value2>) - we need a way for our parser to know the difference - maybe work from rook downwards...rook will know if it is a range or a sequence, maybe the range is a special object rather than tuple!?

Refactor consolidate to use roocs-utils functions

Description

Once roocs-utils functions have been written

daops.utils.consolidate:
 - _consolidate_dset(dset):
   if dset.startswith('http'): raise ...not supported (yet)...
   if os.path.isfile(dset): return dset
   if dset.count('.') > 6: base_dir = roocs_utils.utils.project_utils.get_project_base_dir(dset)
   if os.path.isdir(dset): return os.path.join(dset, '*.nc')
   else: raise No idea what it is

Support for subsetting to a list of coordinate values

  • should we support subsetting to a specific list of coordinate values?
  • E.g.: get levels: 850, 500, 200, 50hPa from a much longer list
  • Or could we provide that via a separate "extract" process/operation
    • Does Xarray correctly support "picker" functionality

Fixing the file paths list in ResultSet

class ResultSet(object):
    def __init__(self, inputs=None):
        self._results = collections.OrderedDict()
        self.metadata = {"inputs": inputs, "process": "something", "version": 0.1}
        self.file_paths = []

    def add(self, dset, result):
        self._results[dset] = result
   
        for item in result:
              if isintance(item, str) and os.path.isfile(item):
                   self.file_paths.append(result)

regrid - consult CDS input

  • daops version: *
  • Python version: *
  • Operating System: Linux

Description

ECMWF have done classifications of variables into their required regridding types:

  • linear interpolation
  • nearest neighbour
  • mass-conserving

They can provide this.

They describe the regridding problem in terms of a sparse matrix calculation where a set of weights are applied. Once the matrices are pre-computed the computation is efficient.

release 0.1.0?

  • daops version:
  • Python version:
  • Operating System:

Description

@agstephens @ellesmith88 can we make an 0.1.0 release with a reference to clisops 0.1..0? After that we can move to the xclim subset module integrated in clisops.

I can make the release ... but probably need permissions.

See also: roocs/clisops#1

Replicate content between ES indexes

To get the mapping just use the <index_name>/_mapping endpoint (e.g.: https://es14.ceda.ac.uk:9200/c3s-roocs-fix-prop/_mapping). It is worth paring this down as you will get all the default stuff in there too, You only need the mappings which are non-standard.

Loading is as simple as

from elasticsearch import Elasticsearch
import json

with open('mapping_file.json') as reader:
    mapping = json.load(reader)

index_name = 'index_name'

es = Elasticsearch()
if not es.indices.exists(index_name):
    es.indices.create(index_name, body=mapping)

You can do a cross-cluster re-index to copy the data across:

https://elasticsearch-py.readthedocs.io/en/v7.11.0/helpers.html#reindex

Note: CEDA public end-point is: https://elasticsearch.ceda.ac.uk/c3s-roocs-fix-prop/_mapping

Example Search with no body specified:
https://elasticsearch.ceda.ac.uk/c3s-roocs-fix-prop/_search

Get xarray aggregation tests working

  • daops version: *
  • Python version: 3.7+
  • Operating System: linux

Description

@ellesmith88: I have created the following unit test module:

https://github.com/roocs/daops/blob/master/tests/test_xarray/test_xarray_aggregation.py

Most of it is in the form of skeleton code/stubs. Please can you get it working as a valid unit test.

The purpose of it is to ensure that we have tested the normal behaviour of xarray.open_mfdataset() - just to make sure that our assumptions throughout roocs are appropriate.

Enable selection of variables in daops interface

Should we allow the daops interface to include the selection of variables?

Philosophically, we created daops and rook to deal with dataset identifiers, which tend to include only a single data variable (along with its metadata and coordinate variables). As we consider the wider use of roocs we find, as with the ESA CCI datasets at CEDA, that some datasets have many variables. For example, this kerchunk file links to NetCDF files that contain 204 variables!

https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json

Here is an example request to remind us of the existing interface (using the command-line daops subset approach):

daops subset --area 30,-10,65,30 --time 2000-01-01/2000-02-30 --levels "/" --time-components ""  --output-dir /tmp --file-namer simple https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json

So, should we extend the daops interface to allow specific selection of variables?

If yes, what are the options?

If we decide to support this extension, then maybe we have two options:

  1. Expand the dataset identifier so that it includes variable IDS, such as:
  • use a hash to separate the identifier (or path/URL) and a comma-separated list of variables:
https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json#toa_swup,toa_swup_clr,toa_swup_hig

So a full command might be:

daops subset 
  --area 30,-10,65,30 
  --time 2000-01-01/2000-02-30 
  --levels "/" 
  --time-components ""  
  --output-dir /tmp 
  --file-namer simple 
  https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json#toa_swup,toa_swup_clr,toa_swup_hig
  1. Add in a new parameter, such as variables:
  • variables: list of strings (or variable IDs) - DEFAULT = None (i.e. include all variables)
  • time
  • area
  • level
  • collection

@cehbrecht: what are your thoughts on this proposal?

Check all occurrences of `open_dataset` and `open_xr_dataset` use `roocs-utils` version of the function

  • daops version: current
  • Python version: all
  • Operating System: all

Description

We have a common set of arguments that need to be sent to either of the open functions in xarray:

xr.open_dataset(...)
xr.open_mfdataset(...)

We need to make sure that all calls to these call:

roocs_utils.xarray_utils.xarray_utils.open_xr_dataset(...)

We need to review the daops, dachar and clisops code to check they are all doing this correctly. @ellesmith88, please can you take a look at this. Thanks

Process error: Cannot apply_along_axis when any iteration dimensions are 0

  • daops version: 0.5.0
  • Python version:
  • Operating System:

Description

Error on user request in production system.

See notebook, error 21, 24.03:
https://nbviewer.jupyter.org/github/roocs/rooki/blob/master/notebooks/tests/test-c3s-cmip6-subset-errors-dkrz-2021-03-23.ipynb

What I Did

Run:

wf = ops.Subset(
        ops.Input(
            'tos', ['c3s-cmip6.ScenarioMIP.CNRM-CERFACS.CNRM-CM6-1.ssp245.r1i1p1f2.Omon.tos.gn.v20190219']
        ),
        # time="2021-01-01/2050-12-31",
        area="1,40,2,4"
)
resp = wf.orchestrate()
resp.status

Error:

Process error: Cannot apply_along_axis when any iteration dime
nsions are 0  

Traceback (most recent call last):
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/rook/director/director.py", line 156, in process
    file_uris = runner(self.inputs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/rook/utils/subset_utils.py", line 5, in run_subset
    result = subset(**args)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/ops/subset.py", line 77, in subset
    result_set = Subset(**locals()).calculate()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/ops/base.py", line 88, in calculate
    process(self.get_operation_callable(), norm_collection, **self.params),
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/processor.py", line 19, in process
    result = operation(dset, **kwargs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/subset.py", line 165, in subset
    return op.process()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/base_operation.py", line 89, in process
    processed_ds = self._calculate()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/subset.py", line 63, in _calculate
    result = subset_bbox(ds, **self.params)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py", line 251, in func_checker
    return func(*args, **kwargs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py", line 875, in subset_bbox
    da[var] = da[var].where(lon_cond & lat_cond, drop=True)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/common.py", line 1273, in where
    return ops.where_method(self, cond, other)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/ops.py", line 203, in where_method
    keep_attrs=True,
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 1134, in apply_ufunc
    keep_attrs=keep_attrs,
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 271, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 632, in apply_variable_ufunc
    for arg, core_dims in zip(args, signature.input_core_dims)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 632, in <listcomp>
    for arg, core_dims in zip(args, signature.input_core_dims)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 542, in broadcast_compat_data
    data = variable.data
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 374, in data
    return self.values
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 554, in values
    return _as_array_or_item(self._data)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 287, in _as_array_or_item
    data = np.asarray(data)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 693, in __array__
    self._ensure_cached()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 690, in _ensure_cached
    self.array = NumpyIndexingAdapter(np.asarray(self.array))
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 86, in __getitem__
    key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 853, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)

In `check_result(...)` - check that data is not all missing

Hi @alaniwi,

In the check_result(...) function, here, ...

https://github.com/roocs/daops/blob/test_data_pools_new/tests/data_pools_checks/run_data_pools_checks.py#L619

Please can you add a check on the entire output array to assert that it is not all NANs or fill values. E.g. using something like numpy.isnan or equivalent for xarray.

This would help us spot where subsetting operation had gone wrong and return an xarray.Dataset which had no valid data in the array.

consolidate expects file path with DRS folder structure

  • daops version: 0.3.0
  • Python version:
  • Operating System:

Description

The consolidate function works for a dataset id:

parameterise(
collection='c3s-cmip5.output1.ICHEC.EC-EARTH.historical.day.atmos.day.r1i1p1.tas.latest', 
time='1850/1855')

... and files with DRS folder structure:

parameterise(
collection='/data/c3s-cmip5/output1/ICHEC/EC-EARTH/historical/day/atmos/day/r1i1p1/tas/latest/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')

But it fails when only the file name is given without the DRS folders:

parameterise(
collection='/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')

Error message:

p = parameterise(
collection='/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')
result = subset(**p)

ValueError: max() arg is an empty sequence

The results of subset only have the filename but not the DRS folder structure:

p = parameterise(collection='c3s-cmip5.output1.ICHEC.EC-EARTH.historical.day.atmos.day.r1i1p1.tas.latest', time='1850/1855')
result = subset(**p)
result.file_paths
Out[9]: ['./tas_day_EC-EARTH_historical_r1i1p1_18500101-18551229.nc']

When we chain the subset operators then the second subset operation will fail on the output generated by the first one.

What I Did

Running in ipython:

In [1]: from daops.ops.subset import subset

In [2]: from roocs_utils.parameter import parameterise

In [3]: p = parameterise(collection='/Users/pingu/tmp/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', time='1850/1855')

In [4]: result = subset(**p)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-b4d4a92f06bc> in <module>
----> 1 result = subset(**p)

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/ops/subset.py in subset(collection, time, area, level, output_dir, output_type, split_method, file_namer)
     59 
     60     collection = consolidate.consolidate(
---> 61         parameters.get("collection"), time=parameters.get("time")
     62     )
     63 

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py in consolidate(collection, **kwargs)
     79         # convert dset to ds_id to work with elasticsearch index
     80         if not dset.count(".") > 6:
---> 81             dset = convert_to_ds_id(dset)
     82 
     83         if "time" in kwargs:

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py in convert_to_ds_id(dset)
     48     elif os.path.isfile(dset) or dset.endswith(".nc"):
     49         dset = dset.split("/")
---> 50         i = max(loc for loc, val in enumerate(dset) if val.lower() in projects)
     51         ds_id = ".".join(dset[i:-1])
     52         return ds_id

ValueError: max() arg is an empty sequence

Release for 0.10.1

We're getting ready to open a rook PR adding average_shape process, and we'd need a daops release with the latest PR supporting that new operation here.

Compatible with clisops 0.12.1

Add features: CLI (subset only), Dockerfile and CWL file to the repo

@huard @cehbrecht: we are working on a branch that will allow daops to be called as a command-line utility (for subset only) - for testing with the ESA Earth Observation Exploitation Platform Common Architecture (EOEPCA) framework. EOEPCA uses ADES (Application Deployment and Execution Service) to generate a WPS and allow applications to be deployed and run to it in the following way:

  • A CWL file: describing the inputs, outputs and arguments of the application
  • A Dockerfile: encapsulating the recipe for building a container to run the application (in our case, a daops/cli.py interface)

Are you happy for us to add these features to the master branch of daops? There should be no disruption to the existing components.

daops.consolidate() is inefficient if lots of files - can we provide hints in the file path mapper?

  • daops version: all
  • Python version: all
  • Operating System: all

Description

I have got daops working with another data set (not ESGF) that uses file path mappings in the config, e.g.:

[project:haduk_grid]
base_dir = {{ ceda_base_dir }}/archive/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid
file_name_template = {__derive__var_id}_hadukgrid_uk_{spatial_average}_{frequency}_{__derive__time_range}.{__derive__extensi
on}
facet_rule = project version_major version_minor version_patch version_extra spatial_average frequency variable version
fixed_path_modifiers =
    variable:groundfrost pv rainfall sfcWind snowLying sun tas tasmin
    frequency:mon
fixed_path_mappings =
    haduk_grid.v1.0.3.0.1km.{frequency}.{variable}.v20210712:v1.0.3.0/1km/{variable}/{frequency}/v20210712/*.nc
    haduk_grid.v1.0.2.1.1km.{frequency}.{variable}.v20200731:v1.0.2.1/1km/{variable}/{frequency}/v20200731/*.nc

The following code is inefficient:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L58-L87

It reads:

  1. Each file
  2. The aggregated dataset

We could try to provide hints to tell daops the date ranges in the files without having to open them. That would speed things up massively.

E.g.:

fixed_path_mappings =
    haduk_grid.v1.0.3.0.1km.{frequency}.{variable}.v20210712:v1.0.3.0/1km/{variable}/{frequency}/
        v20210712/*_(?P<startYYYYMM>\d{6})-(?P<endYYYYMM>\d{6}).nc

Then the code could parse the regex in the file name and not have to open the file(s).

IndexError: list index out of range

  • daops version: 0.5.0
  • Python version:
  • Operating System:

Description

Error seen on user request on production service.

See error 41 from 26.03 in this notebook:
https://nbviewer.jupyter.org/github/roocs/rooki/blob/master/notebooks/tests/test-c3s-cmip6-subset-errors-dkrz-2021-03-23.ipynb

What I Did

Run this request:

wf = ops.Subset(
        ops.Input(
            'tas', ['c3s-cmip6.ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp245.r1i1p1f1.Amon.tas.gr.v20191217']
        ),
        time="2021-01-01/2100-12-31",
        area="-10,30,35,70"
)
resp = wf.orchestrate()
resp.status

Get into this error:

Process error: list index out of range

  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py", line 42, in consolidate
    ds = open_xr_dataset(dset)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/roocs_utils/xarray_utils/xarray_utils.py", line 33, in open_xr_dataset
    return xr.open_dataset(dset[0], use_cftime=True)
IndexError: list index out of range

Handling of large datasets

  • how to handle pre/post-processors when dataset is too big?
    • will lazy evaluation be needed?
    • will Xarray do it for free?
    • could you chain operators lazily and then do workflow.apply()

Testing integration of daops as backend for cmip6_preprocessing

Hi everyone,

I firstly wanted to say thank you for all the efforts that have already been put into this framework. I would love to contribute/integrated daops more into my workflow.

I am maintaining cmip6_preprocessing and am very interested in migrating some of the things I fix (in a quite ad-hoc fashion for now) in a more general way over here.

My primary goal for cmip6_preprocessing is to use it with python and the scientific pangeo stack, but I like the idea of documenting the actual problems (needing 'fixes') in an general and language-agnostic way over here. I was very impressed by the demo @agstephens gave a while ago during the CMIP6 cloud meeting and am now thinking of finally getting to work on this.

I am still really unsure how to actually contribute fixes to this repo, though. What I propose is to work my way through this using some quite simple fixes that are relatively easy to apply and are already documented in errata.

Specifically, I am currently testing this python code, which changes some of the metadata necessary to determine the point in time where a dataset was branched of the parent model run.

def fix_metadata_issues(ds):
    # https://errata.es-doc.org/static/view.html?uid=2f6b5963-f87e-b2df-a5b0-2f12b6b68d32
    if ds.attrs["source_id"] == "GFDL-CM4" and ds.attrs["experiment_id"] in [
        "1pctCO2",
        "abrupt-4xCO2",
        "historical",
    ]:
        ds.attrs["branch_time_in_parent"] = 91250
    # https://errata.es-doc.org/static/view.html?uid=61fb170e-91bb-4c64-8f1d-6f5e342ee421
    if ds.attrs["source_id"] == "GFDL-CM4" and ds.attrs["experiment_id"] in [
        "ssp245",
        "ssp585",
    ]:
        ds.attrs["branch_time_in_child"] = 60225
    return ds

These ingest an xarray.dataset and check certain conditions within the attributes, and then overwrite attributes accordingly. I could easily parse those out to dataset-specific 'fixes'.

Where exactly could I translate this into a fix within the daops framework? Very happy to start a PR (and then test the implementation from cmip6_preprocessing), but I am afraid I am still a bit unsure about the daops internals. Any pointers would be greatly appreciated.

Implement fixes requested by ESMValTool team

Required steps to manually add fixes:

  • check that our data exhibits the error to be fixed
  • write the xarray fixing code in daops, with associated unit tests
  • add the fix class to dachar.fixes...., with associated unit tests if required
  • decide/agree on content of fix metadata:
    • URL to issue, or code, or both in ESMValTool repo
    • description of the fix
    • "source" property to identify that this came from ESMValTool, maybe with release version or github commit number.
  • use inventory to identify all data sets that will be affected
  • generate (by hand?) fix proposals:
    1. create fix proposal per datasets, OR
    2. create fix proposal template and list of datasets file
  • do we need some kind of QC when the fixes are being proposed?
  • publish fix proposals:
    • Start with a one-off proposal: dachar propose-fixes -p cmip6 <json_file>
    • Adapt to: dachar propose-fixes -p cmip6 --file-list=datasets_files.txt <fix_template.json>
  • process fixes as usual - to accept them

Relevant to:

ESMValGroup/ESMValCore#755

ESMValGroup/ESMValCore#787

Define API parameters for subset and average processes

Follow OGC good practice and other relevant services to define sensible inputs for spatial/temporal parameters etc.

Needs some research into existing services.

E.g. should time window look like?

"1999-01-01T00:00:00Z/2000-10-10T12:00:00Z"

Is our "Fixer" class approach appropriate?

Please check whether the Fixer approach that we have implemented will be flexible enough to address all issues with CMIP5, CMIP6 and CORDEX data.

Use the ESMValTool repository to review examples of fixes that are needed.

Do we need something more than:

  • pre-processor
  • post-processor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.