roocs / daops Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 7.0 468 KB

License: BSD 3-Clause "New" or "Revised" License

Makefile 1.16% Python 88.00% Shell 0.04% Jupyter Notebook 6.99% Dockerfile 2.51% Common Workflow Language 1.31%

daops's Introduction

daops - data-aware operations

The daops library (pronounced "day-ops") provides a python interface to a set of operations suitable for working with climate simulation outputs. It is typically used with ESGF data sets that are described in NetCDF files. daops is unique in that it accesses a store of fixes defined for datasets that are irregular when compared with others in their population.

When a daops operation, such as subset, is requested, the library will look up a database of known fixes before performing and calculations or transformations. The data will be loaded and fixed using the xarray library before the any actual operations are sent to its sister library clisops.

Free software: BSD
Documentation: https://daops.readthedocs.io

Features

The package has the following features:

Ability to run data-reduction operations on large climate data sets.
Knowledge of irregularities/anomalies in some climate data sets.
Ability to apply fixes to those data sets before operating on them. This process is called normalisation of the data sets.

Credits

This package was created with Cookiecutter and the cedadev/cookiecutter-pypackage project template.

Cookiecutter: https://github.com/audreyr/cookiecutter
cookiecutter-pypackage: https://github.com/cedadev/cookiecutter-pypackage

daops's People

Contributors

Stargazers

Watchers

Forkers

cehbrecht agstephens zeitsperre alaniwi charlesgauthier-udm ouranosinc eoepca

daops's Issues

Just send the command-line options directly into the `subset` function

The subset() operation already supports detailed parsers. Therefore, it is not necessary to check or modify the command-line arguments before they get sent through.

Could decadal fixes derive leadtime values on-the-fly?

daops version: branch: decadal-fixes
Python version: all
Operating System: all

Description

We have a system for generating fixes for adding the lead time variable, which uses:

https://github.com/roocs/daops/blob/decadal_fixes/daops/data_utils/coord_utils.py#L47-L76

For each fix, we include a string that includes a list of values, e.g.:

            "fix_id": "AddCoordFix",
            "operands": {
                "var_id": "leadtime",
                "value": "15,45,74,105,135,166,196,227,258,288,319,349,380,410,439,470,500,531,561,592,623,653,684,714,745,775,804,835,865,896,926,957,988,1018,1049,1079,1110,
1140,1169,1200,1231,1262,1292,1323,1354,1384,1415,1445,1476,1506,1535,1566,1596,1627,1657,1688,1719,1749,1780,1810,1841,1871,1900,1931,1961,1992,2022,2053,2084,2114,2145,2175,
2206,2236,2265,2296,2326,2357,2387,2418,2449,2479,2510,2540,2571,2601,2630,2661,2692,2723,2753,2784,2815,2845,2876,2906,2937,2967,2996,3027,3057,3088,3118,3149,3180,3210,3241,
3271,3302,3332,3361,3392,3422,3453,3483,3514,3545,3575,3606,3636,3667,3697",
                "dim": [
                    "time"
                ],
                "dtype": "float64",
                "attrs": {
                    "long_name": "Time elapsed since the start of the forecast",
                    "standard_name": "forecast_period",
                    "units": "days"
                },
                "encoding": {
                    "dtype": "double"
                }
            },
            "source": {
                "name": "ceda",
                "version": "",
                "comments": "",
                "url": "https://github.com/cp4cds/c3s34g_master/tree/master/Decadal"
            }

An alternative would be to encode a rule that tells the fix function to lookup the required values and add them into the new coordinate variable. Instead of value being set as a list of values, it could be some kind of rule such as:

"value": "derive: daops.data_utils.time_utils._get_lead_times"

@ellesmith88 This might be overkill but it is probably worth a discussion.

Create a "roocs-utils" package

daops and dachar should use same common root dirs.

We should remove root_dir from the operation function parameters in daops. But it should be possible to override their values somewhere, e.g. in a python object or an environment variable.

daops needs clisops:: all
dachar needs clisops:: general xarray utils
daops needs dachar:: root dirs

roocs-utils - must be lightweight and have no dependencies except xarray

Add unit tests for command-line

Can we delete reference to "orchestrate"?

Hi @ellesmith88, I note that we have these empty modules:

https://github.com/roocs/daops/blob/master/tests/test_operations/test_orchestrate.py
https://github.com/roocs/daops/blob/master/daops/ops/orchestrate.py

Can they be deleted?

Update daops to interrogated Fixes index for dataset fixes

Pre-requisites:

Fix index exists and has public read-only access
- which implies: other indexes exists and are working with dachar.

Task:

Update daops so that normalise will look up whether fixes need to applied to each dataset and will then apply them.
Requires integration with "Fixer" class, and updating that class to query the Fix Index by dataset id.

daops production error: ZeroDivisionError: float divmod() - with tiny (lat, lon) box

daops version: production
Python version: 3.7
Operating System: Centos7

Description

The error logs have shown that this request fails:

from daops.ops.subset import subset

inputs = {
  'collection': 'c3s-cmip6.ScenarioMIP.NCC.NorESM2-MM.ssp245.r1i1p1f1.day.tasmax.gn.v20191108',
  'area': (8.37, 39.12, 8.56, 39.26),
  'level': None,
  'time': ('2006-01-01T00:00:00', '2099-12-30T00:00:00'),
  'output_type': 'netcdf',
  'output_dir': '.',
  'split_method': 'time:auto',
  'file_namer': 'standard'
}

resp = subset(**inputs)

The error we are seeing is:

/usr/local/Miniconda3-py39_4.9.2-Linux-x86_64/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py: found within input date time range. Defaulting to minimum time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
/usr/local/Miniconda3-py39_4.9.2-Linux-x86_64/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py:een nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
ZeroDivisionError: float divmod()

My first guess is that the lat and lon selection is coming back with no data because the range is too small - but that that doesn't actually trigger an exception in xarray, then subsequent processing is highlighting the error.

@ellesmith88 please take a look. Thanks

Compare with ncml approach by ouranos

Ouranos is using ncml to provide data fixes.

Point @huard to an example where ncml does not cover the needed fix.

A fix currently looks like this:
https://github.com/roocs/proto-lib-34e/blob/master/fixes/cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga.json

Could our scanner also generate ncml fixes need by ouranos? So the scanner could be a shared component.

Currently this is all prototyping ...

order of elements of area needed for doc strings

All of these need fixing to say what order the area elements are in:

daops/daops/ops/subset.py

Lines 54 to 55 in 3282f63

area: Area to subset over, sequence or string of comma separated lat and lon

bounds. Must contain 4 values.
https://github.com/roocs/clisops/blob/870c78046a0e4b516350e33200027a2dd41dd099/clisops/ops/subset.py#L218
https://github.com/roocs/roocs-utils/blob/7c19fe79ffc6999bd7e67a93041db7d8ef75e638/roocs_utils/parameter/area_parameter.py#L12-L17

It turns out that the answer is west, south, east, north:

https://github.com/roocs/roocs-utils/blob/7c19fe79ffc6999bd7e67a93041db7d8ef75e638/roocs_utils/parameter/area_parameter.py#L48-L54

Add unit tests for ResultsDB object

Create a unit tests module for the new ResultsDB class.

Adapt Fixes and Fixer interface to support multiple steps

Update the JSON and the Fixer code so that fixes are represented as:

a list of pre-processors (function only for each)
a list of post-processors: (function, args, kwargs for each)

New intake catalog approach could speed up `consolidate` step

daops version: master
Python version: 3.7
Operating System: all

Description

In our consolidate step, we read the datasets and decide which are in the requested time range, by opening them all with xarray:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L42-L69

In our new intake catalog approach, we have the time information for each file directly accessible. We could allow daops to lookup an intake catalog (if we can work out a clean way to make this connection).

This would speed things up.

Tagging: @cehbrecht @ellesmith88

Fix lookup of catalogs in consolidate so that multiple catalogs could be used (e.g. cmip6 AND cmip5)

See code here:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L131-L133

Need to move the catalog lookup to work on a per-dataset level. (Could cache them all in an object before looping through).

daops: subset-by-point

Extend the subset operation to support the proposed extension in: roocs/34e-mngmt#105

Key issues:

level: allow x1/x2 and x1,x2,x3,x4
time: allow x1/x2 and x1,x2,x3,x4
year, month, day: add these arguments as options instead of time - either one or the other - if both: default to use time
how to distinguish between a range of (<start>, <end>) and a sequence of (<value1>, <value2>) - we need a way for our parser to know the difference - maybe work from rook downwards...rook will know if it is a range or a sequence, maybe the range is a special object rather than tuple!?

In `consolidate` - should we check full times instead of years

In consolidate(): we look for the intersection of years requested with the years in the files.

Should we be doing a complete time selection rather than just looking at years? I can't remember why we made that decision :-)

Refactor consolidate to use roocs-utils functions

Description

Once roocs-utils functions have been written

daops.utils.consolidate:
 - _consolidate_dset(dset):
   if dset.startswith('http'): raise ...not supported (yet)...
   if os.path.isfile(dset): return dset
   if dset.count('.') > 6: base_dir = roocs_utils.utils.project_utils.get_project_base_dir(dset)
   if os.path.isdir(dset): return os.path.join(dset, '*.nc')
   else: raise No idea what it is

Support for subsetting to a list of coordinate values

should we support subsetting to a specific list of coordinate values?
E.g.: get levels: 850, 500, 200, 50hPa from a much longer list
Or could we provide that via a separate "extract" process/operation
- Does Xarray correctly support "picker" functionality

Fixing the file paths list in ResultSet

class ResultSet(object):
    def __init__(self, inputs=None):
        self._results = collections.OrderedDict()
        self.metadata = {"inputs": inputs, "process": "something", "version": 0.1}
        self.file_paths = []

    def add(self, dset, result):
        self._results[dset] = result
   
        for item in result:
              if isintance(item, str) and os.path.isfile(item):
                   self.file_paths.append(result)

regrid - consult CDS input

daops version: *
Python version: *
Operating System: Linux

Description

ECMWF have done classifications of variables into their required regridding types:

linear interpolation
nearest neighbour
mass-conserving

They can provide this.

They describe the regridding problem in terms of a sparse matrix calculation where a set of weights are applied. Once the matrices are pre-computed the computation is efficient.

release 0.1.0?

daops version:
Python version:
Operating System:

Description

@agstephens @ellesmith88 can we make an 0.1.0 release with a reference to clisops 0.1..0? After that we can move to the xclim subset module integrated in clisops.

I can make the release ... but probably need permissions.

Replicate content between ES indexes

To get the mapping just use the <index_name>/_mapping endpoint (e.g.: https://es14.ceda.ac.uk:9200/c3s-roocs-fix-prop/_mapping). It is worth paring this down as you will get all the default stuff in there too, You only need the mappings which are non-standard.

Loading is as simple as

from elasticsearch import Elasticsearch
import json

with open('mapping_file.json') as reader:
    mapping = json.load(reader)

index_name = 'index_name'

es = Elasticsearch()
if not es.indices.exists(index_name):
    es.indices.create(index_name, body=mapping)

You can do a cross-cluster re-index to copy the data across:

https://elasticsearch-py.readthedocs.io/en/v7.11.0/helpers.html#reindex

Note: CEDA public end-point is: https://elasticsearch.ceda.ac.uk/c3s-roocs-fix-prop/_mapping

Example Search with no body specified:
https://elasticsearch.ceda.ac.uk/c3s-roocs-fix-prop/_search

Get xarray aggregation tests working

daops version: *
Python version: 3.7+
Operating System: linux

Description

@ellesmith88: I have created the following unit test module:

https://github.com/roocs/daops/blob/master/tests/test_xarray/test_xarray_aggregation.py

Most of it is in the form of skeleton code/stubs. Please can you get it working as a valid unit test.

The purpose of it is to ensure that we have tested the normal behaviour of xarray.open_mfdataset() - just to make sure that our assumptions throughout roocs are appropriate.

daops consolidate - need to check why return types are different

daops version: 0.7.0
Python version: all
Operating System: all

Description

The consolidate function appears to return different object types in this block:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L151-L168

We think this is wrong - but it might affect how daops and rook interact with the function - need to fix.

Can we define our WPS input types in a generic library to support our own WPS Profile?

See examples of time and lat/lon inputs here:
https://github.com/bird-house/finch/blob/master/finch/processes/wpsio.py

More generally, we will have some inputs that are common across our packages. These might include data_refs or resources - that specify ESGF datasets. Maybe we should define these as part of daops in a generalised way that rook can use.

Enable selection of variables in daops interface

Should we allow the daops interface to include the selection of variables?

Philosophically, we created daops and rook to deal with dataset identifiers, which tend to include only a single data variable (along with its metadata and coordinate variables). As we consider the wider use of roocs we find, as with the ESA CCI datasets at CEDA, that some datasets have many variables. For example, this kerchunk file links to NetCDF files that contain 204 variables!

https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json

Here is an example request to remind us of the existing interface (using the command-line daops subset approach):

daops subset --area 30,-10,65,30 --time 2000-01-01/2000-02-30 --levels "/" --time-components ""  --output-dir /tmp --file-namer simple https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json

So, should we extend the daops interface to allow specific selection of variables?

If yes, what are the options?

If we decide to support this extension, then maybe we have two options:

Expand the dataset identifier so that it includes variable IDS, such as:

use a hash to separate the identifier (or path/URL) and a comma-separated list of variables:

https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json#toa_swup,toa_swup_clr,toa_swup_hig

So a full command might be:

daops subset 
  --area 30,-10,65,30 
  --time 2000-01-01/2000-02-30 
  --levels "/" 
  --time-components ""  
  --output-dir /tmp 
  --file-namer simple 
  https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json#toa_swup,toa_swup_clr,toa_swup_hig

Add in a new parameter, such as variables:

variables: list of strings (or variable IDs) - DEFAULT = None (i.e. include all variables)
time
area
level
collection

@cehbrecht: what are your thoughts on this proposal?

Check all occurrences of `open_dataset` and `open_xr_dataset` use `roocs-utils` version of the function

daops version: current
Python version: all
Operating System: all

Description

We have a common set of arguments that need to be sent to either of the open functions in xarray:

xr.open_dataset(...)
xr.open_mfdataset(...)

We need to make sure that all calls to these call:

roocs_utils.xarray_utils.xarray_utils.open_xr_dataset(...)

We need to review the daops, dachar and clisops code to check they are all doing this correctly. @ellesmith88, please can you take a look at this. Thanks

CWL - do tools exist to manage "orchestrate" for us?

daops version: *
Python version: *
Operating System: Linux

Description

Matt says we might get a lot for free if we use CWL. Look into whether Python implementations could manage orchestrate for us.

Process error: Cannot apply_along_axis when any iteration dimensions are 0

daops version: 0.5.0
Python version:
Operating System:

Description

Error on user request in production system.

See notebook, error 21, 24.03:
https://nbviewer.jupyter.org/github/roocs/rooki/blob/master/notebooks/tests/test-c3s-cmip6-subset-errors-dkrz-2021-03-23.ipynb

What I Did

Run:

wf = ops.Subset(
        ops.Input(
            'tos', ['c3s-cmip6.ScenarioMIP.CNRM-CERFACS.CNRM-CM6-1.ssp245.r1i1p1f2.Omon.tos.gn.v20190219']
        ),
        # time="2021-01-01/2050-12-31",
        area="1,40,2,4"
)
resp = wf.orchestrate()
resp.status

Error:

Process error: Cannot apply_along_axis when any iteration dime
nsions are 0  

Traceback (most recent call last):
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/rook/director/director.py", line 156, in process
    file_uris = runner(self.inputs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/rook/utils/subset_utils.py", line 5, in run_subset
    result = subset(**args)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/ops/subset.py", line 77, in subset
    result_set = Subset(**locals()).calculate()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/ops/base.py", line 88, in calculate
    process(self.get_operation_callable(), norm_collection, **self.params),
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/processor.py", line 19, in process
    result = operation(dset, **kwargs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/subset.py", line 165, in subset
    return op.process()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/base_operation.py", line 89, in process
    processed_ds = self._calculate()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/ops/subset.py", line 63, in _calculate
    result = subset_bbox(ds, **self.params)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py", line 251, in func_checker
    return func(*args, **kwargs)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/clisops/core/subset.py", line 875, in subset_bbox
    da[var] = da[var].where(lon_cond & lat_cond, drop=True)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/common.py", line 1273, in where
    return ops.where_method(self, cond, other)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/ops.py", line 203, in where_method
    keep_attrs=True,
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 1134, in apply_ufunc
    keep_attrs=keep_attrs,
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 271, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 632, in apply_variable_ufunc
    for arg, core_dims in zip(args, signature.input_core_dims)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 632, in <listcomp>
    for arg, core_dims in zip(args, signature.input_core_dims)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/computation.py", line 542, in broadcast_compat_data
    data = variable.data
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 374, in data
    return self.values
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 554, in values
    return _as_array_or_item(self._data)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/variable.py", line 287, in _as_array_or_item
    data = np.asarray(data)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 693, in __array__
    self._ensure_cached()
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 690, in _ensure_cached
    self.array = NumpyIndexingAdapter(np.asarray(self.array))
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 86, in __getitem__
    key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/xarray/core/indexing.py", line 853, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)

In `check_result(...)` - check that data is not all missing

Hi @alaniwi,

In the check_result(...) function, here, ...

https://github.com/roocs/daops/blob/test_data_pools_new/tests/data_pools_checks/run_data_pools_checks.py#L619

Please can you add a check on the entire output array to assert that it is not all NANs or fill values. E.g. using something like numpy.isnan or equivalent for xarray.

This would help us spot where subsetting operation had gone wrong and return an xarray.Dataset which had no valid data in the array.

consolidate expects file path with DRS folder structure

daops version: 0.3.0
Python version:
Operating System:

Description

The consolidate function works for a dataset id:

parameterise(
collection='c3s-cmip5.output1.ICHEC.EC-EARTH.historical.day.atmos.day.r1i1p1.tas.latest', 
time='1850/1855')

... and files with DRS folder structure:

parameterise(
collection='/data/c3s-cmip5/output1/ICHEC/EC-EARTH/historical/day/atmos/day/r1i1p1/tas/latest/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')

But it fails when only the file name is given without the DRS folders:

parameterise(
collection='/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')

Error message:

p = parameterise(
collection='/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', 
time='1850/1855')
result = subset(**p)

ValueError: max() arg is an empty sequence

The results of subset only have the filename but not the DRS folder structure:

p = parameterise(collection='c3s-cmip5.output1.ICHEC.EC-EARTH.historical.day.atmos.day.r1i1p1.tas.latest', time='1850/1855')
result = subset(**p)
result.file_paths
Out[9]: ['./tas_day_EC-EARTH_historical_r1i1p1_18500101-18551229.nc']

When we chain the subset operators then the second subset operation will fail on the output generated by the first one.

What I Did

Running in ipython:

In [1]: from daops.ops.subset import subset

In [2]: from roocs_utils.parameter import parameterise

In [3]: p = parameterise(collection='/Users/pingu/tmp/data/tas_day_EC-EARTH_historical_r1i1p1_18500101-18591231.nc', time='1850/1855')

In [4]: result = subset(**p)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-b4d4a92f06bc> in <module>
----> 1 result = subset(**p)

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/ops/subset.py in subset(collection, time, area, level, output_dir, output_type, split_method, file_namer)
     59 
     60     collection = consolidate.consolidate(
---> 61         parameters.get("collection"), time=parameters.get("time")
     62     )
     63 

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py in consolidate(collection, **kwargs)
     79         # convert dset to ds_id to work with elasticsearch index
     80         if not dset.count(".") > 6:
---> 81             dset = convert_to_ds_id(dset)
     82 
     83         if "time" in kwargs:

~/miniconda3/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py in convert_to_ds_id(dset)
     48     elif os.path.isfile(dset) or dset.endswith(".nc"):
     49         dset = dset.split("/")
---> 50         i = max(loc for loc, val in enumerate(dset) if val.lower() in projects)
     51         ds_id = ".".join(dset[i:-1])
     52         return ds_id

ValueError: max() arg is an empty sequence

Release for 0.10.1

We're getting ready to open a rook PR adding average_shape process, and we'd need a daops release with the latest PR supporting that new operation here.

Compatible with clisops 0.12.1

Some CORDEX fixes have been captured by the py-cordex project

See these:

euro-cordex/py-cordex#17

Argument names

Description

Propose that we try to synchronise our argument names with those defined in OGC Common:

bbox: http://docs.opengeospatial.org/is/17-069r3/17-069r3.html#_parameter_bbox
datetime: http://docs.opengeospatial.org/is/17-069r3/17-069r3.html#_parameter_datetime

And any others we can find.

Output of subset is lists of lists

With split method implemented, the output of subset is a list of lists. Is this correct?

Add features: CLI (subset only), Dockerfile and CWL file to the repo

@huard @cehbrecht: we are working on a branch that will allow daops to be called as a command-line utility (for subset only) - for testing with the ESA Earth Observation Exploitation Platform Common Architecture (EOEPCA) framework. EOEPCA uses ADES (Application Deployment and Execution Service) to generate a WPS and allow applications to be deployed and run to it in the following way:

A CWL file: describing the inputs, outputs and arguments of the application
A Dockerfile: encapsulating the recipe for building a container to run the application (in our case, a daops/cli.py interface)

Are you happy for us to add these features to the master branch of daops? There should be no disruption to the existing components.

Look at Dask internal DAG model for representing workflows

daops version: *
Python version: *
Operating System: Linux

Description

Can we re-use or learn from Dask internal DAG model for representing workflows - in python?

Support for "start, end, interval" selection

Should we support a "start,end,interval" selection of coordinates?

daops.consolidate() is inefficient if lots of files - can we provide hints in the file path mapper?

daops version: all
Python version: all
Operating System: all

Description

I have got daops working with another data set (not ESGF) that uses file path mappings in the config, e.g.:

[project:haduk_grid]
base_dir = {{ ceda_base_dir }}/archive/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid
file_name_template = {__derive__var_id}_hadukgrid_uk_{spatial_average}_{frequency}_{__derive__time_range}.{__derive__extensi
on}
facet_rule = project version_major version_minor version_patch version_extra spatial_average frequency variable version
fixed_path_modifiers =
    variable:groundfrost pv rainfall sfcWind snowLying sun tas tasmin
    frequency:mon
fixed_path_mappings =
    haduk_grid.v1.0.3.0.1km.{frequency}.{variable}.v20210712:v1.0.3.0/1km/{variable}/{frequency}/v20210712/*.nc
    haduk_grid.v1.0.2.1.1km.{frequency}.{variable}.v20200731:v1.0.2.1/1km/{variable}/{frequency}/v20200731/*.nc

The following code is inefficient:

https://github.com/roocs/daops/blob/master/daops/utils/consolidate.py#L58-L87

It reads:

Each file
The aggregated dataset

We could try to provide hints to tell daops the date ranges in the files without having to open them. That would speed things up massively.

E.g.:

fixed_path_mappings =
    haduk_grid.v1.0.3.0.1km.{frequency}.{variable}.v20210712:v1.0.3.0/1km/{variable}/{frequency}/
        v20210712/*_(?P<startYYYYMM>\d{6})-(?P<endYYYYMM>\d{6}).nc

Then the code could parse the regex in the file name and not have to open the file(s).

IndexError: list index out of range

daops version: 0.5.0
Python version:
Operating System:

Description

Error seen on user request on production service.

See error 41 from 26.03 in this notebook:
https://nbviewer.jupyter.org/github/roocs/rooki/blob/master/notebooks/tests/test-c3s-cmip6-subset-errors-dkrz-2021-03-23.ipynb

What I Did

Run this request:

wf = ops.Subset(
        ops.Input(
            'tas', ['c3s-cmip6.ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp245.r1i1p1f1.Amon.tas.gr.v20191217']
        ),
        time="2021-01-01/2100-12-31",
        area="-10,30,35,70"
)
resp = wf.orchestrate()
resp.status

Get into this error:

Process error: list index out of range

  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/daops/utils/consolidate.py", line 42, in consolidate
    ds = open_xr_dataset(dset)
  File "/usr/local/anaconda/envs/rook/lib/python3.7/site-packages/roocs_utils/xarray_utils/xarray_utils.py", line 33, in open_xr_dataset
    return xr.open_dataset(dset[0], use_cftime=True)
IndexError: list index out of range

Can we provide provenance info on which fixes were applied?

daops version: *
Python version: *
Operating System: Linux

Description

We need to return some kind of provenance information after processing. Could we include details of which datasets were fixed and how?

Handling of large datasets

how to handle pre/post-processors when dataset is too big?
- will lazy evaluation be needed?
- will Xarray do it for free?
- could you chain operators lazily and then do workflow.apply()

Update test data management (daops) to use gitpython

Follow roocs-utils approach

travis build fails for python 3.6

daops version:
Python version:
Operating System:

Description

... but works locally in my conda env.

What I Did

Testing integration of daops as backend for cmip6_preprocessing

Hi everyone,

I firstly wanted to say thank you for all the efforts that have already been put into this framework. I would love to contribute/integrated daops more into my workflow.

I am maintaining cmip6_preprocessing and am very interested in migrating some of the things I fix (in a quite ad-hoc fashion for now) in a more general way over here.

My primary goal for cmip6_preprocessing is to use it with python and the scientific pangeo stack, but I like the idea of documenting the actual problems (needing 'fixes') in an general and language-agnostic way over here. I was very impressed by the demo @agstephens gave a while ago during the CMIP6 cloud meeting and am now thinking of finally getting to work on this.

I am still really unsure how to actually contribute fixes to this repo, though. What I propose is to work my way through this using some quite simple fixes that are relatively easy to apply and are already documented in errata.

Specifically, I am currently testing this python code, which changes some of the metadata necessary to determine the point in time where a dataset was branched of the parent model run.

def fix_metadata_issues(ds):
    # https://errata.es-doc.org/static/view.html?uid=2f6b5963-f87e-b2df-a5b0-2f12b6b68d32
    if ds.attrs["source_id"] == "GFDL-CM4" and ds.attrs["experiment_id"] in [
        "1pctCO2",
        "abrupt-4xCO2",
        "historical",
    ]:
        ds.attrs["branch_time_in_parent"] = 91250
    # https://errata.es-doc.org/static/view.html?uid=61fb170e-91bb-4c64-8f1d-6f5e342ee421
    if ds.attrs["source_id"] == "GFDL-CM4" and ds.attrs["experiment_id"] in [
        "ssp245",
        "ssp585",
    ]:
        ds.attrs["branch_time_in_child"] = 60225
    return ds

These ingest an xarray.dataset and check certain conditions within the attributes, and then overwrite attributes accordingly. I could easily parse those out to dataset-specific 'fixes'.

Where exactly could I translate this into a fix within the daops framework? Very happy to start a PR (and then test the implementation from cmip6_preprocessing), but I am afraid I am still a bit unsure about the daops internals. Any pointers would be greatly appreciated.

Implement fixes requested by ESMValTool team

Required steps to manually add fixes:

check that our data exhibits the error to be fixed
write the xarray fixing code in daops, with associated unit tests
add the fix class to dachar.fixes...., with associated unit tests if required
decide/agree on content of fix metadata:
- URL to issue, or code, or both in ESMValTool repo
- description of the fix
- "source" property to identify that this came from ESMValTool, maybe with release version or github commit number.
use inventory to identify all data sets that will be affected
generate (by hand?) fix proposals:
1. create fix proposal per datasets, OR
2. create fix proposal template and list of datasets file
do we need some kind of QC when the fixes are being proposed?
publish fix proposals:
- Start with a one-off proposal: dachar propose-fixes -p cmip6 <json_file>
- Adapt to: dachar propose-fixes -p cmip6 --file-list=datasets_files.txt <fix_template.json>
process fixes as usual - to accept them

Relevant to:

ESMValGroup/ESMValCore#755

ESMValGroup/ESMValCore#787

Define API parameters for subset and average processes

Follow OGC good practice and other relevant services to define sensible inputs for spatial/temporal parameters etc.

Needs some research into existing services.

E.g. should time window look like?

"1999-01-01T00:00:00Z/2000-10-10T12:00:00Z"

Is our "Fixer" class approach appropriate?

Please check whether the Fixer approach that we have implemented will be flexible enough to address all issues with CMIP5, CMIP6 and CORDEX data.

Use the ESMValTool repository to review examples of fixes that are needed.

Do we need something more than:

pre-processor
post-processor

	area: Area to subset over, sequence or string of comma separated lat and lon
	bounds. Must contain 4 values.

roocs / daops Goto Github PK

daops's Introduction

daops - data-aware operations

Features

Credits

daops's People

Contributors

Stargazers

Watchers

Forkers

daops's Issues

Description

Description

Description

Description

Description

Description

Description

Description

Should we allow the daops interface to include the selection of variables?

If yes, what are the options?

Description

Description

Description

What I Did

Description

What I Did

Some CORDEX fixes have been captured by the py-cordex project

Description

Description

Description

Description

What I Did

Description

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org