oloapinivad / ecmean4 Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 278.77 MB

EC-Earth basic evaluation tool

License: Apache License 2.0

Python 99.75% Shell 0.25%

ecmean4's People

Contributors

Stargazers

Watchers

Forkers

jhardenberg

ecmean4's Issues

Refactor of global mean

We would like to have a more elaborated version of the global mean, able to deal with multiple seasons and regions as done recently with #46 for performance indices.

We will need to create a script which performs global mean operations as done by the code on a restricted set of observations, and then output everything as yaml file. It could be amazing if we can store also the variance for each variable/region/season so that we can provide an estimate of the error.

The global mean output should be also converted in a pdf similarly to performance indices. However, we would like to have the color to point the amount of standard deviation we are far from the observation, and to report the model value in the heatmap. It will be great to show also the value of the observations.

Slowdown when processing CMIP6 data

I noticed a significant slowdown of the the global mean tool especially when processing some CMIP6 data.
It seems more evident for models which have a big file which includes multiple years, while it is not evident when single year are processed.
I suspect this is due to the way the file list is being created, might deserve an important refactoring.

(CMOR) sftof in global_mean for ocean vars

In computing global means of ocean variables for CMOR data, we should actually use the information in sftof (the fraction of a cell covered by ocean). For NEMO this is always 100%, but not necessarily so for other models which might have fractional coverage.

Introduce semi-automatic benchmarking

It will be nice to have a tool to produce in a robust way estimate on the code speed on a pre-defined set of data (let's say 30 years of CMIP6 data) with a set of different cores. This can be very useful when developing new functionalities, or to identify where bottleneck as mentioned in #50, and perhaps to have something ported directly on the documentation.

It is unclear if some existing tool can be used for this, need to be investigated.

SSS variance climatology is wrong

The file variance_levitus_SSS.nc actually contains sea ice variance (it is identical with variance_GISS_SICE.nc)! Obviously this leads to a crazy value of the SSS performance index.

CMIP6 compatibility for synda

The PR #31 (SPHINX docs) also (I guess by mistake) changed interfaces/interface_CMIP6.yml, probably to adapt it to the particular file structure by @oloapinivad .
The previous version actually was compatible with the directory structure as exposed by synda (available on our machines wilma and mafalda)
In branch fixcmip6 I reintroduce the previous version but I also kept Paolo's version as interfaces/interface_CMIP6_PD.yml (so it can be used with -i CMIP6_PD). We can have different interface files for different directory structures.

Efficiency of multiprocessing of performance indices

So far the multiprocessing of performance indices is based on a subdivision of the variables in chunks which is done according to the available processors and the variable list.
However, it often occurs that all 3d variables - which are the most computational intensitve - are put togheter. It would be significantly more efficient if these are shared among the processors. Also exploiting of parallel computation by dask should be investigated.

More in general, we can think about find a way to estimate the speed of these code with a testing procedure. Unfortunately, I am not sure that speed can be exploited on Github.

Documentation with read the docs

In branch documentation (and in documentation-api) we implemented a initial version of sphinx+readthedocs to provide an automatic documentation procedure.
The default documentation works fine, and can be fine https://ecmean4.readthedocs.io/en/latest/

However, the autodoc part, which builds the description of the functions starting from the function descriptions, appears to be a pain. Indeed, it works smoothly locally using sphinx in two different configuration, which involve the use of the sphinx-api (7376c01) or of the autosummary (cae29f2).

From a "style" point of view, the sphinx-api solution seems better, so that this is considered the preferred one so far and it is installed on the webpage.

When we move on the server using readthedocs, both fails since they appear to not be able to find the correct modules. I suspect this is associated with the interdependnecies of the two scripts, which probably should be two different modules, but still it puzzles me a lot that both crashes.

An example of the crash for the sphinx-api solution: actually these are just errors.

WARNING: autodoc: failed to import module 'global_mean'; the following exception was raised:
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/envs/latest/lib/python3.9/site-packages/sphinx/ext/autodoc/importer.py", line 62, in import_module
    return importlib.import_module(modname)
  File "/home/docs/.asdf/installs/python/3.9.7/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/checkouts/latest/global_mean.py", line 23, in <module>
    from ecmean import var_is_there, load_yaml, \
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/checkouts/latest/ecmean.py", line 17, in <module>
    cdo = Cdo()
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/envs/latest/lib/python3.9/site-packages/cdo.py", line 187, in __init__
    self.operators         = self.__getOperators()
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/envs/latest/lib/python3.9/site-packages/cdo.py", line 278, in __getOperators
    version = parse_version(getCdoVersion(self.CDO))
  File "/home/docs/checkouts/readthedocs.org/user_builds/ecmean4/envs/latest/lib/python3.9/site-packages/cdo.py", line 78, in getCdoVersion
    proc = subprocess.Popen([path2cdo, '-V'], stderr=subprocess.PIPE, stdout=subprocess.PIPE)
  File "/home/docs/.asdf/installs/python/3.9.7/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/docs/.asdf/installs/python/3.9.7/lib/python3.9/subprocess.py", line 1696, in _execute_child
    and os.path.dirname(executable)
  File "/home/docs/.asdf/installs/python/3.9.7/lib/python3.9/posixpath.py", line 152, in dirname
    p = os.fspath(p)

TypeError: expected str, bytes or os.PathLike object, not NoneType

variables that cannot be processed

There is no error raising when a variable that does not exist in the climatology is raised...

Improve warning for missing files

There are a few cases where the warning/error when missing files are encountered are badly presented. At least two cases must be found

When mask and areacello files are missing (ERROR raised, but impossible to understand)
When some year is selected but it is not found in the in the files (global mean raise error, but performance indices work without any warning!)

Work with CMOR output

In order to facilitate comparison with previous model iterations and with other models, being able to run the tools also on CMIP out in CMOR format is desirable. Being explored in the new cmor branch.

better management of masks

The definition of the grid files and masking needs to be homogenised between the two tools

Non-cmor variables

Interface files are very convenient but apparently we forgot to consider the hypothesis when a variable is not saved under a cmor name. As an example we consider 2t as output from raw IFS data or ERA5 reanalysis. We cannot change the variable name in the config.yml since this will break any comparison.

We should introduce a the possibility that the interface file have a specification of which variable to be loaded/seeked, that might be different from the cmor name. Something as

 tas:
    varload: '2t'
    varnname: '2m Temperature'
    filetype: atm2d

The new varload should be present everywhere in the code, and set as identical to var in the case this is not defined. This should also come with a revisitation of the make_input_filename() function which does not allow for such flexibility

Revisiting the `component` and its files in the interface YAML files

ECmean4 is built on the present of a series of components which tells the inner code how to deal with land-sea masks, grid areas and grid specifications. These are listed within the component block of the interface files

As an exemple, for cmor they reads as:

component:
  cmoratm:
    inifile: 'r1i1p1f*/sftlf/sftlf_fx_{model}_{expname}_r1i1p1f*_{grid}.nc'
    atmfile: 'r1i1p1f*/sftlf/sftlf_fx_{model}_{expname}_r1i1p1f*_{grid}.nc'
  cmoroce:
    gridfile: 'r1i1p1f*/sftof/sftof_Ofx_{model}_{expname}_r1i1p1f*_{grid}.nc'
    areafile: 'r1i1p1f*/areacello/areacello_Ofx_{model}_{expname}_r1i1p1f*_{grid}.nc'

As another example, EC-Earth4 reads:

component:
  oifs:
    inifile: 'ICMGG{expname}INIT'
    atmfile: 'output/oifs/{expname}_atm_cmip6_1m_{year1}-{year2}.nc'
  nemo:
    gridfile: nemo-initial-state.nc
    areafile: domain_cfg.nc

The naming is non consistent, and also the goal of each this file can be not trivial to understand. In principle what we mandatory need is:

Land-sea mask file on the atmospheric grid (mandatory)
Land-sea mask file on the oceanic grid (not mandatory, see #38)
Area file for atmospheric grid (it could be whatever file in most cases, since _area_cell() has been shown to cover many cases)
Area file oceanic grid (the standard areacello in CMOR)
Area file ice grid (some cmor model has this peculiarity, see #47)

What is the need of the oceanic gridfile in this configuration needs to be assessed: this was originally introduced for EC-Earth4 but maybe can be included in the areafile.

Furthermore, each component implies different treating by the mask and interpolation functions, and this is very much ad hoc.

A general reorganization must be carried out to support different models.

oceanic mask

Oceanic mask is made to work only with official cmip6 data, and it cannot load other variables which. Also, some dangerous unit check is done and this can be a problem in certain occasion.

Furthermore, there is no need to print
WARNING -> No oceanic mask available for oceanic vars, this might lead to inconsistent results...
we should issue this warning only at the beginning...

RK08 SSS variance is bugged

Variance for levitus dataset included in the climatology is bugged: it is actually the variance of sea-ice, needs to be replaced

Coverage is not considering the call to `subprocess.run()`

this seems to be a new problem which was not there in the past, the most of the proposed solution on the web does not work.

Coding standards and workflow testing

The pytest test that has been implemented in the testing branch includes two flake8 tests that are now included in the workflow. This is done before of the python -m pytest call, which uses the two basic tests for atmosphere and coupled runs. The workflow is executed to every new commit and on every new pull request on the main

The first one of the flake8 call produces errors and makes the workflow to fail. It actually checks for undefined variables and syntax errors (and it made me spot one undefined vars!)

flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

The second produce a full report under the form of a warning. Line length is set according to Github standard.

flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

It would be important to run those tests BEFORE doing a pull request or a commit on the main

A very basic way to fix the most of the minor issue that arise from coding is running autopep8

autopep8 --in-place --max-line-length=127 --recursive .

The call of this three commands should be run always before merging a pull request.

wfo unit

This is more related to EC-Earth, but it is a good collateral aspect of using pint

wfo
kg/m2/s ---> m/100years
Unit converson required...
3.1688087814028946e-06 kilogram / meter ** 3 / second ** 2
Units mismatch, this cannot be handled!

What unit is kg/m2/s? perhaps is kg m-2 s-1?

global mean does not select years for files with more than one year

ECmean4/global_mean.py

Lines 142 to 146 in 171f6b1

 yrange = range(diag.year1, diag.year2+1) 

 for year in yrange: 

 infile = make_input_filename(var, dervars, year, year, face, diag) 

 x = cdop.output(infile, keep=True) 

 a.append(x)

There is no selyear call in the global_mean.py code so that if a file includes more than one year all the file are currently processed. This is wrong and occurs quite often for CMIP data. Spotted developing the xarray code.

Correct the parsing of the global_mean() and performance_indices()

So far the two main functions simply works from command line.

We will need to restructure the corresponding parsers so that the two functions can run also simply being imported from an external script, which can be helpful if they are called from other tools.

Inconsistency in fldmean following grid definition

Global mean are computed via the cdo fldmean command.
However, this seems to be dependent on the grid definition

cdo output  -fldmean -timmean -selname,tas   -selname,tas 
/lus/h2resw01/scratch/ccpd/ece4/MALE/output/oifs/MALE_atm_cmip6_1m_1990-1990.nc
 287.088

cdo output  -fldmean -timmean -selname,tas  -setgridtype,regular  -setgrid,/lus/h2resw01/scratch/ccpd/ece4/MALE/ICMGGMALEINIT  -selname,tas
/lus/h2resw01/scratch/ccpd/ece4/MALE/output/oifs/MALE_atm_cmip6_1m_1990-1990.nc
287.083

Not a big difference, but it should be considered if have any implication.

CDO pardes has been removed in CDO 2.0.5

As from the title, this makes the CDO based version no longer operative, since it is used to find which what variables are in the file

More flexible way to specify input files

Input files for every variable can change vastly if models are changed. We will make an attempt to specify this in a more general and flexible way in the interface_* file. This is being developed in the filenames branch (which is a sub branch of parallel, since thta branch, still to be merged, has too large structural changes and it is better to restart from there)

pint and metpy tool for handling units

A new branch pint-units is being developed to exploit of unit conversion from model data to required dataset/global mean value

Issues with CMIP6 data

Trying to provide a complete assessment of the CMIP6 data with both global mean and performance indices with the brand new xarray we are facing some issue with some model grid/files. This could be 1) an issue of the ECmean4 and the xESMF current interpolation method 2) some not regular cmor files in the CMIP6 archive.

Models on which ECmean4 is working completely:

(between parenthesis the time to process 30 years of data for global mean and performance indices, with 8 cores on wilma)

IPSL-CM6A-LR
CanESM5
EC-Earth3
MIROC6
TaiESM1
CNRM-CM6-1
AWI-CM-1-1-MR : oceanic interpolation from unstructured grid originally was not working, but nearest neighbour has been implemented when an unstructured grid is identified (not neat but fine)
CESM2
GFDL-CM4

Models on which ECmean4 is working partially:

CMCC-CM2-SR5
NorESM2-MM
ACCESS-CM2
With these three model a weird issue associated with sea ice is found. Apparentely remapping of sea ice does not work with the weight matrix computed on the ocean variables, since the sea ice model has a different grid!
An example error:

ValueError: The horizontal shape of input data is (291, 360), different from that of the regridder (292, 362)!

Moving from CDO to xarray+dask

This is to keep track of the development in devel/xarray. The idea is to convert the cdo engine to xarray which is more self consistent and allows for out-of-core computation via dask.

First commit 82a0661 introduced xr_global_mean.py which similar parsing structure and support only for EC-Earth4. It requires a conda environment to correctly handle dependencies (safer than the pyEnv originally developed)

Positive aspects:

The code is extremely fast (about 25/30% of the original CDO code, even without using dask)
Remove the entire cdopipe class.
No need to handle the grid, simply open the file with Xarray.

Negative aspects:

The area weighting mean/sum is clumsy, since the generation of weights need to be done manually. Possible solution with external packages seems non-trivial.
Grib vs Netcdf have very different file structures, and this requires manual intervention when using data from both sources at the same time (as for example Ec-Earth4 masks)
Had to create from a parsing of the expression used to evaluate derived variables, which may be not robust.

Performance indices should use remapcon

Currently remapbil is used to perform interpolation of all variables in the performance indices regridding. This is not appropriate for upscaling and something like remapcon should be used. The problem is that the unstructured nemo gri is missing corner data and so does not allow the use of remapcon

New cdo interface class

We introduce a new class to access CDO command chains (or pipelines)

crash when using multiple processors and missing variables

We found out that if there are more available processors than variable to be processed the code might crash.
This is currently investigated.

CMIP6 interface mixes Amon and Omon

Developing xarray version of the global_mean.py I found out that the current filenames expansion i.e. _expand_filename() does not make any difference between Amon and Omon variables, so that if both are available both are loaded and averaged together. This occurs specifically for historical EC-Earth3 on mafalda when computing net_sfc, which requires snow (prsn) which is available for both Amon and Omon, but other potential situation like this are possible.
CDO does not complaint since it just do global average, but xarray crashes so that I spotted the issue.

This has to be solved in the main branch: a possible workaround currently used in the xarray branch is to prepend an A and O to the filename definition in the interface_CMIP6.yml as done here below:

filetype:
  atm2d:
    filename: '{var}_A{frequency}_{model}_{expname}_{ensemble}_{grid}_{year1}01-{year2}12.nc'
    dir: '{ensemble}/{frequency}/{var}/{grid}/{version}'
    component: cmoratm
  atm3d:
    filename: '{var}_A{frequency}_{model}_{expname}_{ensemble}_{grid}_{year1}01-{year2}12.nc'
    dir: '{ensemble}/{frequency}/{var}/{grid}/{version}'
    component: cmoratm
  oce2d:
    filename: '{var}_O{frequency}_{model}_{expname}_{ensemble}_{grid}_{year1}01-{year2}12.nc'
    dir: '{ensemble}/{frequency}/{var}/{grid}/{version}'
    component: cmoroce
  ice:
    filename: '{var}_{frequency}_{model}_{expname}_{ensemble}_{grid}_{year1}01-{year2}12.nc'
    dir: '{ensemble}/{frequency}/{var}/{grid}/{version}'
    component: cmoroce

It is not clean but solves all the issues so far.

global mean with no standard deviation does not show up in the figure

This is due to some behaviour of seaborn probably, need to be investigated

ECmean4 not running with python 3.11

So far it is not possible to run with py311 since there is a conflict between xESMF, python and esmpy.
xESMF can't work with the most recent version of ESMPY, so that we run with 8.3.1. However, 8.3.1 does not run with python 3.11

This should be addressed in days by xESMF pangeo-data/xESMF#218

Introduce thread parallelization

The goal is to use threads to exploit paralleization. Parallelization on variable seems to be the best initial solution.

Introduce a translation interface from variables to files

In order to simplify the look up of variables of the two scripts, we need to have a interface_ece4.yml which handle all the common information.

lean requirements.txt

The requirements file currently contains outdated pinned packages and way more than waht is strictly needed.
A short, lean requirements with only what is needed should be added.

no more support for atm-only runs

I have just found that the main does not longer work with atm-only run.
Cant check now but I will have a look at this later in the afternoon

(.ECmean4) [ccpd@aa6-100 ECmean4]$ ./performance_indices.py ALFA 1990 1990
PI for  net_sfc 18.288219451904297
PI for  tas 15.55960464477539
PI for  psl 1.8526122570037842
PI for  pr 25.827404022216797
PI for  tauu 12.488482475280762
PI for  tauv 4.452453136444092
PI for  ta 6.034306049346924
PI for  ua 3.1787490844726562
PI for  va 2.8235251903533936
PI for  hus 4.428908824920654
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/apps/python3/3.8.8-01/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/apps/python3/3.8.8-01/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "./performance_indices.py", line 38, in worker
    isavail, varunit = var_is_there(infile, var, face['variables'])
  File "/perm/ccpd/ecearth4/ECmean4/ecmean.py", line 99, in var_is_there
    ffile = infile[0]
IndexError: list index out of range
Done in 5.4165 seconds
Traceback (most recent call last):
  File "./performance_indices.py", line 265, in <module>
    main(args)
  File "./performance_indices.py", line 218, in main
    out_sequence = [var, varstat[var], piclim[var]['mask'], piclim[var]
  File "<string>", line 2, in __getitem__
  File "/usr/local/apps/python3/3.8.8-01/lib/python3.8/multiprocessing/managers.py", line 850, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'tos'

Unpacking the ecmean.py

To this date all the functions are gathered together in the ecmean.py
This is not very efficient, since implies tons of import and makes also quite complicated to browse the code. We should create a ecmean folder with different py scripts, where functions are clusted as a function of their usage/import

nemo oceinifile does not include information on the curvilinear grid

This emerged in the genbil branch, which precomputed the remap weights to speedup the calculation for performance indices.

Currently, the oceinifile chosen is domain_cfg.nc, since it includes the information to compute the grid area fundamental for averaging operation. However, this file has incomplete grid description so that CDO recognizes the file grid as generic. Therefore, this file cannot be used to generate the weights, and the OCEGRIDFILE variable includes a useless txt file.

An alternative could be to use output from the model itself, which has a curvilenear grid, or the nemo-initial-state.nc file. However, both them currently does not include grid area information, which would require two different initial files - something we would like to avoid at first.

On the other hand, it might be not obvious how to reproduce the curvilenear description file from the domain_cfg.nc

Including sea ice area in global mean

As noticied by Klaus W, we are lacking in the global mean a measure for the sea ice.

After a long discussion if it is better to implement sea ice area or sea ice extent (see here for the subtle difference) we decided to stick to a version of sea ice area (siconc * gridarea) which is even more simple and does not require the 15% threshold.

It is a bit incomplete but it can be robustly computed by monthly means and it is accepted by the SIMIP community (https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL086749)

Similarly, it should be possible to implement a measure for sea ice volume.

I will work on a branch in the next weeks

Reference values for ocean variables global means missing

All ocean variables (tos, sos, wfo, zos) are missing reference values in gm_reference.yml

pattern correlation branch

I have just introduced a pattern correlation branch https://github.com/oloapinivad/ECmean4/tree/correlation which computes also the correlation for a few selected variables. Currently reference is the PI dataset. Structure is a simplified version of the performance indices script, performing very similar operations. I preferred to keep it separate in order to provide statistics on different diagnostics.
However, we will need to update our datasets

An example run

(.ECmean4) [ccpd@aa6-100 ECmean4]$ ./correlation_pattern.py BETA 1990 1990
Pattern Correlation for  tas 0.990241
Pattern Correlation for  pr 0.839326
Pattern Correlation for  psl 0.967952
Done in 1.4894 seconds
/ec/res4/scratch/ccpd/ece4/ECmean4/table/CP4_RK08_BETA_1990_1990.txt

| Var | Correlation | Domain | Dataset |
|-------+---------------+----------+-----------|
| tas | 0.990241 | land | CRU |
| pr | 0.839326 | global | CMAP |
| psl | 0.967952 | global | COADS |

Procedure to create a new climatology

Summary

This is to document the methodology to create a new climatology. A new script has been introduced in devel/clim-cmip6 named py-climatology-create.py -> https://github.com/oloapinivad/ECmean4/blob/devel/clim-cmip6/py-climatology-create.py and aims at providing a cornerstone for any future climatology development. This is very important since it allows for updating the existing EC22 climatology (or more correctly, the new definitive version of EC23).

the present code

The tool is based on a yml file that tells it where the variables are saved on the local machine (now configured to work on wilma). The scripts has been implemented in order to exploit dask so that can run with multiple processors and takes a few hours to cover the 10 variables. It uses CDO for interpolation, with different remapping method according to the variable. It provides both mean and variance to be able to compute the performance indices: for each variable, the default "yearly" climatology is now provided together with a season averaged climatology so that now PI can be produce for multiple seasons.

The script simply compute the yearly or season mean, then estimate interannual variance. An outlier filtering for variance is then applied, but we might not be happy with that (see below). A cool feature is that it automatically produces climatology yml file (and combined with the cmip6 creation file cmip6-clim-evaluate.py, it also provides the average values for cmip6 for each variable, season and domain) https://github.com/oloapinivad/ECmean4/blob/devel/clim-cmip6/climatology/EC23/pi_climatology_EC23.yml

Things to be discussed:

Outlier removal: this is the hardest part, and a bit of an issue of the RK PI itself. Some grid points have very low variance due to limitation of the original dataset, and this implies that when division by variance is performed this small grid points are the big playeres for the estimation of the PI, which reaches incredibly large values. A filtering for this outliers has been introduced, based on the 5-sigma of the log10 distribution. It works decently, but needs to be investigated better how this works. Also some datasets has strange values here and there and this should be addressed as well. Perhaps a monitoring of the variance distribution should be introduced in the form of a plot?
Naming convention: we need to set the names of the files and of the climatology in a definitive way so that future update guarantee backward compatibility with older climatologies.
- So far the files are distinguished between yearly average (as in the RK08 climatology) + seasonal files. This guarantee backward compatibility. However, the naming is quite clumsy
- The original climatology is named RK08, the new EC22. Wouldn't be better to have a versioning as ECmean-v1, ECmean-v2 as long as we produce new release? Should we have a separate repo?
Datasets to be used: we might want to discuss this in a separate issue, but we need to find more observational datasets for most of the variables since there is a massive rely on ERA5, which can be criticized. Should we provide multiple observtional datasets? This has the benefit for the user to choose, but it might get dispersive (a non-unique output of the ECmean tool can bring far from our original goal)
Year definition: so far year 1990-2019 is the reference time frame. We cannot provide the yearly timeseries otherwise the dataset would be too big - unless we create a specific repository/archive for this. I do not like too much this idea since the CMIP6 comparison cannot be provided. However, a few different options might be provided.
Resolution: r180x90 is being replaced by r360x180, is this okay? Should we keep the two of them, or even keep the original dataset and then perform interpolation on each target grid (this will slow down the operation since weights need to be computed every time)
Interpolation: we are using CDO, not xESMF, since it is much more practical. We set remapbil as default, and remapcon for conserivative variables (precipitation and net_sfc). Should we always use remapcon?
Other points?

Value of the spatial integrals

Following the introduction of the pint package, a small inconsistency in values of the spatial integrals is observed. I think this is correct as it is treated in #12, but I write it down for the sake of my own clarity.

Take for example pr_oce and pr_land. They are measured in Sverdrup, i.e. 10^9 m3/s

Precipitation is provided by the model in kg/m2/s, which, dividing by water density (1000 kg/m3) to get volume means 10^-3 m/s (i.e. 1 mm/s or 86400 mm/day). What we currently do is to integrated in space over the ocean or land surface, mutiplying by the area and the summing ('masked_mean()' function in cdopipe.py), so that we multiply by squared meter and get 10^-3 m3/s.

It is thus correct to assume that the factor of conversion is 10^-12 (from original units to target units). ERA5 data suggest we have about 17 Sv of precipitation. However, results of

 cdo output -fldsum -mul -timmean -selname,pr BETA_atm_cmip6_1m_1990-1990.nc area.nc 
  1.69177e+10

Which multiplied by 10^-12 does not give the same order of magnitude.
Indeed, this is taken into account by pint so that we get...

| pr_oce       | Precipitation (ocean)          | Sv        |      0.0135796  |     13.4499   | ERA5       | 1990-2019 |
| pme_oce      | Precip. minus evap. (ocean)    | Sv        |     -0.00111626 |     -1.24691  | ERA5       | 1990-2019 |
| pr_land      | Precipitation (land)           | Sv        |      0.00333878 |      3.82094  | ERA5       | 1990-2019 |
| pme_land     | Precip. minus evap. (land)     | Sv        |      0.00100797 |      1.39091  | ERA5       | 1990-2019 |

there is a 10^3 factor between older "reference" data and what we currently get. Is this correct? Or was it wrong in the past?

As a further confirmation I found on wiki

Approximately 505,000 cubic kilometres (121,000 cu mi)[citation needed] of water falls as precipitation each year; 398,000 cubic kilometres (95,000 cu mi) of it over the oceans.[

398000 km3/year = 398000 * 10^9 m3/year = 0,012 * 10^9 m3/s (Sv), which is in line with what we get from the new version.
@jhardenberg could you confirm me that this above is correct, and we need to fix reference values (and perhaps change the unit of reference which is not very handy)?

Installing from PyPI causes dependency errors

Hi,

When trying to install ECmean4 into a fresh Python virtual environment, I get:

> pip install ECmean4

Collecting ECmean4
  Using cached ECmean4-0.1.1-py3-none-any.whl (11.3 MB)
Collecting xarray (from ECmean4)
  Using cached xarray-2023.5.0-py3-none-any.whl (994 kB)
Collecting netcdf4 (from ECmean4)
  Using cached netCDF4-1.6.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
Collecting dask (from ECmean4)
  Using cached dask-2023.6.0-py3-none-any.whl (1.2 MB)
INFO: pip is looking at multiple versions of ecmean4 to determine which version is compatible with other requirements. This could take a while.
Collecting ECmean4
  Using cached ECmean4-0.1.0-py3-none-any.whl (11.3 MB)
ERROR: Cannot install ecmean4==0.1.0 and ecmean4==0.1.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    ecmean4 0.1.1 depends on esmpy
    ecmean4 0.1.0 depends on esmpy

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Trying to explicitly stating version 0.1.1 does not work either:

pip install 'ECmean4==0.1.1'
Collecting ECmean4==0.1.1
  Using cached ECmean4-0.1.1-py3-none-any.whl (11.3 MB)
Collecting xarray (from ECmean4==0.1.1)
  Using cached xarray-2023.5.0-py3-none-any.whl (994 kB)
Collecting netcdf4 (from ECmean4==0.1.1)
  Using cached netCDF4-1.6.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
Collecting dask (from ECmean4==0.1.1)
  Using cached dask-2023.6.0-py3-none-any.whl (1.2 MB)
INFO: pip is looking at multiple versions of ecmean4 to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement esmpy (from ecmean4) (from versions: none)
ERROR: No matching distribution found for esmpy

Interpolation with CDO to replace xESMF

We are not happy with the current xesmf+esmpy+ESMF dependency, since there are a lot of limitation especially for unstructured grids.
We might want to exploit the CDO bindings to perform this operations: create weights should be straightforward and then later calls can be potentially performed online

Coherence and alignment of `global_mean` and `performance_indices`

The two most important script of ECmean4, or better their two worker functions pi_worker and gm_worker shares massive portions of the code (but this is true also for some basic operations of the main as config file loading and setup).

For example browsing for variables and units adjustment are almost identical. It make thus sense to create a set of common functions that are imported and handle these operations to reduce the risk of mistakes, increase modularity and compact the code.

Most importantly, the data access is done in two different methods by the two functions. global_mean still access the file year by year, while performance_indices load all the required together. In principle the second option should be more efficient considering the xarray performances, but we need to double check it. This should be uniformed as well.

	yrange = range(diag.year1, diag.year2+1)
	for year in yrange:
	infile = make_input_filename(var, dervars, year, year, face, diag)
	x = cdop.output(infile, keep=True)
	a.append(x)