ncar / intake-esm-datastore Goto Github PK
View Code? Open in Web Editor NEWIntake-esm Datastore
License: Apache License 2.0
Intake-esm Datastore
License: Apache License 2.0
@matt-long, I saw this a while ago, and I meant to ask you whether the member_id
is 0 or 5??
intake-esm-datastore/collection-definitions/cesm1-le-collection.yml
Lines 18 to 21 in a706bee
@matt-long, I suggest we move the build script build_collection.py
into intake-esm
. This way, once intake-esm
is installed, the user gets this functionality out of the box.
Would it be useful and feasible to build a catalog for SubX ?
Or would it be more useful/easy to just build a catalog based on intake-xarray? With SubX the data will be remote.
https://github.com/kpegion/SubX/blob/master/Python/download_data/generate_ts_py_ens_files.ksh
Now that intake-esm
supports a relative path from the json
to csv.gz
, I think all the json
files in catalog/
with a full path should change to the relative path.
$ cd /glade/collections/cmip/catalog/intake-esm-datastore/catalogs
$ grep catalog_file *.json
campaign-cesm2-cmip6-timeseries.json: "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.csv.gz",
glade-cesm1-cmip5-timeseries.json: "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm1-cmip5-timeseries.csv.gz",
glade-cesm1-le.json: "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm1-le.csv.gz",
glade-cmip5.json: "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip5.csv.gz",
glade-cmip6.json: "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.csv.gz",
mistral-cmip5.json: "catalog_file": "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip5.csv.gz",
mistral-cmip6.json: "catalog_file": "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.csv.gz",
mistral-miklip.json: "catalog_file": "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-miklip.csv.gz",
mistral-MPI-GE.json: "catalog_file": "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-MPI-GE.csv.gz",
pangeo-cmip6.json: "catalog_file": "https://storage.googleapis.com/cmip6/pangeo-cmip6.csv",
stratus-cesm1-le.json: "catalog_file": "stratus-cesm1-le.csv",
I plan on changing
campaign-cesm2-cmip6-timeseries.json
glade-cesm1-cmip5-timeseries.json
glade-cesm1-le.json
When I next update those catalogs, and stratus-cesm1-le.json
is already doing this. I definitely think glade-cmip5.json
and glade-cmip6.json
should follow suit, though I am less familiar with mistral
.
We currently have three catalogs for different CESM output accessible from cheyenne / dav (excluding the CMOR-ized CMIP output): campaign-cesm2-cmip6-timeseries
, glade-cesm1-cmip5-timeseries
, and glade-cesm1-le
(which should actually point to data on campaign storage and be renamed campaign-cesm1-le
. I think all three of these should follow the same naming convention for columns in the csv file, and should include
experiment
case
file_fullpath
file_basename
date_range
sequence_order
member_id
component
grid
stream
variable
year_offset
parent_experiment
parent_member_id
branch_year_in_parent
branch_year_in_child
pertlim
With the following notes:
branch_year_in_parent == branch_year_in_child
, can we define the catalog with a YAML file that simply specifies branch_year
and sets both columns to that one value?pertlim
is not specified in the YAML file, it should be set to zero.I've toyed with the idea of including a machine
column as well, namely as a way to note the differences between ensemble members 101 - 105 and 001 - 005 in the CESM1 Large Ensemble, but I think that might be too burdensome when creating future catalogs. I'm open to other peoples' thoughts on that, though.
Note that this issue supercedes #48 and #53 and a solution will do the same to PR #49 so I will close them in favor of tracking conversation in a single place (namely this ticket).
The original SSP experiments for CMIP6 in CESM2 had a bug in some emissions data, so they were re-run. Catalogs should not include the original ensemble members. The new runs are
b.e21.BSSP126cmip6.f09_g17.CMIP6-SSP1-2.6.101
b.e21.BSSP126cmip6.f09_g17.CMIP6-SSP1-2.6.102 (MOAR)
b.e21.BSSP126cmip6.f09_g17.CMIP6-SSP1-2.6.103
b.e21.BSSP245cmip6.f09_g17.CMIP6-SSP2-4.5.101
b.e21.BSSP245cmip6.f09_g17.CMIP6-SSP2-4.5.102 (MOAR)
b.e21.BSSP245cmip6.f09_g17.CMIP6-SSP2-4.5.103
b.e21.BSSP370cmip6.f09_g17.CMIP6-SSP3-7.0.101
b.e21.BSSP370cmip6.f09_g17.CMIP6-SSP3-7.0.102 (MOAR)
b.e21.BSSP370cmip6.f09_g17.CMIP6-SSP3-7.0.103
b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.101
b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.102 (MOAR)
b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.103
Provenance of the new runs:
SSP.101: RUN_REFCASE=b.e21.BHIST.f09_g17.CMIP6-historical.010 (piCntl=y841)
SSP.102: RUN_REFCASE=b.e21.BHIST.f09_g17.CMIP6-historical.011 (piCntl=y871)
SSP.103: RUN_REFCASE=b.e21.BHIST.f09_g17.CMIP6-historical.004 (piCntl=y501)
Currently the raw output is available on glade
and is in the process of being CMORized. Once that is complete, the raw output will be moved to campaign
, at which point I will update campaign-cesm2-cmip6-timeseries
; hopefully someone else can jump in and update glade-cmip6
plus the cloud catalogs. (If you are that someone else, please add yourself to the assignees list!)
With help from @andersy005 I turned @matt-long's netcdf-based CESM2-CMIP6
metadata store into a gzipped CSV file. It's on glade at /glade/work/mlevy/CMIP6-CESM2_only-NOT_CMORIZED.csv.gz
.
Python code to generate this:
ds = xr.open_dataset('/glade/u/home/mclong/.intake_esm/collections/CESM2-CMIP6.nc')
df = ds.to_dataframe()
df = df.drop(columns=['resource', 'resource_type', 'direct_access', 'file_basename', 'ctrl_branch_year', 'sequence_order', 'grid'])
df = df.rename(columns = {'file_fullpath' : 'path'})
df.to_csv('/glade/work/mlevy/CMIP6-CESM2_only-NOT_CMORIZED.csv.gz', compression='gzip', index=False)
looking at cesm1-le-collection.yml, the existing collection is missing the RCP-4.5 dataset and can also take advantage of the ctrl_experiment
/ ctrl_member_id
fields we recently added to the legacy intake-esm
used to go from yaml -> netcdf (ahead of the netcdf -> csv conversion).
I'll update the yaml as well as the json / csv in a coming PR.
Hi all,
I understand there is a flag that can be used to ensure that you only ingest the latest version of files in a cmip6 catalogue. My query is how to apply it when ingesting from a .json store, eg:
col_url='https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json' col = intake.open_esm_datastore(col_url)
How can you apply the --pick-latest-version flag to the above?
Many thanks.
Just wondering if anyone objects to this change happening soon. Any local repos would need to run these commands:
git branch -m master main
git fetch origin
git branch -u origin/main main
I think this is an issue with the catalog itself, rather than an intake-esm
problem? @sgyeager reported this in Zulip, and I've verified that I am running into the same error both on Casper and my local laptop.
mamba create -n test-intake intake-esm
)git clone https://github.com/NCAR/intake-esm-datastore.git
)cd intake-esm-datastore
Python 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import intake
>>> import intake_esm
>>> intake_esm.show_versions()
INSTALLED VERSIONS
------------------
cftime: 1.6.3
dask: 2024.2.0
fastprogress: 1.0.3
fsspec: 2024.2.0
gcsfs: None
intake: 0.7.0
intake_esm: 2024.2.6
netCDF4: 1.6.5
pandas: 2.2.0
requests: 2.31.0
s3fs: None
xarray: 2024.1.1
zarr: 2.17.0
>>> col = intake.open_esm_datastore('catalogs/glade-cesm2-le.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mlevy/miniconda3/envs/test-intake/lib/python3.12/site-packages/intake_esm/core.py", line 107, in __init__
self.esmcat = ESMCatalogModel.load(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlevy/miniconda3/envs/test-intake/lib/python3.12/site-packages/intake_esm/cat.py", line 242, in load
cat = cls.model_validate(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlevy/miniconda3/envs/test-intake/lib/python3.12/site-packages/pydantic/main.py", line 509, in model_validate
return cls.__pydantic_validator__.validate_python(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for ESMCatalogModel
aggregation_control.aggregations.0.options
Input should be a valid dictionary [type=dict_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/dict_type
There are ~26000 files on campaign that we know we don't want to include in our catalog; rather than dumping them in every notebook we read, it would be easier for me to update the .csv.gz
file here
@matt-long has recommended better file names:
campaign_via_glade-cmip6_NOT_CMORIZED
-> campaign-cesm2-cmip6-timeseries
glade-cmip5_NOT_CMORIZED
-> glade-cesm1-cmip5-timeseries
The former DKRZ cluster Mistral is no longer running (See e.g. last sentence of this article). This makes the following catalogs outdated:
marbl-ecosys/cesm2-marbl#3 and marbl-ecosys/cesm2-marbl#4 (and probably some others) make it seem like it would be useful to have a catalog pointing to a variety of observational datasets on glade. Does this exist already? If it doesn't, should I build it? And does it make sense to have a single catalog for all obs, or should each component have its own? (e.g. ocean-obs, atm-obs...)
I'm trying to add an RCP8.5 column to a table I'm generating from /glade/work/mlevy/intake-esm-collection/json/glade-cesm1-cmip5-timeseries.json
and the catalog doesn't currently include that data. I've added some of the HPSS time series for b40.rcp8_5.1deg.bdrd.001
and b40.rcp8_5.1deg.bprp.002
to /glade/p/cgd/oce/projects/cesm2-marbl/intake-esm-data
for the time being, I just need to regenerate the csv
file to include these new files.
We have many different CESM simulations and I would like to create an esm-intake collection of them. The output files are monthly mean netcdf files and contain many variables.
I have created a collection.json
file:
{
"esmcat_version": "0.1.0",
"id": "CESM_simulations",
"description": "This is an ESM collection for CESM1 simulations.",
"catalog_file": "simulations.csv",
"attributes": [
{ "column_name": "component", "vocabulary": ""},
{ "column_name": "frequency", "vocabulary": ""},
{ "column_name": "experiment", "vocabulary": ""},
{ "column_name": "variable", "vocabulary": ""}
],
"assets": {
"column_name": "path",
"format": "netcdf"
}
}
and with a simulations.csv
:
component,frequency,experiment,path
ocn,monthly,CTRL,simulation1.pop.h.0001-01.nc
ocn,monthly,CTRL,simulation1.pop.h.0001-02.nc
I can create a catalogue cat = intake.open_esm_datastore('collection.json').search(experiment=['CTRL'])
which results in
CESM_simulations-ESM Collection with 2 entries:
> 1 component(s)
> 1 frequency(s)
> 1 experiment(s)
> 2 path(s)
but when I create a dataset with dset_dict = cat.to_dataset_dict(cdf_kwargs={'decode_times': False})
it returns a dataset with only a single time coordinate:
calling dset_dict['ocn.monthly.CTRL']
yields
<xarray.Dataset>
Dimensions: (bnds: 2, d2: 2, nlat: 2400, nlon: 3600, time: 1, z_t: 42, z_t_150m: 12, z_w: 42, z_w_bot: 42, z_w_top: 42)
Coordinates:
* time (time) float64 7.302e+04
* z_t (z_t) float32 500.622 1506.873 ... 562499.9 587499.9
* z_t_150m (z_t_150m) float32 500.622 1506.873 ... 14895.824
* z_w (z_w) float32 0.0 1001.244 ... 549999.9 574999.9
* z_w_top (z_w_top) float32 0.0 1001.244 ... 549999.9 574999.9
* z_w_bot (z_w_bot) float32 1001.244 2012.502 ... 599999.9
ULONG (nlat, nlon) float64 ...
ULAT (nlat, nlon) float64 ...
TLONG (nlat, nlon) float64 ...
TLAT (nlat, nlon) float64 ...
Dimensions without coordinates: bnds, d2, nlat, nlon
Data variables:
time_bound (time, d2) float64 ...
dz (z_t) float32 ...
dzw (z_w) float32 ...
KMT (nlat, nlon) float64 ...
KMU (nlat, nlon) float64 ...
REGION_MASK (nlat, nlon) float64 ...
UAREA (nlat, nlon) float64 ...
TAREA (nlat, nlon) float64 ...
HU (nlat, nlon) float64 ...
HT (nlat, nlon) float64 ...
DXU (nlat, nlon) float64 ...
DYU (nlat, nlon) float64 ...
DXT (nlat, nlon) float64 ...
DYT (nlat, nlon) float64 ...
HTN (nlat, nlon) float64 ...
HTE (nlat, nlon) float64 ...
HUS (nlat, nlon) float64 ...
HUW (nlat, nlon) float64 ...
ANGLE (nlat, nlon) float64 ...
ANGLET (nlat, nlon) float64 ...
days_in_norm_year float64 ...
grav float64 ...
omega float64 ...
radius float64 ...
cp_sw float64 ...
sound float64 ...
vonkar float64 ...
cp_air float64 ...
rho_air float64 ...
rho_sw float64 ...
rho_fw float64 ...
stefan_boltzmann float64 ...
latent_heat_vapor float64 ...
latent_heat_fusion float64 ...
ocn_ref_salinity float64 ...
sea_ice_salinity float64 ...
T0_Kelvin float64 ...
salt_to_ppt float64 ...
ppt_to_salt float64 ...
mass_to_Sv float64 ...
heat_to_PW float64 ...
salt_to_Svppt float64 ...
salt_to_mmday float64 ...
momentum_factor float64 ...
hflux_factor float64 ...
fwflux_factor float64 ...
salinity_factor float64 ...
sflux_factor float64 ...
nsurface_t float64 ...
nsurface_u float64 ...
KE (time, z_t, nlat, nlon) float32 ...
TEMP (time, z_t, nlat, nlon) float32 ...
SALT (time, z_t, nlat, nlon) float32 ...
SSH2 (time, nlat, nlon) float32 ...
SHF (time, nlat, nlon) float32 ...
SFWF (time, nlat, nlon) float32 ...
EVAP_F (time, nlat, nlon) float32 ...
PREC_F (time, nlat, nlon) float32 ...
SNOW_F (time, nlat, nlon) float32 ...
MELT_F (time, nlat, nlon) float32 ...
ROFF_F (time, nlat, nlon) float32 ...
SALT_F (time, nlat, nlon) float32 ...
SENH_F (time, nlat, nlon) float32 ...
LWUP_F (time, nlat, nlon) float32 ...
LWDN_F (time, nlat, nlon) float32 ...
MELTH_F (time, nlat, nlon) float32 ...
IAGE (time, z_t, nlat, nlon) float32 ...
WVEL (time, z_w_top, nlat, nlon) float32 ...
UET (time, z_t, nlat, nlon) float32 ...
VNT (time, z_t, nlat, nlon) float32 ...
UES (time, z_t, nlat, nlon) float32 ...
VNS (time, z_t, nlat, nlon) float32 ...
PD (time, z_t, nlat, nlon) float32 ...
HMXL (time, nlat, nlon) float32 ...
XMXL (time, nlat, nlon) float32 ...
TMXL (time, nlat, nlon) float32 ...
HBLT (time, nlat, nlon) float32 ...
XBLT (time, nlat, nlon) float32 ...
TBLT (time, nlat, nlon) float32 ...
SSH (time, nlat, nlon) float64 ...
time_bnds (time, bnds) float64 ...
TAUX (time, nlat, nlon) float64 ...
TAUY (time, nlat, nlon) float64 ...
UVEL (time, z_t, nlat, nlon) float64 ...
VVEL (time, z_t, nlat, nlon) float64 ...
Attributes:
title: spinup_pd_maxcores_f05_t12
history: Thu Sep 14 23:06:30 2017: ncks -A /projects/0...
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/net...
contents: Diagnostic and Prognostic Variables
source: CCSM POP2, the CCSM Ocean Component
revision: $Id: tavg.F90 34115 2012-01-25 22:35:19Z njn01 $
calendar: All years have exactly 365 days.
start_time: This dataset was created on 2017-04-15 at 12:...
cell_methods: cell_methods = time: mean ==> the variable va...
nsteps_total: 25052952
tavg_sum: 86399.99999999974
CDI: Climate Data Interface version 1.7.0 (http://...
CDO: Climate Data Operators version 1.7.0 (http://...
NCO: "4.6.0"
history_of_appended_files: Thu Sep 14 23:06:30 2017: Appended file /proj...
intake_esm_varname: None
How do I concatenate along the time axis?
I am currently converting our CMIP6 nc files to zarr, since I am experiencing sever problems when trying to work with the giant chunks of netcdf files.
Is there a way to use the builder scripts to create a catalog for just the zarr files?
This seems very similar to what @naomi-henderson has done for the pangeo cloud. Are those scripts available publicly?
Consistent with intake-esm terminology, let's change collection-definitions to collection-input.
I've put together a file parser for cesm2_cmip6 collection, however, I am not sure I am getting everything right especially the experiment
attribute for the Decadal Prediction (DCPP) output.
For instance, here's what I get for one file:
/glade/collections/cdg/timeseries-cmip6/DCPP/011-020/b.e11.BDP.f09_g16.1969-11.014/atm/proc/tseries/month_1/b.e11.BDP.f09_g16.1969-11.014.cam.h0.SOLIN.196911-197912.nc'
In [1]: from cesm import cesm2_cmip6_parser
In [2]: f = "/glade/collections/cdg/timeseries-cmip6/DCPP/011-020/b.e11.BDP.f09_g16.1969-11.014/atm/proc/tseries/month_1/b.e11.BDP.f09_g16.1969-11.014.cam.h0.SOLIN.196911-197912.nc"
In [3]: cesm2_cmip6_parser(f)
Out[3]:
{'path': '/glade/collections/cdg/timeseries-cmip6/DCPP/011-020/b.e11.BDP.f09_g16.1969-11.014/atm/proc/tseries/month_1/b.e11.BDP.f09_g16.1969-11.014.cam.h0.SOLIN.196911-197912.nc',
'case': 'b.e11.BDP.f09_g16.1969-11.014',
'variable': 'SOLIN',
'date_range': '196911-197912',
'stream': 'cam.h0',
'component': 'atm',
'experiment': '1969-11'}
Note that I am getting experiment=1969-11
. Is this right or should we treat DCPP outputs as a special case?
I seem to be getting the right attributes for outputs from other experiments:
In [4]: f2 = '/glade/collections/cdg/timeseries-cmip6/f.e21.F1850_BGC.f09_f09_mg17.CFMIP-amip-piForcing.001/atm/proc/tseries/month_1/f.e21.F1850_BGC.f09_f09_mg17.CFMIP-amip-piForcing.001.cam.h0.CLD_CAL_UN.187001-191912.nc'
In [5]: cesm2_cmip6_parser(f2)
Out[5]:
{'path': '/glade/collections/cdg/timeseries-cmip6/f.e21.F1850_BGC.f09_f09_mg17.CFMIP-amip-piForcing.001/atm/proc/tseries/month_1/f.e21.F1850_BGC.f09_f09_mg17.CFMIP-amip-piForcing.001.cam.h0.CLD_CAL_UN.187001-191912.nc',
'case': 'f.e21.F1850_BGC.f09_f09_mg17.CFMIP-amip-piForcing.001',
'variable': 'CLD_CAL_UN',
'date_range': '187001-191912',
'stream': 'cam.h0',
'component': 'atm',
'experiment': 'CFMIP-amip-piForcing'}
Originally posted by @andersy005 in #47
Hi, i get an error when trying to open the GLADE CMIP5 catalog, maybe the catalog_file
path in the JSON needs updating? :
import intake
import intake_esm
url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/glade-cmip5.json"
col = intake.open_esm_datastore(url)
col
FileNotFoundError: [Errno 2] No such file or directory: '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip5.csv.gz'
https://rda.ucar.edu/datasets/ds630.0/
Resources/Documentation:
I have built a catalog file using cmip.py
with the --pick-latest-version
flag.
I am still seeing duplicate versions in my catalog file.
I experimented with the _pick_latest_version
locally and was able to rectify my dataframe by adding the field 'dcpp_init_year'
to the fields that are excluded from the groupby call:
grpby = list(set(df.columns.tolist()) - {'path', 'version', 'dcpp_init_year'})
In my case the dcpp_init_year
is populated by nans, which might throw of pandas groupby.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.