Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:
- CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr
Data Node: noresg.nird.sigma2.no
Version: 20190920
Total Number of Files (for all variables): 17
When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.
The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.
Example 1: S3 has 4 netcdf files, 5 are needed for continuity
In the current `s3://esgf-world/CMIP6` bucket, there are 4 netcdf files starting with
`https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp370/r1i1p1f1/Omon/thetao/gr/v20180701/`:
['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_201501-203412.nc',
'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_203501-205412.nc',
'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_205501-207412.nc',
'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_209501-210012.nc']
At https://esgf-node.llnl.gov/search/cmip6/ there is another file available:
['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_207501-209412.nc']
Never mind why one is missing, these things easily happen. But if we blindly concatenate the 4 files, we have a large gap in the time grid.
The real problem comes with there are many files and just one is missing.
Example 2: S3 has 85 netcdf files, 86 are needed for continuity
In the current `s3://esgf-world/CMIP6` bucket, there are 85 netcdf files starting with
`https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/ssp370/r1i1p1f2/day/vas/gn/v20200318/`:
['vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20150101-20151231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20160101-20161231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20170101-20171231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20180101-20181231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20190101-20191231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20200101-20201231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20210101-20211231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20220101-20221231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20230101-20231231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20240101-20241231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20250101-20251231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20260101-20261231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20270101-20271231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20280101-20281231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20290101-20291231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20300101-20301231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20310101-20311231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20320101-20321231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20330101-20331231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20340101-20341231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20350101-20351231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20360101-20361231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20370101-20371231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20380101-20381231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20390101-20391231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20400101-20401231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20410101-20411231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20420101-20421231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20430101-20431231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20440101-20441231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20450101-20451231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20460101-20461231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20470101-20471231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20480101-20481231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20490101-20491231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20500101-20501231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20510101-20511231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20520101-20521231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20530101-20531231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20540101-20541231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20550101-20551231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20560101-20561231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20570101-20571231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20580101-20581231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20590101-20591231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20600101-20601231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20610101-20611231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20620101-20621231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20630101-20631231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20640101-20641231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20650101-20651231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20660101-20661231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20670101-20671231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20680101-20681231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20690101-20691231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20700101-20701231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20720101-20721231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20730101-20731231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20740101-20741231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20750101-20751231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20760101-20761231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20770101-20771231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20780101-20781231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20790101-20791231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20800101-20801231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20810101-20811231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20820101-20821231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20830101-20831231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20840101-20841231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20850101-20851231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20860101-20861231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20870101-20871231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20880101-20881231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20890101-20891231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20900101-20901231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20910101-20911231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20920101-20921231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20930101-20931231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20940101-20941231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20950101-20951231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20960101-20961231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20970101-20971231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20980101-20981231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20990101-20991231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_21000101-21001231.nc']
The year 2071 is missing from S3, although it is available through ESGF-CoG.
In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium). So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities.
So here are some questions for opening this issue:
- Should we somehow allow the 'legitimate' non-contiguous datasets? If so, should we divide them up into contiguous chunks and serve them separately?
- What to do about datasets with missing netcdf files? Certainly we could try complete the list, but if the files do not exist, what then?
A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets.
frequency |
total |
non-contiguous |
percent |
yearly |
23758 |
94 |
3% |
monthly |
179953 |
2490 |
1.4% |
daily |
23758 |
1947 |
8% |
hourly |
4471 |
not checked |
|
sub-hourly |
749 |
not checked |
|