pangeo-forge / cmip6-pipeline Goto Github PK

View Code? Open in Web Editor NEW

1.0 9.0 5.0 76 KB

Pipeline for cloud-based CMIP6 data ingestion pipeline

License: Apache License 2.0

Python 21.58% Jupyter Notebook 78.42%

cmip6-pipeline's Introduction

cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline

Coordination Meetings

The group of people working on "CMIP6 in the cloud" involves folks from LDEO, GFDL, Rhodium Group and CEDA.

We hold regular coordination meetings every other Friday at 10am ET (Location: https://columbiauniversity.zoom.us/j/6902819781).

Google Calendar Link

cmip6-pipeline's People

Contributors

Stargazers

Watchers

Forkers

dgergel charlesbluca cisaacstern sangitakeshavan

cmip6-pipeline's Issues

Proposed addition to the GCS bucket, gs://cmip6/

I propose to add any new versions of datasets to our exisiting Google Cloud zarr bucket by adding a new prefix to our CMIP6 collection - gs://cmip6/CMIP6 which solves multiple problems.

This follows the naming of the other CMIP collections, gs://cmip6/CMIP5 and gs://cmip6/CMIP3

This allows us to start using the version in the object names, for example:

 gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn, version 20200310

would now be stored in:

   gs://cmip6/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn/v20200310

The csv catalog files will now be generated the same as before, but with the URL pointing to the latest version only. The novice user will not be required to find the latest version on their own
The old versions will not need to be removed - so there will be no glitches and misunderstandings as the new versions are added and the existing datasets will stay where they are

indexing with s3 storage inventory

@zflamig suggested we use S3 storage inventory to index our buckets: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

I recommend we

Run this every week on both NetCDF and Zarr buckets
Store the data in Parquet format

Discussion of directory structure and catalog options

I'm opening this issue to follow up on the discussion we had at today's meeting. It would be great to align on conventions regarding how these cloud data are organized and catalogs. This will allow users to move freely between the different stores with minimum frictions.

@RuthPetrie mentioned that CEDA was trying to figure out the optimal directory structure for their data. Our directory structure was documented by @charlesbluca and @naomi-henderson here:
https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html#directory-structure

I also made the point that I think it's better if we not force users to rely on assumptions about directory structure in their analysis code. It's better to think of the directory structure as ephemeral and changeable The reason are:

We don't want to get locked in to a specific storage location. We want to have the flexibility be able to move data (across buckets, cloud, etc.) in the future.
If we rely just on directories, in order to discover what data is actually available, users will need to either
1. list the bucket (expensive, slow, might be impossible without credentials)
2. have a bunch of try / fail logic to deal with data they expect to be there but is missing

Instead, I advocated for having all data requests go through a catalog. This doesn't have to be heavy-handed or complex. At its core, a catalog is a mapping between a dataset ID and a storage path. Working with NCAR and @andersy005, we defined ESM collection spec: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md
This is how all of our cloud data is currently cataloged. ESM collections spec uses a very simple CSV file that anyone can open and parse.

Work is underway to align with STAC (see NCAR/esm-collection-spec#21), although this has stalled a bit due to lack of effort. We should definitely try to revive this as I believe strongly that STAC is the future for cloud data catalogs.

Whatever we choose, it's very important that we align going forward.

cc @pangeo-forge/cmip6

Handling non-contiguous datasets

Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:

CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr
Data Node: noresg.nird.sigma2.no
Version: 20190920
Total Number of Files (for all variables): 17

When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.

The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.

Example 1: S3 has 4 netcdf files, 5 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 4 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp370/r1i1p1f1/Omon/thetao/gr/v20180701/`:

['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_201501-203412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_203501-205412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_205501-207412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_209501-210012.nc']

At https://esgf-node.llnl.gov/search/cmip6/ there is another file available:

['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_207501-209412.nc']

Never mind why one is missing, these things easily happen. But if we blindly concatenate the 4 files, we have a large gap in the time grid.

The real problem comes with there are many files and just one is missing.

Example 2: S3 has 85 netcdf files, 86 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 85 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/ssp370/r1i1p1f2/day/vas/gn/v20200318/`:

['vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20150101-20151231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20160101-20161231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20170101-20171231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20180101-20181231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20190101-20191231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20200101-20201231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20210101-20211231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20220101-20221231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20230101-20231231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20240101-20241231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20250101-20251231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20260101-20261231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20270101-20271231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20280101-20281231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20290101-20291231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20300101-20301231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20310101-20311231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20320101-20321231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20330101-20331231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20340101-20341231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20350101-20351231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20360101-20361231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20370101-20371231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20380101-20381231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20390101-20391231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20400101-20401231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20410101-20411231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20420101-20421231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20430101-20431231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20440101-20441231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20450101-20451231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20460101-20461231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20470101-20471231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20480101-20481231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20490101-20491231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20500101-20501231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20510101-20511231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20520101-20521231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20530101-20531231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20540101-20541231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20550101-20551231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20560101-20561231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20570101-20571231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20580101-20581231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20590101-20591231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20600101-20601231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20610101-20611231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20620101-20621231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20630101-20631231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20640101-20641231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20650101-20651231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20660101-20661231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20670101-20671231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20680101-20681231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20690101-20691231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20700101-20701231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20720101-20721231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20730101-20731231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20740101-20741231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20750101-20751231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20760101-20761231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20770101-20771231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20780101-20781231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20790101-20791231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20800101-20801231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20810101-20811231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20820101-20821231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20830101-20831231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20840101-20841231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20850101-20851231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20860101-20861231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20870101-20871231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20880101-20881231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20890101-20891231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20900101-20901231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20910101-20911231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20920101-20921231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20930101-20931231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20940101-20941231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20950101-20951231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20960101-20961231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20970101-20971231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20980101-20981231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20990101-20991231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_21000101-21001231.nc']

The year 2071 is missing from S3, although it is available through ESGF-CoG.

In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium). So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities.

So here are some questions for opening this issue:

Should we somehow allow the 'legitimate' non-contiguous datasets? If so, should we divide them up into contiguous chunks and serve them separately?
What to do about datasets with missing netcdf files? Certainly we could try complete the list, but if the files do not exist, what then?

A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets.

frequency	total	non-contiguous	percent
yearly	23758	94	3%
monthly	179953	2490	1.4%
daily	23758	1947	8%
hourly	4471	not checked
sub-hourly	749	not checked

Comparison of LDEO / ESGF / CEDA Cloud holdings

@naomi-henderson and @aradhakrishnanGFDL did a very interesting "diff" comparison of the two different cloud data stores.

I'm linking to Naomi's gist to have a persistent record of this: https://gist.github.com/naomi-henderson/d2ea493e8bc705f5551f7fce9c0402e5

It would be great to get CEDA included in this as well once they are up and running (cc @RuthPetrie). This exercise is somewhat related to the discussion of directory structure / catalog format in #7.

finding various implementation errors in the netcdf file tracking_ids

The dataset version has always caused trouble in our cmip6 pipeline. It is the only DRS element which is not stored in the netcdf file's metadata. However, we use the version to keep track of Datasets which have been modified. I have been using the tracking_id to get the version (using the http://hdl.handle.net / https://handle-esgf.dkrz.de data handling service), somewhat successfully to obtain the version information.

But many implementation errors pop up. The gs://cmip6 zarr Datasets have tracking_ids which are concatenations of the netcdf file tracking_ids from which it is aggregated. In a perfect world, all of these tracking_ids would correspond to one and only one netcdf file and each netcdf file would correspond to one and only one version. So I am collecting and categorizing the various issues and trying to come up with some sensible work-arounds. I will be collecting them here, if you want to help ...

Format esgf-world catalog to help with the netcdf->zarr pipeline

This issue tracks the action needed to keep the CSV header for the esgf-world intake esm catalog similar to that of the zarr catalog to enable easy comparison and integration into the CMIP6 pipeline.

A dynamic view of GCS dataset acquisition

I have been working on a quick-and-dirty way to make our dataset collection process more transparent. Here is an experimental google sheet which is updated automatically from the scripts (which are still running on local resources - ugh - four jobs in parallel).

The top sheet is a list of on-going processes, with two additional sheets to define the terms and random issues.

A new request came in this morning for 3 hourly data. This will take quite a long time, so I decided to share the page to keep the requestor in the loop (and help explain why it will take a while). I have paused our usual background jobs and will use this to see what problems are encountered in the 3 hourly datasets!

Suggestions welcome! I am trying to work within the google sheet updating quotas - a bit of a nuisance ...

aggregate data holdings by GB instead of count

This fantastic notebook developed by @aradhakrishnanGFDL and @naomi-henderson is super useful for understanding what data we have in the cloud: https://github.com/aradhakrishnanGFDL/gfdl-aws-analysis/blob/master/examples/a1r-CompareAWS-NCvsZarr.ipynb

However, it's aggregated by number of datasets:

Once #15 is implemented, it should be easy to get the size of each dataset. So we can aggregate by summing the sizes, rather than just counting datasets.

This is required for the AWS folks to move forward in allocating more storage space for us. We need the numbers in GB.

Issue with CanESM5-r*i1p1f1 ssp245 Omon data

There is an issue loading data for CanESM5 r*i1p1f1 for ss245 from the GCS catalogue. Loading the original ESGF data works as expected. This appears to be an issue for variables in the Omon table. The same variables for different experiments / r1i1p2f1 work as expected.

It is not immediately obvious what the issue is. The basic metadata seems ok, but any attempt to load / plot the data fails (verified across multiple users and machines). Could this be an issue in the conversion from netcdf?

Thanks!

A minimal working example:

cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"       # use this outside of CCCma / for public data
col = intake.open_esm_datastore(cat_url)
query = dict(experiment_id=['ssp245'], table_id=['Omon'], member_id='r1i1p1f1',
             variable_id=['fgco2'], source_id='CanESM5')
cat = col.search(**query)

ds_g = cat.to_dataset_dict(zarr_kwargs={"consolidated": True, "decode_times": True}, cdf_kwargs={'chunks': {'time':1032}})['ScenarioMIP.CCCma.CanESM5.ssp245.Omon.gn']

ds_g.load()

ValueError                                Traceback (most recent call last)
...
ValueError: destination buffer too small; expected at least 838080, got 419040

Also see https://github.com/swartn/canesm5_pangeo_issues/blob/main/canesm5_r1i1p1f1_pangeo_issue.ipynb

Test request for CMIP6 data processing and uploading to Cloud

@dgergel , I am going to start documenting our work together here so others can follow if interested.

Diana and I are working through a particular CMIP6 data request in order to rewrite/extend/fix the code started in CMIP6-collection.

Our test request (see the last entries in request sheet ).

This represents a project of mutual interest to obtain the following datasets for as many models as possible:

table_id = 'day'
variable_id = ['pr', 'tasmin', 'tasmax']
experiment_id = ['historical', 'ssp126', 'ssp245', 'ssp370']

We were starting with the member_id = 'r1i1p1f1'. Many models do not have a member_id matching this value, so I have now extended our wish list to at least one ensemble member from each experiment_id/source_id combination (so that 'pr' is not from one run and 'tasmax' from another).

Note that there are separate rows for each variable - that is just so that I can process them in parallel.

Adding nominal resolution to catalog?

It would be great to have the nominal_resolution attribute in the intake catalogs so that people can search it / view it as a table. Is this something that is possible?

Migrate master database to bigquery (instead of CSV)

We currently use .csv files as our master source of truth on what is in the CMIP6 cloud archive:

https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv

A more robust and cloud-native way to do this would be to use BigQuery, Google Cloud's database product, to store this information. Then we could run SQL queries on the database, rather than downloading a big CSV file every time.

@charlesbluca, could you play around with exporting the CSV into BigQuery?

@naomi-henderson, can you summarize the process we are currently using to keep this CSV file up to date?

Comparison of bucket indexing tools

As talked about in #7, regularly generating an index of all the files in a bucket/directory could be very useful in:

Generating/updating catalogs or databases of relevant datasets
Keeping track of files for synchronization purposes

There are a lot of tools that could do this work - some exclusive to specific cloud providers, others not. Some of these tools include:

gsutil (supports both Google Cloud and S3)
Rclone (supports a variety of cloud providers, including Google )
AWS CLI (supports only S3)
S3P (supports only S3)

I tested the above tools on both Google Cloud and S3 (when relevant) to get a sense of which would have the best utility in listing the entirety of a large bucket. Some basic parameters of the testing include:

Target bucket(s) - Pangeo's CMIP6 buckets in both Google Cloud and S3 storage; both are ~550 TB of data comprising of some 20,000,000+ files
Command - flat listing of all bucket's contents with size and modification time (relevant mainly for synchronization purposes)
Output redirection - currently all output is written to a file unaltered; this may change if we want to edit output in place before writing to file using something like sed

The output of these tests can be found here. Some observations:

For S3 listing, S3P was by far the fastest, running 4-6x faster than the AWS CLI listing (~40 min versus ~165 min)
For Google Cloud listing, Rclone was by far the fastest, running nearly 4x faster than the gsutil listing (~47 min versus ~173 min)
Both gsutil and Rclone had trouble listing S3 storage, with both commands failing to list the bucket within the 6 hour timeout; the listing of modification time likely influenced these results, as listing is significantly faster in both cases when excluding this information

Obviously additional testing of more cloud listing tools (MinIO client for example) would be ideal, but these results provide some motivation to dig deeper into Rclone and S3P to index CMIP6 data in Google Cloud and S3 storage, respectively.

Making GCS -> S3 pipeline

Since I completed the initial copy of CMIP6 from GCS to S3, I had been stuck on how to keep the two datasets synchronized moving forward. After getting in contact with our contact at Amazon, I was given the following strategy:

Turn on Pub/Sub notifications for gs://cmip6 to capture object creation events
Write a handler for the notifications that sends them to AWS as AWS Batch/Lambda jobs (depending on individual object size)
Write a Batch/Lambda container to receive the messages and copy files from GCS to S3

The main obstacle I foresee here is that I am unsure what permissions we have to set up Pub/Sub notifications for the copy of CMIP6 on GCS; @rabernat, do we have a point of contact for Google's public dataset program? Beyond this, the other issue to consider is that all catalog/collection files in the root of the bucket will have to be altered after they are uploaded to/updated on S3, as they all use URLs pointing to the GCS bucket.

Once this is done, all that would need to be done is syncing up the data altered in the time between the initial upload and the creation of this pipeline.

Defining the inputs to a single zarr store output

On today's call, we identified a certain "breakpoint" in Naomi's workflow; after she is done cleaning and checking the user's request, she has a list of zarr stores that need to be created / uploaded and theoretically should be able to just "let it go" and have this happen automatically.

Notwithstanding all the problems that can occur in this step, let's try to define the atomic structure of a single one of these jobs.

What are the inputs that completely define a single CMIP6 zarr store output?

Imagine we were writing a function

def create_single_cmip6_zarr_store(*args, **kwargs):

What are the arguments to this function?

If we can define this, we can write the function, based on Naomi's existing notebooks. Once we write the function, we can cloudify it and investigate different execution frameworks like Apache Beam / Google Cloud Dataflow to help us automate this part of the workflow.

Consolidating metadata in Zarr holdings

Hi @naomi-henderson,

In our CEDA pipeline for CMIP6 cloud data, we do the following:

Read NetCDF files from local disk
Define xarray/dask/zarr chunks
Write directly to object store

We are using the consolidated=True option when we write the data, which is supposed to group all the (small) metadata files into one large consolidated metadata file.

The code does consolidate the metadata, but it also writes all the separate small metadata to object store before the consolidation. We think this is slowing down our write processes.

In your workflow, I think you write the data to POSIX disk, then copy the files into object store. When you do this, do you delete all the extra small metadata files and only preserve the consolidated file?

Also, do you have an idea of the write speeds that you get to object store?

Thanks