Comments (8)
My case is extremely simple since all zarr stores are stored in a very regular directory structure and I am using consolidated metadata. So I just list all of the .zmetadata files and parse the paths into a pandas dataframe. This is the script we used on the GC collection itself. If you use glob to create a list of all .zmetadata files, then you can just use the second half of the script.
import datetime
import io
from google.cloud import storage
import pandas as pd
def index_cmip6(request):
SOURCE_BUCKET = 'pangeo-cmip6'
TARGET_BUCKET = 'pangeo-cmip6'
TARGET_FILENAME = 'pangeo-cmip6.csv'
storage_client = storage.Client()
zarr_blobs = [blob.name for blob in storage_client.list_blobs(SOURCE_BUCKET)
if '.zmetadata' in blob.name]
df = pd.DataFrame(zarr_blobs, columns=['store'])
files = df.store.values
ddict = {}
for item, tdir in enumerate(files):
store = 'gs://cmip6/' + tdir.split('.zmetadata')[0]
vlist = tdir.split('/')[-9:-1]
vlist += [store]
ddict[item] = vlist
dz = pd.DataFrame.from_dict(ddict, orient='index')
dz = dz.rename(columns={0: "activity_id", 1: "institution_id", 2:"source_id",
3:"experiment_id",4:"member_id",5:"table_id",6:"variable_id",
7:"grid_label",8:"zstore"})
bucket = storage_client.get_bucket(TARGET_BUCKET)
blob = bucket.blob(TARGET_FILENAME)
with io.StringIO() as f:
dz.to_csv(f, mode='w', index=False)
f.seek(0)
blob.upload_from_string(f.read(), content_type='text/csv')
from intake-esm-datastore.
That is very helpful. I think my setup is actually very similar! Ill give those a shot. I will also try your suggestion @andersy005 and give you some feedback on what is returned.
from intake-esm-datastore.
@andersy005 , quite right. I have fixed the problem in the data catalog and added a check in the scripts. What happens is that, if two separate requests are being filled at the same time, asking for the same dataset, the one zarr store is put inside the other (hence the 'gn/gn' in the pathname). It is kind of a freak occurance, but now my scripts should catch it. Thanks!
from intake-esm-datastore.
Quick reply. I was able to construct a catalog by modifying @naomi-henderson script. Havenโt been able to try to use the intake-ram-datastore builders yet. Ill try to get to it soon.
from intake-esm-datastore.
Closing this as it appears to have been addressed.
from intake-esm-datastore.
Can you post an example path of one of your zarr stores?
This seems very similar to what @naomi-henderson has done for the pangeo cloud. Are those scripts available publicly?
Ccing @charlesbluca as he may have pointers to the scripts used for the CMIP6 data.
from intake-esm-datastore.
Assuming you saved your zarr stores with .zarr
extension, can try calling get_asset_list()
function to see if you can get the list of all zarr stores at least?
intake-esm-datastore/builders/core.py
Line 81 in 6cd2a36
I am not 100% confident that it will work out of the box...
Once you have a list of your zarr stores, constructing the dataframe should be straightforward.
from intake-esm-datastore.
For some reason, there are two zarr stores in the Pangeo catalog whose attributes were not properly parsed. See screenshot below.
Notice how there's activity_id = 'AWI'
, I believe the activity_id
for these stores should have been CMIP
, institution_id='AWI'
, etc...
from intake-esm-datastore.
Related Issues (20)
- Can not open glade-cesm2-le.json HOT 1
- Add CESM2-CMIP6 (not CMORized) collection HOT 11
- Catalogue for observational datasets on glade? HOT 2
- Clean up the CESM non-CMORized file names and contents HOT 1
- Add a common utility for building catalogs HOT 1
- SubX catalog HOT 6
- Update large ensemble collection on glade HOT 2
- Parsing attributes for CESM2-CMIP6 collection assets HOT 1
- Add CESM1 RCP runs to glade-cesm1-cmip5-timeseries HOT 2
- Collection with monthly CESM output files (history files) HOT 6
- Error accessing GLADE CMIP5 catalog HOT 4
- Standard column naming convention for catalogs pointing to raw CESM output on glade / campaign HOT 4
- Picking the latest version fails for cmip.py builder HOT 5
- Update SSP runs for CESM2-CMIP6 datastores HOT 1
- Relative path in JSON files? HOT 1
- How to pick latest version when ingesting open_esm_datastore on .json url? HOT 2
- Change master branch to main? HOT 2
- Add data catalog for ERA5 Reanalysis
- Retirement of DKRZs Mistral
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from intake-esm-datastore.