Source Dataset
Several years of high-frequency (15 min) GCM-timestep level output from the Superparameterized CESM2 isolating state variables "before" and "after" a key code region containing computationally intensive superparameterization calculations. For use in making a superparameterization emulator of explicitly resolved clouds + their radiation influence + turbulence, that can be used in a real-geography framework CESM2 framework, to side step the usual computational cost of SP. Similar in spirit to proof of concept in Rasp, Pritchard & Gentine (2018) and Mooers et al. (2021) but with new refinements by Mike Pritchard and Tom Beucler towards compatibility with operational, real-gegoraphy CESM2 (critically, isolating only tendencies up to surface coupling and including outputs relevant to CLM land model's expectations; see Mooers et al. concluding discussion)
- Link to the website / online documentation for the data
N/A
- The file format (e.g. netCDF, csv)
Raw model output is CESM2-formatted NetCDF history files
- How are the source files organized? (e.g. one file per day)
One file per day across 8-10 sim-years each forced with the same annual cycle of SSTs.
- How are the source files accessed (e.g. FTP)
Not publicly posted yet. Staged on a few XSEDE or NERSC clusters. Mike Pritchard's group at UC Irvine can help with access to these.
- provide an example link if possible
- Any special steps required to access the data (e.g. password required)
Transformation / Alignment / Merging
**Apologies in advance if this is TMI. A starter core-dump from Mike Pritchard on a busy afternoon:
There are multiple pre-processing steps. The raw model output contains many more variables than one would want to analyze so there is trimming. But users may want to experiment with different inputs and outputs. So this trimming may be user-specific. Can provide guidance on specific variable names for inputs/outputs of published emulators worth competing with on request.
Important: A key subset of variables (surface fluxes) that probably everyone would want in their input vector will need to be time-shifted backward relative by one time step to avoid information leaks, having to due to which phase of the integration cycle these fluxes were saved at on history file vs. the emulated regions. Some users may want to make emulators that include memory of state variables from previous time steps in the input vector (e.g. as in Han et al., JAMES, 2020) in which case there is the same preprocessing issue of backwards time shifting made flexible to additional variables (physical caveat: likely no more than a few hours i.e. <= 10 temporal samples at most so never any reason to include contiguous temporal adjacency beyond that limit).
Many users may want to subsample lon,lat,time first to reduce data volume and promote independence of samples due to spatial and temporal autocorrelations riddled throughout the data. Other users may prefer to include all of these samples as fuel for ambitious architectures that require very data-rich limits to find good fits in. This sub-sampling is user-specific.
Many users wanting to make "one-size-fits-all" emulators (i.e. same NN for all grid cells) will want to flatten lon,lat,time into a generic "sample" dimension (retaining level variability) and shuffle that for ML and split into training/validation/test splits. Such users would also want to pre-normalize by means and range/stds defined independently for separate levs, but which lump together the flattened lon/lat/time statistics. Advanced users may want to train regional- or regime-specific emulators, which might then use regionally-aware normalizations, such that flexibility here would help.
Some users may want to convert the specific humidity and temperature input state variables into an equivalent relative humidity as an alternate input that is less prone to out of sample extrapolation when the emulator is tested prognostically. The RH conversion should use a fixed set of assumptions that are consistent with a f90 module for use in identical testing online; can provide python and f90 code when the time comes.
The surface pressure is vital information to make the vertical discretization physically relevant per CESM2's hybrid vertical eta coordinate. So should always be made available. The pressure mid points and pressure thickness of each vertical level can also be derived from this field but vary with lon,lat,time. Mass-weighting the outputs of vertically resolved variables like diabatic heating using the derived pressure thickness could be helpful to users wishing to prioritize the column influence of different samples as in Beucler et al. (2021, PRL).
Output Dataset
I am not qualified to assess the trade-offs of the options listed here but interested in learning.