ai2cm / pace Goto Github PK
View Code? Open in Web Editor NEWImplementation of the FV3GFS / SHiELD atmospheric model in Python
Home Page: https://ai2cm.github.io/pace/
License: Apache License 2.0
Implementation of the FV3GFS / SHiELD atmospheric model in Python
Home Page: https://ai2cm.github.io/pace/
License: Apache License 2.0
We discussed as a group that it's challenging to test the integration of PyPI packages we've created (eg pace-util
) within fv3net
. This documents how I am testing pace-util
in fv3net
before making a release, to try to avoid unexpected failures that would force another release (which is what happened when I made a PR and only ran the tests within the pace
repo, because it's hard to anticipate the fv3net
integration).
git grep
indicates pace-util
is used in the following places:
To test the prognostic run image, I locally hacked its Dockerfile to install pace-util
from source repo within fv3net (set to the dev branch) instead from PyPI, and built the image. Unfortunately I think this was needed because pace-util
can't be pip installed from a git url, as it's part of the larger pace
monorepo. See instructions here. Then I ran the prognostic run tests in the container.
To test the fv3fit
integration, I did a similar thing for the fv3net
environment (which is where fv3fit
runs in the fv3net
container), and ran the fv3fit
tests locally. Finally, I built images for the e2e test (fv3net
, prognostic_run
, artifacts
, post_process_run
) under a local commit and triggered the e2e test locally.
I think collectively that covers our test exposure to pace-util
. We could replicate the above as an optional workflow in CI but it's probably not worth it, since we only want to do this when making new releases of pace-util, which probably isn't that often.
But the above steps are time-consuming. At the very least, we should not add additional exposure to pace-util
outside of the prognostic run image and fv3fit package (where it's already used) unless there's a good reason, as that would imply more integration testing needed.
File "/scratch/snx3000/ewu/sandbox/pace/fv3gfs-physics/fv3gfs/physics/stencils/physics.py", line 295, in __call__
self._microphysics(physics_state.microphysics, timestep=timestep)
File "/scratch/snx3000/ewu/sandbox/pace/fv3gfs-physics/fv3gfs/physics/stencils/microphysics.py", line 2332, in __call__
self._warm_rain(
File "/scratch/snx3000/ewu/sandbox/pace/dsl/pace/dsl/stencil.py", line 447, in __call__
self.stencil_object(
File ".gt_cache_000000/py38_1013/gtcgtgpu/fv3gfs/physics/stencils/microphysics/warm_rain/m_warm_rain__gtcgtgpu_dc34e66f28.py", line 108, in __call__
self._call_run(
File "/scratch/snx3000/ewu/sandbox/pace/external/gt4py/src/gt4py/stencil_object.py", line 514, in _call_run
self._validate_args(field_args, parameter_args, domain, origin)
File "/scratch/snx3000/ewu/sandbox/pace/external/gt4py/src/gt4py/stencil_object.py", line 426, in _validate_args
raise TypeError(
TypeError: The type of parameter 'crevp_0' is '<class 'cupy._core.core.ndarray'>' instead of 'float64'
self._cgmlt
should be scalars instead of arrays.
A memory leak was introduced by VRAM pooling due to the particular pattern in using self.ptc
and self.delpc
as return value read of C_SW. Those where remove in PR #319 to plug the leak.
A soon-to-be released version of DaCe deals with it at the source and once merged we could remove the workaround, which to use the internal C_SW ptc
and delpc
.
Another option to do away with the dyn_core
allocated ones, since this only adds a copy
gt4py
v1 removes the Storage
class and allow any __array_interface__
describing object to be bound. Unfortunately, the default cupy
allocation used in our model has a bad stride (should have unit stride) leading to performance decrease in the backend.
Potential solution:
I am following readme to get test data, but after following instruction as
First, make sure you have configured the authentication with user credientials and configured Docker with the following commands:
gcloud auth login
gcloud auth configure-docker
Next, you can download the test data for the dynamical core and the physics tests.
cd fv3core
make get_test_data
TEST_DATA_ROOT=/scratch/git/pace/fv3core/test_data/ TARGET=dycore EXPERIMENT=c12_6ranks_standard make -C .. get_test_data
make[1]: Entering directory '/scratch/git/pace'
if [ -z "" ] ; then \
if [ ! -f "/scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/input.nml" ] || \
[ "$(gsutil cp gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/md5sums.txt -)" != "$(cat /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/md5sums.txt)" ] ; then \
rm -rf /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore ; \
make sync_test_data ; \
make unpack_test_data ; \
fi ; \
else \
make sync_test_data_from_ftp ; \
make unpack_test_data ; \
fi
make[2]: Entering directory '/scratch/git/pace'
mkdir -p /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore && gsutil -m rsync -r gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/ /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore
WARNING: gsutil rsync uses hashes when modification time is not available at
both the source and destination. Your crcmod installation isn't using the
module's C extension, so checksumming will run very slowly. If this is your
first rsync since updating gsutil, this rsync can take significantly longer than
usual. For help installing the extension, please see "gsutil help crcmod".
Building synchronization state...
Caught non-retryable exception while listing gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/: AccessDeniedException: 403 [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket.
CommandException: Caught non-retryable exception - aborting rsync
make[2]: *** [Makefile.data_download:19: sync_test_data] Error 1
make[2]: Leaving directory '/scratch/git/pace'
make[2]: Entering directory '/scratch/git/pace'
if [ -f /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/dat_files.tar.gz ]; then \
cd /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore && tar -xf dat_files.tar.gz && \
rm dat_files.tar.gz; \
fi
make[2]: Leaving directory '/scratch/git/pace'
make[1]: Leaving directory '/scratch/git/pace'
self._start_time
in driver needs to be incremented every step so that when we write restart namelist it has the correct time stamp.
The current version of the integration of halo exchange uses WrappedHaloExchange
class to allow orchestration
and stencil backends to co-exists (due to DaCe parsing limitations).
This has two issues:
state
to be passed to __init__
and __call__
of the modelSolutions:
__init__
FVSubgridZ
leads to a CUDA 701 when ran under dace:gpu
. The underlying cause is an overflow of the registers leading to cuda launch failure.
Temporary workaround as introduced in PR #400 is a skip_test
that cancelled our translate test for this particular backend.
Avenue for solutions:
WARNING: this probably also fail when doing orchestration, it remains untested
PerformanceConfig
should build a PerformanceCollector
class, where it contains timer, info per step, and profiler. And PerformanceCollector
is not an attribute to PerformanceConfig
, could be either a property or off a build method. This will also mitigate having to do things like if self.performance_mode
and
# TODO: these attributes are popped because they're not really config,
# we should move them off of config classes and onto non-config classes.
config_dict["performance_config"].pop("times_per_step", None)
config_dict["performance_config"].pop("hits_per_step", None)
config_dict["performance_config"].pop("timestep_timer", None)
config_dict["performance_config"].pop("total_timer", None)
Default grid/block is set to “64,8,1” in DaCeConfig
Clearly this is a default that works well for our experiment size & current hardware (192 grid points per GPU & P100) but could be under optimal for any change of experiment size or hardware.
Introduce a simple “rule of thumb” to deal with a more cases and introduce an “advanced” user configuration option
DoD: DaCe config grid/block setting takes into account hardware & experiment size
As an entry point to the repository, we should have an example case running a baroclinic test case.
Currently, when a constants.nc file is already present, you would need to delete that file if you want to run with a different npx. The error message should be more descriptive.
Currently notebooks can be outdated easily, find a solution to make sure they don't break
This currently uses both stencil
as well as lazy_stencil
, but the former can be eliminated.
lazy_stencil
compiles the first time the implementation
attribute is referenced.
We originally made physics state gt4py storages since there're no need to halo exchange. But using quantity is more consistent and makes accessing variable documentation easier, there're less switch cases when reading from restart files.
Currently we generate separate data folders for dycore, initialization, and physics regression data. This was done because we must generate these datasets in separate runs, and it is not feasible to merge Serialbox datasets. Now that we convert Serialbox data to netcdf, we could merge this data into a single folder.
Merging these folders would avoid needing to specify TARGET values when running tests, which currently can cause tests to pass without running.
This also requires removing our last dependency on Serialbox so that we can stop archiving Serialbox data. The "serialbox" initialization of the driver must be updated to use netCDF data.
Currently, we write restart files at the end of the simulation. Add capability of saving intermediate restart files at specified time.
For select checkpoints, we do quantity.transpose
to match with data on disk. However, DaCe cannot parse transpose.
VERBOSE: Failed to parse the following program:
def transpose(self, target_dims: Sequence[Union[str, Iterable[str]]]) -> 'Quantity':
'Change the dimension order of this Quantity.\n\n If you know you are working with cell-centered variables, you can do:\n\n >>> from pace.util import X_DIM, Y_DIM, Z_DIM\n >>> transposed_quantity = quantity.transpose([X_DIM, Y_DIM, Z_DIM])\n\n To support re-ordering without checking whether quantities are on\n cell centers or interfaces, the API supports giving a list of dimension names\n for dimensions. For example, to re-order to X-Y-Z dimensions regardless of the\n grid the variable is on, one could do:\n\n >>> from pace.util import X_DIMS, Y_DIMS, Z_DIMS\n >>> transposed_quantity = quantity.transpose([X_DIMS, Y_DIMS, Z_DIMS])\n\n Args:\n target_dims: a list of output dimensions. Instead of a single dimension\n name, an iterable of dimensions can be used instead for any entries.\n For example, you may want to use pace.util.X_DIMS to place an\n x-dimension without knowing whether it is on cell centers or interfaces.\n\n Returns:\n transposed: Quantity with the requested output dimension order\n\n Raises:\n ValueError: if any of the target dimensions do not exist on this Quantity,\n or if this Quantity contains multiple values from an iterable entry\n '
target_dims = <dace.frontend.python.parser.DaceProgram object at 0x15544cac7bb0>(target_dims, ('x', 'y', 'z_interface'))
transpose_order = [self_dims_index(dim) for dim in target_dims]
transposed = Quantity_0(<dace.frontend.python.parser.DaceProgram object at 0x15544cf872e0>(__g_self_data, transpose_order), dims=<dace.frontend.python.parser.DaceProgram object at 0x15544529c040>(('x', 'y', 'z_interface'), transpose_order), units='ln(Pa)', origin=<dace.frontend.python.parser.DaceProgram object at 0x15544cbce820>((3, 3, 0), transpose_order), extent=<dace.frontend.python.parser.DaceProgram object at 0x155447383910>((12, 12, 80), transpose_order), gt4py_backend='dace:gpu')
transposed._attrs = self._attrs
return transposed
As of updating pace to gt4py v1.0, the Docker image built in examples/
with make build
installs all the dependencies for pace
, but not pace
itself. #418 was created to quickly work around the problem, but it is not a proper fix.
What needs to happen is that someone tracks down why e.g. pip3 install -e ./physics -c constraints.txt && python3 -c "import pace.physics"
fails, revert the Dockerfile
change in #418, and make a fix.
Currently, pace restart only support numpy or CPU backend. We want to extent this to work for GPU backend as well.
There are other modules that also require pace.dsl
and do not list it in requirements.
I was writing a test for Pace and realized that I have to essentially specify the domain twice: once when creating the GridIndexing
specification for the StencilFactory
, and another time in the from_origin_domain(...)
call. Are both these necessary?
DaCeProgress is a custom made Logger that prepend orchestration mode & prints timing when used at a ContextManager. It uses default python logger, but this should be the base of a global Logger
for Pace
I am trying to set up a lab to replicate the Pace v0. 1: A Python-based Performance-Portable Implementation of the FV3 Dynamical Core Due to #355 I cannot access the docker environment as provided in the docs (i.e., make dev
doesn't work). So I tried to start with provided Dockerfile
I am using v0.1.0 release because I am assuming this is the version for the submission, and I modified requirements_dev.txt
to install gt4py with cuda117 features, like gt4py[cuda117]
For sake of simplicity, I started with Nvidia's docker images. Here is my Dockerfile
FROM nvidia/cuda:11.7.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y make \
software-properties-common \
libopenmpi3 \
libopenmpi-dev \
libboost-all-dev \
libhdf5-serial-dev \
netcdf-bin \
libnetcdf-dev \
python3 \
python3-pip
RUN pip3 install --upgrade setuptools wheel pip packaging
COPY . /pace
RUN cd /pace && \
pip3 install -r /pace/requirements_dev.txt && \
python3 -m gt4py.gt_src_manager install
RUN rm -rf /pace
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
Then I modified driver/examples/configs/baroclinic_c12.yaml
to change backend to cuda
(which I believe cupy is invoked as the code generator). My modified top part is like:
stencil_config:
compilation_config:
backend: cuda
rebuild: true
validate_args: true
format_source: false
device_sync: true
Then I run the command line as
mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none --oversubscribe -n 6 pyth
on3 -m pace.driver.run driver/examples/configs/baroclinic_c12.yaml
inside the Docker image.
After kernel is compiled, the program crashed as following.
[8181c3a7fe69:00151] Signal: Segmentation fault (11)
[8181c3a7fe69:00151] Signal code: Invalid permissions (2)
[8181c3a7fe69:00151] Failing at address: 0xb02920000
[8181c3a7fe69:00151] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f97d32ff520]
[8181c3a7fe69:00151] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a094d)[0x7f97d345d94d]
[8181c3a7fe69:00151] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7f974f2df244]
[8181c3a7fe69:00151] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x44)[0x7f974f2b8784]
[8181c3a7fe69:00151] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x36d)[0x7f974f2b215d]
[8181c3a7fe69:00151] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f9745f43b5d]
[8181c3a7fe69:00151] [ 6] /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0xf61da)[0x7f97463901da]
[8181c3a7fe69:00151] [ 7] python3(+0x15c8de)[0x55f4995a18de]
[8181c3a7fe69:00151] [ 8] python3(PyObject_Call+0xbb)[0x55f4995b075b]
[8181c3a7fe69:00151] [ 9] python3(_PyEval_EvalFrameDefault+0x2955)[0x55f49958cb25]
[8181c3a7fe69:00151] [10] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [11] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [12] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [13] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [14] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [15] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [16] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [17] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [18] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [19] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [20] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [21] python3(_PyObject_FastCallDictTstate+0x16d)[0x55f4995975fd]
[8181c3a7fe69:00151] [22] python3(+0x166d74)[0x55f4995abd74]
[8181c3a7fe69:00151] [23] python3(_PyObject_MakeTpCall+0x1fc)[0x55f49959835c]
[8181c3a7fe69:00151] [24] python3(_PyEval_EvalFrameDefault+0x73b3)[0x55f499591583]
[8181c3a7fe69:00151] [25] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [26] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [27] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [28] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [29] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] *** End of error message ***
python3(_PyEval_EvalFrameDefault+0x1a31)[0x5594b79d7c01]
[8181c3a7fe69:00152] [27] python3(+0x16ab11)[0x5594b79fbb11]
[8181c3a7fe69:00152] [28] python3(_PyEval_EvalFrameDefault+0x1a31)[0x5594b79d7c01]
[8181c3a7fe69:00152] [29] python3(_PyFunction_Vectorcall+0x7c)[0x5594b79ee12c]
[8181c3a7fe69:00152] *** End of error message ***
BTW, numpy backend works, but due to missing a written guideline to run with GPU backend. I am not sure if I am using a right way.
Could you help me to triage the issue or provide any additional instructions?
Thank you.
Orchestration attempts at inlining everything by default - and in the case of scalar that means freezing some information (like loop sizes for examples or n/k split). The grid is passed mostly by self.X
of the submodule, which will lead to it being frozen.
Question
Is there any rank specific data that could lead to error in the model when we compile on top tile and send the results across all ranks?
E.g. are we using top-tile to compile all tile, is it reproducing the data properly because they are not lat/lon depend (or just in nature), e.g. area
, rarea
, or is there an actual dependency that could lead to small errors (cubtolatlon
?)
The code that does distributed compile for DaCe has been deactivated in the meantime. See DEACTIVATE_DISTRIBUTED_DACE_COMPILE
Solutions
dyn
call and flag it dace.compiletime
DOD
DEACTIVATE_DISTRIBUTED_DACE_COMPILE
has been removed and underlying problem solved or documented
If a user runs make savepoint_tests_*
from within Pace, Docker returns Unable to find image 'us.gcr.io/vcm-ml/pace:latest' locally
. We need to run make build
first. This artifact doesn't even exist on https://console.cloud.google.com/gcr/images/vcm-ml?referrer=search&project=vcm-ml.
With the removal of future-stencil we are back to a system where every running rank requires its own .gt_cache_X to execute our model.
This approach works on low node-counts (6/54) but is not scalable beyond that.
A solution for this was implemented for the dace backends, but the GridTools backends are still not able to reproduce this.
We need a solution to build on a lower rank count and use the generated and compiled code across ranks.
The version of black pinned in constraints.txt
is much newer than the version pre-commit
uses. This causes inconsistencies with editors and should be brought in sync.
To fix, update the version in pre-commit
to match the newer version in constraints.txt
.
There should be a way where we do not make diagnostics output at all, including the grids.
As a consumer I find it odd that we call get_test_data for the driver from the fv3core directory with TARGET=driver
. Should this be moved to the driver directory at some point?
As written, the halo update code in Pace does not support non-square MPI layouts (e.g. where a tile is decomposed into 3x2 ranks). This is because the abstractions for partitioning of ranks currently assume that for a given rank and a given direction, there is only one rank bordering in that direction. Fortunately, this assumption is not required by anything using the code which has this abstraction.
To fix this, the Partitioner class must be refactored, replacing the boundary(self, boundary_type: int, rank: int) -> Optional[bd.SimpleBoundary]:
method with a boundaries(self): -> Iterable[bd.SimpleBoundary]
method, and code which currently calls boundary
while iterating over directions should instead iterate over the boundaries
. Once this is done, the implementation of boundaries
can be modified to support non-square layouts, which should make it so that all halo update code works for non-square layouts.
I would strongly recommend test-driven development for this modification, similar to the current partitioner boundaries tests. In developing the current code, it was very, very helpful to be able to try various "guesses" as to what code would work until something passed the (extensive) test cases.
Instead of saving the entire 3D field of tracer mixing ratio, we want the column integrated value so that we only need to save a 2D field.
Use the numpy equivalents on the Quantity. See #375 (comment).
update_dwind_prep_stencil
needs to be broken down to 2 stencils
pace.util.Quantity
has an attribute _attrs
that isn't accessible via instantiation or public setter (though it has a getter). Because Quantity
is used in the fv3net integration for writing zarrs, this means that fv3net is currently using private API to write a zarr with array attributes. In the spirit of making the fv3net -pace.util
integration more robust, seems like we should add a setter for Quantity._attrs
, or go the other way and get rid of Quantity._attrs
entirely.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.