Giter VIP home page Giter VIP logo

pace's People

Contributors

ajdas1 avatar brianhenn avatar dependabot-preview[bot] avatar dependabot[bot] avatar eddie-c-davis avatar elynnwu avatar floriandeconinck avatar gmao-ckung avatar jdahm avatar mcgibbon avatar nbren12 avatar oelbert avatar ofuhrer avatar rheacangeo avatar spencerkclark avatar spidermonkey1975 avatar twicki avatar yniederm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pace's Issues

Testing pace-util integration in fv3net

We discussed as a group that it's challenging to test the integration of PyPI packages we've created (eg pace-util) within fv3net. This documents how I am testing pace-util in fv3net before making a release, to try to avoid unexpected failures that would force another release (which is what happened when I made a PR and only ran the tests within the pace repo, because it's hard to anticipate the fv3net integration).

git grep indicates pace-util is used in the following places:

  • prognostic_c48_run (zarr monitor for prognostic run and for emulation via emulation package; also for nudging and prescribing)
  • fv3fit (halos for CNNs)

To test the prognostic run image, I locally hacked its Dockerfile to install pace-util from source repo within fv3net (set to the dev branch) instead from PyPI, and built the image. Unfortunately I think this was needed because pace-util can't be pip installed from a git url, as it's part of the larger pace monorepo. See instructions here. Then I ran the prognostic run tests in the container.

To test the fv3fit integration, I did a similar thing for the fv3net environment (which is where fv3fit runs in the fv3net container), and ran the fv3fit tests locally. Finally, I built images for the e2e test (fv3net, prognostic_run, artifacts, post_process_run) under a local commit and triggered the e2e test locally.

I think collectively that covers our test exposure to pace-util. We could replicate the above as an optional workflow in CI but it's probably not worth it, since we only want to do this when making new releases of pace-util, which probably isn't that often.

But the above steps are time-consuming. At the very least, we should not add additional exposure to pace-util outside of the prognostic run image and fv3fit package (where it's already used) unless there's a good reason, as that would imply more integration testing needed.

Direct indexing of arrays in microphysics

  File "/scratch/snx3000/ewu/sandbox/pace/fv3gfs-physics/fv3gfs/physics/stencils/physics.py", line 295, in __call__
    self._microphysics(physics_state.microphysics, timestep=timestep)
  File "/scratch/snx3000/ewu/sandbox/pace/fv3gfs-physics/fv3gfs/physics/stencils/microphysics.py", line 2332, in __call__
    self._warm_rain(
  File "/scratch/snx3000/ewu/sandbox/pace/dsl/pace/dsl/stencil.py", line 447, in __call__
    self.stencil_object(
  File ".gt_cache_000000/py38_1013/gtcgtgpu/fv3gfs/physics/stencils/microphysics/warm_rain/m_warm_rain__gtcgtgpu_dc34e66f28.py", line 108, in __call__
    self._call_run(
  File "/scratch/snx3000/ewu/sandbox/pace/external/gt4py/src/gt4py/stencil_object.py", line 514, in _call_run
    self._validate_args(field_args, parameter_args, domain, origin)
  File "/scratch/snx3000/ewu/sandbox/pace/external/gt4py/src/gt4py/stencil_object.py", line 426, in _validate_args
    raise TypeError(
TypeError: The type of parameter 'crevp_0' is '<class 'cupy._core.core.ndarray'>' instead of 'float64'

self._cgmlt should be scalars instead of arrays.

Revert `self.ptc` & `self.delpc` usage in dyn_core or decide to remove them

A memory leak was introduced by VRAM pooling due to the particular pattern in using self.ptc and self.delpc as return value read of C_SW. Those where remove in PR #319 to plug the leak.
A soon-to-be released version of DaCe deals with it at the source and once merged we could remove the workaround, which to use the internal C_SW ptc and delpc.
Another option to do away with the dyn_core allocated ones, since this only adds a copy

`gt:gpu` slowdown on `gtpy` v1 due to change in allocation

gt4py v1 removes the Storage class and allow any __array_interface__ describing object to be bound. Unfortunately, the default cupy allocation used in our model has a bad stride (should have unit stride) leading to performance decrease in the backend.

Potential solution:

  • use gt4py provided allocator (and optimized for the backend)
  • make sure striding in our GPU allocation is unit

Cannot access test data following instruction on Readme

I am following readme to get test data, but after following instruction as

First, make sure you have configured the authentication with user credientials and configured Docker with the following commands:

gcloud auth login
gcloud auth configure-docker

Next, you can download the test data for the dynamical core and the physics tests.

cd fv3core
make get_test_data

TEST_DATA_ROOT=/scratch/git/pace/fv3core/test_data/ TARGET=dycore EXPERIMENT=c12_6ranks_standard make -C .. get_test_data
make[1]: Entering directory '/scratch/git/pace'
if [ -z "" ] ; then \
	if [ ! -f "/scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/input.nml" ] || \
	[ "$(gsutil cp gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/md5sums.txt -)" != "$(cat /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/md5sums.txt)" ] ; then \
		rm -rf /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore ; \
		make sync_test_data ; \
		make unpack_test_data ; \
	fi ; \
else \
	make sync_test_data_from_ftp ; \
	make unpack_test_data ; \
fi
make[2]: Entering directory '/scratch/git/pace'
mkdir -p /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore && gsutil -m rsync -r gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/ /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore

WARNING: gsutil rsync uses hashes when modification time is not available at
both the source and destination. Your crcmod installation isn't using the
module's C extension, so checksumming will run very slowly. If this is your
first rsync since updating gsutil, this rsync can take significantly longer than
usual. For help installing the extension, please see "gsutil help crcmod".

Building synchronization state...
Caught non-retryable exception while listing gs://vcm-fv3gfs-serialized-regression-data/8.1.1/c12_6ranks_standard/dycore/: AccessDeniedException: 403 [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket.
CommandException: Caught non-retryable exception - aborting rsync
make[2]: *** [Makefile.data_download:19: sync_test_data] Error 1
make[2]: Leaving directory '/scratch/git/pace'
make[2]: Entering directory '/scratch/git/pace'
if [ -f /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore/dat_files.tar.gz ]; then \
	cd /scratch/git/pace/fv3core/test_data//8.1.1/c12_6ranks_standard/dycore && tar -xf dat_files.tar.gz && \
	rm dat_files.tar.gz; \
fi
make[2]: Leaving directory '/scratch/git/pace'
make[1]: Leaving directory '/scratch/git/pace'

Better integration of halo exchange

The current version of the integration of halo exchange uses WrappedHaloExchange class to allow orchestration and stencil backends to co-exists (due to DaCe parsing limitations).

This has two issues:

  • Goes against the "state-less" design of the code by requiring the same state to be passed to __init__ and __call__ of the model
  • Performance of the orchestration is limited by the callback which acts as a black-box funnel for the data-flow

Solutions:

  • Using literal list, which is pending a DaCe feature to be finished, we could keep the callback but remove the state from __init__
  • Furthermore, we could recode the custom pack/unpack in a language that DaCe can parse then do away with the wrapper altogether

FVSubgridZ failure on `dace:gpu` backend

FVSubgridZ leads to a CUDA 701 when ran under dace:gpu. The underlying cause is an overflow of the registers leading to cuda launch failure.

Temporary workaround as introduced in PR #400 is a skip_test that cancelled our translate test for this particular backend.

Avenue for solutions:

  • break code to have a smaller number of tracers updated
  • change the block/thread configuration for the stencil
  • skip field_to_scalar OIR pass (in combination with the above)
  • deeper fix of the backend in GT4Py (tbd)

WARNING: this probably also fail when doing orchestration, it remains untested

Performance config makes performance collector class

PerformanceConfig should build a PerformanceCollector class, where it contains timer, info per step, and profiler. And PerformanceCollector is not an attribute to PerformanceConfig, could be either a property or off a build method. This will also mitigate having to do things like if self.performance_mode and

        # TODO: these attributes are popped because they're not really config,
        # we should move them off of config classes and onto non-config classes.
        config_dict["performance_config"].pop("times_per_step", None)
        config_dict["performance_config"].pop("hits_per_step", None)
        config_dict["performance_config"].pop("timestep_timer", None)
        config_dict["performance_config"].pop("total_timer", None)

Better grid/blocks setting for DaCe

Default grid/block is set to “64,8,1” in DaCeConfig
Clearly this is a default that works well for our experiment size & current hardware (192 grid points per GPU & P100) but could be under optimal for any change of experiment size or hardware.

Introduce a simple “rule of thumb” to deal with a more cases and introduce an “advanced” user configuration option

DoD: DaCe config grid/block setting takes into account hardware & experiment size

JIRA mirror ticket

Physics state uses quantity

We originally made physics state gt4py storages since there're no need to halo exchange. But using quantity is more consistent and makes accessing variable documentation easier, there're less switch cases when reading from restart files.

Regression data folders can be merged

Currently we generate separate data folders for dycore, initialization, and physics regression data. This was done because we must generate these datasets in separate runs, and it is not feasible to merge Serialbox datasets. Now that we convert Serialbox data to netcdf, we could merge this data into a single folder.

Merging these folders would avoid needing to specify TARGET values when running tests, which currently can cause tests to pass without running.

This also requires removing our last dependency on Serialbox so that we can stop archiving Serialbox data. The "serialbox" initialization of the driver must be updated to use netCDF data.

DaCe cannot parse quantity.transpose

For select checkpoints, we do quantity.transpose to match with data on disk. However, DaCe cannot parse transpose.

VERBOSE: Failed to parse the following program:
def transpose(self, target_dims: Sequence[Union[str, Iterable[str]]]) -> 'Quantity':
    'Change the dimension order of this Quantity.\n\n    If you know you are working with cell-centered variables, you can do:\n\n    >>> from pace.util import X_DIM, Y_DIM, Z_DIM\n    >>> transposed_quantity = quantity.transpose([X_DIM, Y_DIM, Z_DIM])\n\n    To support re-ordering without checking whether quantities are on\n    cell centers or interfaces, the API supports giving a list of dimension names\n    for dimensions. For example, to re-order to X-Y-Z dimensions regardless of the\n    grid the variable is on, one could do:\n\n    >>> from pace.util import X_DIMS, Y_DIMS, Z_DIMS\n    >>> transposed_quantity = quantity.transpose([X_DIMS, Y_DIMS, Z_DIMS])\n\n    Args:\n        target_dims: a list of output dimensions. Instead of a single dimension\n            name, an iterable of dimensions can be used instead for any entries.\n            For example, you may want to use pace.util.X_DIMS to place an\n            x-dimension without knowing whether it is on cell centers or interfaces.\n\n    Returns:\n        transposed: Quantity with the requested output dimension order\n\n    Raises:\n        ValueError: if any of the target dimensions do not exist on this Quantity,\n            or if this Quantity contains multiple values from an iterable entry\n    '
    target_dims = <dace.frontend.python.parser.DaceProgram object at 0x15544cac7bb0>(target_dims, ('x', 'y', 'z_interface'))
    transpose_order = [self_dims_index(dim) for dim in target_dims]
    transposed = Quantity_0(<dace.frontend.python.parser.DaceProgram object at 0x15544cf872e0>(__g_self_data, transpose_order), dims=<dace.frontend.python.parser.DaceProgram object at 0x15544529c040>(('x', 'y', 'z_interface'), transpose_order), units='ln(Pa)', origin=<dace.frontend.python.parser.DaceProgram object at 0x15544cbce820>((3, 3, 0), transpose_order), extent=<dace.frontend.python.parser.DaceProgram object at 0x155447383910>((12, 12, 80), transpose_order), gt4py_backend='dace:gpu')
    transposed._attrs = self._attrs
    return transposed

Pace editable install from requirements_dev.txt is broken

As of updating pace to gt4py v1.0, the Docker image built in examples/ with make build installs all the dependencies for pace, but not pace itself. #418 was created to quickly work around the problem, but it is not a proper fix.

What needs to happen is that someone tracks down why e.g. pip3 install -e ./physics -c constraints.txt && python3 -c "import pace.physics" fails, revert the Dockerfile change in #418, and make a fix.

Specifying the domain twice

I was writing a test for Pace and realized that I have to essentially specify the domain twice: once when creating the GridIndexing specification for the StencilFactory, and another time in the from_origin_domain(...) call. Are both these necessary?

Guidelines on run Pace v0.1.0 with GPU?

I am trying to set up a lab to replicate the Pace v0. 1: A Python-based Performance-Portable Implementation of the FV3 Dynamical Core Due to #355 I cannot access the docker environment as provided in the docs (i.e., make dev doesn't work). So I tried to start with provided Dockerfile

I am using v0.1.0 release because I am assuming this is the version for the submission, and I modified requirements_dev.txt to install gt4py with cuda117 features, like gt4py[cuda117]

For sake of simplicity, I started with Nvidia's docker images. Here is my Dockerfile

FROM nvidia/cuda:11.7.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y make \
    software-properties-common \
    libopenmpi3 \
    libopenmpi-dev \
    libboost-all-dev \
    libhdf5-serial-dev \
    netcdf-bin \
    libnetcdf-dev \
    python3 \
    python3-pip
RUN pip3 install --upgrade setuptools wheel pip packaging
COPY . /pace
RUN cd /pace && \
    pip3 install -r /pace/requirements_dev.txt && \
    python3 -m gt4py.gt_src_manager install
RUN rm -rf /pace
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

Then I modified driver/examples/configs/baroclinic_c12.yaml to change backend to cuda (which I believe cupy is invoked as the code generator). My modified top part is like:

stencil_config:
  compilation_config:
    backend: cuda
    rebuild: true
    validate_args: true
    format_source: false
    device_sync: true

Then I run the command line as

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none --oversubscribe -n 6 pyth
on3 -m pace.driver.run driver/examples/configs/baroclinic_c12.yaml

inside the Docker image.

After kernel is compiled, the program crashed as following.

[8181c3a7fe69:00151] Signal: Segmentation fault (11)
[8181c3a7fe69:00151] Signal code: Invalid permissions (2)
[8181c3a7fe69:00151] Failing at address: 0xb02920000
[8181c3a7fe69:00151] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f97d32ff520]
[8181c3a7fe69:00151] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a094d)[0x7f97d345d94d]
[8181c3a7fe69:00151] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7f974f2df244]
[8181c3a7fe69:00151] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x44)[0x7f974f2b8784]
[8181c3a7fe69:00151] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x36d)[0x7f974f2b215d]
[8181c3a7fe69:00151] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f9745f43b5d]
[8181c3a7fe69:00151] [ 6] /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0xf61da)[0x7f97463901da]
[8181c3a7fe69:00151] [ 7] python3(+0x15c8de)[0x55f4995a18de]
[8181c3a7fe69:00151] [ 8] python3(PyObject_Call+0xbb)[0x55f4995b075b]
[8181c3a7fe69:00151] [ 9] python3(_PyEval_EvalFrameDefault+0x2955)[0x55f49958cb25]
[8181c3a7fe69:00151] [10] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [11] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [12] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [13] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [14] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [15] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [16] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [17] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [18] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [19] python3(_PyEval_EvalFrameDefault+0x816)[0x55f49958a9e6]
[8181c3a7fe69:00151] [20] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] [21] python3(_PyObject_FastCallDictTstate+0x16d)[0x55f4995975fd]
[8181c3a7fe69:00151] [22] python3(+0x166d74)[0x55f4995abd74]
[8181c3a7fe69:00151] [23] python3(_PyObject_MakeTpCall+0x1fc)[0x55f49959835c]
[8181c3a7fe69:00151] [24] python3(_PyEval_EvalFrameDefault+0x73b3)[0x55f499591583]
[8181c3a7fe69:00151] [25] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [26] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [27] python3(+0x16ab11)[0x55f4995afb11]
[8181c3a7fe69:00151] [28] python3(_PyEval_EvalFrameDefault+0x1a31)[0x55f49958bc01]
[8181c3a7fe69:00151] [29] python3(_PyFunction_Vectorcall+0x7c)[0x55f4995a212c]
[8181c3a7fe69:00151] *** End of error message ***
python3(_PyEval_EvalFrameDefault+0x1a31)[0x5594b79d7c01]
[8181c3a7fe69:00152] [27] python3(+0x16ab11)[0x5594b79fbb11]
[8181c3a7fe69:00152] [28] python3(_PyEval_EvalFrameDefault+0x1a31)[0x5594b79d7c01]
[8181c3a7fe69:00152] [29] python3(_PyFunction_Vectorcall+0x7c)[0x5594b79ee12c]
[8181c3a7fe69:00152] *** End of error message ***

BTW, numpy backend works, but due to missing a written guideline to run with GPU backend. I am not sure if I am using a right way.

Could you help me to triage the issue or provide any additional instructions?

Thank you.

[DaCe Orchestration] - Grid data get harcoded in the compiled executable

Orchestration attempts at inlining everything by default - and in the case of scalar that means freezing some information (like loop sizes for examples or n/k split). The grid is passed mostly by self.X of the submodule, which will lead to it being frozen.

Question

Is there any rank specific data that could lead to error in the model when we compile on top tile and send the results across all ranks?

E.g. are we using top-tile to compile all tile, is it reproducing the data properly because they are not lat/lon depend (or just in nature), e.g. area, rarea, or is there an actual dependency that could lead to small errors (cubtolatlon?)

The code that does distributed compile for DaCe has been deactivated in the meantime. See DEACTIVATE_DISTRIBUTED_DACE_COMPILE

Solutions

  • Move grid & init to a be a parameters on the dyn call and flag it dace.compiletime
  • Make sure that no data frozen is in fact different on non top-tile rank

DOD
DEACTIVATE_DISTRIBUTED_DACE_COMPILE has been removed and underlying problem solved or documented

The GT4Py build process is not scalable

With the removal of future-stencil we are back to a system where every running rank requires its own .gt_cache_X to execute our model.
This approach works on low node-counts (6/54) but is not scalable beyond that.
A solution for this was implemented for the dace backends, but the GridTools backends are still not able to reproduce this.

We need a solution to build on a lower rank count and use the generated and compiled code across ranks.

Formatting with Black

The version of black pinned in constraints.txt is much newer than the version pre-commit uses. This causes inconsistencies with editors and should be brought in sync.

To fix, update the version in pre-commit to match the newer version in constraints.txt.

Non-square layouts are not supported by halo updates

As written, the halo update code in Pace does not support non-square MPI layouts (e.g. where a tile is decomposed into 3x2 ranks). This is because the abstractions for partitioning of ranks currently assume that for a given rank and a given direction, there is only one rank bordering in that direction. Fortunately, this assumption is not required by anything using the code which has this abstraction.

To fix this, the Partitioner class must be refactored, replacing the boundary(self, boundary_type: int, rank: int) -> Optional[bd.SimpleBoundary]: method with a boundaries(self): -> Iterable[bd.SimpleBoundary] method, and code which currently calls boundary while iterating over directions should instead iterate over the boundaries. Once this is done, the implementation of boundaries can be modified to support non-square layouts, which should make it so that all halo update code works for non-square layouts.

I would strongly recommend test-driven development for this modification, similar to the current partitioner boundaries tests. In developing the current code, it was very, very helpful to be able to try various "guesses" as to what code would work until something passed the (extensive) test cases.

pace-util: quantity attrs inaccessible

pace.util.Quantity has an attribute _attrs that isn't accessible via instantiation or public setter (though it has a getter). Because Quantity is used in the fv3net integration for writing zarrs, this means that fv3net is currently using private API to write a zarr with array attributes. In the spirit of making the fv3net -pace.util integration more robust, seems like we should add a setter for Quantity._attrs, or go the other way and get rid of Quantity._attrs entirely.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.