Giter VIP home page Giter VIP logo

Comments (38)

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024 1

It works! I extended the spin-up by two years and the output is what I expected.

I switched to restart_format = 'pio' in ice/cice_in.nml and also replaced the #Collation and #Misc flags in the config.yaml file with those of the latest COSIMA/025deg_jra55_ryf@2b2be7b commit to avoid segmentation fault errors.

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024 1

Ah thanks @russfiedler, that indeed looks promising. @mpudig can you try again including all the options between # Misc and userscripts in the config.yaml that Russ has listed above? Don't add the userscripts because currently those aren't in your config directory.

from libaccessom2.

russfiedler avatar russfiedler commented on August 26, 2024 1

There could be other things in the cice_in.nml file that might need checking for PIO use.
You're probably right about that openmpi version. I'm not sure why it's there or what its affect is.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024 1

might be worth comparing your whole config with https://github.com/COSIMA/01deg_jra55_ryf to see if there's anything else amiss

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

hmm, yes the latest libaccessom2 is set up for JRA55-do 1.4 which has separate solid and liquid runoff and is incompatible with JRA55-do 1.3. We may need to set up a JRA55-do 1.3 branch for libaccessom2 and cherry-pick the perturbation code changes.

@nichannah does that sound possible?

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

@mauricehuguenin your executables are really old - they use libaccessom2 1bb8904 from 10 Dec 2019.

There have been a lot of commits since then, so applying the perturbation code changes could be tricky, but that's really a question for @nichannah.

It looks like the most recent commit supporting JRA55-do v1.3 was f6cf437 from 16 Apr 2020, so that might make a better starting point.

JRA55-do 1.4 support was merged into master at 4198e15 but it looks like this branch also included some unrelated commits.

See https://github.com/COSIMA/libaccessom2/network

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

Thanks @aekiss! Yes the 025deg_jra55_ryf9091_gadi spin-up was started at the end of December 2019, soon after Gadi came online. It would be a pity not to continue to use it given the resources that went into it.

@mauricehuguenin a good starting point might be to try using the f6cf437 libaccessom2 commit to extend the control run. If that works, then we can think about building the more recent perturbations code into that.

from libaccessom2.

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024

I fetched the commit from the 16th of April COSIMA/025deg_jra55_ryf@2eb6a35 that has changes to atmosphere/forcing.json, config.yaml and ice/ice_input.nml. I then changed to the latest _a227a61 executables as those have the additive forcing functions.

Extending the spin-up with the 2eb6a35c commit works fine, with the latest executables I however get this abort message:

MPI_ABORT was invoked on rank 1550 in communicator MPI_COMM_WORLD
with errorcode 1.

Do the latest .exe files require the licalvf input files? These are currently not in my atmosphere/forcing.json file from the 2eb6a35c commit.

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

@mauricehuguenin I presume your run is the one at /home/561/mv7494/access-om2/025deg_jra55_ryf_ENSOWind/? If so, the error looks like Invalid restart_format: nc. This seems to be a cice error associated with the ice restarts (in https://github.com/COSIMA/cice5/blob/master/io_pio/ice_restart.F90). Something to do with Parallel IO changes??

However, in looking around I also noticed that there are many differences between your configs and the ones used for the spin-up (e.g. /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/, or equivalently https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi). E.e. you're using input_236a3011 rather than input_20200530 (although this may not make any difference). To me the best approach would be to start with the configs at https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi and update only that which we need to. In this case; the changes to atmosphere/forcing.json and ice/ice_input.nml in COSIMA/025deg_jra55_ryf@2eb6a35 (and the executables of course).

from libaccessom2.

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024

I agree that this is the way to go. With the following changes to Ryan's 025deg_jra55_ryf/ryf9091_gadi spin-up:

In atmosphere/forcing.json:

+      "cname": "runof_ai",
+       "domain": "land"

In config.yaml the latest executables:

+      exe: /g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
+      exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
+      exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_a227a61.exe

In /ice/input_ice.nml:

+    fields_from_atm = 'swfld_i', 'lwfld_i', 'rain_i', 'snow_i', 'press_i', 'runof_i', 'tair_i', 'qair_i', 'uwnd_i', 'vwnd_i'
+    fields_to_ocn = 'strsu_io', 'strsv_io', 'rain_io', 'snow_io', 'stflx_io', 'htflx_io', 'swflx_io', 'qflux_io', 'shflx_io', 'lwflx_io', 'runof_io', 'press_io', 'aice_io', 'melt_io', 'form_io'
+    fields_from_ocn = 'sst_i', 'sss_i', 'ssu_i', 'ssv_i', 'sslx_i', 'ssly_i', 'pfmice_i'
+/

I run into the Invalid restart_format: nc abort. @aekiss Do you maybe know what might happen here? Is it something with the parallelization mentioned by Ryan above #72 (comment)?

from libaccessom2.

aidanheerdegen avatar aidanheerdegen commented on August 26, 2024

@rmholmes If you want to keep this spin up would an alternate option be to try spinning off a new control with the updated forcing (just use the ocean temp/salt state as the initial conditions), and keep running the control you have for a decade, say, and compare to your new control run. Then compare and see if you're happy that they're broadly similar, or if they are different it is what you'd expect? Or does this not really work as a strategy?

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

@aidanheerdegen that is another option, although changing forcing mid-way through a run is not very clean. If the differences between v1.3 and v1.4 are not significant it may not make a big difference.

@nichannah - it would be great to your opinion on whether minor tweaks to the code to make it backwards-compatible are feasible.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

The default restart format was changed to pio in recent executables.
You could try setting restart_format = 'nc' in &setup_nml in ice/cice_in.nml.
This will disable parallel IO but that's less important at 0.25deg.

from libaccessom2.

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024

Thanks Andrew, this option is already active in ice/cice_in.nml so it might be something else that is causing it.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

Ah ok, that may be the problem - have you tried restart_format = 'pio'?

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

FYI I'm in the process of updating the model executables. This will include a fix to a bug in libaccessom2 a227a61.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

@mauricehuguenin I've put the latest executables here. It might be good to use these instead as they include a fix to a rounding error bug in libaccessom2. But they are completely untested so I'd be interested to hear if you have any issues with them.

/g/data/ik11/inputs/access-om2/bin/yatm_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM-BGC_6256fdc_libaccessom2_0ab7295.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_18x15.3600x2700_1682p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_0ab7295.exe

from libaccessom2.

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024

I can confirm that these latest executables work with no issues. 👍

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

@mauricehuguenin if this is working - can you close the issue?

from libaccessom2.

mpudig avatar mpudig commented on August 26, 2024

Hi - Ryan and I are attempting to run an RYF9091 ACCESS-OM2-01 simulation that supports relative humidity forcing and the perturbations code. We have used the same executables (but for 1/10-deg) posted above by @aekiss and are restarting from /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/restart995.

The model is crashing because of what we believe to be a parallel I/O problem in the CICE outputs. The error logs are spitting out, among other things, the following:

ibhdf5.so 000014827787BD11 H5D__layout_oh_cr Unknown Unknown
libhdf5.so 0000148277870EEF H5D__create Unknown Unknown
libhdf5.so.103.1. 000014827787D455 Unknown Unknown Unknown
libhdf5.so 0000148277974C3B H5O_obj_create Unknown Unknown
libhdf5.so.103.1. 0000148277938445 Unknown Unknown Unknown
libhdf5.so.103.1. 0000148277909FB2 Unknown Unknown Unknown
libhdf5.so 000014827790ABA0 H5G_traverse Unknown Unknown
libhdf5.so.103.1. 0000148277934D73 Unknown Unknown Unknown
libhdf5.so 0000148277939B72 H5L_link_object Unknown Unknown
libhdf5.so 000014827786E574 H5D__create_named Unknown Unknown
libhdf5.so 0000148277849473 H5Dcreate2 Unknown Unknown
libnetcdf.so.18.0 000014827B26BBBD Unknown Unknown Unknown
libnetcdf.so.18.0 000014827B26D099 Unknown Unknown Unknown
libnetcdf.so 000014827B26D854 nc4_rec_write_met Unknown Unknown
libnetcdf.so.18.0 000014827B26FADF Unknown Unknown Unknown
libnetcdf.so 000014827B27061D nc4_enddef_netcdf Unknown Unknown
libnetcdf.so.18.0 000014827B270180 Unknown Unknown Unknown
libnetcdf.so 000014827B27009D NC4__enddef Unknown Unknown
libnetcdf.so 000014827B2193EB nc_enddef Unknown Unknown
cice_auscom_3600x 000000000093A87F Unknown Unknown Unknown
cice_auscom_3600x 00000000006ADFAC ice_history_write 947 ice_history_write.f90
cice_auscom_3600x 000000000066699F ice_history_mp_ac 2023 ice_history.f90
cice_auscom_3600x 00000000004165C5 cice_runmod_mp_ci 411 CICE_RunMod.f90
cice_auscom_3600x 0000000000411212 MAIN__ 70 CICE.f90
cice_auscom_3600x 00000000004111A2 Unknown Unknown Unknown
libc-2.28.so 0000148279999493 __libc_start_main Unknown Unknown
cice_auscom_3600x 00000000004110AE Unknown Unknown Unknown

The model was crashing at the end of the first month when certain icefields_nml fields in cice_in.nml were set to 'm', and crashing at the end of the first day when set to 'd', so we are fairly confident the issue is coming from CICE.

It would be great if someone could have a look at this to see what is going wrong. My files are at /home/561/mp2135/access-om2/01deg_jra55_ryf_cont/ and all my changes have been pushed here: https://github.com/mpudig/01deg_jra55_ryf/tree/v13_rcpcont.

Thanks!

from libaccessom2.

aidanheerdegen avatar aidanheerdegen commented on August 26, 2024

The stack traces point to different builds, don't know if that is relevant, but if they're built against different MPI and/or PIO/netCDF/HDF5 libraries it might be problematic:

34 0x0000000000933ade pioc_change_def()  /home/156/aek156/github/COSIMA/access-om2-new/src/cice5/ParallelIO/src/clib/pioc_support.c:2985
35 0x00000000006ae0ec ice_history_write_mp_ice_write_hist_.V()  /home/156/aek156/github/COSIMA/access-om2/src/cice5/build_auscom_3600x2700_722p/ice_history_write.f90:947

So specifically /home/156/aek156/github/COSIMA/access-om2-new/ and /home/156/aek156/github/COSIMA/access-om2

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

Does this configuration work with other executables?

from libaccessom2.

russfiedler avatar russfiedler commented on August 26, 2024

I also note that the status of the various pio_... calls is hardly ever checked before the pio_enddef call that finally fails is called. Naughty programmers!
Anyway, it's dying in ROMIO

from libaccessom2.

mpudig avatar mpudig commented on August 26, 2024

@aekiss, yes, we ran it with

/g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_a227a61.exe

originally and it crashed at the end of the first month as well.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

Are these the versions you need to use, or would something else be more ideal? If so I could try compiling that.

from libaccessom2.

mpudig avatar mpudig commented on August 26, 2024

Those are the latest ones we have used (https://github.com/mpudig/01deg_jra55_ryf/blob/v13_rcpcont/config.yaml) and the issue is still occurring with them.

I should maybe add too that these executables worked when running the 1/4-degree configuration (with relative humidity forcing)!

from libaccessom2.

russfiedler avatar russfiedler commented on August 26, 2024

You haven't set the correct switches for using PIO in the mpirun command in config.yaml

e.g.
mpirun: --mca io ompio --mca io_ompio_num_aggregators 1
Also you want to set the UCX_LOG_LEVEL
See, for example

/g/data/ik11/outputs//access-om2-01/01deg_jra55v140_iaf_cycle4/output830/config.yaml

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

Also - I guess it would be best to remove the specification of openmpi/4.0.1 as that could clash with the versions used for compilation?

from libaccessom2.

aidanheerdegen avatar aidanheerdegen commented on August 26, 2024

Specifying modules like that overrides the automatic discovery using ldd, which is what mpirun does too I believe. Yes it is best not to do that, and just let it find the right one to use.

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

I've compared the cice_in.nml files (see /scratch/e14/rmh561/diff_cice_in.nml). The only differences I see that could be relevant are the history_chunksize ones - do these need to be specificed for the parallel I/O?

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

yes, Nic added these for parallel IO

from libaccessom2.

mpudig avatar mpudig commented on August 26, 2024

Hi, thanks all for your comments the other day. Implementing Russ's comments on including mpirun: --mca io ompio --mca io_ompio_num_aggregators 1 in config.yaml and Ryan's on adding history_chunksize to cice_in.nml fixed the original issue: the model ran successfully past month 1 and completed a full 3-month simulation.

However, the output has troubled us slightly. Comparing to the ik11 run over the same period (/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output996) there seem to be some physical differences between sea ice and some other variables. I'm attaching plots of the global average salt in my run and the ik11 run, as well as the difference in sea ice concentration between my run and the ik11 run. There seems to be systematically more sea ice in my run than the ik11 run. My run is sitting at /scratch/e14/mp2135/access-om2/archive/01deg_jra55_ryf_cont/.

compare_sea_ice_conc
comparing_salt

We can't see any major changes in ice configs between our run and the ik11 run. However, there are lots of changes in the CICE executable between commits 2572851 and d3e8bdf which seem mostly to do with parallel I/O and WOMBAT. Do you think the (small) changes we are seeing are realistic with these executable changes, or has something gone awry?

from libaccessom2.

mauricehuguenin avatar mauricehuguenin commented on August 26, 2024

I can see that the run on ik11 uses additional input for mom:

input:
          - /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
          - /g/data/x77/amh157/passive/passive4

Matt is running without these passive fields on /g/data/x77. Is this input maybe causing the difference in the global fields? Unfortunately I am not a member of x77 and cannot have a look at the fields.

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

@mauricehuguenin that's just a passive tracer that Andy had included in the original control run. It won't influence the physics.

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

Hmm, that seems surprising to me. Have you carefully checked all your .nml files?
nmltab can make this easier: https://github.com/aekiss/nmltab

from libaccessom2.

aekiss avatar aekiss commented on August 26, 2024

You're using all the same input files, right?

from libaccessom2.

mpudig avatar mpudig commented on August 26, 2024

One difference is that we use RYF.r_10.1990_1991.nc instead of RYF.q_10.1990_1991.nc as an atmospheric input field. But since no perturbation has been applied this shouldn't change things substantially. I think @rmholmes has tested this pretty extensively.

There are a few differences between some .nml files. I assume they're mostly because of various updates since the ik11 simulation was run (but maybe not...?):

In ocean/input.nml:

  • The ik11 run has max_axes = 100 under &diag_manager_nml, whereas my run doesn't.

In ice/input_ice.nml

  • My run has fields_from_atm, fields_to_ocn and fields_from_ocn options, whereas the ik11 run doesn't.

In ice/cice_in.nml

  • My run has istep0 = 0, whereas the ik11 run has istep0 = 6454080. (Does this seem strange?!)
  • My run has runtype = 'initial', whereas the ik11 run has runtype = 'continue'.
  • My run has restart = .false., whereas the ik11 run has restart = .true..
  • My run has restart_format = 'pio', whereas the ik11 run has restart = 'nc'.
  • My run has history_chunksize_x and _y (per Ryan's comment above).

from libaccessom2.

rmholmes avatar rmholmes commented on August 26, 2024

In addition to Matt's comments above, yes we're using the same inputs (input_08022019).

from libaccessom2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.