Description of the issue:
A user was trying to run with ladjust_bury_coeff
in user_nl_marbl
(which is not a very common configuration); he was also trying to get 100+ SYPD out of the gx3v7
grid (which is not a very common requirement), so he was running with 288 ocean tasks. gen_pop_decomp
was giving a layout that creating 290 blocks, and reported the model crashing in ecosys_driver.F90:513
at
508 allocate(rmean_vals(size(marbl_instances(1)%glo_avg_rmean_interior_tendency)))
509 lscalar = .false.
510 call ecosys_running_mean_saved_state_get_var_vals('interior_tendency', lscalar, rmean_vals(:))
511 do n = 1, size(rmean_vals)
512 do iblock = 1, size(marbl_instances)
513 marbl_instances(iblock)%glo_avg_rmean_interior_tendency(n)%rmean = rmean_vals(n)
514 end do
515 end do
516 deallocate(rmean_vals)
it turns out the issue is that marbl_instances
is size max_blocks_clinic
(2, in his configuration) and we only want these loops running through nblocks_clinic
(1 on most tasks), so ladjust_bury_coeff
currently can't be true if any block has nblocks_clinic < max_blocks_clinic
. Fixing that moved the error to ecosys_driver:640
:
637 if ((size(glo_avg_fields_interior, dim=4) /= 0) .or. (size(glo_avg_fields_surface, dim=4) /= 0)) then
638 allocate(glo_avg_area_masked(nx_block, ny_block, nblocks_clinic))
639 where (land_mask(:,:,:))
640 glo_avg_area_masked(:,:,:) = TAREA(:,:,:)
641 else where
642 glo_avg_area_masked(:,:,:) = c0
643 end where
(I think the third dimension of land_mask
and TAREA
are both max_blocks_clinic
while the allocate()
statement for glo_avg_area_masked
in line 638 shows it uses nblocks_clinic
instead.)
As you can tell, I've started working on a fix for this... I think I changed the above block to explicitly use 1:nblocks_clinic
for the third dimension of land_mask
in 639 and TAREA
in 640, but got yet another error elsewhere.
The original user who reported the problem was happy to be given a 252 task layout that keeps max_blocks_clinic=1
, so fixing this is not urgent. I'm putting all this detail in the issue ticket because I'm going to set it aside for a few weeks while I focus on more pressing issues, but it would probably be good to eventually come back and fix the bug.
I also think it would be useful to update the test suite to try to explicitly test cases where ladjust_bury_coeff = .true.
and either some tasks have more blocks than others, or some tasks have no blocks. I expect both of those tests would fail currently.
Version:
- CESM:
2_3_beta09
; I believe the first user was running CESM 2.1.x
- POP2:
cesm_pop_2_1_20220322
Machine/Environment Description:
error was reported on cheyenne and that's also where I reproduced the issue in the latest codebase
Any xml/namelist changes or SourceMods: