Giter VIP home page Giter VIP logo

Comments (28)

ikalash avatar ikalash commented on August 18, 2024 1

This has been resolved.

from albany.

mperego avatar mperego commented on August 18, 2024

@ikalash , Use Serial Mesh is set to true when ALBANY_PARALELL_EXODUS is defined
https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/CMakeLists.txt#L8-L18

from albany.

ikalash avatar ikalash commented on August 18, 2024

Thanks @mperego . This option doesn't seem to work. When I set it, the configure output for Albany says the following:

-- Performing Test PARALLEL_EXODUS_SUPPORTED
-- Performing Test PARALLEL_EXODUS_SUPPORTED - Success
-- ALBANY_PARALELL_EXODUS set to True

I guess the logic you pointed me to is for FO_GIS only but there are other tests like in corePDEs that have 'Use Serial Mesh = ON'. It looks like the code where the option is set is here:

https://github.com/sandialabs/Albany/blob/master/CMakeLists.txt#L426-L432 .

Strangely, setting the option to False doesn't even turn off the Use Serial Mesh tests for land ice, it looks like. Did the capability get broken at some point?

from albany.

mperego avatar mperego commented on August 18, 2024

All the tests should have a similar logic, unless they run in serial. If they don't we should fix them.
Anyway, that's not what causing the issue here.

I'll let @bartgol reply, he understand this logic and the libraries needed much better than I do.

from albany.

bartgol avatar bartgol commented on August 18, 2024

It's hard to tell what the issue is. More precisely, it's hard to tell what could cause an issue here, while things run fine on other machines.

The error msg suggests that this is NOT a pnetcdf file, but rather a classic netcdf file, but that's the same file we use everywhere. Maybe the installation of trilinos is pointing to some funky netcdf installations. It's hard to tell.

from albany.

ikalash avatar ikalash commented on August 18, 2024

I think there are 2 separate issues:
1.) ALBANY_PARALELL_EXODUS=FALSE does not actually disable the tests that have 'Use Serial Mesh = True'.
2.) The seacas/netcdf error itself.

I think we can address 1.) on the Albany side. Can I move the logic here https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/CMakeLists.txt#L8-L18 to the main Albany/src/CMakeLists.txt, so that it is applied everywhere? I'm not sure why it was applied only to the LandIce tests.

from albany.

bartgol avatar bartgol commented on August 18, 2024

I'm confused, I thought you said that in that build ALBANY_PARALELL_EXODUS is ON. I think that logic is in landIce since landIce is the only package that can handle both serial/parallel mesh read (for a given test). The only non-landice test failing is corePDEs_SteadyHeatConstrainedOpt2D_Conductivity_Dist_Param_Restart, but that test runs only if ALBANY_PARALLEL_EXODUS is ON, so I don't think moving that logic would help.

That said, if you want to move that snippet of cmake code, I would keep it in the tests folder, so maybe in tests/CMakeLists.txt?

from albany.

gsjaardema avatar gsjaardema commented on August 18, 2024

The error message is confusing. Basically, when trying to open the file, there is an error, so Exodus says that the file should be openable, because we have the right configuratino for the type of file it is, and the file exists, but there is still a problem opening it.

The real error found at the netCDF level is the NetCDF: Parallel operation on file opened for non-parallel access which means that the file was opened as a serial file, but then something in the open fucntion then tried to do a parallel operation on the file... I'm not sure what that operation would be or how it is getting called. It is inside the parallel exodus open function ex_open_par instead of the serial open function ex_open, so maybe that is causing it.

But, that code should not have changed recently, so not sure why it would be giving you an issue now...

from albany.

ikalash avatar ikalash commented on August 18, 2024

I'm confused, I thought you said that in that build ALBANY_PARALELL_EXODUS is ON. I think that logic is in landIce since landIce is the only package that can handle both serial/parallel mesh read (for a given test). The only non-landice test failing is corePDEs_SteadyHeatConstrainedOpt2D_Conductivity_Dist_Param_Restart, but that test runs only if ALBANY_PARALLEL_EXODUS is ON, so I don't think moving that logic would help.

That said, if you want to move that snippet of cmake code, I would keep it in the tests folder, so maybe in tests/CMakeLists.txt?

I guess what I am saying is that ALBANY_PARALLEL_EXODUS looks like it's not read in from the configure script. It is set here based on the Trilinos configuration:

https://github.com/sandialabs/Albany/blob/master/CMakeLists.txt#L426-L432

I was thinking it should be possible to set that option from the configure script too.

from albany.

bartgol avatar bartgol commented on August 18, 2024

If you don't want the test to read a serial mesh (which is what landIce does if ALBANY_PARALELL_EXODUS=ON), we can add a FIXTURE_SETUP test, which decompose the mesh (much like the FO_GIS folder does), so that the mesh can be read already partitioned.

from albany.

ikalash avatar ikalash commented on August 18, 2024

Thanks @bartgol . I am not sure what to do about the attaway tests given @gsjaardema 's comment. Disabling the tests would be sort of a bandaid fix. It seems like the cryptic netcdf error is not really telling us much. I just got the nightly back up with a new compiler - I suspect the compiler is what is causing the problems, or perhaps how netcdf was built using the compiler? I am not sure we want to spend resources troubleshooting this ourselves right now. @mperego do you have any recommendations for how to proceed?

from albany.

mperego avatar mperego commented on August 18, 2024

@ikalash Maybe we can change the logic so that when ALBANY_PARALLEL_EXODUS is set from cmake configure script it overrides this logic https://github.com/sandialabs/Albany/blob/master/CMakeLists.txt#L426-L432.
In this way you can turn it off for Attaway builds.

from albany.

bartgol avatar bartgol commented on August 18, 2024

Why can't we just partition the mesh offline, and run with Use Serial Mesh: false?

from albany.

mperego avatar mperego commented on August 18, 2024

We talked about this before, I think it is a handy capability to keep around. But if everyone else think we are better off removing the Use Serial Mesh capability, I'm fine with that.

from albany.

ikalash avatar ikalash commented on August 18, 2024

What's weird about some of the tests is that they use 'Use Serial Mesh = OFF' for the main disc but not for the sideset disc, it looks like, e.g.,

https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/input_fo_humboldt_frosch_fluxdiv.yaml#L137
https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/input_fo_humboldt_frosch_fluxdiv.yaml#L199

I could switch some of the problematic tests with "mixed" use of "Use Serial Mesh" to just set it to off, if folks are OK with that.

from albany.

mperego avatar mperego commented on August 18, 2024

The 3d mesh is extruded from the side (basal) mesh, so once you partition the 2d mesh, the extruded mesh is already partitioned and 'Use Serial Mesh' is not needed.
I doubt that's the problem. I think the problem is when importing and partitioning the 2d mesh.

from albany.

ikalash avatar ikalash commented on August 18, 2024

@mperego : the thing is if I set 'Use Serial Mesh: false' in both instances, the case runs (provided we've partitioned ../AsciiMeshes/Humboldt/humboldt_contiguous_2d.exo).

from albany.

mperego avatar mperego commented on August 18, 2024

Sure. But then we are not testing the Use Serial Mesh capability. It would be better to disable it with a cmake configuration flag, so that we can do that only on Attaway.

from albany.

ikalash avatar ikalash commented on August 18, 2024

Yes, I agree. In these tests I forced ALBANY_PARALLEL_EXODUS to be false. What I am saying is that it's not correct in this case to have 'Use Serial Mesh = ON' for the SS mesh, which is the same as what you are saying. Some ifdefs can correct this, I agree.

from albany.

mperego avatar mperego commented on August 18, 2024

Sorry, I don't understand the issue with the fluxdiv test you linked above.
Use Serial Mesh on the SS mesh can be either OFF, or ON. To be ON, you need the proper mesh libraries, which you don't have on Attaway. There is nothing wrong per se with that test, I think.
If you force ALBANY_PARALLEL_EXODUS to be false on Attaway, then you should not need to modify the test. Am I'm missing something?

from albany.

ikalash avatar ikalash commented on August 18, 2024

@mperego : you are right, the logic for USE_SERIAL_MESH is set correctly based on ALBANY_PARALLEL_EXODUS: https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/CMakeLists.txt#L9-L11 . I had an error that was causing me not to set the latter to off. Sorry about the confusion. I think there are some issues for the logic in this case though for the FO_GIS tests. For instance, all the decomp tests in landIce need to be disabled if ALBANY_PARALLEL_EXODUS is false. I can fix it.

from albany.

mperego avatar mperego commented on August 18, 2024

@ikalash the decomp tests normally work when ALBANY_PARALLEL_EXODUS is off. At least on all the other machines we test on.

from albany.

mperego avatar mperego commented on August 18, 2024

In fact, we use decomp exactly when ALBANY_PARALLEL_EXODUS is off, so that we do not need Use Serial Mesh on.
https://github.com/sandialabs/Albany/blob/master/tests/landIce/FO_GIS/CMakeLists.txt#L64-L76

from albany.

ikalash avatar ikalash commented on August 18, 2024

@mperego : can you please have a look at my PR to enable setting ALBANY_PARALLEL_EXODUS = OFF from input file? #1050 There are still issues on attaway with decomp and similar utilities not working properly. I will ask @gsjaardema about this once the PR is merged and the errors are showing up on our CDash site.

from albany.

ikalash avatar ikalash commented on August 18, 2024

@gsjaardema : it looks like the decomp utility in our new attaway build post RHEL8 upgrade is broken. Here is the output: https://sems-cdash-son.sandia.gov/cdash/test/4973558

[swa10:261798] *** Process received signal ***
[swa10:261798] Signal: Aborted (6)
[swa10:261798] Signal code:  (-6)
[swa10:261798] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x15554b328cf0]
[swa10:261798] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x15554af9facf]
[swa10:261798] [ 2] /lib64/libc.so.6(abort+0x127)[0x15554af72ea5]
[swa10:261798] [ 3] /projects/aue/hpc/builds/x86_64/rhel8/2410comp/toolchain-intel-2024.1.0/install/linux-rhel8-x86_64/oneapi-2024.1.0/hdf5-1.14.2-6tbjgnq/lib/libhdf5.so.310(+0x5af91)[0x155554106f91]
[swa10:261798] [ 4] /lib64/libnetcdf.so.15(NC4_create+0x19c)[0x15555504cc7c]
[swa10:261798] [ 5] /lib64/libnetcdf.so.15(NC_create+0x204)[0x15555500aa44]
[swa10:261798] [ 6] /lib64/libnetcdf.so.15(nc__create+0x19)[0x15555500ab39]
[swa10:261798] [ 7] /projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/nem_slice[0x47eed1]
[swa10:261798] [ 8] /projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/nem_slice[0x46589d]
[swa10:261798] [ 9] /projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/nem_slice[0x461014]
[swa10:261798] [10] /projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/nem_slice[0x45f672]
[swa10:261798] [11] /lib64/libc.so.6(__libc_start_main+0xe5)[0x15554af8bd85]
[swa10:261798] [12] /projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/nem_slice[0x40a69e]
[swa10:261798] *** End of error message ***
/projects/albany/nightlyCDash/build/TrilinosSerialInstall/bin/decomp: line 146: 261798 Aborted                 ( ${NOOP:+echo }$NEM_SLICE $decomp_type $spheres $do_viz $decomp_method $weighting $nem_slice_flag -o $nemesis -m mesh=$processors $genesis )

ERROR:******************************************************************
ERROR:
ERROR     During nem_slice execution. Check error output above and rerun
ERROR:
ERROR:******************************************************************

Have you seen this before? The decomp utility is built as part of Trilinos and works correctly on other platforms.

from albany.

gsjaardema avatar gsjaardema commented on August 18, 2024

Very strange. Looks like it is having an issue creating a file during the nem_slice part of decomp.

However, on closer look, it is using the /lib64/libnetcdf.so and the ...aue... libhdf5.

Are those the TPLs that Trilinos usually uses? I thought they were either all from /projects/aue/hpc/... or were from sems?

from albany.

gsjaardema avatar gsjaardema commented on August 18, 2024

Here is the libnetcdf.settings file from the /lib64/libnetcdf.so on attaway:

# NetCDF C Configuration Summary
==============================

# General
-------
NetCDF Version:         4.7.0
Configured On:          Sun Jul 16 05:24:17 UTC 2023
Host System:            x86_64-redhat-linux-gnu
Build Directory:        /builddir/build/BUILD/netcdf-c-4.7.0/build
Install Prefix:         /usr

# Compiling Options
-----------------
C Compiler:             /usr/bin/gcc
CFLAGS:                 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection
CPPFLAGS:               -I/usr/include/hdf
LDFLAGS:                -Wl,-z,relro  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -L/usr/lib64/hdf
AM_CFLAGS:              
AM_CPPFLAGS:            
AM_LDFLAGS:             
Shared Library:         yes
Static Library:         yes
Extra libraries:        -ljpeg -lmfhdf -ldf -ljpeg -lsz -lhdf5_hl -lhdf5 -lm -ldl -lz -lcurl -ltirpc

It is an old version of netcdf (4.7.0) and it looks like it is supposed to link to the hdf5 library also in /lib64, but is instead using a different libhdf5.so...

I think the issue is that the wrong libraries are being used; but not sure how to fix that problem.

from albany.

ikalash avatar ikalash commented on August 18, 2024

You're right, somehow in the Trilinos build, the netcdf path is not being set from the module: https://sems-cdash-son.sandia.gov/cdash/build/73094/configure . I am amazed how Albany works at all then. I will fix the path. I think it will fix the problem. Thanks!

from albany.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.