Comments (9)
You may already know this, but be aware that SchedMD changed the srun
cmd line in 23.11 - you might need to make some adjustments.
from ompi.
rebuilt slurm and openmpi to use pmi 4.2.9 and dropped the PMIX_MCA_gds=hash setting . Ran ~3000 hello_world jobs in this environment without seeing any core dumps.
from ompi.
Just to be clear: you originally said you ran 3000 jobs with mpirun
using PMIx 5.0.2 and saw no problems. So I'm assuming your last test refers to executing with srun
and not mpirun
- yes??
I fail to see a connection between PMIx and ob1/recv being caught in a segfault - we don't have anything to do with the MPI message exchange. Likewise, it's hard to see what srun
has to do with it, so I have no idea what to suggest. Given everything you have encountered across the two issue reports, I suspect there is something more fundamentally borked in this system.
from ompi.
My apologies - blasted github had me logged into a different account when I wrote the above note. Sigh.
from ompi.
no worries - thanks for taking a look at this for me.
Yep, the new testing using an srun launch with a PMIx 4.2.9 based slurm/openmpi did not see any core dumps in ~3000 runs. I'll stick with this new setup for now since things seem happier.
If you can think of any env variables I can set to provide more debug information, please let me know and I can give them a try and report back what I find.
from ompi.
Gave this some thought - given that things work fine under mpirun
but fail under srun
, I'm inclined to think there is some problem in the Slurm-PMIx integration when using PMIx 5.x. I know nothing about debugging Slurm, so I would really encourage you to file a ticket with SchedMD. At the very least, they should be made aware of the situation in case others encounter it.
It still feels to me like there is something else in your environment causing the problem (and the PMIx change being just a canary or flat out red herring), but minus more info, I have no idea how to pursue it.
from ompi.
one last note to add here before closing this one out and turning my focus to the Slurm/SchedMD side of the house. Two interesting things:
- adding strace in front of the hello_world_mpi application buries/hides/avoids the issue
- removing options using cgroups from the slurm.conf appears buries/hides/avoids the issue
Turns out that I had disabled cgroups in my testing area earlier and forgotten about it. My comments above about PMIx impacting this issue should be ignored. Much more likely the change in my slurm configuration in my test environment that changed the launch behavior.
from ompi.
@bhendersonPlano If this issue is not in OMPI rather SLURM or PMIX can you please file in corresponding community and close here?
from ompi.
I've started a thread on the slurm-users mailing list - hopefully someone will chime in there.
I'll close this one out as it does not appear to be an OpenMPI issue.
from ompi.
Related Issues (20)
- Open MPI fails with 480 processes on a single node HOT 5
- configure: error: C compiler cannot create executables HOT 6
- Unable to run openMPI from two machines HOT 5
- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. HOT 5
- OpenMPI configure script wrongly recognizes which directive to use for ignoring tkr in case of the new LLVM Fortran compiler HOT 9
- mpirun 5.0.2 hangs - ssh works HOT 11
- --with-cuda failes to find libcuda.so HOT 4
- Scaling issue run openmp on a cluster HOT 4
- openmpi osc_ucx_component error HOT 4
- Error using openmpi mpirun in Fedora 40 HOT 5
- Errors when running mpi programs HOT 5
- Trying to run MPI 3.0.6 on docker HOT 6
- problem with MPI_Comm_Create_Group HOT 8
- Error `Could not find viable pmix build` while building in Docker HOT 2
- COLL/UCC doesn't compile against head of UCC at master HOT 2
- Support zero-copy non-contiguous send HOT 4
- OpenMPI/5.0.3 with PMIx/4.2.7 compilation error HOT 2
- Configure --with-tm=/opt/pbs/ with PBS Professional fails with openmpi-5.0.3, succeeds with openmpi-4.1.4 HOT 9
- Failed to build RPM from SRPM because of large UID and old tar command HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ompi.