Comments (4)
The cmbd issues all overlay bind/connect calls before entering its event loop, however that is insufficient to ensure that messages are not lost.
This note from zmq_connect(3) explains the fundamental issue:
for most transports and socket types the connection is not
performed immediately but as needed by 0MQ. Thus a successful call
to zmq_connect() does not mean that the connection was or could
actually be established. Because of this, for most transports and
socket types the order in which a server socket is bound and a
client socket is connected to it does not matter. The first
exception is when using the inproc:// transport: you must call
zmq_bind() before calling zmq_connect(). The second exception are
ZMQ_PAIR sockets, which do not automatically reconnect to
endpoints.
I think epgm:.// is the the exception alluded to, but for testing with multiple cmbds per node, we are distributing events over ipc://.
One solution is to use the request/reply network to obtain events until the first one arrives on the pub/sub event network. We could sequence events and have a fresh cmbd send a request to its parent for event 0, 1, 2... until it receives an event N on its event overlay, the stop doing that.
from flux-core.
Simple solution: make connect synchronous by using zmonitor_t to monitor for ZMQ_EVENT_CONNECTED.
from flux-core.
As zmonitor seems unreliable and its API is not stable, ZMQ_EVENT_CONNECTED turns out to not be a very good way to fix this.
from flux-core.
Should have been auto-closed when the above was merged.
from flux-core.
Related Issues (20)
- LBANN mpi-catch-test hangs in MPI_Finalize with ompi 4.1.2 and simple PMI HOT 10
- default begin-time dependency format in `flux jobs` HOT 2
- flux run segfaults if user is not in password file on compute node HOT 2
- Keep a copy of R in job-manager for use in jobtap plugin callbacks HOT 3
- jobtap history plugin throws errors HOT 6
- src/tcmalloc.cc:333] Attempt to free invalid pointer 0x561bf1b5ecd0
- flux-core build fails on IBM coral system running rhel 7.9 based OS
- flux-uri slurm:jobid does not work for slurm batch jbos HOT 9
- broker stuck at exit in `zmq_ctx_term()` HOT 7
- overly verbose cleanup messages after allocation expired HOT 1
- Python: `flux.job.wait` is overloaded HOT 5
- basic job resource usage accounting
- testsuite: gitlab ci cluster specific tests HOT 2
- python: some JobInfo attributes don't port `to_dict()` HOT 9
- broker: improve error message if libzmq doesn't support CURVE
- valgrind detects leak in test_disconnect.t
- not ok 11 - flux-watch: works with --since HOT 5
- test_usock_server.t hangs in gitlab ci HOT 14
- doc: NAME section doesn't render in HTML man pages
- shell: remove standalone mode?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.