Giter VIP home page Giter VIP logo

Comments (4)

SLongshaw avatar SLongshaw commented on July 29, 2024

Thank you for the bug report.

This is confirmed using g++9 and openmpi so not compiler/mpi library specific.

The segfault is triggered by the fetch_points/fetch_all call in your example but in fact stems from the creation of the internal message structure within the library triggered by the barrier() function, so is likely representative of a more deeply-rooted bug.

This will be looked into as a matter of urgency and this report updated once a fix is found.

A quick point for the moment, there is a mismatch in the chrono_sampler used, when using the specialisms approach to declaring MUI objects, it is important to match types, the interface and point types are 3d (3-dimensional and double data type) while the chrono_sampler is 1d. This isn't the underlying problem though.

from mui.

chrisrichardson avatar chrisrichardson commented on July 29, 2024

Thanks @SLongshaw. I will update my chrono_sampler dimensionality. Good luck finding the bug.

from mui.

SLongshaw avatar SLongshaw commented on July 29, 2024

Hi @chrisrichardson I think this should now be resolved in both branches.

To summarise, you managed to uncover a situation that doesn't normally arise using MUI, specifically the sending rank was able to exit before the receiving rank had completed its fetch and as MUI relies on a non-blocking send / blocking receive design this meant MPI buffers etc. were lost as the sending rank exited. Normally coupled designs mean this tends not to happen (not specifically by design admittedly).

In reality though, this represented a bug in the general design of MUI as there should always be a corresponding MPI wait (or equivalent) call (which there was) but there was nothing to block to wait to ensure completion.

The solution implemented is a new blocking wait on MPI_Isend completion but with an inbuilt timer to avoid deadlock.

The problem here is that the original design was the safest and most general approach but inherently had this potential problem, the new design is more "correct" in MPI terms but reduces the generality of the library, for most scenarios this shouldn't represent any difference but for those few where it does, the addition of a timeout and user warning messages is the (inelegant) solution.

This change will remain permanent as long as further test cases in the future don't highlight any unforeseen major issues with the new solution. In that case there is a work-around using the original approach and that is to ensure that the sending side finishes using a call to the barrier() function at a time stamp where it hasn't yet issued a fetch() command and, on the receiving side to send a final empty commit() at that time to unblock both sides, in effect adding a synchronisation barrier.

Please try this change out and feel free to re-open this issue if it is not resolved for you, for now though I will close.

from mui.

SLongshaw avatar SLongshaw commented on July 29, 2024

The fix for this problem eventually resulted in a runtime problem with a specific case where the new blocking wait adding significant overhead. The non-blocking wait has therefore been re-instated, however, a final blocking test loop has been added to the destructor of the MPI comm class, this has been tested against the issue raised here and the new solution still offers a fix while removing the need for a blocking MPI test loop after every send.

from mui.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.