Giter VIP home page Giter VIP logo

autoperf's People

Contributors

carns avatar shanedsnyder avatar srini009 avatar sudheerchunduri avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autoperf's Issues

application crash when MPI_alltoallw() wrapped by APMPI uses the MPI_DATATYPE_NULL type

See darshan-hpc/darshan#698 for context from @bebosudo and @jedwards4b. This was originally filed on Darshan, but the problem is actually being encountered in the AutoPerf APMPI module in Darshan.

The underlying issue is that some MPI implementations may crash (or more pedantically, throw an unhandled error) if MPI_Type_size() is called on MPI_DATATYPE_NULL. The MPI spec doesn't specify what is supposed to happen in this case. AutoPerf and Darshan both issue MPI_Type_size() calls without the application's knowledge, and may therefore introduce unexpected application errors.

Since there are MPI libraries in production that do this, the safest solution is to wrap calls to MPI_Type_size() in the APMPI wrappers with guards (maybe in a macro) that catches the MPI_DATATYPE_NULL case and either replaces it with a safe type or just sets the size to 0. This is especially important for application codes that are already going out of their way to avoid passing MPI_DATATYPE_NULL into particular function calls but don't realize that autoperf may trigger additional MPI calls as part of its instrumentation.

Once we confirm a fix here, then the same fix can probably be applied in Darshan's core MPIIO module. The similar hypothetical problem in the MPIIO module can't be triggered right now as best I can tell, but it could be in the future if an MPI library behaved slightly differently.

Here is a single process reproducer that works on both Cori with its default user environment as well as on my laptop with MPICH 3.4.1 and Darshan origin/main with --enable-apmpi-mod:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    int          ret;
    int          sendcounts[1] = {0};
    int          sdispls[1]    = {0};
    int          recvcounts[1] = {0};
    int          rdispls[1]    = {0};
    char         rbuffer[1]    = {0};
    char         sbuffer[1]    = {0};
    MPI_Datatype sendtypes[1]  = {MPI_DATATYPE_NULL};
    MPI_Datatype recvtypes[1]  = {MPI_DATATYPE_NULL};
    int          nprocs        = 0;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    if (nprocs != 1) {
        fprintf(stderr,
                "Error: this test program must be executed with exactly one "
                "process.\n");
        return (-1);
    }

#if 0
    MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
#endif

    ret = MPI_Alltoallw(sbuffer, sendcounts, sdispls, sendtypes, rbuffer,
                        recvcounts, rdispls, recvtypes, MPI_COMM_WORLD);

    if (ret != MPI_SUCCESS)
        fprintf(stderr, "Error: MPI_Alltoallw(); continuing...\n");

    MPI_Finalize();
    return 0;
}

The desired behavior is for it to execute without crashing.

warnings from darshan-job-summary.pl when autoperf APMPI data is present

The darshan-job-summary.pl tool produces output that looks like this if executed on a darshan log with Autoperf APMPI data present:

Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 137, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 138, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 139, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 140, <PARSE_OUT> line 467.
...

I'm getting over 2,000 lines similar to that when running it on a single process mpi-io benchmark log.

It looks like the issue is that the Autoperf parsed output does not have the same number of columns as other Darshan modules, in particular the file name, mount pt, and fs type fields are missing. It looks like there are also warnings later due to some counters having string rather than numeric values.

I'm not sure what the best fix is; either APMPI (and other Autoperf modules if they have the same problem, I haven't checked the others) needs to add dummy columns to the output or darshan-job-summary.pl should be modified to skip AP records.

This problem is peculiar to darshan-job-summary since it is using perl to parse string output from darshan parser, but we need to fix it until the Python tools are mature enough to deprecate it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.