argonne-lcf / autoperf Goto Github PK
View Code? Open in Web Editor NEWCore autoperf source
License: Other
Core autoperf source
License: Other
The darshan-hpc/darshan#413 changes (already landed in darshan origin/main) made subtle changes to some common api functions and macros used by darshan modules in the interest of performance optimization.
I'll see if I can make a pass through to update these and contribute the changes in a PR for review. Once that's fixed we need to update the submodule reference accordingly.
See darshan-hpc/darshan#698 for context from @bebosudo and @jedwards4b. This was originally filed on Darshan, but the problem is actually being encountered in the AutoPerf APMPI module in Darshan.
The underlying issue is that some MPI implementations may crash (or more pedantically, throw an unhandled error) if MPI_Type_size() is called on MPI_DATATYPE_NULL. The MPI spec doesn't specify what is supposed to happen in this case. AutoPerf and Darshan both issue MPI_Type_size() calls without the application's knowledge, and may therefore introduce unexpected application errors.
Since there are MPI libraries in production that do this, the safest solution is to wrap calls to MPI_Type_size() in the APMPI wrappers with guards (maybe in a macro) that catches the MPI_DATATYPE_NULL case and either replaces it with a safe type or just sets the size to 0. This is especially important for application codes that are already going out of their way to avoid passing MPI_DATATYPE_NULL into particular function calls but don't realize that autoperf may trigger additional MPI calls as part of its instrumentation.
Once we confirm a fix here, then the same fix can probably be applied in Darshan's core MPIIO module. The similar hypothetical problem in the MPIIO module can't be triggered right now as best I can tell, but it could be in the future if an MPI library behaved slightly differently.
Here is a single process reproducer that works on both Cori with its default user environment as well as on my laptop with MPICH 3.4.1 and Darshan origin/main with --enable-apmpi-mod:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
int ret;
int sendcounts[1] = {0};
int sdispls[1] = {0};
int recvcounts[1] = {0};
int rdispls[1] = {0};
char rbuffer[1] = {0};
char sbuffer[1] = {0};
MPI_Datatype sendtypes[1] = {MPI_DATATYPE_NULL};
MPI_Datatype recvtypes[1] = {MPI_DATATYPE_NULL};
int nprocs = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
if (nprocs != 1) {
fprintf(stderr,
"Error: this test program must be executed with exactly one "
"process.\n");
return (-1);
}
#if 0
MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
#endif
ret = MPI_Alltoallw(sbuffer, sendcounts, sdispls, sendtypes, rbuffer,
recvcounts, rdispls, recvtypes, MPI_COMM_WORLD);
if (ret != MPI_SUCCESS)
fprintf(stderr, "Error: MPI_Alltoallw(); continuing...\n");
MPI_Finalize();
return 0;
}
The desired behavior is for it to execute without crashing.
The darshan-job-summary.pl
tool produces output that looks like this if executed on a darshan log with Autoperf APMPI data present:
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 137, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 138, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 139, <PARSE_OUT> line 467.
Use of uninitialized value $str in substitution (s///) at /home/carns/working/install/lib/TeX/Encode.pm line 140, <PARSE_OUT> line 467.
...
I'm getting over 2,000 lines similar to that when running it on a single process mpi-io benchmark log.
It looks like the issue is that the Autoperf parsed output does not have the same number of columns as other Darshan modules, in particular the file name
, mount pt
, and fs type
fields are missing. It looks like there are also warnings later due to some counters having string rather than numeric values.
I'm not sure what the best fix is; either APMPI (and other Autoperf modules if they have the same problem, I haven't checked the others) needs to add dummy columns to the output or darshan-job-summary.pl should be modified to skip AP records.
This problem is peculiar to darshan-job-summary since it is using perl to parse string output from darshan parser, but we need to fix it until the Python tools are mature enough to deprecate it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.