Giter VIP home page Giter VIP logo

e3sm-io's Introduction

Parallel I/O Kernel Case Study -- E3SM

This repository contains a case study of parallel I/O kernel from the E3SM climate simulation model. E3SM is one of the Department of Energy (DOE) mission applications designed to run on the DOE leadership parallel computers. The E3SM I/O module, Scorpio, can make use of existing I/O libraries, such as PnetCDF, NetCDF-4, HDF5, and ADIOS. The benchmark program in this repository is developed to evaluate the E3SM I/O kernel performance using the above mentioned libraries. Achieving a good I/O performance for E3SM on HPC systems is challenging because its data access pattern consists of a large number of small, unordered, non-contiguous write requests on each process.

Data Partitioning Pattern in E3SM

The problem domain in E3SM simulation is represented by a cubed sphere grid which is partitioned among multiple processes along only X and Y axes that cover the surface of the problem domain. Domain partitioning among processes uses the Hilbert space curve algorithm to first linearize the 2D cubed sphere grid into subgrids where data points in a subgrid are physically closer to each other, and then divide the linearized subgrids evenly among the processes. This partitioning strategy produces in each process a long list of small, noncontiguous write requests and the file offsets of any two consecutive requests may not be in an increasing order in the file space.

The data partitioning patterns (describing the decomposition of the problem domain represented by multi-dimensional arrays) were captured by the Scorpio library during the E3SM production runs. There can be multiple decomposition maps used by different variables. A data decomposition map records the positions (offsets) of array elements written by each MPI process. The access offsets are stored in a text file, referred to as the "decomposition map file".

Three Case Studies

This benchmark currently studies three cases from E3SM, namely F, G and I cases, simulating the atmospheric, oceanic, and land components, respectively. Information about the climate variables written in these three case studies and their decomposition maps can be found in variables.md. Table below shows the information about decomposition maps, numbers of variables, and the maximum and minimum numbers of non-contiguous write requests among all processes.

Case F G I
Number of MPI processes 21600 9600 1344
Number of decomposition (partitioning) maps 3 6 5
Number of partitioned variables 387 41 546
Number of non-partitioned variables 27 11 14
Total number of variables 414 52 560
MAX no. noncontig writes among processes 184,644 21,110 41,400
MIN no. noncontig writes among processes 174,926 18,821 33,120

Compile and Run Instructions for E3SM-IO

  • See INSTALL.md. It also describes the command-line run options in details.
  • Current build/test status: MPICH OpenMPI

Run Options Include Various I/O Libraries and Two Data Layouts

There are several I/O methods implemented in this case study, including two data layouts (canonical vs. log) and five parallel I/O libraries of PnetCDF, NetCDF-4, HDF5, Log VOL, and ADIOS. Table below summarizes the supported combinations of libraries and data layouts. For the full list of I/O options and more detailed descriptions, readers are referred to INSTALL.md.

layout \ library PnetCDF HDF5 Log VOL ADIOS NetCDF4
canonical yes yes no no yes
log (blob) yes yes yes yes no

Performance Results of Log-layout I/O Methods

For the log layout options available in this benchmark, users are referred to BLOB_IO.md for their designs and implementations. Log I/O methods store write requests by appending the write data one after another, like a time log, regardless the data's position relative to its global structure, e.g. a subarray of a multi-dimensional array. Thus data stored in the file does not follow the dimensional canonical order. On the other hand, storing data in the canonical order requires an expensive communication to organize the data among the processes. As the number of processes increases, the communication cost can become significant. All I/O methods that store data in the log layout defers the expensive inter-process communication to the data consumer applications. Usually, "replay" utility programs are made available for users to convert a file in the log layout to the canonical layout.

  • Below shows the execution times of four log-layout based I/O methods collected on July 2022 on Cori at NERSC.

    Performance of log-layout based I/O methods on Cori

  • Below shows the execution times of four log-layout based I/O methods collected in September 2022 on Summit at OLCF.

    Performance of log-layout based I/O methods on Summit

Publications

Developers

Copyright (C) 2021, Northwestern University. See COPYRIGHT notice in top-level directory.

Project funding supports:

This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy's Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation's exascale computing imperative.

e3sm-io's People

Contributors

brtnfld avatar dqwu avatar khou2020 avatar wkliao avatar yzanhua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

e3sm-io's Issues

Fails when using hdf5 canonical

mpiexec -n 16 ./e3sm_io f_case_866x72_16p.nc -k -o F_out.h5 -x canonical -a hdf5 -r 10
Error in ../../../src/cases/var_wr_case.cpp line 312 function var_wr_case
Error in ../../../src/cases/var_wr_case.cpp line 312 function var_wr_case
Error in ../../../src/cases/var_wr_case.cpp line 312 function var_wr_case
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

failed to build bpstat on Cori

Error message:

  CXX      bpstat.o
bpstat.cpp(23): catastrophic error: cannot open source file "filesystem"
  #include <filesystem>
                       ^

disable running Cache VOL for netcdf4 + log option

Testing netcdf4 + log layout option has been disabled for Cache VOL connector.
See the github issue at Cache VOL repo at HDFGroup/vol-cache#18

The source codes below show the location in test.sh that disables the test.
Note the environment variable HDF5_VOL_CONNECTOR is overwritten if set by users.

E3SM-IO/test.sh

Lines 102 to 116 in 988537c

elif test "x${ap[0]}" = xnetcdf4 ; then
FILE_EXT="nc4"
saved_HDF5_PLUGIN_PATH=$HDF5_PLUGIN_PATH
saved_HDF5_VOL_CONNECTOR=$HDF5_VOL_CONNECTOR
if test "x${ap[1]}" = xlog ; then
# This option requires the two VOL environment variables to be set.
export HDF5_PLUGIN_PATH="$LOGVOL_LIB_PATH"
export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"
# Decomposition file must be read with native VOL, use nc file
IN_FILE+=".nc"
else
IN_FILE+=".${FILE_EXT}"
unset HDF5_PLUGIN_PATH
unset HDF5_VOL_CONNECTOR
fi

change command-line option -r

@khou2020

In 22fa89b
the command-line option -r for setting the number of time steps
(number of records) has changed to apply to only the F case f0 file.
See comments of that commit for more information.

The F and G cases for scorpio option need to change accordingly

bug in utils/bpstat

Github Actions failed with the following messages.
Same error happens to all F, G, and I cases.

CMD = utils/bpstat ./test_output/adios_blob_f_case_866x72_16p.bp_h1.bp
[98](https://github.com/Parallel-NetCDF/E3SM-IO/runs/7258671854?check_suite_focus=true#step:18:99)
[Fri Jul  8 21:44:06 2022] [ADIOS2 ERROR] <Helper> <adiosSystem> <ExceptionToError> : adios2_open: [Fri Jul  8 21:44:06 2022] [ADIOS2 EXCEPTION] <Toolkit> <transport::file::FilePOSIX> <CheckFile> : couldn't open file ./test_output/adios_blob_f_case_866x72_16p.bp_h1.bp.bp, in call to POSIX open: errno = 2: No such file or directory
[99](https://github.com/Parallel-NetCDF/E3SM-IO/runs/7258671854?check_suite_focus=true#step:18:100)
: iostream error
[100](https://github.com/Parallel-NetCDF/E3SM-IO/runs/7258671854?check_suite_focus=true#step:18:101)

In 5980509, using bpstat to test output files is disabled.

run-time error when using hdf5_md

When running command-line option -k -o out.nc -a hdf5_md,
the following error appears.

Error at line 1029 in e3sm_io_driver_hdf5.cpp:
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: /hdf5/src/H5Shyper.c line 6889 in H5Sselect_hyperslab(): not a data space
    major: Invalid arguments to routine
    minor: Inappropriate type
Error in var_io_F_case.cpp line 1144 function run_varn_F_case

Timing report for ADIOS deconstructor

@khou2020
These lines need to be updated. They are for HDF5.

e3sm_io_driver_adios2::~e3sm_io_driver_adios2 () {
int err = 0;
int rank;
double tsel_all, twrite_all, text_all;
// printf("adios2 destructor\n");
MPI_Allreduce (&twrite, &twrite_all, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);
MPI_Allreduce (&tsel, &tsel_all, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);
MPI_Allreduce (&text, &text_all, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (rank == 0) {
printf ("#%%$: H5Dwrite_time_max: %lf\n", twrite_all);
printf ("#%%$: H5Sselect_hyperslab_time_max: %lf\n", tsel_all);
printf ("#%%$: H5Dset_extent_time_max: %lf\n", text_all);
}
}

compile error

H5Pget_mpio_no_collective_cause (this->driver.dxplid_coll, &local_no_collective_cause,

e3sm_io_driver_hdf5_agg.cpp:246:73: error: ‘local_no_collective_cause’ was not declared in this scope
         H5Pget_mpio_no_collective_cause (this->driver.dxplid_coll, &local_no_collective_cause,
                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~

e3sm_io_driver_hdf5_agg.cpp:308:73: error: ‘local_no_collective_cause’ was not declared in this scope
         H5Pget_mpio_no_collective_cause (this->driver.dxplid_coll, &local_no_collective_cause,
                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~

e3sm_io_print_profile()

e3sm_io_print_profile() seems only relevant to HDF5 and
dumps HDF5 timing breakdowns. It should be called in the
verbose or debug mode only.

e3sm_io_print_profile (cfg);

Also, what is the first column of each line for? i.e. #%$:

#%$: e3sm_io_timer_hdf5_time_mean: 0.000000

use of H5S_SELECT_OR in H5Sselect_hyperslab()

Affected cases:

  • -a hdf5 -x canonical
  • -a hdf5 -x log
  • -a hdf5_md -x canonical

When using H5S_SELECT_OR in H5Sselect_hyperslab() to take a union of multiple
hyperslabs, HDF5 library will flatten all hyperslabs and sort the selected elements into
an increasing order. Therefore, it is not possible to create a filespace that follows the
same order of write requests using H5Sselect_hyperslab(). For example, if a process
makes 2 write requests, the 1st one is the 1st column of a MxN 2D array and the 2nd
request is the 2nd column, then H5S_SELECT_OR will create a hyperslab of Mx2.
In other words, one cannot write the 1st column followed by the 2nd column in a single
HDwrite call. See an example in tests/select_or.c.

In the current implementation, use of H5S_SELECT_OR appears in

The advantage of using the union of selected spaces is so we can create a single file
space per dataset and make a single H5Dwrite call to write the dataset, which also
allows us to use MPI collective mode. Otherwise, because the number of noncontiguous
write requests can be different among processes, we will have to call H5Dwrite one per
noncontiguous request without being able to use the collective I/O.

One solution will be to flatten all write requests and merge them into an increasing order
in the file space. This will also require to flatten the memory space of the user buffers
and move the memory buffers corresponding to their file spaces during the sorting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.