Hello, I recently discovered mpifileutils and they are awesome. I m

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Comments (87)

dvickernasa commented on September 1, 2024 1

Sorry, this fell off my radar. The "production" transfers I needed to do are done and I didn't get back to my testing. I'm restarting a transfer as a test. I'm going to recreate the original transfer this time, including the original options I used (most notably without --batch-files. I don't think I've run into a case where this didn't work when using --batch-files and I'm starting to suspect that is why. This new transfer is the exact same set of files from the first post in this issue. The source and destination filesystem are the same as well, I'm just using a different path in the destination FS. Fingers crossed that this recreates it. Details below. I'm using the debugtimestamps branch for both dsync runs so if we do catch the problem, we should have some good data to go over.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/
dest=/nobackup2/dvicker/dsync_test/ccdev/

dsync_opts='-X non-lustre'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out

from mpifileutils.

dvickernasa commented on September 1, 2024 1

I've compiled the update and I'll repeat the test. I'm not going to use the new option for now, just to be consistent. I'll email the output for the latest results when its done.

from mpifileutils.

dvickernasa commented on September 1, 2024 1

I'm re-running my test with the updated code. Should be finished in the morning. Since I've transferred this same set of files a few times now, we should get a good idea if the extra fsyncs's slow things down much.

from mpifileutils.

ofaaland commented on September 1, 2024 1

Hi @dvickernasa Since the destination LFS is ldiskfs, that rules out my theory. Back to the drawing board for us. Thank you.

from mpifileutils.

dvickernasa commented on September 1, 2024 1

I've started my test with the updated code.

from mpifileutils.

dvickernasa commented on September 1, 2024

Changed the subject since the root problem here is the time stamp info not being preserved.

Here is a more comprehensive example. There is an odd mix of correct and incorrect time stamps in the destination.

[dvicker@monte dsync]$ ls -l /nobackup/PROJECTS/ccdev/spacex/aero/adb/ar/
total 24040207564
  781477845 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        187 -rw-rw---- 1 dvicker  ccp_spacex         742598 Aug 15 11:12 ar-108.squashfs.toc
 5835780999 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       1721 -rw-rw---- 1 ema      ccp_spacex        9413597 Aug 24 03:22 ar-109.squashfs.toc
   30794854 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Oct 29  2022 ar-126.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  561289162 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Oct 29  2022 ar-127.squashfs
        233 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Aug 15 11:12 ar-127.squashfs.toc
 1042993450 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
       2651 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Aug 15 11:13 ar-131.squashfs.toc
  882730771 -rw-rw---- 1 ema      ccp_spacex   899033808896 Aug 24 11:42 ar-132.squashfs
       4046 -rw-rw---- 1 ema      ccp_spacex       22383607 Aug 24 11:42 ar-132.squashfs.toc
 1777595522 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Oct 28  2022 ar-133.squashfs
       1721 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
  101115096 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Oct 28  2022 ar-134.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753055384 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
        466 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Aug 15 11:13 ar-135.squashfs.toc
  158022760 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Oct 29  2022 ar-136.squashfs
         94 -rw-rw---- 1 dvicker  ccp_spacex         225492 Aug 15 11:13 ar-136.squashfs.toc
11896898702 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      32762 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Aug 20 22:27 ar-138.squashfs.toc
   46078532 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Oct 27  2022 ar-140.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          99596 Aug 15 11:13 ar-140.squashfs.toc
   69136835 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          61816 Aug 15 11:13 ar-144.squashfs.toc
  103192925 -rw-r--r-- 1 dvicker  eg3          105696878592 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs
        629 -rw-r--r-- 1 dvicker  eg3               3418271 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
         36 -rwxr-xr-x 1 aschwing ccp_spacex            203 Feb 27  2019 update_perms.csh*
[dvicker@monte dsync]$ ls -l /nobackup2/PROJECTS/ccdev/spacex/aero/adb/ar/
total 23852316152
  772498852 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        728 -rw-rw---- 1 dvicker  ccp_spacex         742598 Sep 29 22:48 ar-108.squashfs.toc
 5837342120 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       9196 -rw-rw---- 1 ema      ccp_spacex        9413597 Sep 29 22:03 ar-109.squashfs.toc
   24560752 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Sep 29 20:41 ar-126.squashfs
         88 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  555208412 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Sep 29 23:53 ar-127.squashfs
        980 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Sep 30 00:05 ar-127.squashfs.toc
 1031594016 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
      14184 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Sep 29 17:44 ar-131.squashfs.toc
  877963384 -rw-rw---- 1 ema      ccp_spacex   899033808896 Sep 29 19:22 ar-132.squashfs
      21860 -rw-rw---- 1 ema      ccp_spacex       22383607 Sep 30 00:14 ar-132.squashfs.toc
 1773086904 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Sep 29 20:43 ar-133.squashfs
       9136 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
   99878836 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Sep 30 00:08 ar-134.squashfs
         92 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753046000 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
       2080 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Sep 29 22:02 ar-135.squashfs.toc
  157828056 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Sep 29 23:49 ar-136.squashfs
        224 -rw-rw---- 1 dvicker  ccp_spacex         225492 Sep 29 17:19 ar-136.squashfs.toc
11765599492 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      63144 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Sep 29 23:49 ar-138.squashfs.toc
   37663536 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Sep 29 16:54 ar-140.squashfs
        100 -rw-rw---- 1 dvicker  ccp_spacex          99596 Sep 29 22:03 ar-140.squashfs.toc
   62700864 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         64 -rw-rw---- 1 dvicker  ccp_spacex          61816 Sep 30 02:00 ar-144.squashfs.toc
  103219708 -rw-r--r-- 1 dvicker  eg3          105696878592 Sep 30 02:26 Dragon2_RCS_JI_Joe_MSFC.squashfs
       3340 -rw-r--r-- 1 dvicker  eg3               3418271 Sep 29 16:46 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
          4 -rwxr-xr-x 1 aschwing ccp_spacex            203 Sep 29 16:46 update_perms.csh*
[dvicker@monte dsync]$

from mpifileutils.

adammoody commented on September 1, 2024

Hi, @dvickernasa . Great to hear mpiFileUtils has been helpful. Thanks.

dsync should be setting the atime and mtime on the destination files to match the source file. However, it does that as the very last step, after all files have been copied. I wonder if the job might have been interrupted or hit a fatal error before it could complete that step.

Do you see any such errors or early termination from your job output log?

Tip: When copying a large number of files, I like to recommend using the --batch-files option, as in:

dsync --batch-files 100000 ...

This changes dsync to copy files in batches rather than all files at once. It sets the timestamps at the completion of each batch. If the job is interrupted, one can then execute the same dsync command again. dsync skips copying any files that were completed in earlier runs. By default, it avoids copying any file whose destination exists and has a matching file type, size, and mtime.

from mpifileutils.

dvickernasa commented on September 1, 2024

That is possible. Here is the end of the output from the dsync run.

[2023-09-30T08:19:28] Syncing directory updates to disk.
[2023-09-30T08:19:33] Sync completed in 4.942 seconds. 
[2023-09-30T08:19:33] Started: Sep-29-2023,16:34:18
[2023-09-30T08:19:33] Completed: Sep-30-2023,08:19:33
[2023-09-30T08:19:33] Seconds: 56715.151
[2023-09-30T08:19:33] Items: 1076815
[2023-09-30T08:19:33]   Directories: 35829
[2023-09-30T08:19:33]   Files: 1037753
[2023-09-30T08:19:33]   Links: 3233
[2023-09-30T08:19:33] Data: 48.152 TiB (52944066031967 bytes)
[2023-09-30T08:19:33] Rate: 890.263 MiB/s (52944066031967 bytes in 56715.151 seconds)
[2023-09-30T08:19:33] Updating timestamps on newly copied files
[2023-09-30T08:19:33] Completed updating timestamps
[2023-09-30T08:19:33] Completed sync
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49249,1],5]
  Exit code:    1
--------------------------------------------------------------------------

I'll start using the --batch-files option, thanks for the tip.

I'm attempting to recover from this using an rsync command to update the time stamps.

rsync -rv --size-only --times $source $dest >> rsync-ccdev.out.times.$( date +%F )

But its going fairly slow. Woud just repeating the dsync update the timestamps?

from mpifileutils.

dvickernasa commented on September 1, 2024

Actually, the data I copied was done with two separate dsync runs since they were in two different areas. I've attached the output from both runs. The ccdev run showed the error, but the cev-repo run did not. But the destinations for both runs have files with incorrect time stamps. I will attached the full output of both runs.

Any recommendations for fixing the time stamps would be appreciated. The rsync command I posted above should do it, but its slow. I'm thinking dsync won't work since the output contains Updating timestamps on newly copied files. That makes it sound like dsync won't update the timestamps on the files that were previously copied.

dsync-ccdev.out.2023-09-29.txt
dsync-cev-repo.out.2023-09-29.txt

from mpifileutils.

dvickernasa commented on September 1, 2024

I've been doing some tests this afternoon using a smaller dataset. Basically did this:

dsync /src /dest
cd /dest ; lfs find . | xargs touch
dsync /src /dest

Even though no actual data was transferred for the 2nd dsync (I think, the 2nd transfer went fast), the time stamps in the destination got updated to their source time stamps. I used rsync after the 2nd dsync to see if it found any files it wanted to transfer again and it did not, indicating that the size and times of all files did indeed get updated.

I've restarted the dsync's of my original, large datasets so we'll see what happens.

from mpifileutils.

adammoody commented on September 1, 2024

Ok, thanks for the update. I'm still puzzled as to why the timestamps aren't being set correctly the first time.

How many nodes and processes are you using?

from mpifileutils.

adammoody commented on September 1, 2024

Does it do a better job at setting timestamps the first time if you add the --direct option?

from mpifileutils.

adilger commented on September 1, 2024

@adammoody is there a reason not to set the timestamps on a file immediately after it is finished processing? That would avoid the risk of missing them being set, and also reduces overhead on the filesystem because it doesn't need to do another set of filename lookups on all of the files.

It is understandable that the timestamps on the directories should be deferred until all of the entries in the directory are updated (and also to avoid setting the directory timestamp millions of times for a large directory), but that is not the case for regular files.

At worst, if a large file is being written by multiple MPI ranks concurrently, the timestamp might be set multiple times, but that is fine given it only happens for very large files, and one of the threads will be the last one writing and will reset the timestamp when it is finished.

Separately, is the "--batch-files" option mainly to work around the lack of timestamp updates on files, or are there other reasons to use it? Should this be set to eg. 10000, or 100 x mpi_ranks, or similar by default? That would increase the directory timestamp updates marginally, but is unlikely to hurt overall transfer performance significantly (I guess some amount of straggler/idle thread overhead for each batch).

from mpifileutils.

dvickernasa commented on September 1, 2024

How many nodes and processes are you using?

I'm using 16 cores on a single workstation for these transfers.

Does it do a better job at setting timestamps the first time if you add the --direct option?

In my small tests, I wasn't using --direct and the times were getting set properly. I verified by following up with an rsync, which didn't copy anything new. I can try a big transfer with --direct next time and see if that works better.

The 2nd dsync for the large dataset just went OK. The intent here was just to correct the times. For some reason, this happened:

[2023-10-03T18:42:03] Comparing file sizes and modification times of 3093774 items
[2023-10-03T18:42:03] Started   : Oct-03-2023, 18:42:01
[2023-10-03T18:42:03] Completed : Oct-03-2023, 18:42:03
[2023-10-03T18:42:03] Seconds   : 2.043
[2023-10-03T18:42:03] Items     : 3093774
[2023-10-03T18:42:03] Item Rate : 3093774 items in 2.042868 seconds (1514426.851390 items/sec)
[2023-10-03T18:42:03] Deleting items from destination
[2023-10-03T18:42:03] Removing 1740489 items ...

<cut>

[2023-10-03T18:46:44] Removed 1676960 items (96.35%) in 280.076 secs (5987.513 items/sec) 10 secs remaining ...
[2023-10-03T18:46:54] Removed 1739691 items (99.95%) in 290.075 secs (5997.386 items/sec) 0 secs remaining ...
[2023-10-03T18:46:54] Removed 1740489 items (100.00%) in 290.399 secs (5993.446 items/sec) done
[2023-10-03T18:46:54] Removed 1740489 items in 290.457 seconds (5992.237 items/sec)
[2023-10-03T18:46:54] Copying items to destination
[2023-10-03T18:46:54] Copying to /nobackup2/PROJECTS/orion/cev-repo
[2023-10-03T18:46:54] Items: 1740490
[2023-10-03T18:46:54]   Directories: 0
[2023-10-03T18:46:54]   Files: 1740490
[2023-10-03T18:46:54]   Links: 0
[2023-10-03T18:46:54] Data: 94.113 TiB (56.699 MiB per file)

The source data is almost completely static so I'm not sure why dsync decided it needed to delete and recopy about half of the data. Perhaps the incorrect dates from the first run threw it off? Its hard to tell for sure but it looked like there were far fewer than 50% of the files with wrong dates in the destination from the 1st dsync. Any thoughts on this mass deletion would be welcome.

from mpifileutils.

adammoody commented on September 1, 2024

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

From your output logs, I don't see the types of errors that I'd expect to see if the job had been interrupted. It looks to print the message that it has successfully updated all timestamps. It looks buggy to me that it somehow failed.

One guess is that perhaps a trailing write came in to the file after the timestamp update had been processed on the metadata server. That shouldn't be happening, but maybe it is. I don't think that would happen either for a single node run or when using --direct which uses O_DIRECT, which is what looking for with those questions.

I'll also run some tests to see if I can reproduce that locally.

from mpifileutils.

adammoody commented on September 1, 2024

Thank you for following along and for your suggestions, @adilger .

Yes, we currently set the timestamps once at the end to handle files that are split among multiple ranks. We don't have communication to know when all ranks are done with a given file, so instead we wait until all ranks are done with all files.

The --batch-files option is there to add intermediate global sync points for completing files early. It may provide some secondary benefit to sync procs up more frequently and perhaps improve work balance, but it's mainly there to set timestamps before waiting until the very end.

Picking the best batch size is not well defined. It effectively acts like a checkpoint operation, so it makes sense to pick a batch size based on the expected runtime and failure rate. It also depends on amount of data and the file sizes. I suggested 100k here since there were 4m files in total, so 100k just seemed like a decent checkpoint frequency. Pretty arbitrary...

I like the idea of setting the timestamp multiple times from multiple ranks. That does add extra load on the metadata server, but that's probably not a huge hit.

However, a remaining challenge is to figure out how to handle setting other properties like ownership, permission bits, and extended attributes. We set those at the end, and then we set the timestamps as the very last step. This ensures that the mtime on the destination only matches the source after the file has been fully copied (all data and attributes).

I know some properties can't be set too early, since for example, some of those can disable write access to the destination file.

And I think we'd need to consider the case where dsync might be interrupted and leaves a destination file with the correct size and mtime before all ranks had actually completed their writes. (The file size can be correct since the last bytes in the file might actually be written first.) We could perhaps write to a temporary file and rename to handle that, but then I think we run into a similar problem of knowing when it's time to do the rename.

Have more insight into those?

from mpifileutils.

dvickernasa commented on September 1, 2024

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

Given this, I think @adilger's suggestion of writting the times right after the file write makes a lot of sense if its not too much of a performance hit. If dsync's behavior is to delete the file and start over again if the times don't match, it would be very beneficial to get the times updated ASAP so that if the transfer is interrupted, the next dsync can avoid as much re-transfer as possible.

One other possibility is implementing rsync's --partial behavior. If that is turned on, the part of the file that is transferred is kept and subsequent rsync runs pick up where the the previous left off. Of course, this would also require implementing the delta-transfer algorithm and I have no idea how much effort that would entail.

from mpifileutils.

adilger commented on September 1, 2024

One option for detecting partial file transfer would be to check the blocks allocated to the file against the source file. If they are grossly smaller on the target, then it is clearly not finished copying the interior.

Alternately (and this might be better for performance reasons as well) you could make the work distribution function smarter and have only a single rank per OST object and copy the objects linearly to avoid holes. With modern Lustre, large O_DIRECT writes are much faster than buffered writes 20GB/s vs. 2GB/s).

Also, I imagine you only split a file into multiple parts to copy in parallel once it is over 1GiB in allocated space, so the cost of doing an MPI comm to the other ranks is pretty tiny compared to the file data moved, so making the client do multiple locks/opens for these files is probably more work in total.

from mpifileutils.

dvickernasa commented on September 1, 2024

Sorry, I know this isn't directly related to this issue, but I have a question about dsync's behavior. Here is an example of the end of a transfer.

[2023-10-03T19:32:44] Copied 2.673 TiB (66%) in 2738.947 secs (0.999 GiB/s) 1438 secs left ...
[2023-10-03T19:32:51] Copied 2.684 TiB (66%) in 2746.006 secs (1.001 GiB/s) 1424 secs left ...
[2023-10-03T19:33:04] Copied 2.695 TiB (66%) in 2759.073 secs (1.000 GiB/s) 1414 secs left ...
[2023-10-03T19:33:14] Copied 2.707 TiB (66%) in 2769.405 secs (1.001 GiB/s) 1401 secs left ...
[2023-10-03T19:33:24] Copied 2.719 TiB (67%) in 2779.124 secs (1.002 GiB/s) 1388 secs left ...
[2023-10-03T19:33:34] Copied 2.730 TiB (67%) in 2789.049 secs (1.002 GiB/s) 1376 secs left ...
[2023-10-03T19:33:49] Copied 2.738 TiB (67%) in 2804.159 secs (1.000 GiB/s) 1370 secs left ...
[2023-10-03T19:34:04] Copied 2.755 TiB (68%) in 2818.953 secs (1.001 GiB/s) 1353 secs left ...
[2023-10-03T19:34:15] Copied 2.766 TiB (68%) in 2829.957 secs (1.001 GiB/s) 1341 secs left ...
[2023-10-03T19:34:21] Copied 2.777 TiB (68%) in 2836.497 secs (1.002 GiB/s) 1328 secs left ...
[2023-10-03T19:34:39] Copied 2.788 TiB (68%) in 2854.734 secs (1.000 GiB/s) 1320 secs left ...
[2023-10-03T19:34:48] Copied 2.802 TiB (69%) in 2863.235 secs (1.002 GiB/s) 1303 secs left ...
[2023-10-03T19:35:08] Copied 2.813 TiB (69%) in 2883.456 secs (0.999 GiB/s) 1295 secs left ...
[2023-10-03T20:09:39] Copied 2.831 TiB (69%) in 4954.226 secs (599.206 MiB/s) 2180 secs left ...
[2023-10-03T20:09:39] Copied 4.077 TiB (100%) in 4954.227 secs (862.846 MiB/s) done
[2023-10-03T20:09:39] Copy data: 4.077 TiB (4482383360736 bytes)
[2023-10-03T20:09:39] Copy rate: 862.846 MiB/s (4482383360736 bytes in 4954.227 seconds)
[2023-10-03T20:09:39] Syncing data to disk.
[2023-10-03T20:09:45] Sync completed in 6.131 seconds.

Note that the percentage is progressing along fairly linearly, then suddenly jumps a lot and finishes much earlier than predicted. I see this happen a lot. Do you know what's going on there?

My 2nd dsync of the large dataset finished and I have an rsync running to verify it doesn't find anything else to transfer. Its going to take awhile for that to finish but its been running most of the day and it hasn't found anything yet, which is a good sign. My original rsync verification starting finding files it wanted to transfer almost right away.

from mpifileutils.

adammoody commented on September 1, 2024

The missing progress messages is a known problem. This is also noted in #530

These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.

I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.

from mpifileutils.

adilger commented on September 1, 2024

Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.

from mpifileutils.

adammoody commented on September 1, 2024

Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.

It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...

There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.

from mpifileutils.

adammoody commented on September 1, 2024

Oh, wait. I'm describing the point-to-point reduction in libcircle:

https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97

There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:

https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c

That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.

from mpifileutils.

adammoody commented on September 1, 2024

Since this is getting a bit involved, let's migrate discussion about missing progress messages to issue #530.

from mpifileutils.

dvickernasa commented on September 1, 2024

I've got a few more datapoints on setting the times.

First of all, the rsync verification of our 141 TB, 2.7M files dataset is still going. Its been running since 10/5. We closed off the permissions of the top level directory to ensure people can't access the data. So far, rsync has found about a couple dozen files that it wanted to re-transfer.

Over the weekend, we had another few sets of data we needed to transfer, each set containing about 250 GB and 1.7M files. Using a single node we can transfer one set in less than an hour using dsync. However, we found more indications of timestamps getting set incorrectly. This time we kept running dsync. After 2 or 3 times of dsync finding new files to transfer, we could get to the point where dsync itself no longer found new files to transfer. This was still not using --direct. The next time I use dsync, I'll try using --direct.

from mpifileutils.

adammoody commented on September 1, 2024

To be sure, your latest transfer of 1.7M files saw mismatched timestamps, and this only ran dsync using a single node, right? That info helps to eliminate one theory I had.

And this is a Lustre-to-Lustre transfer, right?

Did you see any errors reported in the output of that first transfer?

from mpifileutils.

dvickernasa commented on September 1, 2024

Correct - single node and lustre to lustre.

from mpifileutils.

dvickernasa commented on September 1, 2024

Sorry, to answer your questions about errors, Yes, there were some errors. They did a couple intermediate transfers while the datasets were in use and being modified in an effort to reduce the time it took to transfer the data when the runs were completed.

ERROR: Source file `XXX' shorter than expected 4007 (errno=0 Success)

Although, I found this odd (the source data was smaller not larger), we didn't spend any time digging into this.

One the job were done and the data was static, the transfer did show just a couple of errors.

ERROR: Failed to open output file `XXX' (errno=13 Permission denied)
ERROR: Failed to change ownership on `XXX' lchown() (errno=13 Permission denied)

Which is kind of strange since the source and destination directories were both owned by the person doing the transfer. One more dsync run got all files and finished without errors.

from mpifileutils.

dvickernasa commented on September 1, 2024

FYI, I started another transfer on Thursday. This one was quite large WRT the number of files (44.7M)

[2023-10-12T14:42:45] Copying to /nobackup2/PROJECTS/orion/gnc/em1data-copy
[2023-10-12T14:42:46] Items: 44656137
[2023-10-12T14:42:46]   Directories: 3593494
[2023-10-12T14:42:46]   Files: 39432055
[2023-10-12T14:42:46]   Links: 1630588
[2023-10-12T14:42:46] Data: 49.134 TiB (1.307 MiB per file)

I used --direct from the beginning on this one.

source=/.scratch00/agdl/em1data/
dest=/nobackup2/PROJECTS/orion/gnc/em1data-copy/
dsync_opts='-X non-lustre --batch-files 100000 --direct'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

The scratch filesystem is NFS so this is a an NFS to lustre transfer. Unfortunately, it looks like this one died sometime on Saturday.

[2023-10-14T11:13:45] Copied 262.948 GiB (80%) in 851.497 secs (316.218 MiB/s) 219 secs left ...
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 4679 on node monte exited on signal 9 (Killed).
--------------------------------------------------------------------------

I'm not sure why it died yet - I'll be digging into that some more. However, after poking around in the partially transferred directory structure, it looks like all directories are owned by root. Or at least most of them. Here is an example of the top level directory.

[root@dvicker dsync]# ls -l /.scratch00/agdl/em1data/
total 20381086
drwxrws---  41 root     mpcvdata          57 Mar  6  2023 AR1_MER_GEM
drwxrwxrwx  11 mgualdon cev               11 Mar 21  2023 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwxr-sr-x   2 root     mpcvdata     2206262 Dec  7  2020 dumap_html
-rw-r--r--   1 root     mpcvdata        2782 Dec  7  2020 dumap.html
-rw-r--r--   1 root     mpcvdata  6744041294 Dec  4  2020 du.txt
drwxr-sr-x   8 choltz   mpcvdata           8 Jun 26  2018 EM1_DATA
drwxr-sr-x   5 choltz   mpcvdata           5 Aug 24  2016 EM1_FSW
drwxrws---   5 lmcev    mpcvdata           5 Jul 13  2020 EM1_iVAC
drwxrws---+ 13 choltz   mpcvdata          13 Jan 25  2023 EM1_RUNS
drwxr-sr-x  20 mfrostad mpcvdata          20 Jan 14  2021 EM1_SLS_Traj
drwxrws---   2 idux     mpcvdata           4 Apr 21  2021 EM1_TMT_Database
drwxr-sr-x   6 dheacock mpcvdata           6 Jul 13  2020 EM1_VAC
drwxr-s---   3 jarnold1 em1tppi            3 Aug  8  2019 EM_Flight_Traj_Limited_Rights
drwxr-sr-x  10 choltz   mpcvdata          16 May  3  2016 jfreechart-1.0.19
drwxrwsrwx   4 jmarcy   mpcvdata           4 Aug 10  2022 jmarcy
-rwxrwxrwx   1 jmarcy   mpcvdata      659220 Jan 19  2021 MEP_28ep9_20201215.xml
drwxrwsrwx   2 mstoft   mpcvdata           2 Jan 11  2022 mstoft
drwxrwsr-x+ 10 bawood   mpcvdata          14 May  8 13:58 orion_veras_archive
drwxr-sr-x   3 lgefert  mpcvdata           3 Feb  7  2019 prop_data
drwxrwsrwx  16 jmarcy   mpcvdata          16 Sep 21  2020 ramtares
drwxrwsrwx   6 mstoft   mpcvdata           7 Feb 10  2021 Rel28eP9Eng_Debug
drwxr-sr-x   5 choltz   mpcvdata           6 Nov 27  2017 socgnc_base
drwxr-s---  36 choltz   em1tppi           39 Oct 17  2022 socgnc_releases
drwxr-sr-x   5 choltz   mpcvdata           5 Mar 22  2016 SSW
[root@dvicker dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/
total 36212336
drwx------  41 root     root            4096 Oct 14 10:16 AR1_MER_GEM
drwx------  11 root     root            4096 Oct 12 14:42 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwx------   2 root     root       180535296 Oct 14 10:08 dumap_html
drwx------   8 root     root            4096 Oct 12 14:42 EM1_DATA
drwx------   5 root     root            4096 Oct 12 14:42 EM1_FSW
drwx------   5 root     root            4096 Oct 12 14:42 EM1_iVAC
drwxrwx---+ 13 root     root            4096 Oct 12 14:42 EM1_RUNS
drwx------  20 root     root            4096 Oct 12 14:42 EM1_SLS_Traj
drwx------   2 root     root            4096 Oct 12 14:42 EM1_TMT_Database
drwx------   6 root     root            4096 Oct 12 14:42 EM1_VAC
drwx------   3 root     root            4096 Oct 12 14:42 EM_Flight_Traj_Limited_Rights
drwx------  10 root     root            4096 Oct 14 10:05 jfreechart-1.0.19
drwx------   4 root     root            4096 Oct 12 14:42 jmarcy
drwx------   2 root     root            4096 Oct 12 14:42 mstoft
drwx------+ 10 root     root            4096 Oct 14 10:05 orion_veras_archive
drwx------   3 root     root            4096 Oct 12 14:42 prop_data
drwx------  16 root     root            4096 Oct 12 14:42 ramtares
drwx------   6 root     root            4096 Oct 12 14:42 Rel28eP9Eng_Debug
drwx------   5 root     root            4096 Oct 12 14:42 socgnc_base
drwx------  36 root     root            4096 Oct 12 18:55 socgnc_releases
drwx------   5 root     root            4096 Oct 12 14:42 SSW
[root@dvicker dsync]#

Is this expected?

from mpifileutils.

dvickernasa commented on September 1, 2024

A few updates.

The first dsync run of this latest transfer died due to an OOM event on the machine I was running on. I think it was dsync itself that caused the OOM, since this was a pretty isolated machine and nothing else was running there. But I'm not 100% on that. I didn't mention it before, but this is also a single node run.

The sync is restarted and, unfortunately, I'm having similar issues. dsync decided to delete a lot of what it already transferred.

[2023-10-16T11:22:01] Comparing file sizes and modification times of 17155249 items
[2023-10-16T11:22:19] Started   : Oct-16-2023, 11:21:26
[2023-10-16T11:22:19] Completed : Oct-16-2023, 11:22:19
[2023-10-16T11:22:19] Seconds   : 53.312
[2023-10-16T11:22:19] Items     : 17155249
[2023-10-16T11:22:19] Item Rate : 17155249 items in 53.312461 seconds (321786.852653 items/sec)
[2023-10-16T11:22:19] Deleting items from destination
[2023-10-16T11:22:19] Removing 10915151 items

I'm attaching the full output for both the original and the restart for this latest run.

dsync-em1.out.2023-10-16-09-50.txt
dsync-em1.out.2023-10-12-12-52.txt

from mpifileutils.

adammoody commented on September 1, 2024

Ugh, that's frustrating. I suspect that the timestamps for the files on that first run are still not being set correctly.

It looks like the first run successfully completed a number of 100k file-transfer batches before it died (193 batches to be precise).

[2023-10-14T10:56:04] Copied 19300000 of 44656137 items (43.219%)

So at most it should have had to delete about 100k files, which would be the files just from the batch being worked when the run died. Instead, it thought 11m million files needed to be replaced.

It seems like it's either missing setting the timestamps on some files, or perhaps it's hitting an error and not reporting it. I'll review the code again with an eye for those kinds of problem.

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

Having root ownership left on the directories is expected, because the directory permissions are not updated until all files have been transferred. Since the first run didn't complete, it didn't get a chance to change the ownership on the directories. However, it should have changed ownership and set timestamps on all of those previous files.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

from mpifileutils.

dvickernasa commented on September 1, 2024

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

No problem and I will drop this for future runs.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

Yes, I agree. I will see if I can find a small-ish dataset to reproduce this.

from mpifileutils.

adammoody commented on September 1, 2024

This will produce some long output, but let's add some debug statements to print info when dsync sets the timestamp on a file.

Can you modify src/common/mfu_io.c to add the printf and preceding formatting lines as shown below and rebuild?

int mfu_utimensat(int dirfd, const char* pathname, const struct timespec times[2], int flags)
{   
    int rc;
    int tries = MFU_IO_TRIES;
retry:
    errno = 0;
    rc = utimensat(dirfd, pathname, times, flags);
    if (rc != 0) {
        if (errno == EINTR || errno == EIO) {
            tries--;
            if (tries > 0) {
                /* sleep a bit before consecutive tries */
                usleep(MFU_IO_USLEEP);
                goto retry;
            }
        }
    }

    //------------------------------
    char access_s[30];
    char modify_s[30]; 
    size_t access_rc = strftime(access_s, sizeof(access_s) - 1, "%FT%T", localtime(&times[0].tv_sec));
    size_t modify_rc = strftime(modify_s, sizeof(modify_s) - 1, "%FT%T", localtime(&times[1].tv_sec));
    printf("%s rc=%d errno=%d atime_sec=%llu atime_ns=%llu mtime_sec=%llu mtime_ns=%llu atime=%s mtime=%s\n",
        pathname, rc, errno, times[0].tv_sec, times[0].tv_nsec, times[1].tv_sec, times[1].tv_nsec, access_s, modify_s);
    //------------------------------

    return rc;
}

Also add an #include <time.h> statement at the top of the file.

Execute a dsync and if you find any files have mismatching timestamps, look for those same paths in the output. Let's verify that a line is actually printed, that the input timestamps are correct, and that the utimensat call returns with rc=0.

BTW, it seems like it's not too hard to find files with mismatched timestamps, but if you need to scan a lot of files manually, we can probably use dcmp to help with that.

No need to copy the actual output here, but let me know what you find.

from mpifileutils.

dvickernasa commented on September 1, 2024

Lots of updates....

Code changes

Sure, I can do that. If we keep going down this road, it might be nice to make a branch I can check out and compile against so I'm not making the changes by hand.

Smaller test case

I tried to find a smaller test case. Picked a directory tree with 400k files in 4.8 TB. Did the initial transfer last night in about 2 hours. Repeated the transfer this morning and it only found 1 file to re-transfer. So perhaps this is a better test case to use for testing.

Update on latest large transfer

This is the 40M file case. I had restarted the transfer yesterday but it crashed again just now. If you want me to create a different issue for this, I can. Console output for this crash:

[monte:24626] *** Process received signal ***
[monte:24626] Signal: Segmentation fault (11)
[monte:24626] Signal code: Address not mapped (1)
[monte:24626] Failing at address: 0x80
[monte:24626] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fc689c8c630]
[monte:24626] [ 1] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0xb1b7)[0x7fc687c5b1b7]
[monte:24626] [ 2] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_recv_connect_ack+0xbc2)[0x7fc687c5c222]
[monte:24626] [ 3] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0x69f1)[0x7fc687c569f1]
[monte:24626] [ 4] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x7a4)[0x7fc68aa16924]
[monte:24626] [ 5] mpiexec[0x404d52]
[monte:24626] [ 6] mpiexec[0x403610]
[monte:24626] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc6898d1555]
[monte:24626] [ 8] mpiexec[0x403529]
[monte:24626] *** End of error message ***
./dsync-em1.sh: line 22: 24626 Segmentation fault      mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

Here are the /var/log/messages for this:

Oct 17 10:58:37 monte kernel: dsync[24639]: segfault at 80 ip 00007f6cd1ec01b7 sp 00007f6cd1cafce0 error 4 in mca_oob_tcp.so[7f6cd1eb5000+13000]
Oct 17 10:58:37 monte kernel: dsync[24637]: segfault at 80 ip 00007ff86669d1b7 sp 00007ff86648cce0 error 4 in mca_oob_tcp.so[7ff866692000+13000]
Oct 17 10:58:37 monte kernel: dsync[24683]: segfault at 80 ip 00007f410599e1b7 sp 00007f410578dce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24680]: segfault at 80 ip 00007ff888bf51b7 sp 00007ff8889e4ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24684]: segfault at 80 ip 00007f3725a891b7 sp 00007f3725878ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24682]: segfault at 80 ip 00007fd6cae651b7 sp 00007fd6cac54ce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f3725a7e000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7ff888bea000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd6cae5a000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f4105993000+13000]
Oct 17 10:58:37 monte kernel: Pid 24670(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24674(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: dsync[24665]: segfault at 80 ip 00007f5faa6201b7 sp 00007f5faa40fce0 error 4 in mca_oob_tcp.so[7f5faa615000+13000]
Oct 17 10:58:37 monte kernel: dsync[24673]: segfault at 80 ip 00007fd64f05b1b7 sp 00007fd64ee4ace0 error 4
Oct 17 10:58:37 monte kernel: dsync[24669]: segfault at 80 ip 00007fd2215721b7 sp 00007fd221361ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24647]: segfault at 80 ip 00007fdb7e2401b7 sp 00007fdb7e02fce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd221567000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fdb7e235000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd64f050000+13000]
Oct 17 10:58:37 monte kernel: Pid 24634(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24636(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24640(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24651(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Pid 24632(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24644(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24664(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24668(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24660(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24658(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24646(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24666(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24635(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24656(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24654(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24633(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24662(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24631(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24638(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24648(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24652(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24642(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24626(mpiexec) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24629 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24677 (dsync) of user 0 killed by SIGSEGV - dumping core
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24672 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24630 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:38 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:05 monte abrt-hook-ccpp: Saved core dump of pid 24672 to core.24672 at /home/dvicker/dsync (3906400256 bytes)
Oct 17 10:59:05 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:08 monte abrt-hook-ccpp: Saved core dump of pid 24630 to core.24630 at /home/dvicker/dsync (5385498624 bytes)
Oct 17 10:59:11 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 40552
Oct 17 10:59:20 monte abrt-hook-ccpp: Saved core dump of pid 24629 to core.24629 at /home/dvicker/dsync (5427765248 bytes)
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 34726
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Skipped 9 previous similar messages
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-09-21-11:01:07-18361'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'vmcore-127.0.0.1-2018-07-12-08:53:30'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-07-28-16:59:54-16739'
Oct 17 10:59:38 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: Generating core_backtrace
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Undefined variable outside of [[ ]] bracket
Oct 17 10:59:45 monte abrt-server: Sending an email...
Oct 17 10:59:45 monte abrt-server: Sending a notification email to: root@localhost
Oct 17 10:59:45 monte abrt-server: Email was sent to: root@localhost

This created 4 core files if you want them. This is different than the last crash, which was an OOM event.

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

from mpifileutils.

adammoody commented on September 1, 2024

Good idea regarding the branch. I've got one with the printf added here:

https://github.com/hpc/mpifileutils/tree/debugtimestamps

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

Yes, that makes total sense. I'm hoping the smaller test case works, as well.

If we can get a stack trace from the core files, that would be helpful.

Do you get something with gdb?

from mpifileutils.

adammoody commented on September 1, 2024

If it helps, you can use dcmp to scan for files that have mismatching size/mtime values after a dsync using the following command:

dcmp --output EXIST=COMMON@CONTENT=DIFFER:mismatched.txt --text --lite /source /destination

That writes a list of any files to mismatched.txt that dcmp finds to exist in both source and destination paths, but have different contents. The --lite option tells dcmp to compare file "content" by using size/mtime rather than reading and comparing actual file data.

from mpifileutils.

dvickernasa commented on September 1, 2024

Yes, here are the backtraces from gdb.

core.24629.txt
core.24630.txt
core.24672.txt
core.24677.txt

from mpifileutils.

dvickernasa commented on September 1, 2024

Thanks for the dcmp command - that will be very helpful. I can use that in combination with rsync to set the times correctly, which is very helpful in salvaging the files that are already transferred.

I'll work on compiling the debug code and running some tests next.

I really appreciate your help!

from mpifileutils.

adammoody commented on September 1, 2024

Glad to help. Sorry for all of the grief you're hitting with this, but thanks for taking the time to chase it down.

It looks like the core dump stack traces are all somewhere within MPI -- I don't see any mpiFileUtils calls in those traces. With just that info, it's hard to say what led to the segfault. If it's reproducible, maybe you could catch it live under a debugger as the next step. To track that, let's open a new issue. I think it may be an independent problem.

Assuming dsync is actually setting timestamps on each file (to the correct value and without error), then one theory I'm still stuck on is that perhaps a write from some process is being processed by the server after the timestamp has already been set. That late write would end up updating the mtime to the current time, and it could be racy. That seems to fit well with what we're seeing here.

I was hoping the --direct option which enables O_DIRECT would avoid the late writes. From your earlier test, we now know the problem still exists with O_DIRECT, which would seem to rule out that theory. Someone suggested to me today that there might still be cases where O_DIRECT writes could be getting cached, so maybe this idea is still in play.

One more thing I'd like to test is whether forcing an fsync() before closing each file would help. We already have sync() + MPI_Barrier() calls between completing the writes and setting the timestamps. That sync() is meant to flush any cached data to avoid any late write problem, but maybe it's not working for some reason. I have a change in the pipeline that will add an fsync() that I can push to the new branch. We can try that next.

I'd like to first verify that the timestamps are actually being set to the correct values using the printf statement we just added.

from mpifileutils.

dvickernasa commented on September 1, 2024

I ran dcmp on the partial transfer of my 40M file case and have some questions. Here's the last part of the output.

[2023-10-17T20:26:13] Walked 18572701 items in 1192.205 seconds (15578.441 items/sec)
[2023-10-17T20:26:38] Comparing items
[2023-10-17T20:26:54] Started   : Oct-17-2023, 20:26:38
[2023-10-17T20:26:54] Completed : Oct-17-2023, 20:26:54
[2023-10-17T20:26:54] Seconds   : 15.308
[2023-10-17T20:26:54] Items     : 44656137
[2023-10-17T20:26:54] Item Rate : 44656137 items in 15.308301 seconds (2917119.089945 items/sec)
[2023-10-17T20:26:54] Bytes read: 0.000 B (0 bytes)
[2023-10-17T20:26:54] Byte Rate : 0.000 B/s (000 bytes in 15.308 seconds)
[2023-10-17T20:27:02] Writing to output file: mismatched.txt
[2023-10-17T20:27:55] Wrote 16081904 files in 53.309 seconds (301674.447 files/sec)
Number of items that exist in both directories and have different contents: 8040952 (Src: 8040952 Dest: 8040952), dumped to "mismatched.txt"

Looking at the results,

[root@monte dsync]# grep JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv mismatched.txt 
-rw-rw---- 7135 7135 596.170 KiB Mar  6 2022 10:10 /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 7135 7135 596.170 KiB Mar  6 2022 10:10 /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]#

Those look to be identical in size and time. Just to confirm, I did a regular ls and diff too.

[root@monte dsync]# ls -l /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 1 pgarcia6 pgarcia6 610478 Mar  6  2022 /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
-rw-rw---- 1 pgarcia6 pgarcia6 610478 Mar  6  2022 /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]# diff /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
[root@monte dsync]#

So why was this reported as mismatched? This could be a contributor to the dsync issue too.

from mpifileutils.

dvickernasa commented on September 1, 2024

Sorry, I meant to mention that of the 18M files that have been transferred, its reporting that 16M of them differ, which doesn't seem to make sense.

from mpifileutils.

adammoody commented on September 1, 2024

Oh, that's interesting. I'm only print the integer units of mtime, but maybe the nanosecond values are different.

Can you run stat on a pair of those files and check the fractional seconds of mtime?

from mpifileutils.

dvickernasa commented on September 1, 2024

Sure. That does appear to be the issue.

[root@monte dsync]# stat /.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv /nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv
  File: ‘/.scratch00/agdl/em1data/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv’
  Size: 610478    	Blocks: 202        IO Block: 1048576 regular file
Device: 3dh/61d	Inode: 274783365   Links: 1
Access: (0660/-rw-rw----)  Uid: ( 7135/pgarcia6)   Gid: ( 7135/pgarcia6)
Access: 2022-03-06 10:10:09.757465521 -0600
Modify: 2022-03-06 10:10:10.735468368 -0600
Change: 2022-03-06 10:12:16.267833886 -0600
 Birth: -
  File: ‘/nobackup2/PROJECTS/orion/gnc/em1data-copy/AR1_MER_GEM/03_02_22_Landing_Site_Select/DFTO-15/JIS21-DFTO-15_CASE3-TIG_TO_4155_ASCII.csv’
  Size: 610478    	Blocks: 2400       IO Block: 4194304 regular file
Device: f87a3582h/4168758658d	Inode: 198162015687024217  Links: 1
Access: (0660/-rw-rw----)  Uid: ( 7135/pgarcia6)   Gid: ( 7135/pgarcia6)
Access: 2023-10-18 11:28:01.000000000 -0500
Modify: 2022-03-06 10:10:10.000000000 -0600
Change: 2023-10-16 15:50:36.000000000 -0500
 Birth: -
[root@monte dsync]#

from mpifileutils.

adammoody commented on September 1, 2024

Ok, and here em1data-copy is the destination file, right?

Now I wonder if dsync set that nanosecond value on the destination to be 0 to begin with, or did it try to set it to a non-zero value but the file system didn't accept it. It'd be good to figure that bit out. I think the output from the printf debug statement should help determine that.

In the meantime, looking at the man page for rsync, it offers some control when comparing mtime values:

-@, --modify-window
When comparing two timestamps, rsync treats the timestamps as being equal if they differ by no more than the modify-window value. The default is 0, which matches just integer seconds. If you specify a negative value (and the receiver is at least version 3.1.3) then nanoseconds will also be taken into account. Specifying 1 is useful for copies to/from MS Windows FAT filesystems, because FAT represents times with a 2-second resolution (allowing times to differ from the original by up to 1 second).

If you want all your transfers to default to comparing nanoseconds, you can create a ~/.popt file and put these lines in it:

rsync alias -a -a@-1
rsync alias -t -t@-1

With that as the default, you’d need to specify --modify-window=0 (aka -@0) to override it and ignore nanoseconds, e.g. if you’re copying between ext3 and ext4, or if the receiving rsync is older than 3.1.3.

We may need to add a similar option for dsync.

This fractional second difference may be a second issue. I recall your other examples showed timestamps that differed by calendar date, so they were off by way more than just the nanosecond field.

from mpifileutils.

adammoody commented on September 1, 2024

Got the file. Dropped the comment just to hide file names. So dsync is setting the nanosecond fields to 0.

rc=0 errno=0 atime_sec=1697655016 atime_ns=0 mtime_sec=1589398871 mtime_ns=0 atime=2023-10-18T13:50:16 mtime=2020-05-13T14:41:11

To double check, if you stat one of the source files used for that test, can you confirm that the fractional seconds are something other than 0?

from mpifileutils.

adammoody commented on September 1, 2024

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

from mpifileutils.

dvickernasa commented on September 1, 2024

Sorry, I'm switching things back and forth on you quite a bit. Lots of things going on. My apologies, this is making it harder to keep things straight...

Ok, and here em1data-copy is the destination file, right?

Correct.

We may need to add a similar option for dsync.

I agree.

This fractional second difference may be a second issue. I recall your other examples showed timestamps that differed by calendar date, so they were off by way more than just the nanosecond field.

Yes, I agree with this and meant to mention it myself before. I'm doing both NFS to lustre transfers and lustre to lustre transfers here and I think we are potentially running into separate issues. Like you said, I've seen cases where the times in the destination were off by week and months compared to the source. This nanosecond difference is due to filesystem differences.

The 40M file case is an NFS to lustre transfer where the backing filesystem on the NFS side is ZFS. ZFS keeps track of fractional seconds. I'm guessing you know this already, but a lustre filesystem does not keep track of fractional seconds.

The small test case I did with the debug version (dsync-timestamp.2023-10-18-14-05.txt) was a lustre to lustre transfer. I'm trying not to stress our NFS filesystem with too much parallel access if I don't need to.

To double check, if you stat one of the source files used for that test, can you confirm that the fractional seconds are something other than 0?

Just to be clear, that was a lustre to lustre copy so there should have been no fractional seconds in either case. But for completeness, here is one of the source files.

[dvicker@monte dsync]$ stat /nobackup/dvicker/cst-grid-scripts/README.md 
  File: ‘/nobackup/dvicker/cst-grid-scripts/README.md’
  Size: 703       	Blocks: 71         IO Block: 4194304 regular file
Device: f90c7a4h/261146532d	Inode: 144120192501284956  Links: 1
Access: (0600/-rw-------)  Uid: ( 3179/ dvicker)   Gid: ( 3000/     eg3)
Access: 2023-10-18 14:05:57.000000000 -0500
Modify: 2020-05-13 14:41:09.000000000 -0500
Change: 2023-10-03 17:33:56.000000000 -0500
 Birth: -
[dvicker@monte dsync]$

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

Sure. I'm assuming you want the same small test I did before, which is a lustre to lustre copy?

FYI, the decoder ring for our systems:

/.scratch00 = NFS and ZFS
/nobackup = lustre
/nobackup2 = lustre

Again, sorry for confusion. I'll try to keep things more consistent (or at least communicate the details more clearly).

from mpifileutils.

dvickernasa commented on September 1, 2024

I pushed a commit to the debug branch that will print the atime/mtime values for files from the source list, specifically files that exist in the source but not the destination (first time copy). That will tell us whether dsync is reading the source times correctly.

I repeated the same lustre to lustre test I did before with the updated debug branch. Happy to do something else if needed. This was a clean sync with no existing files in the destination.

dsync-timestamp.2023-10-18-17-39.txt

from mpifileutils.

adammoody commented on September 1, 2024

Ah, I had forgotten (or never realized) that Lustre only uses the integer field in mtime. I'm surprised no one has noted this problem with repeated dsyncs from NFS to Lustre already.

I'll look to add an option like with rsync to flip between checking just seconds and seconds+nanoseconds. In the short term, if you need to do more NFS-to-Lustre transfers before that's available, we can hack that the internal check to only consider the integer portion. That's an easy change.

Coming back to the problem regarding the bigger timestamp differences in your Lustre-to-Lustre transfers, if you can reproduce that problem using the debug branch, the extra output should provide some clues. I just pushed another update to drop an extra printf we likely don't need now that we found the source of the nanosecond problem.

from mpifileutils.

adammoody commented on September 1, 2024

I repeated the same lustre to lustre test I did before with the updated debug branch. Happy to do something else if needed. This was a clean sync with no existing files in the destination.

Does dcmp report any mismatching timestamps in that most recent Lustre-to-Lustre test?

from mpifileutils.

dvickernasa commented on September 1, 2024

Not exactly, but I did delete one file in the source and repeated the dsync with the --delete option. I've been wanting to test the --delete option anyway. I used the debug version for this. The sync did what was expected (removed the one file) but also did a bunch of time setting as well. I think it set the time again for all files, which is not consistent with the Updating timestamps on newly copied files output. Results attached.

dsync-timestamp.2023-10-18T18-25.txt

I have some more data for a new, independent transfer as well. Last night I started a lustre to lustre sync we needed. My typical 1 node transfer. 40+ TB so it took most of the night. Here is the initial transfer. I used the non-debug version for this. Output for the first run:

dsync-orion-aero.2023-10-18T16-54.txt

There are some interesting errors in there. I did a dcmp this morning after the transfer was complete. 8 files reported

Number of items that exist in both directories and have different contents: 8 (Src: 8 Dest: 8), dumped to "orion-aero.mismatched.txt"

orion-aero.mismatched.txt

I repeated the dsync for this with the debug code.

dsync-orion-aero.2023-10-19T09-01.txt.gz

Hopefully this is still useful. Or do I need to run the debug code on the initial sync that get the time stamps wrong to begin with?

from mpifileutils.

adammoody commented on September 1, 2024

which is not consistent with the Updating timestamps on newly copied files output.

I had recently noticed that description is not quite accurate, as well. I plan to update that. Thanks.

Do you find any wrong timestamps after using the debug transfer?

This should also be a transfer from scratch. For any files that you find with dcmp that that have the wrong timestamps, I'd like to then look back at the output from the debug run to see what timestamp values it tried to apply to those destination files.

from mpifileutils.

dvickernasa commented on September 1, 2024

Do you find any wrong timestamps after using the debug transfer?

No.

This should also be a transfer from scratch. For any files that you find with dcmp that that have the wrong timestamps, I'd like to then look back at the output from the debug run to see what timestamp values it tried to apply to those destination files.

I figured this was the case. I will start using the debug code for all transfers for now. It seems to me like it takes a fairly large transfer to induce this behavior. All of my smaller transfers seem to work the first time.

I'll look to add an option like with rsync to flip between checking just seconds and seconds+nanoseconds. In the short term, if you need to do more NFS-to-Lustre transfers before that's available, we can hack that the internal check to only consider the integer portion. That's an easy change.

I would like to get my NFS to lustre transfer going again. Can you provide a branch for this? I would like to get that transfer going again.

from mpifileutils.

adammoody commented on September 1, 2024

Sure, this branch ignores the fractional seconds when comparing file mtime values.

https://github.com/hpc/mpifileutils/tree/mtimensec

This does not have the debug statements, so it should be quieter.

from mpifileutils.

dvickernasa commented on September 1, 2024

Excellent, thanks. I'm running dcmp from this branch on the NFS to lustre transfer to verify it finds no (or few) differences. If so, I'll resume the dsync. I'm assuming the code for time comparison is common between the two and that will work, correct?

I'm trying to run some small tests with the time stamp debug to capture some data there.

from mpifileutils.

dvickernasa commented on September 1, 2024

This reminds me of a question I wanted to ask. I see that dcp can take input from dwalk to reduce the number of times you walk the actual filesystem. For many of the transfers I'm doing, the source data is static and it would be great to do something similar - especially for this 40M file NFS to lustre transfer. Does it make sense to add --input to dsync? Or is there a way to do that with the existing codes?

from mpifileutils.

adammoody commented on September 1, 2024

Yes, the branch updates both dsync and dcmp so they are both only using the integer portion, so you should see consistent results.

I'll look at the --input option. That's a good idea. Obviously, that could provide a stale view of the source in the general case, but it would be helpful on large file systems where files don't change. I can't remember offhand if there might be some gotchas involved.

Crossing my fingers that one of the smaller lustre-to-lustre runs can reproduce the timestamp mismatch. Thanks for taking the time to run these tests.

from mpifileutils.

dvickernasa commented on September 1, 2024

The dcmp walk is 13M files into the source FS. I'll let you know how that turns out.

Great, thanks, that will be handy.

I tried a very small test a couple of times and everything worked in one shot. Trying the 400k, ~5 TB case that had a handful of files not work properly the first time. That takes a couple hours so I'll let you know.

from mpifileutils.

dvickernasa commented on September 1, 2024

OK, 40M NFS to lustre case with the updated branch looking good. Here's the dcmp results.

[2023-10-19T20:22:41] Walked 18572701 items in 1222.146 seconds (15196.794 items/sec)
[2023-10-19T20:23:06] Comparing items
[2023-10-19T20:23:22] Started   : Oct-19-2023, 20:23:06
[2023-10-19T20:23:22] Completed : Oct-19-2023, 20:23:22
[2023-10-19T20:23:22] Seconds   : 15.899
[2023-10-19T20:23:22] Items     : 44464314
[2023-10-19T20:23:22] Item Rate : 44464314 items in 15.898896 seconds (2796691.924957 items/sec)
[2023-10-19T20:23:22] Bytes read: 0.000 B (0 bytes)
[2023-10-19T20:23:22] Byte Rate : 0.000 B/s (000 bytes in 15.899 seconds)
[2023-10-19T20:23:29] Writing to output file: mismatched.em1.2023-10-19-18-13T18-13-00txt
[2023-10-19T20:23:31] Wrote 81904 files in 2.380 seconds (34419.940 files/sec)
Number of items that exist in both directories and have different contents: 40952 (Src: 40952 Dest: 40952), dumped to "mismatched.em1.2023-10-19-18-13T18-13-00txt"

Starting the transfer again.

from mpifileutils.

dvickernasa commented on September 1, 2024

The 40M NFS to lustre transfer is working great. Thanks a lot for the updated code.

I've been trying run a test where I do a clean sync followed by a repeat to catch a re-transfer of a file that didn't get set properly the first time. This is the setup.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/spacex/chutes/                                                   
dest=/nobackup2/dvicker/dsync_test/chutes/
dsync_opts='-X non-lustre --batch-files 100000'

mpiexec --allow-run-as-root drm $dest

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out

I've run it a few times now and I think it works properly the first time for all my tests. Here is the output from the last run.

dsync-testing.2023-10-19T23-24-16.out.gz
dsync-testing.2023-10-20T01-30-16.out.gz

Maybe there is still some useful info there. I'll keep running this, hoping to catch a re-transfer but let me know if you have any suggestions.

from mpifileutils.

adammoody commented on September 1, 2024

The 40M NFS to lustre transfer is working great. Thanks a lot for the updated code.

Great! Thanks again for helping to track that one down. I'll have a PR with an option for that soon.

Yes, your setup for the lustre-to-lustre dsync test looks good. It may help to include a dcmp in between the two dsync commands, since that output makes it a bit easier to see the timestamp mismatches.

If these jobs don't yet reproduce the problem, I suppose the next step would be to start increasing the file count. Hopefully, there is some smallish size that will expose the problem reliably.

from mpifileutils.

adammoody commented on September 1, 2024

Oh, never mind about adding dcmp. We can spot any differences if we see any files show up during the delete/overwrite portion in the second dsync.

from mpifileutils.

adammoody commented on September 1, 2024

@dvickernasa , have you had any transfers reproduce the Lustre timestamp problem?

from mpifileutils.

dvickernasa commented on September 1, 2024

OK, I think we caught a few files this time.

FYI, the source was closed off to regular users so I'm certain it was static.

[root@dvicker dsync]# ls -ld /nobackup/PROJECTS/ccdev.to-delete 
d--------- 5 root root 11776 Aug 24 17:08 /nobackup/PROJECTS/ccdev.to-delete
You have new mail in /var/spool/mail/root
[root@dvicker dsync]#

Here is the end of the first transfer:

[dvicker@dvicker dsync]$ tail -15 dsync-testing.2023-11-03T13-27-57.out
[2023-11-04T06:10:03] Updated 1151121 items in 408.878 seconds (2815.316 items/sec)
[2023-11-04T06:10:03] Syncing directory updates to disk.
[2023-11-04T06:10:06] Sync completed in 2.394 seconds.
[2023-11-04T06:10:06] Started: Nov-03-2023,14:05:41
[2023-11-04T06:10:06] Completed: Nov-04-2023,06:10:06
[2023-11-04T06:10:06] Seconds: 57864.748
[2023-11-04T06:10:06] Items: 1151121
[2023-11-04T06:10:06]   Directories: 35775
[2023-11-04T06:10:06]   Files: 1112126
[2023-11-04T06:10:06]   Links: 3220
[2023-11-04T06:10:06] Data: 47.312 TiB (52019905315792 bytes)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:10:06] Updating timestamps on newly copied files
[2023-11-04T06:10:06] Completed updating timestamps
[2023-11-04T06:10:06] Completed sync
[dvicker@dvicker dsync]$

And the output from immediately running the second dsync.

[2023-11-04T06:41:24] Comparing file sizes and modification times of 1112126 items
[2023-11-04T06:41:24] Started   : Nov-04-2023, 06:41:23
[2023-11-04T06:41:24] Completed : Nov-04-2023, 06:41:24
[2023-11-04T06:41:24] Seconds   : 0.702
[2023-11-04T06:41:24] Items     : 1112126
[2023-11-04T06:41:24] Item Rate : 1112126 items in 0.702232 seconds (1583701.944237 items/sec)
[2023-11-04T06:41:24] Deleting items from destination
[2023-11-04T06:41:24] Removing 3 items
[2023-11-04T06:41:24] Removed 3 items in 0.002 seconds (1491.750 items/sec)
[2023-11-04T06:41:24] Copying items to destination
[2023-11-04T06:41:24] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-04T06:41:24] Items: 3
[2023-11-04T06:41:24]   Directories: 0
[2023-11-04T06:41:24]   Files: 3
[2023-11-04T06:41:24]   Links: 0
[2023-11-04T06:41:24] Data: 55.123 GiB (18.374 GiB per file)
[2023-11-04T06:41:24] Creating 3 files.
[2023-11-04T06:41:24] Copying data.
[2023-11-04T06:41:45] Copied 10.328 GiB (19%) in 20.811 secs (508.184 MiB/s) 90 secs left ...
[2023-11-04T06:41:58] Copied 21.477 GiB (39%) in 34.094 secs (645.048 MiB/s) 53 secs left ...
[2023-11-04T06:42:21] Copied 34.646 GiB (63%) in 56.483 secs (628.105 MiB/s) 33 secs left ...
[2023-11-04T06:42:21] Copied 55.123 GiB (100%) in 56.483 secs (999.351 MiB/s) done
[2023-11-04T06:42:21] Copy data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:21] Copy rate: 999.351 MiB/s (59188151153 bytes in 56.483 seconds)
[2023-11-04T06:42:21] Syncing data to disk.
[2023-11-04T06:42:23] Sync completed in 2.393 seconds.
[2023-11-04T06:42:23] Setting ownership, permissions, and timestamps.
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp rc=0 errno=0 atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0 atime=2023-11-04T02:08:53 mtime=2022-05-06T04:57:00
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[2023-11-04T06:42:23] Updated 3 items in 0.008 seconds (398.938 items/sec)
[2023-11-04T06:42:23] Syncing directory updates to disk.
[2023-11-04T06:42:25] Sync completed in 2.375 seconds.
[2023-11-04T06:42:25] Started: Nov-04-2023,06:41:24
[2023-11-04T06:42:25] Completed: Nov-04-2023,06:42:25
[2023-11-04T06:42:25] Seconds: 61.262
[2023-11-04T06:42:25] Items: 3
[2023-11-04T06:42:25]   Directories: 0
[2023-11-04T06:42:25]   Files: 3
[2023-11-04T06:42:25]   Links: 0
[2023-11-04T06:42:25] Data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-04T06:42:25] Updating timestamps on newly copied files

One thing stood out to me. The output said it was deleting and re-transferring 3 files, but the subsequent output shows 6 files.

I picked one of those files and grepped it out of the output.

[dvicker@dvicker dsync]$ grep 2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV dsync-testing.2023-11-03T13-27-57.out dsync-testing.2023-11-04T06-10-06.out
dsync-testing.2023-11-03T13-27-57.out:/nobackup/PROJECTS/ccdev.to/nobackup/PROJECTS/ccdev.to-delete/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-10-20T22:47:34 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[dvicker@dvicker dsync]$ sudo ls -l /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
-rw-rw---- 1 sajansse ccp_spacex 17839545344 Jan 11  2023 /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
[dvicker@dvicker dsync]$

Even with gunzip --best, these output files are over the upload limit so I had to split them into pieces. I've done that so I can upload them and you can see the entire output.

[dvicker@dvicker dsync]$ split -n 3 dsync-testing.2023-11-03T13-27-57.out{,.split}
[dvicker@dvicker dsync]$ split -n 2 dsync-testing.2023-11-04T06-10-06.out{,.split}

dsync-testing.2023-11-03T13-27-57.out.splitaa.gz
dsync-testing.2023-11-03T13-27-57.out.splitab.gz
dsync-testing.2023-11-03T13-27-57.out.splitac.gz
dsync-testing.2023-11-04T06-10-06.out.splitaa.gz
dsync-testing.2023-11-04T06-10-06.out.splitab.gz

from mpifileutils.

adammoody commented on September 1, 2024

Thanks for running these additional tests.

I think there are just 3 files, but it prints two debug messages for each file.

Looking at the *.h5 file in the set, grep prints these 2 lines in the first dsync:

atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-09-29T21:06:58 mtime=2023-06-06T11:29:24

And then for that file in the second dsync, I see these 4 lines:

atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24

So for this particular file, it looks dsync succeeded each time that it set the timestamps (rc=0 every time).

The mtime values did not change between the two runs (mtime_sec=1686068964 mtime_ns=0).

I do see that the access time changed from atime=2023-09-29T21:06:58 to atime=2023-11-04T02:06:19. That's probably due to dsync updating the atime when it reads the source file to begin with. The next release will have the --open-noatime to avoid that.

Given that the mtime is the same, I'm not sure why dsync decided to delete and re-copy the file. Perhaps the file size was different between the two?

Given that the mtime didn't change on the source file, and knowing that the file system was locked down, I really doubt the size of the source file changed. Maybe dsync never fully copied the file to the destination in the first run, even though it thought it had?

It would have been a good idea to print the file size, too, so that I could verify. There might be a couple other checks, like if the file type came up different somehow. I'll look to add that.

from mpifileutils.

adammoody commented on September 1, 2024

@dvickernasa , I've updated the debugtimestamps branch. I merged a commit which required a force push, so hopefully that doesn't trip you up.

https://github.com/hpc/mpifileutils/tree/debugtimestamps

Since the mtime handling in the output of your last test looked to be correct, I've disabled the messages that print getting/setting timestamps on each file so that it's not as noisy. I added new messages to print what difference dsync finds between the source/destination when it decides to replace a destination file.

If you have a chance, would you please kick off a repeat of that test?

I realize that takes a while to run, so I appreciate your time. Thanks.

from mpifileutils.

adammoody commented on September 1, 2024

Oh, this branch does include the new --open-noatime option, too, if you want to use that. I don't think that will interfere with the bug we're currently chasing. It should prevent dsync from updating the source atime values, which should also speed up the second dsync a bit. Without that option, I would expect the second dsync to be setting the timestamps on all destination files to update atime.

from mpifileutils.

adammoody commented on September 1, 2024

Thanks, @dvickernasa . In this latest test, the second dsync copied 3 files. The new "DIFF" debug statements show that it detected a difference in the size/mtime check:

DIFF size/mtime src ... src_size=41347294011 dst_size=41347294011 src_mtime=1686068964 dst_mtime=1699435045
DIFF size/mtime src ... src_size=5307980 dst_size=5307980 src_mtime=1502986683 dst_mtime=1699435081
DIFF size/mtime src ... src_size=4759 dst_size=4759 src_mtime=1444167688 dst_mtime=1699434866

Here the file size values match, but the mtime values are different. The destination mtimes are all much more recent than the source file mtimes, and the destination mtime values look like they probably match up with the time at which the dsync was running.

From your previous test, we saw that dsync had correctly set the mtime on all files, including the 3 files that were copied by the second run. Unfortunately, I can't verify whether the mtime values had been set correctly in this test, because I dropped those messages.

I suspect that to be the case again, but I'd like to add those messages back and run once more to be certain.

Would you mind pulling those changes and running once more?

Thanks!

from mpifileutils.

adammoody commented on September 1, 2024

With this next test, the goal is to look for messages that verify that the timestamps are being set correctly and then must somehow be getting changed again behind the scenes.

I'm back to wondering if this is due to outstanding writes that are buffered on the server and then getting applied after the timestamps have been set. Those "late writes" would change the mtime to the time that last write is applied. dsync has a sync() call and then an MPI_Barrier() which is meant to finalize all writes before setting any timestamps, but maybe that's not sufficient.

write all data
sync()
MPI_Barrier(MPI_COMM_WORLD);
set timestamps

For this test, you don't need to send the full output to me. You can grep for "DIFF" in the second run to find any files that were re-copied. Even one file should be sufficient. Then also grep for the source and destination path of that file in both the first and second run output. We should only need those lines.

Assuming that behaves as expected, I'd like to run one more test where we modify dsync to call fsync() before closing every file. I'd like to check whether that forces the writes to happen.

Thanks again for your help with this, @dvickernasa .

from mpifileutils.

adilger commented on September 1, 2024

@adammoody just following along at home, and wanted to note that sync() on one client node is not going to flush out dirty data on all of the other client nodes, otherwise the servers would see a crippling number of sync writes as every connected client is causing sync writes on every other client. Also, IIRC, sync() at the OS level may start writes, but does not necessarily wait to finish all writes on all filesystems (this has changed over the years between kernels.

Once the data has been flushed from the clients to the servers, Lustre will generally write it to storage in order, so a later "sync" will ensure all earlier writes are also committed to storage.

The "guaranteed" way to ensure all of the data is flushed is to call fsync(fd) or fdatasync(fd) on each files written by each node. If dsync is writing files in batches, then it could f(data)sync(fd) in reverse order (and/or after some delay) so that most of the time it is not waiting for any data to be written or actually triggering a sync on the storage.

Calling the llapi_file_flush() interface will revoke the locks and cache on all clients to flush their data, but of course has no control over whether there are still active processes writing new data to the file, so that would need to be done at the end (I don't know if there is any coordination between writers?). Internally, this is calling llapi_get_data_version(). For debugging, calling llapi_get_data_vesion() when the data writers are finished, and then again before setting the timestamp would tell you if the data has been modified on any OST objects since it was supposedly flushed.

Since these APIs could trigger sync writes for every file accessed (if they have dirty data), it would best to call it on a file only when all of the writers are done, just before the timestamp is reset. That should avoid too much overhead.

While there is an S_NOCMTIME flag that we set on inodes when writing on the OSS (so that the local server clock does not mess with the client timestamps) there is unfortunately no kernel API to set this on the clients.

from mpifileutils.

adammoody commented on September 1, 2024

Thanks, @adilger . I was hoping you might still be following this thread. The global barrier should ensure that sync has been called by all ranks before any rank starts to set timestamps. However, if sync returns before all data has been written by the servers, then that could lead to the problem that we're seeing.

It sounds like I'll need the fsync or fdatasync for this. We used to have an fsync in there. I think we took that out to improve performance and added the single sync() at the end in the hopes it would accomplish the same thing but more efficiently. I like the idea of reversing the order of fsync calls to speed things up, though I might need to reduce the batch size to avoid keeping too many files open.

For a large file that gets split across clients, I presume a single call to llapi_file_flush would be more efficient than having each client call fsync individually? That might be a good Lustre-specific optimization.

from mpifileutils.

dvickernasa commented on September 1, 2024

@adammoody, the latest test completed. It didn't re-transfer files on the 2nd dsync this time so I'm not sure how useful the results are but I'll send them to you anyway.

I'm pulling the latest changes and repeating the test now.

from mpifileutils.

adammoody commented on September 1, 2024

Thanks, @dvickernasa . Right, unfortunately it looks like this latest run didn't trigger the problem. Hopefully, the next run will catch it. Thanks as always for your help.

When you run using that latest branch, it will be quite verbose again as it prints a message whenever getting/setting timestamps.

from mpifileutils.

dvickernasa commented on September 1, 2024

Sorry for the delay. I was at SC23 all last week and didn't get back to this. Looks like my latest transfer with the updated code caught one file.

[2023-11-14T03:38:05] Comparing file sizes and modification times of 1112126 items
[2023-11-14T03:38:05] Started   : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Completed : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Seconds   : 0.721
[2023-11-14T03:38:05] Items     : 1112126
[2023-11-14T03:38:05] Item Rate : 1112126 items in 0.721121 seconds (1542218.650501 items/sec)
[2023-11-14T03:38:05] Deleting items from destination
[2023-11-14T03:38:05] Removing 1 items
[2023-11-14T03:38:05] Removed 1 items in 0.001 seconds (964.208 items/sec)
[2023-11-14T03:38:05] Copying items to destination
[2023-11-14T03:38:05] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-14T03:38:05] Items: 1
[2023-11-14T03:38:05]   Directories: 0
[2023-11-14T03:38:05]   Files: 1
[2023-11-14T03:38:05]   Links: 0
[2023-11-14T03:38:05] Data: 917.512 MiB (917.512 MiB per file)
[2023-11-14T03:38:05] Creating 1 files.
[2023-11-14T03:38:05] Copying data.
[2023-11-14T03:38:09] Copy data: 917.512 MiB (962081164 bytes)
[2023-11-14T03:38:09] Copy rate: 236.609 MiB/s (962081164 bytes in 3.878 seconds)
[2023-11-14T03:38:09] Syncing data to disk.
[2023-11-14T03:38:12] Sync completed in 2.778 seconds.
[2023-11-14T03:38:12] Setting ownership, permissions, and timestamps.

@adammoody I'll email you the latest output.

from mpifileutils.

adammoody commented on September 1, 2024

Thanks again, @dvickernasa . No problem, and I hope you had a good time at SC.

Yes, this set of runs captured good evidence of what we were looking for. Copying bits from your email, here is the note about the second run that finds a difference in mtime:

DIFF /src/file.mpg src_size=962081164 src_mstr=2023-04-15T08:54:32 src_mtime=1681566872 src_mtime2=1681566872
DIFF /dst/file.mpg dst_size=962081164 dst_mstr=2023-11-13T23:02:23 dst_mtime=1699938143 dst_mtime2=1699938143

The source file has an mtime of 2023-04-15T08:54:32 while the destination file has something more recent at 2023-11-13T23:02:23.

However, for this file, I see that the output from the first run reports that it had set the mtime to the correct value 2023-04-15T08:54:32 and without error rc=0:

dsync-testing.2023-11-13T10-00-06.out:dst /dst/file.mpg rc=0 errno=0 atime_sec=1699621797 atime_ns=0 mtime_sec=1681566872 mtime_ns=0 atime=2023-11-10T07:09:57 mtime=2023-04-15T08:54:32

Let me update the branch to enable the fsync() before closing each file. This may slow things down during the copy, but I'd like to see whether the problem still shows up in that case.

from mpifileutils.

adammoody commented on September 1, 2024

@dvickernasa , I've pushed a commit to the debugtimestamps branch that forces the fsync during the copy.

c17d0c6

When you get a chance, would you please pull this change, rebuild, and try another test?

This will add a lot of fsync calls, which may slow down the copy speed. It will also change the timing of things.

Given that this looks to be a race condition, the timing change alone might hide the problem. It'll be hard to trust that it's really gone without running more tests or running larger. On the other hand, if we still see the problem show up with fsync, we might then try the llapi_file_flush API.

I think what we're seeing is that some outstanding writes are being buffered by a server and applied after the timestamp. I'm looking for different ways to force those writes to finish.

from mpifileutils.

adammoody commented on September 1, 2024

@adilger , assuming this is what is happening, I see the following in the notes section of the man 2 sync page on my system:

According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync called on every file in the system or filesystem respectively.

dsync currently uses sync as though it flushes (to disk) any data that the calling process has written. I think the Linux behavior described above would be sufficient, but the POSIX.1-2001 specification would not be.

Do you know how Lustre treats sync?

From your earlier comments, it sounds like it may be more along the lines of POSIX, where sync starts the writes but does not wait for them to finish. You've already described that above, but I just wanted to confirm. I feel like the text in the man page confuses things, since I know Lustre is POSIX-compliant, but this is also on a Linux system.

from mpifileutils.

adilger commented on September 1, 2024

@adammoody I don't think that sync() will at most flush a single client's dirty writes to the filesystem.

from mpifileutils.

dvickernasa commented on September 1, 2024

The updated test ran this morning. The 2nd transfer did not re-transfer anything, so that's encouraging. I'll be out the rest of the week for Thanksgiving but I'm going to put this in a loop and run in several times while I'm gone, so we'll have some more data next week.

Also, here is some data on how the extra sync's affected the transfer times. There is a lot of variation in the runs but it would appear that there isn't a huge affect. Or, it might be nice to add some (potentially just debug) timing statements to directly calculate the overhead of calling the sync.

[dvicker@dvicker dsync]$ for file in  dsync-testing.*.out; do
> grep Rate: $file | tail -1 
> done
[2023-10-16T20:12:58] Rate: 730.639 MiB/s (5357437501746 bytes in 6992.849 seconds)
[2023-10-17T10:19:23] Rate: 35.958 B/s (753 bytes in 20.941 seconds)
[2023-10-19T20:44:39] Rate: 756.352 MiB/s (5736324718486 bytes in 7232.852 seconds)
[2023-10-20T14:05:21] Rate: 655.065 MiB/s (5736324718486 bytes in 8351.211 seconds)
[2023-10-20T16:24:18] Rate: 700.910 MiB/s (5736324718486 bytes in 7804.975 seconds)
[2023-10-20T18:43:53] Rate: 693.610 MiB/s (5736324718486 bytes in 7887.125 seconds)
[2023-10-20T21:00:51] Rate: 701.543 MiB/s (5736324718486 bytes in 7797.935 seconds)
[2023-10-20T23:18:05] Rate: 703.717 MiB/s (5736324718486 bytes in 7773.846 seconds)
[2023-10-21T12:12:59] Rate: 479.823 MiB/s (2842477276278 bytes in 5649.579 seconds)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-08T07:21:31] Rate: 843.019 MiB/s (52019905315792 bytes in 58848.101 seconds)
[2023-11-08T07:58:05] Rate: 877.777 MiB/s (41352606750 bytes in 44.928 seconds)
[2023-11-10T11:29:01] Rate: 836.292 MiB/s (52019905315792 bytes in 59321.474 seconds)
[2023-11-14T03:05:10] Rate: 835.834 MiB/s (52019905315792 bytes in 59353.928 seconds)
[2023-11-14T03:38:15] Rate: 98.241 MiB/s (962081164 bytes in 9.339 seconds)
[2023-11-21T09:28:14] Rate: 814.605 MiB/s (52019905315792 bytes in 60900.757 seconds)
[dvicker@dvicker dsync]$

from mpifileutils.

adammoody commented on September 1, 2024

That's great. Thanks, @dvickernasa . Have a Happy Thanksgiving!

from mpifileutils.

ofaaland commented on September 1, 2024

Hi @dvickernasa, hope you had a good Thanksgiving.

Is the destination file system built with OSTs backed by ZFS, and if so, is "options osd_zfs osd_object_sync_delay_us=<some_positive_number>" set on those servers? If so I might have an explanation for what we're seeing (although I'm not sure). Thanks!

from mpifileutils.

dvickernasa commented on September 1, 2024

@adammoody, I got 6 more iterations on this done over the break. It looks like all of them found more data to copy on the 2nd transfer.

[dvicker@dvicker dsync]$ grep -e Removing dsync-testing.1.2023-11-22T11-04-07.out dsync-testing.2.2023-11-23T05-21-15.out dsync-testing.3.2023-11-24T00-25-20.out dsync-testing.4.2023-11-24T19-19-35.out dsync-testing.5.2023-11-25T13-22-08.out dsync-testing.6.2023-11-26T06-59-07.out
dsync-testing.1.2023-11-22T11-04-07.out:[2023-11-22T11:41:24] Removing 2 items
dsync-testing.2.2023-11-23T05-21-15.out:[2023-11-23T06:03:11] Removing 5 items
dsync-testing.3.2023-11-24T00-25-20.out:[2023-11-24T01:08:11] Removing 1 items
dsync-testing.4.2023-11-24T19-19-35.out:[2023-11-24T19:57:52] Removing 1 items
dsync-testing.5.2023-11-25T13-22-08.out:[2023-11-25T13:56:00] Removing 2 items
dsync-testing.6.2023-11-26T06-59-07.out:[2023-11-26T07:36:36] Removing 1 items
[dvicker@dvicker dsync]$ ll dsync-testing.?.*.out
 54186 -rw-r--r-- 1 root root 328067432 Nov 22 11:04 dsync-testing.1.2023-11-21T17-21-53.out
 67614 -rw-r--r-- 1 root root 321097260 Nov 22 11:52 dsync-testing.1.2023-11-22T11-04-07.out
 54351 -rw-r--r-- 1 root root 328053051 Nov 23 05:21 dsync-testing.2.2023-11-22T11-55-54.out
 68126 -rw-r--r-- 1 root root 322586738 Nov 23 06:15 dsync-testing.2.2023-11-23T05-21-15.out
 55228 -rw-r--r-- 1 root root 328051558 Nov 24 00:25 dsync-testing.3.2023-11-23T06-18-43.out
 68071 -rw-r--r-- 1 root root 322618344 Nov 24 01:29 dsync-testing.3.2023-11-24T00-25-20.out
 55603 -rw-r--r-- 1 root root 328055348 Nov 24 19:19 dsync-testing.4.2023-11-24T01-32-36.out
 68034 -rw-r--r-- 1 root root 322569590 Nov 24 20:20 dsync-testing.4.2023-11-24T19-19-35.out
 54397 -rw-r--r-- 1 root root 328050794 Nov 25 13:22 dsync-testing.5.2023-11-24T20-23-39.out
 67888 -rw-r--r-- 1 root root 322593290 Nov 25 14:07 dsync-testing.5.2023-11-25T13-22-08.out
 54698 -rw-r--r-- 1 root root 328059803 Nov 26 06:59 dsync-testing.6.2023-11-25T14-10-39.out
 68098 -rw-r--r-- 1 root root 322600133 Nov 26 07:37 dsync-testing.6.2023-11-26T06-59-07.out
[dvicker@dvicker dsync]$

I'll sent you some more details via email.

@ofaaland, for the testing I've been using in the last several posts (and most of the other examples in this issue), the source LFS does have ZFS OST's. The destination LFS is ldiskfs. Even so, we are setting the parameter you mentioned to zero.

[root@hpfs-fsl-oss00 ~]# cat /etc/modprobe.d/lustre.conf
# Lustre modprobe configuration

options lnet networks=tcp0(enp4s0),o2ib0(ib1),o2ib1(ib0)
options ko2iblnd map_on_demand=32
options osd_zfs osd_txg_sync_delay_us=0 osd_object_sync_delay_us=0
options zfs zfs_txg_history=120 zfs_txg_timeout=15 zfs_prefetch_disable=1 zfs_txg_timeout=30 zfs_vdev_scheduler=deadline

[root@hpfs-fsl-oss00 ~]#

from mpifileutils.

adammoody commented on September 1, 2024

Thanks for that additional test output, @dvickernasa . This shows that the problem is still there even if we fsync() every file. I was hoping that might be a work around, but it's instructive to know that it's not.

I've pushed a couple more commits to the branch.

The first one calls llapi_file_flush() as @adilger suggested. I left the fsync in there, so now dsync will be doing both. I'd like to see whether this might do better than fsync. As added, this will only work properly when the destination file system is Lustre.

I also added a call to stat the destination file and check the timestamps immediately after setting them. This will print a new DIFF message if it finds a difference. For any file that is re-copied in the second run, if we do not see those new DIFF messages show up in the output of the first run, then it helps further verify that the mtime was set correctly to begin with.

When you have a chance, would you please pull, rebuild, and submit a new round of tests?

from mpifileutils.

dvickernasa commented on September 1, 2024

@adammoody, we talked a while ago about adding the --input option to dsync. If its easy, that would come in handy for all these tests. It takes about 30 minutes to walk the source data for the test I've been running (which is on our older lustre filesystem).

from mpifileutils.

adammoody commented on September 1, 2024

Oh, right. That slipped my mind. I'll work on that.

from mpifileutils.

dvickernasa commented on September 1, 2024

The latest test finished (sorry for the delay). The latest run found 9 files to retransfer.

[2023-11-29T11:34:37] Started   : Nov-29-2023, 11:34:36
[2023-11-29T11:34:37] Completed : Nov-29-2023, 11:34:37
[2023-11-29T11:34:37] Seconds   : 0.717
[2023-11-29T11:34:37] Items     : 1112126
[2023-11-29T11:34:37] Item Rate : 1112126 items in 0.717182 seconds (1550688.322885 items/sec)
[2023-11-29T11:34:37] Deleting items from destination
[2023-11-29T11:34:37] Removing 9 items
[2023-11-29T11:34:37] Removed 9 items in 0.019 seconds (473.855 items/sec)
[2023-11-29T11:34:37] Copying items to destination
[2023-11-29T11:34:37] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-29T11:34:37] Items: 9
[2023-11-29T11:34:37]   Directories: 0
[2023-11-29T11:34:37]   Files: 9
[2023-11-29T11:34:37]   Links: 0
[2023-11-29T11:34:37] Data: 101.952 GiB (11.328 GiB per file)
[2023-11-29T11:34:37] Creating 9 files.
[2023-11-29T11:34:37] Copying data.
[2023-11-29T11:34:51] Copied 9.484 GiB (9%) in 14.754 secs (658.244 MiB/s) 144 secs left ...
[2023-11-29T11:36:20] Copied 20.059 GiB (20%) in 103.611 secs (198.241 MiB/s) 423 secs left ...
[2023-11-29T11:36:20] Copied 101.952 GiB (100%) in 103.611 secs (0.984 GiB/s) done
[2023-11-29T11:36:20] Copy data: 101.952 GiB (109470191676 bytes)
[2023-11-29T11:36:20] Copy rate: 0.984 GiB/s (109470191676 bytes in 103.611 seconds)
[2023-11-29T11:36:20] Syncing data to disk.
[2023-11-29T11:36:23] Sync completed in 2.766 seconds.
[2023-11-29T11:36:23] Setting ownership, permissions, and timestamps.

from mpifileutils.

adammoody commented on September 1, 2024

Thanks, @dvickernasa . Wow, this problem is stubborn. So it would seem the llapi_file_flush() calls are also not helping.

Based on the data from the latest run that you emailed to me separately, I don't see any of the new DIFF messages that would have been printed by the first run if it detected a wrong mtime immediately after it set the timestamp. That further suggests that the timetamps are being set correctly initially and then being changed again later.

Even more puzzling is that several of the files that have incorrect mtimes are of size=0. There should not be any write() calls issued for those files. Maybe this not a buffered write after all, but something else. Let me take another look through the code with that in mind.

BTW, I am working to have the --input-src option for your next run.

from mpifileutils.

dsync not updating times correctly about mpifileutils HOT 87 OPEN

Comments (87)

Code changes

Smaller test case

Update on latest large transfer

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent