Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.<

Does this mean you tried the patch in <a class="issue-link js-issue-link" data-error-t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1,about aws/aws-ofi-nccl

Comments (40)

nzmsv commented on July 28, 2024 2

Does this mean you tried the patch in #77 and it didn't help? In that case I am very interested!

from aws-ofi-nccl.

AWSNB commented on July 28, 2024 1

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

from aws-ofi-nccl.

rashikakheria commented on July 28, 2024 1

These segfaults could be related to non-aligned buffer registrations done in Pytorch's Dataloader library but we would need to confirm if any such registrations are happening (more likely via libfabric rather than plugin). I understand from your conversation that you are using Pytorch 1.9.1 with CUDA 11.4.1 (but the pytorch is compiled with 11.3.1). Is that right?

If yes, we will try to reproduce with these versions and debug it.

Also, in case it is actually misaligned buffers causing this issue then it should be independent of toggling FI_EFA_USE_DEVICE_RDMA=1 (but it could be more likely to be triggered when using CUDA buffers?).

from aws-ofi-nccl.

rashikakheria commented on July 28, 2024 1

@nzmsv Kernel upgrade would only help if this indeed is a page-aligned memory registration issue. I think we should try to reproduce it at our end and verify if that's the case. We can propose solutions based on our findings. Please keep in mind that customer ingest new AMIs to do kernel upgrades.

from aws-ofi-nccl.

nzmsv commented on July 28, 2024 1

We'll work on an internal reproduction. Thank you!

from aws-ofi-nccl.

nzmsv commented on July 28, 2024 1

Does this combination stop segfaults in the child process?

from aws-ofi-nccl.

rashikakheria commented on July 28, 2024

I have seen these errors in the past when my training data wasn't behind fast enough link (has to switch from NFS based storage to local storage). Do you still experience this error?

from aws-ofi-nccl.

rashikakheria commented on July 28, 2024

No follow-up

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

I've replicated this on a recent set of p4d.24xlarges

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Some info here: I've found that the use of FI_EFA_USE_DEVICE_RDMA is quite unstable (causing segfaults) pretty much whenever there's a forked subprocess. So DataLoaders with num_workers>0 have issues. So mitigation is to either disable RDMA or to avoid forks.

I would hope this would be addressed with RDMAV_FORK_SAFE=1 but that doesn't seem to help. If there is guidance, I would greatly appreciate it.

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

This should be addressed by #77

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Ah I tried but that doesn't seem to help. Not sure.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Correct. I checked out #77 and recompiled aws-ofi-nccl and found it didn't mitigate my segfaults.

I've gotten the workers to dump their cores and done some inspecting, and they're all crashing in pretty innocuous/random spots. Something is messing with their memory space underneath their feet.

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

Thinking about this some more, the problem could be caused by registering any buffer that is not page aligned. There is definitely some work that needs to be done in this area.

What does your application use fork() for?

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

What does your application use fork() for?

Pytorch DataLoaders use fork under the hood. The main purpose is to load/preprocess data in a background process so that the GPUs don't have be blocked waiting on data I/O.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

The only remaining potential path I haven't explored is recompiling pytorch. The latest UltraCluster AMIs bundle cuda 11.4 but pytorch has only been compiled with 11.3 (or lower in my case; I'm using on pytorch 1.9.1 for unrelated reasons).

from aws-ofi-nccl.

alexeib commented on July 28, 2024

i have this issue intermittently happening as well. it only happens when the dataloaders are being created. once they are created successfully, the job will run to completion. theres about a 30%-50% chance that a given job will fail, and i just hav eto keep restarting (on the same nodes) until it works. i also notice that the larger the model size, the more likely we are to hit this issue

i was not setting NCCL_PROTO at all. tried it with it set to simple and it did not help

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

Explicitly setting NCCL_PROTO did not resolve.

My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation.

That said, @alexeib witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase.

from aws-ofi-nccl.

AWSNB commented on July 28, 2024

Working on it on our side and looking for internal logs for clues Meanwhile, just to be sure: are you on nvidia driver 470 or cuda 11.4.x ? we found those to have issues on A100 systems So for now, we recommended to all p4d customers to stay with driver 460 and cuda 11.3.1. Nvidia is aware of the issues with the newer driver/cuda From: Stephen Roller ***@***.***> Reply-To: aws/aws-ofi-nccl ***@***.***> Date: Friday, December 24, 2021 at 8:24 AM To: aws/aws-ofi-nccl ***@***.***> Cc: "Bshara, Nafea" ***@***.***>, Comment ***@***.***> Subject: Re: [aws/aws-ofi-nccl] DataLoader cash when using FI_EFA_USE_DEVICE_RDMA=1 (#69) @stephenroller<https://github.com/stephenroller> we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple could you check what is the current setting in your runs and try this variable if not set Explicitly setting NCCL_PROTO did not resolve. My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation. That said, @alexeib<https://github.com/alexeib> witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase. — Reply to this email directly, view it on GitHub<#69 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCL2C3VH6EM46J2VWIDUSSM6XANCNFSM5DEYJFLQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

I'm indeed using the newest version AMI: https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-cuda-11-4-ubuntu-18-04/. Due to my own instances being managed resource, our best options are to find mitigations, or to release new AMIs that don't have issues (if indeed this is an AMI issue). Please contact me directly at [email protected] if you would prefer to coordinate with existing official support channels.

I understand there are internal issues with that image being tracked (case 9248945321) but unsure what other issues are being tracked right now (e.g. segfaults).

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Tracking in T108814700 on the Meta side.

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

I'll come up with a patch that warns when unaligned memory regions are registered to see if we get any hints.

Also, the compatibility between fork and RDMA has been significantly improved in newer Linux kernels (though the plug-in will need some work to take advantage of these features). Which version are you using?

from aws-ofi-nccl.

alexeib commented on July 28, 2024

if it helps i had this (or a similar issue) with cuda 11.1 and pytorch 1.9.1 as well as pytorch 1.10.0 (both compiled to target cuda 11.1 and nccl 2.8.4) - the error message was different then - we got "RuntimeError: CUDA error: unspecified launch failure" with some cryptic cuda stacktrace and Xid nvidia driver errors in the logs.

we then upgraded to the new AMI with cuda 11.4, compiled pytorch 1.10.1 to target cuda 11.4 and now get the same problem but this time we get the segfault in the dataloader instead. same symptoms and same resolution (keep restarting jobs until it works)

another training task used to work on the cuda 11.1 / pytorch 1.9.1 setup but stopped working on the new AMI with the same issue, but only appearing after training for one or a couple of epochs. the fix was to enable "persistent_workers" in the dataloader so they don't get re-created every epoch - seemed to make things a bit more stable

all these issues go away if RDMA is disabled

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

It's probably an issue with cuda 11.4 + efa.

Environments that fail:

pytorch 1.9.1 (compiled for 11.1) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)
pytorch 1.9.1 (compiled for 11.1) + self compiled cuda 11.4, nccl and aws-ofi-nccl
pytorch 1.10.1 (compiled for 11.3) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)

Did not try:

pytorch 1.10.1 (self compiled for 11.4) + cuda 11.4, etc. with drivers 470 (Alexei did though, also saw issues)

Environments that succeed (470 drivers):

pytorch 1.10.1 (compiled for 11.3) + cuda 11.3 + nccl 2.11.4+cuda11.4 (manually downloaded cuda 11.3, set it)

To try on Monday:

new AMI with cuda 11.3, 460 drivers, and pytorch 1.10.1 compiled for 11.3

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

I think my question got buried, re-raising. What is the Linux kernel version?

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

$ uname -a
Linux ip-[redacted] 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

from aws-ofi-nccl.

wzamazon commented on July 28, 2024

@stephenroller

Can you confirm that the segfault happen to the child process?

Also, can you provide the backtrace from the core dump?

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Yes, the child process segfaults.

I have lost the core dumps :( we wiped the cluster today to change to the new AMI with CUDA 11.3. I'm verifying it works now.

When I analyzed the core dumps, they were generally in innocuous places like inside standard libraries (once in vec::fold inside a rust library; once deep inside the Python interpreter, etc). They were inconsistent and happening in well tested places, so I was lead to believe something was touching memory underneath me.

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

Is upgrading the kernel a possibility or should we be looking at alternative solutions? Kernel 5.15 (LTS) includes all the changes to make fork work well with RDMA.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Okay I'm on the brand new AMIs and replicated the issue again.

With pytorch 1.10.1, and launching using the variables set from /etc/profile.d/dlami.sh I observe that EFA/RDMA is not enabled. I manually added
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_HOME/efa/lib"
to my environment and relaunched, and confirmed EFA/RDMA is being used correctly (this should really be done in dlami.sh for me).

~~I observe, however, that it's currently using the NCCL version bundled with pytorch (2.10.3).~~ When I manually set
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so.2.11.4
to absolutely force the NCCL version, then I get the segfaults (and logs report that I'm using 2.11.4+cuda11.5, so I guess that's what's bundled in the newest AMI).

So issues seem like they're tied to newer versions of CUDA (11.4 or 11.5) or NCCL compiled with newer versions of CUDA. Could be environment mismatch but my previous attempts to compile EVERYTHING myself to force alignment didn't really seem to help.

EDIT:
okay I observe the crash with nccl bundled with pytorch as well

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

So far what has worked for me is going and downloading nccl11.4+cuda11.4 directly from NVIDIA, and forcing that to be used at runtime via LD_PRELOAD:

export LD_PRELOAD=/data/home/roller/lib/nccl_2.11.4-1+cuda11.4_x86_64/lib/libnccl.so.2.11.4

This unblocks my research for now.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Okay the latest AMI from yesterday doesn't seem to have issues.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

I was asked to put together a proxy workload that triggers these behaviors.

Here's a rough proxy of our workload. If you can't replicate with this public benchmark, we can start bringing in some of our more complex behavior we have implemented in our private repo.

Replicate the 13B model from here. Increase the number of GPUs to increase the probability of the issue:
https://github.com/pytorch/fairseq/tree/main/examples/fully_sharded_data_parallel

But you may need to add something like --num-workers 8 to explicitly turn on background workers.

You can download a public dataset compatible with fairseq with the instructions here:
https://github.com/pytorch/fairseq/tree/main/examples/language_model#1-preprocess-the-data

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

Thank you! I am still trying to reproduce the crash. Does the crash reproduce with this proxy workload in your environment?

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

Unfortunately, in order to unblock the research, I presently only have access to a cluster with the latest 11.3 image (which doesn't crash). We'll need to work with Six Nines to stand up a cluster with 11.4 in order to test this proxy workload on my end.

I want to keep this discussion mostly on official support channels (cc @AWSNB: emailed you; could you add @nzmsv and the other persons we met this week?), but wanted to leave some info here.

from aws-ofi-nccl.

stephenroller commented on July 28, 2024

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

"bare" metal

from aws-ofi-nccl.

rashikakheria commented on July 28, 2024

Thanks Stephen. I will open another support channel with you.

from aws-ofi-nccl.

Tete-Xiao commented on July 28, 2024

This issue has been noticed by a separate AWS team in April 2021. It is still a problem unfortunately.

https://github.com/aws/sagemaker-training-toolkit/releases/tag/v3.9.2

from aws-ofi-nccl.

nzmsv commented on July 28, 2024

This was found to be caused by an issue in Libfabric. Resolved by ofiwg/libfabric#7431

from aws-ofi-nccl.

DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1 about aws-ofi-nccl HOT 40 CLOSED

Comments (40)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent