Giter VIP home page Giter VIP logo

Comments (40)

nzmsv avatar nzmsv commented on July 28, 2024 2

Does this mean you tried the patch in #77 and it didn't help? In that case I am very interested!

from aws-ofi-nccl.

AWSNB avatar AWSNB commented on July 28, 2024 1

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024 1

These segfaults could be related to non-aligned buffer registrations done in Pytorch's Dataloader library but we would need to confirm if any such registrations are happening (more likely via libfabric rather than plugin). I understand from your conversation that you are using Pytorch 1.9.1 with CUDA 11.4.1 (but the pytorch is compiled with 11.3.1). Is that right?

If yes, we will try to reproduce with these versions and debug it.

Also, in case it is actually misaligned buffers causing this issue then it should be independent of toggling FI_EFA_USE_DEVICE_RDMA=1 (but it could be more likely to be triggered when using CUDA buffers?).

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024 1

@nzmsv Kernel upgrade would only help if this indeed is a page-aligned memory registration issue. I think we should try to reproduce it at our end and verify if that's the case. We can propose solutions based on our findings. Please keep in mind that customer ingest new AMIs to do kernel upgrades.

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024 1

We'll work on an internal reproduction. Thank you!

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024 1

Does this combination stop segfaults in the child process?

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

I have seen these errors in the past when my training data wasn't behind fast enough link (has to switch from NFS based storage to local storage). Do you still experience this error?

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

No follow-up

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

I've replicated this on a recent set of p4d.24xlarges

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Some info here: I've found that the use of FI_EFA_USE_DEVICE_RDMA is quite unstable (causing segfaults) pretty much whenever there's a forked subprocess. So DataLoaders with num_workers>0 have issues. So mitigation is to either disable RDMA or to avoid forks.

I would hope this would be addressed with RDMAV_FORK_SAFE=1 but that doesn't seem to help. If there is guidance, I would greatly appreciate it.

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

This should be addressed by #77

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Ah I tried but that doesn't seem to help. Not sure.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Correct. I checked out #77 and recompiled aws-ofi-nccl and found it didn't mitigate my segfaults.

I've gotten the workers to dump their cores and done some inspecting, and they're all crashing in pretty innocuous/random spots. Something is messing with their memory space underneath their feet.

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

Thinking about this some more, the problem could be caused by registering any buffer that is not page aligned. There is definitely some work that needs to be done in this area.

What does your application use fork() for?

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

What does your application use fork() for?

Pytorch DataLoaders use fork under the hood. The main purpose is to load/preprocess data in a background process so that the GPUs don't have be blocked waiting on data I/O.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

The only remaining potential path I haven't explored is recompiling pytorch. The latest UltraCluster AMIs bundle cuda 11.4 but pytorch has only been compiled with 11.3 (or lower in my case; I'm using on pytorch 1.9.1 for unrelated reasons).

from aws-ofi-nccl.

alexeib avatar alexeib commented on July 28, 2024

i have this issue intermittently happening as well. it only happens when the dataloaders are being created. once they are created successfully, the job will run to completion. theres about a 30%-50% chance that a given job will fail, and i just hav eto keep restarting (on the same nodes) until it works. i also notice that the larger the model size, the more likely we are to hit this issue

i was not setting NCCL_PROTO at all. tried it with it set to simple and it did not help

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

Explicitly setting NCCL_PROTO did not resolve.

My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation.

That said, @alexeib witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase.

from aws-ofi-nccl.

AWSNB avatar AWSNB commented on July 28, 2024

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

I'm indeed using the newest version AMI: https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-cuda-11-4-ubuntu-18-04/. Due to my own instances being managed resource, our best options are to find mitigations, or to release new AMIs that don't have issues (if indeed this is an AMI issue). Please contact me directly at [email protected] if you would prefer to coordinate with existing official support channels.

I understand there are internal issues with that image being tracked (case 9248945321) but unsure what other issues are being tracked right now (e.g. segfaults).

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Tracking in T108814700 on the Meta side.

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

I'll come up with a patch that warns when unaligned memory regions are registered to see if we get any hints.

Also, the compatibility between fork and RDMA has been significantly improved in newer Linux kernels (though the plug-in will need some work to take advantage of these features). Which version are you using?

from aws-ofi-nccl.

alexeib avatar alexeib commented on July 28, 2024

if it helps i had this (or a similar issue) with cuda 11.1 and pytorch 1.9.1 as well as pytorch 1.10.0 (both compiled to target cuda 11.1 and nccl 2.8.4) - the error message was different then - we got "RuntimeError: CUDA error: unspecified launch failure" with some cryptic cuda stacktrace and Xid nvidia driver errors in the logs.

we then upgraded to the new AMI with cuda 11.4, compiled pytorch 1.10.1 to target cuda 11.4 and now get the same problem but this time we get the segfault in the dataloader instead. same symptoms and same resolution (keep restarting jobs until it works)

another training task used to work on the cuda 11.1 / pytorch 1.9.1 setup but stopped working on the new AMI with the same issue, but only appearing after training for one or a couple of epochs. the fix was to enable "persistent_workers" in the dataloader so they don't get re-created every epoch - seemed to make things a bit more stable

all these issues go away if RDMA is disabled

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

It's probably an issue with cuda 11.4 + efa.

Environments that fail:

  • pytorch 1.9.1 (compiled for 11.1) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)
  • pytorch 1.9.1 (compiled for 11.1) + self compiled cuda 11.4, nccl and aws-ofi-nccl
  • pytorch 1.10.1 (compiled for 11.3) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)

Did not try:

  • pytorch 1.10.1 (self compiled for 11.4) + cuda 11.4, etc. with drivers 470 (Alexei did though, also saw issues)

Environments that succeed (470 drivers):

  • pytorch 1.10.1 (compiled for 11.3) + cuda 11.3 + nccl 2.11.4+cuda11.4 (manually downloaded cuda 11.3, set it)

To try on Monday:

  • new AMI with cuda 11.3, 460 drivers, and pytorch 1.10.1 compiled for 11.3

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

I think my question got buried, re-raising. What is the Linux kernel version?

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

$ uname -a
Linux ip-[redacted] 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

from aws-ofi-nccl.

wzamazon avatar wzamazon commented on July 28, 2024

@stephenroller

Can you confirm that the segfault happen to the child process?

Also, can you provide the backtrace from the core dump?

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Yes, the child process segfaults.

I have lost the core dumps :( we wiped the cluster today to change to the new AMI with CUDA 11.3. I'm verifying it works now.

When I analyzed the core dumps, they were generally in innocuous places like inside standard libraries (once in vec::fold inside a rust library; once deep inside the Python interpreter, etc). They were inconsistent and happening in well tested places, so I was lead to believe something was touching memory underneath me.

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

Is upgrading the kernel a possibility or should we be looking at alternative solutions? Kernel 5.15 (LTS) includes all the changes to make fork work well with RDMA.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Okay I'm on the brand new AMIs and replicated the issue again.

With pytorch 1.10.1, and launching using the variables set from /etc/profile.d/dlami.sh I observe that EFA/RDMA is not enabled. I manually added
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_HOME/efa/lib"
to my environment and relaunched, and confirmed EFA/RDMA is being used correctly (this should really be done in dlami.sh for me).

I observe, however, that it's currently using the NCCL version bundled with pytorch (2.10.3). When I manually set
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so.2.11.4
to absolutely force the NCCL version, then I get the segfaults (and logs report that I'm using 2.11.4+cuda11.5, so I guess that's what's bundled in the newest AMI).

So issues seem like they're tied to newer versions of CUDA (11.4 or 11.5) or NCCL compiled with newer versions of CUDA. Could be environment mismatch but my previous attempts to compile EVERYTHING myself to force alignment didn't really seem to help.

EDIT:
okay I observe the crash with nccl bundled with pytorch as well

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

So far what has worked for me is going and downloading nccl11.4+cuda11.4 directly from NVIDIA, and forcing that to be used at runtime via LD_PRELOAD:

export LD_PRELOAD=/data/home/roller/lib/nccl_2.11.4-1+cuda11.4_x86_64/lib/libnccl.so.2.11.4

This unblocks my research for now.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Okay the latest AMI from yesterday doesn't seem to have issues.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

I was asked to put together a proxy workload that triggers these behaviors.

Here's a rough proxy of our workload. If you can't replicate with this public benchmark, we can start bringing in some of our more complex behavior we have implemented in our private repo.

Replicate the 13B model from here. Increase the number of GPUs to increase the probability of the issue:
https://github.com/pytorch/fairseq/tree/main/examples/fully_sharded_data_parallel

But you may need to add something like --num-workers 8 to explicitly turn on background workers.

You can download a public dataset compatible with fairseq with the instructions here:
https://github.com/pytorch/fairseq/tree/main/examples/language_model#1-preprocess-the-data

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

Thank you! I am still trying to reproduce the crash. Does the crash reproduce with this proxy workload in your environment?

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

Unfortunately, in order to unblock the research, I presently only have access to a cluster with the latest 11.3 image (which doesn't crash). We'll need to work with Six Nines to stand up a cluster with 11.4 in order to test this proxy workload on my end.

I want to keep this discussion mostly on official support channels (cc @AWSNB: emailed you; could you add @nzmsv and the other persons we met this week?), but wanted to leave some info here.

from aws-ofi-nccl.

stephenroller avatar stephenroller commented on July 28, 2024

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

"bare" metal

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

Thanks Stephen. I will open another support channel with you.

from aws-ofi-nccl.

Tete-Xiao avatar Tete-Xiao commented on July 28, 2024

This issue has been noticed by a separate AWS team in April 2021. It is still a problem unfortunately.

https://github.com/aws/sagemaker-training-toolkit/releases/tag/v3.9.2

from aws-ofi-nccl.

nzmsv avatar nzmsv commented on July 28, 2024

This was found to be caused by an issue in Libfabric. Resolved by ofiwg/libfabric#7431

from aws-ofi-nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.