Giter VIP home page Giter VIP logo

large-scale-lm-tutorials's People

Contributors

bernardscumm avatar hyunwoongko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

large-scale-lm-tutorials's Issues

03_distributed_programming.ipynb 를 따라하는데 broadcast에서 오류가 발생합니다.

안녕하세요... 공개해주신 노트북으로 파이토치 병렬처리 개념을 알아가고 있습니다.

03_disstributed_programming.ipynb 에서 P2P Communication 까지는 소스 실행이 잘되고 있었습니다.

그런데 Collective Communication 의 첫번째 broadcast 연산의 샘플 프로그램을 따라서 실행하다가 오류가 발생합니다. dist.broadcast(tensor, src=0)을 실행하기까지는 출력이 되었는데, 이후 오류가 발생합니다.

혹시 이 오류가 왜 발생하는지 알 수 있을까요?

제가 테스트 한 환경은 다음과 같습니다.
윈도우 10 Enterprise 22H2 버전이며,
wsl2 기반 윈도우 docker desktop 에서 ubuntu20.04 버전을 실행하고
있습니다.
파이썬은 3.8.10 이며,
pytorch 버전은 1.13.1+cu116 입니다.
GPU는 1080ti 2장이 설치되어 있는 상황입니다.

테스트 실행 결과는 다음과 같습니다.

root@9023c839d35c:/data/hf_test# python -m torch.distributed.launch --nproc_per_node=2 broadcast.py
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


torch.cuda.current_device() : 0
torch.cuda.current_device() : 1
before rank 0: tensor([[ 0.5138, -1.4212],
[-0.8317, 1.0614]], device='cuda:0')

before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f93e939f457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f93e93693ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f9414a47c64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f9414a1f3e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9414a22054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f943f11fe23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f93e937f9e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f93e937faf9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f943f37dc68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f943f37df85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f945f86d083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32834b2457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f328347c3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f32aeb5ac64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f32aeb323e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f32aeb35054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f32d9232e23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f32834929e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3283492af9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f32d9490c68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f32d9490f85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f32f9980083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 260) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

broadcast.py FAILED

Failures:
[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):
[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.