🐛 Bug Running LLaMa2 13B with FSDP ZeRO2 on 8xH100 <div class

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

As per offline discussion with <a class="user-mention notranslate" data-hovercard-type

cc - <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Likely memory fragmentation for larger models about lightning-thunder HOT 6 OPEN

parthmannan commented on June 2, 2024 1

Likely memory fragmentation for larger models

from lightning-thunder.

Comments (6)

parthmannan commented on June 2, 2024

@kshitij12345 - From our discussion, I remember you were looking into this. I believe this is what is causing the memory operations but needs further investigation.

from lightning-thunder.

tfogal commented on June 2, 2024

cc @eqy re: fragmentation lunch discussion

from lightning-thunder.

eqy commented on June 2, 2024

Does TORCH_NCCL_AVOID_RECORD_STREAMS=1 help?

from lightning-thunder.

parthmannan commented on June 2, 2024

@eqy Yes, it does. Either of the two env variables give the same performance benefit. Is this fair to call this a memory fragmentation issue or is this something else you think?

from lightning-thunder.

kshitij12345 commented on June 2, 2024

As per offline discussion with @ptrblck , we should enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default in thunder.

from lightning-thunder.

parthmannan commented on June 2, 2024

cc - @IvanYashchuk @mruberry Can we enable this env var by default in Thunder or should we rely on nvidia containers do enable this?

from lightning-thunder.

Likely memory fragmentation for larger models about lightning-thunder HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent