There's a PyTorch implementation here: <a href="https://github.com/bzhangGo/rmsnorm/bl

I tried this out on the branch <a href="https://github.com/allenai/LLM/tree/Torch2-RMS

Final results: <a href="https://wandb.ai/ai2-llm/petew-benchmarks-2/reports/LayerNorm-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Investigate RMSNorm as an alternative to LayerNorm about olmo HOT 11 CLOSED

allenai commented on September 16, 2024

Investigate RMSNorm as an alternative to LayerNorm

from olmo.

Comments (11)

epwalsh commented on September 16, 2024

I tried this out on the branch Torch2-RMSNorm using torch.autocast() to manually control precision. It was much slower.

However when I stopped messing with autocast it was faster:

from olmo.

epwalsh commented on September 16, 2024

Final results: https://wandb.ai/ai2-llm/petew-benchmarks-2/reports/LayerNorm-Benchmarks--VmlldzozOTI1NTA4

These results tell us that to optimize throughput we should use --model.layer_norm_type=rms when compiling and --model.layer_norm_type=low_precision when not compiling.

from olmo.

ananyahjha93 commented on September 16, 2024

@epwalsh at this point we should also look at end task performance than absolute speedup results

Some of these decisions in bloom were made based on downstream eval and not just throughput: https://arxiv.org/pdf/2210.15424.pdf

from olmo.

ananyahjha93 commented on September 16, 2024

like I think SwiGLU and alibi are non-negotiable based on end task performance

from olmo.

epwalsh commented on September 16, 2024

Absolutely

from olmo.

epwalsh commented on September 16, 2024

Only problem with ALiBi is that it incurs a significant performance hit since it doesn't work with the current Flash Attention implementation.

from olmo.

dirkgr commented on September 16, 2024

Is low_precision stable enough for use? I thought doing LN in 32-bits was a major stability hack in BLOOM.

from olmo.

epwalsh commented on September 16, 2024

Is low_precision stable enough for use? I thought doing LN in 32-bits was a major stability hack in BLOOM.

That remains to be seen.

from olmo.

dirkgr commented on September 16, 2024

What about Alibi vs. Rope?

from olmo.

ananyahjha93 commented on September 16, 2024

performance or throughput?

from olmo.

dirkgr commented on September 16, 2024

Both. I want end-task performance per GPU hour please 🙂

…

On Mar 29, 2023 at 14:14:35, Ananya Harsh Jha ***@***.***> wrote: performance or throughput? — Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPS2ISPFJWQ6WS45QZTW6SQ3XANCNFSM6AAAAAAWLJRBXA> . You are receiving this because you commented.Message ID: ***@***.***>

from olmo.

Recommend Projects

Investigate RMSNorm as an alternative to LayerNorm about olmo HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent