allenai / olmo Goto Github PK
View Code? Open in Web Editor NEWModeling, training, eval, and inference code for OLMo
Home Page: https://allenai.org/olmo
License: Apache License 2.0
Modeling, training, eval, and inference code for OLMo
Home Page: https://allenai.org/olmo
License: Apache License 2.0
While this wasn't implemented as of PyTorch 1.13.1, it appears it's going to be in the next release because it's implemented in the master branch: https://github.com/pytorch/pytorch/blob/master/torch/distributed/fsdp/api.py#L31.
RedPajama includes data sourced from ArXiv
An alternative is unArXive
S2's LaTex dumps from ArXiv are in s3://ai2-s2-scholarphi-pipeline-prod/daq/arxiv-source-data/bymonth/
Activation checkpointing needs to keep track of the state of the random number generator, which fails with torch.compile()
. Rumor has it that the latest torch nightly has this fixed, so we should try this.
Stas' code is here: https://github.com/stas00/toolbox/blob/master/pytorch/all_reduce_bench.py
We should run this on a single node, and multiple nodes.
I tried implementing grouped query attention in this pull request, but seems that pytorch's scaled_dot_product_attention
doesn't support the kind of broadcasting we'd need for this. Revisit if/when this gets fixed on pytorch's end.
We can't run in a debugger anymore.
It's supposed to be faster, and at this point, it's the only way we can get the fast interconnect working in LUMI.
Mosaic tells us we need to apply this to composer to make it work: dskhudia/composer@4d55bdb
We should expect a speed-up too.
No response
No response
Make additional layer and adapter for running code with existing down-stream evaluation tools (e.g., HELM, Catwalk).
The Palm paper says it improves throughput, and doesn't slow learning, at least not at large scales.
https://github.com/allenai/LLM/blob/2118db56095157474fe1c69c1702db08af2d4f74/scripts/train.py#L187
I think having a checkpoint before any training happens would be quite useful.
Of course, that means we need a reliable logging solution first ...
We need a generate
method to run evaluation. Minimum requirement:
top_k
/ top_p
supportLlama's generate
method is fairly straightforward but we can't use it because of it has the wrong license.
generate
method. This will eventually need to be done, but, since we haven't fully finalized the architecture, this is too soon.No response
There's a PyTorch implementation here: https://github.com/bzhangGo/rmsnorm/blob/master/rmsnorm_torch.py
The problem we have with LayerNorm (or at least PyTorch's built-in implementation of LayerNorm) is that torch's autocast behavior is hardcoded to upcast inputs to the function torch.layer_norm
, and as a result using low precision LayerNorm requires some hacking that fails to work when using torch.compile()
. Low precision LN gives a huge speedup with non-compiled models, so we're optimistic that getting this to work with compiled models would provide a similar speedup.
Switching to a different norm implementation could sidestep this issue, and RMSNorm is a simpler operation so we might get an even bigger speedup.
SwiGLU is by far the fastest compared to ReLU and GELU. Although this isn't quite a fair comparison since SwiGLU reduces the number of parameters in the model, all else equal.
E.g. we have an LM loss at 1/2, 1/4, 1/8 of the model.
To avoid the degenerate scenario where, for example, the first half of the model does well but the 2nd half just becomes a bunch of ~identity layers, we could have a separate LM head for each loss, i.e. at each layer in the model where we apply an LM loss. So we'd have to untie the embedding matrix from the LM head.
No response
No response
This is an experiment, but probably worthwhile. With 1000 nodes, it gets very difficult to see which node did what and when, especially when they fail, if they all log to stdout, which slurs stuffs into a single log file. Papertrail is basically running syslog-as-a-service. We can send log messages there, and then look and filter with their UI. We need to configure our code to write logs to their syslog service to do this.
Mosaic has been preferring this over AdamW and it uses less memory since it only keeps track of momentum. We can find an implementation in this PR. Eventually this will be added to composer
, but in the meantime we can just copy it over.
Ideally, the Layernorm layers for a sequential block are separate for attention and FFN modules. Good thing is we haven't seen too much difference because of this until now.
q, k, v = self.att_proj(self.norm(x)).split(self.fused_dims, dim=-1)
# Get attention scores.
att, cache = self.attention(q, k, v, attention_bias, layer_past=layer_past, use_cache=use_cache)
# Add attention scores.
# shape: (B, T, C)
x = x + self.dropout(att)
# Add feed-forward projection.
# shape: (batch_size, seq_len, d_model)
x = x + self.dropout(self.ff_out(self.act(self.ff_proj(self.norm(x)))))
self.norm should be self.norm2 ideally in the last line
No response
Maybe wandb will take care of this for us? I opened a ticket with them.
This should be pretty easy: just add a configuration field for eval batch size and what not, then initialize an eval dataloader in the same way that we initialize the train data loader and pass it to the Trainer
.
Ideally there should be no performance penalty for dropout layers when the dropout probability is 0. But if there is, we should bypass dropout when p=0.0
.
I'm currently testing this here: https://wandb.ai/ai2-llm/dropout-benchmarks
There will be 4 runs:
1.2b-bf16-no-dropout
: uses a patched branch that bypasses all calls to dropout (except inside of the scaled_dot_product_attention
function which we have no control over). This model is compiled using the default settings.1.2b-bf16-zero-dropout
: uses the usual implementation without any code changes were we still call the Dropout
modules even though the dropout probability is set to 0. This model is compiled using the default settings.1.2b-bf16-no-compile-no-dropout
: same as 1.2b-bf16-no-dropout
except this model is NOT compiled.1.2b-bf16-no-compile-zero-dropout
: same as 1.2b-bf16-zero-dropout
except this model is NOT compiled.Yesterday we spoke about where responsibility for data order lives between the llm-model and llm-data workstreams. I thought it might be good to start an issue here where we can figure out who and how we should ensure that future work can reproduce our exact pretraining data order.
Use cases:
Proposed features:
Proposed method:
j * k
dataloaders, dl_i
, across j
nodes and k
devices, each data loader should sample from n
documents of the pretraining corpus, doc_i
, with a stride of j*k
j=1
and k=2
, dl_0[0] == doc_0
and dl_1[0] == doc_1
while dl_0[1] == doc_2
and dl_1[1] == doc_3
{batch_0 : [ [ (doc_0, (start_tok_idx, end_tok_idx)), ...], [(doc_42, (start_tok_idx, end_tok_idx)),...]]}
Let me know what I can do to help support this and coordinate responsibility between llm-model and llm-data.
A simple though reduced alternative is to just record the tokens that are trained on without any further changes.
While this supports the ability to see what a given checkpoint has seen and when it has seen it, this may make it difficult to recover batch and document boundaries or to reproduce this data order in a different tokenizer.
Also if we donโt ensure that data order is invariant to number of nodes and devices, it we may accidentally produce different training orders across different runs. Additionally it may be difficult for people using our codebase to reproduce our data order if they donโt have the same number of nodes and devices.
No response
We promised a 7B model. Also, we need an appropriately sized model to test multi-node training at smaller scales.
Maybe we just hard-code it to 2**16.
Waiting on Dao-AILab/flash-attention#132 to be resolved
Each time we evaluate we should evaluate on the entirety of each subset. This will remove some noise and make it easier for us to debug where the model is going wrong with RP.
Iz and I were talking about this today, can be under nice to have
This is from the PaLM paper:
"The standard Transformer formulation uses k attention heads, where the input vector for each timestep is linearly projected into โqueryโ, โkeyโ, and โvalueโ tensors of shape [k, h], where h is the attention head size. Here, the key/value projections are shared for each head, i.e. โkeyโ and โvalueโ are projected to [1, h], but โqueryโ is still projected to shape [k, h]. We have found that this has a neutral effect on model quality and training speed (Shazeer, 2019), but results in a significant cost savings at autoregressive decoding time. This is because standard multi-headed attention has low efficiency on accelerator hardware during auto-regressive decoding, because the key/value tensors are not shared between examples, and only a single token is decoded at a time."
No response
No response
There are some checkpoints that we want to keep forever, because they are part of our output. The checkpoint saving code needs to know about those.
As of the latest composer release, there's now composer.callbacks.SpeedMonitor
. See mosaicml/examples#238 for example usage. It looks like the version we have (which we took directly from Mosaic's example), was copied into composer line-for-line. So the update should be easy.
Exact spec still WIP, but TODOs are basically:
{"text": "...", "paper_id": <identifier>}
" ".join([title, abstract])
should be sufficient.{"text": "...", "paper_id": <identifier>}
paper_ids
to some note or reason for removal. To start, this should be documents that are part of the test set for Catwalk evaluation, especially Pubmed/arXiv abstract generation.mixed_precision=PURE
, limit_all_gathers=true
, and activation_checkpointing_reentrant=false
in the fsdp_config
. Added in #61Can just copy over the implementation from PaLM-pytorch.
Turns out we never tried it. If this fails, we need to talk to AMD ASAP. We cannot get to 70B without this.
We should either wait until #61 merges, or do this on as part of that PR.
We need to grep through and change all mentions of "DOLMA" / "Dolma" / "dolma" and also rename the Python module itself.
Basically:
# Rename Python module.
mv dolma olmo
# Find and replace all mentions using `fd` and `sd` (`brew install fd sd`)
fd -H --exclude .git | xargs sd 'DOLMA' 'OLMo'
fd -H --exclude .git | xargs sd 'Dolma' 'Olmo'
fd -H --exclude .git | xargs sd 'dolma' 'olmo'
The PaLM paper has a short section of tweaks to the vanilla Transformer architecture. We should make sure we have all of those.
sfantao/pytorch-lumi:sles-rocm-5.5.1-python-3.10-pytorch-v2.0.1
(on Ducker hub).
We have a separate Google Cloud project for OLMo now, we should use that. We should call the bucket something like gs://olmo-checkpoints
instead of gs://allennlp-olmo
.
One experiment is, let's just keep running the 7B and see if it recovers from the spikes.
It is suspicious that we had two slightly different models (one with biases, one without), that both spiked at exactly the same moment. This suggests there might be a data issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.