First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a

So changed to <div class="snippet-clipboard-content notranslate position-relative

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Generation doesn't seem right about reformer-pytorch HOT 13 CLOSED

lucidrains commented on August 28, 2024 1

Generation doesn't seem right

from reformer-pytorch.

Comments (13)

timsoraro commented on August 28, 2024 1

Haha ok, I'll run a test and report back! Thanks :)

from reformer-pytorch.

timsoraro commented on August 28, 2024 1

So changed to

dim=1024
depth=24
head=16 (since needs to be divisible by dim)
axial_position_emb = True,
axial_position_shape = (64, 32)
axial_position_dims = (512, 512)
weight_tie=False
n_hashes=8

And after seven epochs (on 8 V100), still samples don't make any sense (I think they should by now).
https://pastebin.com/PzuHLrMx

Something is not right I think...

from reformer-pytorch.

timsoraro commented on August 28, 2024 1

I'll move this question to deepspeed repo.

BTW, I used the GPT2 tokenizer (huggingface) and the results are much better (tho need more testing). I think it's the much larger vocabulary (50257 vs 2000), but not quite sure.

Thanks again!

from reformer-pytorch.

timsoraro commented on August 28, 2024 1

Sure! I will update here (:

from reformer-pytorch.

lucidrains commented on August 28, 2024

@timsoraro thanks for trying and validating things are working! Please try the following settings : dimensions to 1024, depth to 24, head of 12, and then see how big of a context of you can stretch it to without memory running out. That would be equivalent to gpt2 small for a fair comparison. Also, try turning on Axial Positional Encoding, instructions in the readme. I found it gave the best results for 1024+ context. Thanks for sharing your deepspeed benchmarks!

from reformer-pytorch.

lucidrains commented on August 28, 2024

Lastly, set weight_tie to false for your next run. It is from the Albert paper, and is mainly for self supervised models. n_hashes can be set to 8, if memory allows. I know, a ton of knobs to tweak lol

from reformer-pytorch.

lucidrains commented on August 28, 2024

@timsoraro wow that's a lot of firepower! I think it is best to compare the final perplexities between different models given equal parameter count and amount of data trained. But I'll take your word that you trained the other two models (gpt-2 and lstm) from scratch.

My hunch is that Reformer will never be as good as full attention, and we will have to test the limits of making up for the approximation with parameter count and data. Given your hardware, I suggest adding even more parameters to your model, while capping the amount of epochs you train for at 5. If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

If none of those work, then that is valuable information that either the implementation is wrong, or Reformer simply does not work as well as advertised.

For LSH related hyperparameters, you could always try increasing bucket_size to 128 and beyond, although the authors noted in their paper that they got diminishing returns after hitting 64, which is the setting I defaulted to.

Thanks for sharing these results, this is great for everyone

from reformer-pytorch.

lucidrains commented on August 28, 2024

@timsoraro my advisor @AranKomat, has recommended forcing one of the hashes of LSH to be locally attentive. I've added it as a setting in 0.17.1, which you can turn on with n_local_attn_hashes = 1, if you choose to run another experiment!

Edit: I've made it into a flag instead, so just add_local_attn_hash = True will work

from reformer-pytorch.

timsoraro commented on August 28, 2024

Hey, I think I've done the previous test wrongly, so I'm still experimenting.

Excuse my ignorance, but if let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my batch_size is 32 I would imagine there to be around 937 steps for one epoch, but

for i, data in enumerate(trainloader):

Still walks through 30,000. So what's happening here? How can I know when the model ran over all the data examples (one epoch)?

from reformer-pytorch.

lucidrains commented on August 28, 2024

@timsoraro Please share some samples!

from reformer-pytorch.

timsoraro commented on August 28, 2024

If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

Does this paper suggest to train models with fewer parameters but with more data for 1 epoch? It seems to take quite a long time for a model this size (d=1024, d=12, h=8) to train so I'm considering training a model with fewer parameters on more data for less epochs.

from reformer-pytorch.

lucidrains commented on August 28, 2024

@timsoraro the new recommendation is to make your model as big as possible, and stop your training early https://twitter.com/Eric_Wallace_/status/1235616760595791872?s=20 more data always helps!

from reformer-pytorch.

timsoraro commented on August 28, 2024

Oh okay, thanks for the info! I'm considering how to spend my budget, I really wished there was a pre-trained Reformer model a la GPT2. I assume it would cost 5330$ to pre-train OpenWebText on 8V100 on AWS spot instance (with the same variables as the model posted here), which I cannot afford.

from reformer-pytorch.

Generation doesn't seem right about reformer-pytorch HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent