Giter VIP home page Giter VIP logo

Comments (13)

timsoraro avatar timsoraro commented on August 28, 2024 1

Haha ok, I'll run a test and report back! Thanks :)

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024 1

So changed to

dim=1024
depth=24
head=16 (since needs to be divisible by dim)
axial_position_emb = True,
axial_position_shape = (64, 32)
axial_position_dims = (512, 512)
weight_tie=False
n_hashes=8

And after seven epochs (on 8 V100), still samples don't make any sense (I think they should by now).
https://pastebin.com/PzuHLrMx

Something is not right I think...

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024 1

I'll move this question to deepspeed repo.

BTW, I used the GPT2 tokenizer (huggingface) and the results are much better (tho need more testing). I think it's the much larger vocabulary (50257 vs 2000), but not quite sure.

Thanks again!

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024 1

Sure! I will update here (:

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

@timsoraro thanks for trying and validating things are working! Please try the following settings : dimensions to 1024, depth to 24, head of 12, and then see how big of a context of you can stretch it to without memory running out. That would be equivalent to gpt2 small for a fair comparison. Also, try turning on Axial Positional Encoding, instructions in the readme. I found it gave the best results for 1024+ context. Thanks for sharing your deepspeed benchmarks!

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

Lastly, set weight_tie to false for your next run. It is from the Albert paper, and is mainly for self supervised models. n_hashes can be set to 8, if memory allows. I know, a ton of knobs to tweak lol

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

@timsoraro wow that's a lot of firepower! I think it is best to compare the final perplexities between different models given equal parameter count and amount of data trained. But I'll take your word that you trained the other two models (gpt-2 and lstm) from scratch.

My hunch is that Reformer will never be as good as full attention, and we will have to test the limits of making up for the approximation with parameter count and data. Given your hardware, I suggest adding even more parameters to your model, while capping the amount of epochs you train for at 5. If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

If none of those work, then that is valuable information that either the implementation is wrong, or Reformer simply does not work as well as advertised.

For LSH related hyperparameters, you could always try increasing bucket_size to 128 and beyond, although the authors noted in their paper that they got diminishing returns after hitting 64, which is the setting I defaulted to.

Thanks for sharing these results, this is great for everyone

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

@timsoraro my advisor @AranKomat, has recommended forcing one of the hashes of LSH to be locally attentive. I've added it as a setting in 0.17.1, which you can turn on with n_local_attn_hashes = 1, if you choose to run another experiment!

Edit: I've made it into a flag instead, so just add_local_attn_hash = True will work

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024

Hey, I think I've done the previous test wrongly, so I'm still experimenting.

Excuse my ignorance, but if let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my batch_size is 32 I would imagine there to be around 937 steps for one epoch, but

for i, data in enumerate(trainloader):

Still walks through 30,000. So what's happening here? How can I know when the model ran over all the data examples (one epoch)?

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

@timsoraro Please share some samples!

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024

If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

Does this paper suggest to train models with fewer parameters but with more data for 1 epoch? It seems to take quite a long time for a model this size (d=1024, d=12, h=8) to train so I'm considering training a model with fewer parameters on more data for less epochs.

from reformer-pytorch.

lucidrains avatar lucidrains commented on August 28, 2024

@timsoraro the new recommendation is to make your model as big as possible, and stop your training early https://twitter.com/Eric_Wallace_/status/1235616760595791872?s=20 more data always helps!

from reformer-pytorch.

timsoraro avatar timsoraro commented on August 28, 2024

Oh okay, thanks for the info! I'm considering how to spend my budget, I really wished there was a pre-trained Reformer model a la GPT2. I assume it would cost 5330$ to pre-train OpenWebText on 8V100 on AWS spot instance (with the same variables as the model posted here), which I cannot afford.

from reformer-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.