Giter VIP home page Giter VIP logo

Comments (7)

rkfg avatar rkfg commented on August 29, 2024

I don't think the dictionary quality will improve significantly if you use more data. I even limited the number of lines to process because Sentence Piece itself recommends that. The lines that are taken for processing are sampled randomly so it doesn't mean it will only take the first million lines from your file.

To get the most efficient dictionary you should include the most common phrases into it, not just use a lot of data. Any words that don't make it to the dictionary will be encoded as individual characters. But then again, if the word is so rare it can't be encoded with 2-3 tokens it's unlikely the model will use it in future at all.

from gpt-2.

ZheMann avatar ZheMann commented on August 29, 2024

Thank you for the clear answer. One quick question: do I have to determine the optimal vocab_size by trial and error? I've searched through the entire sentencepiece repository but could not find any straightforward guidelines regarding this.

from gpt-2.

rkfg avatar rkfg commented on August 29, 2024

Probably, yes. I can't find it now but I remember that some paper mentioned around 30-50k tokens for this encoding type. It depends on the language and this whole field is pretty much intuition-driven (from what I read and saw). You can't calcualte the optimal network architecture or the number of tokens, the networks are so big that the only method is trial, error, rinse and repeat.

Here's a couple of easy to read papers to get you started: https://arxiv.org/pdf/1808.06226.pdf and https://arxiv.org/pdf/1508.07909.pdf

They contain no math (I just don't get complex math tbh) and everything else is mostly common sense and logic.

I personally tried vocabularies with 10k and 50k tokens, surprisingly the 10k model converged faster and the resulting loss was much lower (around 3.5 compared to 4+ for 50k model). But the output was still not impressive and maybe the 50k model has more potential for improving in time. It all requires a lot of experimentation.

Also, one thing to remember: your data size (in tokens) must be a lot bigger than your model size. Or else it will just memorize the corpus and produce garbage on arbitrary input. I used a huge Russian books dump, it contains zipped fb2 books, the overall size is more than 400 Gb. Of course, there are many duplicates and not all books are in Russian so I did some filtering first and in the end produced a corpus of around 10 Gb or so. To fully sample it (the train script selects ranodm lines, not sequential) my system would require about 6 days.

from gpt-2.

ZheMann avatar ZheMann commented on August 29, 2024

Dude, you're awesome! Thanks for the valuable information, I will definitely study the papers you mentioned. I will try different values for the vocab_size and see what happens.

However, after you mentioned this:

Also, one thing to remember: your data size (in tokens) must be a lot bigger than your model size. Or else it will just memorize the corpus and produce garbage on arbitrary input. I used a huge Russian books dump, it contains zipped fb2 books, the overall size is more than 400 Gb. Of course, there are many duplicates and not all books are in Russian so I did some filtering first and in the end produced a corpus of around 10 Gb or so. To fully sample it (the train script selects ranodm lines, not sequential) my system would require about 6 days.

I just realised the biggest challenge will be finding sufficient amounts of texts written in Dutch, as the total size of all Dutch books on Gutenberg.org is less than 100MB.

Anyways, things are starting to get more and more clear to me now.

Many thanks again.

from gpt-2.

rkfg avatar rkfg commented on August 29, 2024

Yeah, that corpus is way too small. You can try translating books with Google for starters or find other sources (you're not expecting me to buy 400 Gb of compressed books and I don't think you can find that many in public domain so...). The whole point of neural networks is to lossy "compress" the data into their internal structure to be able to find patterns in it. That's because you require it to correctly predict the next token based on the previous tokens and it should be able to do that on a lot of different text lines that can't be stored inside. If your data can be stored "as is" because the model size allows it, it's not forced to optimize itself and find the patterns, hence it doesn't learn at all, it just memorizes.

from gpt-2.

ZheMann avatar ZheMann commented on August 29, 2024

As this issue is still 'Open', I guess this is a good place to ask the following question:

If you replace the new line character with a custom character like <|n|>, you will end up with one very large sentence in the end, right? How did this work for you while generating the dictionairy files? Because right now it says I only have one sentence which is (obviously) too large to process. However, I defined <|n|> in user_defined_symbols so I expected SentencePiece to cut the large sentence into original sentences based on <|n|> for further processing.

from gpt-2.

rkfg avatar rkfg commented on August 29, 2024

As far as I remember the script doesn't replace the new lines but insert that token before it so the sentences are short enough. Take a look at concat.sh.

from gpt-2.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.