Giter VIP home page Giter VIP logo

Comments (12)

yuyan2do avatar yuyan2do commented on May 17, 2024

You can edit this file, and replace [unused] token with your own token, then reprocess the data.

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

A lot of thanx for your prompt reply,

I thought it of doing , but in case my set of new words does not fits to ** [unused993] 998** , then? @yuyan2do .

from prophetnet.

yuyan2do avatar yuyan2do commented on May 17, 2024

The number of new words is larger than 998? Then it need some code change to support it.

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

Ya, can you please help me out in this? It would be very grateful to you.

from prophetnet.

yuyan2do avatar yuyan2do commented on May 17, 2024

Sure, I can do it early next week.

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

Okay, thanx a lot, I will be waiting for you.

from prophetnet.

yuyan2do avatar yuyan2do commented on May 17, 2024

@ShoubhikBanerjee I have commit change to support append vocab. You can have a try.

  1. Add new tokens at end of this vocab
  2. Reprocess data using new vocab
  3. Start training

In console output, check the word embedding number increase as expected. In below example, I added 3 new tokens.

(embed_tokens): Embedding(30522, 1024, padding_idx=0)
(embed_tokens): Embedding(30525, 1024, padding_idx=0)

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

Thanx a lot, I will try and let you know.

from prophetnet.

yuyan2do avatar yuyan2do commented on May 17, 2024

Shoubhik, have you got time to try it?

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

Sorry, for being so late. Actually, I was engaged with some other stuff.

My point is : White tokenizing using BERT tokenizer, some words like "biocompatible" is tokenized to "bio ##com ##pati ##ble". So the actual word is already lost. So will it work on adding "biocompatible" in vocab.txt. I think no. Because the word is no longer present there as a whole word.

So any wayaround for it?

from prophetnet.

farooqzaman1 avatar farooqzaman1 commented on May 17, 2024

Hi @ShoubhikBanerjee
i am also working this, the workaround is change or add flag --tokenizer nltk to your fairrseq-preprocess command this will solve your problem. i am now working on adopting the vocabulary for scientific articles and there many terms that need to be added to the vocabulary lets me know if you have found any solution for this
for the sake of convenience is am pasting the command here.
fairseq-preprocess --user-dir ./prophetnet --task translation_prophetnet --tokenizer nltk --source-lang src --target-lang tgt --trainpref cnndm/prophetnet_tokenized/train --validpref cnndm/prophetnet_tokenized/valid --testpref cnndm/prophetnet_tokenized/test --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt --workers 20

from prophetnet.

ShoubhikBanerjee avatar ShoubhikBanerjee commented on May 17, 2024

Hi @yuyan2do I tried this , and finetuned Amazon Food Review dataset and found a strange thing over there, while the previous version was generating some output as BPE tokenized, but your latest code failed to generate any output (for some cases, giving [UNK] tokens, ). Moreover, the output summary is escaping the extra words that I have added to the custom vocab.txt.

Text => This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!
Original Summary => Wonderful, tasty taffy
Predicted Summary (Previous Version) => yu ##m yu ##m
Predicted Summary (Current Version) => [UNK] [UNK]

Current vocab.txt file
...
##: 30519
##? 30520
##~ 30521
vitality 30522
jumbo 30523
salted 30524
taffy 30525
saltwater 30526
tasty 30527
twizzlers 30528
yummy 30529
oatmeals 30530
gastronomy 30531
holistic 30532
oatmeal 30533

It seems quite strange to me, is there anything wrong going on?

The most strange part is that : it is escaping the custom(extra) words that are added to vocab.txt

from prophetnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.