Hi, first of all, thank you for the awesome piece of work that you have shared. <p

You can edit <a href="https://github.com/microsoft/ProphetNet/blob/master/src/vocab.tx

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to use custom/append to vocab.txt? about prophetnet HOT 12 OPEN

ShoubhikBanerjee commented on May 17, 2024

How to use custom/append to vocab.txt?

from prophetnet.

Comments (12)

yuyan2do commented on May 17, 2024

You can edit this file, and replace [unused] token with your own token, then reprocess the data.

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

A lot of thanx for your prompt reply,

I thought it of doing , but in case my set of new words does not fits to ** [unused993] 998** , then? @yuyan2do .

from prophetnet.

yuyan2do commented on May 17, 2024

The number of new words is larger than 998? Then it need some code change to support it.

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

Ya, can you please help me out in this? It would be very grateful to you.

from prophetnet.

yuyan2do commented on May 17, 2024

Sure, I can do it early next week.

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

Okay, thanx a lot, I will be waiting for you.

from prophetnet.

yuyan2do commented on May 17, 2024

@ShoubhikBanerjee I have commit change to support append vocab. You can have a try.

Add new tokens at end of this vocab
Reprocess data using new vocab
Start training

In console output, check the word embedding number increase as expected. In below example, I added 3 new tokens.

(embed_tokens): Embedding(~~30522~~, 1024, padding_idx=0)
(embed_tokens): Embedding(30525, 1024, padding_idx=0)

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

Thanx a lot, I will try and let you know.

from prophetnet.

yuyan2do commented on May 17, 2024

Shoubhik, have you got time to try it?

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

Sorry, for being so late. Actually, I was engaged with some other stuff.

My point is : White tokenizing using BERT tokenizer, some words like "biocompatible" is tokenized to "bio ##com ##pati ##ble". So the actual word is already lost. So will it work on adding "biocompatible" in vocab.txt. I think no. Because the word is no longer present there as a whole word.

So any wayaround for it?

from prophetnet.

farooqzaman1 commented on May 17, 2024

Hi @ShoubhikBanerjee
i am also working this, the workaround is change or add flag --tokenizer nltk to your fairrseq-preprocess command this will solve your problem. i am now working on adopting the vocabulary for scientific articles and there many terms that need to be added to the vocabulary lets me know if you have found any solution for this
for the sake of convenience is am pasting the command here.
fairseq-preprocess --user-dir ./prophetnet --task translation_prophetnet --tokenizer nltk --source-lang src --target-lang tgt --trainpref cnndm/prophetnet_tokenized/train --validpref cnndm/prophetnet_tokenized/valid --testpref cnndm/prophetnet_tokenized/test --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt --workers 20

from prophetnet.

ShoubhikBanerjee commented on May 17, 2024

Hi @yuyan2do I tried this , and finetuned Amazon Food Review dataset and found a strange thing over there, while the previous version was generating some output as BPE tokenized, but your latest code failed to generate any output (for some cases, giving [UNK] tokens, ). Moreover, the output summary is escaping the extra words that I have added to the custom vocab.txt.

Text => This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!
Original Summary => Wonderful, tasty taffy
Predicted Summary (Previous Version) => yu ##m yu ##m
Predicted Summary (Current Version) => [UNK] [UNK]

Current vocab.txt file
...
##： 30519
##？ 30520
##～ 30521
vitality 30522
jumbo 30523
salted 30524
taffy 30525
saltwater 30526
tasty 30527
twizzlers 30528
yummy 30529
oatmeals 30530
gastronomy 30531
holistic 30532
oatmeal 30533

It seems quite strange to me, is there anything wrong going on?

The most strange part is that : it is escaping the custom(extra) words that are added to vocab.txt

from prophetnet.

How to use custom/append to vocab.txt? about prophetnet HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent