Comments (12)
You can edit this file, and replace [unused] token with your own token, then reprocess the data.
from prophetnet.
A lot of thanx for your prompt reply,
I thought it of doing , but in case my set of new words does not fits to ** [unused993] 998** , then? @yuyan2do .
from prophetnet.
The number of new words is larger than 998? Then it need some code change to support it.
from prophetnet.
Ya, can you please help me out in this? It would be very grateful to you.
from prophetnet.
Sure, I can do it early next week.
from prophetnet.
Okay, thanx a lot, I will be waiting for you.
from prophetnet.
@ShoubhikBanerjee I have commit change to support append vocab. You can have a try.
- Add new tokens at end of this vocab
- Reprocess data using new vocab
- Start training
In console output, check the word embedding number increase as expected. In below example, I added 3 new tokens.
(embed_tokens): Embedding(30522, 1024, padding_idx=0)
(embed_tokens): Embedding(30525, 1024, padding_idx=0)
from prophetnet.
Thanx a lot, I will try and let you know.
from prophetnet.
Shoubhik, have you got time to try it?
from prophetnet.
Sorry, for being so late. Actually, I was engaged with some other stuff.
My point is : White tokenizing using BERT tokenizer, some words like "biocompatible" is tokenized to "bio ##com ##pati ##ble". So the actual word is already lost. So will it work on adding "biocompatible" in vocab.txt. I think no. Because the word is no longer present there as a whole word.
So any wayaround for it?
from prophetnet.
Hi @ShoubhikBanerjee
i am also working this, the workaround is change or add flag --tokenizer nltk to your fairrseq-preprocess command this will solve your problem. i am now working on adopting the vocabulary for scientific articles and there many terms that need to be added to the vocabulary lets me know if you have found any solution for this
for the sake of convenience is am pasting the command here.
fairseq-preprocess --user-dir ./prophetnet --task translation_prophetnet --tokenizer nltk --source-lang src --target-lang tgt --trainpref cnndm/prophetnet_tokenized/train --validpref cnndm/prophetnet_tokenized/valid --testpref cnndm/prophetnet_tokenized/test --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt --workers 20
from prophetnet.
Hi @yuyan2do I tried this , and finetuned Amazon Food Review dataset and found a strange thing over there, while the previous version was generating some output as BPE tokenized, but your latest code failed to generate any output (for some cases, giving [UNK] tokens, ). Moreover, the output summary is escaping the extra words that I have added to the custom vocab.txt.
Text => This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!
Original Summary => Wonderful, tasty taffy
Predicted Summary (Previous Version) => yu ##m yu ##m
Predicted Summary (Current Version) => [UNK] [UNK]
Current vocab.txt file
...
##: 30519
##? 30520
##~ 30521
vitality 30522
jumbo 30523
salted 30524
taffy 30525
saltwater 30526
tasty 30527
twizzlers 30528
yummy 30529
oatmeals 30530
gastronomy 30531
holistic 30532
oatmeal 30533
It seems quite strange to me, is there anything wrong going on?
The most strange part is that : it is escaping the custom(extra) words that are added to vocab.txt
from prophetnet.
Related Issues (20)
- Can use_fp16 be used?
- Why is the GENIE result in AR-diffusion very different from the original paper? Also, you come from a team. HOT 1
- Character level
- Can't Find Pretrained Checkpoint of Prophetnet: HOT 1
- Unable to load the GENIE model HOT 1
- The datasets have no dev set? HOT 1
- Options Employed for Training or Inference on the CNN/DM Dataset HOT 1
- It seems that the core code of CRITIC, particularly the part involving Google API search, is not implemented HOT 4
- Missing key documents for AR-Diffusion HOT 1
- where is "mbr_select.py" in AR-Diffusion HOT 1
- Unable to run Genie_Finetune.py HOT 1
- “load_fairseq” not found in "AR-Diffusion/data_utils" HOT 3
- Question for GENIE Finetuning, how to specify epochs for training/finetuning? HOT 1
- “load_fairseq” not found in "AR-Diffusion/data_utils" HOT 1
- [AR-Diffusion] predict_xstart vs predict_x_start HOT 3
- AR-Diffusion data.name and exp.name HOT 2
- Request the execution code of llama2
- AR-diffusion: where the code for algorithm 1 is located? HOT 4
- (AR-Diffusion) RuntimeError: Error(s) in loading state_dict for CrossAttention_Diffusion_LM HOT 3
- what is the need for `num_samples` parameter in inference? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prophetnet.