Comments (3)
@harshyadav17, the difference between the "txt" and "json" files arises because Fairseq automatically includes the following special tokens when the dictionary is loaded:
"<s>": 0,
"<pad>": 1,
"</s>": 2,
"<unk>": 3,
You can check this out here.
Therefore, we need to manually add these tokens at the beginning, followed by all the entries from the dictionary file.
The IDs themselves are not different; Fairseq simply adds these tokens automatically, while in this case, we are adding them manually.
Hope this makes it clear.
from indictrans2.
Please check the Hugging Face repo for the necessary scripts and refer to the commits for detailed steps to make it compatible with AutoTokenizer.
If you have any specific questions, feel free to post them here, and we'll be happy to help.
from indictrans2.
@PranjalChitale thanks!
How did you get the dict.SRC.json and dict.TGT.json. I see it has different ids when compared to the .txt files present in final_bin folder or the ones share by you over here: https://indictrans2-public.objectstore.e2enetworks.net/en-indic-fairseq-dict.zip
It would be great if you can share the thought process behind generating such json files.
Thanks!
from indictrans2.
Related Issues (20)
- Translation of Proverbs and Idioms HOT 1
- use with ctranslate HOT 1
- Hardware Requirement HOT 1
- Handle src==tgt inputs in triton inference server
- Issues for the Urdu and Kashmiri HOT 2
- Flash Attention on Mac HOT 2
- Model Optimization HOT 1
- Loosing Formatting post translation HOT 3
- Convert fairseq weights to ctranslate2 HOT 1
- Distillation of en-indic base model HOT 1
- Distillation: Unable to start the training HOT 2
- Distillation Joint Translate Bug HOT 3
- Saving Distillation model HOT 1
- Fairseq dictionary Size HOT 1
- help in finetuning ai4bharat/indictrans2-indic-en-1B HOT 2
- For Odia translations model is generating ଯ଼ in results which is not existing alphabet in Odia language. HOT 4
- Numerals Not Translated Correctly in IndicTrans2 HOT 3
- Installation issue. HOT 1
- Translations are not proper when source contain the different format of numbers. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from indictrans2.