Comments (3)
The inference pipeline is designed to be broad-spectrum, handling texts from a wide array of domains. However, it is not foolproof.
This regex-placeholder method is applied post-hoc as we found it effective in most cases through empirical testing.
Note that the models weren't specifically trained to retain these placeholders and you can go ahead and fine-tune the models to do so.
Sentence splitting is performed using the best open-source libraries available.
The regex pattern was developed by analyzing encountered cases and covers most general-purpose use cases.
If you have any recommendations for other libraries or improved regex patterns, please let us know.
Additionally, you can choose to bypass the inference pipeline when sentence splitting is not necessary (if you are confident about the sequence length).
Below are the results when using the Fairseq model without the inference pipeline.
मौजूदा 1,87,500 रुपये की देनदारी से 20% की कमी।
कंपनी का राजस्व 2019-20 में 12,34,567 रुपये से बढ़कर 2020-21 में 56,78,910 रुपये हो गया।
जब सरकार में सबसे अधिक मूल वेतन केवल 30,000 रुपये प्रति माह था।
from indictrans2.
Thank you so much for your detailed response and for explaining the current approach and its limitations.
Could you kindly guide me on how these results were generated? Were they produced using the model.batch_translate() function?
Thank you once again for your support and assistance.
from indictrans2.
These are using joint_translate.sh
, but batch_translate
can also be modified do disable the regex based preprocessing and sentence splitting.
from indictrans2.
Related Issues (20)
- Translation of Proverbs and Idioms HOT 1
- use with ctranslate HOT 1
- Hardware Requirement HOT 1
- Handle src==tgt inputs in triton inference server
- Issues for the Urdu and Kashmiri HOT 2
- Flash Attention on Mac HOT 2
- Model Optimization HOT 1
- Convert fairseq tokenizer (vocab and final_bin) to HF Autotokenizer HOT 3
- Loosing Formatting post translation HOT 3
- Convert fairseq weights to ctranslate2 HOT 1
- Distillation of en-indic base model HOT 1
- Distillation: Unable to start the training HOT 2
- Distillation Joint Translate Bug HOT 3
- Saving Distillation model HOT 1
- Fairseq dictionary Size HOT 1
- help in finetuning ai4bharat/indictrans2-indic-en-1B HOT 2
- For Odia translations model is generating ଯ଼ in results which is not existing alphabet in Odia language. HOT 4
- Numerals Not Translated Correctly in IndicTrans2 HOT 3
- Installation issue. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from indictrans2.