Giter VIP home page Giter VIP logo

indictrans2's Introduction

IndicTrans2

📜 Paper | 🌐 Website | ▶️ Demo | 🤗 HF Interface | colab link

IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages — including multiple scripts for low-resouce languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).

We open-souce all our training dataset (BPCC), back-translation data (BPCC-BT), final IndicTrans2 models, evaluation benchmarks (IN22, which includes IN22-Gen and IN22-Conv) and training and inference scripts for easier use and adoption within the research community. We hope that this will foster even more research in low-resource Indic languages, leading to further improvements in the quality of low-resource translation through contributions from the research community.

This code repository contains instructions for downloading the artifacts associated with IndicTrans2, as well as the code for training/fine-tuning the multilingual NMT models.

Here is the list of languages supported by the IndicTrans2 models:

Assamese (asm_Beng) Kashmiri (Arabic) (kas_Arab) Punjabi (pan_Guru)
Bengali (ben_Beng) Kashmiri (Devanagari) (kas_Deva) Sanskrit (san_Deva)
Bodo (brx_Deva) Maithili (mai_Deva) Santali (sat_Olck)
Dogri (doi_Deva) Malayalam (mal_Mlym) Sindhi (Arabic) (snd_Arab)
English (eng_Latn) Marathi (mar_Deva) Sindhi (Devanagari) (snd_Deva)
Konkani (gom_Deva) Manipuri (Bengali) (mni_Beng) Tamil (tam_Taml)
Gujarati (guj_Gujr) Manipuri (Meitei) (mni_Mtei) Telugu (tel_Telu)
Hindi (hin_Deva) Nepali (npi_Deva) Urdu (urd_Arab)
Kannada (kan_Knda) Odia (ory_Orya)

Updates

  • 🚨 Dec 30, 2023 - Migrated IndicTrans2 tokenizer for HF compatible IndicTrans2 models to IndicTransTokenizer and will be maintained separately there from now onwards. Add LoRA fine-tuning scripts for our IndicTrans2 models in huggingface_interface.
  • 🚨 Dec 1, 2023 - Release of Indic-Indic model and corresponding distilled variants for each base model. Please refer to the Download section for the checkpoints.
  • 🚨 Sep 9, 2023 - Added HF compatible IndicTrans2 models. Please refer to the README for detailed example usage.

Tables of Contents

Download Models and Other Artifacts

Multilingual Translation Models

Model En-Indic Indic-En Indic-Indic Evaluations
Base (used for benchmarking) download download download translations (as of May 10, 2023), metrics
Distilled download download download

Training Data

Data URL
Bharat Parallel Corpus Collection (BPCC) download
Back-translation (BPCC-BT) download

Evaluation Data

Data URL
IN22 test set download
FLORES-22 Indic dev set download

Installation

Instructions to setup and install everything before running the code.

# Clone the github repository and navigate to the project directory.
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2

# Install all the dependencies and requirements associated with the project.
source install.sh

Note: We recommend creating a virtual environment with python>=3.7.

Additional notes about Installation

The prepare_data_joint_finetuning.sh and prepare_data_joint_training.sh scripts expect that the sentencepiece commandline utility and GNU parallel are installed.

  1. To install the sentencepiece command line utility, please follow the instructions here.
  2. Please check if GNU parallel is installed, if not please install the same or alternatively in case of installation issues, remove parallel --pipe --keep-order from the respective training / finetuning script as well as apply_sentence_piece.sh.

Data

Training

Bharat Parallel Corpus Collection (BPCC) is a comprehensive and publicly available parallel corpus that includes both existing and new data for all 22 scheduled Indic languages. It is comprised of two parts: BPCC-Mined and BPCC-Human, totaling approximately 230 million bitext pairs. BPCC-Mined contains about 228 million pairs, with nearly 126 million pairs newly added as a part of this work. On the other hand, BPCC-Human consists of 2.2 million gold standard English-Indic pairs, with an additional 644K bitext pairs from English Wikipedia sentences (forming the BPCC-H-Wiki subset) and 139K sentences covering everyday use cases (forming the BPCC-H-Daily subset). It is worth highlighting that BPCC provides the first available datasets for 7 languages and significantly increases the available data for all languages covered.

You can find the contribution from different sources in the following table:

BPCC-Mined Existing Samanantar 19.4M
NLLB 85M
Newly Added Samanantar++ 121.6M
Comparable 4.3M
BPCC-Human Existing NLLB 18.5K
ILCI 1.3M
Massive 115K
Newly Added Wiki 644K
Daily 139K

Additionally, we provide augmented back-translation data generated by our intermediate IndicTrans2 models for training purposes. Please refer our paper for more details on the selection of sample proportions and sources.

English BT data (English Original) 401.9M
Indic BT data (Indic Original) 400.9M

Evaluation

IN22 test set is a newly created comprehensive benchmark for evaluating machine translation performance in multi-domain, n-way parallel contexts across 22 Indic languages. It has been created from three distinct subsets, namely IN22-Wiki, IN22-Web and IN22-Conv. The Wikipedia and Web sources subsets offer diverse content spanning news, entertainment, culture, legal, and India-centric topics. IN22-Wiki and IN22-Web have been combined and considered for evaluation purposes and released as IN22-Gen. Meanwhile, IN22-Conv the conversation domain subset is designed to assess translation quality in typical day-to-day conversational-style applications.

IN22-Gen (IN22-Wiki + IN22-Web) 1024 sentences 🤗 ai4bharat/IN22-Gen
IN22-Conv 1503 sentences 🤗 ai4bharat/IN22-Conv

You can download the data artifacts released as a part of this work from the following section.

Preparing Data for Training

BPCC data is organized under different subsets as described above, where each subset contains language pair subdirectories with the sentences pairs. We also provide LaBSE and LASER for the mined subsets of BPCC. In order to replicate our training setup, you will need to combine the data for corresponding language pairs from different subsets and remove overlapping bitext pairs if any.

Here is the expected directory structure of the data:

BPCC
├── eng_Latn-asm_Beng
│   ├── train.eng_Latn
│   └── train.asm_Beng
├── eng_Latn-ben_Beng
└── ...

While we provide deduplicated subsets with the current available benchmarks, we highly recommend performing deduplication using the combined monolingual side of all the benchmarks. You can use the following command for deduplication once you combine the monolingual side of all the benchmarks in the directory.

python3 scripts/dedup_benchmark.py <in_data_dir> <out_data_dir> <benchmark_dir>
  • <in_data_dir>: path to the directory containing train data for each language pair in the format {src_lang}-{tgt_lang}
  • <out_data_dir>: path to the directory where the deduplicated train data will be written for each language pair in the format {src_lang}-{tgt_lang}
  • <benchmark_dir>: path to the directory containing the language-wise monolingual side of dev/test set, with monolingual files named as test.{lang}

Using our SPM model and Fairseq dictionary

Once you complete the deduplication of the training data with the available benchmarks, you can preprocess and binarize the data for training models. Please download our trained SPM model and learned Fairseq dictionary using the following links for your experiments.

En-Indic Indic-En Indic-Indic
SPM model download download download
Fairseq dictionary download download download

To prepare the data for training En-Indic model, please do the following:

  1. Download the SPM model in the experiment directory and rename it as vocab.
  2. Download the Fairseq dictionary in the experiment directory and rename it as final_bin.

Here is the expected directory for training En-Indic model:

en-indic-exp
├── train
│   ├── eng_Latn-asm_Beng
│   │   ├── train.eng_Latn
│   │   └── train.asm_Beng
│   ├── eng_Latn-ben_Beng
│   └── ...
├── devtest
│   └── all
│       ├── eng_Latn-asm_Beng
│       │   ├── dev.eng_Latn
│       │   └── dev.asm_Beng
│       ├── eng_Latn-ben_Beng
│       └── ...
├── vocab
│   ├── model.SRC
│   ├── model.TGT
│   ├── vocab.SRC
│   └── vocab.TGT
└── final_bin
    ├── dict.SRC.txt
    └── dict.TGT.txt

To prepare data for training the Indic-En model, you should reverse the language pair directories within the train and devtest directories. Additionally, make sure to download the corresponding SPM model and Fairseq dictionary and put them in the experiment directory, similar to the procedure mentioned above for En-Indic model training.

You can binarize the data for model training using the following:

bash prepare_data_joint_finetuning.sh <exp_dir>
  • <exp_dir>: path to the directory containing the raw data for binarization

You will need to follow the same steps for data preparation in case of fine-tuning models.

Training your own SPM models and learning Fairseq dictionary

If you want to train your own SPM model and learn Fairseq dictionary, then please do the following:

  1. Collect a balanced amount of English and Indic monolingual data (we use around 3 million sentences per language-script combination). If some languages have limited data available, increase their representation to achieve a fair distribution of tokens across languages.
  2. Perform script unification for Indic languages wherever possible using scripts/preprocess_translate.py and concatenate all Indic data into a single file.
  3. Train two SPM models, one for English and other for Indic side using the following:
spm_train --input=train.indic --model_prefix=<model_name> --vocab_size=<vocab_size> --character_coverage=1.0 --model_type=BPE
  1. Copy the trained SPM models in the experiment directory mentioned earlier and learn the Fairseq dictionary using the following:
bash prepare_data_joint_training.sh <exp_dir>
  1. You will need to use the same Fairseq dictionary for any subsequent fine-tuning experiments and refer to the steps described above (link).

Training / Fine-tuning

After binarizing the data, you can use train.sh to train the models. We provide the default hyperparameters used in this work. You can modify the hyperparameters as per your requirement if needed. If you want to train the model on a customized architecture, then please define the architecture in model_configs/custom_transformer.py. You can start the model training with the following command:

bash train.sh <exp_dir> <model_arch>
  • <exp_dir>: path to the directory containing the binarized data
  • <model_arch>: custom transformer architecture used for model training

For fine-tuning, the initial steps remain the same. However, the finetune.sh script includes an additional argument, pretrained_ckpt, which specifies the model checkpoint to be loaded for further fine-tuning. You can perform fine-tuning using the following command:

bash finetune.sh <exp_dir> <model_arch> <pretrained_ckpt>
  • <exp_dir>: path to the directory containing the binarized data
  • <model_arch>: custom transformer architecture used for model training
    • transformer_18_18 - For IT2 Base models
    • transformer_base18L - For IT2 Distilled models
  • <pretrained_ckpt>: path to the fairseq model checkpoint to be loaded for further fine-tuning

You can download the model artifacts released as a part of this work from the following section.

The pretrained checkpoints have 3 directories, a fairseq model directory and 2 CT-ported model directories. Please note that the CT2 models are provided only for efficient inference. For fine-tuning purposes you should use the fairseq_model. Post that you can use the fairseq-ct2-converter to port your fine-tuned checkpoints to CT2 for faster inference.

Inference

Fairseq Inference

In order to run inference on our pretrained models using bash interface, please use the following:

bash joint_translate.sh <infname> <outfname> <src_lang> <tgt_lang> <ckpt_dir>
  • infname: path to the input file containing sentences
  • outfname: path to the output file where the translations should be stored
  • src_lang: source language
  • tgt_lang: target language
  • ckpt_dir: path to the fairseq model checkpoint directory

If you want to run the inference using python interface then please execute the following block of code from the root directory:

from inference.engine import Model

model = Model(ckpt_dir, model_type="fairseq")

sents = [sent1, sent2,...]

# for a batch of sentences
model.batch_translate(sents, src_lang, tgt_lang)

# for a paragraph
model.translate_paragraph(text, src_lang, tgt_lang)

CT2 Inference

In order to run inference on CT2-ported model using python inference then please execute the following block of code from the root directory:

from inference.engine import Model

model = Model(ckpt_dir, model_type="ctranslate2")

sents = [sent1, sent2,...]

# for a batch of sentences
model.batch_translate(sents, src_lang, tgt_lang)

# for a paragraph
model.translate_paragraph(text, src_lang, tgt_lang)

Evaluations

We consider the chrF++ as our primary metric. Additionally, we also report the BLEU and Comet scores. We also perform statistical significance tests for each metric to ascertain whether the differences are statistically significant.

In order to run our evaluation scripts, you will need to organize the evaluation test sets into the following directory structure:

eval_benchmarks
├── flores
│   └── eng_Latn-asm_Beng
│       ├── test.eng_Latn
│       └── test.asm_Beng
├── in22-gen
├── in22-conv
├── ntrex
└── ...

To compute the BLEU and chrF++ scores for prediction file, you can use the following command:

bash compute_metrics.sh <pred_fname> <ref_fname> <tgt_lang>
  • pred_fname: path to the model translations
  • ref_fname: path to the reference translations
  • tgt_lang: target language

In order to automate the inference over the individual test sets for En-Indic, you can use the following command:

bash eval.sh <devtest_data_dir> <ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <ckpt_dir>: path to the fairseq model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

In case of Indic-En evaluation, please use the following command:

bash eval_rev.sh  <devtest_data_dir> <ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <ckpt_dir>: path to the fairseq model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

Note: You don’t need to reverse the test set directions for each language pair.

In case of Indic-Indic evaluation, please use the following command:

bash pivot_eval.sh <devtest_data_dir> <pivot_lang> <src2pivot_ckpt_dir> <pivot2tgt_ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <pivot_lang>: pivot language (default should be eng_Latn)
  • <src2pivot_ckpt_dir>: path to the fairseq Indic-En model checkpoint directory
  • <pivot2tgt_ckpt_dir>: path to the fairseq En-Indic model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

In order to perform significance testing for BLEU and chrF++ metrics after you have the predictions for different systems, you can use the following command:

bash compute_comet_metrics_significance.sh <devtest_data_dir>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)

Similarly, to compute the COMET scores and perform significance testing on predictions of different systems, you can use the following command.

bash compute_comet_score.sh <devtest_data_dir>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)

Please note that as we compute significance tests with the same script and automate everything, it is best to have all the predictions for all the systems in place to avoid repeating anything. Also, we define the systems in the script itself, if you want to try out other systems, make sure to edit it there itself.

Baseline Evaluation

To generate the translation results for baseline models such as M2M-100, MBART, Azure, Google, and NLLB MoE, you can check the scripts provided in the "baseline_eval" directory of this repository. For NLLB distilled, you can either modify NLLB_MoE eval or use this repository. Similarly, for IndicTrans inference, please refer to this repository.

You can download the translation outputs released as a part of this work from the following section.

LICENSE

The following table lists the licenses associated with the different artifacts released as a part of this work:

Artifact LICENSE
Existing Mined Corpora (NLLB & Samanantar) CC0
Existing Seed Corpora (NLLB-Seed, ILCI, MASSIVE) CC0
Newly Added Mined Corpora (Samanantar++ & Comparable) CC0
Newly Added Seed Corpora (BPCC-H-Wiki & BPCC-H-Daily) CC-BY-4.0
Newly Created IN-22 test set (IN22-Gen & IN22-Conv) CC-BY-4.0
Back-translation data (BPCC-BT) CC0
Model checkpoints MIT

The mined corpora collection (BPCC-Mined), existing seed corpora (NLLB-Seed, ILCI, MASSIVE), Backtranslation data (BPCC-BT), are released under the following licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to BPCC-Mined, existing seed corpora (NLLB-Seed, ILCI, MASSIVE) and BPCC-BT.

Citation

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

indictrans2's People

Contributors

anoopkunchukuttan avatar gokulnc avatar jaygala24 avatar pranjalchitale avatar varungumma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indictrans2's Issues

NameError: name 'postprocess_batch' is not defined

Getting the following error from the library while trying to do a custom script translation on the ctranslate2 int8 model.

$ python3 examples/ct2.py --ckpt-dir data/en-indic-deploy/ct2_int8_model/ --src-lang eng_Latn --tgt-lang mal_Mlym
Initializing sentencepiece model for SRC and TGT
Initializing model for translation
Hello world!
Traceback (most recent call last):
  File "/home/jerin/code/IndicTrans2/examples/ct2.py", line 24, in <module>
    model.batch_translate(sents, args.src_lang, args.tgt_lang)
  File "/home/jerin/code/IndicTrans2/inference/engine.py", line 248, in batch_translate
    return postprocess_batch(translations, tgt_lang, input_sents=batch)
    
NameError: name 'postprocess_batch' is not defined

Points to the following line:

return postprocess_batch(translations, tgt_lang, input_sents=batch)

The following fixes it for me.

diff --git a/inference/engine.py b/inference/engine.py
index 15bc21a..a63eefe 100644
--- a/inference/engine.py
+++ b/inference/engine.py
@@ -245,7 +245,7 @@ class Model:
 
         preprocessed_sents = self.preprocess_batch(batch, src_lang, tgt_lang)
         translations = self.translate_lines(preprocessed_sents)
-        return postprocess_batch(translations, tgt_lang, input_sents=batch)
+        return self.postprocess_batch(translations, tgt_lang, input_sents=batch)
     
     # translate a paragraph from src_lang to tgt_lang
     def translate_paragraph(self, paragraph: str, src_lang: str, tgt_lang: str) -> str:

Unable to run inference after finetuning

Hi Pranjal,

I am not able to run inference after finetuning as the checkpoints do not seem to have vocab directory. How to work around it then?

Thanks,
Rahul

Speeding up Performance

I am running the translation engine on an ec2 instance, which has 8 T4 GPUs of 16GB VRAM each
Out of the total 128GB, the system seems to only be using about 3-6 constantly

Is there any way I can get this to use more of the GPU power and be faster?
Any specific parameters that might help me?

I can share the code segments or any other details if needed

spm_encode not found

Hey,

I did clone and ran model with input text and expecting ouput in english, I downloaded model on my local mac m2 pro

Getting below error
joint_translate.sh: line 48: spm_encode: command not found

Command i ran is below
bash joint_translate.sh input op hin_Deva eng_Latn ../../../Downloads/indic-en-deploy/fairseq_model

Can you please help me to how i can export spm_encode

I exported it with SPM_PATH but no chance.

Cuda Version

Hi team,
Steps that I have taken all the steps which is listed in readme in order to run the inference

I have downloaded the model checkpoints and I try to run the inference using the Python code

from inference.engine import Model
model = Model('indic-en-preprint/ct2_int8_model/', model_type="ctranslate2")

This gives me the error, the error is below

"RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version"

My Cuda version is

11.8

Error number 2

If I try to run the fairseq inference the kernel becomes dead.

Can you guide me on how to resolve this issue, I will be very thankful to you.

Kind Regards
Abdul Basit

How to handle large documents?

I'm looking to translate large documents into English, but I'm encountering an issue with the maximum sequence length of 256 while translating. In some instances, even after splitting the document, some sentences are still longer than 256 tokens. This situation might potentially impact the global context. Could you provide me with any suggestions or recommendations to handle this effectively?

@jaygala24

Benchmarks

I read the paper of IndicTrans2 where i found some dataset benchmarks for indictrans2, can someone please provide me with the individual benchmark for each language of indictrans2 and what benchmarks are used for evaluation ,thank you

Non Kashmiri Character

The model introduces a non Kashmiri Character in translations.
There are multiple instances where "ٮ۪" is introduced as in "پٮ۪ٹھ" characters are in the output, but they are non Kashmiri.
The correct spelling is "پؠٹھ"
There shape of the lower character is not rectangular in Kashmiri but circular.

Bug during Translation

IndicTrans2 hangs up while translating the following line:

"About $5 in the drugstore & the product will last for months.It doesn't hurt/burn or any of that jazz.--------------------------------------- Â\xad------."

I diagnosed the reason to be "--------------------------------------- Â\xad------."

Can we do something about this?

Adding more details:

I did some experiments to find out the cause (note the full-stop):

  1. "About $5 in the drugstore & the product will last for months.It doesn't hurt/burn or any of that jazz." gets successfully translated.
  2. "--------------------------------------- Â\xad------." gets successfully translated.
  3. ".--------------------------------------- Â\xad------" gets successfully translated.
  4. ".--------------------------------------- Â\xad------." gets successfully translated.
  5. "About $5 in the drugstore & the product will last for months.It doesn't hurt/burn or any of that jazz--------------------------------------- Â\xad------." gets successfully translated.
  6. "About $5 in the drugstore & the product will last for months.It doesn't hurt/burn or any of that jazz.--------------------------------------- Â\xad------" hangs up.
  7. "About $5 in the drugstore & the product will last for months.It doesn't hurt/burn or any of that jazz.--------------------------------------- Â\xad------." hangs up.

in the 5th experiment, i removed the full stop between "jazz" and "-".
in the 6th experiment, i removed the full-stop at the end of the line.

Hope this helps in resolving the bug.

Add keyboard shortcut like Ctrl+Enter to run the translation

On this page: https://models.ai4bharat.org/#/nmt/v2

Just like many online IDEs have Ctrl + Enter to code execute blocks. Is it possible to add a keyboard shortcuts like Ctrl+Enter to run the translation?

Perhaps the following UX experience.

  1. When page loads for the fist time the cursor is active in "From" translation box.
  2. This enables people to start writing immediately without using mouse to focus.
  3. Ctrl + Enter to run translation.
  4. After translation, focus cursor back to the "From" textbox
  5. This way people can edit existing text and once again without using mouse or tabs to refocus cursor.
  6. They can then again execute translation using Ctrl+Enter

Right now, it takes a combination of keyboard and mouse clicks to translate a text & refocus back to the "From" textbox. It becomes time consuming for a learner of language.

🙏🏼

file path of fairseq

python3 convert_indictrans_checkpoint_to_pytorch.py --fairseq_path <fairseq_checkpoint_best.pt> --pytorch_dump_folder_path <hf_output_dir>

which file path to insert between fairseq_checkpoint_best.pt and hf_output_dir
can yo please mention the file path or where i can find it

Unexpected behaviour during installation

Hi,
I wanted to use your model so I followed the installation instructions given here but that didn't work for me.

I think the reason for that is this line:

conda activate itv2

conda activate itv2 doesn't activate the environment when run from a bash script and therefore the subsequent installations do not happen in the itv2 environment. Check out this comment in the conda repo.

If you are interested, you can replicate the problem with this code snippet:

#/bin/bash
echo "Create environment"
conda create -n my_env python=3.9 -y
conda activate my_env
echo "Done"

Save this in a file and execute bash <file-name>.sh.

Possible fixes:

  1. The change suggested in the above comment in conda repo fixed the issue for me.
  2. This Stack Overflow answer also solved the problem.

Please consider addressing this issue as it impacts the usability of your repo. Or please let me know if I am making some mistake.
Thanks!

Getting <unk> token occasionaly in output

Hello team,

Thanks for building this wonderful open-source project. I sometimes notice that output is returned with tokens.

I got this when translating an Malayalam article to English.

unnamed

The results are not deterministic always as well. Sometimes it happens with tokens and sometimes it doesn't work. How can I get deterministic results always?

how to translate multi-line strings

hey ive been using the CTranslate2 model as inference. i handle concurrent user request by dynamically batching them together and inferencing the as a batch.

However i wanted to know i i could somehow handle new line \n in my input string and the translated response retained the new line token somehow (eg: \n or any special token).

Ive tried other alternative solutions like slicing the input string on the \n and inferencing these separate pieces as a batch and merging them together, but concurrent throughput in my application suffers.

please let me know incase of a possible solution or a hacky way of how i could achieve this

Huggingface API example

Could you please insert a Huggingface API example on the Readme.md and huggingface model card ? Current documentation and generation process is pretty obtuse. One could follow following standard template.

from transformers import MarianTokenizer, AutoModelForSeq2SeqLM

text = 'Рада познакомиться'
mname = 'Helsinki-NLP/opus-mt-ru-en'
tokenizer = MarianTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
input_ids = tokenizer.encode(text, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) #Nice to meet you

Another way is creating a docker container where your research could be easily consumed.

{fname}.txt._norm is an empty file.

Hello,
First of all, I would like to express my appreciation for your diligent efforts in successfully releasing IndicTrans2. AI4Bharat's commitment to continual advancement on the India AI Stack is commendable and greatly valued in the community.

Onto the issue– I am running fairseq inference with source language as hin_Dev and target language as eng_Latn. On running the python interface, I seem to always getting the output as -
["The"]

Following is what I sent to the model.batch_translate() in the sents parameter -
sents = ['मेरा नाम संभव है']

On delving a bit deeper, I found that the files IndicTrans2/inference/normalize_punctuation.sh and IndicTrans2/normalize-punctuation.perl write to the {fname}.txt._norm. This file comes out to be empty for some reason. Because of this, all downstream preprocessing, inference, and post-processing steps consume only empty string objects.

Any guidance on how to resolve this is appreciated. TIA.

EDIT 1: Created an incomplete issue by mistake
EDIT 2: Formatting.

Here's my complete main.py file that runs the python interface -

from inference.engine import Model
import os

base_dir = os.path.dirname(__file__)

ckpt_dir = f'{base_dir}/IndicTrans2/models/indic-en-deploy/fairseq_model'

model = Model(ckpt_dir)

sents = ['मेरा नाम संभव है']

# for a batch of sentences
print(model.batch_translate(sents, "hin_Deva", "eng_Latn"))

# # for a paragraph
# model.translate_paragraph(text, 'hin_Deva', 'eng_Latn')

Transformers Integrations is not a package.

I was trying to use the HuggingFace inference for IndicTrans2. As guided in the README file, I executed the install.sh file and the setup was completed successfully. But, when I used the code from the example.py file, I encountered a ModuleNotFoundError for IndicTransTokenizer.

After looking at this issue, I used "%cd IndicTrans2" and the previous ModuleNotFoundError was resolved but I got a new ModuleNotFoundError as given below:

ModuleNotFoundError: No module named 'transformers.integrations.deepspeed'; 'transformers.integrations' is not a package

I was executing everything in Kaggle.

How can this error be resolved?

Weird Translation issues in Malayalam

  1. Translating words with dots like U.D.F or B.J.P get's wrong by the model, while simply BJP works fine as well. This occurs in case of names of like V.D Satheeshan as well. In English translation it's comes as Pinarayi Vijayan D.

Input Text

കേരളത്തിലെ പ്രമുഖ UDF നേതാക്കൾ നാളെ ബിജെപിയിൽ ചേരും, വരുംദിവസങ്ങളിൽ LDF നേതാക്കളും- സുരേന്ദ്രൻ
തിരുവനന്തപുരം: പ്രധാനമന്ത്രി നരേന്ദ്രമോദിയുടെ സന്ദർശനത്തിന് മുന്നോടിയായി കേരളത്തിലെ പ്രമുഖ എൽ.ഡി.എഫ്., യു.ഡി.എഫ്. നേതാക്കൾ ബിജെപിയിൽ അംഗത്വമെടുക്കുമെന്ന് ബി.ജെ.പി. സംസ്ഥാന അധ്യക്ഷൻ കെ. സുരേന്ദ്രൻ. പൗരത്വ നിയമം കേരളത്തിലും നടപ്പാക്കുമെന്നും പിണറായി വിജയന്റെയും വി.ഡി. സതീശന്റെയും വാക്കുകേട്ട് തുള്ളാൻ നിന്നാൽ നിങ്ങൾ വെള്ളത്തിലാകുമെന്നും കെ. സുരേന്ദ്രൻ പറഞ്ഞു.

Output Text:

Prominent UDF leaders in Kerala to join BJP tomorrow, LDF leaders in coming days: Surendran.Thiruvananthapuram: Ahead of Prime Minister Narendra Modi\'s visit to Kerala, a prominent lawyer from Kerala has come forward..D.F., U.D.F.BJP leaders to join party soon.J.P.State president K.S..Surendran..Citizenship Act will be implemented in Kerala too, says Pinarayi Vijayan.D.If you stop to listen to Satheesan\'s words, you will be in the water..Surendran said..
  1. Sometimes a person with he gender on translation is converted to she in English. I noticed this few times. If you need samples do let me know

Context Window Limited to 512 tokens.

Dear AI4Bharat Team,

I'm writing to report a limitation I've encountered while using the IndicTrans2 large language model for machine translation tasks. The current context window of 512 characters hinders its applicability in real-world scenarios that often involve longer text passages.

IndicTrans2's ability to translate between English and Indian languages, as well as between Indic languages themselves, is a valuable contribution. However, the limited context window restricts the model's ability to capture the full context of longer sequences, potentially leading to inaccurate or nonsensical translations.

I would like to request the consideration of releasing models with a higher context window size. Ideally, a window size of 32,000 characters would significantly improve the model's capabilities for real-world tasks.

I understand that increasing the context window size might come with computational costs. However, the ability to handle longer sequences would greatly enhance the usability and effectiveness of IndicTrans2.

Thank you for your time and consideration.

Numbers getting changed after translation

I've deployed the model and while inference

{
"text":"*Apply Euclid's division algorithm to determine the Highest Common Factor (HCF) of $231$ and $396$.\n\n",
"translated_text":" * ಯುಕ್ಲಿಡ್ನ ಡಿವಿಷನ್ ಅಲ್ಗಾರಿದಮ್ಅನ್ನು ಅನ್ವಯಿಸಿ, ಅತಿ ಹೆಚ್ಚು ಸಾಮಾನ್ಯ ಅಂಶವನ್ನು (ಎಚ್ಸಿಎಫ್) ನಿರ್ಧರಿಸಲು $239 ಮತ್ತು $396."
}

231 -> 239. Issue seems be changing the number only when $ is given otherwise it seems to be okay, what's the reason for this and a possible solution?

is this an error in prepare_data_joint_finetuning.sh or is it any way to resolve the issue

I have been trying to fine tune the indic trans 2 model with my data by in the file prepare_data_joint_finetuning.sh at the end there is a command of fairseq "fairseq-preprocess" which has --srcdict $exp_dir/final_bin/dict.SRC.txt \ --tgtdict $exp_dir/final_bin/dict.TGT.txt \ is the directory final_bin or is it final_dict if its final_bin iam getting an error as
FileNotFoundError: [Errno 2] No such file or directory: 'indic-en-exp/final_bin/dict.SRC.txt'
how can resolve this issue and if its the error in code can you please correct the code
Thank you.

Not able to install the fairseq in cuda 12 with the command source install.sh

First let me tell you the steps I have did.
1 Clone the github in virtual machine.
2 then run source install.sh
all dependencies are install except fairseq.

So I have install it with
git checkout cf8ff8c3c5242e6e71e8feb40de45dd699f3cc08 this extra command.
now I have versions
torch 2.2.1+cu118 , fairseq - 1.0.0a0+cf8ff8c
torchaudio 2.2.1 cuda 12(Virtual machine)
then downloaded the spm and dictionary .renamed it.
downloaded BPCC data and taken wiki for train.
download the model.
then downloaded IN22 data and
used for duplication by giving conv from IN22 as a bench mark.
and comment "parallel --pipe --keep-order " this line from prepare_data_joint_finetuning.sh
then run "bash prepare_data_joint_finetuning.sh <exp_dir>" this command after creating exp folder and pasted all the files in it. eg. vocab, final dict, train and devtest (flores 22).
Commands run well.
Then run the final command for finetuning which is
"bash finetune.sh /home/translation-exp-vm/sab/exp transformer_base18L /home/translation-exp-vm/sab/data/jaygala/it2_ckpts/distilled_models/en-indic/fairseq_model/model/checkpoint_best.pt"
then getting the error

self._bin_buffer_mmap._mmap.close()
AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

this error is showing because the files creating in exp folder are most of empty.
image.

please help me with it. is I need to change cuda version with 11.8 or 12.1 or 12.2. or I am doing anything wrong.

Now I have changed the cuda version and try the things but still getting the same error.
Thank you for help.

Latency Issues in Jupyter Notebook

I am using this model in my jupyter notebook and while doing the translation in jupyter notebook latency (time to generate output) is very high as compare to the latency i get in "https://models.ai4bharat.org/#/nmt/v2" (approx. 2-3 sec) .
When i translate en = ['hi there, my name is x and i live in city y']
in jupyter notebook it takes 17.4sec and while in "https://models.ai4bharat.org/#/nmt/v2" it takes less than 2 sec
Can someone tell me how can i match latency in jupyter notebook
Latency

Use of SPM models for Indic-Indic translation

Hi Team,

I have a Prakrit-Hindi (Devanagiri script) dataset that I wanted to fine-tune one of the pre-trained Indictrans2 models. I am blocked on which SPM model to use as the only available options are En-Indic and Indic-En but my dataset is Indic-Indic.

Also, it seems like HuggingFace interface only allows Indictrans2 models for inference and not fine-tuning. Please clarify these two queries.

Could not import function from cython file named "data_utils_fast.pyx"

Initializing sentencepiece model for SRC and TGT
Initializing model for translation
2023-08-29 16:56:08 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2023-08-29 16:56:25 | INFO | fairseq.tasks.translation | [SRC] dictionary: 122706 types
2023-08-29 16:56:25 | INFO | fairseq.tasks.translation | [TGT] dictionary: 32296 types
2023-08-29 16:58:12 | INFO | fairseq.tasks.fairseq_task | can_reuse_epoch_itr = True
2023-08-29 16:58:12 | INFO | fairseq.tasks.fairseq_task | reuse_dataloader = True
2023-08-29 16:58:12 | INFO | fairseq.tasks.fairseq_task | rebuild_batches = False
2023-08-29 16:58:12 | INFO | fairseq.tasks.fairseq_task | creating new batches for epoch 1
Traceback (most recent call last):
  File "/mnt/d/NMT/IndicTrans2/fairseq/data/data_utils.py", line 313, in batch_by_size
    from fairseq.data.data_utils_fast import (
ModuleNotFoundError: No module named 'fairseq.data.data_utils_fast'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/NMT/IndicTrans2/go.py", line 19, in <module>
    model.translate_paragraph("এটি মূলত তাদের জন্য যারা উপস্থিত থাকতে পারেন নি বা উপস্থিত ছিলেন না।", "ben_Beng", "eng_Latn")
  File "/mnt/d/NMT/IndicTrans2/inference/engine.py", line 278, in translate_paragraph
    postprocessed_sents = self.batch_translate(sents, src_lang, tgt_lang)
  File "/mnt/d/NMT/IndicTrans2/inference/engine.py", line 252, in batch_translate
    translations = self.translate_lines(preprocessed_sents)
  File "/mnt/d/NMT/IndicTrans2/inference/engine.py", line 180, in fairseq_translate_lines
    return self.translator.translate(lines)
  File "/mnt/d/NMT/IndicTrans2/inference/custom_interactive.py", line 236, in translate
    for batch in make_batches(
  File "/mnt/d/NMT/IndicTrans2/inference/custom_interactive.py", line 61, in make_batches
    itr = task.get_batch_iterator(
  File "/mnt/d/NMT/IndicTrans2/fairseq/tasks/fairseq_task.py", line 318, in get_batch_iterator
    batch_sampler = make_batches(dataset, epoch)
  File "/mnt/d/NMT/IndicTrans2/fairseq/tasks/fairseq_task.py", line 300, in make_batches
    batches = dataset.batch_by_size(
  File "/mnt/d/NMT/IndicTrans2/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
    return data_utils.batch_by_size(
  File "/mnt/d/NMT/IndicTrans2/fairseq/data/data_utils.py", line 319, in batch_by_size
    raise ImportError(
ImportError: Please build Cython components with: `python setup.py build_ext --inplace`

To Reproduce please run the following code from the root path

from inference.engine import Model

ckpt_dir = "/mnt/d/Datasets/NMT/ai4bharat/indic-en-preprint/fairseq_model/"

model = Model(ckpt_dir, model_type="fairseq")

model.translate_paragraph("এটি মূলত তাদের জন্য যারা উপস্থিত থাকতে পারেন নি বা উপস্থিত ছিলেন না।", "ben_Beng", "eng_Latn")

Note: I have made the model run in CPU by replacing the "model.cuda()" with "model.cpu()"

having issue in finetuning .

i am having issue while finetuning the model. after doing all the preprocessing when i try to run the finetune.sh shell . i am getting an error which shows "The dataset is empty. This could indicate that all elements in the dataset have been skipped. Try increasing the max number of allowed tokens or using a larger dataset." .and also can you tell me that should i have to make any changes in the code if i want to do finetuning for only 4 language pair .

Using distilled indic-indic model for translation

I am using the distilled pre-trained model checkpoint for indic-indic translation. I am getting error because of the --arch arguments which defaults to transformer_18_18.

Can you please tell which architecture to use? I am finetuning using our dataset.

Kashmiri-English translation results

Hi Team,

I totally followed the translation and evaluation process in the github, without any training and fine-tuning. But the performance are different from the results reported in the paper.
On test/gen, it's 25.0
On test/conv, it's 26.1
But in the paper, they are 38.3 and 31.8.
Do I need to fine-tune the model or something else.

Thank you.

Indic-Indic translation models

Hi, sorry if I missed this, under Multilingual Translation Models I see en-indic and indic-en models to download. If one want to translate between Indic languages, say Tamil to Hindi, is there indic-indic model? From the demo site I see that such translation is possible.

Capitalisation affects translation?

She is at home --> ಆಕೆ ಮನೆಯಲ್ಲೇ ಇದ್ದಾಳೆ.
vs.

she is at home --> ಅವಳು ಮನೆಯಲ್ಲಿದ್ದಾಳೆ

She is at home

image

she is at home

image

HF model finetunning

i am trying to finetune the HF model ai4bharat/indictrans2-en-indic-dist-200M , i run the final command
"source train_lora.sh /content/en-indic-exp ai4bharat/indictrans2-en-indic-dist-200M en-indic eng_Latn hin_Deva /content/output"

but model is not saving in output dir (please check in ss) after completion of training.
could you please help here.
Screenshot 2024-01-01 194205

Note: "bash train_lora.sh <data_dir> <model_name> <output_dir> <src_lang_list> <tgt_lang_list> "
in above command ,direction is missing

Data formatting

Hi Team

I am looking at fine tuning the model as per our used case.Would liike to know if meta data is required for fine tuning or not.Also whatt is the minimum number of pairs expected for fine tuning.And thirdly since I am looking at fine tuning the model across 12 Indian languages.Do we require translations for a given english text across all Indian languages

GGML / GGUF formats

The model is pretty amazing and thanks a lot for open sourcing it. Is there a way to size it down and run in hardwares like Apple silicon using ggml ?
GGML
Would this improve the inference times ? For me in Apple M2 it takes 12 seconds to translate 1 sentence. If you can guide me to do this would be willing to help!

Unable to fine tune en-indic model

I followed steps mentioned in Readme.md to set up the environment.

I tried to fine tune the following downloaded model:
https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip

I modified finetune.sh to include
--finetune-from-model $pretrained_ckpt

I ran it as follows:
bash finetune.sh transformer_18_18 <path to the above downloaded and unzipped folder/fairseq_model/model>

The training starts from epoch 1 instead of from a higher epoch of the pretrained model.

Could you guide me on what might be the problem?

Indic-Indic model support for Indic-En

I was trying the ct2_int8_model of indic-indic model. From my tests, indic-indic and en-indic translation works fine. However, When used with indic-en, the output is in Hindi.
Sample:

Source language: ml
Source text: "ദേവനാഗരിക്ക് രണ്ട് ബ്ലോക്ക് ഉണ്ട് യുണീക്കോഡിൽ. 128 കോഡ് പോയിന്റ് ബ്ലോക്ക് തീർന്നതുകൊണ്ട്, ദേവനാഗരി എക്സ്റ്റന്റഡ് ബ്ലോക്കു കൂടി."

Tokenized preprocessed content:

[['mal_Mlym', 'eng_Latn', '▁देव', 'नाग', 'रि', 'क्क्', '▁रण्ट्', '▁ब्लोक्क्', '▁उण्ट्', '▁यु', 'णी', 'क्को', 'डि', 'ൽ', '▁.'], ['mal_Mlym', 'eng_Latn', '▁128', '▁कोड्', '▁पोयि', 'न्ऱ्', '▁ब्लोक्क्', '▁ती', 'ർ', 'न्नत', 'ुकॊण्ट्', '▁,', '▁देव', 'नाग', 'रि', '▁ऎक्स्', 'ऱ्ऱ', 'न्ऱ', 'ड्', '▁ब्लो', 'क्कु', '▁कूटि', '▁.']]
Translation Output:
यूनिकोड में देवनागरी के लिए दो ब्लॉक हैं । 128 कोड प्वाइंट ब्लॉक खतम हो चुका है, देवनागरि विस्तारित ब्लॉक भी है ।

This is a good hindi translation. But English is the target language.

When I looked at the example.py source code, I noticed that en-indic, indic-en, indic-indic models are used. Is it possible to replace everything using indic-indic? If so, I wonder why the above example is failing for English target language.

self._bin_buffer_mmap._mmap.close() AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

I have followed the all steps mention in the readme file and getting the below error while fine tuning.

Traceback (most recent call last):
File "/home/translation-exp-vm/anaconda3/envs/itv2/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq_cli/train.py", line 528, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq_cli/train.py", line 131, in main
task.load_dataset(valid_sub_split, combine=False, epoch=1)
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/tasks/translation.py", line 338, in load_dataset
self.datasets[split] = load_langpair_dataset(
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/tasks/translation.py", line 85, in load_langpair_dataset
src_dataset = data_utils.load_indexed_dataset(
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/data/data_utils.py", line 106, in load_indexed_dataset
dataset = indexed_dataset.make_dataset(
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/data/indexed_dataset.py", line 86, in make_dataset
return MMapIndexedDataset(path)
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/data/indexed_dataset.py", line 494, in init
self._do_init(path)
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/data/indexed_dataset.py", line 507, in _do_init
self._bin_buffer_mmap = np.memmap(
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/numpy/core/memmap.py", line 267, in new
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: cannot mmap an empty file
Exception ignored in: <function MMapIndexedDataset.del at 0x7f50e64aac10>
Traceback (most recent call last):
File "/home/translation-exp-vm/anaconda3/envs/itv2/lib/python3.9/site-packages/fairseq/data/indexed_dataset.py", line 513, in del
self._bin_buffer_mmap._mmap.close()
AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

I am just preparing the script for fine tuning by using sample data which is given in BPCC/daily dataset.

Thank you in advanced.

transliteration using indictrans2

Firstly huge shout out to the team at AI4B for releasing such good quality models.

I am exploring translation from kannada -> english for various information phrases. One of them being names (Like Surya, Bhagya)

I have explored both xlit (transliteration model) and indictransv2 (translate model), feels like the latter does well and is quite reliable.

Here are two examples :

xlit : ಶ್ರೀ. ಬಿ. ಆರ್. ಗೋವಿಂದಯ್ಯ --> shri. bi. aar. govindyya [does not do well for initials consistently]
indictrans2 : ಶ್ರೀ. ಬಿ. ಆರ್. ಗೋವಿಂದಯ್ಯ --> Mr. B.R. Govindaiah

But indictrans2 being a translation model goes wrong in cases where the name has a meaning in the dictionary , For Eg;
indictrans2 : ಸೂರ್ಯ --> The Sun [wrong] Surya [Expected]

Having said this , here are my queries :

Q1. Is there any reliable way (like giving a prompt phrase) you'd suggest to the indictransv2 model to also cover the above case ?
Q2. I have noticed a toggle button in the demo page that says "transliteration" , but I could not see any changes even I switch it on or off, can you explain what is it for and how can we trigger it programitically ?

getting error while importing from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

same code was working about a week back but now i get this error, running the code using modal labs remote gpus

code

import modal
stub = modal.Stub()


volume = modal.NetworkFileSystem.persisted("data")
MODEL_DIR = "/data"

@stub.function( cpu=2, memory = 4276, gpu = 'A10G', timeout=1200, network_file_systems={MODEL_DIR: volume})
def loadIndicTrans2(dataset_name):
    import time
    start_time = time.time()

    import os 
    import subprocess
    
    commands = [
    "pip install -q bitsandbytes",
    "apt update ", 
    "apt install -y git",
    "git clone https://github.com/AI4Bharat/IndicTrans2"
    ]
    for command in commands:
        subprocess.run(command, shell=True)

    os.chdir("IndicTrans2/huggingface_interface")
    subprocess.run("bash install.sh", shell=True)


    with open('importIndic.py', 'w') as file:
        file.write(f'''
try:
    import torch
    import os
    import pandas as pd
    import csv
    print(torch.cuda.get_device_name(0))
    import sys
    from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
    print('from transformers imported')
    from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
    print('from indictranstokenizer imported')
    
    en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # ai4bharat/indictrans2-en-indic-dist-200M
    
    BATCH_SIZE = 4
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
    if len(sys.argv) > 1:
        quantization = sys.argv[1]
    else:
        quantization = ""
    
    
    def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
        if quantization == "4-bit":
            qconfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
            )
        elif quantization == "8-bit":
            qconfig = BitsAndBytesConfig(
                load_in_8bit=True,
                bnb_8bit_use_double_quant=True,
                bnb_8bit_compute_dtype=torch.bfloat16,
            )
        else:
            qconfig = None
    
        tokenizer = IndicTransTokenizer(direction=direction)
        model = AutoModelForSeq2SeqLM.from_pretrained(
            ckpt_dir,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            quantization_config=qconfig,
        )
    
        if qconfig == None:
            model = model.to(DEVICE)
            model.half()
        model.eval()
        return tokenizer, model
    
    def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
        translations = []
        for i in range(0, len(input_sentences), BATCH_SIZE):
            batch = input_sentences[i : i + BATCH_SIZE]
    
            batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
    
            inputs = tokenizer(
                batch,
                src=True,
                truncation=True,
                padding="longest",
                return_tensors="pt",
                return_attention_mask=True,
            ).to(DEVICE)
    
            with torch.no_grad():
                generated_tokens = model.generate(
                    **inputs,
                    use_cache=True,
                    min_length=0,
                    max_length=256,
                    num_beams=5,
                    num_return_sequences=1,
                )
    
            generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)
    
            translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)
            del inputs
            torch.cuda.empty_cache()
        return translations

    
    ip = IndicProcessor(inference=True)
    en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, "en-indic", quantization)


    from datasets import load_dataset
    dataset_name = '{dataset_name}'
    if(dataset_name == "ai2_arc"):
        possible_configs = [
        'ARC-Challenge',
        'ARC-Easy'
        ]
        # columns to translate
        columns = ['question','choices']
        # columns not to translate, to keep in converted dataset as is.
        columns_asis = ['id','answerKey']

    dataset = []
    if(dataset_name == 'ai2_arc'):
        for config in possible_configs:
            base_url = 'https://huggingface.co/api/datasets/allenai/ai2_arc/parquet/{{config}}'
            data_files = {{'train': base_url + '/train/0.parquet','test':base_url + '/test/0.parquet', 'validation': base_url + '/validation/0.parquet'}}
            dataset_slice = load_dataset('parquet', data_files=data_files)
            dataset.append(dataset_slice)

    
except Exception as e:
    # Handle the exception
    print('An error occurred:'+ str(e))
        ''')
    result = subprocess.run(['python', 'importIndic.py'], stdout=subprocess.PIPE)


@stub.local_entrypoint()
def main():
    # provide dataset name among ai2_arc, gsm8k, lukaemon/mmlu
    dataset_name = "ai2_arc"
    
    loadIndicTrans2.remote(dataset_name)

the error says An error occurred:[Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'

Difference in inference scripts

I noticed that there are different scripts for inference:

  1. /scripts/normalize_regex_inference.py
  2. /inference/normalize_regex_inference.py

From the paper I understood that <dnt>, </dnt> tags are added to not translate the text in between, but the 2nd script seems to be handling this differently, by adding <ID>. Any reason for this?

Model download getting disconnected frequently

Trying to download the large archive containing the model has frequent disconnections. I'm working around this by using wget -c in the short-term.

--2023-06-09 23:06:56--  (try:10)  https://indictrans2-public.objectstore.e2enetworks.net/it2_deployment_ckpts/indic-en-deploy.zip
Connecting to indictrans2-public.objectstore.e2enetworks.net (indictrans2-public.objectstore.e2enetworks.net)|101.53.136.18|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 12539123688 (12G), 5043405889 (4.7G) remaining [application/zip]
Saving to: ‘indic-en-deploy.zip’

ValuError when running HF inference

ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'configuration_indictrans.IndicTransConfig'> and you passed <class 'transformers_modules.ai4bharat.indictrans2-en-indic-dist-200M.f7f37e522d6612a10cbc8563af6820434e854047.configuration_indictrans.IndicTransConfig'>. Fix one of those so they match!

Not able to run example.py for translations

HI

I am not able to run example.py for translations.I am getting below error.I am able to port the checkpoint to HF.

(itv2_hf) venkata.kancherla@LM0004097 huggingface_inference % python3 example.py
Traceback (most recent call last):
File "/Users/venkata.kancherla/Documents/My Projects/Indic Search queries/Translit_models/IndicTrans2/IndicTrans2/huggingface_inference/example.py", line 127, in
en_translations = batch_translate(
File "/Users/venkata.kancherla/Documents/My Projects/Indic Search queries/Translit_models/IndicTrans2/IndicTrans2/huggingface_inference/example.py", line 76, in batch_translate
generated_tokens = model.generate(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/transformers/generation/utils.py", line 1593, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/transformers/generation/utils.py", line 742, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/.cache/huggingface/modules/transformers_modules/ai4bharat/indictrans2-indic-en-1B/3cef3ffbd7fda581c930985163498f5ab8885121/modeling_indictrans.py", line 727, in forward
layer_outputs = encoder_layer(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/.cache/huggingface/modules/transformers_modules/ai4bharat/indictrans2-indic-en-1B/3cef3ffbd7fda581c930985163498f5ab8885121/modeling_indictrans.py", line 381, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 196, in forward
return F.layer_norm(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/functional.py", line 2543, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
(itv2_hf) venkata.kancherla@LM0004097 huggingface_inference % python3 example.py
Traceback (most recent call last):
File "/Users/venkata.kancherla/Documents/My Projects/Indic Search queries/Translit_models/IndicTrans2/IndicTrans2/huggingface_inference/example.py", line 127, in
en_translations = batch_translate(
File "/Users/venkata.kancherla/Documents/My Projects/Indic Search queries/Translit_models/IndicTrans2/IndicTrans2/huggingface_inference/example.py", line 76, in batch_translate
generated_tokens = model.generate(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/transformers/generation/utils.py", line 1593, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/transformers/generation/utils.py", line 742, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/.cache/huggingface/modules/transformers_modules/ai4bharat/indictrans2-indic-en-1B/3cef3ffbd7fda581c930985163498f5ab8885121/modeling_indictrans.py", line 727, in forward
layer_outputs = encoder_layer(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/.cache/huggingface/modules/transformers_modules/ai4bharat/indictrans2-indic-en-1B/3cef3ffbd7fda581c930985163498f5ab8885121/modeling_indictrans.py", line 381, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 196, in forward
return F.layer_norm(
File "/Users/venkata.kancherla/opt/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/functional.py", line 2543, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

I see some posts to add --skip-torch-cuda-test.But I am not sure where to add this Can you please help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.