Language Detection and Translation

This project includes the development of two API endpoints and the training of three models from scratch for language detection and translation, of course i could've downloaded some pretrained or finetuned already built models but i like the building process and the learning that comes with it.

My.Video2.mp4

Usage

pip install -r requirements.txt ps : if you face a problem with torch probably you need to install the torch from their website according to your device in my case it was the pip with gpu installation 'https://pytorch.org/get-started/locally/'.
Run main.py this will run uvicorn server with the endpoints.
Open browser and navigate to the follwoing localhost for swagger ui http://127.0.0.1:8080/docs.
For language detection model you must send data the model was trained on and must be whole words .. the reason for that is that the vocabulary the model was trained on is what the model saw, anything else it will not recognize.
for language translation model i'll pass some sentences i trained the model on as i didnt train it on all the pairs.
if you try to pass any words that are not in the vocab of the model it would return key-error, also the two models related to language translation doesn't have the same vocabulary so don't expect for one sentence to be translated on both languages .. again this can be further trained on the whole dataset and results do improve.
The models weights are loaded at the running of the main.py script.

Test cases for translation model:

'تقول انك تتعمد اخفاء مظهرك الحسن'

'you are saying you intentionally hide your good looks '

'i didnt see the need for it'

"لا ارى لذلك حاجة "

?!!!!!!? -> this should return unknown
'i read his book'

"انا اقرا كتابه "

'im sure that she will come back soon'

"انا متاكد من انها ستعود قريبا "

ps : you can find more test cases at the end of this repo i've evaluated after the training

Project structure

models are included in two folders :
1. language detection model
2. language translation models -> includes two models one for each language path: ara->eng / eng->ara
config is for paths and other configuration for the whole project
I included the training noteooks if you want to take a look at what i did
model translation architecture contains the architecture of the translation model
i made two utils for translation one for English - Arabic translation and one for Arabic - English translation because vocabularies are different
data is included also in case you want to take a look at it.

Language Detection Model

The language detection model is a word-level model that achieved an accuracy of around 97%. It was trained on a dataset containing multiple languages and uses a stratified split to maintain the ratio of each language in the training and testing sets. The API endpoint returns the detected language and the time taken for the request.

Language Translation Model

The language translation model is also a word-level model that uses the PyTorch deep learning framework and utilized Sequence to Sequence Network and Attention. It includes two models for translation between Arabic to English and English to Arabic, with training loss based on negative log-likelihood and BLEU score for evaluation. The API endpoint checks the language and passes the sentence to the appropriate model.

The models were trained on word-level data, while character-level models could perform better with word-level understanding, they would require more training time and resources.

Training Details

The language detection model was trained in approximately 3 hours, while the translation models took longer due to the complexity of the task and the use of RNNs with Attention. The translation models were trained on a subset of data with filtered language pairs based on a specific criteria to choose sentences that starts with like "This", "هذه" and a maximum number of words in sentences.

The next picture is the normalized confusion matrix from the language detection model training, more details can be found in the training notebooks folder

Conclusion

The language detection and translation models perform well on the subset of data used, but further training and resources would be required for generalization.

Some more test cases from the translator models :

English -> Arabic

i dont think tom would want to do that < لا اظنن توم يريد فعل ذلك

i read his book < انا اقرا كتابه على الاطلاق

im sure that she will come back soon < انا متاكد من انها ستعود قريبا

im really hungry < انا جايع جدا في الصباح

i cant see anything < لا استطيع ابتكار ارى

i dont know when hell be here < لا اعرف متى سيكون هنا

Arabic -> English

حظه يسبق ذكاءه < he is more lucky than clever

انك لست طالبا < you are not a student

تقول انك تتعمد اخفاء مظهرك الحسن < you are saying you intentionally hide your good looks

انه قلق بسبب مرض والده < he is concerned about his fathers illness

انت امي < you are my mother

تحبني كل عايلتي < i am loved by all my family

انا من الاكوادور < i am from ecuador

هو رجل حكمة < he is a man of wit

amr-abdellatif / end-to-end-language-detection---translator Goto Github PK

end-to-end-language-detection---translator's Introduction

Language Detection and Translation

Usage

Test cases for translation model:

Project structure

Language Detection Model

Language Translation Model

Training Details

Conclusion

Some more test cases from the translator models :

English -> Arabic

Arabic -> English

end-to-end-language-detection---translator's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent