This project includes the development of two API endpoints and the training of three models from scratch for language detection and translation, of course i could've downloaded some pretrained or finetuned already built models but i like the building process and the learning that comes with it.
My.Video2.mp4
-
pip install -r requirements.txt
ps : if you face a problem with torch probably you need to install the torch from their website according to your device in my case it was the pip with gpu installation 'https://pytorch.org/get-started/locally/'. -
Run
main.py
this will run uvicorn server with the endpoints. -
Open browser and navigate to the follwoing localhost for swagger ui
http://127.0.0.1:8080/docs
. -
For language detection model you must send data the model was trained on and must be whole words .. the reason for that is that the vocabulary the model was trained on is what the model saw, anything else it will not recognize.
-
for language translation model i'll pass some sentences i trained the model on as i didnt train it on all the pairs.
-
if you try to pass any words that are not in the vocab of the model it would return key-error, also the two models related to language translation doesn't have the same vocabulary so don't expect for one sentence to be translated on both languages .. again this can be further trained on the whole dataset and results do improve.
-
The models weights are loaded at the running of the
main.py
script.
- 'تقول انك تتعمد اخفاء مظهرك الحسن'
'you are saying you intentionally hide your good looks '
- 'i didnt see the need for it'
"لا ارى لذلك حاجة "
-
?!!!!!!? -> this should return unknown
-
'i read his book'
"انا اقرا كتابه "
- 'im sure that she will come back soon'
"انا متاكد من انها ستعود قريبا "
ps : you can find more test cases at the end of this repo i've evaluated after the training
- models are included in two folders :
- language detection model
- language translation models -> includes two models one for each language path: ara->eng / eng->ara
- config is for paths and other configuration for the whole project
- I included the training noteooks if you want to take a look at what i did
- model translation architecture contains the architecture of the translation model
- i made two utils for translation one for English - Arabic translation and one for Arabic - English translation because vocabularies are different
- data is included also in case you want to take a look at it.
The language detection model is a word-level model that achieved an accuracy of around 97%. It was trained on a dataset containing multiple languages and uses a stratified split to maintain the ratio of each language in the training and testing sets. The API endpoint returns the detected language and the time taken for the request.
The language translation model is also a word-level model that uses the PyTorch deep learning framework and utilized Sequence to Sequence Network and Attention. It includes two models for translation between Arabic to English and English to Arabic, with training loss based on negative log-likelihood and BLEU score for evaluation. The API endpoint checks the language and passes the sentence to the appropriate model.
The models were trained on word-level data, while character-level models could perform better with word-level understanding, they would require more training time and resources.
The language detection model was trained in approximately 3 hours, while the translation models took longer due to the complexity of the task and the use of RNNs with Attention. The translation models were trained on a subset of data with filtered language pairs based on a specific criteria to choose sentences that starts with like "This", "هذه" and a maximum number of words in sentences.
The next picture is the normalized confusion matrix from the language detection model training, more details can be found in the training notebooks folder
The language detection and translation models perform well on the subset of data used, but further training and resources would be required for generalization.
i dont think tom would want to do that < لا اظنن توم يريد فعل ذلك
i read his book < انا اقرا كتابه على الاطلاق
im sure that she will come back soon < انا متاكد من انها ستعود قريبا
im really hungry < انا جايع جدا في الصباح
i cant see anything < لا استطيع ابتكار ارى
i dont know when hell be here < لا اعرف متى سيكون هنا
حظه يسبق ذكاءه < he is more lucky than clever
انك لست طالبا < you are not a student
تقول انك تتعمد اخفاء مظهرك الحسن < you are saying you intentionally hide your good looks
انه قلق بسبب مرض والده < he is concerned about his fathers illness
انت امي < you are my mother
تحبني كل عايلتي < i am loved by all my family
انا من الاكوادور < i am from ecuador
هو رجل حكمة < he is a man of wit