Giter VIP home page Giter VIP logo

end-to-end-language-detection---translator's Introduction

Language Detection and Translation

This project includes the development of two API endpoints and the training of three models from scratch for language detection and translation, of course i could've downloaded some pretrained or finetuned already built models but i like the building process and the learning that comes with it.

My.Video2.mp4

Usage

  1. pip install -r requirements.txt ps : if you face a problem with torch probably you need to install the torch from their website according to your device in my case it was the pip with gpu installation 'https://pytorch.org/get-started/locally/'.

  2. Run main.py this will run uvicorn server with the endpoints.

  3. Open browser and navigate to the follwoing localhost for swagger ui http://127.0.0.1:8080/docs.

  4. For language detection model you must send data the model was trained on and must be whole words .. the reason for that is that the vocabulary the model was trained on is what the model saw, anything else it will not recognize.

  5. for language translation model i'll pass some sentences i trained the model on as i didnt train it on all the pairs.

  6. if you try to pass any words that are not in the vocab of the model it would return key-error, also the two models related to language translation doesn't have the same vocabulary so don't expect for one sentence to be translated on both languages .. again this can be further trained on the whole dataset and results do improve.

  7. The models weights are loaded at the running of the main.py script.

Test cases for translation model:

  1. 'تقول انك تتعمد اخفاء مظهرك الحسن'

'you are saying you intentionally hide your good looks '

  1. 'i didnt see the need for it'

"لا ارى لذلك حاجة "

  1. ?!!!!!!? -> this should return unknown

  2. 'i read his book'

"انا اقرا كتابه "

  1. 'im sure that she will come back soon'

"انا متاكد من انها ستعود قريبا "

ps : you can find more test cases at the end of this repo i've evaluated after the training

Project structure

  1. models are included in two folders :
    1. language detection model
    2. language translation models -> includes two models one for each language path: ara->eng / eng->ara
  2. config is for paths and other configuration for the whole project
  3. I included the training noteooks if you want to take a look at what i did
  4. model translation architecture contains the architecture of the translation model
  5. i made two utils for translation one for English - Arabic translation and one for Arabic - English translation because vocabularies are different
  6. data is included also in case you want to take a look at it.

Language Detection Model

The language detection model is a word-level model that achieved an accuracy of around 97%. It was trained on a dataset containing multiple languages and uses a stratified split to maintain the ratio of each language in the training and testing sets. The API endpoint returns the detected language and the time taken for the request.

Language Translation Model

The language translation model is also a word-level model that uses the PyTorch deep learning framework and utilized Sequence to Sequence Network and Attention. It includes two models for translation between Arabic to English and English to Arabic, with training loss based on negative log-likelihood and BLEU score for evaluation. The API endpoint checks the language and passes the sentence to the appropriate model.

The models were trained on word-level data, while character-level models could perform better with word-level understanding, they would require more training time and resources.

Training Details

The language detection model was trained in approximately 3 hours, while the translation models took longer due to the complexity of the task and the use of RNNs with Attention. The translation models were trained on a subset of data with filtered language pairs based on a specific criteria to choose sentences that starts with like "This", "هذه" and a maximum number of words in sentences.

The next picture is the normalized confusion matrix from the language detection model training, more details can be found in the training notebooks folder Alt text

Conclusion

The language detection and translation models perform well on the subset of data used, but further training and resources would be required for generalization.

Some more test cases from the translator models :

English -> Arabic

i dont think tom would want to do that < لا اظنن توم يريد فعل ذلك

i read his book < انا اقرا كتابه على الاطلاق

im sure that she will come back soon < انا متاكد من انها ستعود قريبا

im really hungry < انا جايع جدا في الصباح

i cant see anything < لا استطيع ابتكار ارى

i dont know when hell be here < لا اعرف متى سيكون هنا


Arabic -> English

حظه يسبق ذكاءه < he is more lucky than clever

انك لست طالبا < you are not a student

تقول انك تتعمد اخفاء مظهرك الحسن < you are saying you intentionally hide your good looks

انه قلق بسبب مرض والده < he is concerned about his fathers illness

انت امي < you are my mother

تحبني كل عايلتي < i am loved by all my family

انا من الاكوادور < i am from ecuador

هو رجل حكمة < he is a man of wit

end-to-end-language-detection---translator's People

Contributors

amr-abdellatif avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.