Giter VIP home page Giter VIP logo

arbml / arbml Goto Github PK

View Code? Open in Web Editor NEW
372.0 35.0 45.0 802.67 MB

Implementation of many Arabic NLP and CV projects. Providing real time experience using many interfaces like web, command line and notebooks.

Home Page: https://arbml.github.io/ARBML/Interfaces/Website/

License: MIT License

JavaScript 56.01% HTML 3.51% CSS 2.56% Python 0.13% Jupyter Notebook 37.78%
colab demo arabic arabic-nlp

arbml's Introduction

Motivation

As you know machine learning has proven its importance in many fields, like computer vision, NLP, reinforcement learning, adversarial learning, etc .. Unfortunately, there is a little work to make machine learning accessible for Arabic-speaking people.

Goal

Our goal is to enrich the Arabic content by creating open-source projects and open the community eyes on the significance of machine learning. We want to create interactive applications that allow novice Arabs to learn more about machine learning and appreciate its advances.

Challenges

Arabic language has many complicated features compared to other languages. First, Arabic language is written right to left. Second, it contains many letters that cannot be pronounced by most foreigners like ض ، غ ، ح ، خ، ظ. Moreover, Arabic language contains special characters called Diacritics which are special characters that help readers pronounced words correctly. For instance the statement السَّلامُ عَلَيْكُمْ وَرَحْمَةُ اللَّهِ وَبَرَكَاتُهُ containts special characters after most of the letters. The diactrics follow special rules to be given to a certain character. These rules are construct a complete area called النَّحْوُ الْعَرَبِيُّ. Compared to English, the Arabic language words letters are mostly connected اللغة as making them disconnected is difficult to read ا ل ل غ ة. Finally, there as many as half a billion people speaking Arabic which resulted in many dialects in different countires.

Procedure

Our procedure is generalized and can be generalized to many language models not just Arabic. This standrized approach takes part as multiple steps starting from training on colab then porting the models to the web.

Models

Name Description Notebook Demo
Arabic Diacritization Simple RNN model ported from Shakkala
Arabic2English Translation seq2seq with Attention
Arabic Poem Generation CharRNN model with multinomial distribution
Arabic Words Embedding N-Grams model ported from Aravec
Arabic Sentiment Classification RNN with Bidirectional layer
Arabic Image Captioning Encoder-Decoder architecture with attention
Arabic Word Similarity Embedding layers using cosine similarity
Arabic Digits Classification Basic RNN model with classification head
Arabic Speech Recognition Basic signal processing and classification
Arabic Object Detection SSD Object detection model
Arabic Poems Meter Classification Bidirectional GRU
Arabic Font Classification CNN
Arabic Text Detection Optical Character Recognition (OCR)

Datasets

Name Description
Arabic Digits 70,000 images (28x28) converted to binary from Digits
Arabic Letters 16,759 images (32x32) converted to binary from Letters
Arabic Poems 146,604 poems scrapped from aldiwan
Arabic Translation 100,000 paralled arabic to english translation ported from OpenSubtitles
Product Reviews 1,648 reviews on products ported from Large Arabic Resources For Sentiment Analysis
Image Captions 30,000 Image paths with captions extracted and translated from COCO 2014
Arabic Wiki 4,670,509 words cleaned and processed from Wikipedia Monolingual Corpora
Arabic Poem Meters 55,440 verses with their associated meters collected from aldiwan
Arabic Fonts 516 100×100 images for two classes.

Tools

To make models easily accessible by contributers, developers and novice users we use two approaches

Google Colab

Google colaboratory is a free service that is offered by Google for research purposes. The interface of a colab notebook is very similar to jupyter notebooks with slight differences. Google offers three hardware accelerators CPU, GPU and TPU for speeding up training. We almost all the time use GPU because it is easier to work with and acheives good results in a reasonable time. Check this great tutorial on medium.

TensorFlow.js

TensorFlow.js is part of the TensorFlow ecosystem that supports training and inference of machine learning models in the browser. Please check these steps if you want to port models to the web:

  1. Use keras to train models then save the model as model.save('keras.h5')

  2. Install the TensorFlow.js converter using pip install tensorflowjs

  3. Use the following script to tensorflowjs_converter --input_format keras keras.h5 model/

  4. The model directory will contain the files model.json and weight files same to group1-shard1of1

  5. Finally you can load the model using TensorFlow.js

Check this tutorial that I made for the complete procedure.

Website

We developed many models to run directly in the browser. Using TensorFlow.js the models run using the client GPU. Since the webpage is static there is no risk of privacy or security. You can visit the website here . Here is the main intefrace of the website

The added models so far

Poems Generation

English Translation

Words Embedding

Sentiment Classification

Image Captioning

Diactrization

Contribution

Check the CONTRIBUTING.md for a detailed explanantion about how to contribute.

Resources

As a start we will start on Github for hosting the website, models, datasets and other contents. Unfortunately, there is a limitation on the space that will hunt us in the future. Please let us know what you suggest on that matter.

Contributors

Thanks goes to these wonderful people (emoji key):


MagedSaeed

🎨 🤔 📦

March Works

🤔

Mahmoud Aslan

🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Citation

@inproceedings{alyafeai-al-shaibani-2020-arbml,
    title = "{ARBML}: Democritizing {A}rabic Natural Language Processing Tools",
    author = "Alyafeai, Zaid  and
      Al-Shaibani, Maged",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.2",
    pages = "8--13",
}

arbml's People

Contributors

abdulelahsm avatar allcontributors[bot] avatar bsa10 avatar forzagreen avatar magedsaeed avatar mhmoodlan avatar zaidalyafeai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arbml's Issues

A model to add punctuation marks to arabic text

The idea here is to use a machine learning model to automatically add punctuation marks to Arabic text. Those marks are as follows:

!
،
.
?
؛
:
- -
- 
/

I do not have a dataset for this. However, I think Tashkela dataset will be a good fit since it contains a large body of Arabic text that is mostly punctuated.
Any thoughts?

[Suggestion] Adding Arabic Letter ع to ARBML logo

Hey, so the idea is to include ع in the logo. I just think it will make the logo much more distinct.

I made a sketch to flex my drawing skills (warning: I don't know how to draw). The ع I drew here includes رب so it looks like عرب. It's a silly suggestion but I thought personalizing the logo will make it more unique.

69AFE259-2539-4BF5-9325-BC0D3D31DF52 1

Word Root

Given a word get its root. For example the root of الانسحاب is سحب.

Add 'Arabic Font Classification'

Even though it is not explicitly mentioned, but it looks like this repo is NLP focused, so let me know if this is out of context.
I'd like to add Arabic Font Classification to the list of projects. This project is about tackling the visual font recognition problem for Arabic fonts by synthesizing data and addressing domain mismatch challenges. It features a post discussing the main ideas with a demo and an open-sourced codebase and dataset.

Word2Vec Model

Given a word find the top k similar words in the embedding.

Grammar fixer model

Given a grammatically false statement output the corrected one.

PS: I don't even know if it was done in other languages

Sneak Peak

Motivation

As you know machine learning has proven its importance in many fields, like computer vision, NLP, reinforcement learning, adversarial learning, etc .. Unfortunately, there is a little work to make machine learning accessible for Arabic-speaking people.

Goal

Our goal is to enrich the Arabic content by creating open-source projects and open the community eyes on the significance of machine learning. We want to create interactive applications that allow novice Arabs to learn more about machine learning and appreciate its advances.

Contributing

Make sure you check the issues to make sure we don't have repetitive simultaneous work. If your work has not been done create an issue with a proper title. After finishing your work create a pull requests to merge your contribution. Initially these are the projects we want to work on

Models

Create and train models using TensorFlow, Pytorch, etc ...

  • Letters and digits recognition (done)
  • Diactrization (تشكيل) (done)
  • Poem Generation (done)
  • Translation (In progress)
  • Sentiment Analysis
  • Text Auto-Complete
  • Manuscript Reader
  • Part of speech classification (اسم , فعل ، حرف)
  • Word Embedding (done)

Interfaces

Make the models accessible through

  • Web demos (In progress)
  • Colab and Jupyter notebooks (In progress)
  • Command line
  • Desktop applications
  • Gist codes
  • Articles en/ar related to Arabic.

Datasets

  • Arabic Digits (done)
  • Arabic Letters (done)

Extra

We can also contribute to other projects. For instance, help translate this cheat sheet to Arabic.

Community

This is an open space community where we don't force members to do stuff. As long as you are motivated and have the passion to help the community to grow you are always welcomed. More importantly, plan your ideas well and take the time before committing yourself into a project.

Resources

As a start we will start on Github for hosting the website, models, datasets and other contents. Unfortunately, there is a limitation on the space that will hunt us in the future. Please let us know what you suggest on that matter.

Code of Ethics

We are an open community. So, if you reuse a model/code/tutorial please make sure to mention the original authors and check the copyright notice.

Sponsors

We are not sponsored and all the members are voluntarily working on the projects.

License

This project will be under the MIT license.

AraBert

Hey All,

This is a temporary issue to discuss the idea of training a Bert model specific for Arabic - AraBert. We will move to another repository once we have a clear understanding of the problem and how to tackle it as well as having experts in the field. There are three main issues to discuss here and I want your opinions about them

  1. Dataset: we can use the avilable Wiki dumbs. If you have more suggestions let us know.
  2. Models: We can use the available Bert models.
  3. Training: We need sponsor for that to train in the cloud. If you can help with that issue please let us know.

If you have any comments on any of the above points don't hesistate to respond. More importantly, if you have any experience in training large models let us know that. Open source contribution is a slow and lengthy process and it might take us a some time to finish the project. If you are planning to contribute please be sure to cut some time for this project.

Thanks
Zaid

Adding Part-Of-Speech Tagging model

This is a new Feature 👍

I worked on an LSTM model for POS-tagger (you can read about it Here)
I'm willing to enhance the model with pre-trained word2vec then add it here.

I'll open a PR as soon as I start!

Fixing The URLs for Demo

Thank you for making this project for Arabic language. But a notice as mention in the title, the links in the table of demos do not work in README.md and README_AR.md. So before I send a PR I put this issue to get permission before that.

From:
https://zaidalyafeai.github.io/...
To new url of the website:
https://arbml.github.io/...

Updating README_AR.md

I thought I should kick off my contribution to this project by updating docs. I noticed that README_AR.md needs some update and since this project is all about Arabic NLP and its primary audience are Arabic speaking people, I think updating this document is essential when presenting this to newcomers. I'm available to work on this task if the team hasn't worked on it yet.

Text Auto-Complete

Given a statement in Arabic predict the next word. This issue is opened for newcomers. If any one wants two work on this let us know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.