Giter VIP home page Giter VIP logo

ukuxhumana's Introduction

Ukuxhumana

"Ukuxhumana" means "Communicate" in Zulu. This project is aimed at exploring ideas for using Neural Machine Translation for low-resource languages - right now, specifically for the official languages of South Africa, but we are looking for collaborators across the continent to work together with us for the other languages

Mission

  • Provide a centralized repository for known datasets for African NMT and other NLP applications.
  • Provide pretrained state-of-the-art models for African languages.
  • Decrease the barrier to doing NMT research for African languages by providing code and data and models.
  • Spur collaboration across the continent to work on these problems together.

Data

Parallel Corpuses

Our parallel corpuses are from the Autshumato project. The datasets contain data that was translated by professional translators, data that was sourced as translated file pairs from translators and data obtained from Government websites and documents. We also performed extra cleaning on the corpuses, which is described here

Monolingual Corpuses

Our monolingual corpuses are from a variety of sources. We've used the monolingual corpuses for use in the training of fastText embeddings, which are also used in Unsupervised NMT.

Zulu

English

  • WMT 2014

Known Corpuses

We keep a list of known corpuses for African languages here. Please consider contributing a link to your corpus :)

Models

Currently, two main architectures are used throughout this project, namely Convolutional Sequence to Sequence by Gehring et. al. (2017) and Transformer by Vaswani et. al (2017). Fairseq(-py) and Tensor2Tensor were used in modeling these techniques respectively. For each language, a model was trained using byte-pair encoding (BPE) for tokenisation. The learning rate was set to 0.25 and dropout to 0.2. Beam search with a width of 5 was used in decoding the test data.

The original Tensor2Tensor implementation of Transformer was used. The learning rate was set to 0.4, with a batch size of 1024, and a learning rate warm-up of 45000 steps. Tokenisation was done using WordPiece. Beam search with width 4 was used for decoding.

Results

Results are given in BLEU.

Baseline

English -> Language

Model Setswana isiZulu* Northern Sotho Xitsonga Afrikaans
Google Translate 7.55 41.181
Convolutional Seq2Seq (clean) 24.18 0.28 7.41 36.96 16.17
Convolutional Seq2Seq (best BPE) 26.36 (40k) 1.79 (4k) 12.18 (4k) 37.45 (20k) 25.04 (4k)
Transformer (uncased) 33.53 3.33 24.16 (4k) 49.74 (20k) 35.26 (4k)
Transformer (cased) 33.12 3.16 (4k) 23.77 (4k) 49.30 (20k) 34.81 (4k)
Unsupervised MT (60K BPE) 4.45

* Zulu data requires cleaning. Translations often contain more information than in original sentence, leading to poor BLEU scores.

Autshumato Machine Translation Benchmark

Model Afrikaans isiZulu Northern Sotho Setswana Xitsonga
Convolutional Seq2Seq 12.30 0.52 7.41 10.31 10.73
Transformer 20.60 1.34 10.94 15.60 17.98

Publications & Citations

Benchmarking Neural Machine Translation for Southern African Languages

A Focus on Neural Machine Translation for African Languages

Towards Neural Machine Translation for African Languages

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.