Giter VIP home page Giter VIP logo

ad-rdat's Introduction

AD-RDAT

Implementation of paper "Improving Arabic Diacritization with Regularized Decoding and Adversarial Training" at ACL-2021

Citation

@inproceedings{qin-etal-2021-improving,
    title = "Improving Arabic Diacritization with Regularized Decoding and Adversarial Training",
    author = "Qin, Han and Chen, Guimin and Tian, Yuanhe and Song, Yan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    pages = "534--542",
}

Requirements

Our code works with python 3.8 and requires the following packages: sklearn, pytorch.

It also require the PyTorch version of pre-trained language models: multi-lingual BERT and AraBERT.

Usage

See the commands in run.sh to train a model on the small sample data.

ad-rdat's People

Contributors

yuanhetian avatar

Stargazers

 avatar

Watchers

 avatar  avatar

ad-rdat's Issues

Clarify Dataset Format

Hi

Thanks for the awesome work!

I want to know how to adapt my custom dataset to work with training. What is the expected format?

The dataset is lines of diacritized Arabic text, where each line contains a sentence.

I tried to create a dataset converter using the following script, but it fails each time with a different error.

from argparse import ArgumentParser
from io import StringIO
from pathlib import Path
from diacritization_evaluation.util import extract_haraqat
from train_main import BUCKWALTER_MAP

LABELS = {'a', 'i', 'o', 'u', 'K', 'F', 'N', '~','~a','~i', '~u', '~K', '~F', '~N',}
DIAC_LABEL_MAP = {
    v: k
    for (k, v) in BUCKWALTER_MAP.items()
    if k in LABELS
}

DIAC_LABEL_MAP[""] = "#"
DIAC_LABEL_MAP["َّ"] = "~a"
DIAC_LABEL_MAP["ِّ"] = "~i"
DIAC_LABEL_MAP["ُّ"] = "~u"
DIAC_LABEL_MAP["ًّ"] = "~F"
DIAC_LABEL_MAP["ٍّ"] = "~K"
DIAC_LABEL_MAP["ٌّ"] = "~N"

for k in list(DIAC_LABEL_MAP.keys()):
    if "ّ" in k:
        rk = "".join(reversed(k))
        DIAC_LABEL_MAP[rk] = DIAC_LABEL_MAP[k]

def main():
    parser = ArgumentParser()
    parser.add_argument("input")
    parser.add_argument("output")

    args = parser.parse_args()

    input_text = Path(args.input).read_text(encoding="utf-8").splitlines()

    output = StringIO()
    for line in input_text:
        _, chars, diacs = extract_haraqat(line)
        for c, d in zip(chars, diacs):
            if c.isspace():
                output.write("[Sep]\t-")
            else:
                diac = DIAC_LABEL_MAP[d]
                output.write(f"{c}\t{diac}")
            output.write("\n")
        output.write("\n")

    Path(args.output).write_text(output.getvalue(), encoding="utf-8", newline="\n")

if __name__ == '__main__':
    main()

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.