Giter VIP home page Giter VIP logo

opusfilter's Introduction

OpusFilter

OpusFilter is a tool for filtering and combining parallel corpora.

Features:

  • Corpus preprocessing pipelines configured with YAML
  • Simple downloading of parallel corpora from OPUS with OpusTools
  • Implementations for many common text file operations on parallel files
  • Memory-efficient processing of large files
  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.

Installing

Install the latest release from PyPI:

  • pip install opusfilter or pip install opusfilter[all] (include optional Python libraries)

Install from source:

  • pip install . or python setup.py install

Troubleshooting

OpusFilter should generally work fine on Python 3.6, 3.7, and 3.8. In the case of troubles, try installing the exact versions in requirements.txt:

  • pip install -r requirements.txt

Libraries that currently cause trouble:

  • pyhash
    • pyhash-0.9.3 requires setuptools==58 or below
  • fast-mosestokenizer
    • no PyPI packages for Python>=3.9

Documentation

The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.

You can also build the documents from the source:

  • pip install -r docs/requirements.txt or pip install .[docs]
  • sphinx-build docs docs-html

Changelog

A changelog is available in docs/CHANGELOG.md.

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.

Contributing

See docs/CONTRIBUTING.md.

opusfilter's People

Contributors

svirpioj avatar miau1 avatar brightxiaohan avatar markfraney avatar

Stargazers

Mohammad Amin Asadi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.