Giter VIP home page Giter VIP logo

lasertrain's Introduction

LASERtrain

This package reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings. The authors have released a large model covering 93 languages as part of the LASER project; however the code used to train them remains unreleased. The code in this repository is an approximation to the actual code implemented by the authors using the description of the architecture and training parameters provided in their recent publications.

At the moment, the models produced with this software are not compatible with the models available in LASER project; this limitation will be tackled iin the near future.

The package includes instructions to reproduce the experiments described in Artetxe and Schwenk (2019) in which a model is trained on the UN v1.0 corpus and evaluated on the data released for the BUCC'18 shared task.

Requirements

The following packages are required to reproduce run this package and reproduce the results reported:

  • Fairseq
  • Python
  • PyTorch
  • NumPy
  • Sentencepiece
  • Faiss, for fast similarity search and bitext mining
  • jieba 0.39, Chinese segmenter (pip install jieba)

Tutorial: train and evaluate your LASER model

In this section, we reproduce the experiments carried out by Artetxe and Schwenk (2019).

Download and prepare data

First step is to download the datasets needed to train and evaluate our model. Two datasets are required:

  • UN v1.0 corpus: the multilingual corpus on which our model will be trained
  • BUCC 2018 shared task data: the training and test data for the BUCC 2018 shared task that will be used to evaluate our model Note that UN corpus does not cover one of the language pairs in BUCC 2018: German-English. To deal with this, Artetxe and Schwenk (2019) train a second model on Europarl multilingual corpus. We will not cover this second experiment in this tutorial, although the same steps described could be applied to train it.

For BUCC 2018, download all the 4 training data packages and the 4 test data packages to the sub-directory data. Once downloaded, uncompress all the packages using the coommand: tar xjf bucc2018-ru-en.test.tar.bz2

Acknowledgements

Developed by Universitat d'Alacant as part of its contribution to the GoURMET project, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299.

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.