Giter VIP home page Giter VIP logo

cambridgeltl / blicer Goto Github PK

View Code? Open in Web Editor NEW
13.0 7.0 3.0 162 KB

Improving Bilingual Lexicon Induction with Cross-Encoder Reranking (Findings of EMNLP 2022). Keywords: Bilingual Lexicon Induction, Word Translation, Cross-Lingual Word Embeddings.

License: MIT License

Python 100.00%
bilingual-dictionary-induction bilingual-lexicon-extraction bilingual-lexicon-induction bilingual-word-embedding cross-encoder cross-lingual-embeddings cross-lingual-word-embeddings fasttext-embeddings pytorch reranking

blicer's Introduction

Improving Bilingual Lexicon Induction with Cross-Encoder Reranking

This repository is the official PyTorch implementation of the following paper:

Yaoyiran Li, Fangyu Liu, Ivan Vulić, and Anna Korhonen. 2022. Improving Bilingual Lexicon Induction with Cross-Encoder Reranking. In Findings of the Association for Computational Linguistics: EMNLP 2022. [arXiv]

BLICEr is a post-hoc reranking method that works in the synergy with any given Cross-lingual Word Embedding (CLWE) space to improve Bilingual Lexicon Induction (BLI) / Word Translation. BLICEr is applicable to any existing CLWE induction method such as ContrastiveBLI, RCSLS, and VecMap. Our method first 1) creates a cross-lingual word similarity dataset, comprising positive word pairs (i.e., true translations) and hard negative pairs induced from the original CLWE space, and then 2) fine-tunes an mPLM (e.g., mBERT or XLM-R) in a Cross Encoder manner to predict the similarity scores. At inference, we 3) combine the similarity score from the original CLWE space with the score from the BLI-tuned cross-encoder.

As reported in our paper, BLICEr is tested in four different BLI setups:

  • Supervised, 5k seed translation pairs

  • Semi-supervised, 1k seed translation pairs

  • Unsupervised, 0 seed translation pairs

  • Zero-shot, 0 translation pairs directly between source and target languages but assume seed pairs between them and a third language respectively (no overlapping)

Dependencies:

  • PyTorch >= 1.10.1
  • Transformers >= 4.15.0
  • Python >= 3.9.7
  • Sentence-Transformers >= 2.1.0

Get Data and Set Input/Output Directories:

Following ContrastiveBLI, our data are obtained from the XLING (8 languages, 56 BLI directions in total) and PanLex-BLI (15 lower-resource languages, 210 BLI directions in total); please refer to ContrastiveBLI for data preprocessing details.

Our BLICEr is compatible with any CLWE backbones. For brevity, our demo here is based on the state-of-the-art ContrastiveBLI 300-dim C1 CLWEs, which is derived with purely static fastText embeddings (ContrastiveBLI also provides even stronger 768-dim C2 CLWEs which are trained with both fastText and mBERT). Please modify the input/output directories accordingly when using different CLWEs.

Run the Code:

python run_all.py

Output: source->target and target->source P@1 scores for each of λ values in [0, 0.01, 0.02, ... , 0.99, 1.0].

Citation:

Please cite our paper if you find BLICEr useful. If you like our work, please ⭐ this repo.

@inproceedings{li-etal-2022-improving-bilingual,
    title     = {Improving Bilingual Lexicon Induction with Cross-Encoder Reranking},
    author    = {Li, Yaoyiran and Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna},
    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
    year      = {2022}
}

blicer's People

Contributors

yaoyiran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.