Giter VIP home page Giter VIP logo

chiron's Introduction

Chiron

Chiron is a tool for aligning pre-modern and literary texts with translations in multiple languages.

Chironata (in progress)

  • Create an annotated dataset using LaBSE and Vecalign.
  • Code and data saved in chironata

Pipeline models

  1. LaBSE, Feng et al. (2020)
  • For embedding sentences
  • Associated file: build_labse_embeds.py, using Hugging Face implementation
  • Input: text file to embed; output: LaBSE embeddings of 768 dimensions in binary file.
  • LaBSE paper: Feng, F., Yang, Y., Cer, D.M., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. Annual Meeting of the Association for Computational Linguistics.
  1. Vecalign, Thompson (2019)
  • For aligning two texts embedded at the sentence level
  • Associated files: overlap.py, vecalign.py, score.py
  • Vecalign GitHub: https://github.com/thompsonb/vecalign
  • Vecalign paper: Thompson, B. (2019). Vecalign: Improved Sentence Alignment in Linear Time and Space. Conference on Empirical Methods in Natural Language Processing.

Pipeline steps

  1. Build overlaps files (from vecalign) for source text and translation
  • File to run: overlap.py
  • Input: source text or translation segmented at the sentence level
  • Output: "concatenations of consecutive sentences" as explained on Vecalign's GitHub.
  1. Build LaBSE embeddings of the overlaps files
  • File to run: build_labse_embeds.py
  • Input: overlaps text file
  • Output: LaBSE embeddings in binary file of 768 dimensions, with 1 embedding per sentence concatenation in overlaps file
  1. Align source text and translation (from vecalign)
  • File to run: vecalign.py
  • Input: LaBSE embeddings
  • Output: sentence alignments written to stdout. For a detailed description of the results' format, see Vecalign's GitHub.

Evaluation

Using sentence-level ground truth

  • File to run: score_all.py
  • Includes three scoring functions:
    • Vecalign's original strict scores (Precision, Recall, F1). Does not include Vecalign's original lax scores.
    • Chiron's new lax scores (Precision, Recall, F1)
    • Chiron's new strict score (Accuracy only)

Chapter-level evaluation if sentence-level ground truth not available

  • Example file: score_vec_rslts_chapter_level.ipynb
  • Example based on aligning Thucydides' The Peloponnesian War against a French translation

Testing Chiron

  • Caroline Craig, Kartik Goyal, Gregory R. Crane, Farnoosh Shamsian, and David A. Smith. Testing the limits of neural sentence alignment models on classical Greek and Latin texts and translations. In Computational Humanities Research Conference (CHR), 2023. PDF
  • Code and data available in align_texts_projects

Installation

  1. To use LaBSE, see instructions on Hugging Face
  2. To use Vecalign, see list of dependencies on Vecalign's GitHub

chiron's People

Contributors

caro28 avatar

Stargazers

SphRbtHyk avatar Frédérique Michèle Rey avatar

Watchers

 avatar

Forkers

gregorycrane

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.