Giter VIP home page Giter VIP logo

Comments (8)

bmix avatar bmix commented on May 17, 2024 2

Please make it handle any XML input, like https://www.deepl.com/docs-api/handling-xml/

Should be the same amount of work, but cover so much more input, like DocBook, TEI, DITA, etc. HTML(5) could be serialized to XML, then it would already be included.

Thanks.

from argos-translate.

pierotofy avatar pierotofy commented on May 17, 2024 1

+1 on this one; I've started implementing https://github.com/pierotofy/discourse-translator/tree/libre (plugin for translating discourse forum discussions) and the translation input coming from the software is HTML. Seems like a recurring use case.

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 17, 2024 1

I agree this is not nearly as easy as it seems. My initial thought was to just parse XML and replace the content of each pair of tags with translated values but since there can be tags within sentences this doesn't work:

<p>I use <a href="https://www.google.com">Google</a> every day.</p>

Like the article @pierotofy linked says I think this requires custom models. Hopefully we can support this at some point but for now adding more language models is a higher priority.

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 17, 2024 1

To get this production ready we would need to:

  • Train a new model using this data.
  • Write the code to break up text, run inference on it, and rebuild the xml structure.
  • Generate data for other languages.
  • Train new models with tag data.

I'm currently planning to do few shot translation with an API model provider and then come back to this. Since model training is time consuming and expensive I'm planning to train new models all at once for Argos Translate 2.0 with other potentially breaking changes like removing the tokenizer. If anyone is interested in working on this we could train a test model and test running inference before scaling up to more languages.

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 17, 2024 1

https://github.com/argosopentech/translate-html

from argos-translate.

pierotofy avatar pierotofy commented on May 17, 2024

Looked a bit into this, it's not a trivial thing to do correctly; this article covers the problem best: https://iconictranslation.com/2020/12/issue-112-translating-markup-tags-in-neural-machine-translation/

In short, the best approach seems to require training a model by injecting tags in the training data.

Full paper: http://www.statmt.org/wmt20/pdf/2020.wmt-1.138.pdf

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 17, 2024

from argos-translate.

wintercounter avatar wintercounter commented on May 17, 2024

As a start, it'd be nice if it wouldn't modify the markup at least, but it's making it completely useless after translation. The same goes for markdown.

from argos-translate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.