Comments (8)
Please make it handle any XML input, like https://www.deepl.com/docs-api/handling-xml/
Should be the same amount of work, but cover so much more input, like DocBook, TEI, DITA, etc. HTML(5) could be serialized to XML, then it would already be included.
Thanks.
from argos-translate.
+1 on this one; I've started implementing https://github.com/pierotofy/discourse-translator/tree/libre (plugin for translating discourse forum discussions) and the translation input coming from the software is HTML. Seems like a recurring use case.
from argos-translate.
I agree this is not nearly as easy as it seems. My initial thought was to just parse XML and replace the content of each pair of tags with translated values but since there can be tags within sentences this doesn't work:
<p>I use <a href="https://www.google.com">Google</a> every day.</p>
Like the article @pierotofy linked says I think this requires custom models. Hopefully we can support this at some point but for now adding more language models is a higher priority.
from argos-translate.
To get this production ready we would need to:
- Train a new model using this data.
- Write the code to break up text, run inference on it, and rebuild the xml structure.
- Generate data for other languages.
- Train new models with tag data.
I'm currently planning to do few shot translation with an API model provider and then come back to this. Since model training is time consuming and expensive I'm planning to train new models all at once for Argos Translate 2.0 with other potentially breaking changes like removing the tokenizer. If anyone is interested in working on this we could train a test model and test running inference before scaling up to more languages.
from argos-translate.
https://github.com/argosopentech/translate-html
from argos-translate.
Looked a bit into this, it's not a trivial thing to do correctly; this article covers the problem best: https://iconictranslation.com/2020/12/issue-112-translating-markup-tags-in-neural-machine-translation/
In short, the best approach seems to require training a model by injecting tags in the training data.
Full paper: http://www.statmt.org/wmt20/pdf/2020.wmt-1.138.pdf
from argos-translate.
from argos-translate.
As a start, it'd be nice if it wouldn't modify the markup at least, but it's making it completely useless after translation. The same goes for markdown.
from argos-translate.
Related Issues (20)
- Version comparison broken
- support multiple package paths in ARGOS_PACKAGES_DIR
- WARNING: Language de package default expects mwt, which has been added
- produce sourcemap of translation HOT 1
- Argos Translate GUI repeated delete breaks
- restructure the torrent: Argos-Translate-LibreTranslate-2022-04-30 HOT 4
- How use a specific dialect of a language?
- "Download failed" Error
- ssl.SSLError: [SSL: TLSV1_UNRECOGNIZED_NAME] tlsv1 unrecognized name (_ssl.c:1091) HOT 5
- punctuation breaks translation quality
- ArgosTransate for python doesn't translate anything from English to French HOT 1
- Argos_translate no longer works offline? HOT 1
- multilingual-rag using argos-translate HOT 1
- The difficulty of generating a proper LLM for translation from web scraping...
- Feature Request: Allow installing without `nvidia-cuda` packages. HOT 3
- Pipe mode, line-by-line (stdin/stdout)
- Support for tamil language
- no pip install
- Does not support Python 3.12 HOT 1
- How to switch usage to GPU instead of CPU? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from argos-translate.