tinodidriksen / transfuse Goto Github PK
View Code? Open in Web Editor NEWExtract formatted text from documents, transform it, then put back in place
License: GNU General Public License v3.0
Extract formatted text from documents, transform it, then put back in place
License: GNU General Public License v3.0
echo '<html><head><title>jeg tester</title></head><body><p><![CDATA[ en tekst som skal oversettes ]]></p></body></html>' | tf-clean
yields
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>jeg tester</title></head><body><p></p></body></html>
$ echo 'a <embed> e' |apertium -f html spa-eng
To<embed>And</embed>
expected:
To <embed> And</embed>
(or maybe To <embed/> And
)
Full example was something like
<p>See <embed data-content-id="123" data-link-text="Spot" data-resource="concept" data-type="inline"> run.</p
Here you have an example where translations from it are pretty much useless.
(File has been slightly modified to alter the personal details)
Looking at the content of the first paragraph, you find something like this:
<text:p text:style-name="P45"><text:span text:style-name="T47">R</text:span><text:span text:style-name="T47">esolución </text:span><text:span text:style-name="T32">de la Dirección General de</text:span><text:span text:style-name="T32"> C</text:span><text:span text:style-name="T32">alidad </text:span><text:span text:style-name="T32">y Educación Ambie</text:span><text:span text:style-name="T32">ntal por la que se procede </text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32">l cambio de </text:span><text:span text:style-name="T32">titularidad</text:span><text:span text:style-name="T143"> de </text:span><text:span text:style-name="T26">la</text:span><text:span text:style-name="T26"> autorización ambiental integrada </text:span><text:span text:style-name="T26">otorgada a la empresa</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">XXXXXXXX XXXXX XXXXXX</text:span><text:span text:style-name="T32">,</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">con </text:span><text:span text:style-name="T32">número de </text:span><text:span text:style-name="T32">N</text:span><text:span text:style-name="T32">IF </text:span><text:span text:style-name="T32">P6345435F</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">para un</text:span><text:span text:style-name="T32">a </text:span><text:span text:style-name="T32">planta de tratamiento de resi</text:span><text:span text:style-name="T32">duos urbanos</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">con </text:span><text:span text:style-name="T29">NIMA </text:span><text:span text:style-name="T29">1111111111</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">y <text:s/></text:span><text:span text:style-name="T29">número </text:span><text:span text:style-name="T29">de RICV</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">111</text:span><text:span text:style-name="T29">/AAI/CV </text:span><text:span text:style-name="T32">ubicad</text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32"> en </text:span><text:span text:style-name="T32">el </text:span><text:span text:style-name="T32">Camí s/n, Partida El Plà, del término municipal de (Valencia)</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">a favor de la mercantil </text:span><text:span text:style-name="T29">XXXXXXXX XXXXX XXXXXX, SL</text:span><text:span text:style-name="T29">, con número de NIF </text:span><text:span text:style-name="T29">B1234567</text:span><text:span text:style-name="T29">.</text:span></text:p>
But on the original file, before I obfuscated the details (names, addresses, etc), style IDs were slightly different, because they also included a officeooo:rsid="01c8d657"
attribute in their definition
Merging tags after saving spaces means some inline tags are no longer identical. Potentially merge these earlier.
https://github.com/TinoDidriksen/Transfuse/blob/main/src/stream-visl.cpp#L225 should actually read the block ID.
#7 works, but has issues when the output nests the sub/sup inside another tag. This nesting should be undone where relevant, and otherwise the space removal should step up the tree until it is relevant.
$ echo '<h1>angrepet</h1>' | apertium-deshtml -o
.[][<h1>]angrepet[]❡.[][<\/h1>
]
$ grep -w H /usr/bin/apertium
H) FORMAT_OPTIONS+=(-o) ;;
but it has no effect when running under transfuse, e.g. I made a silly rule to select participle in headings, which works in non-transfuse pipelines
$ echo '<h1>angrepet</h1>' | apertium-deshtml -o | apertium -f none -d . nob-nno_e|apertium-rehtml
<h1>angripe</h1>
$ export APERTIUM_TRANSFUSE=no && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angripe</h1>
but not in transfuse pipelines:
$ export APERTIUM_TRANSFUSE=yes && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angrepet</h1>
Instead of assuming the main file is document.xml
, examine [Content_Types].xml
for the <Override PartName="/word/document2.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />
element.
Just as we save the presence of spaces in order to be able to restore them after injection, sometimes the processor adds spaces that shouldn't be there. This can also be tracked and cleaned up.
Protected elements that were duplicated or deleted during translation should be fixed to only occur exactly as many times as they did in the input.
https://buildd.debian.org/status/package.php?p=transfuse&suite=experimental
E.g., on arm64:
$ diff -a extract-docx-apertium.out ../../tests/extract-docx-apertium.expect
2c2
< [tf-block:1-Qa9x-A]
---
> [tf-block:1-QK9x-A]
6c6
< [tf-block:2-z9oV-A]
---
> [tf-block:2-PUHfYg]
8c8
< [[t:text:Tbd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
---
> [[t:text:TLd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
...etc...
But https://github.com/Cyan4973/xxHash says hashes are identical across all platforms (little / big endian).
From man apertium-deshtml
:
-o Inserts a "❡" (U+2761 CURVED STEM PARAGRAPH SIGN ORNAMENT) at the
end of <h[1–6]> and <title> tags.
e.g.
$ echo '<h1>Historisk sjokktap koronafriskmeldt</h1>' |apertium-deshtml -o
.[][<h1>]Historisk sjokktap koronafriskmeldt[]❡.[][<\/h1>
]
We'd like to use transfuse for https://gtweb.uit.no/apy/ but we'd also like to mark headings/titles with this symbol so CG can treat it as "heading language" (no need for full sentences etc.)
Could we have an option in transfuse to do this? (Unless there's some alternative way of dealing with it that would be even better?)
E.g. namespace error : Namespace prefix xlink for href on use is not defined
for embedded SVGs
% ctest --rerun-failed --output-on-failure
Test project /tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1
Start 1: extract-docx-apertium
1/2 Test #1: extract-docx-apertium ............***Failed 0.02 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5369 Segmentation fault: 11 "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"
Start 7: extract-docx-visl
2/2 Test #7: extract-docx-visl ................***Failed 0.01 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5372 Segmentation fault: 11 "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"
HTML fragment <div>landet på<img alt="landet"/></div> <div>landet</div>
is incorrectly detected as ISO-8859-1
$ for y in yes no; do APERTIUM_TRANSFUSE=$y apertium -f docx -u -d . nob-nno /tmp/in.docx >/tmp/ut.$y.docx; done
With transfuse, we get this bit:
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
<w:r>
<w:rPr/>
<w:t xml:space="preserve">Dette er såkalla «Sideloaded Add-ins». Dei nyttar eit webview, i praksis ein nettlesar</w:t>
</w:r>
</w:t>
</w:r>
which word (and libreoffice) don't show on opening the document, presumably nested runs aren't allowed in OOXML.
(Note: If I first save in.docx from Libreoffice, transfuse can handle it fine, because LO merges all the runs in the input paragraph on saving (removing the proofErr stuff).)
Hello <b>big green</b> <i>world</i>!
turns into Hola <i>Mundo</i> <b>verde grande</b> !
because the <b>
had a space in the input and thus will have a space restored in output. This is not always the desired behavior.
Lots of people use more or less simple json formats for i18n. E.g. apertium-html-tools
{
"@metadata": {
"authors": [ "Juan Pablo Martínez Cortés" ],
"last-updated": "30/01/2016",
"locale": [ "arg" ],
"completion": " 96% 106.82%",
"missing": [ ]
},
"title": "Apertium | Una plataforma libre de traducción automatica",
"tagline": "Plataforma de traducción automatica lliure/de codigo fuent ubierto"
}
or "professional" formats like https://www.i18next.com/misc/json-format
The typical format seems to be a simple object of {key:string-to-be-translated}
. If transfuse could handle that simple format (ignoring the advanced i18next stuff, and skipping anything where value is not a string), we'd probably have support for most of the json-using i18n's.
Google Docs and presumably other editors may produce DOCX files with XML akin to <w:p><w:r><w:t>...</w:t><w:br/><w:br/><w:t>...</w:t><w:br/></w:r></w:p>
- that is, each w:r
can have multiple w:t
and w:br
intermingled.
MS Word itself will produce <w:p><w:r><w:t>...</w:t></w:r><w:r><w:br/></w:r><w:r><w:br/><w:t>...</w:t></w:r><w:r><w:br/></w:r></w:p>
- that is, each w:r
holds max one w:t
Unfortunately, the schema does allow multiple: http://www.datypic.com/sc/ooxml/e-w_r-2.html
(cf. apertium/apertium#110)
Strikeout and del tags should probably be handled as inline-protected.
This gets mangled because <w:tab/>
is mishandled.
<w:p>
<w:r>
<w:t>16.</w:t>
</w:r>
<w:r>
<w:tab/>
<w:t>Utvalget foreslår en mer lik </w:t>
</w:r>
<w:r>
<w:t>prising av elevstipendiatkursene...</w:t>
</w:r>
</w:p>
Analogous to apertium -f line
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.