tinodidriksen / transfuse Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 357 KB

Extract formatted text from documents, transform it, then put back in place

License: GNU General Public License v3.0

CMake 5.48% Shell 0.63% C++ 90.34% HTML 2.85% Roff 0.70%

transfuse's People

Contributors

Stargazers

Watchers

Forkers

mr-martian unhammer

transfuse's Issues

Preserve XML

CDATA is lost

echo '<html><head><title>jeg tester</title></head><body><p><![CDATA[ en tekst som skal oversettes ]]></p></body></html>' | tf-clean
yields
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>jeg tester</title></head><body><p></p></body></html>

space lost around <embed>

$ echo 'a <embed> e' |apertium -f html spa-eng
To<embed>And</embed>

expected:

To <embed> And</embed>

(or maybe To <embed/> And)

Full example was something like

<p>See <embed data-content-id="123" data-link-text="Spot" data-resource="concept" data-type="inline"> run.</p

ODT with dozens of (identical) inline styles breaks translation

Here you have an example where translations from it are pretty much useless.

Doc-original.odt

(File has been slightly modified to alter the personal details)

Looking at the content of the first paragraph, you find something like this:

<text:p text:style-name="P45"><text:span text:style-name="T47">R</text:span><text:span text:style-name="T47">esolución </text:span><text:span text:style-name="T32">de la Dirección General de</text:span><text:span text:style-name="T32"> C</text:span><text:span text:style-name="T32">alidad </text:span><text:span text:style-name="T32">y Educación Ambie</text:span><text:span text:style-name="T32">ntal por la que se procede </text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32">l cambio de </text:span><text:span text:style-name="T32">titularidad</text:span><text:span text:style-name="T143"> de </text:span><text:span text:style-name="T26">la</text:span><text:span text:style-name="T26"> autorización ambiental integrada </text:span><text:span text:style-name="T26">otorgada a la empresa</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">XXXXXXXX XXXXX XXXXXX</text:span><text:span text:style-name="T32">,</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">con </text:span><text:span text:style-name="T32">número de </text:span><text:span text:style-name="T32">N</text:span><text:span text:style-name="T32">IF </text:span><text:span text:style-name="T32">P6345435F</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">para un</text:span><text:span text:style-name="T32">a </text:span><text:span text:style-name="T32">planta de tratamiento de resi</text:span><text:span text:style-name="T32">duos urbanos</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">con </text:span><text:span text:style-name="T29">NIMA </text:span><text:span text:style-name="T29">1111111111</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">y <text:s/></text:span><text:span text:style-name="T29">número </text:span><text:span text:style-name="T29">de RICV</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">111</text:span><text:span text:style-name="T29">/AAI/CV </text:span><text:span text:style-name="T32">ubicad</text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32"> en </text:span><text:span text:style-name="T32">el </text:span><text:span text:style-name="T32">Camí s/n, Partida El Plà, del término municipal de (Valencia)</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">a favor de la mercantil </text:span><text:span text:style-name="T29">XXXXXXXX XXXXX XXXXXX, SL</text:span><text:span text:style-name="T29">, con número de NIF </text:span><text:span text:style-name="T29">B1234567</text:span><text:span text:style-name="T29">.</text:span></text:p>

But on the original file, before I obfuscated the details (names, addresses, etc), style IDs were slightly different, because they also included a officeooo:rsid="01c8d657" attribute in their definition

Merge tags before saving spaces?

Merging tags after saving spaces means some inline tags are no longer identical. Potentially merge these earlier.

VISL block ID

https://github.com/TinoDidriksen/Transfuse/blob/main/src/stream-visl.cpp#L225 should actually read the block ID.

HTML sub/sup nested spacing

#7 works, but has issues when the output nests the sub/sup inside another tag. This nesting should be undone where relevant, and otherwise the space removal should step up the tree until it is relevant.

Option like `apertium-deshtml -o` / `apertium -H` for inserting ❡ at end of headers

$ echo '<h1>angrepet</h1>' | apertium-deshtml -o
.[][<h1>]angrepet[]❡.[][<\/h1>
]

$ grep -w H /usr/bin/apertium
    H) FORMAT_OPTIONS+=(-o) ;;

but it has no effect when running under transfuse, e.g. I made a silly rule to select participle in headings, which works in non-transfuse pipelines

$ echo '<h1>angrepet</h1>' | apertium-deshtml -o | apertium -f none -d . nob-nno_e|apertium-rehtml
<h1>angripe</h1>
$ export APERTIUM_TRANSFUSE=no && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angripe</h1>

but not in transfuse pipelines:

$ export APERTIUM_TRANSFUSE=yes && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angrepet</h1>

DOCX document.xml may have a number

Instead of assuming the main file is document.xml, examine [Content_Types].xml for the <Override PartName="/word/document2.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" /> element.

Save the lack of space

Just as we save the presence of spaces in order to be able to restore them after injection, sometimes the processor adds spaces that shouldn't be there. This can also be tracked and cleaned up.

Post-injection fixup of protected elements

Protected elements that were duplicated or deleted during translation should be fixed to only occur exactly as many times as they did in the input.

Hashes differ slightly across archs

https://buildd.debian.org/status/package.php?p=transfuse&suite=experimental

E.g., on arm64:

$ diff -a extract-docx-apertium.out ../../tests/extract-docx-apertium.expect
2c2
< [tf-block:1-Qa9x-A]
---
> [tf-block:1-QK9x-A]
6c6
< [tf-block:2-z9oV-A]
---
> [tf-block:2-PUHfYg]
8c8
< [[t:text:Tbd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
---
> [[t:text:TLd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
...etc...

But https://github.com/Cyan4973/xxHash says hashes are identical across all platforms (little / big endian).

Support for inserting heading marker

From man apertium-deshtml:

     -o      Inserts a "❡" (U+2761 CURVED STEM PARAGRAPH SIGN ORNAMENT) at the
             end of <h[1–6]> and <title> tags.

e.g.

$ echo '<h1>Historisk sjokktap koronafriskmeldt</h1>' |apertium-deshtml -o
.[][<h1>]Historisk sjokktap koronafriskmeldt[]❡.[][<\/h1>
]

We'd like to use transfuse for https://gtweb.uit.no/apy/ but we'd also like to mark headings/titles with this symbol so CG can treat it as "heading language" (no need for full sentences etc.)

Could we have an option in transfuse to do this? (Unless there's some alternative way of dealing with it that would be even better?)

Suppress namespace errors

E.g. namespace error : Namespace prefix xlink for href on use is not defined for embedded SVGs

Segfault on m1

% ctest --rerun-failed --output-on-failure
Test project /tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1
    Start 1: extract-docx-apertium
1/2 Test #1: extract-docx-apertium ............***Failed    0.02 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5369 Segmentation fault: 11  "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"

    Start 7: extract-docx-visl
2/2 Test #7: extract-docx-visl ................***Failed    0.01 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5372 Segmentation fault: 11  "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"

Charset detection not so great

HTML fragment <div>landet på<img alt="landet"/></div> <div>landet</div> is incorrectly detected as ISO-8859-1

docx creates nested runs (<w:r><w:t><w:r><w:t>), which are then invisible in the opened document

$ for y in yes no; do APERTIUM_TRANSFUSE=$y apertium -f docx -u -d . nob-nno  /tmp/in.docx >/tmp/ut.$y.docx; done

in.docx

With transfuse, we get this bit:

      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve">
          <w:r>
            <w:rPr/>
            <w:t xml:space="preserve">Dette er såkalla «Sideloaded Add-ins». Dei nyttar eit webview, i praksis ein nettlesar</w:t>
          </w:r>
        </w:t>
      </w:r>

which word (and libreoffice) don't show on opening the document, presumably nested runs aren't allowed in OOXML.

(Note: If I first save in.docx from Libreoffice, transfuse can handle it fine, because LO merges all the runs in the input paragraph on saving (removing the proofErr stuff).)

Apertium spacing heuristics

Hello <b>big green</b> <i>world</i>! turns into Hola <i>Mundo</i> <b>verde grande</b> ! because the <b> had a space in the input and thus will have a space restored in output. This is not always the desired behavior.

JSON format

Lots of people use more or less simple json formats for i18n. E.g. apertium-html-tools

{
    "@metadata": {
        "authors": [            "Juan Pablo Martínez Cortés"        ],
        "last-updated": "30/01/2016",
        "locale": [            "arg"        ],
        "completion": " 96% 106.82%",
        "missing": [        ]
    },
    "title": "Apertium | Una plataforma libre de traducción automatica",
    "tagline": "Plataforma de traducción automatica lliure/de codigo fuent ubierto"
}

or "professional" formats like https://www.i18next.com/misc/json-format

The typical format seems to be a simple object of {key:string-to-be-translated}. If transfuse could handle that simple format (ignoring the advanced i18next stuff, and skipping anything where value is not a string), we'd probably have support for most of the json-using i18n's.

DOCX allows multiple w:t per w:r

Google Docs and presumably other editors may produce DOCX files with XML akin to <w:p><w:r><w:t>...</w:t><w:br/><w:br/><w:t>...</w:t><w:br/></w:r></w:p> - that is, each w:r can have multiple w:t and w:br intermingled.

MS Word itself will produce <w:p><w:r><w:t>...</w:t></w:r><w:r><w:br/></w:r><w:r><w:br/><w:t>...</w:t></w:r><w:r><w:br/></w:r></w:p> - that is, each w:r holds max one w:t

Unfortunately, the schema does allow multiple: http://www.datypic.com/sc/ooxml/e-w_r-2.html

(cf. apertium/apertium#110)

HTML strikeout inline-protected?

Strikeout and del tags should probably be handled as inline-protected.

Docx w:tab needs to be a hard break

This gets mangled because <w:tab/> is mishandled.

    <w:p>
      <w:r>
        <w:t>16.</w:t>
      </w:r>
      <w:r>
        <w:tab/>
        <w:t>Utvalget foreslår en mer lik </w:t>
      </w:r>
      <w:r>
        <w:t>prising av elevstipendiatkursene...</w:t>
      </w:r>
    </w:p>