Giter VIP home page Giter VIP logo

transfuse's People

Contributors

mr-martian avatar tinodidriksen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

transfuse's Issues

CDATA is lost

echo '<html><head><title>jeg tester</title></head><body><p><![CDATA[ en tekst som skal oversettes ]]></p></body></html>' | tf-clean
yields
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>jeg tester</title></head><body><p></p></body></html>

space lost around <embed>

$ echo 'a <embed> e' |apertium -f html spa-eng
To<embed>And</embed>

expected:

To <embed> And</embed>

(or maybe To <embed/> And)

Full example was something like

<p>See <embed data-content-id="123" data-link-text="Spot" data-resource="concept" data-type="inline"> run.</p

ODT with dozens of (identical) inline styles breaks translation

Here you have an example where translations from it are pretty much useless.

Doc-original.odt

(File has been slightly modified to alter the personal details)

Looking at the content of the first paragraph, you find something like this:

<text:p text:style-name="P45"><text:span text:style-name="T47">R</text:span><text:span text:style-name="T47">esolución </text:span><text:span text:style-name="T32">de la Dirección General de</text:span><text:span text:style-name="T32"> C</text:span><text:span text:style-name="T32">alidad </text:span><text:span text:style-name="T32">y Educación Ambie</text:span><text:span text:style-name="T32">ntal por la que se procede </text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32">l cambio de </text:span><text:span text:style-name="T32">titularidad</text:span><text:span text:style-name="T143"> de </text:span><text:span text:style-name="T26">la</text:span><text:span text:style-name="T26"> autorización ambiental integrada </text:span><text:span text:style-name="T26">otorgada a la empresa</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">XXXXXXXX XXXXX XXXXXX</text:span><text:span text:style-name="T32">,</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">con </text:span><text:span text:style-name="T32">número de </text:span><text:span text:style-name="T32">N</text:span><text:span text:style-name="T32">IF </text:span><text:span text:style-name="T32">P6345435F</text:span><text:span text:style-name="T32"> </text:span><text:span text:style-name="T32">para un</text:span><text:span text:style-name="T32">a </text:span><text:span text:style-name="T32">planta de tratamiento de resi</text:span><text:span text:style-name="T32">duos urbanos</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">con </text:span><text:span text:style-name="T29">NIMA </text:span><text:span text:style-name="T29">1111111111</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">y <text:s/></text:span><text:span text:style-name="T29">número </text:span><text:span text:style-name="T29">de RICV</text:span><text:span text:style-name="T29"> </text:span><text:span text:style-name="T29">111</text:span><text:span text:style-name="T29">/AAI/CV </text:span><text:span text:style-name="T32">ubicad</text:span><text:span text:style-name="T32">a</text:span><text:span text:style-name="T32"> en </text:span><text:span text:style-name="T32">el </text:span><text:span text:style-name="T32">Camí s/n, Partida El Plà, del término municipal de (Valencia)</text:span><text:span text:style-name="T29">, </text:span><text:span text:style-name="T29">a favor de la mercantil </text:span><text:span text:style-name="T29">XXXXXXXX XXXXX XXXXXX, SL</text:span><text:span text:style-name="T29">, con número de NIF </text:span><text:span text:style-name="T29">B1234567</text:span><text:span text:style-name="T29">.</text:span></text:p>

But on the original file, before I obfuscated the details (names, addresses, etc), style IDs were slightly different, because they also included a officeooo:rsid="01c8d657" attribute in their definition

HTML sub/sup nested spacing

#7 works, but has issues when the output nests the sub/sup inside another tag. This nesting should be undone where relevant, and otherwise the space removal should step up the tree until it is relevant.

Option like `apertium-deshtml -o` / `apertium -H` for inserting ❡ at end of headers

$ echo '<h1>angrepet</h1>' | apertium-deshtml -o
.[][<h1>]angrepet[]❡.[][<\/h1>
]
$ grep -w H /usr/bin/apertium
    H) FORMAT_OPTIONS+=(-o) ;;

but it has no effect when running under transfuse, e.g. I made a silly rule to select participle in headings, which works in non-transfuse pipelines

$ echo '<h1>angrepet</h1>' | apertium-deshtml -o | apertium -f none -d . nob-nno_e|apertium-rehtml
<h1>angripe</h1>
$ export APERTIUM_TRANSFUSE=no && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angripe</h1>

but not in transfuse pipelines:

$ export APERTIUM_TRANSFUSE=yes && echo '<h1>angrepet</h1>' | apertium -H -f html -d . nob-nno_e
<h1>angrepet</h1>

DOCX document.xml may have a number

Instead of assuming the main file is document.xml, examine [Content_Types].xml for the <Override PartName="/word/document2.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" /> element.

Save the lack of space

Just as we save the presence of spaces in order to be able to restore them after injection, sometimes the processor adds spaces that shouldn't be there. This can also be tracked and cleaned up.

Hashes differ slightly across archs

https://buildd.debian.org/status/package.php?p=transfuse&suite=experimental

E.g., on arm64:

$ diff -a extract-docx-apertium.out ../../tests/extract-docx-apertium.expect
2c2
< [tf-block:1-Qa9x-A]
---
> [tf-block:1-QK9x-A]
6c6
< [tf-block:2-z9oV-A]
---
> [tf-block:2-PUHfYg]
8c8
< [[t:text:Tbd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
---
> [[t:text:TLd7QQ]]legal[[/]] [[t:b:lS94HQ]]persons[[/]].[]
...etc...

But https://github.com/Cyan4973/xxHash says hashes are identical across all platforms (little / big endian).

Support for inserting heading marker

From man apertium-deshtml:

     -o      Inserts a "❡" (U+2761 CURVED STEM PARAGRAPH SIGN ORNAMENT) at the
             end of <h[1–6]> and <title> tags.

e.g.

$ echo '<h1>Historisk sjokktap koronafriskmeldt</h1>' |apertium-deshtml -o
.[][<h1>]Historisk sjokktap koronafriskmeldt[]❡.[][<\/h1>
] 

We'd like to use transfuse for https://gtweb.uit.no/apy/ but we'd also like to mark headings/titles with this symbol so CG can treat it as "heading language" (no need for full sentences etc.)

Could we have an option in transfuse to do this? (Unless there's some alternative way of dealing with it that would be even better?)

Segfault on m1

% ctest --rerun-failed --output-on-failure
Test project /tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1
    Start 1: extract-docx-apertium
1/2 Test #1: extract-docx-apertium ............***Failed    0.02 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5369 Segmentation fault: 11  "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"

    Start 7: extract-docx-visl
2/2 Test #7: extract-docx-visl ................***Failed    0.01 sec
/tmp/build/nightly/transfuse/transfuse-0.5.8+g58~4a3f5cd1/tests/extract.sh: line 5:
5372 Segmentation fault: 11  "$1" -m extract -K -d "extract-$3-$4" -s "$4" "$2/test.$3" "extract-$3-$4.tmp" 2> "extract-$3-$4.tmp"

docx creates nested runs (<w:r><w:t><w:r><w:t>), which are then invisible in the opened document

$ for y in yes no; do APERTIUM_TRANSFUSE=$y apertium -f docx -u -d . nob-nno  /tmp/in.docx >/tmp/ut.$y.docx; done

in.docx

With transfuse, we get this bit:

      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve">
          <w:r>
            <w:rPr/>
            <w:t xml:space="preserve">Dette er såkalla «Sideloaded Add-ins». Dei nyttar eit webview, i praksis ein nettlesar</w:t>
          </w:r>
        </w:t>
      </w:r>

which word (and libreoffice) don't show on opening the document, presumably nested runs aren't allowed in OOXML.

(Note: If I first save in.docx from Libreoffice, transfuse can handle it fine, because LO merges all the runs in the input paragraph on saving (removing the proofErr stuff).)

Apertium spacing heuristics

Hello <b>big green</b> <i>world</i>! turns into Hola <i>Mundo</i> <b>verde grande</b> ! because the <b> had a space in the input and thus will have a space restored in output. This is not always the desired behavior.

JSON format

Lots of people use more or less simple json formats for i18n. E.g. apertium-html-tools

{
    "@metadata": {
        "authors": [            "Juan Pablo Martínez Cortés"        ],
        "last-updated": "30/01/2016",
        "locale": [            "arg"        ],
        "completion": " 96% 106.82%",
        "missing": [        ]
    },
    "title": "Apertium | Una plataforma libre de traducción automatica",
    "tagline": "Plataforma de traducción automatica lliure/de codigo fuent ubierto"
}

or "professional" formats like https://www.i18next.com/misc/json-format

The typical format seems to be a simple object of {key:string-to-be-translated}. If transfuse could handle that simple format (ignoring the advanced i18next stuff, and skipping anything where value is not a string), we'd probably have support for most of the json-using i18n's.

DOCX allows multiple w:t per w:r

Google Docs and presumably other editors may produce DOCX files with XML akin to <w:p><w:r><w:t>...</w:t><w:br/><w:br/><w:t>...</w:t><w:br/></w:r></w:p> - that is, each w:r can have multiple w:t and w:br intermingled.

MS Word itself will produce <w:p><w:r><w:t>...</w:t></w:r><w:r><w:br/></w:r><w:r><w:br/><w:t>...</w:t></w:r><w:r><w:br/></w:r></w:p> - that is, each w:r holds max one w:t

Unfortunately, the schema does allow multiple: http://www.datypic.com/sc/ooxml/e-w_r-2.html

(cf. apertium/apertium#110)

Docx w:tab needs to be a hard break

This gets mangled because <w:tab/> is mishandled.

    <w:p>
      <w:r>
        <w:t>16.</w:t>
      </w:r>
      <w:r>
        <w:tab/>
        <w:t>Utvalget foreslår en mer lik </w:t>
      </w:r>
      <w:r>
        <w:t>prising av elevstipendiatkursene...</w:t>
      </w:r>
    </w:p>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.