Giter VIP home page Giter VIP logo

apertium's Introduction

Apertium

Requirements

  • This package needs the package lttoolbox-3.5 installed in the system, as well as libxml and libpcre.

See https://apertium.org and https://wiki.apertium.org for more information on installing.

Description

When building, this package generates, among others, the following modules:

  • apertium-deshtml, apertium-desrtf, apertium-destxt Deformatters for html, rtf and txt document formats.
  • apertium-rehtml, apertium-rertf, apertium-retxt Reformatters for html, rtf and txt document formats.
  • apertium Translator program. Execute without parameters to see the usage.

Quick Start

There are binaries available for Debian, Ubuntu, Fedora, CentOS, OpenSUSE, Windows, and macOS. We package both nightly builds and releases. See https://wiki.apertium.org/wiki/Installation for more information. Only build from source if you either want to change this tool's behavior, or are on a platform we don't yet package for.

  1. Download the packages for lttoolbox-VERSION.tar.gz and apertium-VERSION.tar.gz and linguistic data

    Note: If you are using the translator from GitHub, run ./autogen.sh before running ./configure in all cases.

  2. Unpack lttoolbox and do ('#' means 'do that with root privileges'):

   $ cd lttoolbox-VERSION
   $ ./configure
   $ make
   # make install
  1. Unpack apertium and do:
   $ cd apertium-VERSION
   $ ./configure
   $ make
   # make install
  1. Unpack linguistic data (LING_DATA_DIR) and do:
   $ cd LING_DATA_DIR
   $ ./configure
   $ make
   and wait for a while (minutes).
  1. Use the translator
   USAGE: apertium [-d datadir] [-f format] [-u] <direction> [in [out]]
    -d datadir       directory of linguistic data
    -f format        one of: txt (default), html, rtf, odt, docx, wxml, xlsx, pptx,
                     xpresstag, html-noent, latex, latex-raw
    -a               display ambiguity
    -u               don't display marks '*' for unknown words
    -n               don't insert period before possible sentence-ends
    -m memory.tmx    use a translation memory to recycle translations
    -o direction     translation direction using the translation memory,
                     by default 'direction' is used instead
    -l               lists the available translation directions and exits
    direction        typically, LANG1-LANG2, but see modes.xml in language data
    in               input file (stdin by default)
    out              output file (stdout by default)


   Sample:

   $ apertium -f txt es-ca <input >output

apertium's People

Contributors

azmfaridee avatar bechapertium avatar bentley avatar elephantgcc avatar flammie avatar frankier avatar ftyers avatar ilnarselimcan avatar jimregan avatar jmoratinos avatar jonorthwash avatar kartikm avatar khannatanmai avatar krvoje avatar marcriera avatar mr-martian avatar nordfalk avatar pminervini avatar pranavad avatar roybaer avatar sanmarf avatar singh-lokendra avatar snomos avatar sortiz avatar sushain97 avatar techievena avatar tinodidriksen avatar tradumatica avatar unhammer avatar xavivars avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium's Issues

Failing to build apertium in apertium_deshtml.cc on Mac

apertium_deshtml.cc:3963:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register yy_state_type yy_current_state;
^~~~~~~~~
apertium_deshtml.cc:3964:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *yy_cp, *yy_bp;
^~~~~~~~~
apertium_deshtml.cc:3964:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *yy_cp, *yy_bp;
^~~~~~~~~
apertium_deshtml.cc:3965:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register int yy_act;
^~~~~~~~~
apertium_deshtml.cc:4518:6: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *dest = YY_CURRENT_BUFFER_LVALUE->yy_ch_buf;
^~~~~~~~~
apertium_deshtml.cc:4519:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *source = (yytext_ptr);
^~~~~~~~~
apertium_deshtml.cc:4520:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register int number_to_move, i;
^~~~~~~~~
apertium_deshtml.cc:4520:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register int number_to_move, i;
^~~~~~~~~
apertium_deshtml.cc:4652:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register yy_state_type yy_current_state;
^~~~~~~~~
apertium_deshtml.cc:4653:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *yy_cp;
^~~~~~~~~
apertium_deshtml.cc:4677:2: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register int yy_is_jam;
^~~~~~~~~
apertium_deshtml.cc:4678:6: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister]
register char *yy_cp = (yy_c_buf_p);

Apertium loses track of STDIN when called from Node.js?

I have a peculiar issue with Apertium when called from a Node.js on Linux. It does not behave this way on macos. Consider the following Node.js code:

File: translate.js

console.log(require('child_process').execSync('/usr/local/bin/apertium -d /usr/local/share/apertium nob-nno_e', { input: 'Jeg er en pike på syv år.' }).toString());

The script opens a child process and runs Apertium. The input element is a string that's passed to the command through a pipe. Similar to echo "input-string" | apertium ....

Excepted output here would be something like Eg er ei jente på sju år. But the actual output is this:

USAGE: /usr/local/bin/apertium-destxt [ -h | -o | -i | -n ] [input_file [output_file]]
txt format processor 

It seems to me as if Apertium loses track of STDIN when it tries to opens yet another subprocess.

I can get this to work if I use a workaround: wrap the call to apertium in a shell script and everything works as expected. I.e.

File: wrapper.sh

#!/bin/bash
cat <&0 | /usr/local/bin/apertium -d /usr/local/share/apertium nob-nno_e

File: translate.js modified to this:

console.log(require('child_process').execSync('wrapper.sh', { input: 'Jeg er en pike på syv år.' }).toString());

I've noted this behavior on Google Compute Engine's Debian instances as well as in Docker containers. I would guess this is an issue that's not very common, but I thought I'd leave it here in case anyone stumbled upon a similar issue.

deshtml crash in UTF-8 decoding

This crash happens on OpenBSD.

(gdb) run < /tmp/foo.html
Starting program: /usr/ports/pobj/apertium-3.5.2/fake-amd64/usr/local/bin/apertium-deshtml < /tmp/foo.html

Program received signal SIGSEGV, Segmentation fault.
utf8::internal::validate_next<char const*> (
    it=@0x7f7ffffd9b80: 0x8a7ce979ebd <error: Cannot access memory at address 0x8a7ce979ebd>, 
    end=0x8a7ce979ec1 <error: Cannot access memory at address 0x8a7ce979ec1>, 
    code_point=@0x7f7ffffd9b44: 0) at ../utf8/utf8/core.h:232
232             const octet_difference_type length = utf8::internal::sequence_length(it);
(gdb) bt
#0  utf8::internal::validate_next<char const*> (
    it=@0x7f7ffffd9b80: 0x8a7ce979ebd <error: Cannot access memory at address 0x8a7ce979ebd>, 
    end=0x8a7ce979ec1 <error: Cannot access memory at address 0x8a7ce979ec1>, 
    code_point=@0x7f7ffffd9b44: 0) at ../utf8/utf8/core.h:232
#1  0x00000866e140c685 in utf8::next<char const*> (
    it=@0x7f7ffffd9b80: 0x8a7ce979ebd <error: Cannot access memory at address 0x8a7ce979ebd>, 
    end=0x8a7ce979ec1 <error: Cannot access memory at address 0x8a7ce979ec1>)
    at ../utf8/utf8/checked.h:140
#2  0x00000866e1400ea0 in escape (str=...) at apertium_deshtml.cctmp:113
#3  0x00000866e1401011 in escape (str=...) at apertium_deshtml.cctmp:127
#4  0x00000866e140805d in printBuffer () at apertium_deshtml.cctmp:479
#5  0x00000866e1408cd4 in yylex () at apertium_deshtml.cctmp:692
#6  0x00000866e140c029 in main (argc=<optimized out>, argv=0x7f7ffffd9de8)
    at apertium_deshtml.cctmp:801

foo.html:

<!doctype html>
<title>.</title>

shellcheck

shellcheck blocked the nightly build, saying:

In ../../apertium/apertium line 262:
  grep "slides\/slide" |\
              ^-- SC1117: Backslash is literal in "\/". Prefer explicit escaping: "\\/".


In ../../apertium/apertium line 394:
ARGS_ALL=( $@ )              # so we can index into it with a variable
           ^-- SC2206: Quote to prevent word splitting, or split robustly with mapfile or read -a.


In ../../apertium/apertium line 399:
    *) ARGS_PREOPT+=($arg); (( OPTIND++ )) ;;
                     ^-- SC2206: Quote to prevent word splitting, or split robustly with mapfile or read -a.


In ../../apertium/apertium line 488:
  txt|rtf|html|xpresstag|mediawiki)
      ^-- SC2221: This pattern always overrides a later one.


In ../../apertium/apertium line 493:
  rtf)
  ^-- SC2222: This pattern never matches because of a previous pattern.


In ../../apertium/apertium line 497:
    MILOCALE=$(locale -a|grep -i -v "utf\|^C$\|^POSIX$"|head -1);
                                        ^-- SC1117: Backslash is literal in "\|". Prefer explicit escaping: "\\|".
                                             ^-- SC1117: Backslash is literal in "\|". Prefer explicit escaping: "\\|".


In ../../apertium/apertium line 586:
    MILOCALE=$(locale -a|grep -i -v "utf\|^C$\|^POSIX$"|head -1);
                                        ^-- SC1117: Backslash is literal in "\|". Prefer explicit escaping: "\\|".
                                             ^-- SC1117: Backslash is literal in "\|". Prefer explicit escaping: "\\|".

reject-current-rule loops forever with -z

zreject.t1x:

<?xml version="1.0" encoding="utf-8"?>
<transfer default="chunk">
  <section-def-cats>
    <def-cat n="n_np">
      <cat-item tags="n.*"/>
   </def-cat>
    <def-cat n="adj">
      <cat-item tags="adj.*"/>
    </def-cat>
  </section-def-cats>
  <section-def-attrs>
    <def-attr n="nbr">
      <attr-item tags="sg"/>
      <attr-item tags="pl"/>
    </def-attr>
  </section-def-attrs>

  <section-def-vars>
    <def-var n="ntags"/>
  </section-def-vars>

  <section-def-lists>
    <def-list n="blah">
      <list-item v="blah"/>
    </def-list>
  </section-def-lists>

  <section-def-macros>

    <def-macro n="foo" npar="1">
      <let><var n="ntags"/><clip pos="1" side="tl" part="tags"/></let>
    </def-macro>

  </section-def-macros>

  <section-rules>
    <rule comment="ADJ N">
      <pattern>
        <pattern-item n="adj"/>
        <pattern-item n="n_np"/>
      </pattern>
      <action>
        <choose>
          <when><test><equal><clip pos="2" side="tl" part="nbr"/><lit-tag v="sg"/></equal></test>
          <reject-current-rule shifting="no"/>
        </when></choose>
        <out>
          <chunk namefrom="c_name" case="caseFirstWord">
            <tags><tag><lit-tag v="bah"/></tag></tags>
            <lu><clip pos="1" side="tl" part="whole"/></lu>
            <b pos="1"/>
            <lu><clip pos="2" side="tl" part="whole"/></lu>
          </chunk>
        </out>
      </action>
    </rule>
  </section-rules>
</transfer>

loops forever:

$ apertium-preprocess-transfer zreject.t1x zreject.t1x.bin
$ echo -e '^a<blah>/a<blah>$[]\0^b<adj><ind>/b<adj><ind>$ ^c<n><sg>/c<n><sg>$[]\0'  \
     | apertium-transfer -z -b zreject.t1x zreject.t1x.bin \
     | head -c999999 |wc -c
999999

apertium-tagger duplicates compound parts

$ echo '^TV-karriere/TV<np><al><cmp>+karriere<n><m><sg><ind>/Tv<n><m><sg><ind><cmp>+karriere<n><m><sg><ind>$' | apertium-tagger -g nob-nno.prob 
^Tv<n><m><sg><ind><cmp>+karriere<n><m><sg><ind>+karriere<n><m><sg><ind>$

nob-nno.prob.zip

Strangely, it's dependent on the .prob file; with swe-nor it gets completely cut off:

$ echo '^TV-karriere/TV<np><al><cmp>+karriere<n><m><sg><ind>/Tv<n><m><sg><ind><cmp>+karriere<n><m><sg><ind>$' | apertium-tagger -g swe-nor.prob
^TV<np><al><cmp>+

swe-nor.prob.zip

while with sme-nob it actually gets it right:

$ echo '^TV-karriere/TV<np><al><cmp>+karriere<n><m><sg><ind>/Tv<n><m><sg><ind><cmp>+karriere<n><m><sg><ind>$' | apertium-tagger -g sme-nob.prob
^TV<np><al><cmp>+karriere<n><m><sg><ind>$

sme-nob.prob.zip

Include distfiles as release assets

The Apertium releases don’t currently include any release tarballs generated by make dist; they only contain the default tarballs generated by GitHub.

These aren’t suitable for packaging:

  • GitHub’s autogenerated tarballs are not stable—they’re subject to change when GitHub’s infrastructure changes (e.g., their backends update git/gzip/tar…), breaking packaging systems that assume the tarball has a consistent checksum
  • They require the user/packager to depend on Autoconf and Automake

Once generated with make dist, distfiles can be easily attached to a tag on the GitHub releases page.

I’ve attached a copy of apertium-3.5.1.tar.gz which I generated with make dist. The 3.5.0 tag could use the distfile already uploaded to SourceForge. I’d be willing to upload these myself, but I don’t seem to have the permissions for that.

Deformatters escape too much

There is no reason deformatters escape anything except [] inside []. Currently a lot of symbols that are special elsewhere in the stream are also escaped inside superblank [], but there is no practical need for this. Once a [ is seen, parsing should continue until next ].

Tests fail for the master branch

The tests are currently showing some kind of anomalies; for the command make test all the three type of tests PASS, but when we run python tests/run_tests.py explicitly, we get this error message:

abinash@change-the-world:~/apertium(master)$ python tests/run_tests.py 
test_cat_is_a_verb (tagger.AmbiguityClassTest) ... run /home/abinash/apertium/apertium/apertium-tagger -s 0 /tmp/tmp8y88Bi /tmp/tmpPDoT58 /tmp/tmpmVx6SI /tmp/tmpAY3r0o /tmp/tmpdE6_88 /tmp/tmpPDoT58

11 states and 13 ambiguity classes
{DET} 	 Word: The -- {DET} 	 Word: The
{VERB} 	 Word: falling -- {VERB} 	 Word: falling
{NOUN} 	 Word: cat -- {NOUN} 	 Word: cat
{VERB} 	 Word: has -- {VERB} 	 Word: has
{VERB} 	 Word: booked -- {VERB} 	 Word: booked
{NOUN} 	 Word: books -- {VERB,NOUN} 	 Word: books
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{VERB} 	 Word: Close -- {VERB,NOUN,ADJ} 	 Word: Close
{DET} 	 Word: the -- {DET} 	 Word: the
{NOUN} 	 Word: books -- {VERB,NOUN} 	 Word: books
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{DET} 	 Word: The -- {DET} 	 Word: The
{VERB} 	 Word: falling -- {VERB} 	 Word: falling
{NOUN} 	 Word: cat -- {NOUN} 	 Word: cat
{VERB} 	 Word: has -- {VERB} 	 Word: has
{NOUN} 	 Word: books -- {VERB,NOUN} 	 Word: books
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{TAG_kEOF} 	 Word:  -- {TAG_kEOF} 	 Word: 

*** Error in `/home/abinash/apertium/apertium/.libs/apertium-tagger': free(): invalid next size (fast): 0x0000564f64df99d0 ***
ERROR
test_changing_class_hmm_sup (tagger.AmbiguityClassTest) ... run /home/abinash/apertium/apertium/apertium-tagger -s 0 /tmp/tmp_J5fZJ /tmp/tmpVR0ZGW /tmp/tmp4C9rxw /tmp/tmpTosizp /tmp/tmpmkv7s9 /tmp/tmpVR0ZGW

11 states and 13 ambiguity classes
{DET} 	 Word: The -- {DET} 	 Word: The
{NOUN} 	 Word: cat -- {NOUN} 	 Word: cat
{VERB} 	 Word: books -- {VERB,NOUN} 	 Word: books
{DET} 	 Word: the -- {DET} 	 Word: the
{NOUN} 	 Word: room -- {NOUN} 	 Word: room
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{DET} 	 Word: The -- {DET} 	 Word: The
{ADJ} 	 Word: red -- {ADJ} 	 Word: red
{NOUN} 	 Word: cat -- {NOUN} 	 Word: cat
{VERB} 	 Word: books -- {VERB,NOUN} 	 Word: books
{DET} 	 Word: the -- {DET} 	 Word: the
{ADJ} 	 Word: red -- {ADJ} 	 Word: red
{NOUN} 	 Word: room -- {NOUN} 	 Word: room
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{DET} 	 Word: The -- {DET} 	 Word: The
{ADJ} 	 Word: red -- {ADJ} 	 Word: red
{NOUN} 	 Word: cat -- {NOUN} 	 Word: cat
{VERB} 	 Word: books -- {VERB,NOUN} 	 Word: books
{DET} 	 Word: the -- {DET} 	 Word: the
{NOUN} 	 Word: room -- {NOUN} 	 Word: room
{TAG_SENT} 	 Word: . -- {TAG_SENT} 	 Word: .
{TAG_kEOF} 	 Word:  -- {TAG_kEOF} 	 Word: 

*** Error in `/home/abinash/apertium/apertium/.libs/apertium-tagger': free(): invalid next size (fast): 0x000055631235e9d0 ***
ERROR
test_changing_class_hmm_unsup (tagger.AmbiguityClassTest) ... run /home/abinash/apertium/apertium/apertium-tagger -t 1 /tmp/tmpk4FFdK /tmp/tmpg9AaSS /tmp/tmpKaXgx_ /tmp/tmp33ZA3Z

11 states and 13 ambiguity classes
ERROR
test_changing_class_sliding_window (tagger.AmbiguityClassTest) ... run /home/abinash/apertium/apertium/apertium-tagger --sliding-window -t 1 /tmp/tmprtyZ9I /tmp/tmplRbrZh /tmp/tmppvbiHI /tmp/tmpdQI_lx

11 states and 13 ambiguity classes

run /home/abinash/apertium/apertium/apertium-tagger -d --sliding-window -g /tmp/tmpdQI_lx /tmp/tmpmbaHZI
ERROR

======================================================================
ERROR: test_cat_is_a_verb (tagger.AmbiguityClassTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 316, in test_cat_is_a_verb
    model_fn, tagged, untagged])
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 85, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/home/abinash/apertium/apertium/apertium-tagger', '-s', '0', '/tmp/tmp8y88Bi', '/tmp/tmpPDoT58', '/tmp/tmpmVx6SI', '/tmp/tmpAY3r0o', '/tmp/tmpdE6_88', '/tmp/tmpPDoT58']' returned non-zero exit status -6

======================================================================
ERROR: test_changing_class_hmm_sup (tagger.AmbiguityClassTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 290, in test_changing_class_hmm_sup
    model_fn, tagged, untagged])
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 85, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/home/abinash/apertium/apertium/apertium-tagger', '-s', '0', '/tmp/tmp_J5fZJ', '/tmp/tmpVR0ZGW', '/tmp/tmp4C9rxw', '/tmp/tmpTosizp', '/tmp/tmpmkv7s9', '/tmp/tmpVR0ZGW']' returned non-zero exit status -6

======================================================================
ERROR: test_changing_class_hmm_unsup (tagger.AmbiguityClassTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 298, in test_changing_class_hmm_unsup
    model_fn])
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 85, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/home/abinash/apertium/apertium/apertium-tagger', '-t', '1', '/tmp/tmpk4FFdK', '/tmp/tmpg9AaSS', '/tmp/tmpKaXgx_', '/tmp/tmp33ZA3Z']' returned non-zero exit status -11

======================================================================
ERROR: test_changing_class_sliding_window (tagger.AmbiguityClassTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 307, in test_changing_class_sliding_window
    self.changing_class_impl(['--sliding-window'], model_fn)
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 268, in changing_class_impl
    stdout=self.devnull)
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 85, in inner
    return f(*args, **kwargs)
  File "/home/abinash/apertium/tests/tagger/__init__.py", line 67, in check_stderr
    with Popen(*popenargs, stderr=PIPE, **kwargs) as process:
AttributeError: __exit__

----------------------------------------------------------------------
Ran 4 tests in 2.661s

FAILED (errors=4)
runTest (pretransfer.BasicPretransferTest) ... ok
runTest (pretransfer.InlineBlankPretransferTest) ... expected failure
runTest (pretransfer.JoinGroupPretransferTest) ... ok
runTest (pretransfer.PretransferTest) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.044s

OK (expected failures=1)
runTest (postchunk.EmptyMacroPostchunkTest) ... ok
runTest (postchunk.EmptyNoMacroPostchunkTest) ... ok
runTest (postchunk.PostchunkTest) ... ok
runTest (postchunk.SimplePostchunkTest) ... ERROR
runTest (postchunk.UseMacroPostchunkTest) ... ok

======================================================================
ERROR: runTest (postchunk.SimplePostchunkTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abinash/apertium/tests/postchunk/__init__.py", line 65, in runTest
    self.assertEqual(self.communicateFlush(inp+"[][\n]"),
  File "/home/abinash/apertium/tests/postchunk/__init__.py", line 37, in communicateFlush
    self.proc.stdin.write(string.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 21: ordinal not in range(128)

----------------------------------------------------------------------
Ran 5 tests in 0.275s

FAILED (errors=1)

modes files should allow aliases

There should be a way to specify aliases in modes files. Probably about 25% of the time I find myself typing e.g. grn-spa-transfer instead of grn-spa-chunker, or grn-spa-generador instead of grn-spa-dgen. It would be cool to be able to alias the most popular/common ones.

Build failure in serialiser.h

Apertium (3.5.0 and master) fails to build on OpenBSD -current with clang-6.0.0.

libtool: compile:  c++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.4 -I/usr/local/lib/lttoolbox-3.4/include -I/usr/local/include/libxml2 -I/usr/local/include -I/usr/local/include -I/usr/local/include/lttoolbox-3.4 -I/usr/local/lib/lttoolbox-3.4/include -I/usr/local/include/libxml2 -I/usr/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF .deps/collection.Tpo -c collection.cc  -fPIC -DPIC -o .libs/collection.o
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.4/lttoolbox/serialiser.h:239:34: error: member
      reference base type 'const unsigned long' is not a structure or union
  uint64_t size = SerialisedType_.size();
                  ~~~~~~~~~~~~~~~^~~~~
collection.cc:96:23: note: in instantiation of member function '(anonymous
      namespace)::Serialiser<unsigned long>::serialise' requested here
  Serialiser<size_t>::serialise(element.size(), serialised);
                      ^
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.4/lttoolbox/serialiser.h:242:17: error: type
      'unsigned long' cannot be used prior to '::' because it has no members
  for (typename Container::const_iterator value_type_ =
                ^
In file included from collection.cc:21:
In file included from ../apertium/deserialiser.h:28:
/usr/local/include/lttoolbox-3.4/lttoolbox/deserialiser.h:212:53: error: member
      reference base type 'typename std::remove_const<unsigned long>::type'
      (aka 'unsigned long') is not a structure or union
      std::inserter(SerialisedType_, SerialisedType_.begin());
                                     ~~~~~~~~~~~~~~~^~~~~~
collection.cc:105:39: note: in instantiation of member function
      'Deserialiser<unsigned long>::deserialise' requested here
  size_t size = Deserialiser<size_t>::deserialise(serialised);
                                      ^
3 errors generated.

apertium-transfer deosn't recognise lowercase lemmas

C/P from sf.net:

Bug or lost feature in apertium-transfer for t1x regarding the case of @lemma attribute of cat-item element. Input from the pipeline up until here:

^seit/kun$ ^Jahrzehnt/vuosikymmen$ ^bevor/ennen$
trying to match:

<def-cat n="seit">
  <cat-item lemma="seit" tags="pr.*"/>
  <cat-item lemma="seit" tags="pr.dat"/>
  <cat-item lemma="seit" tags="cnjsub"/> <!-- XXX: bad disam -->
</def-cat>
<def-cat n="bevor">
  <cat-item lemma="bevor" tags="preadv"/>
</def-cat>
<def-cat n="zeitwort">
  <cat-item lemma="Jahr" tags="n.*"/>
  <cat-item lemma="Jahrzehnt" tags="n.*"/>
  <!--<cat-item lemma="jahrzehnt" tags="n.*"/>-->
</def-cat>

in rule:

<rule comment="seit ZEIT bevor: AIKOIHIN">
  <pattern>
    <pattern-item n="seit"/>
    <pattern-item n="zeitwort"/>
    <pattern-item n="bevor"/>
  </pattern>
  <action>
    <call-macro n="case-mangler">
      <with-param pos="2"/>
    </call-macro>
    <call-macro n="number-mangler">
      <with-param pos="2"/>
    </call-macro>
    <out>
      <chunk name="AdvP" case="caseFirstWord">
        <tags>
            <tag><lit-tag v="ADV"/></tag>
        </tags>
        <lu>
          <clip pos="2" side="tl" part="lem"/>    <!-- Jahr ~ vuosi -->
          <clip pos="2" side="tl" part="a_noun"/> <!-- n -->
          <var n="number"/>                       <!-- sg / pl -->
          <lit-tag v="ill"/>                      <!-- ill -->
        </lu>
      </chunk>
    </out>
  </action>
</rule>

Does not work with pipeline:

$ cat modes/deu-fin-transfer.mode

lt-proc -w -e '/home/tpirinen/github/flammie/apertium-fin-deu/deu-fin.automorf.bin' | cg-proc -w1n '/home/tpirinen/github/flammie/apertium-fin-deu/deu-fin.rlx.bin' | apertium-pretransfer| lt-proc -b '/home/tpirinen/github/flammie/apertium-fin-deu/deu-fin.autobil.bin' | apertium-transfer -c -b '/home/tpirinen/github/flammie/apertium-fin-deu/apertium-fin-deu.deu-fin.t1x'  '/home/tpirinen/github/flammie/apertium-fin-deu/deu-fin.t1x.bin' 

see here:

$ apertium -d . deu-fin-transfer < texts/dw.de-Langsam-gesprochene-Nachrichten-2017-10-10.text | fgrep Liberia
^default{^kun$}$ ^nP{^vuosikymmen$}$ ^default{^ennen$}$
Uncommenting the lower-case lemma in t1x fixes the problem.

part="sl" includes all of source when using apertium-transfer -b

Testing or outputting <clip pos="1" side="sl" part="lemq"/> when using apertium-transfer with the -b option for input '^skyldes<vblex><pstv><pres>/komme# av<vblex><pres>$' will claim that the source part has a lemq # av where it doesn't.

This goes for all uses of side="sl" when run with -b (e.g. part="whole", tags etc.)

Perceptron tagger not properly supporting null flush

It seems to be an issue with flushing. Perceptron tagger seems to not support null-flush at all.

How to reproduce it.

tty 1

mkfifo /tmp/fifotest
sleep 1000  > /tmp/fifotest &  # so it keeps the pipe open
apertium-tagger -zx -g ../src/apertium-eng/eng.prob < /tmp/fifotest 

tty 2

echo -e "^take/take<vblex><pres>$ [][\n]\0" > /tmp/fifotest   # nothing gets printed in tty1
echo -e "^take/take<vblex><pres>$ [][\n]\0" > /tmp/fifotest   # nothing gets printed in tty1

Marking intrachunk relations in t1x

After experimenting with secondary tags we realised we can get alignments of source -> target very trivially, and this can help in a number of applications - markup handling, dependency relations (a source sentence marked with semantic dependencies would give us the dependencies in target as well), even generating aligned translations could help in other NLP applications for a language pair.

Most alignments happen based on where the lem/lemh of an LU in the output comes from, which gives us fairly good results. But if words are generated in the target language, which don't clip the lemma of a source word, but exist because the target language represents a grammatical relation with separate words, then in t1x we don't really know which source word these align with. Since t1x generates in a chunk we have a very small list of candidates, but if there is more than one viable candidate in a chunk, it becomes probabilistic (Yes it can still be linguistically informed, such as auxiliaries attach with verbs, determiners attach with nouns, etc.) If these linguistic features can give us a clear picture into intrachunk relations, then it will be easy to align every output token with a source token.

If this is not the case, then I propose (for the future), we add a small optional attribute in output <lu>..</lu> blocks that don't clip any lemmas, such that they can mark which part of the source pattern they come out of, OR they can mark what is the head of that token in the output pattern. The latter can be done using something like <lu headpos="2">..</lu>, and the latter can be done by giving IDs to each token and then using intrachunk pointers.

Using input patterns and output chunks, Apertium implicitly aligns phrases. (this information can also become useful if we somehow want to mark chunks in the translation output). But we don't have word alignment, even though we're really close. If such a marker becomes usable in future t1x, we won't have to guess what the relations are within chunks. It will be trivial while writing the t1x but will go a long way in several applications.

Port apertium-adapt-docx to ODT

https://github.com/apertium/apertium/pull/47/files brings much better docx handling, allowing for better translations to be done.

Currently, same type of problems happen with ODT files, where formatting boundaries are incorrectly identified as word boundaries, making translations much worse than they could be.

It would be great to have a tool apertium-adapt-odt similar to apertium-adapt-docx.

mac os x compile problem

I was trying to compile apertium on relatively fresh / vanilla mac and run into this problem:

...
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.5 -I/opt/local/include/libxml2 -I/opt/local/include -I/usr/local/include/lttoolbox-3.5 -I/opt/local/include/libxml2 -I/opt/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF .deps/collection.Tpo -c collection.cc  -fno-common -DPIC -o .libs/collection.o
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:239:34: error: member
      reference base type 'const unsigned long' is not a structure or union
  uint64_t size = SerialisedType_.size();
                  ~~~~~~~~~~~~~~~^~~~~
collection.cc:98:23: note: in instantiation of member function '(anonymous
      namespace)::Serialiser<unsigned long>::serialise' requested here
  Serialiser<size_t>::serialise(element.size(), serialised);
                      ^
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:242:17: error: type
      'unsigned long' cannot be used prior to '::' because it has no members
  for (typename Container::const_iterator value_type_ =
                ^
In file included from collection.cc:21:
In file included from ../apertium/deserialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/deserialiser.h:194:66: error: member
      reference base type 'typename std::remove_const<unsigned long>::type'
      (aka 'unsigned long') is not a structure or union
  auto insert_it = std::inserter(SerialisedType_, SerialisedType_.begin());
                                                  ~~~~~~~~~~~~~~~^~~~~~
collection.cc:107:39: note: in instantiation of member function
      'Deserialiser<unsigned long>::deserialise' requested here
  size_t size = Deserialiser<size_t>::deserialise(serialised);
                                      ^
3 errors generated.
make[2]: *** [collection.lo] Error 1
make[1]: *** [all] Error 2
make: *** [all-recursive] Error 1
tpi006@uit-mac-219 apertium % 

I'm sure I've seen it on irc before but cannot find the solution atm.

apertium accesses the network during build

Creating apertium-gen-deformat script
Creating apertium-gen-reformat script
Creating apertium-validate-tagger script
Creating apertium-validate-transfer script
Creating apertium-validate-dictionary script
Creating apertium-validate-modes script
Creating apertium-validate-interchunk script
Creating apertium-validate-postchunk script
Creating apertium script
Creating apertium-unformat script
Creating apertium-validate-acx script
Creating apertium-utils-fixlatex script
svn checkout https://github.com/unhammer/apertium-get/trunk apertium-get
svn: E170013: Unable to connect to a repository at URL 'https://github.com/unhammer/apertium-get/trunk'
svn: E670005: no address associated with name

On OpenBSD we build packages as an unprivileged user with no network access. If this script is required to build then it should be included in the source tarballs…

Apertium build fails in MacOSX

I installed all the dependencies from MacPorts as suggested in the wiki. I built successfully lttoolbox but I get this error when building Apertium:

ltosi1294$ make
Making all in apertium
/Applications/Xcode.app/Contents/Developer/usr/bin/make all-am
depbase=echo collection.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF $depbase.Tpo -c -o collection.lo collection.cc &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF .deps/collection.Tpo -c collection.cc -fno-common -DPIC -o .libs/collection.o
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:239:34: error: member reference base type 'const unsigned long' is not a structure or union
uint64_t size = SerialisedType_.size();
~~~~~~~~~~~~~~~^~~~~
collection.cc:98:23: note: in instantiation of member function '(anonymous namespace)::Serialiser::serialise' requested here
Serialiser<size_t>::serialise(element.size(), serialised);
^
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:242:17: error: type 'unsigned long' cannot be used prior to '::' because it has no members
for (typename Container::const_iterator value_type_ =
^
In file included from collection.cc:21:
In file included from ../apertium/deserialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/deserialiser.h:212:53: error: member reference base type 'typename std::remove_const::type' (aka 'unsigned long') is not a
structure or union
std::inserter(SerialisedType_, SerialisedType_.begin());
~~~~~~~~~~~~~~~^~~~~~
collection.cc:107:39: note: in instantiation of member function 'Deserialiser::deserialise' requested here
size_t size = Deserialiser<size_t>::deserialise(serialised);
^
3 errors generated.
make[2]: *** [collection.lo] Error 1
make[1]: *** [all] Error 2
make: *** [all-recursive] Error 1
MacBook-Pro-di-Lorenzo:apertium ltosi1294$ sudo make install
Making install in apertium
depbase=echo collection.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF $depbase.Tpo -c -o collection.lo collection.cc &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -I/usr/local/include/lttoolbox-3.5 -I/usr/local/lib/lttoolbox-3.5/include -I/opt/local/include/libxml2 -I/opt/local/include -Wall -Wextra -g -O2 -std=c++2a -MT collection.lo -MD -MP -MF .deps/collection.Tpo -c collection.cc -fno-common -DPIC -o .libs/collection.o
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:239:34: error: member reference base type 'const unsigned long' is not a structure or union
uint64_t size = SerialisedType_.size();
~~~~~~~~~~~~~~~^~~~~
collection.cc:98:23: note: in instantiation of member function '(anonymous namespace)::Serialiser::serialise' requested here
Serialiser<size_t>::serialise(element.size(), serialised);
^
In file included from collection.cc:20:
In file included from ../apertium/serialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/serialiser.h:242:17: error: type 'unsigned long' cannot be used prior to '::' because it has no members
for (typename Container::const_iterator value_type_ =
^
In file included from collection.cc:21:
In file included from ../apertium/deserialiser.h:28:
/usr/local/include/lttoolbox-3.5/lttoolbox/deserialiser.h:212:53: error: member reference base type 'typename std::remove_const::type' (aka 'unsigned long') is not a
structure or union
std::inserter(SerialisedType_, SerialisedType_.begin());
~~~~~~~~~~~~~~~^~~~~~
collection.cc:107:39: note: in instantiation of member function 'Deserialiser::deserialise' requested here
size_t size = Deserialiser<size_t>::deserialise(serialised);
^
3 errors generated.
make[1]: *** [collection.lo] Error 1
make: *** [install-recursive] Error 1

use just xmllint instead of $XMLLINT in Makefile.am / generated scripts

Scripts like apertium-validate-dictionary refer to the xmllint that configure finds at the time you ran it, e.g. /opt/local/bin/xmllint. That makes the script not work if you copy it to other computers or install xmllint to a different path. Let it just use xmllint with no path.

(similarly for bash – some systems might use sh there, which is even worse, since sh on some systems is definitely not bash-compatible)

Space lost in cg-proc

I'm experimenting with CG's command MergeCohorts. It absolutely suits for my needs, but the blank after the merged cohorts is lost.

This is what is happening:
Step 1:

$ echo "Saint-Jean a" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin'
^Saint/saint<adj><m><sg>/saint<n><m><sg>$-^Jean/jean<n><m><sg>/Jean<np><ant><m><sg>$ ^a/avoir<vbhaver><pri><p3><sg>/avoir<vblex><pri><p3><sg>$

(There is a space before "$a/avoir...", like in the input)

Step 2:

$ echo "Saint-Jean a" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin' | cg-proc -w '/home/hector/apertium/apertium-fra/fra.rlx.bin'
^Saint-Jean/*Saint-Jean$^a/avoir<vbhaver><pri><p3><sg>/avoir<vblex><pri><p3><sg>$

(The space has been lost)

So:

$ echo "Saint-Jean a" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin' > output1.txt 
$ cat output1.txt | cg-proc -w '/home/hector/apertium/apertium-fra/fra.rlx.bin'
^Saint-Jean/*Saint-Jean$^a/avoir<vbhaver><pri><p3><sg>/avoir<vblex><pri><p3><sg>$

I attach a minimal rlx file to compile and output1.txt.
output1.txt
apertium-fra.fra.rlx.txt

Skip text in quotes

Is there currently any way in Apertium to skip text inside quotes?

The only way I can think of is pre-processing with <apertium-notrans>:

echo 'tekst «med sitat»' \
| sed 's%«[^»]*»%<apertium-notrans>&</apertium-notrans>%g' \
| apertium -f html-noent  -d . nob-nno_e \
| sed 's%</*apertium-notrans>%%g'

but perhaps there are better solutions?

apertium -f line doesn't retain line-breaks in place

I was testing a frequency list in apertium-udm-kpv when this happens:

$ head dev/komikyv.freqs 
   4750 да
   3237 —
   2328 .
   2324 ,
   1900 
   1317 и
   1179 оз
   1131 эз
   1088 вӧлі
    987 вылӧ

$ head dev/komikyv.freqs | apertium -f line -d . kpv-udm-debug
   @4750 @да
   @3237 @—
   @2328 
   @2324 ,
   @1900 1317 
    но
   @1179 @оз
   @1131 @оз
   @1088 @вӧвны
    @987 @вылӧ

the line with 1900 freq is probably an NBSP token but -f line should still not break like that...

Interchunk/Transfer memory leak

On the APy host, I've observed apertium-interchunk sitting at 18 GB VIRT / 12 GB RES, for both nno-nob and oci-fra.

Also now spotted apertium-transfer at 9 GB VIRT / 6 GB RES for en-ca, so must be something more general.

apertium -v or --version should give version

$ apertium -v
ERROR: Unknown option v
USAGE: apertium [-d datadir] [-f format] [-u] <direction> [in [out]]
 -d datadir       directory of linguistic data
 -f format        one of: txt (default), html, rtf, odt, docx, wxml, xlsx, pptx,
                  xpresstag, html-noent, latex, latex-raw, line
 -a               display ambiguity
 -u               don't display marks '*' for unknown words
 -n               don't insert period before possible sentence-ends
 -m memory.tmx    use a translation memory to recycle translations
 -o direction     translation direction using the translation memory,
                  by default 'direction' is used instead
 -l               lists the available translation directions and exits
 direction        typically, LANG1-LANG2, but see modes.xml in language data
 in               input file (stdin by default)
 out              output file (stdout by default)

$ apertium --version
ERROR: Unknown option -
USAGE: apertium [-d datadir] [-f format] [-u] <direction> [in [out]]
 -d datadir       directory of linguistic data
 -f format        one of: txt (default), html, rtf, odt, docx, wxml, xlsx, pptx,
                  xpresstag, html-noent, latex, latex-raw, line
 -a               display ambiguity
 -u               don't display marks '*' for unknown words
 -n               don't insert period before possible sentence-ends
 -m memory.tmx    use a translation memory to recycle translations
 -o direction     translation direction using the translation memory,
                  by default 'direction' is used instead
 -l               lists the available translation directions and exits
 direction        typically, LANG1-LANG2, but see modes.xml in language data
 in               input file (stdin by default)
 out              output file (stdout by default)

apertium should check for xsltproc

Bug report from someone trying to just work with apertium-uig. it's somehow possible to install apertium and lttoolbox without having xsltproc installed:

apertium-validate-modes modes.xml
apertium-gen-modes modes.xml
/ha/home/zeman/nastroje/hfst/bin/apertium-gen-modes: line 89: /usr/bin/xsltproc: No such file or directory
/ha/home/zeman/nastroje/hfst/bin/apertium-gen-modes: line 90: /usr/bin/xsltproc: No such file or directory
make: *** [modes/uig-morph.mode] Error 127

b pos test?

I may not understand how <b pos=x> works, when I try:

    <rule comment="Compose syntactic past form">
      <pattern>
        <pattern-item n="pastverb"/>
      </pattern>
      <action>
        <call-macro n="tensemood-mangler">
          <with-param pos="1"/>
        </call-macro>
        <out>
          <chunk name="vp" case="caseFirstWord">
            <tags>
              <tag><lit-tag v="VP"/></tag>
            </tags>
            <lu>
              <lit v="haben"/>
              <lit-tag v="vbhaver.pri"/>
              <clip pos="1" side="tl" part="a_prsnum"/>
            </lu>
            <b pos="1"/>
            <lu>
              <clip pos="1" side="tl" part="lem"/>
              <clip pos="1" side="tl" part="a_verb"/>
              <lit-tag v="pp"/>
            </lu>
          </chunk>
        </out>
      </action>
    </rule>

I get error messages:

Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
Error in /home/tpirinen/github/apertium/apertium-fin-deu/apertium-fin-deu.fin-deu.t1x: line 190: index >= limit
...

repeated. If I use pos="0" it works but I thought these poses are starting from 1

reject-current-rule in interchunk/postchunk too

I have a rule that changes tl determiner gender to fit the following noun gender, but I'd not like it to apply if the source language genders don't match up (meaning they shouldn't be chunked). In t1x/chunker, I can use <reject-current-rule shifting="no"> to undo the two-word rule match and go back to the single-word rules, but that's not implemented yet for interchunk, where I have to duplicate the contents of the single-word rules inside the two-word rules, which is not very DRY.

@jimregan is there any reason it would be harder to do in interchunk than in chunker?

t?x let element shoudl warn if it loses stuff

I tend to think the let element in t?x scripting language as let of any programming language and commonly confused when code like:

<let><clip pos="1" part="case" side="tl"/><lit-tag v="nom"/></let>
...
<lu>
<clip part="lem" pos="1" side="tl"/>
...
<clip part="case" pos="1" side="tl"/>
...

will create an empty string for the case part. Because there is nothing matching the case attr in source langauge. If it were possible to detect this on runtime and warn it would help my debugging times a lot, I think I must've spend days debugging this counting all instances tgether ;-D

cg-proc and vislcg3 do not give the same matches

As in #65, I have an issue using the MergeCohorts statement.

For such thing as "Saint-de-Jour" everything works fine:

$ echo "Saint-de-Jour" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin' | cg-proc -w '/home/hector/apertium/apertium-fra/fra.rlx.bin'
^Saint-de-Jour/*Saint-de-Jour

But there are problems when there are subreadings:

$ echo "Saint-du-Jour" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin' | cg-proc -w '/home/hector/apertium/apertium-fra/fra.rlx.bin'
^Saint/Saint<adj><m><sg>/Saint<n><m><sg>$-^du/de<pr>+le<det><def><m><sg>$-^Jour/Jour<n><m><sg>$

But, when using vislcg3, everything seems fine:

$ echo "Saint-du-Jour" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin' | cg-conv -a -l | vislcg3 --trace -g '/home/hector/apertium/apertium-fra/apertium-fra.fra.rlx'
; "<Saint>"
;	"saint" adj m sg MERGECOHORTS:8
;	"saint" n m sg MERGECOHORTS:8
-
"<Saint-du-Jour>"
	"*Saint-du-Jour" MERGECOHORTS:8
; "<du>"
;	"de" pr MERGECOHORTS:8
;		"le" det def m sg
-
; "<Jour>"
;	"jour" n m sg MERGECOHORTS:8

I attach output1.txt, which is the output of:
echo "Saint-du-Jour" | lt-proc -w '/home/hector/apertium/apertium-fra/fra.automorf.bin'
And a minimal rlx file.
output1.txt
apertium-fra.fra.rlx.txt

apertium -f line eats full stops for breakfast

C/P from https://sourceforge.net/p/apertium/tickets/129/:

The new apertium -f line mode removes full stops at the ends of lines.

$ tail texts/dw.de-Langsam-gesprochene-Nachrichten-2018-02-20.text
Neuer Prozess gegen Perus Ex-Präsident Fujimori:

Ungeachtet seiner Begnadigung im Dezember muss Perus Ex-Präsident Alberto Fujimori erneut vor Gericht.
Der Nationale Strafgerichtshof in Lima ordnete einen Prozess wegen der Ermordung von sechs Bauern im Jahr 1992 an.
Fujimori hatte Peru von 1990 bis 2000 mit harter Hand regiert.
2007 wurde er schwerer Menschenrechtsverbrechen für schuldig befunden und zu 25 Jahren Gefängnis verurteilt.
An Heiligabend begnadigte Präsident Pedro Pablo Kuczynski den 79-Jährigen aufgrund seines schlechten Gesundheitszustands.
Dies löste in dem südamerikanischen Land heftige Proteste aus.
Kritiker vermuteten eine geheime Absprache mit Abgeordneten aus dem Fujimori-Lager, die drei Tage vor dem Erlass ein Amtsenthebungsverfahren gegen Kuczynski blockiert hatten.
Dieser sieht sich Korruptionsvorwürfen ausgesetzt.

$ tail texts/dw.de-Langsam-gesprochene-Nachrichten-2018-02-20.text | apertium -f line -d . deu-fin
Uusi prosessi vastaan Perun ex--presidentti *Fujimori:

armahduksen Huolimatta joulukuussa täytyy Perun ex--presidentti Alberto *Fujimori uudestaan ennen oikeuden
kansan rikosoikeudenhovi Limassa järjestää prosessin murhan varten kuudesta rakentajan vuodessa 1992 päälle
*Fujimori oli Peru 1990:sta 2000:aan asti kanssa kovan käden hallittu
2007 tuli hän raskaampien ihmisoikeudenrikoksien syyllinen löydytty ja 25 vuotta vankila tuomittu
jouluaatolta armahdettu presidentti Pedro Pablo *Kuczynski tämän 79-vuotias pahan terveydentilan perusteella
Se löysää tämän *südamerikanischen maa huomattavien protestien ulos
kriitikko oletettu salainen yhteisymmärrys kansanedustajilla tämän *Fujimori-tavara, tämä kolme päivän asetuksen eteen *Amtsenthebungsverfahren vastaan *Kuczynski olivat estäneet
Sen näkee itse korruptionsyytöksien paljastettu

$ tail texts/dw.de-Langsam-gesprochene-Nachrichten-2018-02-20.text | apertium -d . deu-fin
Uusi prosessi vastaan Perun ex--presidentti *Fujimori:

armahduksen Huolimatta joulukuussa täytyy Perun ex--presidentti Alberto *Fujimori uudestaan ennen oikeuden.
kansan rikosoikeudenhovi Limassa järjestää prosessin murhan varten kuudesta rakentajan vuodessa 1992 päälle.
*Fujimori oli Peru 1990:sta 2000:aan asti kanssa kovan käden hallittu.
2007 tuli hän raskaampien ihmisoikeudenrikoksien syyllinen löydytty ja 25 vuotta vankila tuomittu.
jouluaatolta armahdettu presidentti Pedro Pablo *Kuczynski tämän 79-vuotias pahan terveydentilan perusteella.
Se löysää tämän *südamerikanischen maa huomattavien protestien ulos.
kriitikko oletettu salainen yhteisymmärrys kansanedustajilla tämän *Fujimori-tavara, tämä kolme päivän asetuksen eteen *Amtsenthebungsverfahren vastaan *Kuczynski olivat estäneet.
Sen näkee itse korruptionsyytöksien paljastettu.
This is a minor problem but it could cost lots of BLEU points in the shared task just like the line-combining without -f line switch does ;-)

Mark single word as untranslatable outside the pipeline

It'd be nice to be able to mark a single word (or short phrase) as untranslatable outside the pipeline, but without turning it into a superblank like <apertium-notrans> does.

This way, CG and apertium-tagger could notice it and treat it e.g. as a proper noun or whatever makes sense, and transfer rules could be properly blocked by it, while people using apertium could preprocess such that e.g. STARTUP in STARTUP is Making the World a Better Place shouldn't be translated to OPPSTART.

Configure script should check for svn

I just installed a new laptop and tried to compile Apertium and got the following error:

svn checkout https://github.com/unhammer/apertium-get/trunk apertium-get
/bin/bash: svn: no s'ha trobat l'ordre
Makefile:1716: recipe for target 'apertium-get/apertium-get' failed
make[2]: *** [apertium-get/apertium-get] Error 127
make[2]: Leaving directory '/home/fran/source/apertium/trunk/apertium/apertium'
Makefile:938: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/home/fran/source/apertium/trunk/apertium/apertium'
Makefile:484: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1
Making install in apertium
make[1]: Entering directory '/home/fran/source/apertium/trunk/apertium/apertium'
svn checkout https://github.com/unhammer/apertium-get/trunk apertium-get
/bin/bash: svn: no s'ha trobat l'ordre
Makefile:1716: recipe for target 'apertium-get/apertium-get' failed
make[1]: *** [apertium-get/apertium-get] Error 127
make[1]: Leaving directory '/home/fran/source/apertium/trunk/apertium/apertium'
Makefile:484: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

apertium script should be able to use apy

It would be cool if the apertium script could make calls to apy with a flag, like:

$ echo "something" | apertium -p eng-cat

Would send the request to apy instead of looking locally, alternatively apertium eng-cat could first check for a running apy instance and use it if available.

apertium-tagger error

Hi. Translates in one direction, but not in the other.
echo "hello" | apertium deu-eng
*hello
echo "hello" | apertium eng-deu
State bad 0 0 1 0
apertium-tagger: 1:41: can't get const wchar_t: TheCharacterStream not good
^hello/hello<ij>/hello<n><sg>$^./.<sent>$
^
Try 'apertium-tagger --help' for more information.

AND same error with eng-cat:
echo "1" | apertium cat-eng
1
echo "1" | apertium eng-cat
State bad 0 0 1 0
apertium-tagger: 1:21: can't get const wchar_t: TheCharacterStream not good
^1/1<num>$^./.<sent>$
^
Try 'apertium-tagger --help' for more information.

At the same time pairs work successfully: es-fr, spa-eng, es-pt, spa-ita - there and back

Improve README

This is a somewhat important repo so it would be nice to markdownify the README a bit (mostly just the titles) and then symlink README.md -> README.

Is this ok, @ftyers ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.