Giter VIP home page Giter VIP logo

Comments (7)

linas avatar linas commented on June 14, 2024

Actually, I guess there are two errors here: first, relex is scrambling the utf8 input and turning it into the junk � which is probably not a valid utf8 char. Next, link-grammar trips over this invalid char ...

from link-grammar.

linas avatar linas commented on June 14, 2024

this bug is the ultimate victim: opencog/opencog#1689

from link-grammar.

ampli avatar ampli commented on June 14, 2024

It looks as if it behaves as planned.

Each of these words contain from some reason an invalid utf-8 character:

$ od -xc
Ph� T�n.
0000000    6850    bfef    20bd    ef54    bdbf    2e6e    000a
          P   h 357 277 275       T 357 277 275   n   .  \n
0000015

It cannot match anything in any/4.0.regex since all the regex'es there require that all the letters are particular ones:

ANY-WORD:  /^[[:alnum:]_'-]+$/
ANY-PUNCT:  /^[[:punct:]]+$/

JUNK: /^[^[:punct:]][^[:space:]]+[^[:punct:]]$/

So the final result is that these words are notifies as not in the dict, as "needed"...

It is possible to "fix" that by adding UNKNOWN-WORD to any/4.0.dict:

UNKNOWN-WORD: ();

And now we get:


$ link-parser any
link-grammar: Info: Dictionary found at ../any/4.0.dict
link-grammar: Info: Using locale en_US.utf8.
link-grammar: Info: Dictionary version 5.1.0.
link-grammar: Info: Library version link-grammar-5.3.0. Enter "!help" for help.
linkparser> !m
Display word morphology turned on.
linkparser> Ph� T�n.
No complete linkages found.
Found 4 linkages (4 had no P.P. violations) at null count 1
    Linkage 1, cost vector = (UNUSED=1 DIS= 0.00 LEN=1)

    +-------ANY------+----ANY----+
    |                |           |
LEFT-WALL [Ph�] T�n[!JUNK] .[!ANY-PUNCT] 

Press RETURN for the next linkage.
linkparser> 

In addition, sentences with invalid Unicode in them should be just rejected (else the program can segfault, or get a security problem). I can try to add such a check.
But this may also cause Relex to get stuck due to the same reason, so Relex should be fixed anyway.

from link-grammar.

linas avatar linas commented on June 14, 2024

So looks like ANY sometimes has unused words. How add, I thought it did not do that. Maybe this is a side-effect of the regex?

from link-grammar.

ampli avatar ampli commented on June 14, 2024

The problem is that the JUNK regex does't really match any word that doesn't match ANY-WORD and ANY-PUNC (by design).
So there exist words that don't match any regex in any/4.0.regex, and since any/4.0.dict contains only regex labels, such words are "not in the dict".

It is possible to define after JUNK:

INVALID: /^/
and define it in the dict as
INVALID: ();
but this is just like using UNKNOWN-WORD.

from link-grammar.

linas avatar linas commented on June 14, 2024

closing; the bug is not in the relex/link-grammar toolchain. See opencog/opencog#1689 for the debug log establishing this.

from link-grammar.

linas avatar linas commented on June 14, 2024

opencog guile implemetation was a missing per-thread utf8 initialization; threads were defaulting to an iso-8859-1 encoding :-(

from link-grammar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.