Comments (7)
Actually, I guess there are two errors here: first, relex is scrambling the utf8 input and turning it into the junk � which is probably not a valid utf8 char. Next, link-grammar trips over this invalid char ...
from link-grammar.
this bug is the ultimate victim: opencog/opencog#1689
from link-grammar.
It looks as if it behaves as planned.
Each of these words contain from some reason an invalid utf-8 character:
$ od -xc
Ph� T�n.
0000000 6850 bfef 20bd ef54 bdbf 2e6e 000a
P h 357 277 275 T 357 277 275 n . \n
0000015
It cannot match anything in any/4.0.regex since all the regex'es there require that all the letters are particular ones:
ANY-WORD: /^[[:alnum:]_'-]+$/
ANY-PUNCT: /^[[:punct:]]+$/
JUNK: /^[^[:punct:]][^[:space:]]+[^[:punct:]]$/
So the final result is that these words are notifies as not in the dict, as "needed"...
It is possible to "fix" that by adding UNKNOWN-WORD to any/4.0.dict:
UNKNOWN-WORD: ();
And now we get:
$ link-parser any
link-grammar: Info: Dictionary found at ../any/4.0.dict
link-grammar: Info: Using locale en_US.utf8.
link-grammar: Info: Dictionary version 5.1.0.
link-grammar: Info: Library version link-grammar-5.3.0. Enter "!help" for help.
linkparser> !m
Display word morphology turned on.
linkparser> Ph� T�n.
No complete linkages found.
Found 4 linkages (4 had no P.P. violations) at null count 1
Linkage 1, cost vector = (UNUSED=1 DIS= 0.00 LEN=1)
+-------ANY------+----ANY----+
| | |
LEFT-WALL [Ph�] T�n[!JUNK] .[!ANY-PUNCT]
Press RETURN for the next linkage.
linkparser>
In addition, sentences with invalid Unicode in them should be just rejected (else the program can segfault, or get a security problem). I can try to add such a check.
But this may also cause Relex to get stuck due to the same reason, so Relex should be fixed anyway.
from link-grammar.
So looks like ANY sometimes has unused words. How add, I thought it did not do that. Maybe this is a side-effect of the regex?
from link-grammar.
The problem is that the JUNK
regex does't really match any word that doesn't match ANY-WORD
and ANY-PUNC
(by design).
So there exist words that don't match any regex in any/4.0.regex, and since any/4.0.dict contains only regex labels, such words are "not in the dict".
It is possible to define after JUNK
:
INVALID: /^/
and define it in the dict as
INVALID: ();
but this is just like using UNKNOWN-WORD
.
from link-grammar.
closing; the bug is not in the relex/link-grammar toolchain. See opencog/opencog#1689 for the debug log establishing this.
from link-grammar.
opencog guile implemetation was a missing per-thread utf8 initialization; threads were defaulting to an iso-8859-1 encoding :-(
from link-grammar.
Related Issues (20)
- Open work items for 5.12.5 HOT 12
- Word "test" in English dict 5.12.1 vs older ones HOT 3
- `www.abisource.com` is not accessible HOT 16
- Make - failure to find link-names.o HOT 3
- LICENSE file mentioned twice in README HOT 1
- NEWS file is out of date HOT 2
- Unexpanded variable in configure help output HOT 2
- Update config.guess and config.sub files HOT 2
- ASpell is still enabled by default although the ChangeLog says it's disabled by default HOT 4
- Build errors when enabling pcre2 on macOS HOT 2
- Add macOS builds to CI
- link-generator defaults to Lithuanian HOT 1
- link-generator -l en uses unexpectedly high memory and CPU HOT 14
- warning: result of comparison of constant 18446744073709551615 with expression of type 'uint16_t' (aka 'unsigned short') is always true [-Wtautological-constant-out-of-range-compare] HOT 2
- warning: comparison of integers of different signs: 'yy_size_t' (aka 'unsigned long') and 'int' [-Wsign-compare] HOT 4
- Java: warning: [removal] Character(char) in Character has been deprecated and marked for removal HOT 1
- Apparent Hunspell problem seen in Debian build of 5.12.3
- multi-dict and multi-thread tests crash with a segmentation fault on macOS when built with pcre2 HOT 2
- Tarball for 5.12.4 reports a bad GPG signature HOT 5
- Configuring with --enable-debug misses finding opencog/atomspace/AtomSpace.h HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from link-grammar.