tinodidriksen / spellers Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.34 MB

Front-ends and packaging scripts for spellers. Git read-only mirror.

License: GNU General Public License v3.0

Shell 1.07% Perl 7.14% JavaScript 5.52% CMake 4.03% C++ 77.67% C 4.57%

spell-checker

spellers's People

Contributors

Stargazers

Watchers

spellers's Issues

Hyphens should always be part of the word

At least in the Sami languages. This is true for all positions (initial, middle, final):
-davvi
el-rávdnje
dutki-

Screenshots demonstrating how it is now, and how this leads to irrelevant suggestions:

In all these cases the whole string «el-rávdnje» etc should be the input, and suggestions should be generated from that.

Whether a hyphen is part of a word or not is definitely language dependent. E.g. in English it is probably best to treat the hyphen as a NON-word char (ie a word separator), whereas for Sámi, Norwegian and most other Nordic languages it should be a word character. The only exception to this is when it is used alone, ie with whitechars on each side.

Our Sámi fst's are built with this in mind, and should be able to handle all correct uses of hyphens.

Divvun (sme) issues error message about overlapping I/O

After upgrading to a version that fixes issue 18, an error message about overlapping I/O is issued when continuous spell checking is on.

OS: windows 7 enterprise service pack 1, 64 bit
Word: Microsoft Office Professional Plus 2013, 32 bit

Error message: GetLastError: Operasjonen er utført

When using the Lule Sami speller v4.1 in a Windows 10 machine with MSOffice 2013 this error and the the "Alle datakanaler" message appears again and again.

This results in a "You should consider uninstalling this plugin" message from Office.

MWEs failing in MS Office

Multi-word expressions are tested in parts only, and thus failing. E.g. sme's vákten láhkai

Regression in derived proper nouns

The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:

jiellevárihat
skánitlaš

These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).

Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).

"make check" is only known to pass for SME, I have not tested the other languages yet.

There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).

Word crashing after a period of heavy use with North Sámi speller

Background:

User installed the newest divvun package (divvun.no)
User edited North Sámi texts for one day, everything worked fine
User continued the next day, Word hangs (does not respond) and needs to be forced down
User uninstalled the divvun tools, and Word behaves as normal again

OS: windows 7 enterprise service pack 1, 64 bit
Word: Microsoft Office Professional Plus 2013, 32 bit

Problem trigging document sent via e-mail for privacy & copyright reasons.

Misspellings in all uppercase get unexpected suggestions

Example (none of the suggestions are reasonable corrections):

Here is the same input with initial cap (second and third suggestions are reasonable corrections):

And the same input with all lowercase (all suggestions are reasonable corrections):

It might be that this can all be corrected in the fst by giving higher weights to certain types of compounds. As for now, one of the uppercase only suggestions is analysed as follows:

$ echo Kant-RV-irgi | hfst-lookup -q build/newspellers/tools/spellcheckers/fstbased/analyser-fstspeller-gt-norm.hfst 
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    17,031221
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    10017,031250

Giving higher weight to +ACR tags should help improve the suggestions. I'll try this first.

Error message: Alle datakanaler er opptatt

One of our users filed a bug report via Facebook.

The text says roughly:

«Sometimes I get the following message (see screenshot). It started to appear after I installed the newest Divvun tools. I have Windows 7 and MS Office 2013.»

All uppercase names are flagged as misspelled, suggestions rejected when inserted

When the option to check the spelling of all uppercase strings is on, (inflected) names are flagged as misspelled, where as the same string in regular / lexical case is not flagged:

regular case vs all-upper: see Kautokeino
inflected vs base form: OSLO vs most others - but KAUTOKEINO is also a base form, so the pattern is not fully consistent
names vs non-names: ráidočiekčan vs RÁIDOČIEKČAN vs KAUTOKEINO - only the last one is flagged

Further, when introducing a misspelling in one of the all-upper names, the correct form is suggested, but when inserted, it is still flagged as misspelled:

That is, it rejects its own suggestions in these cases.

"Account already exists"-error when installing lulesami speller

When installing the lule sami speller, an error telling that the account already exists appears.

This happened on a Windows 10 machine with MS Office 2013 (in the norwegian sami parliament)

Acronyms do not get suggestions as given by libhfstospell

Compare the following two pages with speller test results:

https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.vk.xml (voikkospell)

and

https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.to.xml (hfst-ospell-office)

The main table has the following columns:

input
expected output
edit distance between the first two
list of suggestions given by the speller, if any

The correct suggestion is highlighted, and the background colour of the row also indicates the position of the correct suggestion (or lack thereof).

Now search for the string "TV-esne" on both pages, and compare the suggestion lists. The suggestions from voikkospell corresponds to what you would get with the following command:

echo "TV-esne" | hfst-ospell -S sma.zhfst

The suggestions from hfst-ospell-office do not. Also look at the input "TV’esne" two rows further down, for an example of even more diverging suggestions.

EN DASH (U+2013) is not ignored by speller

The following text will trigger a red underline in MS Word using the SME speller (version: Divvun-sme-2015.292.177.msi, 2015-10-19, 02:57):

– Fertejit čielga njuolggadusat

The words are accepted, but not the initial EN DASH.

A single space will trigger a red underline

A single space surrounded by linebreaks (U+000A) will trigger a red underline in MS Word using the SME speller from today. To repeat, insert the following in a document (here given as Unicode HEX code points to avoid misunderstandings and code garbling):

U+000A
U+000A
U+0020
U+000A
U+000A

Set the text to North Sami. Notice how there is a very short squiggly red line where the space is.

XPI modules can't coexist

The Mozilla packages cannot currently coexist with each other.

Keep initial upper case in suggestions

When a misspelled word starts with a capital letter, also the suggestions should be with an initial capital letter. This is presently not the case:

Unnaraččas should have been corrected to Unnoraččas.

Tested with the 16.11.2015 version.

Colons should be part of the word when followed by letters, preceded by non-whitespace

Examples:

ráhkistan: - NOT part of the word
234:t - PART of the word
234:at - misspelling, PART of the word (correct form above)
NATO:s - PART of the word

Native with libraries

A solution to the bugs below is to implement a proper native library instead of spawning an instrumented process. This work is in-progress (including https://github.com/hfst/hfst-ospell/tree/cmake ).

Punctuation should not be part of suggestions

Comma is treated as part of the word as well as the suggestions. This bug does not in itself create bad input or output, but it does increase the noise level of the suggestions:

It would be better to not include the comma in the input, and thus also remove it from the suggestions.

hfst-ospell-office dumps core on [uppercase-char] and {uppercase-char}

Examples from commandline use:
hfst-mso $ echo "5 [L]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*
*** Error in `/usr/local/bin/hfst-ospell-office': munmap_chunk(): invalid pointer: 0x000000000236e880 ***
Aborted (core dumped)

hfst-mso $ echo "5 [l]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*

{L}, {L], [L} also results in core dumps.

hfst-mso $ /usr/local/bin/hfst-ospell --version

hfstospell 0.4.0
Jan 26 2016 10:02:10
copyright (C) 2009 - 2016 University of Helsinki

The original input had about 90000 lines, one "word" on each line.
On Linux hfst-ospell-office dumps core after all lines have been spell checked, on OS X it crashes more randomly (somewhere in the middle of the process).

Word hangs when opening documents that have been edited using speller version 3.0.

After Divvun 4.0 was installed, this problem appeared.

Documents created after 4.0 was installed do not have this problem. Installing Divvun-sme-86.176.10407.msi did not help.

Windows 7
Microsoft Office Home and Business 2013
Version: 15.0.4787.1002
Locale se-FI

OmegaT Backend

Extend the Greenlandic OmegaT backend for all languages.

MSI version max out at 255.255.65535

MSI version numbers have a max - who knew! https://msdn.microsoft.com/en-us/library/aa370859.aspx says max is 255.255.65535, which is incompatible with the current scheme of year.day.minute.

So need to find a new compatible scheme, or just use the current 32bit timestamp.

Installing the msoffice speller on a Windows terminal server does not make it available for all users.

Only the admin user that installed the speller has access to it.

North Sami has no version info easter egg

All our spellers are automatically equipped with an easter egg giving core version information, triggered by asking for suggestions to the string "nuvviDspeller". This is what it looks like for South Sami:

The North Sami spellers from this week has been lacking the easter egg, although when building and testing on my own computer it is there. The North Sami Word / MS Office speller gives the following list of suggestions:

These suggestions all have a much higher weight than the easter egg suggestions. Thus, it looks to me as the easter egg was not built for some reason for SME.

Having the possibility to identify the lexicon version is quite important when debugging user error messages - often it turns out that they have an old speller, and that can easily be identified by asking them to type in the easter egg trigger.

ISpellCheckProvider Puzzling Behavior

I have implemented an ISpellCheckProvider and it works, but also doesn't (locale kl-GL installer; source).

If I use the official https://code.msdn.microsoft.com/windowsdesktop/spell-checking-client-aea0148c sample code, then everything works as expected and words get suggestions.

From any other MS-provided client, such as Windows itself or Skype, the word gets marked as wrong but the suggested action is to remove the word entirely, as if the response code is CORRECTIVE_ACTION_DELETE. But the debug log is clearly showing a return value of CORRECTIVE_ACTION_GET_SUGGESTIONS. So that makes no sense, as it all goes via the same MS-provided spell checker host handler. I can only conclude that MS implemented their own clients incorrectly...

Seems nobody else in the world is making spell checking providers using this API, and it's nigh impossible to get hold of someone at MS who'd know.

(originally asked on MS TechNet and Stack Overflow)

SME, SMA, SMJ installers should be upgrades to existing/old installations

It would be very desirable from an end user point of view that the msi installers for the hfst-based spellers would uninstall/remove old Divvun tools. The old tools are also installed using MSI, and I believe all necessary data should be available in the file:

https://gtsvn.uit.no/langtech/trunk/prooftools/installers/win/MSOffice/wix/divvunsme.wxs

Duplicate suggestions when input has initial upper case, suggestion can be both a name and a regular noun

I have run a comparison of the voikko-based speller with the hfst-ospell-office speller. The result is visible here (first 7 diffs, available for a month):

https://www.diffnow.com/?report=5m6dx

As can be seen in the seventh diff, the same suggestion is given twice. This is caused by two suggestions that are underlyingly different only in their initial case (but thus still different), but which are made identical because the input has initial uppercase, and so both suggestions will have initial upper case, which makes them identical.

There needs to be a check for uniqueness within the final suggestion list, and if two identical suggestions are found, only the first/best one is returned.

Forward slash counted as part of the word

In a string like Sállan/Glimt, the forward slash is counted as part of the word, when it should not:

Each of the two parts before and after the slash should be treated as independent words, ie the slash should not be considered a word character. This could be language specific, such that some languages want it as part of their words, and others not, although I am not aware of any such language on first thought.

tinodidriksen / spellers Goto Github PK

spellers's People

Contributors

Stargazers

Watchers

spellers's Issues

Recommend Projects

Recommend Topics

Recommend Org