Giter VIP home page Giter VIP logo

spellers's People

Contributors

tinodidriksen avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

spellers's Issues

Hyphens should always be part of the word

At least in the Sami languages. This is true for all positions (initial, middle, final):
-davvi
el-rávdnje
dutki-

Screenshots demonstrating how it is now, and how this leads to irrelevant suggestions:

skjermbilde 2015-11-19 kl 17 03 12

skjermbilde 2015-11-19 kl 17 04 09

skjermbilde 2015-11-19 kl 17 04 50

In all these cases the whole string «el-rávdnje» etc should be the input, and suggestions should be generated from that.

Whether a hyphen is part of a word or not is definitely language dependent. E.g. in English it is probably best to treat the hyphen as a NON-word char (ie a word separator), whereas for Sámi, Norwegian and most other Nordic languages it should be a word character. The only exception to this is when it is used alone, ie with whitechars on each side.

Our Sámi fst's are built with this in mind, and should be able to handle all correct uses of hyphens.

Divvun (sme) issues error message about overlapping I/O

After upgrading to a version that fixes issue 18, an error message about overlapping I/O is issued when continuous spell checking is on.

windows-error-message

OS: windows 7 enterprise service pack 1, 64 bit
Word: Microsoft Office Professional Plus 2013, 32 bit

Error message: GetLastError: Operasjonen er utført

When using the Lule Sami speller v4.1 in a Windows 10 machine with MSOffice 2013 this error and the the "Alle datakanaler" message appears again and again.

divvun-feilmelding3

This results in a "You should consider uninstalling this plugin" message from Office.
divvun-feilmelding1

Regression in derived proper nouns

The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:

skjermbilde 2015-12-03 kl 15 15 05

  • jiellevárihat
  • skánitlaš

These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).

Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).

"make check" is only known to pass for SME, I have not tested the other languages yet.

There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).

Word crashing after a period of heavy use with North Sámi speller

Background:

  1. User installed the newest divvun package (divvun.no)
  2. User edited North Sámi texts for one day, everything worked fine
  3. User continued the next day, Word hangs (does not respond) and needs to be forced down
  4. User uninstalled the divvun tools, and Word behaves as normal again

OS: windows 7 enterprise service pack 1, 64 bit
Word: Microsoft Office Professional Plus 2013, 32 bit

Problem trigging document sent via e-mail for privacy & copyright reasons.

Misspellings in all uppercase get unexpected suggestions

Example (none of the suggestions are reasonable corrections):

skjermbilde 2015-11-19 kl 16 53 43

Here is the same input with initial cap (second and third suggestions are reasonable corrections):

skjermbilde 2015-11-25 kl 12 54 24

And the same input with all lowercase (all suggestions are reasonable corrections):

skjermbilde 2015-11-25 kl 12 53 23

It might be that this can all be corrected in the fst by giving higher weights to certain types of compounds. As for now, one of the uppercase only suggestions is analysed as follows:

$ echo Kant-RV-irgi | hfst-lookup -q build/newspellers/tools/spellcheckers/fstbased/analyser-fstspeller-gt-norm.hfst 
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    17,031221
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    10017,031250

Giving higher weight to +ACR tags should help improve the suggestions. I'll try this first.

All uppercase names are flagged as misspelled, suggestions rejected when inserted

When the option to check the spelling of all uppercase strings is on, (inflected) names are flagged as misspelled, where as the same string in regular / lexical case is not flagged:

skjermbilde 2015-12-03 kl 11 05 34

  • regular case vs all-upper: see Kautokeino
  • inflected vs base form: OSLO vs most others - but KAUTOKEINO is also a base form, so the pattern is not fully consistent
  • names vs non-names: ráidočiekčan vs RÁIDOČIEKČAN vs KAUTOKEINO - only the last one is flagged

Further, when introducing a misspelling in one of the all-upper names, the correct form is suggested, but when inserted, it is still flagged as misspelled:

skjermbilde 2015-12-03 kl 10 54 32

That is, it rejects its own suggestions in these cases.

Acronyms do not get suggestions as given by libhfstospell

Compare the following two pages with speller test results:

https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.vk.xml (voikkospell)

and

https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.to.xml (hfst-ospell-office)

The main table has the following columns:

  • input
  • expected output
  • edit distance between the first two
  • list of suggestions given by the speller, if any

The correct suggestion is highlighted, and the background colour of the row also indicates the position of the correct suggestion (or lack thereof).

Now search for the string "TV-esne" on both pages, and compare the suggestion lists. The suggestions from voikkospell corresponds to what you would get with the following command:

echo "TV-esne" | hfst-ospell -S sma.zhfst

The suggestions from hfst-ospell-office do not. Also look at the input "TV’esne" two rows further down, for an example of even more diverging suggestions.

EN DASH (U+2013) is not ignored by speller

The following text will trigger a red underline in MS Word using the SME speller (version: Divvun-sme-2015.292.177.msi, 2015-10-19, 02:57):

– Fertejit čielga njuolggadusat

The words are accepted, but not the initial EN DASH.

A single space will trigger a red underline

A single space surrounded by linebreaks (U+000A) will trigger a red underline in MS Word using the SME speller from today. To repeat, insert the following in a document (here given as Unicode HEX code points to avoid misunderstandings and code garbling):

U+000A
U+000A
U+0020
U+000A
U+000A

Set the text to North Sami. Notice how there is a very short squiggly red line where the space is.

Keep initial upper case in suggestions

When a misspelled word starts with a capital letter, also the suggestions should be with an initial capital letter. This is presently not the case:

skjermbilde 2015-11-16 kl 13 34 45

Unnaraččas should have been corrected to Unnoraččas.

Tested with the 16.11.2015 version.

Punctuation should not be part of suggestions

Comma is treated as part of the word as well as the suggestions. This bug does not in itself create bad input or output, but it does increase the noise level of the suggestions:

skjermbilde 2015-11-16 kl 13 34 45

It would be better to not include the comma in the input, and thus also remove it from the suggestions.

hfst-ospell-office dumps core on [uppercase-char] and {uppercase-char}

Examples from commandline use:
hfst-mso $ echo "5 [L]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*
*** Error in `/usr/local/bin/hfst-ospell-office': munmap_chunk(): invalid pointer: 0x000000000236e880 ***
Aborted (core dumped)

hfst-mso $ echo "5 [l]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*

{L}, {L], [L} also results in core dumps.

hfst-mso $ /usr/local/bin/hfst-ospell --version

hfstospell 0.4.0
Jan 26 2016 10:02:10
copyright (C) 2009 - 2016 University of Helsinki

The original input had about 90000 lines, one "word" on each line.
On Linux hfst-ospell-office dumps core after all lines have been spell checked, on OS X it crashes more randomly (somewhere in the middle of the process).

OmegaT Backend

Extend the Greenlandic OmegaT backend for all languages.

North Sami has no version info easter egg

All our spellers are automatically equipped with an easter egg giving core version information, triggered by asking for suggestions to the string "nuvviDspeller". This is what it looks like for South Sami:

skjermbilde 2015-12-03 kl 10 58 52

The North Sami spellers from this week has been lacking the easter egg, although when building and testing on my own computer it is there. The North Sami Word / MS Office speller gives the following list of suggestions:

skjermbilde 2015-12-03 kl 10 58 19

These suggestions all have a much higher weight than the easter egg suggestions. Thus, it looks to me as the easter egg was not built for some reason for SME.

Having the possibility to identify the lexicon version is quite important when debugging user error messages - often it turns out that they have an old speller, and that can easily be identified by asking them to type in the easter egg trigger.

ISpellCheckProvider Puzzling Behavior

I have implemented an ISpellCheckProvider and it works, but also doesn't (locale kl-GL installer; source).

If I use the official https://code.msdn.microsoft.com/windowsdesktop/spell-checking-client-aea0148c sample code, then everything works as expected and words get suggestions.

From any other MS-provided client, such as Windows itself or Skype, the word gets marked as wrong but the suggested action is to remove the word entirely, as if the response code is CORRECTIVE_ACTION_DELETE. But the debug log is clearly showing a return value of CORRECTIVE_ACTION_GET_SUGGESTIONS. So that makes no sense, as it all goes via the same MS-provided spell checker host handler. I can only conclude that MS implemented their own clients incorrectly...

Seems nobody else in the world is making spell checking providers using this API, and it's nigh impossible to get hold of someone at MS who'd know.

(originally asked on MS TechNet and Stack Overflow)

Duplicate suggestions when input has initial upper case, suggestion can be both a name and a regular noun

I have run a comparison of the voikko-based speller with the hfst-ospell-office speller. The result is visible here (first 7 diffs, available for a month):

https://www.diffnow.com/?report=5m6dx

As can be seen in the seventh diff, the same suggestion is given twice. This is caused by two suggestions that are underlyingly different only in their initial case (but thus still different), but which are made identical because the input has initial uppercase, and so both suggestions will have initial upper case, which makes them identical.

There needs to be a check for uniqueness within the final suggestion list, and if two identical suggestions are found, only the first/best one is returned.

Forward slash counted as part of the word

In a string like Sállan/Glimt, the forward slash is counted as part of the word, when it should not:

skjermbilde 2015-11-16 kl 13 38 28

Each of the two parts before and after the slash should be treated as independent words, ie the slash should not be considered a word character. This could be language specific, such that some languages want it as part of their words, and others not, although I am not aware of any such language on first thought.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.