tinodidriksen / spellers Goto Github PK
View Code? Open in Web Editor NEWFront-ends and packaging scripts for spellers. Git read-only mirror.
License: GNU General Public License v3.0
Front-ends and packaging scripts for spellers. Git read-only mirror.
License: GNU General Public License v3.0
At least in the Sami languages. This is true for all positions (initial, middle, final):
-davvi
el-rávdnje
dutki-
Screenshots demonstrating how it is now, and how this leads to irrelevant suggestions:
In all these cases the whole string «el-rávdnje» etc should be the input, and suggestions should be generated from that.
Whether a hyphen is part of a word or not is definitely language dependent. E.g. in English it is probably best to treat the hyphen as a NON-word char (ie a word separator), whereas for Sámi, Norwegian and most other Nordic languages it should be a word character. The only exception to this is when it is used alone, ie with whitechars on each side.
Our Sámi fst's are built with this in mind, and should be able to handle all correct uses of hyphens.
Multi-word expressions are tested in parts only, and thus failing. E.g. sme's vákten láhkai
The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:
These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).
Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).
"make check" is only known to pass for SME, I have not tested the other languages yet.
There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).
Background:
OS: windows 7 enterprise service pack 1, 64 bit
Word: Microsoft Office Professional Plus 2013, 32 bit
Problem trigging document sent via e-mail for privacy & copyright reasons.
Example (none of the suggestions are reasonable corrections):
Here is the same input with initial cap (second and third suggestions are reasonable corrections):
And the same input with all lowercase (all suggestions are reasonable corrections):
It might be that this can all be corrected in the fst by giving higher weights to certain types of compounds. As for now, one of the uppercase only suggestions is analysed as follows:
$ echo Kant-RV-irgi | hfst-lookup -q build/newspellers/tools/spellcheckers/fstbased/analyser-fstspeller-gt-norm.hfst
Kant-RV-irgi Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom 17,031221
Kant-RV-irgi Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom 10017,031250
Giving higher weight to +ACR tags should help improve the suggestions. I'll try this first.
One of our users filed a bug report via Facebook.
The text says roughly:
«Sometimes I get the following message (see screenshot). It started to appear after I installed the newest Divvun tools. I have Windows 7 and MS Office 2013.»
When the option to check the spelling of all uppercase strings is on, (inflected) names are flagged as misspelled, where as the same string in regular / lexical case is not flagged:
Further, when introducing a misspelling in one of the all-upper names, the correct form is suggested, but when inserted, it is still flagged as misspelled:
That is, it rejects its own suggestions in these cases.
Compare the following two pages with speller test results:
https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.vk.xml (voikkospell)
and
https://gtsvn.uit.no/langtech/trunk/langs/sma/devtools/speller_result_typos.to.xml (hfst-ospell-office)
The main table has the following columns:
The correct suggestion is highlighted, and the background colour of the row also indicates the position of the correct suggestion (or lack thereof).
Now search for the string "TV-esne" on both pages, and compare the suggestion lists. The suggestions from voikkospell corresponds to what you would get with the following command:
echo "TV-esne" | hfst-ospell -S sma.zhfst
The suggestions from hfst-ospell-office do not. Also look at the input "TV’esne" two rows further down, for an example of even more diverging suggestions.
The following text will trigger a red underline in MS Word using the SME speller (version: Divvun-sme-2015.292.177.msi, 2015-10-19, 02:57):
– Fertejit čielga njuolggadusat
The words are accepted, but not the initial EN DASH.
A single space surrounded by linebreaks (U+000A) will trigger a red underline in MS Word using the SME speller from today. To repeat, insert the following in a document (here given as Unicode HEX code points to avoid misunderstandings and code garbling):
U+000A
U+000A
U+0020
U+000A
U+000A
Set the text to North Sami. Notice how there is a very short squiggly red line where the space is.
The Mozilla packages cannot currently coexist with each other.
A solution to the bugs below is to implement a proper native library instead of spawning an instrumented process. This work is in-progress (including https://github.com/hfst/hfst-ospell/tree/cmake ).
Examples from commandline use:
hfst-mso $ echo "5 [L]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*
*** Error in `/usr/local/bin/hfst-ospell-office': munmap_chunk(): invalid pointer: 0x000000000236e880 ***
Aborted (core dumped)
hfst-mso $ echo "5 [l]" | /usr/local/bin/hfst-ospell-office se-latest/se.zhfst
@@ hfst-ospell-office is alive
*
{L}, {L], [L} also results in core dumps.
hfst-mso $ /usr/local/bin/hfst-ospell --version
hfstospell 0.4.0
Jan 26 2016 10:02:10
copyright (C) 2009 - 2016 University of Helsinki
The original input had about 90000 lines, one "word" on each line.
On Linux hfst-ospell-office dumps core after all lines have been spell checked, on OS X it crashes more randomly (somewhere in the middle of the process).
After Divvun 4.0 was installed, this problem appeared.
Documents created after 4.0 was installed do not have this problem. Installing Divvun-sme-86.176.10407.msi did not help.
Windows 7
Microsoft Office Home and Business 2013
Version: 15.0.4787.1002
Locale se-FI
Extend the Greenlandic OmegaT backend for all languages.
MSI version numbers have a max - who knew! https://msdn.microsoft.com/en-us/library/aa370859.aspx says max is 255.255.65535
, which is incompatible with the current scheme of year.day.minute
.
So need to find a new compatible scheme, or just use the current 32bit timestamp.
Only the admin user that installed the speller has access to it.
All our spellers are automatically equipped with an easter egg giving core version information, triggered by asking for suggestions to the string "nuvviDspeller". This is what it looks like for South Sami:
The North Sami spellers from this week has been lacking the easter egg, although when building and testing on my own computer it is there. The North Sami Word / MS Office speller gives the following list of suggestions:
These suggestions all have a much higher weight than the easter egg suggestions. Thus, it looks to me as the easter egg was not built for some reason for SME.
Having the possibility to identify the lexicon version is quite important when debugging user error messages - often it turns out that they have an old speller, and that can easily be identified by asking them to type in the easter egg trigger.
I have implemented an ISpellCheckProvider and it works, but also doesn't (locale kl-GL installer; source).
If I use the official https://code.msdn.microsoft.com/windowsdesktop/spell-checking-client-aea0148c sample code, then everything works as expected and words get suggestions.
From any other MS-provided client, such as Windows itself or Skype, the word gets marked as wrong but the suggested action is to remove the word entirely, as if the response code is CORRECTIVE_ACTION_DELETE. But the debug log is clearly showing a return value of CORRECTIVE_ACTION_GET_SUGGESTIONS. So that makes no sense, as it all goes via the same MS-provided spell checker host handler. I can only conclude that MS implemented their own clients incorrectly...
Seems nobody else in the world is making spell checking providers using this API, and it's nigh impossible to get hold of someone at MS who'd know.
It would be very desirable from an end user point of view that the msi installers for the hfst-based spellers would uninstall/remove old Divvun tools. The old tools are also installed using MSI, and I believe all necessary data should be available in the file:
https://gtsvn.uit.no/langtech/trunk/prooftools/installers/win/MSOffice/wix/divvunsme.wxs
I have run a comparison of the voikko-based speller with the hfst-ospell-office speller. The result is visible here (first 7 diffs, available for a month):
https://www.diffnow.com/?report=5m6dx
As can be seen in the seventh diff, the same suggestion is given twice. This is caused by two suggestions that are underlyingly different only in their initial case (but thus still different), but which are made identical because the input has initial uppercase, and so both suggestions will have initial upper case, which makes them identical.
There needs to be a check for uniqueness within the final suggestion list, and if two identical suggestions are found, only the first/best one is returned.
In a string like Sállan/Glimt, the forward slash is counted as part of the word, when it should not:
Each of the two parts before and after the slash should be treated as independent words, ie the slash should not be considered a word character. This could be language specific, such that some languages want it as part of their words, and others not, although I am not aware of any such language on first thought.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.