Comments (3)
The standard FSTs by tradition try to cover all reasonable tokens in text, even if these are not strictly words of the language in question. In actual texts that people write in the everyday work and life, they use numeric symbols, proper names and loan words, and one usually aims to recognize those as well in corpus analysis. If you look in the A-W corpus, one will find 983 Arabic or Roman symbol tokens, representing 116 types, of which 64 are Arabic tokens.
One reason why both numeric words and symbols are in the same file is that in some inflecting languages, such as Finnish, they all take the same affixational morphology. For instance,
yhdentenätoista hetkenä
11:ntenä hetkenä
... both are uttered with the same Finnish words words yhdentenätoista hetkenä
, and both have the same meaning on the eleventh/11th hour.
Generally, I am quite wary of starting to create an increasing number of specialist FSTs for various purposes, as having the FST be able to recognize and generate tokens one doesn't necessarily need for some application mostly is not a problem, but starting to manage a growing number of different FSTs with different combinations of LEXC and other files can defeat the purpose. Also, the problem with looking up the FOMA FSTs is probably not in the source code but rather in the hfst-fst2fst
command, so splitting the files is not necessarily the fundamental solution..
Nevertheless, when I was compiling the list of LEXC files needed for the dictionary FSTs, I did consider separating the Cree words as a separate set under Ipc's - but didn't do it then. But now having been prodded into action, the split is actually relatively straight-forward to do, so I implemented it now. We used to have dictionary-specific FSTs that were even more permissive with vowel length marking, and now we will have those again.
Besides this, there are some other aspects of the LEXC files that one might want to reorganize at one point, e.g. the lists of misspelled frequent word forms might be better placed into their own LEXC file, which can be included when needed. That could also clarify the compilation of the descriptive and normative FSTs, though it can cause its own complications.
from plains-cree-fsts.
Excellent! This is fixed as of SVN r187652.
$ hfst-optimized-lookup crk-descriptive-analyzer.hfstol
!! Warning: file contains more than one transducer !!
!! This is currently not handled - using only the first one !!
peyak
peyak pêyak+Num+Ipc
nisto
nisto nisto+Num+Ipc
keka-mitataht
keka-mitataht IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+2Sg+3SgO
keka-mitataht IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO
keka-mitataht kêkâ-mitâtaht+Num+Ipc
I
I I +?
1
1 1 +?
V
V V +?
IV
IV IV +?
4321
4321 4321 +?
A thing that both dictionaries have been doing is that they both assume that if the FST analyzes it, it's a Cree wordform. This is not necessarily the case :/
I wonder if we can make the analysis strings more explicit and self-explanatory, stating "hey! this is definitely a Cree wordform!" or "hey, this might be something else!"
from plains-cree-fsts.
Just checked the A-W corpus and there are twice as many Arabic or Roman numerals than spelled-out Cree numeral words (n=438). So, are Arabic or Roman numerals non-Cree? The same question applies actually to English and Finnish as well.
Anyhow, the regular, non-dictionary FSTs already indicate that a token is non-strictly Cree by the features +Arab
, +Rom
, as well as +Eng
, +Fra
and +Lat
, as is the tradition for FSTs for other languages as well. What is unmarked is presumed to be the language in question (i.e. Cree).
from plains-cree-fsts.
Related Issues (20)
- Discrepancy using `kâ-` in layouts HOT 1
- Incorrect tense marker in FST implementation HOT 8
- Use correct spell-relax file HOT 7
- Implement a spell-relax rule also for word-final `-uh` HOT 1
- (Descriptive) Fomabin no longer analyzes "nipa" HOT 1
- Imperatives for the VTAti subclass HOT 1
- Unspecified actors for AI verbs HOT 3
- -n/h alternation for II verbs HOT 1
- Productive recognition of diminutives HOT 1
- New Foma FST build script HOT 6
- VTA-1 with glides HOT 4
- Descriptive analyzer should accept "ni-ki-pimohtanan" HOT 1
- Normative analyser misses correctly spelled particles (and one verb)
- kê- not recognized as preverb HOT 1
- Possible typo for `langs/crk/inc/paradigms/verb-ai-full.layout` HOT 3
- Question: paradigm strings for forms in FSTs, but not on itwêwina site HOT 12
- Missing alternative form for `V+AI+Ind+Fut+Def+3Sg` HOT 6
- Document how to generate forms with alternative grammatical preverbs HOT 2
- Some stems should not be stems HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from plains-cree-fsts.