The numerals.lexc file includes both Cree words for n

Excellent! This is fixed as of SVN r187652. <div class="snippet-clipboard-content

Split numerals file? about plains-cree-fsts HOT 3 CLOSED

ualbertaaltlab commented on July 16, 2024

Split numerals file?

from plains-cree-fsts.

Comments (3)

aarppe commented on July 16, 2024

The standard FSTs by tradition try to cover all reasonable tokens in text, even if these are not strictly words of the language in question. In actual texts that people write in the everyday work and life, they use numeric symbols, proper names and loan words, and one usually aims to recognize those as well in corpus analysis. If you look in the A-W corpus, one will find 983 Arabic or Roman symbol tokens, representing 116 types, of which 64 are Arabic tokens.

One reason why both numeric words and symbols are in the same file is that in some inflecting languages, such as Finnish, they all take the same affixational morphology. For instance,

yhdentenätoista hetkenä
11:ntenä hetkenä

... both are uttered with the same Finnish words words yhdentenätoista hetkenä, and both have the same meaning on the eleventh/11th hour.

Generally, I am quite wary of starting to create an increasing number of specialist FSTs for various purposes, as having the FST be able to recognize and generate tokens one doesn't necessarily need for some application mostly is not a problem, but starting to manage a growing number of different FSTs with different combinations of LEXC and other files can defeat the purpose. Also, the problem with looking up the FOMA FSTs is probably not in the source code but rather in the hfst-fst2fst command, so splitting the files is not necessarily the fundamental solution..

Nevertheless, when I was compiling the list of LEXC files needed for the dictionary FSTs, I did consider separating the Cree words as a separate set under Ipc's - but didn't do it then. But now having been prodded into action, the split is actually relatively straight-forward to do, so I implemented it now. We used to have dictionary-specific FSTs that were even more permissive with vowel length marking, and now we will have those again.

Besides this, there are some other aspects of the LEXC files that one might want to reorganize at one point, e.g. the lists of misspelled frequent word forms might be better placed into their own LEXC file, which can be included when needed. That could also clarify the compilation of the descriptive and normative FSTs, though it can cause its own complications.

from plains-cree-fsts.

eddieantonio commented on July 16, 2024

Excellent! This is fixed as of SVN r187652.

$ hfst-optimized-lookup crk-descriptive-analyzer.hfstol
!! Warning: file contains more than one transducer          !!
!! This is currently not handled - using only the first one !!
peyak
peyak	pêyak+Num+Ipc

nisto
nisto	nisto+Num+Ipc

keka-mitataht
keka-mitataht	IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+2Sg+3SgO
keka-mitataht	IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO
keka-mitataht	kêkâ-mitâtaht+Num+Ipc

I
I	I	+?

1
1	1	+?

V
V	V	+?

IV
IV	IV	+?

4321
4321	4321	+?

A thing that both dictionaries have been doing is that they both assume that if the FST analyzes it, it's a Cree wordform. This is not necessarily the case :/

I wonder if we can make the analysis strings more explicit and self-explanatory, stating "hey! this is definitely a Cree wordform!" or "hey, this might be something else!"

from plains-cree-fsts.

aarppe commented on July 16, 2024

Just checked the A-W corpus and there are twice as many Arabic or Roman numerals than spelled-out Cree numeral words (n=438). So, are Arabic or Roman numerals non-Cree? The same question applies actually to English and Finnish as well.

Anyhow, the regular, non-dictionary FSTs already indicate that a token is non-strictly Cree by the features +Arab, +Rom, as well as +Eng, +Fra and +Lat, as is the tradition for FSTs for other languages as well. What is unmarked is presumed to be the language in question (i.e. Cree).

from plains-cree-fsts.

Split numerals file? about plains-cree-fsts HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent