Giter VIP home page Giter VIP logo

Comments (3)

aarppe avatar aarppe commented on July 16, 2024

The standard FSTs by tradition try to cover all reasonable tokens in text, even if these are not strictly words of the language in question. In actual texts that people write in the everyday work and life, they use numeric symbols, proper names and loan words, and one usually aims to recognize those as well in corpus analysis. If you look in the A-W corpus, one will find 983 Arabic or Roman symbol tokens, representing 116 types, of which 64 are Arabic tokens.

One reason why both numeric words and symbols are in the same file is that in some inflecting languages, such as Finnish, they all take the same affixational morphology. For instance,

yhdentenätoista hetkenä
11:ntenä hetkenä 

... both are uttered with the same Finnish words words yhdentenätoista hetkenä, and both have the same meaning on the eleventh/11th hour.

Generally, I am quite wary of starting to create an increasing number of specialist FSTs for various purposes, as having the FST be able to recognize and generate tokens one doesn't necessarily need for some application mostly is not a problem, but starting to manage a growing number of different FSTs with different combinations of LEXC and other files can defeat the purpose. Also, the problem with looking up the FOMA FSTs is probably not in the source code but rather in the hfst-fst2fst command, so splitting the files is not necessarily the fundamental solution..

Nevertheless, when I was compiling the list of LEXC files needed for the dictionary FSTs, I did consider separating the Cree words as a separate set under Ipc's - but didn't do it then. But now having been prodded into action, the split is actually relatively straight-forward to do, so I implemented it now. We used to have dictionary-specific FSTs that were even more permissive with vowel length marking, and now we will have those again.

Besides this, there are some other aspects of the LEXC files that one might want to reorganize at one point, e.g. the lists of misspelled frequent word forms might be better placed into their own LEXC file, which can be included when needed. That could also clarify the compilation of the descriptive and normative FSTs, though it can cause its own complications.

from plains-cree-fsts.

eddieantonio avatar eddieantonio commented on July 16, 2024

Excellent! This is fixed as of SVN r187652.

$ hfst-optimized-lookup crk-descriptive-analyzer.hfstol
!! Warning: file contains more than one transducer          !!
!! This is currently not handled - using only the first one !!
peyak
peyak	pêyak+Num+Ipc

nisto
nisto	nisto+Num+Ipc

keka-mitataht
keka-mitataht	IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+2Sg+3SgO
keka-mitataht	IC+PV/kika+mihtâtêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO
keka-mitataht	kêkâ-mitâtaht+Num+Ipc

I
I	I	+?

1
1	1	+?

V
V	V	+?

IV
IV	IV	+?

4321
4321	4321	+?

A thing that both dictionaries have been doing is that they both assume that if the FST analyzes it, it's a Cree wordform. This is not necessarily the case :/

I wonder if we can make the analysis strings more explicit and self-explanatory, stating "hey! this is definitely a Cree wordform!" or "hey, this might be something else!"

from plains-cree-fsts.

aarppe avatar aarppe commented on July 16, 2024

Just checked the A-W corpus and there are twice as many Arabic or Roman numerals than spelled-out Cree numeral words (n=438). So, are Arabic or Roman numerals non-Cree? The same question applies actually to English and Finnish as well.

Anyhow, the regular, non-dictionary FSTs already indicate that a token is non-strictly Cree by the features +Arab, +Rom, as well as +Eng, +Fra and +Lat, as is the tradition for FSTs for other languages as well. What is unmarked is presumed to be the language in question (i.e. Cree).

from plains-cree-fsts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.