ualbertaaltlab / plains-cree-fsts Goto Github PK

View Code? Open in Web Editor NEW

2.0 5.0 1.0 543 KB

Mirror of the source code for the Plains Cree morphological analyzer/generator.

Home Page: https://github.com/giellalt/lang-crk

License: Other

Makefile 67.87% Shell 2.89% Ruby 29.24%

nehiyawewin morphology finite-state-transducer analyzer generator cree plains-cree finite-state-transducers

plains-cree-fsts's Issues

Discrepancy using `kâ-` in layouts

I believe I've found a discrepancy in the layouts, the INFINITIVE form for kâ- is being represented as ka- in the .layout files:

|      | "FUTURE/INFINITIVE"        |                           |
|      | : "Conjunct: ka-"          | : "Conjunct: ta-"         |
| "1s" | PV/ka+*+Cnj+Prs+1Sg        | PV/ta+*+Cnj+Prs+1Sg       |
| "2s" | PV/ka+*+Cnj+Prs+2Sg        | PV/ta+*+Cnj+Prs+2Sg       |
| "3s" | PV/ka+*+Cnj+Prs+3Sg        | PV/ta+*+Cnj+Prs+3Sg       |
| "1p" | PV/ka+*+Cnj+Prs+1Pl        | PV/ta+*+Cnj+Prs+1Pl       |
| "21" | PV/ka+*+Cnj+Prs+12Pl       | PV/ta+*+Cnj+Prs+12Pl      |
| "2p" | PV/ka+*+Cnj+Prs+2Pl        | PV/ta+*+Cnj+Prs+2Pl       |
| "3p" | PV/ka+*+Cnj+Prs+3Pl        | PV/ta+*+Cnj+Prs+3Pl       |
| "4"  | PV/ka+*+Cnj+Prs+4Sg/Pl     | PV/ta+*+Cnj+Prs+4Sg/Pl    |

I believe they should be:

|      | "FUTURE/INFINITIVE"         |                           |
|      | : "Conjunct: kaa-"          | : "Conjunct: ta-"         |
| "1s" | PV/kaa+*+Cnj+Prs+1Sg        | PV/ta+*+Cnj+Prs+1Sg       |
| "2s" | PV/kaa+*+Cnj+Prs+2Sg        | PV/ta+*+Cnj+Prs+2Sg       |
| "3s" | PV/kaa+*+Cnj+Prs+3Sg        | PV/ta+*+Cnj+Prs+3Sg       |
| "1p" | PV/kaa+*+Cnj+Prs+1Pl        | PV/ta+*+Cnj+Prs+1Pl       |
| "21" | PV/kaa+*+Cnj+Prs+12Pl       | PV/ta+*+Cnj+Prs+12Pl      |
| "2p" | PV/kaa+*+Cnj+Prs+2Pl        | PV/ta+*+Cnj+Prs+2Pl       |
| "3p" | PV/kaa+*+Cnj+Prs+3Pl        | PV/ta+*+Cnj+Prs+3Pl       |
| "4"  | PV/kaa+*+Cnj+Prs+4Sg/Pl     | PV/ta+*+Cnj+Prs+4Sg/Pl    |

kê- not recognized as preverb

e.g. kê-ohpâskonahk

not found with norm or desc, but e.g. ê-ohpâskonahk is fine

Descriptive analyzer should accept "ni-ki-pimohtanan"

It should have these results:

ni-ki-pimohtanan	pimotêw+V+TA+Ind+Prt+1Pl+3SgO	0.000000
ni-ki-pimohtanan	pimohtêw+V+AI+Ind+Prt+1Pl	0.000000

Normative analyser misses correctly spelled particles (and one verb)

Forms from A-W MGS

Particles that look fine but aren't caught by the normative analyser (descriptive catches them):

ma ma+Ipc 0
aniyê aniyê+Ipc 0
waniyaw waniyaw+Ipc 0
nitaka nitaka+Ipc 0
ô ô+Ipc+Interj 0
yôhô yôhô+Ipc+Interj 0

And one verb:

kâ-pê-nayawacikicik PV/kaa+PV/pe+nayawacikiwak+V+AI+Cnj+3Pl 0

[AEW notes] ê-isimâkwahk was flagged and the suggested correction is ê-isimâkwak with the /h/ in the inflection. Is this an issue with the paradigm in general? I see the paradigm for a number of -mâkwan, -nâkwan, -spakwan forms are all incorrect in leaving the h out of the /...ahk / ending that should be here.

As we already reviewed this, this is a question of -n final VII verbs currently allowing the h/n alternation (specified with a stem-final n3) only for a handful of verbs.

[AEW further says:] Here is likely the problem. Verbs like mâyâtan are given correctly as ê-mâyâtahk, but some many of the other VIIs ending in /an/ are give incorrect /ak/ endings in the paradigms. They should all end in 0s /ahk/ and 0p /ahki/. VII stems that end in /an/ always change to /ahk/ in the Conjunct. It is the /in/ and /on/ endings which are more unpredictable and which need to be marked lexically.

We can implement this for all -an final II verbs, by changing their final -n into n3. For the -in and -on final verbs, this needs to be specified per lexeme, e.g. at the very least for pipon and its compounds.

Incorrect tense marker in FST implementation

I'm going to carry over our conversation from #6 and open this up as a bug:

I believe the FST analysis is incorrect for the form kâ-ki-:

❯ echo "PV/kaa_ki+ohkomiw+V+AI+Cnj+Prs+1Sg" | hfst-optimized-lookup crk-normative-generator.hfstol
PV/kaa_ki+ohkomiw+V+AI+Cnj+Prs+1Sg	kâ-ki-ohkomiyân

~~The analysis marks this as Prs (past) but the implementation is -ki- when the past marker in Plains Cree is -kî-.~~ My mistake, Prt is the 'past' analysis marker in the FSTs. With that however, I am still concerned this analysis is incorrect:

I've double-checked several references to be certain, I cannot find kâ-ki- in any of my references however there are several examples of kâ-kî- in both Freda Ahenakêw's works as well as Arok Wolvengrey's thesis, for example:

p312 ex(23)
tānisi kā-kī-isi-nikamoyan?
tānisi kā-  kī-   isi-  nikamo -yan
IPC    IPV  IPV   IPV   VAI    2s
how    CNJ  PST   thus  sing
“How did you sing?”

p312 ex(24)
tānēhki kā-kī-sipwēhtēt?
tānēhki kā- kī-  sipwēhtē -t
IPC     IPV IPV  VAI      3s
why     CNJ PST  leave
“Why did s/he leave?”

p316 ex(31)
kā- kī- wāpam -iko -t
IPV IPV VTA   INV  3s
CNJ PST see   3’-3s

There are 16 examples in total that I could find just in that paper alone.

Please let me know if you need references to further examples.

Possible typo for `langs/crk/inc/paradigms/verb-ai-full.layout`

I believe the FUTURE DEFINITE TENSE form for Ind+Fut+Def+12P should actually be Ind+Fut+Def+12Pl, attempting to inflect with the current value results in an error:

$ echo nikamow+V+AI+Ind+Fut+Def+12P | hfst-optimized-lookup --silent crk-normative-generator.hfstol
!! Warning: file contains more than one transducer          !!
!! This is currently not handled - using only the first one !!
nikamow+V+AI+Ind+Fut+Def+12P	nikamow+V+AI+Ind+Fut+Def+12P	+?

Creates superfluous + at the end of a Roman numeral analysis

From eddieantonio/fst-lookup#5:

This also appears to be an issue with the FST — at least the one built in UAlbertaALTLab/plains-cree-fst

$ hfst-lookup crk-descriptive-analyzer.hfst
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> I
I	I+Num+Rom+	0.000000
> II
II	II+Num+Rom+	0.000000
> III
III	III+Num+Rom+	0.000000
> IV
IV	IV+Num+Rom+	0.000000
> V
V	V+Num+Rom+	0.000000
> VI
VI	VI+Num+Rom+	0.000000
> VII
VII	VII+Num+Rom+	0.000000
> VIII
VIII	VIII+Num+Rom+	0.000000
> IX
IX	IX+Num+Rom+	0.000000
> X
X	X+Num+Rom+	0.000000

(Descriptive) Fomabin no longer analyzes "nipa"

"nipa" should be analyzed as "nipâ" or "nipâw+V+AI+Imp+Imm+2Sg". This works in HFST:

$ echo "nipa" | hfst-lookup -q crk-descriptive-analyzer.hfst
nipa	nipâw+V+AI+Imp+Imm+2Sg	0.000000

And the optimized lookup:

$ echo "nipa" | hfst-optimized-lookup -q crk-descriptive-analyzer.hfstol
!! Warning: file contains more than one transducer          !!
!! This is currently not handled - using only the first one !!
nipa	nipâw+V+AI+Imp+Imm+2Sg

However, this CRASHES flookup!

$ echo "nipa" | flookup -q crk-descriptive-analyzer.fomabin
[1]    12880 done       echo "nipa" |
       12881 abort      flookup -q crk-descriptive-analyzer.fomabin

Trying to load the FST into Foma also crashes it!

$ foma
Foma, version 0.9.18alpha (svn r0)
Copyright © 2008-2015 Mans Hulden
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; for details, type "help license"

Type "help" to list all commands available.
Type "help <topic>" or help "<operator>" for further help.

foma[0]: load stack crk-descriptive-analyzer.fomabin
[1]    12891 abort      foma

However, the strict analyzer (spell relax not applied) still works...?

$ echo "nipâ" | flookup -q crk-strict-analyzer.fomabin
nipâ	nipâw+V+AI+Imp+Imm+2Sg

Note that fst-lookup does not crash on this Fomabin, and is actually usable for some analyses, but returns 0 results for nipa.

Possible sources of the bug:

The latest spell relax rules
The inversion of crk-orth.hfst
hfst-fst2fst
Foma itself

Inflection identifiers list

Would be great to have a JSON lookup (or similar) of identifiers for the various inflections, eg:

{
 "PV/e+{{ lemma }}+V+AI+Cnj+Prs+1Sg": {
    "mode": "Conjunct",
    "type": "VAI",
    "variation": "VAI1"
    "tempus": "Present",
    "actor":  "1Sg"
    "etc": "..."
  }
}

Some discussion may be required to make sure that all inflections are identifiable using a format such as this. The goal would be to render list of inflected forms (similar to the "linguistic" or "nêhiyawêwin" tab on itwêwina).

I've made a couple attempts at this but struggle to encompass ideas like "future intentional" or "infinitive (ta-)", or something like "should (ta-kî-)", but at the least a full list of the inflection identifiers would be great so they can be mapped to templates for inflection Note that some of these deviate from the identifiers you are using, I was taking creative liberties, however it would be great to have a 1-1 agreement on how to identify forms moving forward.

Bonus points if we can include obscure forms like "weak reduplication" such as

wayâpamêw

@eddieantonio looking for your thoughts on this one.

New Foma FST build script

We should try a new, less error-prone Foma build script. Here's @aarppe:

I left out the proper names and abbreviations from the catenation of the LEXC source, leaving otherwise the compilation script the same and adding FOMA compilation at the end. Direct conversion with hfst-fst2fst of the HFST descriptive analyzer to FOMA format results in abort, but the following scheme seems to work:

hfst-fst2fst -b -F -i crk-gen-norm-dict.hfst -o crk-gen-norm-dict.fomabin

hfst-fst2fst -b -F -i crk-orth.hfst -o crk-orth.fomabin

foma -e"load crk-gen-norm-dict.fomabin" -e"define M" -e"load crk-orth.fomabin" -e"invert net" -e"define O" -e"regex [ M .o. O ];" -e"save stack crk-anl-desc-dict.fomabin" -s

Testing with a few examples, we seem to get results we expect:

flookup -q crk-anl-desc-dict.fomabin
nepat
nipayan
meyonipat
nepat	IC+nipâw+V+AI+Cnj+Prs+3Sg

nipayan	pê-ayâw+V+AI+Ind+Prs+1Sg
nipayan	nipâw+V+AI+Cnj+Prs+2Sg
nipayan	nipâw+V+AI+Cnj+Prs+1Sg

meyonipat	IC+PV/miyo+nipâw+V+AI+Cnj+Prs+3Sg

That hfst-fst2fst produces an imperfect FST is something that we ought to bring to the attention of the Helsinki folks. Nevertheless, the HFST and FOMA FSTs seem to work, so I can't easily judge if there's something disagreeable in the source code.

Regardless, I don't know if this works with the python FST lookup code.

Getting the new FST working is important as it fixes some key errors in the affixation implemented last week (Atticus is working in the remaining issues, namely unspecified actors for VAIs and VTIs) as well as incorporates all of Arok's new dictionary entries.

Originally posted by @aarppe in UAlbertaALTLab/morphodict#261 (comment)

Question: paradigm strings for forms in FSTs, but not on itwêwina site

Sorry for the long title, the layout files do not include the paradigm strings (eg: PV/ta+*+Cnj+Prs+3Pl) for the following forms:

ka-kî- (independent) eg: nika-kî-itwân "I could say (thus)"
ta-kî- (conjunct) eg: ta-kî-itweyân "I should say (thus)"
kita- (conjunct) eg: kita-mosiwit "he became a moose". I understand this may be the same as "ta" but please correct me if I'm wrong
kâ- (conjunct) eg: kâ-nêhiyawêcik "when they speak Cree/those who speak Cree"

I assume the FSTs can handle these, but I didn't see the paradigm IDs (what are we calling them?) listed in the layout files, I was curious if they were going to be added, or if someone could help me generate the list.

Thanks!

Imperatives for the VTAti subclass

[AEW says:] Upon checking I found that the 2s-3s command form for this verb is currently generated as "iti" which is incorrect, it should be /isi/. This goes along with the general t > s rule that takes place for all 2s > 3... command forms.

Currently, this happens to be by design, as is exemplified by the attached YAML file, also to be found via the following link:

https://victorio.uit.no/langtech/trunk/langs/crk/test/src/gt-norm-yamls/V-TA-itew_gt-norm.yaml

I've located where this change can be made, either in how the affixation is described in verb_affixes.lexc:

+Imm+2Sg+3SgO:i2 VERB_ENDLEX ;

... or in the morphophonological description on when -t- turns into -s-

"t2sVTA4Rule"
t3:s <=> _ [ Bx: [ i: | ii2: ] ] | .#. ;

We'd need to add i2 to the context when the t>s change happens (which might have ramifications elsewhere), or adjust the affixation in src/morphology/affixes/verb_affixes.lexc so that a usual <i> is affixed rather than <i2> (also with ramifications that need to be checked). In all cases, the corresponding YAML file needs to be revised.

VTA-1 with glides

y and w in miyêw and ayâwêw should not collapse with i-initial suffixes. mowêw seems to be working, so can use as an example. May also have to do with <i> vs <i2> (or whatever characters we are using for PA *i and PA *e, I may have them wrong).

But Bloomfield kika-ayâtin ayâwêw+V+TA+Ind+Fut+Def+1Sg+2SgO <-- hmm

Productive recognition of diminutives

If no diminutive is listed, then allows only for the productive generation of the short diminutive -is (rather than both -is and -isis). If a diminutive is listed (-is or -isis), disallow the other.

Missing alternative form for `V+AI+Ind+Fut+Def+3Sg`

I believe there may be an incorrect form within the FSTs, or possibly a missing "alternative" (like V+AI+Ind+Prs+12Pl), for the FUTURE DEFINITIVE TENSE. In the 3rd person these words will be prefixed with ka- by the FSTs, however we observe locally (ôta amiskiwâciy wâskahikanihk) that in the non-SAP forms the words will be prefixed with ta-, for instance:

❯ echo itwêw+V+AI+Ind+Fut+Def+3Sg | hfst-optimized-lookup --silent crk-normative-generator.hfstol
itwêw+V+AI+Ind+Fut+Def+3Sg	ka-itwêw

❯ echo kimiwan+V+II+Ind+Fut+Def+3Sg | hfst-optimized-lookup --silent crk-normative-generator.hfstol
kimiwan+V+II+Ind+Fut+Def+3Sg	ka-kimiwan

I believe these examples should be ta-itwêw and ta-kimiwan respectively. Could this just be a local variation, or is there possibly a mistake here?

Use correct spell-relax file

We need the latest spell relax to be compiled in the FSTs here. Problem is, @eddieantonio doesn't know... where the latest file is. Could @aarppe perhaps help?

Split numerals file?

The numerals.lexc file includes both Cree words for numbers, as well as legacy stuff for Arabic and Roman numerals. For a dictionary FST, I think having Arabic and Roman numerals is silly.

For a general acceptor FST, perhaps recognizing Arabic numerals makes sense. When will people legitimately use Roman numerals in Cree text? 😂

Therefore, I think this file should be split and only Cree numerals will be built into the dictionary FSTs.

Among the various LEXC files selected for the dictionary-only FSTs, I would nevertheless still include numerals.lexc, as that is where this subclass of Indeclining Particles (Ipc) are enumerated, and our dictionary sources do contain number words, e.g. pêyak.

I have... issues... with pêyak being in the same file as Arabic and Roman numerals.

Can I split the legacy lexica into its own file (something like numerals-other.lexc), include this in the normal FST, and intentionally exclude it from the dict FSTs?

Originally posted by @eddieantonio in #20 (comment)

Implement a spell-relax rule also for word-final `-uh`

We will probably want to implement a spell-relax rule also for word-final -uh, so that one can recognize neeyuh for nîya, and maybe some other interference from English.

Originally posted by @aarppe in #12 (comment)

Some stems should not be stems

Like for example, "kinanâskomitin" will be inflected further @atticusha.

@aarppe: can you change the script that generates the stems? How?

plains-cree-fsts/src/morphology/stems/verb_stems.lexc

Line 2318 in 07fade2

kinanâskomitin:kinanâskomitin VTA ;

Example of generating multiple inflections

@eddieantonio you mentioned it's possible to generate multiple inflections simultaneously, can you provide an example of how to do so?

Unspecified actors for AI verbs

[AEW notes] ... some options being given for the VAI unspecified actors. When I search for /nîmihitow/, the paradigms are currently giving these options:

nîmihitonâniwan
nîmihitoniwan

ê-nîmihitohk
ê-nîmihtonâniwahk
ê-nîmihitoniwahk

The ones I have marked in red are incorrect. I assume this comes from treating -(nâ)niwan / -(nâ)niwahk as if the -(nâ) part is always optional. It is not. It is morpho-phonologically/contextually conditioned; It is only absent for /â/ and /ê > â/ final stems. For all others, the full forms including the -nâ must be used. I.e..

nipâ*niwan* and
mêtawâ*niwan*

... but ...

api*nâniwan*
tapasî*nâniwan*
nikamo*nâniwan*
pasikô*nâniwan*

This will need to be cleaned up in the paradigms if this is pervasive rather than an odd occurrence in the/nîmihito-/paradigm.

Document how to generate forms with alternative grammatical preverbs

Thank you, I won't push the argument that the forms should be included/displayed, however if there was a reference for how to generate those forms (read: a list of the possible preverbs/paradigm IDs eg: PV/ka_ki+...) that would be super helpful.

Originally posted by @aaronfay in #6 (comment)

ualbertaaltlab / plains-cree-fsts Goto Github PK

plains-cree-fsts's Issues

Recommend Projects

Recommend Topics

Recommend Org