giellalt / lang-rus Goto Github PK

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Russian language

License: GNU General Public License v3.0

Makefile 0.34% Shell 0.46% M4 0.40% Python 0.80% Regular Expression 0.18% XML 0.03% YAML 0.52% Text 97.28%

finite-state-transducers constraint-grammar nlp language-resources minority-language proofing-tools giellalt-langs maturity-beta geo-russia langfam-indoeuropean

lang-rus's People

Contributors

Stargazers

Watchers

Forkers

trondtynnol

lang-rus's Issues

эты

$ echo эты | hfst-lookup -q analyser-gt-desc.hfstol
эты	эта+N+Fem+Inan+Pl+Acc	0.000000
эты	эта+N+Fem+Inan+Pl+Nom	0.000000
эты	эта+N+Fem+Inan+Sg+Gen	0.000000

Add `+Err/L2_Lat` for confusing Cyrillic with Latin letters

Words like вилет (cf билет). The following are candidates to consider:

б > в
й > у
н > п
п > р
к > с
х > н

Lemmas declared more than once

Taken from reynoldsnlp/udar#40

The following code using the lexc_parser module ...

from sys import stderr

import lexc_parser as lp


filename = GTPATH + '/langs/rus/src/morphology/lexicon.tmp.lexc'

print('Parsing lexc file...', file=stderr)
with open(filename) as f:
    src = f.read()
lexc = lp.Lexc(src)

primary_lexicons = [entry.cc.id for entry in lexc['Root']
                    if entry.cc is not None and entry.cc.id != 'Numeral']
for lex in primary_lexicons:
    lexc[lex].cc_lemmas_dict

...yields the following lists of lemmas that are declared more than once inside the same part of speech's LEXICON:

Parsing lexc file...
ryan.py:17: UserWarning: Lemmas declared more than once within Adverb:
{'коротко', 'наголо', 'верхом', 'чудно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Noun:
{'бронирование', 'пояс', 'колонок', 'кочан', 'ничтожество', 'судзуки', 'лекарство', 'орган', 'рондо', 'видение', 'уголь', 'туника', 'сапожок', 'пресс-релиз', 'артикул', 'соболь', 'огнеупоры', 'кондуктор', 'индустрия', 'чижик', 'вязанка', 'воздвижение', 'недвижимость', 'пулярка', 'призрак', 'козырь', 'флагман', 'цоколь', 'бакан', 'нон-стоп', 'гитлерюгенд', 'сопло', 'ширма', 'предвозвестник', 'провидение', 'болванчик', 'генсовет', 'парилка', 'пугало', 'гигант', 'тягло', 'полиграфия', 'комплекс', 'микрометр', 'мебельщик', 'характерность', 'феномен', 'пристенок', 'хаханьки', 'натура', 'наркоминдел', 'чувиха', 'пергамент', 'водолей', 'сельдь', 'ламповая', 'напряг', 'ферула', 'хиханьки', 'глюк', 'настриг', 'туркменбаши', 'пролог', 'метчик', 'обрезание', 'туфелька', 'розан', 'речушка', 'чабер', 'порсканье', 'судья', 'светоч', 'урка', 'хаос', 'проводка', 'лиганд', 'колосс', 'дочушка', 'маки', 'транспорт', 'замглавы', 'полип', 'ирис', 'угольник', 'проволочка', 'лосось', 'единица', 'червец', 'тотем', 'холодность', 'плёночка', 'картель', 'нуклеокапсид', 'жертва', 'истукан', 'предвестник', 'кашица', 'кредит', 'взрослый', 'опрощение', 'сведение', 'ужин', 'отзыв', 'русло', 'солнечник', 'ход', 'ястребок', 'префикс', 'цитокин', 'ирей', 'синтип', 'бучение', 'книговедение', 'трапезная', 'безобразность', 'край', 'чучело', 'созданьице', 'зайчик', 'рол', 'подволока', 'разлив', 'солнышко', 'креветка', 'консерваторка', 'дядя', 'прототип', 'сметливость', 'гуарани', 'субъект', 'заворот', 'видик', 'катанье', 'ведение', 'создание', 'калига', 'устрица', 'хобот', 'прослушка', 'бодяга', 'зев', 'комроты', 'отчёт', 'фрик', 'конус', 'адрес', 'котик', 'камора', 'дышло', 'плазмодий', 'марионетка', 'отправитель', 'усадьба', 'селище', 'живчик', 'лоцман', 'дублет', 'светило', 'боливар', 'мшанка', 'целение', 'юнкер', 'спутник', 'скакунок', 'дуплет', 'ордер'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Predicative:
{'чудно', 'полно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Pronoun:
{'возле', 'поперёд', 'обок', 'вне', 'внутрь', 'близь', 'помимо', 'посредине', 'напротив', 'поперёк', 'вблизи', 'посреди', 'вперёд', 'наместо', 'спереди', 'наперекор', 'подобно', 'согласно', 'насчёт', 'навроде', 'свыше', 'ниже', 'посередине', 'ради', 'позади', 'вдоль', 'под', 'чрез', 'вроде', 'вследствие', 'посредством', 'выключая', 'у', 'путём', 'касательно', 'превыше', 'накануне', 'относительно', 'вопреки', 'про', 'промежду', 'касаемо', 'около', 'над', 'из-за', 'по', 'сквозь', 'за', 'ввиду', 'соразмерно', 'противу', 'поверх', 'вовнутрь', 'наперерез', 'без', 'позадь', 'вкось', 'вослед', 'пред', 'мимо', 'сообразно', 'из-под', 'опричь', 'внизу', 'между', 'по-над', 'кроме', 'сверху', 'о', 'посередь', 'сверх', 'вкруг', 'внутри', 'промеж', 'через', 'к', 'против', 'от', 'наподобие', 'перед', 'посереди', 'сзади', 'кругом', 'на', 'включая', 'прежде', 'до', 'исключая', 'выше', 'снизу', 'соответственно', 'взамен', 'насупротив', 'для', 'из', 'округ', 'среди', 'меж', 'плюс', 'окрест', 'средь', 'с', 'благодаря', 'спустя', 'вслед', 'при', 'противно²', 'вместо', 'минус', 'вокруг', 'после', 'впереди', 'подле', 'близ', 'по-за', 'изнутри', 'супротив', 'в', 'середь'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Verb:
{'осветить', 'прояснеть', 'отползать', 'запыхаться¹', 'усугубиться', 'тикать', 'усугубить', 'икать'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Propernoun:
{'Мелани', 'Сандро', 'Филатов', 'Зощенко', 'Марго', 'Геркулесович', 'Люси', 'Симонович', 'Фениксович', 'Симон', 'Витольдович', 'Манагуа', 'Якобсон', 'Евтушенко', 'Гордон', 'Исидор', 'Терещенко', 'Геркулесовна', 'Бурденко', 'Исидорович', 'Григоренко', 'Симоновна', 'Фигаро', 'Макаренко', 'Стефанович', 'Филиппов', 'Короленко', 'Геркулес', 'Лонгин', 'Франко', 'Довженко', 'Пегасовна', 'Пегасович', 'Никарагуа', 'Лонгиновна', 'Мартиновна', 'Громыко', 'Элизабет', 'Федотов', 'Павлиновна', 'Лысенко', 'Шевченко', 'Гильфердинг', 'Павлин', 'Шульженко', 'Исаченко', 'Иванов', 'Робинсон', 'Пегас', 'Стефан', 'Мартин', 'Михалков', 'Павлинович', 'Персей', 'Стефановна', 'Семашко', 'Икария', 'Катанга', 'Мемфис', 'Лонгинович', 'Исидоровна', 'Фениксовна', 'Викторович', 'Феникс', 'Стефани', 'Персеевич', 'Новиков', 'Витольдовна', 'Мартинович', 'Любань', 'Витольд', 'Виктор', 'Нестеренко', 'Панченко', 'Гурченко', 'Обухов', 'Персеевна', 'Покров', 'Итака', 'Морган', 'Викторовна'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Punctuation:
{''}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Symbols:
{'%'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within LexicalizedParticiple:
{'положить', 'сложить'}
  lexc[lex].cc_lemmas_dict

Make ambiguous/optional transitivity tag

Taken from reynoldsnlp/udar#24. (some discussion can be see there)

Russian verbs do not inflect for transitivity, so having multiple readings distinguished by transitivity is grammatically inaccurate.

Transitivity tags can be helpful for the CG, so we should specify transitivity when possible, but if the transitivity is ambiguous, there should only be one reading.

separate prepositions into lemmas by case?

Taken from reynoldsnlp/udar#27.

It would be helpful to language learners/teachers to be able to search for instances of a preposition that govern a certain case.

For example, с can govern INST, GEN and ACC. Each of these could be a different lemma, e.g. с¹, с², с³. The superscript numerals are kind of a pain, and they are opaque. Perhaps this should be с+Pr+Acc, с+Pr+Gen, and с+Pr+Ins. This stretches the meaning of the case tags, where in this case it means that the preposition governs that case, rather than that it is in that case.

After infra juggle, accents are not filtered off, tokenizer does not recognize simple words

In the src/fst/morphology/stems/nouns.lexc

табак:таба́к м_b_Р2 "weight: 4.490091974131408" ;

BUT

lang-rus jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> табак
табак	табак+?	inf

This is related to several tickets in after the move

Superlative tag?

Taken from reynoldsnlp/udar#32

Should tokens such as новейший, высший, etc. be tagged as superlatives?

Reconsider 1Pl imperatives

Taken from reynoldsnlp/udar#31.

Reconsider whether to mark 1pl as imperatives. If so, then should imperfectives be marked as well? This is both a linguistic and practical question.

Add acronyms to analyzer

Many of the unrecognized tokens in running text are acronyms, such as СССР and США. Acronyms should have gender and number tags to show agreement.

Integrate L2 error analyzer into automake

Makefiles for the L2 error analyzer were added in 9536efe. These should be optimized and integrated into the standard automake workflow, so that they build with the --enable-L2 configure flag.

Stress on multi-word expressions

Taken from reynoldsnlp/udar#19

The lexical underlying form needs to have a persistent stress mark that survives the two-level rule that reduces stresses to the right-most one. For example,...

красно-жёлтых
так как
так что
то есть

Search through an fst2strings version of a stressed transducer for any words with stresses on both sides of spaces and hyphens. Something like this: egrep ":.*[ё́̀].*(% |-).*[ё́̀]"

hfst-compose-intersect in src/Makefile_L2 leads to HfstFatalException

It may be that the best way to solve this problem is to properly integrate the L2 makefile into the automake build (see #10). Maybe @snomos can help determine how difficult that will be.

In the root directory, running $ make && cd src && make -f Makefile_L2 -B throws an HfstFatalException. The problem seems to stem from the number of error tags in L2_ORTH_ERRS. I have tried various combinations to see if there is some kind of conflict between the rules, but every small subset I have tried works without error. ~~(However, maybe I just haven't tested the right combination yet)~~ I ran 12 different rotations of the 12 tags, and it fails on the 10th tag every time.

The regex files for L2_ORTH_ERRS are shown here (removing comments and empty lines):

$ tail -n +1 src/orthography/L2_*.regex | grep -v ^# | grep -v ^$
==> src/orthography/L2_Akn.regex <==
а (<-) о ;
==> src/orthography/L2_e2je.regex <==
е (<-) э ;
==> src/orthography/L2_H2S.regex <==
ь (<-) ъ ;
==> src/orthography/L2_i2j.regex <==
й (<-) и ;
==> src/orthography/L2_i2y.regex <==
ы (<-) и ;
==> src/orthography/L2_Ikn.regex <==
и (<-) е ,
и (<-) я ;
==> src/orthography/L2_j2i.regex <==
и (<-) й ;
==> src/orthography/L2_je2e.regex <==
э (<-) е ;
==> src/orthography/L2_NoSS.regex <==
0 (<-) ь ;
==> src/orthography/L2_sh2shch.regex <==
щ (<-) ш ;
==> src/orthography/L2_shch2sh.regex <==
ш (<-) щ ;
==> src/orthography/L2_y2i.regex <==
и (<-) ы ;

The offending code is this loop in Makefile_L2. It appears that hfst-compose-intersect is outputting a bad transducer and hfst-disjunct is choking on it:

	for tag in $(L2_ORTH_ERRS) ; \
	do \
		echo "[ ? -> ... \"\+Err\/L2_$${tag}\" || _ .#. ]" > add-tag-err-L2_$${tag}.regex.tmp ; \
		hfst-regexp2fst  --format=foma --xerox-composition=ON -v  \
			-S add-tag-err-L2_$${tag}.regex.tmp -o add-tag-err-L2_$${tag}.hfst ; \
		printf "read regex @\"orthography/L2_$${tag}.compose.hfst\" \
			.o. @\"analyser-gt-desc.hfst\" \
			;\n \
			save stack err.orth.tmp.hfst\n \
			quit\n" | hfst-xfst -p -v --format=foma ; \
		hfst-subtract -F err.orth.tmp.hfst \
			      analyser-gt-desc-L2.tmp.hfst \
			      > err.uniq.tmp.hfst ; \
		hfst-compose-intersect -v -1 err.uniq.tmp.hfst \
		      -2 add-tag-err-L2_$${tag}.hfst \
		      -o err.tagged.tmp.hfst ; \
		hfst-disjunct -1 analyser-gt-desc-L2.tmp.hfst \
		      -2 err.tagged.tmp.hfst \
		      | hfst-determinize \
		      | hfst-minimize \
		      > err.tmp.hfst ; \
		mv err.tmp.hfst analyser-gt-desc-L2.tmp.hfst ; \
		echo "слово" | hfst-lookup analyser-gt-desc-L2.tmp.hfst ; \
		hfst-summarize --verbose analyser-gt-desc-L2.tmp.hfst ; \
	done

The last relevant bit of output is the following:

Reading from add-tag-err-L2_sh2shch.regex.tmp, writing to add-tag-err-L2_sh2shch.hfst
Compiling expression #1
Using foma as output handler
Reading from standard input...
? bytes. 167693 states, 372271 arcs, ? paths
hfst[1]: hfst[1]: hfst[1]: .
hfst-subtract: warning: Warning: analyser-gt-desc-L2.tmp.hfst contains flag diacritics. The result of subtraction may be incorrect.
hfst-compose-intersect: warning:
Found output multi-char symbols ("+A") in
transducer in file err.uniq.tmp.hfst which are not found on the
input tapes of transducers in file add-tag-err-L2_sh2shch.hfst.
Reading from err.uniq.tmp.hfst and add-tag-err-L2_sh2shch.hfst, writing to err.tagged.tmp.hfst
Reading and minimizing rule xre(?)...
Reading lexicon... subtract(?stdin?, ?stdin?) read
Computing intersecting composition...
Storing result in err.tagged.tmp.hfst...
terminate called after throwing an instance of 'HfstFatalException'
hfst-determinize: Aborted (core dumped)
<stdin> is not a valid transducer file

Fix spellers

make stalls with xfst modify-tags (fine with hfst and apertium) (

This issue was created automatically with bugzilla2github

Bugzilla Bug 1909

Date: 2014-11-07T15:02:39+01:00
From: Robert Reynolds <>
To: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
CC: ftyers, ReynoldsRJR, sjur.n.moshagen, trond.trosterud

Last updated: 2015-02-10T10:17:25+01:00

Can't build tokenizer

make[4]: *** No rule to make target 'tokeniser-disamb-gt-desc.accented.pmhfst', needed by 'all-am'. Stop.

Russian transducer does not compile due to syntax error at weight: 20, NDS kyv is down (

This issue was created automatically with bugzilla2github

Bugzilla Bug 2368

Date: 2017-03-28T15:38:49+02:00
From: Jack Rueter <<rueter.jack>>
To: Trond Trosterud <<trond.trosterud>>
CC: ftyers, ReynoldsRJR, sjur.n.moshagen, trond.trosterud

Last updated: 2017-03-28T21:21:50+02:00

Bugzilla Bug 1851

Date: 2014-04-09T11:59:32+02:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Robert Reynolds <>
CC: ftyers, sjur.n.moshagen, trond.trosterud

Last updated: 2014-05-07T14:55:27+02:00