Giter VIP home page Giter VIP logo

giellalt / lang-rus Goto Github PK

View Code? Open in Web Editor NEW
5.0 27.0 1.0 49.33 MB

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Russian language

Home Page: https://giellalt.uit.no

License: GNU General Public License v3.0

Makefile 0.34% Shell 0.46% M4 0.40% Python 0.80% Regular Expression 0.18% XML 0.03% YAML 0.52% Text 97.28%
finite-state-transducers constraint-grammar nlp language-resources minority-language proofing-tools giellalt-langs maturity-beta geo-russia langfam-indoeuropean

lang-rus's People

Contributors

albbas avatar anghyflawn avatar bbqsrc avatar flammie avatar ftyers avatar leneantonsen avatar reynoldsnlp avatar rtxanson avatar rueter avatar snomos avatar trondtr avatar trondtynnol avatar ulp16 avatar unhammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

trondtynnol

lang-rus's Issues

эты

$ echo эты | hfst-lookup -q analyser-gt-desc.hfstol
эты	эта+N+Fem+Inan+Pl+Acc	0.000000
эты	эта+N+Fem+Inan+Pl+Nom	0.000000
эты	эта+N+Fem+Inan+Sg+Gen	0.000000

Lemmas declared more than once

Taken from reynoldsnlp/udar#40

The following code using the lexc_parser module ...

from sys import stderr

import lexc_parser as lp


filename = GTPATH + '/langs/rus/src/morphology/lexicon.tmp.lexc'

print('Parsing lexc file...', file=stderr)
with open(filename) as f:
    src = f.read()
lexc = lp.Lexc(src)

primary_lexicons = [entry.cc.id for entry in lexc['Root']
                    if entry.cc is not None and entry.cc.id != 'Numeral']
for lex in primary_lexicons:
    lexc[lex].cc_lemmas_dict

...yields the following lists of lemmas that are declared more than once inside the same part of speech's LEXICON:

Parsing lexc file...
ryan.py:17: UserWarning: Lemmas declared more than once within Adverb:
{'коротко', 'наголо', 'верхом', 'чудно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Noun:
{'бронирование', 'пояс', 'колонок', 'кочан', 'ничтожество', 'судзуки', 'лекарство', 'орган', 'рондо', 'видение', 'уголь', 'туника', 'сапожок', 'пресс-релиз', 'артикул', 'соболь', 'огнеупоры', 'кондуктор', 'индустрия', 'чижик', 'вязанка', 'воздвижение', 'недвижимость', 'пулярка', 'призрак', 'козырь', 'флагман', 'цоколь', 'бакан', 'нон-стоп', 'гитлерюгенд', 'сопло', 'ширма', 'предвозвестник', 'провидение', 'болванчик', 'генсовет', 'парилка', 'пугало', 'гигант', 'тягло', 'полиграфия', 'комплекс', 'микрометр', 'мебельщик', 'характерность', 'феномен', 'пристенок', 'хаханьки', 'натура', 'наркоминдел', 'чувиха', 'пергамент', 'водолей', 'сельдь', 'ламповая', 'напряг', 'ферула', 'хиханьки', 'глюк', 'настриг', 'туркменбаши', 'пролог', 'метчик', 'обрезание', 'туфелька', 'розан', 'речушка', 'чабер', 'порсканье', 'судья', 'светоч', 'урка', 'хаос', 'проводка', 'лиганд', 'колосс', 'дочушка', 'маки', 'транспорт', 'замглавы', 'полип', 'ирис', 'угольник', 'проволочка', 'лосось', 'единица', 'червец', 'тотем', 'холодность', 'плёночка', 'картель', 'нуклеокапсид', 'жертва', 'истукан', 'предвестник', 'кашица', 'кредит', 'взрослый', 'опрощение', 'сведение', 'ужин', 'отзыв', 'русло', 'солнечник', 'ход', 'ястребок', 'префикс', 'цитокин', 'ирей', 'синтип', 'бучение', 'книговедение', 'трапезная', 'безобразность', 'край', 'чучело', 'созданьице', 'зайчик', 'рол', 'подволока', 'разлив', 'солнышко', 'креветка', 'консерваторка', 'дядя', 'прототип', 'сметливость', 'гуарани', 'субъект', 'заворот', 'видик', 'катанье', 'ведение', 'создание', 'калига', 'устрица', 'хобот', 'прослушка', 'бодяга', 'зев', 'комроты', 'отчёт', 'фрик', 'конус', 'адрес', 'котик', 'камора', 'дышло', 'плазмодий', 'марионетка', 'отправитель', 'усадьба', 'селище', 'живчик', 'лоцман', 'дублет', 'светило', 'боливар', 'мшанка', 'целение', 'юнкер', 'спутник', 'скакунок', 'дуплет', 'ордер'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Predicative:
{'чудно', 'полно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Pronoun:
{'возле', 'поперёд', 'обок', 'вне', 'внутрь', 'близь', 'помимо', 'посредине', 'напротив', 'поперёк', 'вблизи', 'посреди', 'вперёд', 'наместо', 'спереди', 'наперекор', 'подобно', 'согласно', 'насчёт', 'навроде', 'свыше', 'ниже', 'посередине', 'ради', 'позади', 'вдоль', 'под', 'чрез', 'вроде', 'вследствие', 'посредством', 'выключая', 'у', 'путём', 'касательно', 'превыше', 'накануне', 'относительно', 'вопреки', 'про', 'промежду', 'касаемо', 'около', 'над', 'из-за', 'по', 'сквозь', 'за', 'ввиду', 'соразмерно', 'противу', 'поверх', 'вовнутрь', 'наперерез', 'без', 'позадь', 'вкось', 'вослед', 'пред', 'мимо', 'сообразно', 'из-под', 'опричь', 'внизу', 'между', 'по-над', 'кроме', 'сверху', 'о', 'посередь', 'сверх', 'вкруг', 'внутри', 'промеж', 'через', 'к', 'против', 'от', 'наподобие', 'перед', 'посереди', 'сзади', 'кругом', 'на', 'включая', 'прежде', 'до', 'исключая', 'выше', 'снизу', 'соответственно', 'взамен', 'насупротив', 'для', 'из', 'округ', 'среди', 'меж', 'плюс', 'окрест', 'средь', 'с', 'благодаря', 'спустя', 'вслед', 'при', 'противно²', 'вместо', 'минус', 'вокруг', 'после', 'впереди', 'подле', 'близ', 'по-за', 'изнутри', 'супротив', 'в', 'середь'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Verb:
{'осветить', 'прояснеть', 'отползать', 'запыхаться¹', 'усугубиться', 'тикать', 'усугубить', 'икать'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Propernoun:
{'Мелани', 'Сандро', 'Филатов', 'Зощенко', 'Марго', 'Геркулесович', 'Люси', 'Симонович', 'Фениксович', 'Симон', 'Витольдович', 'Манагуа', 'Якобсон', 'Евтушенко', 'Гордон', 'Исидор', 'Терещенко', 'Геркулесовна', 'Бурденко', 'Исидорович', 'Григоренко', 'Симоновна', 'Фигаро', 'Макаренко', 'Стефанович', 'Филиппов', 'Короленко', 'Геркулес', 'Лонгин', 'Франко', 'Довженко', 'Пегасовна', 'Пегасович', 'Никарагуа', 'Лонгиновна', 'Мартиновна', 'Громыко', 'Элизабет', 'Федотов', 'Павлиновна', 'Лысенко', 'Шевченко', 'Гильфердинг', 'Павлин', 'Шульженко', 'Исаченко', 'Иванов', 'Робинсон', 'Пегас', 'Стефан', 'Мартин', 'Михалков', 'Павлинович', 'Персей', 'Стефановна', 'Семашко', 'Икария', 'Катанга', 'Мемфис', 'Лонгинович', 'Исидоровна', 'Фениксовна', 'Викторович', 'Феникс', 'Стефани', 'Персеевич', 'Новиков', 'Витольдовна', 'Мартинович', 'Любань', 'Витольд', 'Виктор', 'Нестеренко', 'Панченко', 'Гурченко', 'Обухов', 'Персеевна', 'Покров', 'Итака', 'Морган', 'Викторовна'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Punctuation:
{''}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Symbols:
{'%'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within LexicalizedParticiple:
{'положить', 'сложить'}
  lexc[lex].cc_lemmas_dict

Make ambiguous/optional transitivity tag

Taken from reynoldsnlp/udar#24. (some discussion can be see there)

Russian verbs do not inflect for transitivity, so having multiple readings distinguished by transitivity is grammatically inaccurate.

Transitivity tags can be helpful for the CG, so we should specify transitivity when possible, but if the transitivity is ambiguous, there should only be one reading.

separate prepositions into lemmas by case?

Taken from reynoldsnlp/udar#27.

It would be helpful to language learners/teachers to be able to search for instances of a preposition that govern a certain case.

For example, с can govern INST, GEN and ACC. Each of these could be a different lemma, e.g. с¹, с², с³. The superscript numerals are kind of a pain, and they are opaque. Perhaps this should be с+Pr+Acc, с+Pr+Gen, and с+Pr+Ins. This stretches the meaning of the case tags, where in this case it means that the preposition governs that case, rather than that it is in that case.

Add acronyms to analyzer

Many of the unrecognized tokens in running text are acronyms, such as СССР and США. Acronyms should have gender and number tags to show agreement.

Stress on multi-word expressions

Taken from reynoldsnlp/udar#19

The lexical underlying form needs to have a persistent stress mark that survives the two-level rule that reduces stresses to the right-most one. For example,...

красно-жёлтых
так как
так что
то есть

Search through an fst2strings version of a stressed transducer for any words with stresses on both sides of spaces and hyphens. Something like this: egrep ":.*[ё́̀].*(% |-).*[ё́̀]"

hfst-compose-intersect in src/Makefile_L2 leads to HfstFatalException

It may be that the best way to solve this problem is to properly integrate the L2 makefile into the automake build (see #10). Maybe @snomos can help determine how difficult that will be.

In the root directory, running $ make && cd src && make -f Makefile_L2 -B throws an HfstFatalException. The problem seems to stem from the number of error tags in L2_ORTH_ERRS. I have tried various combinations to see if there is some kind of conflict between the rules, but every small subset I have tried works without error. (However, maybe I just haven't tested the right combination yet) I ran 12 different rotations of the 12 tags, and it fails on the 10th tag every time.

The regex files for L2_ORTH_ERRS are shown here (removing comments and empty lines):

$ tail -n +1 src/orthography/L2_*.regex | grep -v ^# | grep -v ^$
==> src/orthography/L2_Akn.regex <==
а (<-) о ;
==> src/orthography/L2_e2je.regex <==
е (<-) э ;
==> src/orthography/L2_H2S.regex <==
ь (<-) ъ ;
==> src/orthography/L2_i2j.regex <==
й (<-) и ;
==> src/orthography/L2_i2y.regex <==
ы (<-) и ;
==> src/orthography/L2_Ikn.regex <==
и (<-) е ,
и (<-) я ;
==> src/orthography/L2_j2i.regex <==
и (<-) й ;
==> src/orthography/L2_je2e.regex <==
э (<-) е ;
==> src/orthography/L2_NoSS.regex <==
0 (<-) ь ;
==> src/orthography/L2_sh2shch.regex <==
щ (<-) ш ;
==> src/orthography/L2_shch2sh.regex <==
ш (<-) щ ;
==> src/orthography/L2_y2i.regex <==
и (<-) ы ;

The offending code is this loop in Makefile_L2. It appears that hfst-compose-intersect is outputting a bad transducer and hfst-disjunct is choking on it:

	for tag in $(L2_ORTH_ERRS) ; \
	do \
		echo "[ ? -> ... \"\+Err\/L2_$${tag}\" || _ .#. ]" > add-tag-err-L2_$${tag}.regex.tmp ; \
		hfst-regexp2fst  --format=foma --xerox-composition=ON -v  \
			-S add-tag-err-L2_$${tag}.regex.tmp -o add-tag-err-L2_$${tag}.hfst ; \
		printf "read regex @\"orthography/L2_$${tag}.compose.hfst\" \
			.o. @\"analyser-gt-desc.hfst\" \
			;\n \
			save stack err.orth.tmp.hfst\n \
			quit\n" | hfst-xfst -p -v --format=foma ; \
		hfst-subtract -F err.orth.tmp.hfst \
			      analyser-gt-desc-L2.tmp.hfst \
			      > err.uniq.tmp.hfst ; \
		hfst-compose-intersect -v -1 err.uniq.tmp.hfst \
		      -2 add-tag-err-L2_$${tag}.hfst \
		      -o err.tagged.tmp.hfst ; \
		hfst-disjunct -1 analyser-gt-desc-L2.tmp.hfst \
		      -2 err.tagged.tmp.hfst \
		      | hfst-determinize \
		      | hfst-minimize \
		      > err.tmp.hfst ; \
		mv err.tmp.hfst analyser-gt-desc-L2.tmp.hfst ; \
		echo "слово" | hfst-lookup analyser-gt-desc-L2.tmp.hfst ; \
		hfst-summarize --verbose analyser-gt-desc-L2.tmp.hfst ; \
	done

The last relevant bit of output is the following:

Reading from add-tag-err-L2_sh2shch.regex.tmp, writing to add-tag-err-L2_sh2shch.hfst
Compiling expression #1
Using foma as output handler
Reading from standard input...
? bytes. 167693 states, 372271 arcs, ? paths
hfst[1]: hfst[1]: hfst[1]: .
hfst-subtract: warning: Warning: analyser-gt-desc-L2.tmp.hfst contains flag diacritics. The result of subtraction may be incorrect.
hfst-compose-intersect: warning:
Found output multi-char symbols ("+A") in
transducer in file err.uniq.tmp.hfst which are not found on the
input tapes of transducers in file add-tag-err-L2_sh2shch.hfst.
Reading from err.uniq.tmp.hfst and add-tag-err-L2_sh2shch.hfst, writing to err.tagged.tmp.hfst
Reading and minimizing rule xre(?)...
Reading lexicon... subtract(?stdin?, ?stdin?) read
Computing intersecting composition...
Storing result in err.tagged.tmp.hfst...
terminate called after throwing an instance of 'HfstFatalException'
hfst-determinize: Aborted (core dumped)
<stdin> is not a valid transducer file

make stalls with xfst modify-tags (fine with hfst and apertium) (

This issue was created automatically with bugzilla2github

Bugzilla Bug 1909

Date: 2014-11-07T15:02:39+01:00
From: Robert Reynolds <>
To: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
CC: ftyers, ReynoldsRJR, sjur.n.moshagen, trond.trosterud

Last updated: 2015-02-10T10:17:25+01:00

Can't build tokenizer

make[4]: *** No rule to make target 'tokeniser-disamb-gt-desc.accented.pmhfst', needed by 'all-am'. Stop.

negative participles

Participles can generally be negated with не~ as in непрочитанный. The FST does not systematically include such forms.

Generating год causes segmentation fault in Xfst (

This issue was created automatically with bugzilla2github

Bugzilla Bug 1851

Date: 2014-04-09T11:59:32+02:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Robert Reynolds <>
CC: ftyers, sjur.n.moshagen, trond.trosterud

Last updated: 2014-05-07T14:55:27+02:00

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.