unitexgramlab / unitex-lingua Goto Github PK

View Code? Open in Web Editor NEW

19.0 10.0 6.0 112.27 MB

Unitex/GramLab Language Resources

Home Page: https://unitexgramlab.org

License: Other

HTML 93.78% Makefile 0.45% Batchfile 0.03% Shell 4.66% Roff 0.78% Perl 0.29%

unitexgramlab language-resources relex dictionaries

unitex-lingua's People

Contributors

Stargazers

Watchers

Forkers

alexis-neme fatmabm foufini zorba2018 evertontomalok denismaurel eric-laporte

unitex-lingua's Issues

Error in dictionary-graph for French ordinals

The currently distributed dictionary-graphs for French ordinals misplaces the plural s of -ièmes. The s is inserted in the grammatical codes of the lexical tag instead of the inflected form. Thanks to Denis Biguenet who discovered this bug.

How to collabore with pt-BR corpus and dictionary changes?

Hi, I am interesting in to collaborate with the pt-BR language infrastructure... There are an e-mail list or forum to discuss with the core unitex-lingua/pt-BR team?

Place each corpus inside its own folder

Is your feature request related to a problem? Please describe.

Lorsqu'on ouvre un fichier .txt avec Unitex, il y a création d'un .snt et d'un dossier. Si en plus on utilise une cascade, il y a en plus création de 5 fichiers et d'un autre dossier... Le dossier Corpus devient vite surchargé.

Describe the solution you'd like

On place chaque corpus de langue dans un répertoire de même nom (sans l'extension)

Submitted by @denisMaurel

remove <MIX> lexical maps from resources

Unitex/GramLab has never accepted a <MIX> lexical mask. It is ignored when it occurs in a local-grammar graph. This lexical mask probably existed in Intex but was not retained by Sébastien during the implementation of Unitex. I am not sure what it used to mean. In the <MIX> topic on the users' forum (13 October 2015), no users argued in favour of a <MIX> lexical mask. For consistency we should replace the graphs containing <MIX> in the distributed resources. Denis Maurel provided a version of the French sentence-splitting graph without <MIX> on 24 May 2018.

Transformation algorithm proposal, FST2 to JSON

"An .fst2 file is a text file that describes a set of graphs", Unitex-GramLab-3.1-usermanual-en, chapter 14.3.2, "Format .fst2".

Proposal

To transform (export) the .fst2 file into .fst2.json, with some enhancing semantic and a formal description with json-schema. Example:

{
  "graphs":{
     "NP":["1 1", "2 2", "-2 2", "3 3", "t", "f"],
     "Adj":["6 1","5 1","4 1","t","f"]
  }, 
  "etc":["%<E>","%the/DET","%<A>/ADJ","%<N>","%nice","@pretty","%small","f"]
}

obtained from (original example of the chapter 14.3.2),

0000000002
-1 NP
: 1 1 
: 2 2 -2 2 
: 3 3 
t 
f 
-2 Adj
: 6 1 5 1 4 1 
t 
f 
%<E>
%the/DET
%<A>/ADJ
%<N>
%nice
@pretty
%small
f

The formal specification of the translation-algorithm (that can be translated to C++ for best implementation) can use Javascript, Perl or another simple language... Anyway there are some tips, or it can be inspired, by ElagFstFilesIO.cpp, ElagFstFilesIO.cpp, etc.

Terminal command to convert all graphs?

Hi, I am testing Unitex-GramLab-3.1-linux-x86_64.run, it is fine in Ubuntu 16 LTS... Running with pt-BR ... At menu FSGraph menu/Tools/Compile FST2 no convertion, only error:

"Main graph matches epsilon! ERROR: the main graph A001 recognizes "

I need

Avoid errors
Compile and save all Graphs (to produce all FST2 files from pt-BR/Inflection/*.grf).

Entries with obsolete spelling in updated dictionary of Portuguese (Brazil)

The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, still contains some entries with the obsolete spelling alongside the corresponding new entry, for instance vôo in addition to voo. Thanks to Erick Fonseca and Sandra Aluísio who discovered that.

Missing graphs for number recognition - all languages

Only the French version distributes resources for the recognition of numbers written in full (Dnum.grf).

I'm preparing new content for a few other languages:

French (e4f58f5)
?

Victorien V.

Reorganize naming of inflection graphs for French compound nouns

Some inflection graphs for French compound nouns are distributed as examples for users. The naming of these graphs is inconsistent, for example 'XXX' means 'three words' in the NC_NNPrepXXX graph, but the same sequence 'XXX' means 'two words and the delimiter between them' in the NC_XXXfp graph.
The naming should be reorganized.

Is there anyone working on Chinese(zh-CN)

Hello everyone,

I'm interested in adding support for Chinese(zh-CN) language, is there anyone working on it or can anyone send me the email-list for it? Or we could set up a group to work on it.

However, what I found on GramLab's official site, it says that:

Unitex can process Chinese, Korean, multilingual text with characters from several alphabets, e.g.Latin and Greek alphabets, without having to encode an alphabet into another.

But I cannot find the corresponding language package at unitex-lingua repo, can anyone tell me what happened?

Best regards,
Haozhe

Missing abbreviations in Portuguese (Brazil) DELAF

The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, misses the abbreviation and acronym entries that were present in the version of 2004.
This makes it more difficult to carry out comparisons between the two versions of the dictionary.

Transforming DELA into SQL or JSON

Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?