unitexgramlab / unitex-lingua Goto Github PK
View Code? Open in Web Editor NEWUnitex/GramLab Language Resources
Home Page: https://unitexgramlab.org
License: Other
Unitex/GramLab Language Resources
Home Page: https://unitexgramlab.org
License: Other
The currently distributed dictionary-graphs for French ordinals misplaces the plural s of -ièmes. The s is inserted in the grammatical codes of the lexical tag instead of the inflected form. Thanks to Denis Biguenet who discovered this bug.
Hi, I am interesting in to collaborate with the pt-BR language infrastructure... There are an e-mail list or forum to discuss with the core unitex-lingua/pt-BR team?
Is your feature request related to a problem? Please describe.
Lorsqu'on ouvre un fichier .txt avec Unitex, il y a création d'un .snt et d'un dossier. Si en plus on utilise une cascade, il y a en plus création de 5 fichiers et d'un autre dossier... Le dossier Corpus devient vite surchargé.
Describe the solution you'd like
On place chaque corpus de langue dans un répertoire de même nom (sans l'extension)
Submitted by @denisMaurel
Unitex/GramLab has never accepted a <MIX> lexical mask. It is ignored when it occurs in a local-grammar graph. This lexical mask probably existed in Intex but was not retained by Sébastien during the implementation of Unitex. I am not sure what it used to mean. In the <MIX> topic on the users' forum (13 October 2015), no users argued in favour of a <MIX> lexical mask. For consistency we should replace the graphs containing <MIX> in the distributed resources. Denis Maurel provided a version of the French sentence-splitting graph without <MIX> on 24 May 2018.
"An .fst2
file is a text file that describes a set of graphs", Unitex-GramLab-3.1-usermanual-en, chapter 14.3.2, "Format .fst2
".
To transform (export) the .fst2
file into .fst2.json
, with some enhancing semantic and a formal description with json-schema. Example:
{
"graphs":{
"NP":["1 1", "2 2", "-2 2", "3 3", "t", "f"],
"Adj":["6 1","5 1","4 1","t","f"]
},
"etc":["%<E>","%the/DET","%<A>/ADJ","%<N>","%nice","@pretty","%small","f"]
}
obtained from (original example of the chapter 14.3.2),
0000000002
-1 NP
: 1 1
: 2 2 -2 2
: 3 3
t
f
-2 Adj
: 6 1 5 1 4 1
t
f
%<E>
%the/DET
%<A>/ADJ
%<N>
%nice
@pretty
%small
f
The formal specification of the translation-algorithm (that can be translated to C++ for best implementation) can use Javascript, Perl or another simple language... Anyway there are some tips, or it can be inspired, by ElagFstFilesIO.cpp
, ElagFstFilesIO.cpp
, etc.
Hi, I am testing Unitex-GramLab-3.1-linux-x86_64.run
, it is fine in Ubuntu 16 LTS... Running with pt-BR
... At menu FSGraph menu/Tools/Compile FST2 no convertion, only error:
"Main graph matches epsilon! ERROR: the main graph A001 recognizes "
I need
pt-BR/Inflection/*.grf
).The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, still contains some entries with the obsolete spelling alongside the corresponding new entry, for instance vôo in addition to voo. Thanks to Erick Fonseca and Sandra Aluísio who discovered that.
Only the French version distributes resources for the recognition of numbers written in full (Dnum.grf).
I'm preparing new content for a few other languages:
Victorien V.
Some inflection graphs for French compound nouns are distributed as examples for users. The naming of these graphs is inconsistent, for example 'XXX' means 'three words' in the NC_NNPrepXXX graph, but the same sequence 'XXX' means 'two words and the delimiter between them' in the NC_XXXfp graph.
The naming should be reorganized.
Hello everyone,
I'm interested in adding support for Chinese(zh-CN) language, is there anyone working on it or can anyone send me the email-list for it? Or we could set up a group to work on it.
However, what I found on GramLab's official site, it says that:
Unitex can process Chinese, Korean, multilingual text with characters from several alphabets, e.g.Latin and Greek alphabets, without having to encode an alphabet into another.
But I cannot find the corresponding language package at unitex-lingua repo, can anyone tell me what happened?
Best regards,
Haozhe
The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, misses the abbreviation and acronym entries that were present in the version of 2004.
This makes it more difficult to carry out comparisons between the two versions of the dictionary.
Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.