Giter VIP home page Giter VIP logo

unitex-lingua's People

Contributors

alexis-neme avatar clmartineau avatar eric-laporte avatar gvollant avatar martinec avatar nathwhy avatar ndoumi avatar savary avatar victorien-v avatar vinber-service avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unitex-lingua's Issues

Error in dictionary-graph for French ordinals

The currently distributed dictionary-graphs for French ordinals misplaces the plural s of -ièmes. The s is inserted in the grammatical codes of the lexical tag instead of the inflected form. Thanks to Denis Biguenet who discovered this bug.

Place each corpus inside its own folder

Is your feature request related to a problem? Please describe.

Lorsqu'on ouvre un fichier .txt avec Unitex, il y a création d'un .snt et d'un dossier. Si en plus on utilise une cascade, il y a en plus création de 5 fichiers et d'un autre dossier... Le dossier Corpus devient vite surchargé.

Describe the solution you'd like

On place chaque corpus de langue dans un répertoire de même nom (sans l'extension)

Submitted by @denisMaurel

remove <MIX> lexical maps from resources

Unitex/GramLab has never accepted a <MIX> lexical mask. It is ignored when it occurs in a local-grammar graph. This lexical mask probably existed in Intex but was not retained by Sébastien during the implementation of Unitex. I am not sure what it used to mean. In the <MIX> topic on the users' forum (13 October 2015), no users argued in favour of a <MIX> lexical mask. For consistency we should replace the graphs containing <MIX> in the distributed resources. Denis Maurel provided a version of the French sentence-splitting graph without <MIX> on 24 May 2018.

Transformation algorithm proposal, FST2 to JSON

"An .fst2 file is a text file that describes a set of graphs", Unitex-GramLab-3.1-usermanual-en, chapter 14.3.2, "Format .fst2".

Proposal

To transform (export) the .fst2 file into .fst2.json, with some enhancing semantic and a formal description with json-schema. Example:

{
  "graphs":{
     "NP":["1 1", "2 2", "-2 2", "3 3", "t", "f"],
     "Adj":["6 1","5 1","4 1","t","f"]
  }, 
  "etc":["%<E>","%the/DET","%<A>/ADJ","%<N>","%nice","@pretty","%small","f"]
}

obtained from (original example of the chapter 14.3.2),

0000000002
-1 NP
: 1 1 
: 2 2 -2 2 
: 3 3 
t 
f 
-2 Adj
: 6 1 5 1 4 1 
t 
f 
%<E>
%the/DET
%<A>/ADJ
%<N>
%nice
@pretty
%small
f

The formal specification of the translation-algorithm (that can be translated to C++ for best implementation) can use Javascript, Perl or another simple language... Anyway there are some tips, or it can be inspired, by ElagFstFilesIO.cpp, ElagFstFilesIO.cpp, etc.

Terminal command to convert all graphs?

Hi, I am testing Unitex-GramLab-3.1-linux-x86_64.run, it is fine in Ubuntu 16 LTS... Running with pt-BR ... At menu FSGraph menu/Tools/Compile FST2 no convertion, only error:

"Main graph matches epsilon! ERROR: the main graph A001 recognizes "

I need

  1. Avoid errors
  2. Compile and save all Graphs (to produce all FST2 files from pt-BR/Inflection/*.grf).

Entries with obsolete spelling in updated dictionary of Portuguese (Brazil)

The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, still contains some entries with the obsolete spelling alongside the corresponding new entry, for instance vôo in addition to voo. Thanks to Erick Fonseca and Sandra Aluísio who discovered that.

Reorganize naming of inflection graphs for French compound nouns

Some inflection graphs for French compound nouns are distributed as examples for users. The naming of these graphs is inconsistent, for example 'XXX' means 'three words' in the NC_NNPrepXXX graph, but the same sequence 'XXX' means 'two words and the delimiter between them' in the NC_XXXfp graph.
The naming should be reorganized.

Is there anyone working on Chinese(zh-CN)

Hello everyone,

I'm interested in adding support for Chinese(zh-CN) language, is there anyone working on it or can anyone send me the email-list for it? Or we could set up a group to work on it.

However, what I found on GramLab's official site, it says that:

Unitex can process Chinese, Korean, multilingual text with characters from several alphabets, e.g.Latin and Greek alphabets, without having to encode an alphabet into another.

But I cannot find the corresponding language package at unitex-lingua repo, can anyone tell me what happened?

Best regards,
Haozhe

Missing abbreviations in Portuguese (Brazil) DELAF

The dictionary of Portuguese (Brazil) simple words, version of 2015, updated for the spelling reform of 2009, misses the abbreviation and acronym entries that were present in the version of 2004.
This makes it more difficult to carry out comparisons between the two versions of the dictionary.

Transforming DELA into SQL or JSON

Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.