moreymat / omw-graph Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 1.35 MB

The Open Multilingual Wordnet in a graph database

License: MIT License

Perl 42.74% Python 46.91% Shell 10.34%

omw-graph's People

Stargazers

Watchers

omw-graph's Issues

Relations skipped

Some relations are skipped when adding missing synsets.

if not currentid in synsets:
  writeLineSynset(currentid, syncsv)
  synsets[currentid].append(currentid)
  if not targetid in synsets:
    writeLineSynset(targetid, syncsv)
    synsets[targetid].append(targetid)
writeLineRels(currentid, targetid, reltype, relcsv)

synsets is a collection used to know if a synset has been added or not. If not I print it in syn-xxx.csv and add it to synsets. I do it for every relation.

This way there should be no missing synset and only right relations but some are still skipped 5 on more than 200.000

Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.

As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.

The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.

Match Wordnet terminology in the database "schema"

It is important for users that we follow the Wordnet terminology as much as possible in the database "schema" (even if we do not strictly enforce a real db schema).

For example, has_sense_in could be used to link lexical entries to synsets: (m:LexicalEntry)-[:has_sense_in]->(n:Synset).

Please provide here any feedback about terminology we should use and their translation in our database schema.

Add node labels

We can ease indexing and navigation in the graph by assigning labels to nodes.

The obvious labels I see are:

Synset for synsets,
LexicalEntry for lexical entries.

Use cases will drive the introduction of other labels that would carry information about the language, the lexicon...
Here again, @fcbond 's feedback and suggestions will be invaluable.

Use pull requests for progressive code review

We currently have two feature branches that are being regularly updated.

We now need to find a convenient way to:

get feedback on code,
discuss potential improvements,
make joint decisions on naming conventions, module organization, etc
prevent branches from diverging too much, so that we can merge them painlessly.

I propose we follow some of the good practice described in this very simple workflow.
We could in particular use pull requests the way described in point 4 ("open a pull request at any time").

/cc @rhin0cer0s @zorgulle

Report

The report is due on May 26th, 2pm.

Requirements:

introduction (1-2 pages)
detailed description of the proposal (2-5 pages) : do not forget to touch on project management issues
description of work done (6-8 pages)
conclusion (0.5-1 page)

The resulting document should be in the PDF format.

Clean branch "inject"

The branch "inject" seems to be dead.
Code that is still used should be kept, the rest removed.
If no code is still in use, the branch should be deleted.

Who volunteers to do this?

/cc @zorgulle @rhin0cer0s

Redundancy in Wordnet

We had to check and delete redundancy in wordnet files.

Example:
in wn-data-fra we had two occurrence
09014850-n fre:lemma Chișinău

We also have to verify that the redundancy is not on purpose.

Produce resource-dependent subsets of relations

Each OMW-LMF file contains all relations from the Princeton Wordnet, even if some synsets are not instantiated by any lexical entry in the resource.

We could identify, for a resource, the subset of relations that covers its lexical entries.

@fcbond expressed interest into getting these restricted subsets to backport them into the OMW-LMF files.

Add language to index key

In order to avoid redundancy between different language we could add the language reference to the index key and add a separator character.

eg : 00000000-n_rain_eng

Graph queries

Working with py2neo linked to neo4j as graph engine is really slow.

We have ~58.000 non lexicalized nodes.

Going through all nonlexicalized nodes and looking for each of their path to top node is supposed to take more than 20 min ( and seems to blow up at the end).

py2neo is using a REST API and we think slows things down.
Can we use a python graph library (networkx and graphviz) to build our graph and work on it before sending it to neo4j ?

If not I think the only way to do things faster is by doing some server side work in Java.

/cc @moreymat @zorgulle

Rdv irl

J'ouvre ce tiquet pour que nous puissions nous arranger pour se voir irl.

Nous voudrions vous rencontrer pour parler de ce que nous avons fait, ce qui reste a faire etc ...

Nous allons mettre a jour d'ici la fin du week end la doc pour que vous sachiez precisement ou nous en sommes.

Pour le jour et l'heure nous sommes libre toute la semaine sauf vendredi matin.

Size of temporary CSV files

Currently temporary .csv files are a bit big ( ~ 200 mb for eng relations ).
We will try to make them smaller

generate on the fly index
hash functions

Slideshow

The presentation will last for 20 minutes + 5 minutes for questions, so the slideshow should contain approximately 15 slides.

Proper handling of non-lexicalized synsets

The OMW-LMF files contain non-lexicalized synsets, i.e. synsets that are not mentioned in any lexical entry.
Some of the non-lexicalized synsets result from partial coverage of the resources.
We could remedy this and expand their coverage by leveraging external resources such as wiktionary.

This issue will build on results from #13 .

Virtual_env

N'ayant pas les droits root à la fac, pour rajouter des packages python ( l'interface py2neo par exemple ) nous devons passer par un environnement virtuel.

Cela semble être conseillé de manière générale pour ne pas mélanger les packages et fabriquer des dépendances propres.

On pensait push notre environnement virtuel sur la branche 'master' vu qu'il sera utilisé partout.

Un peu de doc : http://www.virtualenv.org/en/latest/virtualenv.html

Merge parser and relation extraction

By regrouping relation extraction and file parsing we can add other language easily but it is slower of 5 second

Produce language-specific hierarchies from the English one

Francis Bond wrote:

It might also be interesting to look at producing language specific hierarchies from the English one, as described by :
V. Vincze, A. Almasi. Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian (in http://gwc2014.ut.ee/index.php?v=proceedings, pp. 118--126).

/cc @fcbond

moreymat / omw-graph Goto Github PK

omw-graph's People

Stargazers

Watchers

omw-graph's Issues

Recommend Projects

Recommend Topics

Recommend Org