Giter VIP home page Giter VIP logo

omw-graph's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

omw-graph's Issues

Relations skipped

Some relations are skipped when adding missing synsets.

if not currentid in synsets:
  writeLineSynset(currentid, syncsv)
  synsets[currentid].append(currentid)
  if not targetid in synsets:
    writeLineSynset(targetid, syncsv)
    synsets[targetid].append(targetid)
writeLineRels(currentid, targetid, reltype, relcsv)

synsets is a collection used to know if a synset has been added or not. If not I print it in syn-xxx.csv and add it to synsets. I do it for every relation.

This way there should be no missing synset and only right relations but some are still skipped 5 on more than 200.000

Import the original Wordnets

Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.

As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.

The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.

Match Wordnet terminology in the database "schema"

It is important for users that we follow the Wordnet terminology as much as possible in the database "schema" (even if we do not strictly enforce a real db schema).

For example, has_sense_in could be used to link lexical entries to synsets: (m:LexicalEntry)-[:has_sense_in]->(n:Synset).

Please provide here any feedback about terminology we should use and their translation in our database schema.

Add node labels

We can ease indexing and navigation in the graph by assigning labels to nodes.

The obvious labels I see are:

  • Synset for synsets,
  • LexicalEntry for lexical entries.

Use cases will drive the introduction of other labels that would carry information about the language, the lexicon...
Here again, @fcbond 's feedback and suggestions will be invaluable.

Use pull requests for progressive code review

We currently have two feature branches that are being regularly updated.

We now need to find a convenient way to:

  • get feedback on code,
  • discuss potential improvements,
  • make joint decisions on naming conventions, module organization, etc
  • prevent branches from diverging too much, so that we can merge them painlessly.

I propose we follow some of the good practice described in this very simple workflow.
We could in particular use pull requests the way described in point 4 ("open a pull request at any time").

/cc @rhin0cer0s @zorgulle

Report

The report is due on May 26th, 2pm.

Requirements:

  • introduction (1-2 pages)
  • detailed description of the proposal (2-5 pages) : do not forget to touch on project management issues
  • description of work done (6-8 pages)
  • conclusion (0.5-1 page)

The resulting document should be in the PDF format.

Clean branch "inject"

The branch "inject" seems to be dead.
Code that is still used should be kept, the rest removed.
If no code is still in use, the branch should be deleted.

Who volunteers to do this?

/cc @zorgulle @rhin0cer0s

Redundancy in Wordnet

We had to check and delete redundancy in wordnet files.

Example:
in wn-data-fra we had two occurrence
09014850-n fre:lemma Chișinău

We also have to verify that the redundancy is not on purpose.

Produce resource-dependent subsets of relations

Each OMW-LMF file contains all relations from the Princeton Wordnet, even if some synsets are not instantiated by any lexical entry in the resource.

We could identify, for a resource, the subset of relations that covers its lexical entries.

@fcbond expressed interest into getting these restricted subsets to backport them into the OMW-LMF files.

Add language to index key

In order to avoid redundancy between different language we could add the language reference to the index key and add a separator character.

eg : 00000000-n_rain_eng

Graph queries

Working with py2neo linked to neo4j as graph engine is really slow.

We have ~58.000 non lexicalized nodes.

Going through all nonlexicalized nodes and looking for each of their path to top node is supposed to take more than 20 min ( and seems to blow up at the end).

py2neo is using a REST API and we think slows things down.
Can we use a python graph library (networkx and graphviz) to build our graph and work on it before sending it to neo4j ?

If not I think the only way to do things faster is by doing some server side work in Java.

/cc @moreymat @zorgulle

Rdv irl

J'ouvre ce tiquet pour que nous puissions nous arranger pour se voir irl.

Nous voudrions vous rencontrer pour parler de ce que nous avons fait, ce qui reste a faire etc ...

Nous allons mettre a jour d'ici la fin du week end la doc pour que vous sachiez precisement ou nous en sommes.

Pour le jour et l'heure nous sommes libre toute la semaine sauf vendredi matin.

Size of temporary CSV files

Currently temporary .csv files are a bit big ( ~ 200 mb for eng relations ).
We will try to make them smaller

  • generate on the fly index
  • hash functions

Slideshow

The presentation will last for 20 minutes + 5 minutes for questions, so the slideshow should contain approximately 15 slides.

Proper handling of non-lexicalized synsets

The OMW-LMF files contain non-lexicalized synsets, i.e. synsets that are not mentioned in any lexical entry.
Some of the non-lexicalized synsets result from partial coverage of the resources.
We could remedy this and expand their coverage by leveraging external resources such as wiktionary.

This issue will build on results from #13 .

Virtual_env

N'ayant pas les droits root à la fac, pour rajouter des packages python ( l'interface py2neo par exemple ) nous devons passer par un environnement virtuel.

Cela semble être conseillé de manière générale pour ne pas mélanger les packages et fabriquer des dépendances propres.

On pensait push notre environnement virtuel sur la branche 'master' vu qu'il sera utilisé partout.

Un peu de doc : http://www.virtualenv.org/en/latest/virtualenv.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.