moreymat / omw-graph Goto Github PK
View Code? Open in Web Editor NEWThe Open Multilingual Wordnet in a graph database
License: MIT License
The Open Multilingual Wordnet in a graph database
License: MIT License
Some relations are skipped when adding missing synsets.
if not currentid in synsets:
writeLineSynset(currentid, syncsv)
synsets[currentid].append(currentid)
if not targetid in synsets:
writeLineSynset(targetid, syncsv)
synsets[targetid].append(targetid)
writeLineRels(currentid, targetid, reltype, relcsv)
synsets is a collection used to know if a synset has been added or not. If not I print it in syn-xxx.csv and add it to synsets. I do it for every relation.
This way there should be no missing synset and only right relations but some are still skipped 5 on more than 200.000
Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.
As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.
The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.
It is important for users that we follow the Wordnet terminology as much as possible in the database "schema" (even if we do not strictly enforce a real db schema).
For example, has_sense_in
could be used to link lexical entries to synsets: (m:LexicalEntry)-[:has_sense_in]->(n:Synset)
.
Please provide here any feedback about terminology we should use and their translation in our database schema.
We can ease indexing and navigation in the graph by assigning labels to nodes.
The obvious labels I see are:
Synset
for synsets,LexicalEntry
for lexical entries.Use cases will drive the introduction of other labels that would carry information about the language, the lexicon...
Here again, @fcbond 's feedback and suggestions will be invaluable.
We currently have two feature branches that are being regularly updated.
We now need to find a convenient way to:
I propose we follow some of the good practice described in this very simple workflow.
We could in particular use pull requests the way described in point 4 ("open a pull request at any time").
/cc @rhin0cer0s @zorgulle
The report is due on May 26th, 2pm.
Requirements:
The resulting document should be in the PDF format.
The branch "inject" seems to be dead.
Code that is still used should be kept, the rest removed.
If no code is still in use, the branch should be deleted.
Who volunteers to do this?
/cc @zorgulle @rhin0cer0s
We had to check and delete redundancy in wordnet files.
Example:
in wn-data-fra we had two occurrence
09014850-n fre:lemma Chișinău
We also have to verify that the redundancy is not on purpose.
Each OMW-LMF file contains all relations from the Princeton Wordnet, even if some synsets are not instantiated by any lexical entry in the resource.
We could identify, for a resource, the subset of relations that covers its lexical entries.
@fcbond expressed interest into getting these restricted subsets to backport them into the OMW-LMF files.
In order to avoid redundancy between different language we could add the language reference to the index key and add a separator character.
eg : 00000000-n_rain_eng
Working with py2neo linked to neo4j as graph engine is really slow.
We have ~58.000 non lexicalized nodes.
Going through all nonlexicalized nodes and looking for each of their path to top node is supposed to take more than 20 min ( and seems to blow up at the end).
py2neo is using a REST API and we think slows things down.
Can we use a python graph library (networkx and graphviz) to build our graph and work on it before sending it to neo4j ?
If not I think the only way to do things faster is by doing some server side work in Java.
J'ouvre ce tiquet pour que nous puissions nous arranger pour se voir irl.
Nous voudrions vous rencontrer pour parler de ce que nous avons fait, ce qui reste a faire etc ...
Nous allons mettre a jour d'ici la fin du week end la doc pour que vous sachiez precisement ou nous en sommes.
Pour le jour et l'heure nous sommes libre toute la semaine sauf vendredi matin.
Currently temporary .csv files are a bit big ( ~ 200 mb for eng relations ).
We will try to make them smaller
The presentation will last for 20 minutes + 5 minutes for questions, so the slideshow should contain approximately 15 slides.
The OMW-LMF files contain non-lexicalized synsets, i.e. synsets that are not mentioned in any lexical entry.
Some of the non-lexicalized synsets result from partial coverage of the resources.
We could remedy this and expand their coverage by leveraging external resources such as wiktionary.
This issue will build on results from #13 .
N'ayant pas les droits root à la fac, pour rajouter des packages python ( l'interface py2neo par exemple ) nous devons passer par un environnement virtuel.
Cela semble être conseillé de manière générale pour ne pas mélanger les packages et fabriquer des dépendances propres.
On pensait push notre environnement virtuel sur la branche 'master' vu qu'il sera utilisé partout.
Un peu de doc : http://www.virtualenv.org/en/latest/virtualenv.html
By regrouping relation extraction and file parsing we can add other language easily but it is slower of 5 second
Francis Bond wrote:
It might also be interesting to look at producing language specific hierarchies from the English one, as described by :
V. Vincze, A. Almasi. Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian (in http://gwc2014.ut.ee/index.php?v=proceedings, pp. 118--126).
/cc @fcbond
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.