Giter VIP home page Giter VIP logo

etymology's Introduction

THE PROJECT

This is a first version of the Wikimedia project etytree. The aim of the project is to visualize in an interactive web page the etymological tree (i.e., the etymology of a word in the form of a tree, with ancestors, cognate words, derived words, etc.) of any word in any language using data extracted from Wiktionary.

This project has been inspired by my interest in etymology, in open source collaborative projects and in interactive visualizations.

If you have comments on the project please write on its talk page.

Branches

The master branch is for development and for local installs. The webpack-branch is used in production.

Description

Etytree uses data extracted from an XML dump of the English Wiktionary using an algorithm implemented in dbnary_etymology. The extracted data is kept in sync with Wiktionary each time a new dump is generated (the dump currently used dates back to September 28th, 2017). Data extracted with dbnary_etymology has been loaded into a Virtuoso DBMS which can be accessed at wmflabs etytree-virtuoso sparql endpoint and explored with a faceted browser.

The list of languages and ISO codes can be found at resources/data and are imported from Wiktionary and periodically updated (the current files date back to September 22nd, 2017). File etymology-only_languages.csv has been created from Wiktionary data with a lua module available here. File iso-639-3.tab has been downloaded from this link (the first line has been removed). File list_of_languages.csv has been downloaded from Wiktionary.

I have defined an ontology for etymologies here. In particular I have defined properties etymologicallyRelatedTo, etymologicallyDerivesFrom and etymologicallyEquivalentTo. This ontology needs improvements.

Property http://www.w3.org/2000/01/rdf-schema#seeAlso is used to link etymological entries to the Wiktionary pages they have been extracted from.

Besides etymological relationships, the database also contain POS-s, definitions, senses and more as extracted by dbnary. The ontology for dbnary is defined here.

Licence

The code is distributed under MIT licence and the data is distributed under Creative Commons Attribution-ShareAlike 3.0.

Viewing the Site

The site's html files are contained in the repo root. The main page is index.html. To view the site you just need to navigate to the root of the repo.

Using the SPARQL ENDPOINT

This code queries the wmflabs etytree-virtuoso sparql endpoint which I have set up and populated with data (RDF) produced with dbnary_etymology.

An example query to the sparql endpoint follows:

PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT ?p ?o {
    eng:__ee_door ?p ?o
}

If you want to find all entries containing string "door":

SELECT DISTINCT ?s {
    ?s rdfs:label ?label .
    ?label bif:contains "door" .
}

If you want to find ancestors of "door":

PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org//dbnaryetymology#>
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>

SELECT DISTINCT ?o { 
     eng:__ee_1_door dbetym:etymologicallyRelatedTo+ ?o .
}

etymology DOCUMENTATION

You would to have sudo privileges

npm install -g jsdoc-to-markdown

GENERATE DOCUMENTATION

mkdir ./docs
cd ./resources/js/
jsdoc2md -f app.js datamodel.js data.js etytree.js liveTour.js graph.js > ../../docs/test.md

dbnary_etymology DOCUMENTATION

EXTRACT THE DATA USING dbnary_etymology

The RDF database of etymological relationships is periodically extracted when a new dump of the English Wiktionary is released. The code used to extract the data is available at dbnary_etymology.

COMPILE THE CODE

dbnary_etymology is a Maven project (use java 8 and maven3).

GENERATE DOCUMENTATION

Let's assume you cloned the repository in your home:

cd ~/dbnary_etymology/
mvn site
mvn javadoc:jar

PREPROCESS INPUT DATA

First you need an XML dump of English Wiktionary. Then you need to convert it into UTF-8 format (using iconv for example):

VERSION=20170920
DATA_DIR=/srv/datasets/dumps/$VERSION/                                                               #output data folder
tmp_dump=/public/dumps/public/enwiktionary/$VERSION/enwiktionary-$VERSION-pages-articles.xml.bz2     #path to the dump

mkdir ${DATA_DIR}
dump=${DATA_DIR}/enwiktionary-$VERSION-pages-articles.utf-16.xml
bzcat ${tmp_dump} |iconv -f UTF-8 -t UTF-16 > $dump    #This operation takes approximately 7 minutes.

EXTRACT ENGLISH WORDS

With the following code you can extract data relative to English words:

OUT_DIR=/srv/datasets/dbnary/$VERSION/                                                               #output folder
LOG_DIR=/srv/datasets/dbnary/$VERSION/logs/
EXECUTABLE=~/dbnary_etymology/dbnary-extractor/target/dbnary-extractor-2.0e-SNAPSHOT-jar-with-dependencies.jar
mkdir ${OUT_DIR}
mkdir ${LOG_DIR}

PREFIX=http://etytree-virtuoso.wmflabs.org/dbnary
LOG_FILE=${LOG_DIR}/enwkt-$VERSION.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1   #This operation takes approximately 45 minutes
#compress the output if needed
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

EXTRACT FOREIGN WORDS

For memory reasons I only process a subset of the full data set at a time (from page 0 to page 1800000 - which takes approximately 100 minutes, from page 1899999 to page 3600000 which takes approximately 50 minutes, from page 3600000 to page 6000000 which takes approximately 100 minutes). Note that 24G are needed to process the data.

fpage=0
tpage=1800000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log 
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep	the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

fpage=1800000
tpage=3600000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

fpage=3600000
tpage=6000000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl    ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

EXTRACT A SINGLE ENTRY - FOREIGN WORD

WORD="door"
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary.eng=debug -cp $EXECUTABLE org.getalp.dbnary.cli.GetExtractedSemnet -x -l en --etymology testfile $dump $WORD

UPDATE DATABASE ON VIRTUOSO

Update ontology files

For VERSION=20170920:

cp ~/dbnary_etymology/dbnary-ontology/src/main/resources/org/getalp/dbnary/dbnary_etymology.owl  /srv/datasets/dbnary/$VERSION/
cp ~/dbnary_etymology/dbnary-ontology/src/main/resources/org/getalp/dbnary/dbnary.owl  /srv/datasets/dbnary/$VERSION/

Update database

From isql execute the following steps (step A):

SPARQL CLEAR GRAPH <http://etytree-virtuoso.wmflabs.org/dbnary>;
SPARQL CLEAR GRAPH <http://etytree-virtuoso.wmflabs.org/dbnaryetymology>;
ld_dir ('/srv/datasets/dbnary/20170920/', '*.ttl.gz','http://etytree-virtuoso.wmflabs.org/dbnary');
ld_dir ('/srv/datasets/dbnary/20170920/', '*.owl','http://etytree-virtuoso.wmflabs.org/dbnaryetymology');
-- do the following to see which files were registered to be added:
SELECT * FROM DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
rdf_loader_run();  ----- 1378390 msec. 
-- do nothing too heavy while data is loading
checkpoint;   ----- 50851 msec.
commit WORK;  ----- 1417 msec.
checkpoint;
EXIT;

In case an error occurs:

12:00:44 PL LOG:  File /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz error 37000 SP029: TURTLE RDF loader, line 10636983: syntax error processed pending to here.
12:06:09 PL LOG:  File /srv/datasets/dbnary/20170920//enwkt-1800000_3600000.etymology.ttl.gz error 37000 SP029: TURTLE RDF loader, line 4772623: syntax error processed pending to here.

edit files manually:

zcat /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz > /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl
emacs -nw /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz      #goto-line 10636983
#change line
gzip /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl

Go to step A above and repeat. Then run the following command from the terminal

isql 1111 dba password /opt/virtuoso/db/bootstrap.sql

After dealing with errors relaunch the server.
From isql:

sparql SELECT COUNT(*) WHERE { ?s ?p ?o } ;
sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
-- Build Full Text Indexes by running the following commands using the Virtuoso isql program
RDF_OBJ_FT_RULE_ADD (null, null, 'All');
VT_INC_INDEX_DB_DBA_RDF_OBJ ();
-- Run the following procedure using the Virtuoso isql program to populate label lookup tables periodically and activate the Label text box of the Entity Label Lookup tab:
urilbl_ac_init_db();
-- Run the following procedure using the Virtuoso isql program to calculate the IRI ranks. Note this should be run periodically as the data grows to re-rank the IRIs.
s_rank();

CORS setup

The following link will help you set up CORS for Virtuoso: http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksCORsEnableSPARQLURLs

Start and stop Virtuoso

To start:

cd /opt/virtuoso/db
virtuoso-t -f

To stop:

cd /opt/virtuoso-opensource/bin
isql 1111 dba password
SQL> shutdown();

ETYTREE TO DO

  • Add qualifiers to links between nodes: inherited word (template inherited), borrowed word (template borrowed), named from people, developed from initialism, surface analysis, long detailed etymology (propose a new template?), invented word/coined expression (coined by), back-formation (e.g.: burglar -> burgle, play the tamburine -> tambour, i.e. remove a morpheme, real or perceived) (template back-form), compound (template compound), initialism, acronym, abbreviation, clipping, blend/portmanteau (template blend), calque/loan translation, year template (propose a new template?), cognates (I actually plan to ignore this).
  • Parse glosses in templates
  • Parse nested templates
  • Add zoom to tooltip
  • Add etymology controversies.
  • Add alternative etymologies.
  • Parse diacritics.
  • Maybe consider Dialects:
    Module:da:Dialects ?
    Module:en:Dialects This module provides labels to {{alter}}, which is used in the Alternative forms section.
    Module:grc:Dialects This module translates from dialect codes to dialect names for templates such as {{alter}}. (e.g. aio -> link = 'Aeolic Greek', display = 'Aeolic')
    Module:he:Dialects
    Module:hy:Dialects ?
    Module:la:Dialects (e.g.: aug -> link = Late Latin#Late and post-classical Latin, display = post-Augustan)
  • Maybe consider additional modules:
    Module:families/data mapping language code -> language name  (e.g.: aav -> canonicalName = "Austro-Asiatic",otherNames = {"Austroasiatic"}

etymology's People

Contributors

esterpantaleo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etymology's Issues

Reorganize dagre.js into OOP style code

As I said in issue #22

Right now, the process of drawing out a graph is one long chain of events. Everything will get easier if we have each function alter the graph and update the state of the graph so the next function can read the state. This is similar to what you were talking about with width/height and how you needed to make them global.

The approach I am suggesting is Object Oriented Programming, like they do in this d3 snippet.

This will make it easier to check and assign values.

Not only am I not sure there is a good way to handle this collapsibility issue without switching to an OOP model, but I think this will speed up everything else we need to do.

Collapsible nodes

You had mentioned wanting collapsible nodes so as to make the whole tree navigation smoother and less cluttered.

I guess we have to figure out which functionalities we really want or don't want and the right UX approach to them.

Right now, if you click on a node the tooltip shows up.
If you click on the language code below the node, a tooltip stating the language shows up.
How would you want to handle collapsing or opening nodes? We could add clickable + and - signs

What were your thoughts here?

missing demo.css file

There is a 404 error in the browser from a
GET /resources/css/demo.css
coming from index.html line 5 <link rel="stylesheet" href="../css/demo.css">

Is that file supposed to be there, or is that line supposed to be removed?

Disambiguation not loading

this started after my last pull from master. When I do a search for a term, I get the following in my console:

1280
load.js:182 loading languages
etytree.js:25 searching word in database
etytree.js:30 https://etytree-virtuoso.wmflabs.org/sparql?query=PREFIX%20dbetym%3A%20%3Ch…s.org%2F%2Fdbnaryetymology%23EtymologyEntry%3E%20.%20%20%20%20%20%7D%20%7D
dagre.js:174 Object {}

d3.v3.min.js:1 Error: <g> attribute transform: Expected number, "translate(Infinity,20)scal…".
i @ d3.v3.min.js:1
(anonymous) @ d3.v3.min.js:3
Y @ d3.v3.min.js:1
Co.each @ d3.v3.min.js:3
Co.attr @ d3.v3.min.js:3
(anonymous) @ dagre.js:201
t @ d3.v3.min.js:1
(anonymous) @ d3.v3.min.js:1
c @ d3.v3.min.js:3
(anonymous) @ d3.v3.min.js:3
(anonymous) @ d3.v3.min.js:3
Y @ d3.v3.min.js:1
Co.each @ d3.v3.min.js:3
n.event @ d3.v3.min.js:3
drawDisambiguationDAGRE @ dagre.js:246
source.subscribe.response @ etytree.js:34
a.__tryOrUnsub @ Rx.min.js:43
a.next @ Rx.min.js:42
a._next @ Rx.min.js:40
a.next @ Rx.min.js:40
req.onload @ load.js:12

etytree.js:36 done disambiguation

I used line breaks to separate out the error from the rest of the code. I will look at the new changes and see what I discover.

Cleaning Up the code

The application works now, but it is just large enough that it is going to get harder to fix and prevent bugs if we don't start cleaning it up. A lot of these same things apply towards getting more exposure and making it easier for people to jump on this project.

  • On the most basic level, we need to reorganize the files and folders just a little more so that it's clear what the hierarchy is. That is what I tried to do with pull request #14

  • On the next level the code needs to be reformatted a little for readability, and rearranged some. This will help with debugging.

  • I also recommend adding some tools that help us keep our code clean. I will add those in another branch and create a pull request for them.

@esterpantaleo We can chat about that in our next Skype.

API

Similarly to what happens with the virtuoso faceted browser I would like to be able to go to link:

http://tools.wmflabs.org/etytree/test
to visualize the disambiguation page for word test
in wiktionary https://en.wiktionary.org/wiki/test
in etytree maybe state {word: "test"}

http://tools.wmflabs.org/etytree/test/eng#Etymology_1
to visualize the tree for english word test, etymology 1 (ety = 1)
in virtuoso: http://etytree-virtuoso.wmflabs.org/dbnary/eng/__ee_1_test
in wiktionary https://en.wiktionary.org/wiki/test#Etymology_1
in etytree maybe state {word: "test", language: "eng", etymology: 1}

http://tools.wmflabs.org/etytree/test/eng#Etymology_2
to visualize the tree for english word test, etymology 2 (ety = 2)
in virtuoso: http://etytree-virtuoso.wmflabs.org/dbnary/eng/__ee_2_test
https://en.wiktionary.org/wiki/test#Etymology_2
in etytree maybe state {word: "test", language: "eng", etymology: 2}

we could use:
window.location.hash=JSON.stringify()

mouseover error from cperacontent.js

There seems to be an error from some mouseover event firing a LOT when the mouse moves around the tree:

cperacontent.js:641 Uncaught TypeError: Cannot read property 'startContainer' of null
    at Object._onMouseMove (cperacontent.js:641)
    at onMouseMove (cperacontent.js:636)
_onMouseMove @ cperacontent.js:641
onMouseMove @ cperacontent.js:636

I am not sure exactly what it is firing on, but this may even be causing some of the other errors. I will try and pinpoint what is causing this.

dot appearing when opening tooltip and before tooltip text loads

Per the message you sent me, this is the issue for the small, blue dot that appears in the corner of what will become the tooltip after you click on a word (or do anything to open a tooltip).

  1. You click a word
  2. A dot/circle appears
  3. The dot/circle disappears and the tooltip shows up

At first glance, this looks like it probably has something to do with the tooltip css. But maybe there is something about the timing that can be changed to make sure the text loads before the tooltip becomes visible.

function naming and potential overwrite

it looks like we have the same function get() definition twice:

The first time is in the dagre-d3.js library, line 14764 (ish)
Then again in load.js, at the bottom

This is probably leading to unintended side-effects, and we should change the load.js function name. Etytree calls the function to assign the return value to const source.

tooltip extending out of page

When the tooltip extends out of the page, you can scroll to see the complete tooltip, but the page is cut (see screenshot)
screen shot 2017-09-05 at 11 26 27

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.