Giter VIP home page Giter VIP logo

pyclics's Introduction

CLICS - The Database of Cross-Linguistic Colexifications

The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.

Publications

CLICS Release Authors Title Reference
CLICS List, Terhalle, and Urban Using network approaches to enhance the analysis of cross-linguistic polysemies List2013a
CLICS Mayer, List, Terhalle, and Urban An Interactive Visualization of Crosslinguistic Colexification Patterns Mayer2014
CLICS² List, Greenhill, Anderson, Mayer, Tresoldi, and Forkel CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats List2018e
CLICS³ Rzymski, Tresoldi, et al. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies PREPRINT

Datasets and Software

Datasets providing lexical data aggregated in CLICS and software tooling the CLICS processing workflow is accesible and archived on Zenodo via the CLICS community.

Web application

Since CLICS², the latest release of the CLICS database and colexification network can be explored in a clld application at clics.clld.org.

Contributors

Find information about contributors and grants on CONTRIBUTORS.md.

pyclics's People

Contributors

chrzyki avatar tresoldi avatar xrotwang avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pyclics's Issues

handling of subgraph attributes

In order to guarantee a balanced subgraph view, a bit of tinkering with the data is required. The algorithm is rather straightforward and small, trying to go for a rather balanced sample (not too many and not too few nodes).

The question is how to represent the subgraph attributes? I figured that it is easiest to add them as a list in gml format.

This means, when loading the graph (which will also contain the infomap clusters): it is easiest to evoke the subgraph as this:

In [8]: list(G.node['1273']['subgraph'])
Out[8]: ['1273', '1931', '2131', '1273', '1273', '630', '221', '1931', '2131']

In [9]: subg = G.subgraph(G.node['1273']['subgraph'])

In [10]: len(subg)
Out[10]: 5

This means, that it is extremely convenient to access a given subgraph from within the api and when simply loading GML. I think I'll add the same as a format to store the infomap attributes, as it seems that this will be useful for the treatment of the data in CLLD as well.

Handling "or" concepts in clics

ARM or HAND is a colexificaton itself, so if our dataset contains it, we won't capture the colexification, since it is silently annotated in the original data. For an arm/hand survey, however, we'd like to split those. Can / should we try and do this upon import from cldf? The procedure would be: search for " or " glosses and split them acc. to their narrower descendants. Caveat: not all relations are amenable to that, maybe we'll need a hand-coded list (but that's perfectly doable!).

GML & ASCII

Piggybacking on the discussion here https://github.com/clics/clics2/issues/39: Currently, the 'default' GML that is being written by running cluster CLUSTERMETHOD etc. can't easily be read back from GML with NetworkX due to non-ASCII (and not HTML-entified characters) appearing in the GML.

Do we have a straightforward way of dealing with this?

More verbose output for problematic datasets

If a dataset is missing crucial information (i.e. Glottocodes, lexemes) it clics loads successfully but fails uninformatively in, e.g., clics dataset. Here potential pointers to fixes (0 Glottocodes, 0 lexemes, etc.) would be helpful.

Best coverage subsets for three varying numbers of datasets

If we follow the plan to offer three different networks, namely one high-coverage with many languages and, say 300 concepts, one with less languages, but more concepts of, say 600 concepts, and one with the maximum we can get, we need to use the coverage code in lingpy to account for this.

This code is now straightforward, but the question is: do we still and actually need this, or do we rather just take the full dump of 2000 concepts? Given that we know the frequency of each concept IN CLICS, we can easily even visualize this by showing the size. And the communities still make sense, so far, we do not suffer from skewed data...

subgraphs are not exactly created as with clics2

Since subgraph is now treated as just another cluster algorithm, the resulting clusters are filtered and only clusters with more than one node are kept. This behaviour differs from what we had for subgraphs before, where each node was guaranteed to be present in at least one subgraph.
AFAIKT this isn't really important, because the web app will list all colexified concepts for each node anyway - i.e. what would have appeared as "out edges" in the network view is accessible elsewhere.

So, considering this is mostly a data display issue, and we are moving towards

  • other cluster algos
  • "whole-network" display techniques

anyway, I wouldn't bother re-implementing the old behaviour. Opinions?

Glosses and concepts reported in statistics

The clics datasets command reports gloss counts that don't necessarily match the number of parameters in the corresponding datasets (or, for that matter, in the loaded CLICS database). For example, the beidasinitic dataset has 905 unique parameters, of which 713 are mapped to Concepticon, but the reported gloss count is 892; for lexirumah, the number of glosses is actually lower than the number of concepts:

# Dataset Glosses Concepticon Varieties Glottocodes Families
4 beidasinitic 892 713 18 18 1
15 lexirumah 588 601 133 110 4

This happens because the corresponding concepts_by_dataset field (here) is collected by querying for the count of distinct parameter names, not IDs. The name values, however, are not necessarily distinct, nor the field always carries the elicitation gloss -- to give a single example, in beidasinitic two different Chinese entries used for elicitation, 又 and 再, are both translated to English in the name field as "again"; in this case only one of them is mapped, but I believe that in other datasets there are cases of parameters mapped to different Concepticon entries and sharing the name value (like for arrows in Papuan languages and bird names in Pama-Nyungan).

While the information is technically right (there are, indeed, 892 distinct gloss names in beidasinitic), and while the parameter count can be obtained with Lexibank, such information can be confusing, as a naïve user could expect the counts for the total number of parameters and for the Concepticon entries, not considering that different parameters might share the reported name. My suggestion is to collect both counts in the line linked above (thus having SELECT ds.id, count(distinct p.concepticon_id), count(distinct p.name), count(p.id)) and modify the datasets command (here) to report "parameters" and "glosses". I could take care of that.

load throws error when having an emtpy sqlite after failed attempt

I would say: clics load should ideally delete the clics.sqlite, to make sure this does not happen. Otherwise, one gets the error message:

$ clics load
concepticon and glottolog repos locations must be specified!
$ clics load /path/to/concepticon /path/to/glottolog
sqlite3.OperationalError: no such table: dataset

No big deal, but will yield confused users.

Update for CLICS4

For clics4, we will have some 52 datasets, all segmented and therefore analyzable with LingPy cognate detection methods. This means, we can offer enhanced networks (which require to integrate code that has been written but not yet for pyclics):

  1. code for the identification of cognates among colexifications in the same family (https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L173-L217
  2. code for the computation of weights using random walks (this will increase the paths among concepts through neighbors and could be useful for semantic metrics in the future, but it is not clear how feasible it is to run it on all data: https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L127-L167

Given that we were asked for certain aspects regarding the CLICS data, where the data online is different from the data we report in concepticon (e.g., weighted degree, etc.), it would this time also be good to compute the concepticon table (or norare-table) directly when computing clics, so we have a concrete reference, and no hidden script that runs on one's computer and is not officially shared. So, when doing the colexification search, we should additionally:

  1. computes statistics (weighted degree, degree)
  2. run the subgraph method, which is now directly run in CLLD also in the Python code, to determine the sub-graphs

All in all, this is SOME work to be done.

To explain the sub-graph issue: we had some users asking why data on the website is different from the data in the concepticon version of CLICS3 (Rzymski-2020-XXXX list).

handling of language isolates in the graphs

The standalone app uses wrong names for language isolates, for some stupid reason. Should be easy to fix, as now, it shows the glottocode as name for a language family instead of the name of the language itself.

double entries in clics

https://clics.clld.org/edges/1819-1997

Thanks to Thanasis Georgakopoulos for pointing this out. The problem is: if two concepts are lexified by A and variant B and both are the same in their clics-value, the resulting count is 4 (A:B, A:A, B:B, B:A). This is not dramatic for our general clics stats, as they are based on families, but it should be avoided by counting only one of identical variants per word in the algo.

Refactor communities and subgraph into plugins

As a step toward making pyclics extensible, the communities and subgraph commands should be refactored into plugins, serving as examples of other cluster algorithms that could be run on the colexification graph.

treatment of fuzzy concepts

I was just looking at the graph for fingernail, and I realized that due to concepticon fuzzy concepts like claw or nail, we have at times strange networks.

I don't consider this as a huge problem, but imagine we could automatically expand these and add two translations for the respectively narrower concepts in the hierarchy. As we would still store the "original concept", this might cause less problems in terms of consistency. The only caveat is the depth of going down, so I'd restrict this to one step in the network (arm or hand -> arm / hand, not more).

Running the app locally is difficult with modern browsers

Most modern browser don't allow loading JavaScript from file:// without explicitly setting flags/allowing this with custom configurations. This makes the app locally not usable.

This can be circumvented (not exactly comfortable for the end-user as well ...) with, e.g., running from the app/ directory:

python -m http.server 8000 & firefox localhost:8000

use ALL colexifications available inside a community

This will slightly complicate the network creation, but it can be useful:

  • create communities with a given weight for a graph clone
  • when assembling the subgraphs, display ALL colexifications

This will help to find cases of multiple colexification. E.g.,

links to "fathom" (a measure). But one of the languages colexifies all three, hand, arm, and fathom. It was kicked out, because only edges of size > 3 are regarded, but here, inside a cluster, it seems justified to keep those.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.