clics / pyclics Goto Github PK

python package implementing the CLICS processing workflow

License: Apache License 2.0

Python 62.43% CSS 10.58% HTML 3.29% JavaScript 23.57% TeX 0.13%

pyclics's Introduction

CLICS - The Database of Cross-Linguistic Colexifications

The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.

Publications

CLICS Release	Authors	Title	Reference
CLICS	List, Terhalle, and Urban	Using network approaches to enhance the analysis of cross-linguistic polysemies	List2013a
CLICS	Mayer, List, Terhalle, and Urban	An Interactive Visualization of Crosslinguistic Colexification Patterns	Mayer2014
CLICS²	List, Greenhill, Anderson, Mayer, Tresoldi, and Forkel	CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats	List2018e
CLICS³	Rzymski, Tresoldi, et al.	The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies	PREPRINT

Datasets and Software

Datasets providing lexical data aggregated in CLICS and software tooling the CLICS processing workflow is accesible and archived on Zenodo via the CLICS community.

Web application

Since CLICS², the latest release of the CLICS database and colexification network can be explored in a clld application at clics.clld.org.

Contributors

Find information about contributors and grants on CONTRIBUTORS.md.

pyclics's People

Contributors

Stargazers

Watchers

pyclics's Issues

handling of subgraph attributes

In order to guarantee a balanced subgraph view, a bit of tinkering with the data is required. The algorithm is rather straightforward and small, trying to go for a rather balanced sample (not too many and not too few nodes).

The question is how to represent the subgraph attributes? I figured that it is easiest to add them as a list in gml format.

This means, when loading the graph (which will also contain the infomap clusters): it is easiest to evoke the subgraph as this:

In [8]: list(G.node['1273']['subgraph'])
Out[8]: ['1273', '1931', '2131', '1273', '1273', '630', '221', '1931', '2131']

In [9]: subg = G.subgraph(G.node['1273']['subgraph'])

In [10]: len(subg)
Out[10]: 5

This means, that it is extremely convenient to access a given subgraph from within the api and when simply loading GML. I think I'll add the same as a format to store the infomap attributes, as it seems that this will be useful for the treatment of the data in CLLD as well.

Handling "or" concepts in clics

ARM or HAND is a colexificaton itself, so if our dataset contains it, we won't capture the colexification, since it is silently annotated in the original data. For an arm/hand survey, however, we'd like to split those. Can / should we try and do this upon import from cldf? The procedure would be: search for " or " glosses and split them acc. to their narrower descendants. Caveat: not all relations are amenable to that, maybe we'll need a hand-coded list (but that's perfectly doable!).

Check pygraphistry for CLICS network visualisations

https://github.com/graphistry/pygraphistry

Found by @xrotwang, might be interesting to see whether this can yield good results in conjunction with CLICS' data.

GML & ASCII

Piggybacking on the discussion here https://github.com/clics/clics2/issues/39: Currently, the 'default' GML that is being written by running cluster CLUSTERMETHOD etc. can't easily be read back from GML with NetworkX due to non-ASCII (and not HTML-entified characters) appearing in the GML.

Do we have a straightforward way of dealing with this?

cross-links to other clusters need to show the original concept that links

currently, they only show the concept underlying the cluster (which is of course not what we want)

Best coverage subsets for three varying numbers of datasets

If we follow the plan to offer three different networks, namely one high-coverage with many languages and, say 300 concepts, one with less languages, but more concepts of, say 600 concepts, and one with the maximum we can get, we need to use the coverage code in lingpy to account for this.

This code is now straightforward, but the question is: do we still and actually need this, or do we rather just take the full dump of 2000 concepts? Given that we know the frequency of each concept IN CLICS, we can easily even visualize this by showing the size. And the communities still make sense, so far, we do not suffer from skewed data...

subgraphs are not exactly created as with clics2

Since subgraph is now treated as just another cluster algorithm, the resulting clusters are filtered and only clusters with more than one node are kept. This behaviour differs from what we had for subgraphs before, where each node was guaranteed to be present in at least one subgraph.
AFAIKT this isn't really important, because the web app will list all colexified concepts for each node anyway - i.e. what would have appeared as "out edges" in the network view is accessible elsewhere.

So, considering this is mostly a data display issue, and we are moving towards

other cluster algos
"whole-network" display techniques

anyway, I wouldn't bother re-implementing the old behaviour. Opinions?

Update command documentation, e.g. cluster etc.

Many of the commands are missing help information/docstrings.

Allow overriding the random seed from the CLI

Right now, a hardcoded seed is used.

Glosses and concepts reported in statistics

The clics datasets command reports gloss counts that don't necessarily match the number of parameters in the corresponding datasets (or, for that matter, in the loaded CLICS database). For example, the beidasinitic dataset has 905 unique parameters, of which 713 are mapped to Concepticon, but the reported gloss count is 892; for lexirumah, the number of glosses is actually lower than the number of concepts:

#	Dataset	Glosses	Concepticon	Varieties	Glottocodes	Families
4	beidasinitic	892	713	18	18	1
15	lexirumah	588	601	133	110	4

This happens because the corresponding concepts_by_dataset field (here) is collected by querying for the count of distinct parameter names, not IDs. The name values, however, are not necessarily distinct, nor the field always carries the elicitation gloss -- to give a single example, in beidasinitic two different Chinese entries used for elicitation, 又 and 再, are both translated to English in the name field as "again"; in this case only one of them is mapped, but I believe that in other datasets there are cases of parameters mapped to different Concepticon entries and sharing the name value (like for arrows in Papuan languages and bird names in Pama-Nyungan).

While the information is technically right (there are, indeed, 892 distinct gloss names in beidasinitic), and while the parameter count can be obtained with Lexibank, such information can be confusing, as a naïve user could expect the counts for the total number of parameters and for the Concepticon entries, not considering that different parameters might share the reported name. My suggestion is to collect both counts in the line linked above (thus having SELECT ds.id, count(distinct p.concepticon_id), count(distinct p.name), count(p.id)) and modify the datasets command (here) to report "parameters" and "glosses". I could take care of that.

load throws error when having an emtpy sqlite after failed attempt

I would say: clics load should ideally delete the clics.sqlite, to make sure this does not happen. Otherwise, one gets the error message:

$ clics load
concepticon and glottolog repos locations must be specified!
$ clics load /path/to/concepticon /path/to/glottolog
sqlite3.OperationalError: no such table: dataset

No big deal, but will yield confused users.

Update for CLICS4

For clics4, we will have some 52 datasets, all segmented and therefore analyzable with LingPy cognate detection methods. This means, we can offer enhanced networks (which require to integrate code that has been written but not yet for pyclics):

code for the identification of cognates among colexifications in the same family (https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L173-L217
code for the computation of weights using random walks (this will increase the paths among concepts through neighbors and could be useful for semantic metrics in the future, but it is not clear how feasible it is to run it on all data: https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L127-L167

Given that we were asked for certain aspects regarding the CLICS data, where the data online is different from the data we report in concepticon (e.g., weighted degree, etc.), it would this time also be good to compute the concepticon table (or norare-table) directly when computing clics, so we have a concrete reference, and no hidden script that runs on one's computer and is not officially shared. So, when doing the colexification search, we should additionally:

computes statistics (weighted degree, degree)
run the subgraph method, which is now directly run in CLLD also in the Python code, to determine the sub-graphs

All in all, this is SOME work to be done.

To explain the sub-graph issue: we had some users asking why data on the website is different from the data in the concepticon version of CLICS3 (Rzymski-2020-XXXX list).

handling of language isolates in the graphs

The standalone app uses wrong names for language isolates, for some stupid reason. Should be easy to fix, as now, it shows the glottocode as name for a language family instead of the name of the language itself.

Update/create README.md

Something akin to the old clics2 readme, updated for the new commands, e.g.:

clics -t 3 -f families cluster subgraph

Make graph conversion reproducible

To make sure we create the same igraph.Graph object when converting from networkx.Graph in pyclics.util.networkx2igraph, we must iterate over nodes and edges in an explicit order. See
clics/clics2@3c26c8a#diff-43466c4b79c173a261b560970b1c0b9b

Make tests (and thus the package) compatible with pyglottolog 2.0

This requires fleshing out the directory passed as argument when initializing a Glottolog instance in the load tests.

Which network formats should we create?

Is GML all we need? Reading the networkx docs on GML it seems that GML isn't the best choice when it comes to portability. I'm also not sure I like the "7-bit ASCII + HTML entities"-approach.

double entries in clics

https://clics.clld.org/edges/1819-1997

Thanks to Thanasis Georgakopoulos for pointing this out. The problem is: if two concepts are lexified by A and variant B and both are the same in their clics-value, the resulting count is 4 (A:B, A:A, B:B, B:A). This is not dramatic for our general clics stats, as they are based on families, but it should be avoided by counting only one of identical variants per word in the algo.

Refactor communities and subgraph into plugins

As a step toward making pyclics extensible, the communities and subgraph commands should be refactored into plugins, serving as examples of other cluster algorithms that could be run on the colexification graph.

treatment of fuzzy concepts

I was just looking at the graph for fingernail, and I realized that due to concepticon fuzzy concepts like claw or nail, we have at times strange networks.

I don't consider this as a huge problem, but imagine we could automatically expand these and add two translations for the respectively narrower concepts in the hierarchy. As we would still store the "original concept", this might cause less problems in terms of consistency. The only caveat is the depth of going down, so I'd restrict this to one step in the network (arm or hand -> arm / hand, not more).

Running the app locally is difficult with modern browsers

Most modern browser don't allow loading JavaScript from file:// without explicitly setting flags/allowing this with custom configurations. This makes the app locally not usable.

This can be circumvented (not exactly comfortable for the end-user as well ...) with, e.g., running from the app/ directory:

python -m http.server 8000 & firefox localhost:8000

use ALL colexifications available inside a community

This will slightly complicate the network creation, but it can be useful:

create communities with a given weight for a graph clone
when assembling the subgraphs, display ALL colexifications

This will help to find cases of multiple colexification. E.g.,

http://calc.digling.org/clics/graph.html?infomap_29_ARM

links to "fathom" (a measure). But one of the languages colexifies all three, hand, arm, and fathom. It was kicked out, because only edges of size > 3 are regarded, but here, inside a cluster, it seems justified to keep those.

clics / pyclics Goto Github PK

pyclics's Introduction

CLICS - The Database of Cross-Linguistic Colexifications

Publications

Datasets and Software

Web application

Contributors

pyclics's People

Contributors

Stargazers

Watchers

pyclics's Issues

Recommend Projects

Recommend Topics

Recommend Org