clics / clics2 Goto Github PK

A python package to create and analyze colexification networks from lexical datasets.

Python 51.32% HTML 4.84% CSS 13.53% JavaScript 30.14% TeX 0.17%

clics2's Introduction

CLICS - The Database of Cross-Linguistic Colexifications

The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.

Publications

CLICS Release	Authors	Title	Reference
CLICS	List, Terhalle, and Urban	Using network approaches to enhance the analysis of cross-linguistic polysemies	List2013a
CLICS	Mayer, List, Terhalle, and Urban	An Interactive Visualization of Crosslinguistic Colexification Patterns	Mayer2014
CLICS²	List, Greenhill, Anderson, Mayer, Tresoldi, and Forkel	CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats	List2018e
CLICS³	Rzymski, Tresoldi, et al.	The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies	PREPRINT

Datasets and Software

Datasets providing lexical data aggregated in CLICS and software tooling the CLICS processing workflow is accesible and archived on Zenodo via the CLICS community.

Web application

Since CLICS², the latest release of the CLICS database and colexification network can be explored in a clld application at clics.clld.org.

Contributors

Find information about contributors and grants on CONTRIBUTORS.md.

clics2's People

Contributors

Stargazers

Watchers

Forkers

michaelgfalk brochhagen sambfloyd

clics2's Issues

Make sure all commands are idempotent

Add glottocode and bibkey to exported (downloadable data)

I was informally approached by people from the Mint group, they are very interested in the data but asked if the glottocodes could be added to the downloadable data (especially that from the concepts' page themselves, such as http://clics.clld.org/parameters/1867#2/25.9/0.9 ).

Thinking about what people were asking @LinguList today, it would be a good idea to also add the bibkey for each entry.

transfer clics to another organisation?

We have the CLICS organization. I tend to transfer the data there later on. Or should it be hosted under CLDF? Or should we first host it on its own at the clics organization and then fork with CLDF, as we discussed earlier, and as we may also do with concepticon, to "mark" things as making useful use of cldf?

Check all datasets for compatibility with pylexibank 1.0 ...

... once this is released.

Depends on #43

contribution and acknowledgements

We need to thank people like Damian Satterthwaite-Phillips for sending us their data. Claire should also be thanked. Any more names I might not think about immediately (@SimonGreenhill Simon, your student working on Madang for some time, for example) should also be mentioned. We can collect names here first, later we can put them in a file, as we do for concepticon.

Label “Mortar” appears twice

Entries 5.58 and 7.63 both have the label “mortar”. It looks like

5.58 is MORTAR CRUSHER
7.63 is MORTAR BINDER

The CLICS browser can only navigate to show the former one.

GML does not permit multiple nodes with the same label, which is how I picked up on it.

Update datasets.txt

datasets.txt must be updated to list released versions of the lexibank datasets.

Reproducibility issues

The README in the 'loading the data' section states that Concepticon and Glottolog versions should be >= than a specified version. The version (of at least Glottolog in this case), however, does play a role in the nature of the resulting clics.sqlite:

With Concepticon 2.0 and Glottolog 3.4 (note ids!):

#    Dataset            Glosses    Concepticon    Varieties    Glottocodes    Families
---  ---------------  ---------  -------------  -----------  -------------  ----------
1    allenbai               498            499            9              3           1
2    bantubvd               430            415           10             10           1
3    beidasinitic           905            700           18             18           1
4    bowernpny              338            338          170            168           1
5    hubercolumbian         361            343           69             65          16
6    ids                   1310           1305          320            275          60
7    kraftchadic            428            428           67             60           3
8    northeuralex          1015            940          107            107          21
9    robinsonap             398            393           13             13           1
10   satterthwaitetb        422            418           18             18           1
11   suntb                  996            905           48             48           1
12   tls                   1523            808          120             97           1
13   tryonsolomon           323            311          111             96           5
14   wold                  1814           1457           41             41          24
15   zgraggenmadang         306            306           98             98           1
     TOTAL                    0           2487         1219           1027          90

With Concepticon 2.0 and Glottolog 9701cb0 (note ids!):

#    Dataset            Glosses    Concepticon    Varieties    Glottocodes    Families
---  ---------------  ---------  -------------  -----------  -------------  ----------
1    allenbai               498            499            9              3           1
2    bantubvd               430            415           10             10           1
3    beidasinitic           905            700           18             18           1
4    bowernpny              338            338          170            168           1
5    hubercolumbian         361            343           69             65          16
6    ids                   1310           1305          321            276          60
7    kraftchadic            428            428           67             60           3
8    northeuralex          1015            940          107            107          21
9    robinsonap             398            393           13             13           1
10   satterthwaitetb        422            418           18             18           1
11   suntb                  996            905           48             48           1
12   tls                   1523            808          120             97           1
13   tryonsolomon           323            311          111             96           5
14   wold                  1814           1457           41             41          24
15   zgraggenmadang         306            306           98             98           1
     TOTAL                    0           2487         1220           1028          90

With Glottolog 9701cb0 being the 'correct' CLICS2 version in terms of getting the exact same output as for the publication.

The difference is simply a result of changing itsa1239 and updating some coordinates. Nothing too terrible but still not nice in terms of reproducibility.

$ sqldiff clics-glottolog34.sqlite clics-glottolog9701cb0.sqlite 
UPDATE LanguageTable SET Latitude=-34.697425, Longitude=139.669871 WHERE rowid=107;
UPDATE LanguageTable SET Latitude=41.7882, Longitude=43.2674 WHERE rowid=298;
UPDATE LanguageTable SET Macroarea='Eurasia', Family='Nakh-Daghestanian', Latitude=41.997483, Longitude=47.584526 WHERE rowid=326;
UPDATE LanguageTable SET Latitude=-49.703, Longitude=-75.3756 WHERE rowid=509;
UPDATE LanguageTable SET Latitude=23.6818, Longitude=107.184 WHERE rowid=525;
UPDATE LanguageTable SET Latitude=23.6818, Longitude=107.184 WHERE rowid=526;
UPDATE LanguageTable SET Latitude=21.83753, Longitude=107.3622 WHERE rowid=528;
UPDATE LanguageTable SET Latitude=23.0105, Longitude=104.449 WHERE rowid=536;
UPDATE LanguageTable SET Latitude=23.0105, Longitude=104.449 WHERE rowid=537;
UPDATE LanguageTable SET Latitude=41.7882, Longitude=43.2674 WHERE rowid=725;
UPDATE LanguageTable SET Latitude=65.3874, Longitude=151.318 WHERE rowid=772;
UPDATE LanguageTable SET Latitude=56.7798, Longitude=156.906 WHERE rowid=774;
UPDATE LanguageTable SET Latitude=-11.1377, Longitude=34.7127 WHERE rowid=953;

Proposal: State exactly which versions of Glottolog and Concepticon should be used?

errors in concept linking (needs to be corrected)

errors.txt

Most errors are due to NorthEuralex (too lax linkings). Many also in the list by Nurse (long list), the rest is marginal. All should be corrected to make sure that we have a good representation of the data in CLICS.

standalone app without json

The json could be easily replaced by creating one big js object. They should be much smaller (?) but would also allow to use the code from within other browsers than firefox.

pin all dependencies?

To allow for maximal replicabiity, we might want to pin all dependencies - although that means less good usability?

Actually, we might also just add pinned dependencies to datasets.txt.

Develop a method to search for the best subset of data (concept/language coverage)

With lingpy's coverage method, we can search for mutual coverage in the data. However, for the CLICS case, if we want to have the three/four subsets (1000+ concepts for 300 languages, 500+ concepts for 500 languages, 250+ concepts for 1000 languages, one meta-set with all data), more sophistication is needed. I imagine using the concepticon to pre-analyse a couple of about 10 promising datasets in lexibank. Since coverage varies inside each dataset, however, the methods should further test for individual coverage for each language variety in the data. It is not yet clear how to do this in concrete, as the problem is not entirely trivial, but an approximation could preselect the most promising concepts based on concept coverage and then test for each language variety in the data, whether it conforms to a certain coverage threshold. These values would then be reported for each release.

Should we aim for pylexibank 1.0 for this paper?

I think cutting a 1.0 release of pylexibank would be good. It nothing else it will document our commitment to maintain this data processing pipeline (in particular since we decided that breaking changes would be acceptable for pylexibank < 1.0).

Pluggability

For pyclics 2.0 we should think about which functionality should/could be pluggable/configurable.
Allowing different cluster algorithms to be used would be trivial - as long as they are implemented in igraph or networkxx. For custom clustering, a plugin would presumabl need the graph. But thie would be easy enough if the plugin used the API ... So, maybe it is enough to allow registration of custom clics subcommands.

More reproducibility issues

With the 'correct' clics.sqlite for CLICS2:

  ID A  Concept A                     ID B  Concept B                   Families    Languages    Words
------  --------------------------  ------  ------------------------  ----------  -----------  -------
  1313  MOON                          1370  MONTH                             56          289      294
   906  TREE                          1803  WOOD                              55          211      310
  1258  FINGERNAIL                      72  CLAW                              50          209      216
  2266  SON-IN-LAW (OF WOMAN)         2267  SON-IN-LAW (OF MAN)               49          262      285
  2264  DAUGHTER-IN-LAW (OF WOMAN)    2265  DAUGHTER-IN-LAW (OF MAN)          47          235      262
  1608  LISTEN                        1408  HEAR                              47          102      105
   763  SKIN                           629  LEATHER                           46          233      255
   634  MEAT                          2259  FLESH                             46          222      232
  1307  LANGUAGE                      1599  WORD                              45           94       98
  1228  EARTH (SOIL)                   626  LAND                              43          158      181

That is correct ✓.

However, I get a different number of edges for the infomap graph-stats:

CLICS2:

clics -t 3 -g infomap -f families graph-stats   
-----------  ----
nodes        1534
edges        2638
components     96
communities   248
-----------  ----

My attempt:

clics -t 3 -g infomap -f families graph-stats  
-----------  ----
nodes        1534
edges        2635
components     96
communities   248
-----------  ----

I'm not yet sure what the reason for this is and whether it should be any cause for concern.

factor out pyclics into its own repos

In the light of issues like #40 it may be better to move pyclics into its own repository, making it more transparent that this is simply a library to be used with lexibank datasets to create colexification networks.

The clics2 repository would then be the one holding the CLICS dataset, i.e. a particular version of pyclics applied to a particular set of datasets.

Or would this be too much re-arranging?

test flow graph methods on the data

There is an apparently good library in python for this:

https://graph-tool.skewed.de/static/doc/flow.html#stoer-simple-1997

and there are quite a few methods one could test.

refine graph creation

since we can easily afford to have n graphs for a given edge, we should name the subgraphs on a per-concept basis (what is the concept, from which we start?), while communities can be named as they are. In general, however, the community algorithms should be further tested, ideally using an algorithm with all colexifications (no family threshold). Also: the problem of out-edges, which are not annotated by now, should be solved.

Out edges are defined as a specific edge with minimal number of 5 language families. If we have a bunch of nodes in a graph, all nodes occuring in other edges of this size are automatically out edges. It seems that this can also be handled from within CLLD, right?

Remove obsolete output files

Several output files do not make much sense anymore; e.g. the markdown tables for concepts and languages could simpy be replaced with querying the database on the fly. They only made sense in the scenario where the actual data would have been distributed with clics2.

use CLDF completely, without CLICS intermediate format

This should be straightforward with the new API and pycldf. As long as we have intermediate objects that help to get the words in form of Unidecode.

loading data: dataset.id needs to be specified

While following the README instructions I run into an issue when loading (all available) data.

When running
> clics load ../concepticon-data/ ../glottolog/

with the README-specified version of glottolog/concepticon; and a clean python 3.7 conda environment, I get

ValueError: Dataset.id needs to be specified in subclass for <class 'lexibank_allenbai.Dataset'>!

This seems like a mismatch between the attributes of the dataset(s) and what CLICS expects. I tried to navigate around the problem, but I ended up not achieving much.

I know this is vague, but I'd appreciate any pointers to solve this problem. Thanks!

Here's the full traceback

INFO    loading datasets into clics.sqlite
Traceback (most recent call last):
 File "/home/u167750/miniconda3/envs/clics/bin/clics", line 10, in <module>
   sys.exit(main())
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/pyclics/__main__.py", line 33, in main
   sys.exit(parser.main(parsed_args=args))
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/clldutils/clilib.py", line 110, in main
   catch_all=catch_all, parsed_args=args)
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/clldutils/clilib.py", line 82, in main
   self.commands[args.command](args)
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/clldutils/clilib.py", line 35, in __call__
   return self.func(args)
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/pyclics/commands.py", line 108, in load
   for ds in iter_datasets():
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/pylexibank/dataset.py", line 697, in iter_datasets
   yield ep.load()(glottolog=glottolog, concepticon=concepticon)
 File "/home/u167750/miniconda3/envs/clics/lib/python3.7/site-packages/pylexibank/dataset.py", line 206, in __init__
   "Dataset.id needs to be specified in subclass for %s!" % self.__class__)
ValueError: Dataset.id needs to be specified in subclass for <class 'lexibank_allenbai.Dataset'>!

must pin pylexibank dependency to 0.9

This requires a bugfix release of pyclics

[release of 1.0] Check for FALL maps to concepticon in all datasets

Some datasets have a strange bias linking "fall" in the meaning of "autumn" to "fall = descend"! I suspect TLS, Bowern's data, and maybe some other datasets. Sun1991 as well. But it is very strange... So please all data maintainers, quickly check whether this is normal. Check for the auto-produced subgraph for "summer" for referenceof this error.

GeoJson Dump of all current languages

I think, we'd best use the glottolog api from within CLICS on the languages in our sample to do this. The target format that Thomas used is a list of language data points:

Key	Value
name	"Aguaruna"
family	"Jivaroan"
variety	"std"
url	"http://lingweb.eva.mpg.de/cgi-bin/ids/ids.pl?com=simple_browse&lg_id=258"
lon	"-77.92179"
iso	"agr"
key	"agr_std"
lat	"-5.30044"

We can keep most of these, but we should add the glottocode instead of the ISO code, and as a URL, we'd use our internal URL which just points to all words for the given language.

only use languages with known coordinates when importing data into clics

People may criticize this, as it is not relevant for the networks, but for the sake of the consistency of the app, it would be good to have a flag that allows to only import "valid" lists.

export communities in separate GML and JSON files for easy inspection

this is important for inclusion in CLLD by taking the code from our current clics api

Relative default paths in pyclics

The default paths in pyclics.cli are relative to the directory in which clics is executed. This sounds useful in no general case.

Is not something like Path(pyglottolog.__file__).parent.parent etc. a reasonable default for those paths? Or even None, so the “look for your default data path” can be handled by those clld modules themselves?

template for the html with the application

This will be probably a bit complicated, but I think we can extract core functionality in a rather straightforward way.

The following zip-file illustrates a network for one community:

javascript.zip

We have:

simple HTML skeleton.
css (clips.css, our.css)
jquery: currently externally, should be placed on server, however,
d3, topojson
mousetrap (maybe not needed)
our code which is specific:
- words.json (all words, needs to be generated from new clics)
- langsGeo.json (can be created from glotttolog)
- visualize.jso (the core script)
- concepts.json (needs to be re-generated, but actually not needed)
- cluster_38_hand.json (these are the files we produce offline, but where teh server changes the parameter for JS).

Add a cookbook to illustrate how to run a colexification analysis with a small set of data.

This is already almost prepared, but it should be adjusted with the new api, and will need some further info on lexibank, etc.

Clics tries to write non-ASCII to a GML file

Using pyclics on a windows computer, I got the characteristic encoding error that python throws when trying to pump useful characters into windows text files or onto the console.
The culprit sits here:

clics2/src/pyclics/models.py

Lines 103 to 104 in 97b3121

 with self.fname.open('w') as fp: 

 fp.write('\n'.join(html.unescape(line) for line in nx.generate_gml(graph)))

I first tried to force fp to have UTF-8 encoding, but closer investigation upon trying to draw the graph showed that networkx's GML reader accepts only ASCII input – why are we unescaping the non-ascii characters in there in the first place, instead of plain using nx.write_gml(graph, self.fname)?

http://www.webonary.org/

They are at times easy to prepare for CLICS, although every word would need specific review, yet given the good structure, some work could be made automatically.

	with self.fname.open('w') as fp:
	fp.write('\n'.join(html.unescape(line) for line in nx.generate_gml(graph)))

clics / clics2 Goto Github PK

clics2's Introduction

CLICS - The Database of Cross-Linguistic Colexifications

Publications

Datasets and Software

Web application

Contributors

clics2's People

Contributors

Stargazers

Watchers

Forkers

clics2's Issues

Recommend Projects

Recommend Topics

Recommend Org