lingpy / pysem Goto Github PK

Python library for handling semantic data in linguistics

License: MIT License

Python 100.00%

pysem's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

All source code is available at: https://github.com/lingpy/lingpy.
Documentation can be found at: http://lingpy.org.
For a list of papers in which LingPy was applied, see here.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

pysem's People

Contributors

Stargazers

Watchers

pysem's Issues

mismatch between length of input and output

While working on streitberggothic I encountered following issue:

# mismatch in col len
import pandas as pd
from pysem.glosses import to_concepticon

PATH = "Streitberg-1910-3659.tsv"

def main():
    dfgot = pd.read_csv(PATH, sep="\t").fillna("")
    glosses = [{"gloss": str(g), "pos": str(p)}
                for g, p in zip(dfgot.sense, dfgot.pos)]

    print(len(glosses),
          len(to_concepticon(glosses, language="de", pos_ref="pos",
                             max_matches=1)))

if __name__ == "__main__":
    main()

prints: 3645 3274

So there are somehow less output matches than input provided.

I thought this might be valuable information for the developers, even though eventually I found a workaround, like so:

def main():
    dfgot = pd.read_csv(PATH, sep="\t")

    conid, conglo = [], []
    for g, p in zip(dfgot.sense, dfgot.pos):
        gloss = [{"gloss": g, "pos": p}]
        out = list(to_concepticon(gloss, language="de",
                                  pos_ref="pos", max_matches=1).values())[0]
        if out:
            conid.append(out[0][0])
            conglo.append(out[0][1])
        else:
            conid.append(None)
            conglo.append(None)

    dfgot["CONCEPTICON_ID"], dfgot["CONCEPTICON_GLOSS"] = conid, conglo
    del dfgot["form"]
    dfgot.to_csv("concepts.tsv", index=False, encoding="utf-8", sep="\t")

if __name__ == "__main__":
    main()

Typo in usage example

The usage example contains this line:
to_concepticon([{"gloss": "Fuß", pos: "noun"}], language="de"}])

But, that seems to be a typo for this:
to_concepticon([{"gloss": "Fuß", "pos": "noun"}], language="de")

marker in parse_gloss should not contain the dash

The dash regularly occurs in concepticon concept sets, to avoid that we loose the mapping, it should therefore not be defined as "marker" in the parse_gloss function.

Specify compatible concepticon version

pysem works with concepticon v2.5.0 but not with the latest version. I think it would be nice to be able to see somewhere which concepticon version the concepticon.zip file is based on, e.g. in the readme. Or maybe the file could be updated by the latest version or even replaced by an API?

In [15]: to_concepticon([{"gloss": "arm / hand", "pos": "noun"}], max_matches=4)
Out[15]: 
{'arm / hand': [['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15]]}

Slash is only rudimentarily recognized as a separator

In [15]: to_concepticon([{"gloss": "arm / hand", "pos": "noun"}], max_matches=4)
Out[15]: 
{'arm / hand': [['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15],
  ['1277', 'HAND', 'noun', 15]]}

tests for glosses with `or` and spaces

These don't seem to work by now.

Random Walk Similarity Network from the CLICS Data

The code is in theory available, but should be adjusted, so we can compute it from the CLICS data.