lexibank / abvd Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 67.19 MB

CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020.

Home Page: https://abvd.eva.mpg.de

License: Creative Commons Attribution 4.0 International

TeX 99.14% Python 0.86%

austronesian cldf lexibank1 vocabulary-database

abvd's People

Contributors

Stargazers

Watchers

Forkers

bibiko hansonmenghan

abvd's Issues

Some glottocodes / ISO codes to check

Hello! I think the following languages might have the wrong glottocodes / ISO codes (with my suggestions for what I think they should be):

Houaïlou --> [ajie1238] / [aji]
Axamb (Avok) -- > [avok1244] / []
Proto-Tsouic --> [tsou1250] / []
Bontok, Eastern --> [fina1242] / [bkb]

And here are some with just ISO codes that look off:

Saipan Carolinian Tanapag --> [tpv]
Lamenu (Filakara) --> [lww]
Dadu'a --> [] (or [ilu], for language-level)

Best,
Russell

makecldf fails

... on this entry: word 57 = "?" here.

...which means that the following is passed to add_form:

{'Language_ID': '661', 'Parameter_ID': '57', 'Value': '?', 'Source': ['15258'], 'Cognacy': None, 'Comment': "er-jai 'be married (of woman)' p.78", 'Loan': False, 'Local_ID': '173141', 'Form': None}

... and then we fail with

Traceback (most recent call last):
  File "/Users/simon/projects/lexibank2018/env/bin/lexibank", line 11, in <module>
    load_entry_point('pylexibank', 'console_scripts', 'lexibank')()
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/__main__.py", line 139, in main
    sys.exit(parser.main())
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 110, in main
    catch_all=catch_all, parsed_args=args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 82, in main
    self.commands[args.command](args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 35, in __call__
    return self.func(args)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/misc.py", line 149, in makecldf
    with_dataset(args, Dataset._install)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/util.py", line 28, in with_dataset
    func(get_dataset(args, dataset.id), **vars(args))
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/dataset.py", line 437, in _install
    if self.cmd_install(**kw) == NOOP:
  File "/Users/simon/projects/lexibank2018/abvd/lexibank_abvd.py", line 39, in cmd_install
    source=[b for b in bibs if b.id in refs.get(wl.id, [])]
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/providers/abvd.py", line 231, in to_cldf
    Local_ID=entry.id,
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 222, in add_lexemes
    lexemes = self.add_forms_from_value(split_value=split_value, **kw)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 208, in add_forms_from_value
    kw_ = self.add_form(with_morphemes=with_morphemes, **kw_)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 157, in add_form
    raise ValueError('language, concept, value, and form '
ValueError: language, concept, value, and form must be supplied

What's the best way to fix this? Should add_form catch this? or should this be caught before getting to add_form?

Dayak Ngaju /salawi/ is 'Twenty-Five', not 'Twenty'

For Dayak Ngaju [ngaj1237], the form /salawi/ is glossed as 'Twenty', but it should be glossed as 'Twenty-Five' (Suryanyahu 2013: 130). (A loan from Javanese?) By all accounts, '20' in Dayak Ngaju is a reflex of *duha *puluq.

Typo in 1387-201_five-1 (Pak 'five'): Should be <nuron>, not <muron>

The form given for 'five' in Pak [pakt1239] should be "nuron"; there appears to be a typo (with initial *m instead of "n"). See Smythe & Z'graggen (1975: 185).

Also, speaking of this source, <Z'graggen’> should be spelled with a lowercase in the Source/Author and Notes sections.

Oh, and Simon, you mentioned fixing typos upstream ... is there a more convenient place than here for me to register these typos when I see them?

Upgrade to pylexibank 2.x

Typo in 186-1_hand-1 (Sekar 'hand'): Should be <nima-n>, not <nina-n>

The form given for 'hand' in Sekar [seka1247] should be "nima-n"; there appears to be a typo (with medial *n instead of "m"). See George Grace's fieldnotes:

https://digital.library.manoa.hawaii.edu/static/grace/media/50.pdf

The top left of page 22 (of the pdf) has forms for "tangan" (Indonesian for 'hand') shown with prenominal possessive marking.

The righthand column of page 27 (of the pdf) has compound forms like "niman ˈbukin" for "sikut" (= 'elbow'), "niman ˈtagan" for "djari tangan" (= 'finger'), and "(niman) ˈkisin" for "kuku" (= 'nail').

forms.csv Cognacy column has floats/strings, lingpy expects ints

lingpy/basictypes.py in <lambda>(x)
     29         list.__setitem__(self, index, self._type(item))
     30 
---> 31 integer = lambda x: int(x) if x else 0
     32 strings = partial(_strings, str)
     33 ints = partial(_strings, int)

ValueError: invalid literal for int() with base 10: '1,64'

ABVD can't be imported with lingpy because the cognacy column has invalid values, amongst them: 29?, 1,83, etc. Thanks to @KonstantinHoffmann for pointing this out.

Consider split or new Glottocode for 'Angkola / Mandailing' [lang id 863]

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L1986
current Glottocode bata1290 refers to Batak Angkola only

either new:

Angkola / Mandailing angk1248 - https://glottolog.org/resource/languoid/id/angk1248

or split into

Batak Angkola bata1290 - https://glottolog.org/resource/languoid/id/bata1290
Batak Mandailing bata1291 - https://glottolog.org/resource/languoid/id/bata1291

if the data are the same for both varieties

is who the cognacy experts are in a list somewhere?

is there anywhere one can see who the expert is that did the cognacy judgement?

Wrong glottocode for 1686 (Betawi Malay (Tengahan dialect))

Hello, language ID 1686 (Betawi Malay (Tengahan dialect)) currently has the glottocode lame1259 (for Lamenu-Lewo), which doesn't seem right. It should be something more along the lines of beta1252 (for Betawi) ... although the terminology and classifications surrounding "Betawi" and various other Malayic varieties spoken in and around Jakarta is a mess.

Normalize contributors

In order to be able to create a clld abvd app from the CLDF data, we would normalize contributor names at some point. I think ideally, the "good" data should already go into the CLDF dataset, so we'd need to normalize here.

Lengo glottocode

I think the languoid with the ABVD ID 520 should have the glottocode "pari1257", not "pari1237".

Missing glottocodes

These should definitely be checked by people who know more about these languages than I do, but--in case this is might be helpful--here are my best guesses of what the glottocodes should be for these Austronesian languages that currently seem to lack them (or, in some cases where Glottolog might not have an entry for the lect, what the closest glottocode might be):

Alavas 1 > [mpot1241] / [mvt]
Alavas 2 > [mpot1241] / [mvt]
Alavas-Wowo (Wowo 1) > [mpot1241] / [mvt]
Alavas-Wowo (Wowo 2) > [mpot1241] / [mvt]
Badeng > [main1275] / [xkl]
Baliledo > [anak1240] / [akg]
Kayan > [bara1370] / [kys]
Mandri (Faru) 162-100 > [axam1237] / [ahb]
Mandri (Farun) 162-91 > [nasv1234]
Najit > [malu1245] / [mll]
Siviti (Beterbu, Jericho) > [malu1245] / [mll]
Siviti (Womol) > [malu1245] / [mll]
ßatarxobu (Benut)> [malu1245] / [mll]
ßatarxobu (Gunwar)> [malu1245] / [mll]
ßatarxobu (Limsak)> [malu1245] / [mll]
ßatarxobu (Lipitav) > [malu1245] / [mll]
Novol (Bangir) > [lete1241] / [nms]
Riwo > [geda1237] / [gdd]
Tesmbol (Melaklak) > (?) [aulu1238] / [aul] (or something related)
Tesmbol (Usus) > (?) [aulu1238] / [aul] (or something related)

Typo in 853-201_five-1 (Wetan 'five'): Should be <wolima>, not <wolina>

The form given for 'five' in Wetan [luan1263] should be "wolima"; there appears to be a typo (with medial *n instead of "m"). See Josselin de Jong (1987: 179, 272, 294).

de Josselin de Jong, Jan Petrus Benjamin. 1987. Wetan fieldnotes: Some eastern Indonesian texts with linguistic notes and a vocabulary (Verhandelingen van het Koninklijk Instituut voor Taal-, Land- en Volkenkunde 130). Dordrecht: Foris Publications.

Include classification in LanguageTable?

As far as I can see, the ABVD classification isn't included in the CLDF dataset yet. If we want to re-implement the php app in clld and load data from the CLDF, this might be necessary.

Inconsistent use of slash / solidus

It seems like the forward slash is mainly used to indicate alternate forms for a given concept, but it also creeps up in other places, perhaps to indicate morpheme boundaries (?), as in 'twenty' and 'fifty' in Malagasy (Sakalava) [1184] and Malagasy (Tandroy) [1186]: <roa/pòlo> and <lima/m/pòlo>. A few words for 'vomit' also seem to have slashes, e.g., Rarotongan <rua/ki>. This of course results in the problematic interpretation that and are both forms for 'vomit', whereas there's really just one form /ruaki/.

How to handle inconsistent cognateset IDs

The word for "Twenty" in language Palembang Malay is assigned to cognate set 3.6. This may mean 36, or 3 , 6.
It certainly is a good example why we shouldn't sluggify identifiers, but how to correct? Fix in the source?

update

subcognacy question

I have a question about the interpretation of the data in cases where more than one cogancy class is listed.

I thought that if a word had more than one cogancy class listed, then one of them (probably the second) represents a subcognacy class. Like "wahine" in Hawai'ian being "1, 116, 106" and that that means that all forms that get 116 also get 1 (but not all that have 1 get 116).

For water, I found that there were some words that had cognacy "1,2" and some that have "1" and some that have only "2".

Does my original assumption hold and these are a type of error, or is my assumption wrong?

Wrong glottocode for 399 Megiar

Should be megi1245, not mele1255, I think

change glottocode for Dadu'a to dadu1237

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L2861

There are maybe still discussions about the classification of Dadu'a but since Glottolog has a code for it, it would be better to use it. So far Dadu'a is probably very likely Wetarese.

see:

https://iso639-3.sil.org/sites/iso639-3/files/change_requests/2019/2019-053.pdf
Taylor-Leech 2009 'The language situation in Timor-Leste'
Taylor-Leech 2007 'The Ecology of Language Planning in Timor-Leste: A Study of Language Policy, Planning and Practices in Identity Construction'
and others

https://glottolog.org/resource/languoid/id/dadu1237