Giter VIP home page Giter VIP logo

abvd's People

Contributors

chrzyki avatar lingulist avatar simongreenhill avatar xrotwang avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

abvd's Issues

Some glottocodes / ISO codes to check

Hello! I think the following languages might have the wrong glottocodes / ISO codes (with my suggestions for what I think they should be):

Houaïlou --> [ajie1238] / [aji]
Axamb (Avok) -- > [avok1244] / []
Proto-Tsouic --> [tsou1250] / []
Bontok, Eastern --> [fina1242] / [bkb]

And here are some with just ISO codes that look off:

Saipan Carolinian Tanapag --> [tpv]
Lamenu (Filakara) --> [lww]
Dadu'a --> [] (or [ilu], for language-level)

Best,
Russell

makecldf fails

... on this entry: word 57 = "?" here.

...which means that the following is passed to add_form:

{'Language_ID': '661', 'Parameter_ID': '57', 'Value': '?', 'Source': ['15258'], 'Cognacy': None, 'Comment': "er-jai 'be married (of woman)' p.78", 'Loan': False, 'Local_ID': '173141', 'Form': None}

... and then we fail with

Traceback (most recent call last):
  File "/Users/simon/projects/lexibank2018/env/bin/lexibank", line 11, in <module>
    load_entry_point('pylexibank', 'console_scripts', 'lexibank')()
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/__main__.py", line 139, in main
    sys.exit(parser.main())
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 110, in main
    catch_all=catch_all, parsed_args=args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 82, in main
    self.commands[args.command](args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 35, in __call__
    return self.func(args)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/misc.py", line 149, in makecldf
    with_dataset(args, Dataset._install)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/util.py", line 28, in with_dataset
    func(get_dataset(args, dataset.id), **vars(args))
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/dataset.py", line 437, in _install
    if self.cmd_install(**kw) == NOOP:
  File "/Users/simon/projects/lexibank2018/abvd/lexibank_abvd.py", line 39, in cmd_install
    source=[b for b in bibs if b.id in refs.get(wl.id, [])]
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/providers/abvd.py", line 231, in to_cldf
    Local_ID=entry.id,
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 222, in add_lexemes
    lexemes = self.add_forms_from_value(split_value=split_value, **kw)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 208, in add_forms_from_value
    kw_ = self.add_form(with_morphemes=with_morphemes, **kw_)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 157, in add_form
    raise ValueError('language, concept, value, and form '
ValueError: language, concept, value, and form must be supplied

What's the best way to fix this? Should add_form catch this? or should this be caught before getting to add_form?

Dayak Ngaju /salawi/ is 'Twenty-Five', not 'Twenty'

For Dayak Ngaju [ngaj1237], the form /salawi/ is glossed as 'Twenty', but it should be glossed as 'Twenty-Five' (Suryanyahu 2013: 130). (A loan from Javanese?) By all accounts, '20' in Dayak Ngaju is a reflex of *duha *puluq.

Typo in 1387-201_five-1 (Pak 'five'): Should be <nuron>, not <muron>

The form given for 'five' in Pak [pakt1239] should be "nuron"; there appears to be a typo (with initial *m instead of "n"). See Smythe & Z'graggen (1975: 185).

Also, speaking of this source, <Z'graggen’> should be spelled with a lowercase in the Source/Author and Notes sections.

Oh, and Simon, you mentioned fixing typos upstream ... is there a more convenient place than here for me to register these typos when I see them?

Typo in 186-1_hand-1 (Sekar 'hand'): Should be <nima-n>, not <nina-n>

The form given for 'hand' in Sekar [seka1247] should be "nima-n"; there appears to be a typo (with medial *n instead of "m"). See George Grace's fieldnotes:

https://digital.library.manoa.hawaii.edu/static/grace/media/50.pdf

The top left of page 22 (of the pdf) has forms for "tangan" (Indonesian for 'hand') shown with prenominal possessive marking.

The righthand column of page 27 (of the pdf) has compound forms like "niman ˈbukin" for "sikut" (= 'elbow'), "niman ˈtagan" for "djari tangan" (= 'finger'), and "(niman) ˈkisin" for "kuku" (= 'nail').

forms.csv Cognacy column has floats/strings, lingpy expects ints

lingpy/basictypes.py in <lambda>(x)
     29         list.__setitem__(self, index, self._type(item))
     30 
---> 31 integer = lambda x: int(x) if x else 0
     32 strings = partial(_strings, str)
     33 ints = partial(_strings, int)

ValueError: invalid literal for int() with base 10: '1,64'

ABVD can't be imported with lingpy because the cognacy column has invalid values, amongst them: 29?, 1,83, etc. Thanks to @KonstantinHoffmann for pointing this out.

Consider split or new Glottocode for 'Angkola / Mandailing' [lang id 863]

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L1986
current Glottocode bata1290 refers to Batak Angkola only

either new:

Angkola / Mandailing angk1248 - https://glottolog.org/resource/languoid/id/angk1248

or split into

Batak Angkola bata1290 - https://glottolog.org/resource/languoid/id/bata1290
Batak Mandailing bata1291 - https://glottolog.org/resource/languoid/id/bata1291

if the data are the same for both varieties

Wrong glottocode for 1686 (Betawi Malay (Tengahan dialect))

Hello, language ID 1686 (Betawi Malay (Tengahan dialect)) currently has the glottocode lame1259 (for Lamenu-Lewo), which doesn't seem right. It should be something more along the lines of beta1252 (for Betawi) ... although the terminology and classifications surrounding "Betawi" and various other Malayic varieties spoken in and around Jakarta is a mess.

Normalize contributors

In order to be able to create a clld abvd app from the CLDF data, we would normalize contributor names at some point. I think ideally, the "good" data should already go into the CLDF dataset, so we'd need to normalize here.

Lengo glottocode

I think the languoid with the ABVD ID 520 should have the glottocode "pari1257", not "pari1237".

Missing glottocodes

These should definitely be checked by people who know more about these languages than I do, but--in case this is might be helpful--here are my best guesses of what the glottocodes should be for these Austronesian languages that currently seem to lack them (or, in some cases where Glottolog might not have an entry for the lect, what the closest glottocode might be):

  1. Alavas 1 > [mpot1241] / [mvt]
  2. Alavas 2 > [mpot1241] / [mvt]
  3. Alavas-Wowo (Wowo 1) > [mpot1241] / [mvt]
  4. Alavas-Wowo (Wowo 2) > [mpot1241] / [mvt]
  5. Badeng > [main1275] / [xkl]
  6. Baliledo > [anak1240] / [akg]
  7. Kayan > [bara1370] / [kys]
  8. Mandri (Faru) 162-100 > [axam1237] / [ahb]
  9. Mandri (Farun) 162-91 > [nasv1234]
  10. Najit > [malu1245] / [mll]
  11. Siviti (Beterbu, Jericho) > [malu1245] / [mll]
  12. Siviti (Womol) > [malu1245] / [mll]
  13. ßatarxobu (Benut)> [malu1245] / [mll]
  14. ßatarxobu (Gunwar)> [malu1245] / [mll]
  15. ßatarxobu (Limsak)> [malu1245] / [mll]
  16. ßatarxobu (Lipitav) > [malu1245] / [mll]
  17. Novol (Bangir) > [lete1241] / [nms]
  18. Riwo > [geda1237] / [gdd]
  19. Tesmbol (Melaklak) > (?) [aulu1238] / [aul] (or something related)
  20. Tesmbol (Usus) > (?) [aulu1238] / [aul] (or something related)

Typo in 853-201_five-1 (Wetan 'five'): Should be <wolima>, not <wolina>

The form given for 'five' in Wetan [luan1263] should be "wolima"; there appears to be a typo (with medial *n instead of "m"). See Josselin de Jong (1987: 179, 272, 294).

de Josselin de Jong, Jan Petrus Benjamin. 1987. Wetan fieldnotes: Some eastern Indonesian texts with linguistic notes and a vocabulary (Verhandelingen van het Koninklijk Instituut voor Taal-, Land- en Volkenkunde 130). Dordrecht: Foris Publications.

Inconsistent use of slash / solidus

It seems like the forward slash is mainly used to indicate alternate forms for a given concept, but it also creeps up in other places, perhaps to indicate morpheme boundaries (?), as in 'twenty' and 'fifty' in Malagasy (Sakalava) [1184] and Malagasy (Tandroy) [1186]: <roa/pòlo> and <lima/m/pòlo>. A few words for 'vomit' also seem to have slashes, e.g., Rarotongan <rua/ki>. This of course results in the problematic interpretation that and are both forms for 'vomit', whereas there's really just one form /ruaki/.

subcognacy question

I have a question about the interpretation of the data in cases where more than one cogancy class is listed.

I thought that if a word had more than one cogancy class listed, then one of them (probably the second) represents a subcognacy class. Like "wahine" in Hawai'ian being "1, 116, 106" and that that means that all forms that get 116 also get 1 (but not all that have 1 get 116).

For water, I found that there were some words that had cognacy "1,2" and some that have "1" and some that have only "2".

Does my original assumption hold and these are a type of error, or is my assumption wrong?

change glottocode for Dadu'a to dadu1237

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L2861

There are maybe still discussions about the classification of Dadu'a but since Glottolog has a code for it, it would be better to use it. So far Dadu'a is probably very likely Wetarese.

see:

https://glottolog.org/resource/languoid/id/dadu1237

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.