Giter VIP home page Giter VIP logo

c14bazaar's Introduction

rOpenSci

Project Status: Abandoned

This repository has been archived. The former README is now in README-NOT.md.

c14bazaar's People

Contributors

dakni avatar dependabot[bot] avatar dirkseidensticker avatar joeroe avatar kschmuetz avatar martinhinz avatar nevrome avatar yesdavid avatar zoometh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

c14bazaar's Issues

Database version update

New versions could be available for multiple databases: We should check that every year or so.

rOpenSci Review: Package preparation ToDo

c14bazAAR should follow the rOpenSci Packaging Guide. This guide recommends some useful changes to the package, but also a lot of optional stuff that I don't think is really necessary or even desirable.

So here are the things that I think should be changed and that I didn't already do:

  • 1.9 Authorship: The DESCRIPTION file of a package should list package authors and contributors to a package, using the Authors@R syntax to indicate their roles (author/creator/contributor etc.) if there is more than one author, and using the comment field to indicate the ORCID ID of each author, if they have one. Could you change the Description file accordingly, @MartinHinz? Please do not forget to document this change in NEWS.md.

  • 1.10 Testing: We have good coverage, but some of the new stuff is not covered yet. Could you write some tests for the new remove_duplicates options that I recently added in this epic PR, @dirkseidensticker?


Here are some recommended/suggested things that I do not want to do, because I think they are not useful and/or work intensive in the long run. What do you think?

  • 1.1.2 Creating metadata for your package: We recommend you to use the codemetar package for creating and updating a JSON CodeMeta metadata file for your package via codemetar::write_codemeta().

  • 1.5 Code Style: We recommend the styler package for automating part of the code styling.

  • 1.6 README: We recommend not creating README.md directly, but from a README.Rmd file.

  • 1.7 Documentation: The package should contain at least one vignette providing a substantial coverage of package functions, illustrating realistic use cases and how functions are intended to interact. If the package is small, the vignette and the README can have the same content. I disagree, but I am afraid that I will have to give in on this one.

  • 1.8 Documentation website: We recommend creating a documentation website for your package using pkgdown.

TODO List of databases that could be accessed with c14bazAAR

Databases/Collections behind login- or paywalls. Or not accessible without web scraping. Will not be included. 😿

rOpenSci Review: Template preparation

Submitting Author: Clemens Schmid (@nevrome)
Repository: https://github.com/ISAAKiel/c14bazAAR
Version submitted: 1.0.3.9000
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: c14bazAAR
Title: Download and Prepare C14 Dates from Different Source Databases
Description: Query different C14 date databases and apply basic data cleaning, merging and calibration steps.
Version: 1.0.3.9000
Authors@R: 
    c(person(given = "Clemens",
             family = "Schmid",
             role = c("aut", "cre", "cph"),
             email = "[email protected]",
             comment = c(ORCID = "0000-0003-3448-5715")),
      person(given = "Dirk",
             family = "Seidensticker",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-8155-7702")),
      person(given = "Daniel",
             family = "Knitter",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0003-3014-4497")),
      person(given = "Martin",
             family = "Hinz",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-9904-6548")),
      person(given = "David",
             family = "Matzig",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0001-7349-5401")),
      person(given = "Wolfgang",
             family = "Hamer",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-5943-5020")),
      person(given = "Kay",
             family = "Schmütz",
             role = "aut",
             email = "[email protected]"),
      person(given = "Nils",
             family = "Mueller-Scheessel",
             role = "ctb",
             email = "[email protected]",
             comment = c(ORCID = "0000-0001-7992-8722")))
URL: https://github.com/ISAAKiel/c14bazAAR
BugReports: https://github.com/ISAAKiel/c14bazAAR/issues
Depends: R (>= 3.4.0)
Language: en_GB
License: GPL-2 | file LICENSE
Encoding: UTF-8
LazyData: true
Imports:
    crayon (>= 1.3.4),
    data.table (>= 1.11.4),
    dplyr (>= 0.7.2),
    httr (>= 1.4.0),
    magrittr (>= 1.5),
    pbapply (>= 1.3-3),
    rlang (>= 0.1.1),
    stats (>= 3.4.0),
    tibble (>= 1.3.3),
    tidyr (>= 0.6.3)
Suggests:
    Bchron,
    countrycode,
    dataverse,
    ggplot2,
    ggridges,
    knitr,
    lwgeom,
    mapview,
    openxlsx,
    plyr,
    rgeos,
    rmarkdown,
    rnaturalearth,
    rworldmap,
    rworldxtra,
    sf,
    stringdist,
    testthat
RoxygenNote: 6.1.1
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • reproducibility
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

c14bazAAR was created to access radiocarbon dates from openly accessible archives. All functions are related to data download and data preparation.

  • Who is the target audience and what are scientific applications of this package?

Mostly archaeologists as most radiocarbon databases currently accessible via c14bazAAR contain radiocarbon dates from archaeological contexts. The package could also be of interest for geographers and other kinds of ecologists -- especially if non-archaeological data sources are added to the portfolio.

No. Not that I'm aware of.

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

  • Do you intend for this package to go on CRAN?

It already is on CRAN in version 1.0.3.

JOSS Options
  • The package has an obvious research application according to JOSS's definition.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI: 10.17605/OSF.IO/3DS6A
  • (Do not submit your package separately to JOSS)

Code of conduct

Unify database names

In filenames, functions and tables there are different naming schemes for database names. This is bad and should be unified asap.

removal of doubles

c14bazAAR/R/c14_date_list_rm_doubles.R

How it works right now:

  1. For all dates search for the dates which contain the same labnr string within their labnr string.
  2. Reduce selection of dates to those, who could be doubles based on the result of 1..
  3. For all remaining dates create a data.frame with all the info of the partner doubles (data.frames in data.frame).
  4. Define a set of essential variables.
  5. Decision process:
    1. If all partner doubles are exactly equal in all variables, keep one and throw the rest away.
    2. If the labnr is exactly equal, throw away the ones which lack more info in the essential variables. If they lack the same amount of info here, then also look at the non-essential variables. If no differences here either, than just keep the first date and throw the rest away.
    3. If the labnr is not exactly equal, then just keep the dates.
  6. Execute the selection.

There's an option to only mark doubles that should be removed in an extra column without executing the removal.

I worked with list columns here. Can be confusing at the first (and the second) moment.

Questions:

  • Is the criterion (labcode) to identify doubles sufficient?
  • How to decide which doubles to remove? Priority of certain essential variables?
  • Can it be done faster?

Simplify/automatize database citation

We should add a function that provides Bibtex strings for each database in the current c14_date_list(). That would simplify citation for the user.

example and test datasets

For example code and the automatic tests we need datasets - preconfigured data.frames and objects of class c14_date_list.

tests

All functions need automatic tests.

rOpenSci Review: JOSS Submission ToDo

There's a possibility to submit a very short, descriptive paper about c14bazAAR to the Journal of Open Source Software with the review. I want to do that.

  • The package therefore has to contain a paper.md file with the text in the package root. Structure and content of this file are described here.
  • The package has to be deposited in a long-term repository with a DOI.

CRAN submission ToDo list

Important info in: https://r-pkgs.org/release.html, https://cran.r-project.org/doc/manuals/r-devel/R-exts.html

Pre-submission

  • The package works and all functions are usable
  • The package documentation is up-to-date and represents the functions correctly
  • The test coverage of the package functions is sufficient
  • DESCRIPTION is up-to-date with the latest version number and database list
  • README.md is up-to-date
  • News.md is up-to-date and includes the latest changes
  • Package checks ran and did not yield any ERRORS, WARNINGS or NOTES (or at least the NOTES are addressed in the cran-comments.md)
    • locally (devtools::check())
    • rhub (devtools::check_rhub(email = ...))
    • winbuilder (devtools::check_win_release(email = ....) + devtools::check_win_devel(email = ....))
  • cran-comments.md is up-to-date and fits to the current submission process
  • Spellcheck with devtools::spell_check() ran and did yield not only false-positives
  • codemeta.json is up-to-date (can be updated with codemetar::write_codemeta())
  • inst/CITATION is up-to-date
  • The package does not make external changes without explicit user permission. It does not write to the file system, change options, install packages, quit R, send information over the internet, open external software, etc.
  • No reverse dependencies break because of the new package version (devtools::revdep_check())

Submission
The actual submission should be done by the package maintainer with devtools::release(). In case of a resubmission devtools::submit_cran() spares you all the questions in devtools::release().

Post-submission

  • create a new release on Github
  • add the .9000 suffix to the Version field in the DESCRIPTION to indicate that this is a development version
  • create a new heading in NEWS.md

Logo

c14bazAAR needs a simple Hex logo now. The meme has served its purpose.

Simplify variable reference table

In my quest to cut away everything none-essential from c14bazAAR I would like to simplify the variable reference table. The following steps seems sensible:

  1. So far the list stores a lot of additional variables the databases provide, but are not used by c14bazAAR. I would like to remove these. It's not really useful information, given the extreme heterogeneity of variables and variable meanings across databases.
  2. With this change the data structure can be reorganized once again: Columns for each variable in c14bazAAR and rows for each database and their relevant variable equivalents.

documentation

There's a massive lack of documentation. As usual we need comprehensive function descriptions and example code.

IntChron parser

IntChron <https://intchron.org/> is listed in #2 as not being included because it requires web scraping. However, after spending some time playing with it, I wonder if this might be revisited.

Essentially, IntChron seems to do the same thing as c14bazAAR—systematically compile dates from existing databases—with a web-based API. An IntChron parser would be more complicated than the existing parsers because, as far as I can tell, there is no way to extract the entire database as a single file. But it should still be possible to get it without resorting to web scraping. The key is that every HTML page on IntChron can also be accessed in csv, json, or txt format. This includes the "index" pages that eventually lead you to individual date records. It think it could be worth the extra complexity for IntChron because it does seem to include a lot of dates (for example the entire ORAU database) and it's backed by the Oxford C14 Lab so it's likely to grow over time.

I can think of a few ways you could approach this, depending on how much flexibility you want to give the user. At the simplest, one could implement a multi-stage parser in c14bazAAR:

  1. Retrieve the list of "hosts" (https://intchron.org/host.csv)
  2. Retrieve the list of records-by-country for each host (e.g. https://intchron.org/oxa/record.csv)
  3. Retrieve the list of sites for each country (e.g. https://intchron.org/record/oxa/Jordan.csv)
  4. Retrieve the list of dates for each site (e.g. https://intchron.org/record/oxa/Jordan/Dhuweila.csv)
  5. Parse and collate the dates (actually quite easy because the IntChron format is similar to c14bazAAR's)

On the other end of the spectrum, one could write an R interface to IntChron as its own package, which c14bazAAR could then use as a dependency to retrieve either the entire database or a user-specific subset. That could be worthwhile if the IntChron standard does become widely used, but as things stand I'm not sure that it's worth the extra effort.

I'd be happy to put some work into this, but I thought I would first raise the issue and ask whether you think it is something that fits into c14bazAAR, and what the best approach to doing it might be.

Eubar parser: wrong coordinate is not removed by `coordinate_precision()`

Within the Eubar dataset one date (Beta-206320) is associated with an incorrect coordinate: lat = 4.112887 / lon = 2.591112 that place the date within the Atlantic Ocean, about 240 km south of Benin. The site has to be -- according to the county field -- in north-eastern Spain. This coord is already incorrect in the base dataset.

Should that not be corrected by this code?

c14.data <- get_c14data(databases = "eubar") %>%
  standardize_country_name() %>%
  determine_country_by_coordinate() %>%
  finalize_country_name() %>%
  coordinate_precision()

The 'correction' from España to Spain is listed in the country_thesaurus.csv, and determine_country_by_coordinate() returns Benin as country_final. What happens if there is a mismatch?

I figure such an issue should better the caught using an existing or new function, and rather not individually in the parser.

CA certificate issue with get_palmisano() and get_euroevol()

euroevol and palmisano are downloaded from https://discovery.ucl.ac.uk. This download does not work any more due to CA certificate issues.

> get_palmisano()
Error in curl::curl_fetch_memory(url, handle = handle) : 
  server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
> utils::download.file(db_url, temp, quiet = TRUE, method='curl')
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
Error in utils::download.file(db_url, temp, quiet = TRUE, method = "curl") : 
  'curl' call had nonzero exit status

My certificate list is up-to-date and includes the relevant QuoVadis Root CA 2 G3.

openssl s_client -showcerts indicates some problems on the server-side:

$ openssl s_client -showcerts -servername discovery.ucl.ac.uk -connect discovery.ucl.ac.uk:443
CONNECTED(00000005)
depth=0 jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
   i:C = BM, O = QuoVadis Limited, CN = QuoVadis EV SSL ICA G3
-----BEGIN CERTIFICATE-----
MIIH4DCCBcigAwIBAgIUWsWTIuklFQIki5zk7SzvkyYF4MswDQYJKoZIhvcNAQEL
BQAwSTELMAkGA1UEBhMCQk0xGTAXBgNVBAoMEFF1b1ZhZGlzIExpbWl0ZWQxHzAd
BgNVBAMMFlF1b1ZhZGlzIEVWIFNTTCBJQ0EgRzMwHhcNMTkwOTExMTAyNDExWhcN
MjEwOTExMTAzNDAwWjCBuzETMBEGCysGAQQBgjc8AgEDEwJHQjEaMBgGA1UEDwwR
R292ZXJubWVudCBFbnRpdHkxFzAVBgNVBAUTDk5vdmVtYmVyLTE1LTc3MQswCQYD
VQQGEwJHQjEPMA0GA1UECAwGTG9uZG9uMQ8wDQYDVQQHDAZMb25kb24xIjAgBgNV
BAoMGVVuaXZlcnNpdHkgQ29sbGVnZSBMb25kb24xHDAaBgNVBAMME2Rpc2NvdmVy
eS51Y2wuYWMudWswggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCvh4j4
ub+jjytAuayjz1jXpFooMEgg09Opvruzy1Vkz8KT7VYFurfQpp7xO0kDJV9bz4U6
vVUmqd9R2NaJDs0TtpKjyDFwNq1XR2+39L6JlJuIxdGRUMNLh1geNfBB7QJHac0I
xwstH/mXU9H4eU1JyS8TuVnpCbDZnSqCaQ08hl4137FGrloSL+EHqErzrmz8NzNd
725EKSG1/XP8d8O1FJDaAyvES2JfJWuhrcwa6WPPQdCu2cI4GzMRzPes3aD+IjJl
8tGVep5ketM+Kgsrn9tjiZhFcSOcxO0apRAAAYOA6NBoZvPCLr16CGQSJM/0e2N2
PM/PUh14db39Me79AgMBAAGjggNLMIIDRzAMBgNVHRMBAf8EAjAAMB8GA1UdIwQY
MBaAFOWEVNCQSZ84uvLJ4SoIxU6foEg/MHgGCCsGAQUFBwEBBGwwajA5BggrBgEF
BQcwAoYtaHR0cDovL3RydXN0LnF1b3ZhZGlzZ2xvYmFsLmNvbS9xdmV2c3NsZzMu
Y3J0MC0GCCsGAQUFBzABhiFodHRwOi8vZXYub2NzcC5xdW92YWRpc2dsb2JhbC5j
b20wMQYDVR0RBCowKIITZGlzY292ZXJ5LnVjbC5hYy51a4IRZXByaW50cy51Y2wu
YWMudWswWgYDVR0gBFMwUTBGBgwrBgEEAb5YAAJkAQIwNjA0BggrBgEFBQcCARYo
aHR0cDovL3d3dy5xdW92YWRpc2dsb2JhbC5jb20vcmVwb3NpdG9yeTAHBgVngQwB
ATAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwEwPAYDVR0fBDUwMzAxoC+g
LYYraHR0cDovL2NybC5xdW92YWRpc2dsb2JhbC5jb20vcXZldnNzbGczLmNybDAd
BgNVHQ4EFgQU0+IV/WaITVrZeCsIddZvFZSkuUswDgYDVR0PAQH/BAQDAgWgMIIB
fwYKKwYBBAHWeQIEAgSCAW8EggFrAWkAdgC72d+8H4pxtZOUI5eqkntHOFeVCqtS
6BqQlmQ2jh7RhQAAAW0f40mRAAAEAwBHMEUCIQDYLCvmTrDxh16qE30yqTirA3A+
Xv6TZlpUssZxI+ApqgIgSGicwtcECtcjsSnKmEwUVv6hQnvksG7dH5AqPZ7jbQ0A
dwBWFAaaL9fC7NP14b1Esj7HRna5vJkRXMDvlJhV1onQ3QAAAW0f40m4AAAEAwBI
MEYCIQCPhcwTIoiYCt6Esw49b7bcvRyREX29fRuaX36wJxQ6TAIhAJyPt8r3g++L
xWdb/sWRfF7Jn4zlyA7iUWFTF84dwK5xAHYAb1N2rDHwMRnYmQCkURX/dxUcEdkC
wQApBo2yCJo32RMAAAFtH+NKoAAABAMARzBFAiB/85erYq3OelUTEYpdJdIK//2N
AUG6EtuDCR/UspBmnQIhANbyKv+L+b02o5YIRqRKJ49LJEyJFyRxHrRM8lH9qRk8
MA0GCSqGSIb3DQEBCwUAA4ICAQAjJurMYSd9KFvcOcMZNO1DLsKytJ3N6SIkHXph
J2fpXD4sfBHxxG37r7a3hWi7vqNb4PTL8VIixKw+u/Si0tknJIyHsVf64eI4tfMD
kPDJGxMgr9qEsNukwVXgsnerqXYQRAcgYsnMLEdrgo+7SW7caTnm/adfqrc6r9Ar
4fHRidr9p7RuEM/eRCCmBqswHI7hpsE6miKLh1aXqF6I6JiSCApz3X7mJ4OiLVFN
GKw8rZHGEJUsLQBWIW0qZPjrzNG3M/LF5chVhS9D7HcUtXEFP7smNPdNGgbVTtfY
3+sXpFFdhED5ooRJCkX2/JfylXN3LT8v0iNI04HNQ1/fS27k9Q5QBahEBsvSzh88
OdHP/2jyyQwiGqNH9Q+UGGrYBW50OJB13ztobAeEWITPwI40nf3wU3qoCvM/nvJu
8kO0lD3kD4AyLqWnOYvwgjCzgVe2zuLI9F/BZiZnmXaiJq2SSzgTmIzv/HB0zSHF
BWQpgZpacZok7AhZ3vzpbOdJfgcSOCe/W6+drLyA5wTzV3m4+taU5eKvnI9NN5Xb
iUHXmqLElHVZYakpDAJkT20Uud5uIGHGwiHFYvyHgHlNBxa77Bn2gYxKtH95y3o/
C0SaHauNL7ghuyZVxNRWsKcVWlZ+1/TrOlEp00nTFyoWqxbFgwVP9WarCRDX/rZ/
Yzr/sQ==
-----END CERTIFICATE-----
---
Server certificate
subject=jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk

issuer=C = BM, O = QuoVadis Limited, CN = QuoVadis EV SSL ICA G3

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA
Server Temp Key: DH, 1024 bits
---
SSL handshake has read 2713 bytes and written 511 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.2, Cipher is DHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : DHE-RSA-AES256-GCM-SHA384
    Session-ID: C01FF3A9F5C247ED69A674BB1ECBAE3739CBC49E17A49EDCB3F6A35F6CDB7EC5
    Session-ID-ctx: 
    Master-Key: F00181123F692ABD4A33AF8BBE14D402E374397B749A8E24345D4D6229977968B05B202234048A953EEB70D31EF5426C
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1577661179
    Timeout   : 7200 (sec)
    Verify return code: 21 (unable to verify the first certificate)
    Extended master secret: no
---

Can you confirm this problem with windows/macos, @dirkseidensticker? Really weird: I can access the page with my browser without any problems.

wrong database encodings

>>> file -i PA20110001_S01.txt                                     
PA20110001_S01.txt: text/plain; charset=iso-8859-1

This leads to "wrong" site names.

Merging duplicates leads to NAs

In the process of merging duplicates, it seems that in some instances NAs in the column lon are produced. The following should reproduce an example dataset:

x <- get_all_dates()
x_material <- x %>% classify_material()
x_country <- x_material %>%
standardize_country_name() %>%
determine_country_by_coordinate() %>%
finalize_country_name()
x_duplicate <- x_country %>%
mark_duplicates() %>%
remove_duplicates()
x_sub<-x_duplicate[grepl("Viesen", x_duplicate$site),]

I found out because I was subsetting the data by coordinates and was surprised not finding those from the Viesenhäuser Hof.

get_AustArch() >> Error in check_connection_to_url(db_url)

get_AustArch() creates the following error:

Error in check_connection_to_url(db_url) : http://archaeologydataservice.ac.uk/catalogue/adsdata/arch-1661-1/dissemination/csv/Austarch_1-3_and_IDASQ_28Nov13-1.csv is not available. No internet connection?

which is curious as I am able to access the csv file in the browser. Can somebody verify this?

How to use intcal20 for calibration?

I was expecting to use ... to pass the calibration curve to Bchron::BchronCalibrate() like so:

library(c14bazAAR)
adrac <- get_c14data("adrac")
Batalimo <- adrac %>%
  dplyr::filter(site == "Batalimo")
calibrate(Batalimo,
          choices = "calprobdistr",
          calCurves = rep("intcal20", length(Batalimo$c14age)))

But I get this error:

Calibrating dates... 
  |++                                                |   5%Error in Bchron::BchronCalibrate(ages = x$c14age, ageSds = x$c14std, calCurves = rep("intcal13",  : 
  formal argument "calCurves" matched by multiple actual arguments

Am I doing something wrong? Thank you.

openxlsx reading issues

openxlsx seems to have problems with reading some files (see ycphs/openxlsx#96).

One solution would be to read these files with the readxl package. That requires downloading the files to a tempdir first. If we use readxl I would suggest to replace openxlsx also in these instances where it works right now. So we should switch out openxlsx with readxl everywhere to keep the number of dependencies low.

Download tests

Maybe the download tests should not run on CRAN. If one source database changes, the whole package fails the tests. Not useful.

Retain online lookup tables for backwards compatibility?

#128 removed the lookup tables (e.g. url_reference.csv) from this repository, which means previous versions of the package are no longer functional:

remotes::install_github("ropensci/[email protected]", force = TRUE, quiet = TRUE)
c14bazAAR::get_c14data("14sea")
#> Trying to download all dates from the requested databases...
#>   |                                                          |                                                  |   0%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++++|  99%
#> Warning in c14bazAAR::get_c14data("14sea"): There were errors:
#> 
#> 14sea --> https://raw.githubusercontent.com/ropensci/c14bazAAR/master/data-raw/url_reference.csv is not available. No internet connection?
#> 
#> Not all data might have been downloaded accurately!
#> Error in c14bazAAR::get_c14data("14sea"): 
#> 
#> Download failed for all databases.

Created on 2021-04-05 by the reprex package (v1.0.0)

While hopefully most users will upgrade to 2.0.0, this is might be a problem for the reproducibility of analyses using previous versions? It's also not necessarily clear to users who encounter this error that they need to upgrade the package to fix it.

Non-radiocarbon dates: the bazAARverse

@yesdavid I saw that the 14cpalaeolithic database you added contains non-radiocarbon dates. c14bazAAR does not support these yet: We can't simply put them into the fields c14ageand c14std, that wouldn't make any sense.

For the moment I decided to remove them:

c14palaeolithic <- c14palaeolithic %>%
    dplyr::filter(method %in% c("AMS", "Conv. 14C"))

I think we could make this possible in the future, but this will require some changes in the architecture of the whole package.

Maybe the easiest solution would be dedicated S3 classes for each dating method: e.g. uratho_date_list, osl_date_list, dendro_date_list. Some functions that were developed for the c14_date_list could be applied to these objects as well, others not. There should be a superclass that allows to merge these dates despite their semantic differences.

I'm dependent on your input here, @dirkseidensticker and @MartinHinz. There are other possible solutions like extra packages for each kind of dates or -- much less ambitious -- some more columns for the c14_date_list.

Generally I'm a big fan of the Unix philosophy (Do One Thing and Do It Well), but in this case c14bazAAR already contains multiple functions that are useful for all kinds of dating information. On the other hand, for example dendro dates are a huge can of worms that we might want to avoid.

New column: Version

Some databases are explicitly released in versions (eg. IR-DD, CalPal). There should be a column to reflect this.

Dev mode for URL downloading

Updating URLs and thesauri in PRs is currently pretty painful with c14bazAAR, because the respective meta data is always pulled from the master branch. We should add a dev mode, that can be activated with a ENV-var and enable using only local files.

thesaurification

c14bazAAR/R/c14_date_list_thesaurify.R

How it works right now:
The implementation only covers the variables country and material. The database parsers already remove leading and trailing whitespaces, but the variety of semantically equal or related terms is still huge. For both variables there are manually crafted lists of terms and variations (1, 2 -- both not up to date).
thesaurify() creates two new columns in the input c14_date_list and fills it with the correct term from the thesaurus list. If there is none, then the value is copied unchanged from the country or material column.

Questions:

  • Is there a better way to create semantic standardization?
  • Can we find a solution for a wider range of variables (not just countries and materials but also cultures etc.)?
  • Can we solve all our problems by creating a connection to more powerful thesaurus projects?

Parsers for Palmisano's datasets

This paper includes 920 dates from Northern Mesopotamia and the Levant, 6,000–3,000 BP. At a rough estimate just over half would be new additions to c14bazAAR:

# https://doi.org/10.1371/journal.pone.0244871.s001
palmisano <- readxl::read_xlsx("journal.pone.0244871.s001.xlsx", sheet = "C14 Dataset")
c14baz <- c14bazAAR::get_c14data("all")

sum(!palmisano$LabID %in% c14baz$labnr)
#> [1] 556

The data is available as a supplementary .xlsx file or a CSV in the Zenodo archive. Worth including? #2

plots & visualisation for c14_date_list

It's not important, but I would like to have a custom plot function for c14_date_list objects.

A c14_date_list is essentially a data.frame with a lot of different variables. There are many options what to display and how. Cumulative temporal date density? Spatial distribution of dates?

I see several possibilities to implement this:

  • plot functions that use the graphics package (base R)
  • plot functions that use ggplot2
  • custom geoms for ggplot2
  • mapping via sf

Country thesaurus includes entries that aren't countries

I noticed that fix_database_country_name() currently returns some country names that aren't actually (or currently) countries:

  • Channel Islands (as opposed to Jersey or Guernsey)
  • Corsica
  • Crete
  • Sardinia
  • Yugoslavia

This might not be a bug exactly—these labels do contain useful geographic information—but I wouldn't say it's expected behaviour since the name of the function implies a sovereign country will be returned. So if they remain in the thesaurus, maybe that should be documented somewhere?

I also spotted some inconsistencies in names for the same country:

  • Czechia/Czech Republic

And some mistakes in the corrected names:

  • Lybia (Libya?)
  • Tunesia (Tunisia?)

meaning and relation of variables

Each database contains a set of variables. This table manages relations and priorities.

Important questions:

  • Are the semantic equalisations in the table correct?
  • Should the query functions collect all variables or only the ones with high priority?
  • What should be the default order of the columns in a c14_date_list?

simple feature

These lines show how to build a simple feature object from the downloaded data. It could directly be implemented in order to have the downloaded data directly as sf object.

library(sf)
a <- c14databases::get_CALPAL()
a_sf <- st_sfc(st_multipoint(as.matrix(a[,c('lon','lat')])), crs = st_crs(4326)) %>%
  st_cast("POINT") %>%
  st_sf(data = a, geom = .)

plot(rain_sf["data.c14age"])

i'll test a bit.

version 1.0

Thanks for the great work so far @dirkseidensticker, @dakni, @MartinHinz, @yesdavid, @whamer, @kschmuetz and all the other people from ISAAK

I once more invested some time over the weekend to polish c14bazAAR. There still are some rough edges, but we're getting closer. I would like to ask you to invest some time this week to bring this to an happy end in a final feat of strength.

Now that there's a somehow useful README, everybody should be able to see the big picture. Please go through the documentation, test some functions and criticize/improve/streamline whatever comes to your attention. Think of this as a first compilation of the complete draft of a paper we wrote together.

Remove duplicates does not find duplicates

In a query of mine, the remove duplicate-function does not work as expected. It seems that duplicates between RADON und EUROEVOL are not found, seemingly because the latter does not use hyphens in the laboratory code (e.g. "Gd 6046" vs. "Gd-6046")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.