ropensci / c14bazaar Goto Github PK

View Code? Open in Web Editor NEW

30.0 7.0 12.0 17.9 MB

R Package - Download and Prepare C14 Dates from Different Source Databases

Home Page: https://docs.ropensci.org/c14bazAAR

License: GNU General Public License v2.0

R 95.31% TeX 4.69%

radiocarbon-dates archaeology r

c14bazaar's Introduction

rOpenSci

This repository has been archived. The former README is now in README-NOT.md.

c14bazaar's People

Contributors

Stargazers

Watchers

Forkers

mcolobran martinhinz ercrema arielolafsalgado yesdavid the-last-pastafarian romainfrancois joeroe jmcascalheira apalmisano82 kaos000 zoometh

c14bazaar's Issues

Database version update

New versions could be available for multiple databases: We should check that every year or so.

rOpenSci Review: Package preparation ToDo

c14bazAAR should follow the rOpenSci Packaging Guide. This guide recommends some useful changes to the package, but also a lot of optional stuff that I don't think is really necessary or even desirable.

So here are the things that I think should be changed and that I didn't already do:

1.9 Authorship: The DESCRIPTION file of a package should list package authors and contributors to a package, using the Authors@R syntax to indicate their roles (author/creator/contributor etc.) if there is more than one author, and using the comment field to indicate the ORCID ID of each author, if they have one. Could you change the Description file accordingly, @MartinHinz? Please do not forget to document this change in NEWS.md.
1.10 Testing: We have good coverage, but some of the new stuff is not covered yet. Could you write some tests for the new remove_duplicates options that I recently added in this epic PR, @dirkseidensticker?

Here are some recommended/suggested things that I do not want to do, because I think they are not useful and/or work intensive in the long run. What do you think?

1.1.2 Creating metadata for your package: We recommend you to use the codemetar package for creating and updating a JSON CodeMeta metadata file for your package via codemetar::write_codemeta().
1.5 Code Style: We recommend the styler package for automating part of the code styling.
1.6 README: We recommend not creating README.md directly, but from a README.Rmd file.
1.7 Documentation: The package should contain at least one vignette providing a substantial coverage of package functions, illustrating realistic use cases and how functions are intended to interact. If the package is small, the vignette and the README can have the same content. I disagree, but I am afraid that I will have to give in on this one.
1.8 Documentation website: We recommend creating a documentation website for your package using pkgdown.

TODO List of databases that could be accessed with c14bazAAR

Databases/Collections behind login- or paywalls. Or not accessible without web scraping. Will not be included. 😿

CARD ==> Loginwall.
Flohr 2015 ==> Paywall.
Wang 2014 ==> Paywall.
ORAU - Oxford Radiocarbon Accelerator Unit ==> Loginwall.
Douglass et al. 2019 Madagascar ==> Paywall.
BANADORA ==> No file download intended. Seems to be broken.
PPND ==> No file download intended. Seems to be broken.
KIK-IRPA ==> No file download intended.
Plateforme des datations archéologiques intertropicales ==> No data download intended.
IDEARQ ==> No file download intended.
ARQUEODATA ==> Loginwall.

Rework the duplicate removal interface

The current interface, which separates marking and removal of duplicates is uncomfortable and unsexy. I would like to rework and simplify that.

rOpenSci Review: Template preparation

Submitting Author: Clemens Schmid (@nevrome)
Repository: https://github.com/ISAAKiel/c14bazAAR
Version submitted: 1.0.3.9000
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: c14bazAAR
Title: Download and Prepare C14 Dates from Different Source Databases
Description: Query different C14 date databases and apply basic data cleaning, merging and calibration steps.
Version: 1.0.3.9000
Authors@R: 
    c(person(given = "Clemens",
             family = "Schmid",
             role = c("aut", "cre", "cph"),
             email = "[email protected]",
             comment = c(ORCID = "0000-0003-3448-5715")),
      person(given = "Dirk",
             family = "Seidensticker",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-8155-7702")),
      person(given = "Daniel",
             family = "Knitter",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0003-3014-4497")),
      person(given = "Martin",
             family = "Hinz",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-9904-6548")),
      person(given = "David",
             family = "Matzig",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0001-7349-5401")),
      person(given = "Wolfgang",
             family = "Hamer",
             role = "aut",
             email = "[email protected]",
             comment = c(ORCID = "0000-0002-5943-5020")),
      person(given = "Kay",
             family = "Schmütz",
             role = "aut",
             email = "[email protected]"),
      person(given = "Nils",
             family = "Mueller-Scheessel",
             role = "ctb",
             email = "[email protected]",
             comment = c(ORCID = "0000-0001-7992-8722")))
URL: https://github.com/ISAAKiel/c14bazAAR
BugReports: https://github.com/ISAAKiel/c14bazAAR/issues
Depends: R (>= 3.4.0)
Language: en_GB
License: GPL-2 | file LICENSE
Encoding: UTF-8
LazyData: true
Imports:
    crayon (>= 1.3.4),
    data.table (>= 1.11.4),
    dplyr (>= 0.7.2),
    httr (>= 1.4.0),
    magrittr (>= 1.5),
    pbapply (>= 1.3-3),
    rlang (>= 0.1.1),
    stats (>= 3.4.0),
    tibble (>= 1.3.3),
    tidyr (>= 0.6.3)
Suggests:
    Bchron,
    countrycode,
    dataverse,
    ggplot2,
    ggridges,
    knitr,
    lwgeom,
    mapview,
    openxlsx,
    plyr,
    rgeos,
    rmarkdown,
    rnaturalearth,
    rworldmap,
    rworldxtra,
    sf,
    stringdist,
    testthat
RoxygenNote: 6.1.1
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- database access
- data munging
- data deposition
- reproducibility
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

c14bazAAR was created to access radiocarbon dates from openly accessible archives. All functions are related to data download and data preparation.

Who is the target audience and what are scientific applications of this package?

Mostly archaeologists as most radiocarbon databases currently accessible via c14bazAAR contain radiocarbon dates from archaeological contexts. The package could also be of interest for geographers and other kinds of ecologists -- especially if non-archaeological data sources are added to the portfolio.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

No. Not that I'm aware of.

Technical checks

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

Do you intend for this package to go on CRAN?

It already is on CRAN in version 1.0.3.

Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

The package has an obvious research application according to JOSS's definition.
The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
The package is deposited in a long-term repository with the DOI: 10.17605/OSF.IO/3DS6A
(Do not submit your package separately to JOSS)

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Unify database names

In filenames, functions and tables there are different naming schemes for database names. This is bad and should be unified asap.

removal of doubles

c14bazAAR/R/c14_date_list_rm_doubles.R

How it works right now:

For all dates search for the dates which contain the same labnr string within their labnr string.
Reduce selection of dates to those, who could be doubles based on the result of 1..
For all remaining dates create a data.frame with all the info of the partner doubles (data.frames in data.frame).
Define a set of essential variables.
Decision process:
1. If all partner doubles are exactly equal in all variables, keep one and throw the rest away.
2. If the labnr is exactly equal, throw away the ones which lack more info in the essential variables. If they lack the same amount of info here, then also look at the non-essential variables. If no differences here either, than just keep the first date and throw the rest away.
3. If the labnr is not exactly equal, then just keep the dates.
Execute the selection.

There's an option to only mark doubles that should be removed in an extra column without executing the removal.

I worked with list columns here. Can be confusing at the first (and the second) moment.

Questions:

Is the criterion (labcode) to identify doubles sufficient?
How to decide which doubles to remove? Priority of certain essential variables?
Can it be done faster?

Simplify/automatize database citation

We should add a function that provides Bibtex strings for each database in the current c14_date_list(). That would simplify citation for the user.

example and test datasets

For example code and the automatic tests we need datasets - preconfigured data.frames and objects of class c14_date_list.

Duplicate joining - 'dedubing'

Suggestion: a dedubing function like in Zotero:

Just a suggestion... ;-)

tests

All functions need automatic tests.

remove TL dates from aDRAC parser

aDRAC still contains some TL date that need to be filter out in the parser

rOpenSci Review: JOSS Submission ToDo

There's a possibility to submit a very short, descriptive paper about c14bazAAR to the Journal of Open Source Software with the review. I want to do that.

The package therefore has to contain a paper.md file with the text in the package root. Structure and content of this file are described here.
The package has to be deposited in a long-term repository with a DOI.

CRAN submission ToDo list

Important info in: https://r-pkgs.org/release.html, https://cran.r-project.org/doc/manuals/r-devel/R-exts.html

Pre-submission

Submission
The actual submission should be done by the package maintainer with devtools::release(). In case of a resubmission devtools::submit_cran() spares you all the questions in devtools::release().

Post-submission

create a new release on Github
add the .9000 suffix to the Version field in the DESCRIPTION to indicate that this is a development version
create a new heading in NEWS.md

Logo

c14bazAAR needs a simple Hex logo now. The meme has served its purpose.

Simplify variable reference table

In my quest to cut away everything none-essential from c14bazAAR I would like to simplify the variable reference table. The following steps seems sensible:

So far the list stores a lot of additional variables the databases provide, but are not used by c14bazAAR. I would like to remove these. It's not really useful information, given the extreme heterogeneity of variables and variable meanings across databases.
With this change the data structure can be reorganized once again: Columns for each variable in c14bazAAR and rows for each database and their relevant variable equivalents.

spatial quality estimation

c14bazAAR/R/c14_date_list_estimate_spatial_quality.R

Can it be done better? Right now the spatial quality estimation is limited to some general observations and the comparison of the coordinates and the country information. Can we do more? Maybe a cross validation of site locations?
Can it be done faster?

documentation

There's a massive lack of documentation. As usual we need comprehensive function descriptions and example code.

IntChron parser

IntChron <https://intchron.org/> is listed in #2 as not being included because it requires web scraping. However, after spending some time playing with it, I wonder if this might be revisited.

Essentially, IntChron seems to do the same thing as c14bazAAR—systematically compile dates from existing databases—with a web-based API. An IntChron parser would be more complicated than the existing parsers because, as far as I can tell, there is no way to extract the entire database as a single file. But it should still be possible to get it without resorting to web scraping. The key is that every HTML page on IntChron can also be accessed in csv, json, or txt format. This includes the "index" pages that eventually lead you to individual date records. It think it could be worth the extra complexity for IntChron because it does seem to include a lot of dates (for example the entire ORAU database) and it's backed by the Oxford C14 Lab so it's likely to grow over time.

I can think of a few ways you could approach this, depending on how much flexibility you want to give the user. At the simplest, one could implement a multi-stage parser in c14bazAAR:

Retrieve the list of "hosts" (https://intchron.org/host.csv)
Retrieve the list of records-by-country for each host (e.g. https://intchron.org/oxa/record.csv)
Retrieve the list of sites for each country (e.g. https://intchron.org/record/oxa/Jordan.csv)
Retrieve the list of dates for each site (e.g. https://intchron.org/record/oxa/Jordan/Dhuweila.csv)
Parse and collate the dates (actually quite easy because the IntChron format is similar to c14bazAAR's)

On the other end of the spectrum, one could write an R interface to IntChron as its own package, which c14bazAAR could then use as a dependency to retrieve either the entire database or a user-specific subset. That could be worthwhile if the IntChron standard does become widely used, but as things stand I'm not sure that it's worth the extra effort.

I'd be happy to put some work into this, but I thought I would first raise the issue and ask whether you think it is something that fits into c14bazAAR, and what the best approach to doing it might be.

Eubar parser: wrong coordinate is not removed by `coordinate_precision()`

Within the Eubar dataset one date (Beta-206320) is associated with an incorrect coordinate: lat = 4.112887 / lon = 2.591112 that place the date within the Atlantic Ocean, about 240 km south of Benin. The site has to be -- according to the county field -- in north-eastern Spain. This coord is already incorrect in the base dataset.

Should that not be corrected by this code?

c14.data <- get_c14data(databases = "eubar") %>%
  standardize_country_name() %>%
  determine_country_by_coordinate() %>%
  finalize_country_name() %>%
  coordinate_precision()

The 'correction' from España to Spain is listed in the country_thesaurus.csv, and determine_country_by_coordinate() returns Benin as country_final. What happens if there is a mismatch?

I figure such an issue should better the caught using an existing or new function, and rather not individually in the parser.

CA certificate issue with get_palmisano() and get_euroevol()

euroevol and palmisano are downloaded from https://discovery.ucl.ac.uk. This download does not work any more due to CA certificate issues.

> get_palmisano()
Error in curl::curl_fetch_memory(url, handle = handle) : 
  server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

> utils::download.file(db_url, temp, quiet = TRUE, method='curl')
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
Error in utils::download.file(db_url, temp, quiet = TRUE, method = "curl") : 
  'curl' call had nonzero exit status

My certificate list is up-to-date and includes the relevant QuoVadis Root CA 2 G3.

openssl s_client -showcerts indicates some problems on the server-side:

$ openssl s_client -showcerts -servername discovery.ucl.ac.uk -connect discovery.ucl.ac.uk:443
CONNECTED(00000005)
depth=0 jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk
   i:C = BM, O = QuoVadis Limited, CN = QuoVadis EV SSL ICA G3
-----BEGIN CERTIFICATE-----
MIIH4DCCBcigAwIBAgIUWsWTIuklFQIki5zk7SzvkyYF4MswDQYJKoZIhvcNAQEL
BQAwSTELMAkGA1UEBhMCQk0xGTAXBgNVBAoMEFF1b1ZhZGlzIExpbWl0ZWQxHzAd
BgNVBAMMFlF1b1ZhZGlzIEVWIFNTTCBJQ0EgRzMwHhcNMTkwOTExMTAyNDExWhcN
MjEwOTExMTAzNDAwWjCBuzETMBEGCysGAQQBgjc8AgEDEwJHQjEaMBgGA1UEDwwR
R292ZXJubWVudCBFbnRpdHkxFzAVBgNVBAUTDk5vdmVtYmVyLTE1LTc3MQswCQYD
VQQGEwJHQjEPMA0GA1UECAwGTG9uZG9uMQ8wDQYDVQQHDAZMb25kb24xIjAgBgNV
BAoMGVVuaXZlcnNpdHkgQ29sbGVnZSBMb25kb24xHDAaBgNVBAMME2Rpc2NvdmVy
eS51Y2wuYWMudWswggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCvh4j4
ub+jjytAuayjz1jXpFooMEgg09Opvruzy1Vkz8KT7VYFurfQpp7xO0kDJV9bz4U6
vVUmqd9R2NaJDs0TtpKjyDFwNq1XR2+39L6JlJuIxdGRUMNLh1geNfBB7QJHac0I
xwstH/mXU9H4eU1JyS8TuVnpCbDZnSqCaQ08hl4137FGrloSL+EHqErzrmz8NzNd
725EKSG1/XP8d8O1FJDaAyvES2JfJWuhrcwa6WPPQdCu2cI4GzMRzPes3aD+IjJl
8tGVep5ketM+Kgsrn9tjiZhFcSOcxO0apRAAAYOA6NBoZvPCLr16CGQSJM/0e2N2
PM/PUh14db39Me79AgMBAAGjggNLMIIDRzAMBgNVHRMBAf8EAjAAMB8GA1UdIwQY
MBaAFOWEVNCQSZ84uvLJ4SoIxU6foEg/MHgGCCsGAQUFBwEBBGwwajA5BggrBgEF
BQcwAoYtaHR0cDovL3RydXN0LnF1b3ZhZGlzZ2xvYmFsLmNvbS9xdmV2c3NsZzMu
Y3J0MC0GCCsGAQUFBzABhiFodHRwOi8vZXYub2NzcC5xdW92YWRpc2dsb2JhbC5j
b20wMQYDVR0RBCowKIITZGlzY292ZXJ5LnVjbC5hYy51a4IRZXByaW50cy51Y2wu
YWMudWswWgYDVR0gBFMwUTBGBgwrBgEEAb5YAAJkAQIwNjA0BggrBgEFBQcCARYo
aHR0cDovL3d3dy5xdW92YWRpc2dsb2JhbC5jb20vcmVwb3NpdG9yeTAHBgVngQwB
ATAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwEwPAYDVR0fBDUwMzAxoC+g
LYYraHR0cDovL2NybC5xdW92YWRpc2dsb2JhbC5jb20vcXZldnNzbGczLmNybDAd
BgNVHQ4EFgQU0+IV/WaITVrZeCsIddZvFZSkuUswDgYDVR0PAQH/BAQDAgWgMIIB
fwYKKwYBBAHWeQIEAgSCAW8EggFrAWkAdgC72d+8H4pxtZOUI5eqkntHOFeVCqtS
6BqQlmQ2jh7RhQAAAW0f40mRAAAEAwBHMEUCIQDYLCvmTrDxh16qE30yqTirA3A+
Xv6TZlpUssZxI+ApqgIgSGicwtcECtcjsSnKmEwUVv6hQnvksG7dH5AqPZ7jbQ0A
dwBWFAaaL9fC7NP14b1Esj7HRna5vJkRXMDvlJhV1onQ3QAAAW0f40m4AAAEAwBI
MEYCIQCPhcwTIoiYCt6Esw49b7bcvRyREX29fRuaX36wJxQ6TAIhAJyPt8r3g++L
xWdb/sWRfF7Jn4zlyA7iUWFTF84dwK5xAHYAb1N2rDHwMRnYmQCkURX/dxUcEdkC
wQApBo2yCJo32RMAAAFtH+NKoAAABAMARzBFAiB/85erYq3OelUTEYpdJdIK//2N
AUG6EtuDCR/UspBmnQIhANbyKv+L+b02o5YIRqRKJ49LJEyJFyRxHrRM8lH9qRk8
MA0GCSqGSIb3DQEBCwUAA4ICAQAjJurMYSd9KFvcOcMZNO1DLsKytJ3N6SIkHXph
J2fpXD4sfBHxxG37r7a3hWi7vqNb4PTL8VIixKw+u/Si0tknJIyHsVf64eI4tfMD
kPDJGxMgr9qEsNukwVXgsnerqXYQRAcgYsnMLEdrgo+7SW7caTnm/adfqrc6r9Ar
4fHRidr9p7RuEM/eRCCmBqswHI7hpsE6miKLh1aXqF6I6JiSCApz3X7mJ4OiLVFN
GKw8rZHGEJUsLQBWIW0qZPjrzNG3M/LF5chVhS9D7HcUtXEFP7smNPdNGgbVTtfY
3+sXpFFdhED5ooRJCkX2/JfylXN3LT8v0iNI04HNQ1/fS27k9Q5QBahEBsvSzh88
OdHP/2jyyQwiGqNH9Q+UGGrYBW50OJB13ztobAeEWITPwI40nf3wU3qoCvM/nvJu
8kO0lD3kD4AyLqWnOYvwgjCzgVe2zuLI9F/BZiZnmXaiJq2SSzgTmIzv/HB0zSHF
BWQpgZpacZok7AhZ3vzpbOdJfgcSOCe/W6+drLyA5wTzV3m4+taU5eKvnI9NN5Xb
iUHXmqLElHVZYakpDAJkT20Uud5uIGHGwiHFYvyHgHlNBxa77Bn2gYxKtH95y3o/
C0SaHauNL7ghuyZVxNRWsKcVWlZ+1/TrOlEp00nTFyoWqxbFgwVP9WarCRDX/rZ/
Yzr/sQ==
-----END CERTIFICATE-----
---
Server certificate
subject=jurisdictionC = GB, businessCategory = Government Entity, serialNumber = November-15-77, C = GB, ST = London, L = London, O = University College London, CN = discovery.ucl.ac.uk

issuer=C = BM, O = QuoVadis Limited, CN = QuoVadis EV SSL ICA G3

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA
Server Temp Key: DH, 1024 bits
---
SSL handshake has read 2713 bytes and written 511 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.2, Cipher is DHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : DHE-RSA-AES256-GCM-SHA384
    Session-ID: C01FF3A9F5C247ED69A674BB1ECBAE3739CBC49E17A49EDCB3F6A35F6CDB7EC5
    Session-ID-ctx: 
    Master-Key: F00181123F692ABD4A33AF8BBE14D402E374397B749A8E24345D4D6229977968B05B202234048A953EEB70D31EF5426C
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1577661179
    Timeout   : 7200 (sec)
    Verify return code: 21 (unable to verify the first certificate)
    Extended master secret: no
---

Can you confirm this problem with windows/macos, @dirkseidensticker? Really weird: I can access the page with my browser without any problems.

Input argument checking

We could use something like checkmate to make our functions more robust to wrong input arguments.

Add currently available databases to the DESCRIPTION

Uwe Ligges: For the future, please elaborate in your Description field which databses these are and eventually also specify the corresponding URLs in the form https://.....

TODO List of class transformation functions to allow direct interaction with other (radiocarbon)packages

Which conversion functions do we need?
- as_oxcAARCalibratedDatesList()
- as_BchronCalibratedDates()
- rcarbon
- as.sf()
- ...

Check databases for non-radiocarbon dates and filter them out

Should also be added to the New-parser ToDo list in the README.

citation

Both neolithicRC and c14bazAAR need a nice citation string.

https://www.r-bloggers.com/adding-citation-to-your-r-package/

wrong database encodings

>>> file -i PA20110001_S01.txt                                     
PA20110001_S01.txt: text/plain; charset=iso-8859-1

This leads to "wrong" site names.

Merging duplicates leads to NAs

In the process of merging duplicates, it seems that in some instances NAs in the column lon are produced. The following should reproduce an example dataset:

x <- get_all_dates()
x_material <- x %>% classify_material()
x_country <- x_material %>%
standardize_country_name() %>%
determine_country_by_coordinate() %>%
finalize_country_name()
x_duplicate <- x_country %>%
mark_duplicates() %>%
remove_duplicates()
x_sub<-x_duplicate[grepl("Viesen", x_duplicate$site),]

I found out because I was subsetting the data by coordinates and was surprised not finding those from the Viesenhäuser Hof.

get_AustArch() >> Error in check_connection_to_url(db_url)

get_AustArch() creates the following error:

Error in check_connection_to_url(db_url) : http://archaeologydataservice.ac.uk/catalogue/adsdata/arch-1661-1/dissemination/csv/Austarch_1-3_and_IDASQ_28Nov13-1.csv is not available. No internet connection?

which is curious as I am able to access the csv file in the browser. Can somebody verify this?

How to use intcal20 for calibration?

I was expecting to use ... to pass the calibration curve to Bchron::BchronCalibrate() like so:

library(c14bazAAR)
adrac <- get_c14data("adrac")
Batalimo <- adrac %>%
  dplyr::filter(site == "Batalimo")
calibrate(Batalimo,
          choices = "calprobdistr",
          calCurves = rep("intcal20", length(Batalimo$c14age)))

But I get this error:

Calibrating dates... 
  |++                                                |   5%Error in Bchron::BchronCalibrate(ages = x$c14age, ageSds = x$c14std, calCurves = rep("intcal13",  : 
  formal argument "calCurves" matched by multiple actual arguments

Am I doing something wrong? Thank you.

openxlsx reading issues

openxlsx seems to have problems with reading some files (see ycphs/openxlsx#96).

One solution would be to read these files with the readxl package. That requires downloading the files to a tempdir first. If we use readxl I would suggest to replace openxlsx also in these instances where it works right now. So we should switch out openxlsx with readxl everywhere to keep the number of dependencies low.

simple write_csv and write_xlsx functions

c14bazAAR should provide a simple but powerful function to write c14_date_lists to the file system.

Download tests

Maybe the download tests should not run on CRAN. If one source database changes, the whole package fails the tests. Not useful.

Retain online lookup tables for backwards compatibility?

#128 removed the lookup tables (e.g. url_reference.csv) from this repository, which means previous versions of the package are no longer functional:

remotes::install_github("ropensci/[email protected]", force = TRUE, quiet = TRUE)
c14bazAAR::get_c14data("14sea")
#> Trying to download all dates from the requested databases...
#>   |                                                          |                                                  |   0%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++++|  99%
#> Warning in c14bazAAR::get_c14data("14sea"): There were errors:
#> 
#> 14sea --> https://raw.githubusercontent.com/ropensci/c14bazAAR/master/data-raw/url_reference.csv is not available. No internet connection?
#> 
#> Not all data might have been downloaded accurately!
#> Error in c14bazAAR::get_c14data("14sea"): 
#> 
#> Download failed for all databases.

^{Created on 2021-04-05 by the reprex package (v1.0.0)}

While hopefully most users will upgrade to 2.0.0, this is might be a problem for the reproducibility of analyses using previous versions? It's also not necessarily clear to users who encounter this error that they need to upgrade the package to fix it.

Non-radiocarbon dates: the bazAARverse

@yesdavid I saw that the 14cpalaeolithic database you added contains non-radiocarbon dates. c14bazAAR does not support these yet: We can't simply put them into the fields c14ageand c14std, that wouldn't make any sense.

For the moment I decided to remove them:

c14palaeolithic <- c14palaeolithic %>%
    dplyr::filter(method %in% c("AMS", "Conv. 14C"))

I think we could make this possible in the future, but this will require some changes in the architecture of the whole package.

Maybe the easiest solution would be dedicated S3 classes for each dating method: e.g. uratho_date_list, osl_date_list, dendro_date_list. Some functions that were developed for the c14_date_list could be applied to these objects as well, others not. There should be a superclass that allows to merge these dates despite their semantic differences.

I'm dependent on your input here, @dirkseidensticker and @MartinHinz. There are other possible solutions like extra packages for each kind of dates or -- much less ambitious -- some more columns for the c14_date_list.

Generally I'm a big fan of the Unix philosophy (Do One Thing and Do It Well), but in this case c14bazAAR already contains multiple functions that are useful for all kinds of dating information. On the other hand, for example dendro dates are a huge can of worms that we might want to avoid.

New column: Version

Some databases are explicitly released in versions (eg. IR-DD, CalPal). There should be a column to reflect this.

Dev mode for URL downloading

Updating URLs and thesauri in PRs is currently pretty painful with c14bazAAR, because the respective meta data is always pulled from the master branch. We should add a dev mode, that can be activated with a ENV-var and enable using only local files.

thesaurification

c14bazAAR/R/c14_date_list_thesaurify.R

How it works right now:
The implementation only covers the variables country and material. The database parsers already remove leading and trailing whitespaces, but the variety of semantically equal or related terms is still huge. For both variables there are manually crafted lists of terms and variations (1, 2 -- both not up to date).
thesaurify() creates two new columns in the input c14_date_list and fills it with the correct term from the thesaurus list. If there is none, then the value is copied unchanged from the country or material column.

Questions:

Is there a better way to create semantic standardization?
Can we find a solution for a wider range of variables (not just countries and materials but also cultures etc.)?
Can we solve all our problems by creating a connection to more powerful thesaurus projects?

Checklist for adding database getter functions

To complex to remember. We need a list.

Parsers for Palmisano's datasets

This paper includes 920 dates from Northern Mesopotamia and the Levant, 6,000–3,000 BP. At a rough estimate just over half would be new additions to c14bazAAR:

# https://doi.org/10.1371/journal.pone.0244871.s001
palmisano <- readxl::read_xlsx("journal.pone.0244871.s001.xlsx", sheet = "C14 Dataset")
c14baz <- c14bazAAR::get_c14data("all")

sum(!palmisano$LabID %in% c14baz$labnr)
#> [1] 556

The data is available as a supplementary .xlsx file or a CSV in the Zenodo archive. Worth including? #2

plots & visualisation for c14_date_list

It's not important, but I would like to have a custom plot function for c14_date_list objects.

A c14_date_list is essentially a data.frame with a lot of different variables. There are many options what to display and how. Cumulative temporal date density? Spatial distribution of dates?

I see several possibilities to implement this:

plot functions that use the graphics package (base R)
plot functions that use ggplot2
custom geoms for ggplot2
mapping via sf

Move readme images into man/figures

See https://docs.ropensci.org/c14bazAAR/ and this comment from Hadley: ropensci-org/rotemplate#19 (comment)

You can also have a look where other packages store readme.md figures:

calibration

c14bazAAR/R/c14_date_list_calibrate.R

Is the implemented algorithm correct?
Can it be done faster?
Is the function robust enough to deal with any kind of strange input?

Country thesaurus includes entries that aren't countries

I noticed that fix_database_country_name() currently returns some country names that aren't actually (or currently) countries:

Channel Islands (as opposed to Jersey or Guernsey)
Corsica
Crete
Sardinia
Yugoslavia

This might not be a bug exactly—these labels do contain useful geographic information—but I wouldn't say it's expected behaviour since the name of the function implies a sovereign country will be returned. So if they remain in the thesaurus, maybe that should be documented somewhere?

I also spotted some inconsistencies in names for the same country:

Czechia/Czech Republic

And some mistakes in the corrected names:

Lybia (Libya?)
Tunesia (Tunisia?)

binary mode error in `get_14SEA()`

Commit 1004c36 added support for unicode characters in the column headings, which are needed on a Windows machine. Now the parser throws a warning message that I could trace back to https://github.com/krlmlr/bindr/blob/master/R/populate.R#L61

See

Column c13val stays empty.

meaning and relation of variables

Each database contains a set of variables. This table manages relations and priorities.

Important questions:

Are the semantic equalisations in the table correct?
Should the query functions collect all variables or only the ones with high priority?
What should be the default order of the columns in a c14_date_list?

simple feature

These lines show how to build a simple feature object from the downloaded data. It could directly be implemented in order to have the downloaded data directly as sf object.

library(sf)
a <- c14databases::get_CALPAL()
a_sf <- st_sfc(st_multipoint(as.matrix(a[,c('lon','lat')])), crs = st_crs(4326)) %>%
  st_cast("POINT") %>%
  st_sf(data = a, geom = .)

plot(rain_sf["data.c14age"])

i'll test a bit.

version 1.0

Thanks for the great work so far @dirkseidensticker, @dakni, @MartinHinz, @yesdavid, @whamer, @kschmuetz and all the other people from ISAAK

I once more invested some time over the weekend to polish c14bazAAR. There still are some rough edges, but we're getting closer. I would like to ask you to invest some time this week to bring this to an happy end in a final feat of strength.

Now that there's a somehow useful README, everybody should be able to see the big picture. Please go through the documentation, test some functions and criticize/improve/streamline whatever comes to your attention. Think of this as a first compilation of the complete draft of a paper we wrote together.

Remove duplicates does not find duplicates

In a query of mine, the remove duplicate-function does not work as expected. It seems that duplicates between RADON und EUROEVOL are not found, seemingly because the latter does not use hyphens in the laboratory code (e.g. "Gd 6046" vs. "Gd-6046")

LICENSE

I thought about putting c14bazAAR and also neolithicRC into the public domain. What do you think @dirkseidensticker, @dakni?