Giter VIP home page Giter VIP logo

epiforecasts / covidregionaldata Goto Github PK

View Code? Open in Web Editor NEW
37.0 9.0 18.0 90.73 MB

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.

Home Page: https://epiforecasts.io/covidregionaldata/

License: Other

Dockerfile 0.74% R 86.43% TeX 6.89% Shell 5.94%
covid-19 rstats data regional-data open-science r6

covidregionaldata's Introduction

Subnational data for the COVID-19 outbreak

R-CMD-check Codecov test coverage Data status metacran downloads

MIT license GitHub contributors PRs Welcome GitHub commits

JOSS Zenodo

Interface to subnational and national level COVID-19 data sourced from both official sources, such as Public Health England in the UK, and from other COVID-19 data collections, including the World Health Organisation (WHO), European Centre for Disease Prevention and Control (ECDC), John Hopkins University (JHU), Google Open Data and others. This package is designed to streamline COVID-19 data extraction, cleaning, and processing from a range of data sources in an open and transparent way. This allows users to inspect and scrutinise the data, and tools used to process it, at every step. For all countries supported, data includes a daily time-series of cases and, wherever available, data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources.

Installation

Install from CRAN:

install.packages("covidregionaldata")

Install the stable development version of the package with:

install.packages("covidregionaldata",
  repos = "https://epiforecasts.r-universe.dev"
)

Install the unstable development version of the package with:

remotes::install_github("epiforecasts/covidregionaldata")

Quick start

Documentation

Load covidregionaldata, dplyr, scales, and ggplot2 (all used in this quick start),

library(covidregionaldata)
library(dplyr)
library(ggplot2)
library(scales)

Setup data caching

This package can optionally use a data cache from memoise to locally cache downloads. This can be enabled using the following (this will use the temporary directory by default),

start_using_memoise()
#> Using a cache at: /var/folders/68/22ndk9854tq394wl_n1cxzlr0000gn/T//RtmpylL81U

To stop using memoise use,

stop_using_memoise()

and to reset the cache (required to download new data),

reset_cache()

National data

To get worldwide time-series data by country (sourced from the World Health Organisation (WHO) by default but also optionally from the European Centre for Disease Control (ECDC), John Hopkins University, or the Google COVID-19 open data project), use:

nots <- get_national_data()
#> Downloading data from https://covid19.who.int/WHO-COVID-19-global-data.csv
#> Cleaning data
#> Processing data
nots
#> # A tibble: 182,253 × 15
#>    date       un_region who_region country iso_code cases_new cases_total deaths_new deaths_total recovered_new
#>    <date>     <chr>     <chr>      <chr>   <chr>        <dbl>       <dbl>      <dbl>        <dbl>         <dbl>
#>  1 2020-01-03 Asia      EMRO       Afghan… AF               0           0          0            0            NA
#>  2 2020-01-03 Europe    EURO       Albania AL               0           0          0            0            NA
#>  3 2020-01-03 Africa    AFRO       Algeria DZ               0           0          0            0            NA
#>  4 2020-01-03 Oceania   WPRO       Americ… AS               0           0          0            0            NA
#>  5 2020-01-03 Europe    EURO       Andorra AD               0           0          0            0            NA
#>  6 2020-01-03 Africa    AFRO       Angola  AO               0           0          0            0            NA
#>  7 2020-01-03 Americas  AMRO       Anguil… AI               0           0          0            0            NA
#>  8 2020-01-03 Americas  AMRO       Antigu… AG               0           0          0            0            NA
#>  9 2020-01-03 Americas  AMRO       Argent… AR               0           0          0            0            NA
#> 10 2020-01-03 Asia      EURO       Armenia AM               0           0          0            0            NA
#> # … with 182,243 more rows, and 5 more variables: recovered_total <dbl>, hosp_new <dbl>, hosp_total <dbl>,
#> #   tested_new <dbl>, tested_total <dbl>

This can also be filtered for a country of interest,

g7 <- c(
  "United States", "United Kingdom", "France", "Germany",
  "Italy", "Canada", "Japan"
)
g7_nots <- get_national_data(countries = g7, verbose = FALSE)

Using this data we can compare case information between countries, for example here is the number of deaths over time for each country in the G7:

g7_nots %>%
  ggplot() +
  aes(x = date, y = deaths_new, col = country) +
  geom_line(alpha = 0.4) +
  labs(x = "Date", y = "Reported Covid-19 deaths") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "top") +
  guides(col = guide_legend(title = "Country"))

Subnational data

To get time-series data for subnational regions of a specific country, for example by level 1 region in the UK, use:

uk_nots <- get_regional_data(country = "UK", verbose = FALSE)
uk_nots
#> # A tibble: 9,893 × 26
#>    date       region    region_code cases_new cases_total deaths_new deaths_total recovered_new recovered_total
#>    <date>     <chr>     <chr>           <dbl>       <dbl>      <dbl>        <dbl>         <dbl>           <dbl>
#>  1 2020-01-11 North Ea… E12000001          NA          NA         NA           NA            NA              NA
#>  2 2020-01-11 North We… E12000002          NA          NA         NA           NA            NA              NA
#>  3 2020-01-11 Yorkshir… E12000003          NA          NA         NA           NA            NA              NA
#>  4 2020-01-11 East Mid… E12000004          NA          NA         NA           NA            NA              NA
#>  5 2020-01-11 West Mid… E12000005          NA          NA         NA           NA            NA              NA
#>  6 2020-01-11 East of … E12000006          NA          NA         NA           NA            NA              NA
#>  7 2020-01-11 London    E12000007          NA          NA         NA           NA            NA              NA
#>  8 2020-01-11 South Ea… E12000008          NA          NA         NA           NA            NA              NA
#>  9 2020-01-11 South We… E12000009          NA          NA         NA           NA            NA              NA
#> 10 2020-01-11 England   E92000001          NA          NA         NA           NA            NA              NA
#> # … with 9,883 more rows, and 17 more variables: hosp_new <dbl>, hosp_total <dbl>, tested_new <dbl>,
#> #   tested_total <dbl>, areaType <chr>, cumCasesByPublishDate <dbl>, cumCasesBySpecimenDate <dbl>,
#> #   newCasesByPublishDate <dbl>, newCasesBySpecimenDate <dbl>, cumDeaths28DaysByDeathDate <dbl>,
#> #   cumDeaths28DaysByPublishDate <dbl>, newDeaths28DaysByDeathDate <dbl>, newDeaths28DaysByPublishDate <dbl>,
#> #   newPillarFourTestsByPublishDate <lgl>, newPillarOneTestsByPublishDate <dbl>,
#> #   newPillarThreeTestsByPublishDate <dbl>, newPillarTwoTestsByPublishDate <dbl>

Now we have the data we can create plots, for example the time-series of the number of cases for each region:

uk_nots %>%
  filter(!(region %in% "England")) %>%
  ggplot() +
  aes(x = date, y = cases_new, col = region) +
  geom_line(alpha = 0.4) +
  labs(x = "Date", y = "Reported Covid-19 cases") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "top") +
  guides(col = guide_legend(title = "Region"))

See get_available_datasets() for supported regions and subregional levels. To view what datasets we currently have subnational data for, along with their current status, check the supported countries page or build the supported countries vignette.

For further examples see the quick start vignette. Additional subnational data are supported via the JHU() and Google() classes. Use the available_regions() method once these data have been downloaded and cleaned (see their examples) for subnational data they internally support.

Citation

If using covidregionaldata in your work please consider citing it using the following,

#> 
#> To cite covidregionaldata in publications use:
#> 
#>   Joseph Palmer, Katharine Sherratt, Richard Martin-Nielsen, Jonnie Bevan, Hamish Gibbs, Sebastian
#>   Funk and Sam Abbott (2021). covidregionaldata: Subnational data for COVID-19 epidemiology, DOI:
#>   10.21105/joss.03290
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {covidregionaldata: Subnational data for COVID-19 epidemiology},
#>     author = {Joseph Palmer and Katharine Sherratt and Richard Martin-Nielsen and Jonnie Bevan and Hamish Gibbs and Sebastian Funk and Sam Abbott},
#>     journal = {Journal of Open Source Software},
#>     year = {2021},
#>     volume = {6},
#>     number = {63},
#>     pages = {3290},
#>     doi = {10.21105/joss.03290},
#>   }

Development

Development

This package is the result of work from a number of contributors (see contributors list here). We would like to thank the CMMID COVID-19 working group for insightful comments and feedback.

We welcome contributions and new contributors! We particularly appreciate help adding new data sources for countries at sub-national level, or work on priority problems in the issues. Please check and add to the issues, and/or add a pull request. For more details, start with the contributing guide. For details of the steps required to add support for a dataset see the adding data guide.

covidregionaldata's People

Contributors

arfon avatar biocyberman avatar bisaloo avatar csoneson avatar davisvaughan avatar ffinger avatar hamishgibbs avatar jhellewell14 avatar joehickson avatar jonnie-bevan avatar joseph-palmer avatar kathsherratt avatar mariabnd avatar nebu1eto avatar patrickbarks avatar paulc91 avatar pitmonticone avatar rboyes avatar richardmn avatar sbfnk avatar seabbs avatar sophiemeakin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covidregionaldata's Issues

Deaths in German data are by reporting date of case, not death

Hi,

I couldn't test the package (same problem as in Issue #28), but I looked a bit into your code and wanted to let you know about a particularity of the German RKI data: The death variable (AnzahlTodesfall) refers to cases registered on a given day which subsequently led to death. So it is not new deaths registered on a given date as you might expect and as is provided in the ECDC data.

To see the difference you can run the following:

# rki data:
rki <- read.csv("https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv",
                stringsAsFactors = FALSE)
sum(rki$AnzahlTodesfall[rki$Meldedatum == "2020/07/10 00:00:00"]) # set to date close to current date
# 2 on 22 July, may still increase

# ecdc data:
ecdc <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
ecdc$deaths[ecdc$countriesAndTerritories == "Germany" & ecdc$dateRep == "10/07/2020"]
# 6

To get the deaths by reporting date one essentially has to read the file from RKI every day and compute the number of new deaths from the difference to the data from the previous day. We are doing that here (and are aware of several other people doing the same thing): https://github.com/KITmetricslab/covid19-forecast-hub-de/tree/master/data-truth/RKI

Otherwise great idea, looking forward to using the package :)

Update pkgdown site

Current function reference for the pkgdown site does not list several useful functions including all the utilities. It would be good if this was updated.

Add new countries

Improve test coverage

Test coverage is good for key areas of the code but some functions are untested. See here for areas missing tests.

A large number of the functions missing tests are those that map the package data standard into a wider standard (that may not be currently in use) - should consider if these are still required as add quite a lot of overhead to the code base.

Simplify unit testing

The unit tests are very prone to failing.

Could we simplify to a set of tests that are applied to all countries (instead of having many individual unit tests for each country, as now)?

We now have the data_check() function / vignette which (should) automate some of the more advanced checks like how many regions are returned. (Should probably set up some king of flagging / alert system there too - just not one that causes everything else to fail along with it!)

Meta issue: General package feedback

Opening this issue as a more general discussion thread for package users. Please use this to let us know what works or doesn't work in package use, beyond specific bugs/issues.

We have very limited resources, so provide for only a fairly basic use case (largely driven by how we use the package in our work). It would be good to hear how other users interact with the package and what could be improved.

Automated checks to see if data is updating and downloadable

Currently, this is done via unit tests failing but this is not catching most data failures so needs to be expanded.

It would be sensible to add a data status vignette that records the last date of update and updates via GitHub Actions (using pkgdown).

Add description

Add a description of what the package is to the README and to the description field of the description file.

Update README

  • "Date is not included if totals is FALSE" = should be TRUE
  • If easy could convert the "coverage" section into an automatically updating table (so we don't have to manually update README with each new country)

Error when using include_level_2_regions with UK data

library(covidregionaldata)  
uk_regional <- get_regional_data("UK", include_level_2_regions = TRUE)
|======================================================================| 100%
#> Error: Problem with `mutate()` input `region_level_1`.
#> x object 'region_level_1' not found
#> ℹ Input `region_level_1` is `stringr::str_trim(region_level_1, side = "both")`.

JOSS/ROpenSci publication

Thread to discuss package peer review. Thoughts and assistance welcome. Looking everything over what is here seems solid and appears to still be a fairly valuable contribution.

  • Review package functionality and check working as intended
  • Review package is still needed given other packages that have been developed.
  • Review JOSS guidelines and check matches expectations/ an interesting alternative would be ROpenSci but need to check criteria
  • Add lifecycle badges to indicate functionality stability and sunset areas of the package the are maybe not in use.
  • Review package issues and fix is possible.
  • Write short paper as required (1 page)
  • Submit latest version to CRAN.
  • Finalise Authorship (my vote as is with SEA and KS switching positions).
  • Submit to JOSS/ROpenSci.

Duplicate rows for some authorities in UK regional data

library(covidregionaldata)
library(dplyr)

uk_regional <- get_regional_data("UK", include_level_2_regions = TRUE)
#>  |======================================================================| 100%

uk_regional[duplicated(uk_regional), ] %>% 
  group_by(authority) %>% 
  tally()

#> # A tibble: 4 x 2
#>   authority                 n
#>   <chr>                 <int>
#> 1 Dumfries and Galloway   230
#> 2 Fife                    230
#> 3 Highland                230
#> 4 Powys                   230

Error on installing

I get the following error when trying to install from the development version onto my machine. Any ideas?

remotes::install_github("epiforecasts/covidregionaldata")

image

Integrate with Google open data.

A team at google compile an extensive open data source that is available here: https://github.com/GoogleCloudPlatform/covid-19-open-data

This has multiple data types many of which look useful. The easiest to integrate into our R tooling would be the epidemiology data (published as a csv) which is made up of case counts nationally and subnationally over multiple scales. This includes countries for which we already offer a data source as well as new countries.

A sensible first step would be to implement a function that downloads this data (using caching as elsewhere in the package) and allows filtering by country and geographic scale. This could then be linked to our other data extraction functions.

Download failure graceful error

Currently when data cannot be downloaded the failure is not graceful and the error is not explicit. It would be great to catch downloads failures and return a clear error to speed up debugging.

For example when downloading the Colombian subnational data we see the following error:

Backtrace:
     █
  1. ├─data.table::setDT(covidregionaldata::get_regional_data(country = "colombia"))
  2. └─covidregionaldata::get_regional_data(country = "colombia")
  3.   └─data %>% left_join_region_codes(region_codes_table, by = c(region_level_1 = "region"))
  4.     ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  5.     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  6.       └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  7.         └─covidregionaldata:::`_fseq`(`_lhs`)
  8.           └─magrittr::freduce(value, `_function_list`)
  9.             ├─base::withVisible(function_list[[k]](value))
 10.             └─function_list[[k]](value)
 11.               └─covidregionaldata:::left_join_region_codes(...)
 12.                 ├─dplyr::left_join(data, region_codes_table, by = by, ...)
 13.                 └─dplyr:::left_join.data.frame(...)

Review status of get_linelist and get_international_linelist

  • Functions need to be merged
  • Stability of file download needs to be reviewed (I assume the file is now extremely larger)
  • We need data on report delays (date onset and date confirmation) for website Rt estimates. Making this data as easy and reliable to access as possible is key.
  • This function is now quite out of the general utility of the package but I think still worth having

Release 0.6.0

Updates for next release:

  • Increment version in description, dev badge in README and add new section to NEWS.md
  • Add download tracker to README
  • New regional data sources:
    1. get data in separate get_country_cases function
    2. add to get_region_codes
    3. add to get_regional_data
    • Austria: #30
    • Mexico
  • Fixes:
    • national.R - broken WHO data link : #39
    • germany.R - add option to switch data source for deaths to return by date of death
    • WORDLIST.md - update
    • Add package website and issues tracker to the description.

Getting started vignette

The basic functionality of the package is fairly obvious but some features (like the optional support for memoise) could really do with being briefly explained. Some examples of combining the data set in a simple analysis would also be really nice.

This would be ideally suited to a simple getting started vignette that could then be linked to from the README, the package website, and the package description.

NAs in un_region for national data

library(covidregionaldata)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

all_countries <- get_national_data()
all_countries %>% 
  filter(is.na(un_region)) %>% 
  group_by(country) %>% 
  tally()
#> # A tibble: 5 x 2
#>   country            n
#>   <chr>          <int>
#> 1 Greece           233
#> 2 Namibia          233
#> 3 Spain              1
#> 4 Taiwan           233
#> 5 United Kingdom   233

Created on 2020-08-19 by the reprex package (v0.3.0)

WHO data

  • In national.R, update get_who_data to download from WHO csv directly
  • Change function @description
  • Change DESCRIPTION WHO data source
  • Check tests pass on updated data

Small fixes and updates

  • get_regional_data :
    • alter code so less dependent on specific variable names
    • add option to ignore country specific naming in favour of "region" and "region_code" columns
  • uk : (#60)
    • switch back to avoid using phe R package
    • return clean, structured data with additional raw data columns

Review author list

  • Many people have now contributed - authorship needs to be reviewed with ordering updated.

Data Issue: India

Running the following code:
covidregionaldata::get_regional_data(country = "India")

results in the following error:
Error: Can't combine 'Date_YMD' <date> and 'AN' <double>.

Review status of get_interventions

  • We don't currently use this
  • I don't know what data source it links to
  • There are now many interventions databases
  • Do other packages already link to this data in a nice way - are we adding anything of use

Uk data CI tests failing

The API is timing out the UK data tests (likely due to protections around the number of hits from a single IP).

It would make sense to set up a single data pull and then run tests or to use memoise/mocking as in the other tests.

This is relatively high priority as it leads our CRAN check to show as FALSE which will make it harder to detect real issues.

Move from NCoVUtils

Need to notify users of NCoVUtils that development has switched. This is mostly an internal issue for epiforecast packages.

Pinging @Jonnie-Bevan that we have changed to the new package. Your branch should also be here but unfortunately there will be quite a few merge conflicts I imagine (unfortunately lost the pull request).

JOSS submission

Now that covidregionaldata is on (pending binaries being built) CRAN it makes sense to get it peer-reviewed. My preferred venue for this would be JOSS. The only potential issue with this is the package functionality may be considered too "thin" for their requirements. What do people think? Any other suggestions?

If JOSS is a good venue then all that is needed for submission is a short summary paper (like this). There is also a checklist that needs to be gone through to make sure the package is up to standard but from the top of my head, I think we have done everything required.

If we think there is a better venue then JOSS then I am all ears.

Author ping: @sbfnk, @kathsherratt, @ffinger, @PaulC91, @patrickbarks, @sophiemeakin, @hamishgibbs, @jhellewell14

Obvious missing feature is #41 which needs to be in place ahead of submission.

problem with `get_regional_data` for UK

Tried the CRAN version and had the following error:

library(covidregionaldata)
get_regional_data("UK")
#> Error: Must group by variables found in `.data`.
#> * Column `level_1_region_code` is not found.

Created on 2020-07-27 by the reprex package (v0.3.0)

Check no data from the future

In some data sets there are issues with data being encoded as being from the future. This is obviously false.

It could either be dealt with using a message or with a filter.

Additional data sets meta issue

covidregionaldata currently supports only a small fraction of the available regional data sets. This issue is a good place to start the discussion on new data sets before opening individual issues.

Please note that we are very happy to have additional data sets be contributed but may need ongoing help supporting them if they are not stable in their availability. For contribution guidelines please see the package README.

Undefined exports

When trying to install the package the following error message occurs

Error: package or namespace load failed for ‘covidregionaldata’ in namespaceExport(ns, exports):
 undefined exports: format_ecdc_data, get_total_cases
Error: loading failed
Execution halted

which seems to be related to changes made in 9e83c53

Additionally I think@export might have been omitted from functions in national.R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.