Comments (19)
I'll hop in quickly with a question about how the data you've put together (which is impressive) compares with the aggregated data which the package currently draws from the Department of Civil Protection (https://github.com/pcm-dpc/COVID-19/blob/master/README_EN.md).
The level of disaggregation (gender, age cohort) that you have is more fine-grained than most of the data coming out of covidregionaldata
but in some cases we are aggregating across gender and age cohort to get the regional/sub-regional data we have. (I think we do this in Lithuania, at least. I think Germany we are working from a line list.) So I'm not sure whether covidregionaldata
has a framework to deal with the sub-population indices. But before we get to that there's the question of if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?
We recently moved from one Swiss data source to another. We have not [yet] put in a standard way to let users choose between two different datasets (though I think this is sort of possible within the UK data).
from covidregionaldata.
Hi @RichardMN,
if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?
The main difference between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here is that the former contains incidences organised by date of key event while the latter by date of notification (affected by the typical problem of time-varying reporting delays).
For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.
CC: @ClaudMor, @pitmonticone
from covidregionaldata.
Things have been a bit more hectic for the past couple of weeks and I haven't decided to spend an evening writing this code yet. It's going to be a bit picky sorting out how to switch between two data sources (I suppose I'll probably look at what is done for the UK example) and this is probably why I've not written a drop-in replacement yet. I think that other contributors have also been focussed on other projects related to now- and forecasting.
from covidregionaldata.
So here am I with a suggestion, having had a bit of a look at the data.
It would be a lot simpler if the data were in 'tidy' format.
Roughly, this might look like:
date | region | gender | age_cohort | indicator | count |
---|---|---|---|---|---|
2020-01-15 | Abruzzo | M | 10_19 | deceased | 15 |
[fictional data - I haven't checked what the real numbers would be]
If you prefer to have column names (and region names) in Italian, or all lower case, or not, can all be worked around.
This will make for one very long (as opposed to wide) CSV, but much easier to filter and much easier for our code to aggregate. (And it means not writing code to download 20 x (4 or 5) different separate CSV files, then glue them together, then flatten them, ... which I can do but I'm not looking forward to.)
covidregionaldata
is going to squash the age cohorts and the gender data - the package isn't set up to reveal that detail (which is available from some of our other sources). But if you present your data in this "tidy" form it may make your data more accessible for R-minded data scientists who want to try working through all of it.
Edits:
- I closed this issue by accident when making this comment, I didn't mean to.
- If
count
is zero then there's no need to have a line for it, it would be implicitly zero. We trade some extra data for each non-zero datum against not storing0
and field separators as place-holders.
from covidregionaldata.
Looks good. Below is a quick reprex for pulling it into R, aggregating it (as we will inside the package) and plotting it.
You have saved me at least an hour of painful url-hackery.
I've not started doing logical tests against it, but in terms of making something which is going to be straightforward to pull into covidregionaldata
, thank you very much!
library(vroom)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
it_inphyt_data <- vroom::vroom("https://github.com/InPhyT/COVID19-Italy-Integrated-Surveillance-Data/raw/use_initial_conditions/epiforecasts_covidregionaldata/COVID19-Italy-Integrated-Surveillance-Data.csv")
#> Rows: 674503 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): region, gender, age_cohort, indicator
#> dbl (1): count
#> date (1): date
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
it_agg_data <- it_inphyt_data %>%group_by(date,region,indicator) %>% summarise(across(where(is.double), sum), .groups = "drop")
it_agg_data %>% filter(indicator=="confirmed") %>% ggplot(aes(x=date,y=count, colour=region)) +geom_line() +theme_minimal()
Created on 2022-03-14 by the reprex package (v2.0.1)
from covidregionaldata.
Hi @RichardMN, thanks for your feedback and your questions.
What is the
care
indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as bothhospitalized
and care?
We've renamed care
with the more explicit ICU_admission
and hospitalized
with the more explicit ordinary_hospital_admission
in our dataset (temporary branch). If a patient is admitted to hospital and goes immediately into ICU, they will not be counted both as ordinary_hospital_admission
and ICU_admission
, but exclusively as ICU_admission
.
What is your preferred count between
symptomatic
andconfirmed
?
We have no preferred count between confirmed
(confirmed cases by date of diagnosis) and symptomatic
(symptomatic cases by date of symptoms onset). It crucially depends on your specific research goal. It might be useful to write some code to easily choose between the two options.
Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.)
mutate(level_1_region = recode(.data$level_1_region,
"P.A. Trento" = "Trentino-Alto Adige",
"P.A. Bolzano" = "Trentino-Alto Adige"
)) %>%
Yes, this aggregation makes perfect sense since Trentino-Alto Adige is the Italian region made up of the two self-governing of Trento and Bolzano.
Please tell us if any further changes are needed.
from covidregionaldata.
I've adjusted the download url (twice - I got it wrong the first time). Checks appear to be failing in the github workflow but I think that may be because there's a problem with the French data right now.
from covidregionaldata.
Hello,
Would you have any update on this?
from covidregionaldata.
Back with more questions, some of which may take a bit of digging.
What is the care
indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized
and care
?
What is your preferred count between symptomatic
and confirmed
? I think in most other series we're using confirmed
but the delay between one and the other may be significant, and tracking asymptomatic but confirmed may be useful too. I can write code to choose between which definition is used (see what the Lithuania code offers for three different criteria for attributing death to COVID) but right now I'm trying to get something running.
Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.) It'll have to be amended to match what your region identifiers are but I'm not that familiar with our Italy code so don't quite know why we do this. I wonder if it may be that the two regions share an ISO-3166 code and so they get merged together because in many of our other usages we depend on the ISO-3166 being a unique identifier for regions.
mutate(level_1_region = recode(.data$level_1_region,
"P.A. Trento" = "Trentino-Alto Adige",
"P.A. Bolzano" = "Trentino-Alto Adige"
)) %>%
For now, #464 is a first write-through of an alternate implementation of the Italy code which uses the InPhyT data. I'll make a PR here and would welcome someone else poking it a bit. Later this week I may try putting in:
- option to switch between Italy data sources
- option to choose between
symptomatic
andconfirmed
from covidregionaldata.
Hello @RichardMN,
Is there anything else we can do on our side to facilitate the transition?
Thanks.
from covidregionaldata.
Hi @RichardMN,
We've recently solved a few issues and added one age class so that now we provide the following age classification:
{0_5, 6_12, 13_19, 20_29, 30_39, 40_49, 50_59, 60_69, 70_79, 80_89, 90_+}
Here is the updated data.
Thanks.
from covidregionaldata.
Hi @InterdisciplinaryPhysicsTeam - thank you for the various updates.
There are two slightly interrelated issues. I am not a maintainer of this package and so I cannot apply changes.
The package appears to be moving towards senescence - many of the upstream sources have stopped updating or moved to frequencies which are no longer useful for the epidemiological work which people want to do with data from covidregionaldata
. As a contributor I cannot be sure it's "worth" my time to try to develop and apply changes which might never be accepted in or which I may be the only person to be using them.
On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.
Returning to point 1, I need a sense from @seabbs or @kathsherratt or others whether we're going to try to modularize the package better (so that single country failures don't bork everything else) or just accept that it was very useful for a time but no longer appears to have utility or a market.
This is a bit of a bigger question than belongs in this issue but this appears to be where the conversation might take place.
from covidregionaldata.
Hi @RichardMN @Bisaloo @pitmonticone @ClaudMor,
Thank you @Bisaloo for your reply.
Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch.
It very much depends on which variables you're interested in and would like to make use of.
The main differences between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here are the following:
- the former is disaggregated by sex and age while the latter is aggregated;
- the former contains daily time series of new confirmed cases, symptomatic cases, ordinary hospital admissions, intensive hospital admissions, deceased cases while the latter includes even performed tests, total tested, cumulative confirmed cases, cumulative hospitalised cases and isolated cases;
- the former contains incidences organised by date of key event while the latter by date of notification.
For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.
from covidregionaldata.
Okay, I'm quite convinced we need to keep both data sources, with the ability for the user to switch from one to the other.
@RichardMN, are you interested in implementing this or would you like me to do it? No pressure either way.
from covidregionaldata.
Hi @RichardMN, thanks for your reply.
We're certainly willing to help you with the logistics if needed: if you tell us the proper format we could make an additional folder in our repository with the data in the requested format.
from covidregionaldata.
Hi @RichardMN, here is the tidy version of our dataset following your suggestion.
Could you tell us if you believe it might be fine? If so, we will notify you here when we'll merge in the main branch.
from covidregionaldata.
Hi all, and thanks @RichardMN for bringing up this topic.
As mentioned in #459, we are unsure if this package is still used by / useful to anyone. Because of this, most of the contributors have moved on (excepted @RichardMN, whose heroic efforts to keep this package running need to be highlighted!).
I can help in getting outstanding PR merged though if someone feels that something needs updating / fixing.
Two comments:
-
Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch. In #464, @RichardMN mentions:
but (see #463) I think it may be useful to be able to switch between the two options.
@InterdisciplinaryPhysicsTeam, @ClaudMor, can you weigh in on this please?
-
About the bigger picture regarding changes while the package is broken for other reasons:
On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.
Please do not worry about this @RichardMN, if you want to submit a change, please feel free to do it, no matter what is the status of the rest of the package. Please don't feel you have a duty to fix other parts of the package to get a change accepted. If tests are failing for an unrelated reason, we can still (most of the time) verify that your PR didn't break anything else and go ahead and merge it.
If necessary, feel free to ping me. I cannot promise I'll always be responsive but I'll try.
from covidregionaldata.
Thanks for opening an issue! We'll try and get back to you shortly. If you've identified an issue and would like to fix it please see our contribution guidelines.
from covidregionaldata.
Hi @RichardMN,
Today we've successfully updated our repository merging the new folder epiforecasts_covidregionaldata.
Please don't hesitate to let us know if any further changes are needed.
from covidregionaldata.
Related Issues (20)
- Add tests for download_JSON and JSON_reader HOT 2
- Add memoise support for download_JSON HOT 1
- Update package logo with new datasets HOT 1
- Review depreciated features HOT 2
- HTTP error 502 when downloading Vietnam's json data HOT 6
- Upstream data changes break our regional code - Colombia, Cuba, India, United States HOT 11
- Check if some required packages could be made suggested HOT 4
- Switch to preferably pkgdown theme HOT 3
- Giant package logo in web docs HOT 6
- Vietnam handles province labels badly, possible str_conv issue
- We've also made a fix to `complete()` to ensure that it always works as expected with grouped data frames. One of the results of this fix is that you can no longer supply group variables to `complete()` (if you have a grouped data frame, `complete()` will work "within" each group so you shouldn't have access to them). See https://github.com/tidyverse/tidyr/pull/1300 for more details. HOT 2
- Warnings from countrycode
- Vietnam data timing out HOT 1
- Colombia failing download tests
- Run country-specific tests when relevant files are modified HOT 2
- France has moved their data - this breaks slowstart
- France has moved their data - we need to use a new upstream source HOT 1
- covidregionaldata archived on CRAN HOT 3
- South Africa new reported cases around higher than Our World In Data/WHO HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covidregionaldata.