Giter VIP home page Giter VIP logo

Comments (19)

RichardMN avatar RichardMN commented on July 2, 2024 4

I'll hop in quickly with a question about how the data you've put together (which is impressive) compares with the aggregated data which the package currently draws from the Department of Civil Protection (https://github.com/pcm-dpc/COVID-19/blob/master/README_EN.md).

The level of disaggregation (gender, age cohort) that you have is more fine-grained than most of the data coming out of covidregionaldata but in some cases we are aggregating across gender and age cohort to get the regional/sub-regional data we have. (I think we do this in Lithuania, at least. I think Germany we are working from a line list.) So I'm not sure whether covidregionaldata has a framework to deal with the sub-population indices. But before we get to that there's the question of if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

We recently moved from one Swiss data source to another. We have not [yet] put in a standard way to let users choose between two different datasets (though I think this is sort of possible within the UK data).

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 3

Hi @RichardMN,

if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

The main difference between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here is that the former contains incidences organised by date of key event while the latter by date of notification (affected by the typical problem of time-varying reporting delays).

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

CC: @ClaudMor, @pitmonticone

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 3

Things have been a bit more hectic for the past couple of weeks and I haven't decided to spend an evening writing this code yet. It's going to be a bit picky sorting out how to switch between two data sources (I suppose I'll probably look at what is done for the UK example) and this is probably why I've not written a drop-in replacement yet. I think that other contributors have also been focussed on other projects related to now- and forecasting.

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 3

So here am I with a suggestion, having had a bit of a look at the data.

It would be a lot simpler if the data were in 'tidy' format.

Roughly, this might look like:

date region gender age_cohort indicator count
2020-01-15 Abruzzo M 10_19 deceased 15

[fictional data - I haven't checked what the real numbers would be]

If you prefer to have column names (and region names) in Italian, or all lower case, or not, can all be worked around.

This will make for one very long (as opposed to wide) CSV, but much easier to filter and much easier for our code to aggregate. (And it means not writing code to download 20 x (4 or 5) different separate CSV files, then glue them together, then flatten them, ... which I can do but I'm not looking forward to.)

covidregionaldata is going to squash the age cohorts and the gender data - the package isn't set up to reveal that detail (which is available from some of our other sources). But if you present your data in this "tidy" form it may make your data more accessible for R-minded data scientists who want to try working through all of it.

Edits:

  • I closed this issue by accident when making this comment, I didn't mean to.
  • If count is zero then there's no need to have a line for it, it would be implicitly zero. We trade some extra data for each non-zero datum against not storing 0 and field separators as place-holders.

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 3

Looks good. Below is a quick reprex for pulling it into R, aggregating it (as we will inside the package) and plotting it.

You have saved me at least an hour of painful url-hackery.

I've not started doing logical tests against it, but in terms of making something which is going to be straightforward to pull into covidregionaldata, thank you very much!

library(vroom)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

it_inphyt_data <- vroom::vroom("https://github.com/InPhyT/COVID19-Italy-Integrated-Surveillance-Data/raw/use_initial_conditions/epiforecasts_covidregionaldata/COVID19-Italy-Integrated-Surveillance-Data.csv")
#> Rows: 674503 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (4): region, gender, age_cohort, indicator
#> dbl  (1): count
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
it_agg_data <- it_inphyt_data %>%group_by(date,region,indicator) %>% summarise(across(where(is.double), sum), .groups = "drop")
it_agg_data %>% filter(indicator=="confirmed") %>% ggplot(aes(x=date,y=count, colour=region)) +geom_line() +theme_minimal()

Created on 2022-03-14 by the reprex package (v2.0.1)

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 3

Hi @RichardMN, thanks for your feedback and your questions.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

We've renamed care with the more explicit ICU_admission and hospitalized with the more explicit ordinary_hospital_admission in our dataset (temporary branch). If a patient is admitted to hospital and goes immediately into ICU, they will not be counted both as ordinary_hospital_admission and ICU_admission, but exclusively as ICU_admission.

What is your preferred count between symptomatic and confirmed?

We have no preferred count between confirmed (confirmed cases by date of diagnosis) and symptomatic (symptomatic cases by date of symptoms onset). It crucially depends on your specific research goal. It might be useful to write some code to easily choose between the two options.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.)

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

Yes, this aggregation makes perfect sense since Trentino-Alto Adige is the Italian region made up of the two self-governing of Trento and Bolzano.

Please tell us if any further changes are needed.

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 3

I've adjusted the download url (twice - I got it wrong the first time). Checks appear to be failing in the github workflow but I think that may be because there's a problem with the French data right now.

from covidregionaldata.

ClaudMor avatar ClaudMor commented on July 2, 2024 2

Hello,

Would you have any update on this?

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 2

Back with more questions, some of which may take a bit of digging.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

What is your preferred count between symptomatic and confirmed? I think in most other series we're using confirmed but the delay between one and the other may be significant, and tracking asymptomatic but confirmed may be useful too. I can write code to choose between which definition is used (see what the Lithuania code offers for three different criteria for attributing death to COVID) but right now I'm trying to get something running.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.) It'll have to be amended to match what your region identifiers are but I'm not that familiar with our Italy code so don't quite know why we do this. I wonder if it may be that the two regions share an ISO-3166 code and so they get merged together because in many of our other usages we depend on the ISO-3166 being a unique identifier for regions.

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

For now, #464 is a first write-through of an alternate implementation of the Italy code which uses the InPhyT data. I'll make a PR here and would welcome someone else poking it a bit. Later this week I may try putting in:

  • option to switch between Italy data sources
  • option to choose between symptomatic and confirmed

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 2

Hello @RichardMN,

Is there anything else we can do on our side to facilitate the transition?

Thanks.

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 2

Hi @RichardMN,

We've recently solved a few issues and added one age class so that now we provide the following age classification:

{0_5, 6_12, 13_19, 20_29, 30_39, 40_49, 50_59, 60_69, 70_79, 80_89, 90_+}

Here is the updated data.

Thanks.

from covidregionaldata.

RichardMN avatar RichardMN commented on July 2, 2024 2

Hi @InterdisciplinaryPhysicsTeam - thank you for the various updates.

There are two slightly interrelated issues. I am not a maintainer of this package and so I cannot apply changes.

The package appears to be moving towards senescence - many of the upstream sources have stopped updating or moved to frequencies which are no longer useful for the epidemiological work which people want to do with data from covidregionaldata. As a contributor I cannot be sure it's "worth" my time to try to develop and apply changes which might never be accepted in or which I may be the only person to be using them.

On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.

Returning to point 1, I need a sense from @seabbs or @kathsherratt or others whether we're going to try to modularize the package better (so that single country failures don't bork everything else) or just accept that it was very useful for a time but no longer appears to have utility or a market.

This is a bit of a bigger question than belongs in this issue but this appears to be where the conversation might take place.

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 2

Hi @RichardMN @Bisaloo @pitmonticone @ClaudMor,

Thank you @Bisaloo for your reply.

Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch.

It very much depends on which variables you're interested in and would like to make use of.

The main differences between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here are the following:

  • the former is disaggregated by sex and age while the latter is aggregated;
  • the former contains daily time series of new confirmed cases, symptomatic cases, ordinary hospital admissions, intensive hospital admissions, deceased cases while the latter includes even performed tests, total tested, cumulative confirmed cases, cumulative hospitalised cases and isolated cases;
  • the former contains incidences organised by date of key event while the latter by date of notification.

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

from covidregionaldata.

Bisaloo avatar Bisaloo commented on July 2, 2024 2

Okay, I'm quite convinced we need to keep both data sources, with the ability for the user to switch from one to the other.

@RichardMN, are you interested in implementing this or would you like me to do it? No pressure either way.

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 1

Hi @RichardMN, thanks for your reply.

We're certainly willing to help you with the logistics if needed: if you tell us the proper format we could make an additional folder in our repository with the data in the requested format.

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024 1

Hi @RichardMN, here is the tidy version of our dataset following your suggestion.

Could you tell us if you believe it might be fine? If so, we will notify you here when we'll merge in the main branch.

from covidregionaldata.

Bisaloo avatar Bisaloo commented on July 2, 2024 1

Hi all, and thanks @RichardMN for bringing up this topic.

As mentioned in #459, we are unsure if this package is still used by / useful to anyone. Because of this, most of the contributors have moved on (excepted @RichardMN, whose heroic efforts to keep this package running need to be highlighted!).

I can help in getting outstanding PR merged though if someone feels that something needs updating / fixing.

Two comments:

  • Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch. In #464, @RichardMN mentions:

    but (see #463) I think it may be useful to be able to switch between the two options.

    @InterdisciplinaryPhysicsTeam, @ClaudMor, can you weigh in on this please?

  • About the bigger picture regarding changes while the package is broken for other reasons:

    On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.

    Please do not worry about this @RichardMN, if you want to submit a change, please feel free to do it, no matter what is the status of the rest of the package. Please don't feel you have a duty to fix other parts of the package to get a change accepted. If tests are failing for an unrelated reason, we can still (most of the time) verify that your PR didn't break anything else and go ahead and merge it.

If necessary, feel free to ping me. I cannot promise I'll always be responsive but I'll try.

from covidregionaldata.

github-actions avatar github-actions commented on July 2, 2024

Thanks for opening an issue! We'll try and get back to you shortly. If you've identified an issue and would like to fix it please see our contribution guidelines.

from covidregionaldata.

InterdisciplinaryPhysicsTeam avatar InterdisciplinaryPhysicsTeam commented on July 2, 2024

Hi @RichardMN,

Today we've successfully updated our repository merging the new folder epiforecasts_covidregionaldata.

Please don't hesitate to let us know if any further changes are needed.

from covidregionaldata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.