Giter VIP home page Giter VIP logo

covid-19's Introduction

COVID-19 dataset

Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.

This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:

  • confirmed tested cases of Coronavirus infection
  • the number of people who have reportedly died while sick with Coronavirus
  • the number of people who have reportedly recovered from it

Data

Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.

We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.

You can view the data, its structure as well as download it in alternative formats (e.g. JSON) from the DataHub:

https://datahub.io/core/covid-19

Sources

The upstream dataset currently lists the following upstream data sources:

We will endeavour to provide more detail on how regularly and by which technical means the data is updated. Additional background is available in the CSSE blog, and in the Lancet paper (DOI), which includes this figure:

countries timeline

Preparation

This repository uses Pandas to process and normalize the data.

You first need to install the dependencies:

pip install -r scripts/requirements.txt

Then run the following scripts:

python scripts/process_worldwide.py
python scripts/process_us.py

Python 3.8 .github/workflows/actions.yml

License

This dataset is licensed under the Open Data Commons Public Domain and Dedication License.

The data comes from a variety public sources and was collated in the first instance via Johns Hopkins University on GitHub. We have used that data and processed it further. Given the public sources and factual nature we believe that there the data is public domain and are therefore releasing the results under the Public Domain Dedication and License. We are also, of course, explicitly licensing any contribution of ours under that license.

covid-19's People

Contributors

actions-user avatar anuveyatsu avatar aravindnair430 avatar jochym avatar kant avatar krunal-darji avatar morisset avatar nirabpudasaini avatar pidugusundeep avatar rufuspollock avatar trevorwinstral avatar weileizeng avatar zelima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-19's Issues

Wrong Numbers for Spain on 12/March/2020

Data for Spain on the 12/March/2020 is wrong, Accidentally you copied the same as of 11/March/2020

Hope you can fix this.

Edit: the file is countries-aggregated.csv

Reading in the data via read_csv gives NA results for Canada on 29 March

read_csv(time-series-19-covid-combined.csv, col_names = T) gives 68 NA values for Confirmed and Deaths in the last update on 29 March 2020 for Canada. I cannot immediately see the reason why, but I did pull the data into Excel and that works fine. It seems just the read_csv function is not working on this latest update.

Inconsistent file formatting

The data files have inconsistent file formatting making it difficult to write code which works on all files.

Headers example: Last Update changes to Last_Update, Confirmed changes to FIPS.

Changes to Country/ Region: UK changes to United Kingdom.

Compare files '02-03-2020.csv' to '03-26-2020.csv' for example.

API based on the Data Package

As a lot of people want to connect from dashboards and get filtered/streaming access to the data, it would be good to also set up an (example) wrapper with API endpoints.

See also https://github.com/Quintessential-SFT/Covid-19-API and https://github.com/dataletsch/panoptikum/blob/master/app.py

Design (from @rufuspollock)

Jobs to be done: i want to get latest data for my country / region.

url: coronavirus.api.datahub.io

Desired API

GET /country/{name or code} => (in reverse date order)
[ 
 {
  date: 
  confirmed: ...
  deaths: 
 }
]

API-ifying a Data Package

Can we take Inspiration from https://github.com/simonw/datasette

We have a datapackage.json - let's auto API-ify-it.

e.g. suppose we have a table cases.csv

Country, Date, Value

Each table => a url ...

/cases?field=x

Values => sub-urls

Dimension

Adding an id (??)

/cases/{country}/{date}

Romania data lagging one full day

First of all, congrats on the project! It took me almost no time to synchronize my excel workbook with your csv raw data. Thank you !

I have one issue, Romania data is lagging one full day, do you think you could refresh the dataset faster or at another time? Or please advise how to proceed

Thanks Again !

Push fixes to upstream repo

Can we try and upstream stuff to the upstream repo? May be tough as they have a lot of open PRs and a lot of noise right now. We initially planned (back in Feb) to put in a PR for datapackage.json (and maybe even a refactor or file structures) but this may be tough now (they certainly are unlikely to change file structure).

However, may still be worth trying to push data bugfixes.

France aggregated count is down from yesterday, why?

France count or confirmed number has an issue,
82 2020-04-12 133670
83 2020-04-13 137875
84 2020-04-14 131361
why the number is going down from yesterday?
As it is an aggregated number is has to growth or to show stagnation ...
Thank for any clarification.

Automate keeping data up to date by pulling data from upstream

We want to automate collecting the data every day (or even every half-day?). Since upstream repo is update at 23:59 GMT (once a day), we can run our update script right after that time, eg, 00:00 GMT.

Acceptance criteria

  • The repo is updated at least every day
  • The new dataset is pushed to datahub.io/core/covid-19

Tasks

  • Build action - #15
    • Create github actions to:
      • setup python project
      • install dependencies
      • run the update scripts
      • commit changes and push to the repo (master branch)
    • Run it on a schedule at 00:00 GMT
    • Run it on master branch only
    • Setup github token so the action is authorized to push to the repo
  • Deploy action (to datahub.io) - 4c98133
    • prepare datapackage.json for the dataset
    • setup node project
    • install data-cli via npm/yarn
    • run data push command

Future

Confirmed Cases missing

Current dataset on 4/7/2020 shows 0 cases for North Dakota and the overall # for 4/6/2020 is off by about 140k.

effected vs affected

In the intro you say "effect", which is correct. But when you say "effected" it should be "affected".

effected means to have made something happen
affected means that something has changed something else

Italy has wrong data for March 23

I was updating my dashboards on https://corona.deleu.dev and I noticed a full flat data on Italy.

The dataset shows

2020-03-21,Italy,,43.0,12.0,53578,6072,4825
2020-03-22,Italy,,43.0,12.0,59138,7024,5476
2020-03-23,Italy,,43.0,12.0,59138,7024,5476

when in reality it should be

2020-03-21,Italy,,43.0,12.0,53578,6072,4825
2020-03-22,Italy,,43.0,12.0,59138,7024,5476
2020-03-23,Italy,,43.0,12.0,63,927,7432,6077

Executing on 3/14/2020 gets ValidationError & CastError

CastError                                 Traceback (most recent call last)
~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     48             for f in schema_fields:
---> 49                 row[f.name] = f.cast_value(row.get(f.name))
     50         except CastError as e:

~/.local/lib/python3.8/site-packages/tableschema/field.py in cast_value(self, value, constraints)
    145             if cast_value == config.ERROR:
--> 146                 raise exceptions.CastError((
    147                     'Field "{field.name}" can\'t cast value "{value}" '

CastError: Field "Deaths" can't cast value "None" for type "number" with format "default"
During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
<ipython-input-11-4036c1aa3210> in <module>
     18 extra_value = {'name': 'Case', 'type': 'number'}
     19 
---> 20 Flow(
     21       load(f'{BASE_URL}{CONFIRMED}'),
     22       load(f'{BASE_URL}{RECOVERED}'),

~/.local/lib/python3.8/site-packages/dataflows/base/flow.py in results(self, on_error)
     10 
     11     def results(self, on_error=None):
---> 12         return self._chain().results(on_error=on_error)
     13 
     14     def process(self):

~/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py in results(self, on_error)
     92     def results(self, on_error=None):
     93         ds = self._process()
---> 94         results = [
     95             list(schema_validator(res.res, res, on_error=on_error))
     96             for res in ds.res_iter

~/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py in <listcomp>(.0)
     93         ds = self._process()
     94         results = [
---> 95             list(schema_validator(res.res, res, on_error=on_error))
     96             for res in ds.res_iter
     97         ]

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     44         field_names = [f.name for f in schema.fields]
     45     schema_fields = [f for f in schema.fields if f.name in field_names]
---> 46     for i, row in enumerate(iterator):
     47         try:
     48             for f in schema_fields:

~/.local/lib/python3.8/site-packages/dataflows/processors/dumpers/dumper_base.py in row_counter(self, resource, iterator)
     67     def row_counter(self, resource, iterator):
     68         counter = 0
---> 69         for row in iterator:
     70             counter += 1
     71             yield row

~/.local/lib/python3.8/site-packages/dataflows/processors/dumpers/file_dumper.py in rows_processor(self, resource, writer, temp_file)
     74 
     75     def rows_processor(self, resource, writer, temp_file):
---> 76         for row in resource:
     77             writer.write_row(row)
     78             yield row

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     49                 row[f.name] = f.cast_value(row.get(f.name))
     50         except CastError as e:
---> 51             if not on_error(resource['name'], row, i, e):
     52                 continue
     53 

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in raise_exception(res_name, row, i, e)
     20 
     21 def raise_exception(res_name, row, i, e):
---> 22     raise ValidationError(res_name, row, i, e)
     23 
     24 

ValidationError: 
ROW: {'Date': datetime.date(2020, 3, 14), 'Province/State': None, 'Country/Region': 'Thailand', 'Lat': Decimal('15.0'), 'Long': Decimal('101.0'), 'Confirmed': None, 'Recovered': None, 'Deaths': 'None'}
----

Data Update

Hi,
when data will be updated? Thanks bye, Alberto

[workflow] Actions pipeline stucked

Your action workflow seems stucked since 10 hours ago
Possible something gone wrong while at step Run pip install -r scripts/requirements.txt

image

image

NYT data (for the US)

NYT now have data - just for the US. https://github.com/nytimes/covid-19-data

But it's not open ...

In light of the current public health emergency, The New York Times Company is
providing this database under the following free-of-cost, perpetual,
non-exclusive license. Anyone may copy, distribute, and display the database, or
any part thereof, and make derivative works based on it, provided (a) any such
use is for non-commercial purposes only and (b) credit is given to The New York
Times in any public display of the database, in any publication derived in part
or in full from the database, and in any other public use of the data contained
in or derived from the database.

Dashboard for this

Create a simple dashboard similar to e.g. https://carbon.datahub.io or https://london.datahub.io to present this information and provide an open source basis for others to create their own dashboards quickly esp per country.

Tasks

  • Design the dashboard
  • Sketch out dashboard
  • Implement

Implement

Analysis

Mockup

Screen Shot 2020-04-09 at 16 18 29

Charting libraries

v1 - worldwide data with key figures and choropleth map

v2 - added line chart with cumulative cases in top 5 countries

Screen Shot 2020-04-14 at 22 40 46

v3 - ability to select a country and showing a graph with cumulative cases, deaths per day and new cases per day

dashboard v3

v4 - added figure for showing cases per 100k population

dashboard v4

v5 - added choropleth map (again)

dashboard v5


Charts to do

  • Time series of cases
  • Chloropeth of cases by country

Needs Analysis

Domain Model

Value: (new confirmed) cases, deaths, recovered

Dimensions:

  • Time
  • Country
    • SubCountry i.e. Province/State
    • City

Job Stories

Key figures (for world and per country)

When wanting to know about the situation I want to see key figures such as total number of people infected/recovered/died, so that I understand current status of the situation in the World.

  • In my country, in my locality

Specific items:

  • How many total cases? [single figure]
  • How many total cases (over time) i.e. cumulative? [time series]
  • How many cases "per day" over time [time series]
  • What is the mortality rate? (how that has changed over time?)
  • Cases in specific locations (lon, lat and by country)
  • Total Case by country (now)
  • Case by country (over time)

"What's happening in my country" => Ditto but just with my country

What's changed

  • When I see the COVID-19 dashboard, I want to see a figure showing change of total number of people affected in last 24h (something like stock market price), so that I can know if it's getting better or not.

Secondary

  • When I see the COVID-19 dashboard, I want to check number of cases per capita, so that I can compare my country against others.

Tertiary

  • When I see the COVID-19 dashboard, I want to see viz showing some correlation with economic indicators (by country), so that I can assess the economic impact.

Meta

  • When I see the COVID-19 dashboard, I want to be able to share it via twitter/facebook/instagram, so that my friends/colleagues can also check it out.
  • ...

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

unable to open database file

Hi,
When I try to run in Jupyter notebook, am getting error as unable to open database file.

OperationalError: unable to open database file

FAQs (WIP)

Why this dataset? (After all authoritative one is elsewhere)?

Ans: well structured data, data package'd so you have tools to ingest into your system of choice, reliably kept up to date ...

Why this dashboard? After all there are many others?

We provide a dashboard that is simple and well-designed but primarily because open source and easy for others to reuse

Who's behind this?

@rufuspollock and colleagues at @datopian who have worked in #opendata and #opensource and #datasets for many years.

State Data Missing for US

I imported this file yesterday and it included state data for US - when I refreshed this morning, the data is now missing.

  • time-series-19-covid-combined_csv.csv

Executing process.py on 3/11/2020 gets ValidationError

Here's the traceback:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 49, in schema_validator
row[f.name] = f.cast_value(row.get(f.name))
File "/usr/local/lib/python3.7/site-packages/tableschema/field.py", line 149, in cast_value
).format(field=self, value=value))
datapackage.exceptions.CastError: Field "Deaths" can't cast value "None" for type "number" with format "default"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "process.py", line 60, in
dump_to_path()
File "/usr/local/lib/python3.7/site-packages/dataflows/base/flow.py", line 12, in results
return self._chain().results(on_error=on_error)
File "/usr/local/lib/python3.7/site-packages/dataflows/base/datastream_processor.py", line 96, in results
for res in ds.res_iter
File "/usr/local/lib/python3.7/site-packages/dataflows/base/datastream_processor.py", line 96, in
for res in ds.res_iter
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 46, in schema_validator
for i, row in enumerate(iterator):
File "/usr/local/lib/python3.7/site-packages/dataflows/processors/dumpers/dumper_base.py", line 69, in row_counter
for row in iterator:
File "/usr/local/lib/python3.7/site-packages/dataflows/processors/dumpers/file_dumper.py", line 76, in rows_processor
for row in resource:
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 51, in schema_validator
if not on_error(resource['name'], row, i, e):
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 22, in raise_exception
raise ValidationError(res_name, row, i, e)
dataflows.base.schema_validator.ValidationError:
ROW: {'Date': datetime.date(2020, 3, 11), 'Province/State': 'Anhui', 'Country/Region': 'Mainland China', 'Lat': Decimal('31.8257'), 'Long': Decimal('117.2264'), 'Confirmed': None, 'Recovered': None, 'Deaths': 'None'}

Regional granularity

Country level comparisons are quite limiting, it is difficult to draw meaning about the impact of measures. For instance, mortality and intensive care cases on country level are under/over estimated, in regards to whether co-morbidities are considered, or after health system collapse. The statistics are much more granular in the case of the United States already in the John Hopkins dataset, the Italian regions or Swiss cantons. It would be good to build on the work here to go beyond a country ranking.

Dataset Design

Value: (new confirmed) cases, deaths, recovered

Dimensions:

  • Time
  • Country
    • SubCountry i.e. Province/State
Province/State,Country/Region,Lat,Long,date,case
Anhui,Mainland China,31.8257,117.2264,2020-03-04,6
Anhui,Mainland China,31.8257,117.2264,2020-03-05,6
Anhui,Mainland China,31.8257,117.2264,2020-03-06,6
Beijing,Mainland China,40.1824,116.4142,2020-01-22,0
Beijing,Mainland China,40.1824,116.4142,2020-01-23,0
Beijing,Mainland China,40.1824,116.4142,2020-01-24,0
Beijing,Mainland China,40.1824,116.4142,2020-01-25,0

Perfect dataset

Would go with cumulative numbers (we can always difference to get per day)

  • What about country totals? Do we compute and put in file e.g. if country is null it is the total ... or we can aggregate in browser / elsewhere.
Country,Province,Date,Confirmed,Death,Recovered

province2latlon

Province,Lat,Lon

State-wise data for the US

Hello,

I saw that some countries (e.g., China, Canada, Australia) have state/province data but not the US. Is there any reason that there are only the data for the whole US ?

Thanks!

404 on the recovery url

On running process.py, I get a 404 on the confirmed url. This one:
RECOVERED = 'time_series_19-covid-Recovered.csv'

Maybe this is just some temporary url bug, but I thought I'd let you know.

Meanwhile, I have managed to get the script to run by commenting out all references to the recovered portion of the data, which is less than ideal.

Great job!

[optimization] Move longitude and latitude data to a separate CSV

As a user of the covid-19 data, I want the latitude and longitude data in a separate CSV file from the other data, so that it optimizes the use of the data by cutting down the file sizes, loading times, etc.

Acceptance criteria

  • Latitude and longitude data is moved to a separate CSV file
  • A new datapackage.json is created for the new CSV
  • A new visualization is created for it

Add clinical trials information

Add data about the current clinical trials being conducted against COVID-19.

This might (or might not) involve scraping some clinical trials registries (e.g. EUCTR, ICTRP etc.).

I will self assign as I wanted to get them anyway, can't think of a better place to put them. The only caveat is that I will try to patch some of the OpenTrials collectors in order to do that and that might not be the straightest (or most obvious) path to extract that information.

Canada Recovery Data

Not seeing recovery data for Canada, but it is being updated in the John Hopkins data.

Those are the only NA's I'm seeing. Great work on this - thanks a ton.

docs: methodology

Great stuff! I'm planning to use API for my dashboard Pandemic Estimator but I wish your API had a better documentation on methodology. I'm using JHU directly and I know what chaos it is, the most blatant example being that they provide "cumulative data" that's not cumulative quite often in practice. And the whole change of file formats, etc.

Can you please describe methodology how you deal with it? What's from JHU and what's from CSBS? What has been omitted, what has been "adjusted" and how? Thank you!

Admin2/City field missing in US data

Since the following commit: ab35560
The "Admin2" field is missing from US CSV files. In my case, I was using this field to filter data by US city, and now I can only do so by state. Can this field be added back into the US datasets?

Blog post updating on progress so far

Blog post(s) to put on datahub.io/blog highlighting progress on this dataset plus all the work by others. Could also blog specific stuff e.g. the modelling background.

@Liyubov do you want to lead on this? I suggest drafting blog posts in markdown in hackmd so that can they can then be reviewed and then added to datahub.io/blog easily.

Potential Posts

  • How we are collecting and data packaging the data
  • An overview on the data, dashboarding and modelling efforts going on in the ecosystem
  • An overview of modelling approaches

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.