ocha-dap / dap-scrapers Goto Github PK
View Code? Open in Web Editor NEWDAP components that regularly scrape data from other websites for inclusion in the DAP database
License: The Unlicense
DAP components that regularly scrape data from other websites for inclusion in the DAP database
License: The Unlicense
You guys may have handled this already, but just in case:
There are 3 incident types in ACLED that are explicitly non-violent (Non-Violent Conflict Event, Non-Violent Transfer of Location Control, and Headquarters or Base
Establishment), we should be excluding those from the total calculated for PVX040.
copying @ochastats (Javier) for reference.
This series originally belongs to the World Bank. Recomment to extract it from the WB rather than from the HDR. Here the location http://data.worldbank.org/indicator/NY.GDP.PCAP.PP.KD
Just a minor issue I stumbled upon while looking at recently scraped data. This indicator records ACLED's count of incidents per year, yet is marked as non-numeric.
I've written a little script to check other indicators, and PVX040 is the only one with this mismatch (my script also turns up CG060, which could also be considered numeric except it's a "code" with leading zeros).
Currently dowloaded from th Humand Development Report. Suggesting changing source to childinfo.org. This data can be found at: http://www.childinfo.org/maternal_mortality_indicators.php
The World Bank provides estimates for this indicator but it is not the figures are not as precise as the original source. It is recommendable to import data from http://www.childmortality.org/
faosec:
File "faosec.py", line 50, in
do_file()
File "faosec.py", line 35, in do_file
v12 = mts['V12']
File "/home/lib/messytables/messytables/core.py", line 157, in getitem
raise KeyError("No RowSet called '%s'" % name)
KeyError: "No RowSet called 'V12'"
it's now "V7.1" suggesting major format changes.
title = "Prevalence of Undernourishment" - on row 1.
Recommeding change the source of this data from MDGs to FAO. Data is located at http://www.fao.org/economic/ess/ess-fs/fs-data/en/#.Ut56ZxAo7cs
The acled dataset does not have a name in the dataset table, suggest we use "Armed conflict location and event dataset"
It seems that values reported as "0" are excluded from ScraperWiki data, although they may be present in source data.
I did an analysis of the minimum value of the various numeric indicators, and many indicators have a minimum of "1", which seems suspect.
See the "min value" tab of this spreadsheet:
https://docs.google.com/spreadsheet/ccc?key=0AgxtRla5zLd_dDJpWGwzRldCMGRFaFZXVWl3eXE3NXc&usp=sharing#gid=1
which was built from the data in the CSV link in
https://github.com/OCHA-DAP/ProjectWiki/wiki/ScraperWiki-Download-Links
on 2014-01-18.
Looking over the source data for some of these (e.g. EM-DAT) it does seem that zeros are likely "filled in" for missing data at the source level, but I think that inclusion of those zeros is desired.
Bytes \xc3\xa2\xc2\x82\xc2\xa7 appear in the PHL spreadsheet.
this decodes to \xe2\x82\xa7, which decodes to \u20a7.
the bytes \xe2\x82\xa7 represent [โง], the Peseta symbol in UTF-8
http://www.fileformat.info/info/unicode/char/20a7/index.htm
So we've double UTF-8 encoded somewhere in the food chain. :(
Data currently extracted from MDGs. Suggest to extract data from original source at UNICEF's at Childinfo at http://www.childinfo.org/malnutrition_nutritional_status.php
I have problems trying to reproduce these two series:
Impact of natural disasters: number of deaths
Impact of natural disasters: population affected (average per year/million)
the link seems to be removed. Any advice?
Currently data is extracted from MDGs. Original source is updated more often, suggesting pointing the data extraction to http://www.wssinfo.org/data-estimates/table/
Choose:
Sanitation = total improved
Areas= national
values= countries
Years = all
units= relative (% population)
Suggesting changing the source of this data from MDGs to World Bank. Data is available at http://data.worldbank.org/indicator/IT.MLT.MAIN.P2
Currently data is extracted from MDGs. Original source is updated more often, suggesting pointing the data extraction to http://www.wssinfo.org/data-estimates/table/
Choose:
water = total improved
Areas= national
values= countries
Years = all
units= relative (% population)
m49.py
EXIT: 0
acled.py
^TEXIT: 0
echo.py
EXIT: 0
emdat.py
EXIT: 1
esa.py
EXIT: 0
faosec.py
EXIT: 1
faostat.py
EXIT: 0
hdr-disaster.py
EXIT: 0
hdrstats.py
EXIT: 0
mdg.py
EXIT: 0
unicef.py
EXIT: 0
unterm.py
EXIT: 1
weather.py
EXIT: 0
who-athena.py
EXIT: 2
who-athena2.py
EXIT: 2
wikipedia.py
EXIT: 0
worldbank-lendinggroups.py
EXIT: 0
worldbank.py
EXIT: 0
worldaerodata.py
EXIT: 0
scrapedate
Discovered whilst browsing PHL.csv
This may well make running scrapers difficult to reproduce.
For reference, current state of the ScraperWiki box is:
Mako==1.0.0
MarkupSafe==0.23
PyHamcrest==1.8.0
PyYAML==3.10
SQLAlchemy==0.8.3
Tempita==0.5.1
Unidecode==0.04.14
alembic==0.6.5
chardet==2.1.1
-e git+https://github.com/pudo/dataset@9a91f3d1139a022b8c29f7c4215f6500b9e39b75#egg=dataset-master
decorator==3.4.0
json-table-schema==0.1
lxml==3.2.4
-e git+https://github.com/scraperwiki/messytables@d7b24c85a6216603a2b49a28a857397606f68c1e#egg=messytables-master
nose==1.3.0
pbr==0.5.23
python-dateutil==1.5
python-magic==0.4.3
python-slugify==0.0.6
requests==2.0.1
requests-cache==0.4.4
-e git+https://github.com/scraperwiki/scrumble@45cbf773ff7a3710493f63c82212cbba31c65bcd#egg=scrumble-master
sqlalchemy-migrate==0.8.2
wsgiref==0.1.2
xlrd==0.9.2
-e git+https://github.com/scraperwiki/xypath@b73e47b30e55d8683f3d7656b4063c46c33f1501#egg=xypath-master
This includes the dependencies of dependencies.
Note that the messytables commit listed above doesn't seem to exist anymore. (Neither in the scraperwiki repo or the upstream okfn one either.) Also note that the requirements.txt in the repo has various dependencies just set to pull from GitHub master; would be better if these are pinned.
MDGs database updates are not as often as desired. Suggest to change the source of this data and extract it from the World Bank at http://data.worldbank.org/indicator/SN.ITK.DEFC.ZS
Currently accessed from the MDGs website, suggeting use original source at childinfo at http://www.childinfo.org/maternal_mortality_indicators.php
Data contained in the MDGs database is not updated as often as the original source. Usually the MDG database is updated on yearly basis to support their annual MDG progress report publication. Sugget to re-point their extraction effort from the origianal source at http://apps.who.int/gho/data/node.main.A826
These scrapers are flagging up errors: requires investigation.
As of 2014-02-18, I see it in the indicators table but not in the values table?
It was there in data I downloaded on 2014-01-28, and at first glance I can't find any obvious commits or github issues that would point to its removal.
I see a few other indicators disappeared also during this time, e.g. PVX060 is also gone but it has also disappeared from the indicators table, so maybe that was intentional?
This series is extracted from the MDGs. Suggesting extracting this series from Childinfo.org at http://www.childinfo.org/education_enrolment.php. It could also be extracted from the World Bank. The preferable data source is ChildInfo.org
Recommend to use FAOSTAT as source for this data set. Data can be obtained at http://faostat.fao.org/site/377/DesktopDefault.aspx?PageID=377#ancor
The main advantage is to have the update date and inform users
Cyprus was ommited in this series. This country is included in the one-country data series _% of routine EPI vaccines financed by government. Once the merge is done. _% of routine EPI vaccines financed by government can be removed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.