smnorris / bcdata Goto Github PK
View Code? Open in Web Editor NEWPython and command line tools for quick access to DataBC geo-data available via WFS/WCS.
License: MIT License
Python and command line tools for quick access to DataBC geo-data available via WFS/WCS.
License: MIT License
Port to Python 3 and automate tests.
perhaps permit a list of requested data:
bcdata "bc-airports,conservation-lands"
Then make requests in parallel to really speed things up (the bottleneck is the initial dwds page load, that could easily be done in parallel or some kind of a queue)
the cat
command does not appear to actually pass the sortby option to the request...
Related is issue #43 !
Something must be wrong with the way the POST payload is structured, the code seems to be fine:
payload = {"aoiOption": "0",
"crs": "0",
"fileFormat": FORMATS[driver],
"termsCheckbox": "1",
"clickedSubmit": "true",
"userEmail": email_address}
bcdata national-parks-national-framework-canada-lands-administrative-boundaries-level-1
Traceback (most recent call last):
File "/usr/local/bin/bcdata", line 9, in <module>
load_entry_point('bcdata', 'console_scripts', 'bcdata')()
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 65, in cli
geomark=geomark)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 60, in create_order
custom_download_link.click()
UnboundLocalError: local variable 'custom_download_link' referenced before assignment
bcdata tantalis-management-areas-spatial
Traceback (most recent call last):
File "/usr/local/bin/bcdata", line 9, in <module>
load_entry_point('bcdata', 'console_scripts', 'bcdata')()
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 69, in cli
dl_path = bcdata.download_order(order_id)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 144, in download_order
raise RuntimeError("Download for order_id "+order_id+" timed out")
RuntimeError: Download for order_id 1662791 timed out
WHSE_FOREST_VEGETATION.RSLT_FOREST_COVER_INV_SVW
fails (n=834,064) after 530,000 insertions with bc2pg
.
Cap the number of records requested?
ERROR 1: transfer closed with outstanding read data remaining
FAILURE:
Unable to open datasource `https://openmaps.gov.bc.ca/geo/pub/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=WHSE_FOREST_VEGETATION.RSLT_FOREST_COVER_INV_SVW&outputFormat=json&SRSNAME=epsg%3A4326&sortby=BEC_SERAL&startIndex=630000&count=10000' with the following drivers.
add (and test) additional CLI options:
mock
and polling
seem to give some tests difficulty
eg: Travis builds for fwakit https://travis-ci.org/smnorris/fwakit/jobs/347119336
polling is pretty simple, the dependency could be removed
But having EPSG:4326 available to the CLI would be valuable.
Reproject 4269 downloads to 4326 if requested.
Makes the module a lot heavier - requires ogr2ogr or fiona
if specified output folder exists, attempt to append to it if the data type matches
Should be able to run create_order headlessly via phantomjs
References:
https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/
https://stackoverflow.com/questions/18433453/python-selenium-with-phantomjs-click-failed-referenceerror-cantt-find-varia
Prelim code:
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
user_agent = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
)
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.set_window_size(1120, 550)
Simply changing the driver didn't work, I think there need to be some explicit delays to wait for the server to process request and browser to make changes.
Load dataset to postgres db, defaulting to schema.table matching the datasets object_name
.
There is no need to use Selenium and PhantomJS, Requests (and BeautifulSoup) can do everything. Just need to figure out the right payload for the POST request.
get default with (os.environ.get('KEY_THAT_MIGHT_EXIST'))
there should also be error handling for
timeout is currently at 30min. Is that enough? Can time to timout be echoed to stdout? A test would be valuable
Allow user to specify output layer name
.gdb data will require ogr/fiona for the layer renaming
related to #2
As per bcgov/bcdata#16
Not sure why it got added to requirements.txt
provided a bounds (xmin, ymin, xmax, ymax) return a geomark using the BC geomark service http://apps.gov.bc.ca/pub/geomark/geomarks
Also consider generating a geomark with a geojson feature.
This probably belongs in a separate geomark module given all the capabilities of the api, but it isn't necessary at this point.
downloads are successful without providing an email
the CLI just sits there, some notice that it is doing something would be good - especially if headless execution is possible
dwds says:
Your order has exceeded the maximum order size! You will not be able to submit your order until the size has been reduced. You can do this by either removing selected products or by altering the AOI on configurable products.
Currently, bcdata
hangs - waiting for a response that never comes
Use the catalog API to find the DWDS path rather than scraping the catalog page. Path to download should be an item in the 'Resources' list. For example: https://catalogue.data.gov.bc.ca/api/3/action/package_show?id=conservation-lands
As there are multiple resources in the list I'm not quite sure how to pick the right one (in case the best download source isn't DWDS). Some examples would have to be checked out.
General install includes pytest and a few other things that look like testing packages.
Is this correct behaviour?
If the server supports this, downloads could be very fast, limited mostly by connection speed
every time bcdata runs OSX asks "Do you want the application 'python' to accept incoming network connections?". Perhaps we can convince it to remember the setting after running the command once.
https://stackoverflow.com/questions/19688841/add-python-to-os-x-firewall-options
Geojson is valid as EPSG:4326 only - so this is a good default for cat
and dump
. But I generally want EPSG:3005.
If dumping to file, this is a lot to type (and remember):
$ bcdata cat bc-airports --dst-crs EPSG:3005 \
| fio collect \
| fio load -f GPKG --dst-crs EPSG:3005 airports.gpkg
Maybe a bc2gpkg
command or alias would be useful?
$ bcdata pscis-design-proposal -f shp --crs EPSG:4269
Traceback (most recent call last):
File "/Users/snorris/venv/bcdata/bin/bcdata", line 9, in <module>
load_entry_point('bcdata', 'console_scripts', 'bcdata')()
File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 51, in cli
dl_path = bcdata.download_order(order_id)
File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 119, in download_order
folder = next(os.walk(unzip_folder))[1][0]
IndexError: list index out of range
pagesize defaults to the max accepted by the server, 10,000.
max_workers arbitrarily defaults to 5.
Try testing some different values with different datasets to see if there is a general optimization strategy.
Layers like BEC (features have many coordinates) may benefit from smaller pagesize and more threads?
temp isn't always best
For https://catalogue.data.gov.bc.ca/dataset/hydrometric-stations-active-and-discontinued
$ bcdata hydrometric-stations-active-and-discontinued
Traceback (most recent call last):
File "/usr/local/bin/bcdata", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/bcdata/cli.py", line 50, in cli
driver=driver)
File "/usr/local/lib/python2.7/site-packages/bcdata/main.py", line 74, in download
raise ValueError('DWDS URL does not exist, something went wrong')
ValueError: DWDS URL does not exist, something went wrong
Upgraded Homebrew Python from 3.7.2 to 3.7.2_2 and on opening of new terminal:
bash: /usr/local/bin/bcdata: /usr/local/Cellar/python/3.7.2/bin/python3.7: bad interpreter: No such file or directory
From this line in .profile, added to enable autocomplete:
$ eval "$(_BCDATA_COMPLETE=source bcdata)"
-bash: /usr/local/bin/bcdata: /usr/local/Cellar/python/3.7.2/bin/python3.7: bad interpreter: No such file or directory
Reinstalling bcdata
makes the problem go away but creating a bash complete script might improve stability. https://click.palletsprojects.com/en/7.x/bashcomplete/?highlight=autocomplete
looks like html5lib isn't included on python bundled with ubuntu 16.04:
bcdata natural-resource-nr-district
Traceback (most recent call last):
File "/home/gis/venv/rr_stats/bin/bcdata", line 11, in <module>
load_entry_point('bcdata==0.0.5.dev0', 'console_scripts', 'bcdata')()
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bcdata/scripts/cli.py", line 56, in cli
driver=driver)
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bcdata/__init__.py", line 57, in download
soup = BeautifulSoup(r.text, "html5lib")
File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bs4/__init__.py", line 165, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?
Not a good assumption, presumably data will get updated. Just make sure there are features present.
crs option is simply not passed to get_features
Presuming the server accepts the request, bcdata dump
should support WGS84.
no download, no email required.
Sometimes, we just don't want the whole Province
VEG_BURN_SEVERITY_SP fails to download with tool
with manual download to fgdb, the fields in the output are short names, rather than full names.
Probably best to just use WFS
$ bc2pg bc-transmission-lines
Traceback (most recent call last):
File "/usr/local/bin/bc2pg", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pgdata/cli.py", line 26, in cli
info = db.bcdata2pg(dataset, email)
File "/usr/local/lib/python2.7/site-packages/pgdata/database.py", line 367, in bcdata2pg
dl = bcdata.download(url, email)
File "/usr/local/lib/python2.7/site-packages/bcdata/main.py", line 57, in download
raise ValueError('Specified package is not available via DWDS')
ValueError: Specified package is not available via DWDS
A manual download works fine.
it would be handy if bcdata also scraped the default schema qualified table name. This is noted as object name
on the catalogue page.
eg
for https://catalogue.data.gov.bc.ca/dataset/bc-airports:
Object Name: WHSE_IMAGERY_AND_BASE_MAPS.GSR_AIRPORTS_SVW
test_cli.py
currently fails.
The command should work, is failure due to the time to process or is the call incorrect?
When given an AOI, the download service will provide features that intersect or are clipped to the AOI.
Add clip/intersect as an option to module and CLI.
Cache the available layers somewhere and autocomplete table names / package names when using bcdata info
and bcdata dump
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.