Giter VIP home page Giter VIP logo

bcdata's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcdata's Issues

Python 3

Port to Python 3 and automate tests.

parallel requests

perhaps permit a list of requested data:
bcdata "bc-airports,conservation-lands"

Then make requests in parallel to really speed things up (the bottleneck is the initial dwds page load, that could easily be done in parallel or some kind of a queue)

.gdb request returns shapefile

Something must be wrong with the way the POST payload is structured, the code seems to be fine:

    payload = {"aoiOption": "0",
               "crs": "0",
               "fileFormat": FORMATS[driver],
               "termsCheckbox": "1",
               "clickedSubmit": "true",
               "userEmail": email_address}

national parks fails

bcdata national-parks-national-framework-canada-lands-administrative-boundaries-level-1

Traceback (most recent call last):
  File "/usr/local/bin/bcdata", line 9, in <module>
    load_entry_point('bcdata', 'console_scripts', 'bcdata')()
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 65, in cli
    geomark=geomark)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 60, in create_order
    custom_download_link.click()
UnboundLocalError: local variable 'custom_download_link' referenced before assignment

download times out too soon

bcdata tantalis-management-areas-spatial
Traceback (most recent call last):
  File "/usr/local/bin/bcdata", line 9, in <module>
    load_entry_point('bcdata', 'console_scripts', 'bcdata')()
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 69, in cli
    dl_path = bcdata.download_order(order_id)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 144, in download_order
    raise RuntimeError("Download for order_id "+order_id+" timed out")
RuntimeError: Download for order_id 1662791 timed out

bc2pg fails on larger data sources

WHSE_FOREST_VEGETATION.RSLT_FOREST_COVER_INV_SVW fails (n=834,064) after 530,000 insertions with bc2pg.

Cap the number of records requested?

ERROR 1: transfer closed with outstanding read data remaining
FAILURE:
Unable to open datasource `https://openmaps.gov.bc.ca/geo/pub/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=WHSE_FOREST_VEGETATION.RSLT_FOREST_COVER_INV_SVW&outputFormat=json&SRSNAME=epsg%3A4326&sortby=BEC_SERAL&startIndex=630000&count=10000' with the following drivers.

cli options

add (and test) additional CLI options:

  • driver (check valid drivers)
  • geomark (check that it exists/is valid)
  • crs (check valid CRS)

headless orders

Should be able to run create_order headlessly via phantomjs

References:
https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/
https://stackoverflow.com/questions/18433453/python-selenium-with-phantomjs-click-failed-referenceerror-cantt-find-varia

Prelim code:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

user_agent = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) " +
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
)

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.set_window_size(1120, 550)

Simply changing the driver didn't work, I think there need to be some explicit delays to wait for the server to process request and browser to make changes.

add command for loading to postgres

Load dataset to postgres db, defaulting to schema.table matching the datasets object_name.

  • use db specified by $DATABASE_URL or similar
  • add primary key constraint if the key is known

do all work with requests

There is no need to use Selenium and PhantomJS, Requests (and BeautifulSoup) can do everything. Just need to figure out the right payload for the POST request.

timeouts

timeout is currently at 30min. Is that enough? Can time to timout be echoed to stdout? A test would be valuable

add --layer option

Allow user to specify output layer name

.gdb data will require ogr/fiona for the layer renaming

related to #2

Provide an error message when order max size is exceeded

dwds says:

Your order has exceeded the maximum order size! You will not be able to submit your order until the size has been reduced. You can do this by either removing selected products or by altering the AOI on configurable products.

Currently, bcdata hangs - waiting for a response that never comes

command for download to file?

Geojson is valid as EPSG:4326 only - so this is a good default for cat and dump. But I generally want EPSG:3005.
If dumping to file, this is a lot to type (and remember):

$ bcdata cat bc-airports --dst-crs EPSG:3005 \
  | fio collect \
  | fio load -f GPKG --dst-crs EPSG:3005 airports.gpkg

Maybe a bc2gpkg command or alias would be useful?

empty download fails badly

$ bcdata pscis-design-proposal -f shp --crs EPSG:4269
Traceback (most recent call last):
  File "/Users/snorris/venv/bcdata/bin/bcdata", line 9, in <module>
    load_entry_point('bcdata', 'console_scripts', 'bcdata')()
  File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/snorris/venv/bcdata/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/scripts/cli.py", line 51, in cli
    dl_path = bcdata.download_order(order_id)
  File "/Volumes/Data/Projects/geobc/bcdata/bcdata/__init__.py", line 119, in download_order
    folder = next(os.walk(unzip_folder))[1][0]
IndexError: list index out of range

pagesize / max_workers optimization

pagesize defaults to the max accepted by the server, 10,000.
max_workers arbitrarily defaults to 5.

Try testing some different values with different datasets to see if there is a general optimization strategy.

Layers like BEC (features have many coordinates) may benefit from smaller pagesize and more threads?

No DWDS url on main page

For https://catalogue.data.gov.bc.ca/dataset/hydrometric-stations-active-and-discontinued

$ bcdata hydrometric-stations-active-and-discontinued
Traceback (most recent call last):
  File "/usr/local/bin/bcdata", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/bcdata/cli.py", line 50, in cli
    driver=driver)
  File "/usr/local/lib/python2.7/site-packages/bcdata/main.py", line 74, in download
    raise ValueError('DWDS URL does not exist, something went wrong')
ValueError: DWDS URL does not exist, something went wrong
  • there is a link to an additional resource page that includes the DWDS url, is this link consistent in other pages?
  • perhaps this is better resolved with using the API as per #25 ?

Autocomplete fails with new version of Python installed

Upgraded Homebrew Python from 3.7.2 to 3.7.2_2 and on opening of new terminal:

bash: /usr/local/bin/bcdata: /usr/local/Cellar/python/3.7.2/bin/python3.7: bad interpreter: No such file or directory

From this line in .profile, added to enable autocomplete:

$ eval "$(_BCDATA_COMPLETE=source bcdata)"
-bash: /usr/local/bin/bcdata: /usr/local/Cellar/python/3.7.2/bin/python3.7: bad interpreter: No such file or directory

Reinstalling bcdata makes the problem go away but creating a bash complete script might improve stability. https://click.palletsprojects.com/en/7.x/bashcomplete/?highlight=autocomplete

require html5lib

looks like html5lib isn't included on python bundled with ubuntu 16.04:

bcdata natural-resource-nr-district
Traceback (most recent call last):
  File "/home/gis/venv/rr_stats/bin/bcdata", line 11, in <module>
    load_entry_point('bcdata==0.0.5.dev0', 'console_scripts', 'bcdata')()
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bcdata/scripts/cli.py", line 56, in cli
    driver=driver)
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bcdata/__init__.py", line 57, in download
    soup = BeautifulSoup(r.text, "html5lib")
  File "/home/gis/venv/rr_stats/local/lib/python2.7/site-packages/bs4/__init__.py", line 165, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

support EPSG:4326

Presuming the server accepts the request, bcdata dump should support WGS84.

fields in .gdb are provided in short form

VEG_BURN_SEVERITY_SP fails to download with tool

with manual download to fgdb, the fields in the output are short names, rather than full names.

Probably best to just use WFS

dataset bc-transmission-lines fails

$ bc2pg bc-transmission-lines
Traceback (most recent call last):
  File "/usr/local/bin/bc2pg", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pgdata/cli.py", line 26, in cli
    info = db.bcdata2pg(dataset, email)
  File "/usr/local/lib/python2.7/site-packages/pgdata/database.py", line 367, in bcdata2pg
    dl = bcdata.download(url, email)
  File "/usr/local/lib/python2.7/site-packages/bcdata/main.py", line 57, in download
    raise ValueError('Specified package is not available via DWDS')
ValueError: Specified package is not available via DWDS

A manual download works fine.

cli testing fails

test_cli.py currently fails.
The command should work, is failure due to the time to process or is the call incorrect?

clip vs intersect

When given an AOI, the download service will provide features that intersect or are clipped to the AOI.
Add clip/intersect as an option to module and CLI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.