ckan / ckan-service-provider Goto Github PK

View Code? Open in Web Editor NEW

21.0 6.0 22.0 297 KB

A library for making web services that make functions available as synchronous or asynchronous jobs

Home Page: http://ckan-service-provider.readthedocs.org

License: GNU Affero General Public License v3.0

Python 100.00%

ckan-service-provider's People

Stargazers

Watchers

ckan-service-provider's Issues

Log failed jobs and provide helpful output

SSL handshake error resulting in "Process completed but unable to post to result_url"

Related to ckan/datapusher#82

Whenever CKAN runs with self-signed TLS certificate, DataPusher processes the data but fails to complete jobs with TLS handshake error.

2017/11/30 23:22:19 [info] 21780#21780: *278 SSL_do_handshake() failed (SSL: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca:SSL alert number 48) while SSL handshaking, client: 192.168.2.185, server: 0.0.0.0:5000

This happens even if SSL_VERIFY = False is set in datapusher_settings.py.

This is caused by web.py send_result() method which is called at the end of the job processsing. Since this metod is part of ckan-service-provider and not datapusher, it's unaffected by SSL_VERIFY setting and forces the verification anyway.

Document that you have to configure a database

This isn't obvious ;-)

Would also be nice if we could config the DB from an environment variable (e.g. one compatible with Heroku).

Don't use ProxyFix

ckanserviceprovider always uses the werkzeug proxyfix middleware:

https://github.com/ckan/ckan-service-provider/blob/master/ckanserviceprovider/web.py#L130

(apparently to make it work with gunicorn: 99b675d)

But the docs say:

Do not use this middleware in non-proxy setups for security reasons.

Should we be using this?

Show stats

How may jobs failed and how many succeeded at /status. Suggested by @tobes.

This about removing job resubmitting

Because of #11.

The example service doesn't work

IOError: [Errno 2] No such file or directory: 'example/settings_local.py'

Release Needed to Fix SQLAlchemy Incompatibility

As fixed in #30, it is causing us a lot of trouble that the version of ckan-service-provider on PyPi requires a specific version of SQLAlchemy (see issue datacats/ckan-multisite#37, datacats/datacats#334).

If a release of ckan-service-provider were released on PyPi, this would allow our automated tooling to pull down the correct version of SQLAlchemy without weird hackery such as installing the correct version of SQLAlchemy manually or grabbing ckan-service-provider from Git.

Environment variable for config file should not be called JOB_CONFIG

Something like CKANSERVICEPROVIDER_CONFIG, probably.

Support reading config settings directly from environment variables

Run with no config file. Useful for platforms like Heroku or for Docker etc

dependencies broken because of no version pinning

the flask-login requirement is unpinned in setup.py, and from 0.4.0 onward the import has changed from flask.ext.login to flask_login, causing an exception.

Consider switching from APScheduler to celery

I really consider switching to celery before releasing the datapusher because there are a number of problems with APScheduler and the biggest disadvantage of celery, the setup, does not really apply to this because there won't be many set-ups anyway. Also, setting up the service with a separate worker process is not simple, either.

The reasons for celery are:

More functionalities, such as resubmitting
Better tool support such as https://github.com/mher/flower for monitoring
Better suited for web applications. In APScheduler you really have to think about the scope if your connections and everything. In celery people have already thought about all this.
Logging, APScheduler makes it really difficult (see http://stackoverflow.com/questions/15392058/capture-logs-in-apscheduler)
More docs and examples
Switching probably isn't too much work because we can reuse the routing and everything

@kindly Since you initially chose APSheduler, I'd be interested in your thoughts.

Engine disposal for true concurrency

In SQLAlchemy, the Engine refers to a connection pool.

Typically, "the Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources."

However, as pointed out in the Engine Disposal section of https://docs.sqlalchemy.org/en/14/core/connections.html:

"When a program uses multiprocessing or fork(), and an Engine object is copied to the child process, Engine.dispose() should be called so that the engine creates brand new database connections local to that fork. Database connections generally do not travel across process boundaries."

This bug was unearthed while working on the datapusher to make it concurrent - i.e. having it use PostgreSQL, using multiple uwsgi workers.

ckan/datapusher#200
ckan/datapusher#198

ckan-service-provider maintains a job database using sqlalchemy. Currently, the SQLALCHEMY_DATABASE_URI defaults to sqlite - which is not meant to be used as a concurrent database, resulting in database locks.

Changing SQLALCHEMY_DATABASE_URI to a postgresql connect string eliminated the database lock issues. However, since database connections do not travel across process boundaries, psycopg2 was giving another error:

(psycopg2.OperationalError) SSL error: decryption failed or bad record mac

Resolved this issue by setting 'lazy-apps = true' in uwsgi. (ckan/datapusher#201 (comment))

Another fix that will not just be specific to datapusher, but for other ckan-service-provider clients, would be to have each worker/process have its own Engine by using Engine.dispose().

https://virtualandy.wordpress.com/2019/09/04/a-fix-for-operationalerror-psycopg2-operationalerror-ssl-error-decryption-failed-or-bad-record-mac/

Synchronous jobs via GET

It is necessary, that we have a simple endpoint that lets users add a task via a URL. Also, at this URL, the data should be returned instantly. Basically, it should behave like a static resource.

flask-login and Werkzeug versions are incompatible

flask-login 0.5.0 and Werkzeug 2.1.x are not compatible, due to flask-login using a function that was removed in 2.1.0.

In our situation we are using the CKAN Datapusher library in a docker container, which imports the ckanserviceprovider web module. It started failing to run after a recent build, due to this error at startup:

Traceback (most recent call last):
  File "/etc/ckan/datapusher.wsgi", line 2, in <module>
    import ckanserviceprovider.web as web
  File "/usr/lib/ckan/datapusher/lib/python3.7/site-packages/ckanserviceprovider/web.py", line 13, in <module>
    import flask_login as flogin
  File "/usr/lib/ckan/datapusher/lib/python3.7/site-packages/flask_login/__init__.py", line 16, in <module>
    from .login_manager import LoginManager
  File "/usr/lib/ckan/datapusher/lib/python3.7/site-packages/flask_login/login_manager.py", line 24, in <module>
    from .utils import (login_url as make_login_url, _create_identifier,
  File "/usr/lib/ckan/datapusher/lib/python3.7/site-packages/flask_login/utils.py", line 13, in <module>
    from werkzeug.security import safe_str_cmp
ImportError: cannot import name 'safe_str_cmp' from 'werkzeug.security' (/usr/lib/ckan/datapusher/lib/python3.7/site-packages/werkzeug/security.py)

I've explicitly installed the newer version of flask-login to get around this for now, but hopefully this could be improved in this library?

Always require an API key

See ckan/datapusher#5

Use sqlalchemy.engine_from_config

@amercader We were able to switch from sqlite to postgres by simply replacing the sqlalchemy_database_uri default sqlite uri with a postgres uri.

However, if I want to pass more engine configuration settings through sqlalchemy to postgres, its not possible.

Similar to CKAN and the datastore, ckan service provider should use sqlalchemy.engine_from_config as well so users can pass more engine configuration settings through.

log files don't have timestamps

which makes it hard to use the log files for monitoring and debugging.

CartoDB support

CartoDB is increasingly being used to store, access and visualize geospatial data in the cloud. As recorded in this issue, there is a need for CKAN-CartoDB integration, but with some differences how far this integration should go.

I personally would like to be able to add a resource to CKAN via a CartoDB link and ideally have a preview of the data (using the CartoDB API). We have other resource types than CartoDB, so the default service-provider features should remain.
The original feature request by @Bu1G is a step further, with CKAN controlling CartoDB as a datastore.

@acouch has developed CartoDB integration for DKAN, replacing the native datastore with CartoDB, but that code cannot be ported to CKAN as the codebase is different (see comment).

Any idea what needs to be done to tackle this?

Note: I'm still new to CKAN and not sure if this is functionality the ckan-service-provider can provide. If not, to which repository should I add this issue?

BUG: APScheduler version needs to be no greater than 3.9.1.post1

If ckanserviceprovider is installed from pypi, it does not work properly as it pulls in v3.10 (released Jan 31, 2023), which causes it to not process jobs properly:

ckan-service-provider/setup.py

Lines 13 to 19 in 4dfeb0f

 install_requires = [ 

 "APScheduler>=2.1.2,<4", 

 "Flask>=1.1.1", 

 "SQLAlchemy>=1.3.15,<1.4.0", 

 "requests>=2.23.0", 

 "future", 

 ]

Even if ckanserviceprovider is installed from source it also doesn't work as the requirements.txt file specifies APScheduler 3.9.1, which was yanked:

ckan-service-provider/requirements.txt

Lines 1 to 6 in 4dfeb0f

 # Suggested versions 

 APScheduler==3.9.1 

 requests==2.27.1 

 Flask==2.1.1 

 Flask-Login==0.5.0 

 future==0.18.2

Proposed solution:

set APScheduler>=2.1.2,<3.10.0 in setup.py
set APScheduler==3.9.1.post1 in requirements.txt

Make sure that service works under load

Should be able to handle 1 qps over a day and 100 qps peaks.

Version of flask-login is not pinned and update has broken CSP

Flask-login is unpinned in setup.py, and whilst it was on 0.2.11 it was fine. It's now been upgrade to 0.3.0 and a method (current_user.is_authenticated()) is now a property.

Options for fixes:

Pin flask-login to 0.2.11
Pin to 0.3.0 and change is_authenticated() to is_authenticated.

Resubmit failed jobs

Support for webhooks

Webhooks should be used to notify about finished jobs.

Clear successful jobs

Mismatch between requirement versions in setup.py and requirements.txt

setup.py has requirements unpinned:

      install_requires=[
            'APScheduler',
            'Flask',
            'SQLAlchemy',
            'requests',
            'flask-admin',
            'flask-login'
      ],

requirements.txt has (some) requirements pinned:

APScheduler==2.0.3
Flask==0.9
SQLAlchemy==0.7.8
requests==0.14.1
flask-admin
flask-login

I think they should be pinned and only listed in one place. I don't know how I ended up with APScheduler==3.0.0rc1 which breaks things.

Incompatible with required Flask_login version

Since 0.3.0 is_active and other attributes are no longer functions:

see: https://github.com/maxcountryman/flask-login/blob/master/CHANGES#L77

The actuel required version of Flask_login is 0.5.0 which breaks
ckan-service-provider:

Exception on /user [GET]
Traceback (most recent call last):
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/var/lib/ckan/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/var/lib/ckan/lib/python3.7/site-packages/ckanserviceprovider/web.py", line 327, in user
    'is_active': user.is_active(),
TypeError: 'bool' object is not callable
500 GET /user (172.17.0.1) 1.27ms

See https://github.com/maxcountryman/flask-login/blob/master/CHANGES#L67

Delete api keys when they are not needed any more.

setup.py incompatible with CKAN deps

Current CKAN master specifies sqlalchemy 1.1.11, so we get an error when installing this and then CKAN's requirements.txt:

ckanserviceprovider 0.0.7 has requirement SQLAlchemy<1.3.0,>=1.2.7, but you'll have sqlalchemy 1.1.11 which is incompatible.

This is occurred since this PR was merged: #39

Needs a release before CKAN 2.7 comes out

This is because Datapusher is currently broken with CKAN master (soon to be released as 2.7): ckan/datapusher#124

So please can someone with permission @amercader do a release of ckan-service-provider (with the fix #32 ) and then I'll repoint Datapusher at pointed at that new released version.

	install_requires = [
	"APScheduler>=2.1.2,<4",
	"Flask>=1.1.1",
	"SQLAlchemy>=1.3.15,<1.4.0",
	"requests>=2.23.0",
	"future",
	]

	# Suggested versions
	APScheduler==3.9.1
	requests==2.27.1
	Flask==2.1.1
	Flask-Login==0.5.0
	future==0.18.2

ckan / ckan-service-provider Goto Github PK

ckan-service-provider's People

Stargazers

Watchers

Forkers

ckan-service-provider's Issues

Proposed solution:

Recommend Projects

Recommend Topics

Recommend Org