Giter VIP home page Giter VIP logo

scrapy-heroku's Introduction

Scrapy-Heroku

A package to assist with running scrapy on heroku. This is accomplished by providing a custom application configuration at scrapy_heroku.app.application that launches the scrapyd web service using the PORT environment variable and a multi-process work queue implemented on a Postgres database specified by the DATABASE_URL environment variable.

Configuration

Create a git repo that has a scrapy project at the root (scrapy.cfg should be at the top level). Edit your scrapy.cfg to include the following::

[scrapyd]
application = scrapy_heroku.app.application

[deploy]
url = http://<YOUR_HEROKU_APP_NAME>.herokuapp.com:80/
project = <YOUR_PROJECT_NAME>
username = <A_USER_NAME>
password = <A_PASSWORD>

Add a requirements.txt file that includes scrapy, scrapy-heroku, and scrapyd. It is strongly recommended that you version pin scrapy-heroku as well as the version of scrapy that your project is developed against (pip freeze > requirements.txt).

For Example:

# requirements.txt
Scrapy==0.24.4
scrapyd==1.0.1
scrapy-heroku==0.7.1

Finally create a Procfile that consists of::

web: scrapyd

Make sure you have a postgres database with the DATABASE_URL env parameter set.

scrapy-heroku's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scrapy-heroku's Issues

AttributeError: module 'importlib._bootstrap' has no attribute 'SourceFileLoader'

When adding you module and trying to deploy to Heroku with Python 3.6.4 I get the following error:

remote:        Collecting distribute (from scrapy-heroku==0.7.1->-r /tmp/build_a2015ad4e03aa9d149d2f49b9c840b65/requirements.txt (line 3))
remote:          Downloading distribute-0.7.3.zip (145kB)
remote:            Complete output from command python setup.py egg_info:
remote:            Traceback (most recent call last):
remote:              File "<string>", line 1, in <module>
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/setuptools/__init__.py", line 2, in <module>
remote:                from setuptools.extension import Extension, Library
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/setuptools/extension.py", line 5, in <module>
remote:                from setuptools.dist import _get_unpatched
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/setuptools/dist.py", line 7, in <module>
remote:                from setuptools.command.install import install
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/setuptools/command/__init__.py", line 8, in <module>
remote:                from setuptools.command import install_scripts
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/setuptools/command/install_scripts.py", line 3, in <module>
remote:                from pkg_resources import Distribution, PathMetadata, ensure_directory
remote:              File "/tmp/pip-build-0ymcqp4k/distribute/pkg_resources.py", line 1518, in <module>
remote:                register_loader_type(importlib_bootstrap.SourceFileLoader, DefaultProvider)
remote:            AttributeError: module 'importlib._bootstrap' has no attribute 'SourceFileLoader'
remote:            
remote:            ----------------------------------------
remote:        Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-0ymcqp4k/distribute/
remote:  !     Push rejected, failed to compile Python app.
remote: 
remote:  !     Push failed
remote: Verifying deploy...
remote: 
remote: !       Push rejected to polar-bastion-49521.

Any reason to relay on distribute?

pscycopg2 Operational Error

Consistently getting a pscycopg2 exception: "SSL connection has been closed unexpectedly"

After reading this (http://stackoverflow.com/questions/26792943/heroku-rails-pg-activerecordstatementinvalid-pgconnectionbad-pqconsum) I figured it was due to having a free tier Hobby Dev DB. I then tried a Hobby Basic and same thing happened. Any ideas for how to solve this and/or what the cause may be?

Full Log Exception:

2014-12-09T12:26:04.194140+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/twisted/internet/defer.py", line 1237, in unwindGenerator
2014-12-09T12:26:04.194185+00:00 app[web.1]:        return len(self.q)
2014-12-09T12:26:04.194142+00:00 app[web.1]:        return _inlineCallbacks(None, gen, Deferred())
2014-12-09T12:26:04.194186+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/scrapy_heroku/spiderqueue.py", line 80, in __len__
2014-12-09T12:26:04.194143+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/twisted/internet/defer.py", line 1099, in _inlineCallbacks
2014-12-09T12:26:04.194188+00:00 app[web.1]:        result = self._execute(q)[0][0]
2014-12-09T12:26:04.194145+00:00 app[web.1]:        result = g.send(result)
2014-12-09T12:26:04.194190+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/scrapy_heroku/spiderqueue.py", line 36, in _execute
2014-12-09T12:26:04.194176+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/scrapyd/poller.py", line 21, in poll
2014-12-09T12:26:04.194191+00:00 app[web.1]:        cursor.execute(q, args)
2014-12-09T12:26:04.194177+00:00 app[web.1]:        c = yield maybeDeferred(q.count)
2014-12-09T12:26:04.194193+00:00 app[web.1]:    psycopg2.OperationalError: terminating connection due to administrator command
2014-12-09T12:26:04.194129+00:00 app[web.1]: 2014-12-09 12:26:03+0000 [-] Unhandled Error
2014-12-09T12:26:04.194179+00:00 app[web.1]:    --- <exception caught here> ---
2014-12-09T12:26:04.194194+00:00 app[web.1]:    SSL connection has been closed unexpectedly
2014-12-09T12:26:04.194134+00:00 app[web.1]:    Traceback (most recent call last):
2014-12-09T12:26:04.194180+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/twisted/internet/defer.py", line 139, in maybeDeferred
2014-12-09T12:26:04.194196+00:00 app[web.1]:    
2014-12-09T12:26:04.194136+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/twisted/internet/defer.py", line 139, in maybeDeferred
2014-12-09T12:26:04.194182+00:00 app[web.1]:        result = f(*args, **kw)
2014-12-09T12:26:04.194138+00:00 app[web.1]:        result = f(*args, **kw)
2014-12-09T12:26:04.194183+00:00 app[web.1]:      File "/app/.heroku/python/lib/python2.7/site-packages/scrapy_heroku/spiderqueue.py", line 130, in count
2014-12-09T12:26:04.194197+00:00 app[web.1]: 

Is it possible to catch the exception and try to reconnect when the next poll happens?
Thanks!

Cheers,
Ari

Important connections number on Postgres Database

I use a hobby basic Postgres for my project. Since I have only 20 possible connections on the database, is there any possibility to configure the database to an other env variable than DATABASE_URL? (Scrapy-heroku consumes 10 of them).

Data not being saved to Postgres

Are there any changes to be done to the scrapy project in order to get the data saved in Postgres. I am able to see the logs and items file on server , however, those are not getting saved to heroku.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.