Giter VIP home page Giter VIP logo

distributed-multi-user-scrapy-system-with-a-web-ui's Introduction

Distributed Multi-User Scrapy System with a Web UI

This is a Django project that lets users create, configure, deploy and run Scrapy spiders through a Web interface. The goal of this project is to build an application that would allow multiple users write their own scraping scripts and deploy them to a cluster of workers for scraping in a distributed fashion. The application allows the users do the following actions through a web interface:

  • Create a Scrapy project
  • Add/Edit/Delete Scrapy Items
  • Add/Edit/Delete Scrapy Item Pipelines
  • Edit Link Generator function (more on this below)
  • Edit Scraper function (more on this below)
  • Deploy the projects to worker machines
  • Start/Stop projects on worker machines
  • Display online status of the worker machines, the database, and the link queue
  • Display the deployment status of projects
  • Display the number of items scraped
  • Display the number of errors occured in a project while scraping
  • Display start/stop date and time for projects

Architecture

The application comes bundled with Scrapy pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). The code for these were taken and adapted from https://github.com/sebdah/scrapy-mongodb and https://github.com/roycehaynes/scrapy-rabbitmq. Here is what you need to run the application:

  • MongoDB server (can be standalone or a sharded cluster, replica sets were not tested)
  • RabbitMQ server
  • One link generator worker server with Scrapy installed and running scrapyd daemon
  • At least one scraper worker server with Scrapy installed and running scrapyd daemon

After you have all of the above up and running, fill sample_settings.py in root folder and scrapyproject/scrapy_packages/sample_settings.py files with needed information, rename both files to settings.py, and run the Django server (don't forget to perform the migrations first). You can go to http://localhost:8000/project/ to start creating your first project.

Link Generator

The link generator function is a function that will insert all the links that need to be scraped to the RabbitMQ queue. Scraper workers will be dequeueing those links, scraping the items and saving the items to MongoDB. The link generator itself is just a Scrapy spider written insde parse(self, response) function. The only thing different from the regular spider is that the link generator will not scrape and save items, it will only extract the needed links to be scraped and insert them to the RabbitMQ for scraper machines to consume.

Scrapers

The scraper function is a function that will take links from RabbitMQ, make a request to that link, parse the response, and save the items to DB. The scraper is also just a Scrapy spider, but without the functionality to add links to the queue.

This separation of roles allows to distribute the links to multiple scrapers evenly. There can be only one link generator per project, and unlimited number of scrapers.

RabbitMQ

When a project is deployed and run, the link generator will create a queue for the project in username_projectname:requests format, and will start inserting links. Scrapers will use RabbitMQ Scheduler in Scrapy to get one link at a time and process it.

MongoDB

All of the items that get scraped will be saved to MongoDB. There is no need to prepare the database or collections beforehand. When the first item gets saved to DB, the scraper will create a database in username_projectname format and will insert items to a collection named after the item's name defined in Scrapy. If you are using a sharded cluster of MongoDB servers, the scrapers will try to authoshard the database and the collections when saving the items. The hashed id key is used for sharding.

Here are the general steps that the application performs:

  1. You create a new project, define items, define item pipelines, add link generator and scraper functions, change settings
  2. Press Deploy the project
  3. The scripts and settings will be put into a standard Scrapy project folder structure (two folders will be created: one for link generator, one for scraper)
  4. The two folders will be packaged to .egg files
  5. Link generator egg file will be uploaded to the scrapyd server that was defined in settings file
  6. Scraper egg file will be uploaded to all scrapyd servers that were defined in settings file
  7. You start the link generator
  8. You start the scrapers

Installation

The web application requires:

  • Django 1.8.13
  • django-crispy-forms
  • django-registration
  • pymongo
  • requests
  • python-dateutil

On the link generator and scraper machines you need:

  • Scrapy
  • scrapyd
  • pymongo
  • pika

The dashboard theme used for the UI was retrieved from https://github.com/VinceG/Bootstrap-Admin-Theme.

Examples

Link generator and scraper functions are given in the examples folder.

Screenshots

Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text

License

This project is licensed under the terms of the MIT license.

distributed-multi-user-scrapy-system-with-a-web-ui's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributed-multi-user-scrapy-system-with-a-web-ui's Issues

This is a great project, but it lacks of a full example.

I have built up your project like the picture
default
but the database always be offline.
Do you have any ideas?
here is my setting:
#rabbitmq and mongodb settings
SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = '192.168.2.73'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'Admin'
RABBITMQ_PASSWORD = 'Test1234'

MONGODB_PUBLIC_ADDRESS = '192.168.2.73:27017' # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'mongodb://192.168.2.73:27017' # Actual uri to connect to DB
MONGODB_USER = ''
MONGODB_PASSWORD = ''
MONGODB_SHARDED = False
MONGODB_BUFFER_DATA = 100

LINK_GENERATOR = 'http://192.168.2.73:6800' # Set your link generator worker address here
SCRAPERS = ['http://192.168.2.169:6800'] # Set your scraper worker addresses here

LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux user account created during registration

Is it caused by my wrong setting?

Thanks in advance.

help library crypt

dear admin

  • i using library, add to python 2.7
  • pip install crypt
  • pip install pycrypto
  • pip install cryptography

=> i have problem import library error.

ImportError
No module named crypt
C:/Users/..../Music/scrapy-web-ui-master/Scrapy-Web-UI-master\scrapyproject\views.py in , line 26
C:\Python27\python.exe
2.7.0

please help me. thanks you.

There is nothing called Cookie it should be document.cookie

Uncaught ReferenceError: Cookies is not defined
    at localhost/:91

Here -

    <script src="/static/js.cookie.js"></script>
  <script>
       var csrftoken = Cookies.get('csrftoken');
  function csrfSafeMethod(method) {
      return (/^(GET|HEAD|OPTIONS|TRACE)$/.test(method));
  }

Worker hosts are unreachable

I started scrapyd , mongo and rabbitmq in worker threads still it shows worker status: unreachable , version: unknown

When I cal ping workers from Link generator curl working file

This is how ./scrapyproject/scrapy_packages/settings.py looks

`SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = '127.0.0.1'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'

MONGODB_PUBLIC_ADDRESS = '127.0.0.1:27017' # This will be shown on the web intee
rface, but won't be used for connecting to DB
MONGODB_URI = '127.0.0.1:27017' # Actual uri to connect to DB
MONGODB_USER = ''
MONGODB_PASSWORD = ''
MONGODB_SHARDED = False
MONGODB_BUFFER_DATA = 100

LINK_GENERATOR = 'http://127.0.0.1:6800' # Set your link generator worker addree
ss here
SCRAPERS = ['172.17.0.4:3800', '172.17.0.2:6800'] # Set your scraper worker addd
resses here

LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux userr
account created during registration`

And This is what I get when I curl them from link generator -

`root@96f3b9b83573:/usr/local/Distributed-Multi-User-Scrapy-System-with-a-Web-UI#
curl -I 172.17.0.4:3800
HTTP/1.1 200 OK
Date: Fri, 11 May 2018 04:50:31 GMT
Content-Length: 699
Content-Type: text/html
Server: TwistedWeb/18.4.0

root@96f3b9b83573:/usr/local/Distributed-Multi-User-Scrapy-System-with-a-Web-UI#
curl -I 172.17.0.2:6800
HTTP/1.1 200 OK
Date: Fri, 11 May 2018 04:50:50 GMT
Content-Length: 699
Content-Type: text/html
Server: TwistedWeb/18.4.0`

So workers are up .. but somehow link generator cant connect to them.

attributeerror found

When I use the example linkgenerator or definde a new spider,there are the same error messages.Please help me,is there any files that need re-config?Thanks

2017-06-25 07:33:44 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <AdminDemo1Spider 'admin_demo1' at 0x1117ed890>>
Traceback (most recent call last):
  File "/Users/Bruce/git/Distributed-Multi-User-Scrapy-System-with-a-Web-UI/venv/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/Users/Bruce/git/Distributed-Multi-User-Scrapy-System-with-a-Web-UI/venv/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/private/var/folders/fx/jgdx72z90b5dsq_l39v1q00m0000gn/T/admin_demo1-2-sl3cQm.egg/admin_demo1/spiders/admin_demo1.py", line 28, in spider_closed
AttributeError: 'AdminDemo1Spider' object has no attribute 'statstask'
2017-06-25 07:33:44 [twisted] CRITICAL: Unhandled error in Deferred:
2017-06-25 07:33:44 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/Users/Bruce/git/Distributed-Multi-User-Scrapy-System-with-a-Web-UI/venv/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/Users/Bruce/git/Distributed-Multi-User-Scrapy-System-with-a-Web-UI/venv/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "/Users/Bruce/git/Distributed-Multi-User-Scrapy-System-with-a-Web-UI/venv/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
KeyError: 'username'

Is so great project!

Is so great project!
Scrapy is now the lack of UI function, I hope this project can develop better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.