scrapinghub / frontera Goto Github PK

A scalable frontier for web crawlers

License: BSD 3-Clause "New" or "Revised" License

Python 99.91% Shell 0.09%

frontera's Introduction

Frontera

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

Online operation: small requests batches, with parsing done right after fetch.
Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
Two run modes: single process and distributed.
Built-in SqlAlchemy, Redis and HBase backends.
Built-in Apache Kafka and ZeroMQ message buses.
Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
Transparent data flow, allowing to integrate custom components easily using Kafka.
Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
Optional use of Scrapy for fetching and parsing.
3-clause BSD license, allowing to use in any commercial product.
Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

frontera's People

Contributors

Stargazers

Watchers

Forkers

ruairif getwingm imclab plafl markbaas curtiszimmerman teamhg-memex sibiryakov mitchellzen sunchen009 bbotella rajatgoyal vishwakarmarahul mrg7 b-cube starrify debuggingfuture chris-zen lljrsr michaelviu kartikkannapur voith sergeynenashev preetwinder williamren vlsarro thunder1 jp111 gcetusic anukat2015 warungdata nyimbi datnamer luffyhwl nkhuyu gianvi rampage644 dingwf silencetilldawn yuanbei lopuhin rtvt123 thocell nibircse delniy k9team3 kangkangv5 bytearchive pvanderlinden ragnardanneskjold andr3ic chazkii yaokeepmoving bclowcode ducnguyen1911 amitsing89 anujmax jdrew1303 securextools mbrukman dgo2dance jensenfeng jaisanas haydenliu tarunrs manishsat gaocegege toabey ossa619 ashbt siteimprove simonqiang rmax-contrib xsren khellan chaitanyacixlive leezqcst dansandland 724686158 wiesel2 loitv1689 sht3ch isra17 alexxnica flared koaladaddy weeshlow zhodj mabolivar ravirajadrangi yuwenlidao melcutz tafhim fakegit clarksun aperezalbela bczhang sitefig devopsmi alexleethinker

frontera's Issues

Docker support

It is possible to see this code in action via easy-to-use image (or building it) of Docker?
What about to add distributed sample of dockerizing?

Support of internationalized domain names

At the moment CF stores and treats URLs as they were found.

http://en.wikipedia.org/wiki/Internationalized_domain_name

hcf backend: Handle more requests exceptions for hs retries.

#11 (diff)

Getting an zeromq Error in windows 2010

Process Process-1:
Traceback (most recent call last):
File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run
self._target(_self._args, _self._kwargs)
File "C:\neuromac\Admin.py", line 438, in start_proxy
backend.close()
UnboundLocalError: local variable 'backend' referenced before assignment
Admin going to start with cfg_file: random_walk\random_walk.cfg
Traceback (most recent call last):
File "..\Admin.py", line 461, in
admin = Admin_Agent(no+1,cfg_file)
File "..\Admin.py", line 50, in init
self._initialize_communication_links()
File "..\Admin.py", line 75, in _initialize_communication_links
self.socket_pull.bind("tcp://:%s" % self.parser.getint("system","pull_port"))
File "zmq/backend/cython/socket.pyx", line 487, in zmq.backend.cython.socket.Socket.bind (zmq\backend\cython\socket.c:5156)
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq\backend\cython\socket.c:7535)
raise ZMQError(errno)
zmq.error.ZMQError: Address in use

Default settings and documentation inconsistent

Hi there.
In the default settings you set

SPIDER_FEED_PARTITIONS = 1

In the docs it is written that the default is supposed to be 2.
I also think that the docs about SPIDER_FEED_PARTIONS in the run modes section are a bit confusing. I think that setting SPIDER_FEED_PARTIONS = 2 will result in two partitions with the numbers 0 and 1. According to this docs page it should be 0, 1 and 2.

Multiple strategy and DB workers for ZeroMQ message bus

At the moment it's limited to single DB and single SW. The solution of this problem requires modification of the whole data flow, spider log needs to be partitioned and broker needs to send messages only to those who subscribed to specific partition.

SW stopping on too many HBase retries

My SW is stopping due to an unhandled exception:

Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/task.py", line 213, in __call__
    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/media/sf_distributed_llshrimp/debug/frontera/frontera/worker/strategy.py", line 80, in work
    self.states.fetch(fingerprints)
  File "/media/sf_distributed_llshrimp/debug/frontera/frontera/contrib/backends/hbase.py", line 306, in fetch
    records = table.rows(keys, columns=['s:state'])
  File "/usr/local/lib/python2.7/dist-packages/happybase/table.py", line 155, in rows
    self.name, rows, columns, {})
  File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1358, in getRowsWithColumns
    return self.recv_getRowsWithColumns()
  File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1384, in recv_getRowsWithColumns
    raise result.io
happybase.hbase.ttypes.IOError: IOError(_message='org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 27766 actions: IOException: 27766 times, \n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:228)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:208)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1594)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:936)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:950)\n\tat org.apache.hadoop.hbase.client.HTable.get(HTable.java:911)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumnsTs(ThriftServerRunner.java:1107)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumns(ThriftServerRunner.java:1063)\n\tat sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy9.getRowsWithColumns(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4262)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4246)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n')

After restarting the SW it is working again (but the consumption seems to work at a much smaller rate)

Confused about MAX_NEXT_REQUESTS

Under Production Crawling an example is presented with N spiders and N SWs. The spider has twice the amount of MAX_NEXT_REQUESTS than the workers. And there is a comment that both values should be "consistent".
Under Frontera Settings there is no mentioning about which ratio to use for SW and spider requests or what "consistent" means.
Then in the general-spider project the spider has half of the MAX_NEXT_REQUESTS that the workers have.
Now I am a bit confused. Should MAX_NEXT_REQUESTS be the same for SWs and spiders or should a certain ratio be used?

Support ip address instead of domain name in URLs

It is probably broken now.

Rename project to "Frontera"

Sine the beginning of the Crawl Frontier project, we knew we wanted to give a it a real name, something less generic than "crawl frontier", something that can be easy to identify and sounds well. There is a reason why Scrapy was called Scrapy and not "Web Crawler" :).

We have decided today on calling it "Frontera". I have reserved the pypi name.

@plafl @sibiryakov can you do the rename?

Redirect loop when using distributed-frontera

I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
...
2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

Mention contributors

Let's make a contributors.txt or AUTHORS file. It should mention contributions from Scrapinghub and Diffeo along with some of the individuals that have make significant contributions.

More efficient memory backends

The memory backends are all implemented using heapq. This allows for some succinct code when using different crawl ordering, but it's less efficient than choosing more appropriate datastructures for each specific ordering (LIFO, FIFO, etc.)

hcf backend: Add settings for number of retries and sleep values.

#11 (diff)

color logging fails with colorlog 2.6.0

File "/Library/Python/2.7/site-packages/crawl_frontier-0.2.0.post.dev7.pre-py2.7.egg/crawlfrontier/logger/formatters/color.py", line 14, in __init__
        reset=reset, style=style)
    exceptions.TypeError: __init__() got an unexpected keyword argument ‘format’

works with 2.4.0

Frontera Scrapy request callback doesn't work

Frontera seems not support scrapy request callback as well as the rule callback. I tried scrapy-record example. The program doesn't go to the parse_page function

Thanks

LRU cache for state cache in strategy worker

Currently state cache is periodically cleaned up completely, I believe LRU should be more efficient.

CrawlSpider doesn't work with most backends

Memory-based backends are exclusion.
CrawlSpider makes use of callbacks stored in requests. These callbacks are discarded when saved into DB by Frontera. Possible solution implies pickling/unpickling callback methods to binary field in DB.

Thrift compact protocoll should also be framed.

In the docs you do not mention that one should also use the framed protocol.

remove optional dependencies from install_requires

I'm not 100% sure about that, but it seems some of the requirements in setup.py are optional. For example, based on docs it seems Crawl Frontier can be used without SQLAlchemy or without pydot. But now it can't because these requirements are in install_requires. I think it is better for install_requires to be minimal - what do you think about removing most of packages from it?

Versioning in install_requires also looks too restrictive - are latest versions really required? Pinning versions with == is problematic because if a new combatible version of a package is released it won't be possible to use it with an existing crawl-frontier version.

IMHO install_requires should exclude versions that are known not to work, not versions about which we're unsure. We may create requirements.txt file with pinned package versions which are known to work.

kafka python offsetrequest

Please correct frontera.contrib.messagebus.kafka.init.py line 5 to

from kafka.common import OffsetRequestPayload, check_error, OffsetFetchRequestPayload, UnknownTopicOrPartitionError

the corresponding kafka python version is 1.0.2

other wise you cannot load OffsetRequest.

More efficient state cache in strategy workers

At the moment extracted links from spiders are distributed to SWs using fingerprint from crawled page. That causes same links to consume memory and HBase(or other DB) bandwidth on all SW instances. This can be improved by using additional message type in spider log, with key of extracted link fingerprint. Therefore links will be distributed uniformly to SWs and no duplication will take place.

This improvement requires changing spider log code in Frontera spider backend, protocols and SW.

The drawbacks are more messages in message queue systems.

Crawling strategy for a topic-focused crawler

It would be nice to add to Frontera an optional crawling strategy for topical crawling. It could take dictionary of words describing some topic as input and crawl from seed urls searching for documents relevant to topic until some finishing condition is met.

New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

While this is solved you can use this on your settings as a workaround:

DELAY_ON_EMPTY=0.0

The problem is in frontera.contrib.scrapy.schedulers.FrontieraScheduler, method _get_next_requests.
If there are no pending requests and the test self._delay_next_call < time() fails, an empty list is returned which causes the crawl to terminate

Serialize meta/headers/cookies using widely used format

Currently we use pickle, which is limits storage usage only for Python in SQLIteBackend.

Prioritize command line option for SPIDER_PARTITION_ID

Right now frontera recommends setting the PARTITION_ID in a separate python settings file for each spider / worker. However when shipping out the project it would be nice to have a command line option to pass either a config file or the PARTITION_ID of the worker/spider. The separate settings file would then no longer be needed, which would make frontera more flexible and easier to deploy and use.
Since supporting config files might need big changes in the project I recommend adding a command line option to choose the PARTIOTION_ID.
Do you think that would be a good addition? Is there already something available, so this feature would not be needed?

Get rid of OverusedBuffer

Buffering requests to busy hosts should be responsibility of fetcher component. We need to figure out how to change interfaces, and how to support necessary buffering logic in our default fetcher (Scrapy).

Second-level domain hostname partitioning option

It's easier to describe the problem with example:

en.wikipedia.org and ru.wikipedia.org should be assigned to the same partition because they have same second-level domain wikipedia.org.

Such behavior should be optional. This is mostly needed to avoid generating extra load for a websites with 3 or more level domains hosted on the same machine.

pre-requisite for running examples?

I tried to setup frontier with scrapy crawlSpider but no luck.

Then I clone the project and run the example with scrapy crawl example, still seems it finish with requests being generated but no response being parsed.

am I missing anything or it is a version problem? using scrapy 0.24.4 and frontera (0.3.0.post0.dev2)

2015-05-20 00:48:52+0800 [example] INFO: Dumping Scrapy stats:
    {'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 19, 16, 48, 52, 562936),
     'frontera/iterations': 0,
     'frontera/pending_requests_count': 0,
     'frontera/seeds_count': 4,
     'log_count/DEBUG': 2,
     'log_count/INFO': 9,
     'start_time': datetime.datetime(2015, 5, 19, 16, 48, 52, 558849)}

403 not treated as error

According to the HTTP standard, successful responses are those whose status codes are in the 200-300 range. But sqlalchemy backend is not treating 403 as a error and not populating error column. Essentially request_error function is not getting called.

exception during scrapy callback marked as queued

Hi,
If there is any exception with response parsing in scrapy, the request remain marked as QUEUED and no error is logged on the frontier.

Kafka settings defaults and documentation

Settings keys and values needs to be extracted from message bus code, added to default_settings module and documented.

ZMQ_HOSTNAME cannot be a host name when connecting to broker

When using a host name to connect via ZMQ an error is thrown because the socket tries to connect to the host name, but needs an IP address instead.

Frontera Web UI

https://groups.google.com/a/scrapinghub.com/forum/#!topic/frontera/z3yoidLMK-8

Reliable batch release in the queue

Current implementation removes the batch right after requesting it from the storage. If spider process will not receive it (or unexpectedly finish during crawling) items in the batch will be lost. To prevent this, a mechanism marking batch as "locked" with following release needs to be designed.

setting to switch off exception when encountering same url fingerprint

I am trying to run multiple spiders with rdbms backend, the spiders are such that they might find the url that was visited by other spider, frontera raises as exception in this case, Is the expected behaviour?

I would want to not throw error when same url is encountered, I just want to not crawl it again, but don't need to throw exception.

Error:
sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.IntegrityError) duplicate key value violates unique constraint "play_store_pkey" DETAIL: Key (fingerprint)=(f259306fa30657ab28ffa1c322d843d0cdceee41) already exists. [SQL: 'INSERT INTO play_store (url, fingerprint, depth, created_at, status_code, state, error, meta, headers, cookies, method, body) VALUES (%(url)s, %(fingerprint)s, %(depth)s, %(created_at)s, %(status_code)s, %(state)s, %(error)s, %(meta)s, %(headers)s, %(cookies)s, %(method)s, %(body)s)'] [parameters: {'body': None, 'cookies': <psycopg2.extensions.Binary object at 0x7f3e8a02c5a8>, 'url': 'https://play.google.com/store/apps/details?id=com.dic_o.dico_eng_fra', 'status_code': None, 'created_at': '20150909151342577212', 'error': None, 'state': 'NOT CRAWLED', 'headers': <psycopg2.extensions.Binary object at 0x7f3e89d6c238>, 'depth': 6, 'meta': <psycopg2.extensions.Binary object at 0x7f3e89d6c440>, 'fingerprint': 'f259306fa30657ab28ffa1c322d843d0cdceee41', 'method': 'GET'}]

urlparse result caching in Frontera

To avoid spending CPU cycles on url parsing, results can be caching and used later.

Queued remain as queued when you stop crawling in sqlalchemy backend.

Ideally, Queued items should be put back as not crawled when the spider is closed.

Scrapy- Enabling Frontera middlewares removes the referer header from response objects

When i enable the following frontera middleware in scrapy

I lose all my referer headers in all my response objects

Is there anyway i can preserve the referrer?

The referer is available when i remove the following lines, but i need to enable these frontera middlewares

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})


DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'

Also, the referermiddleware is enabled, i can see it when in the info/debug log when scrapy starts

Make Snappy Codec for Kafka message bus optional

and disable it by default, excluding from default requirements.

Enable ZeroMQ message bus tests for IPv6

Keep in mind, that support for IPv6 was fixed in 4.1.3
saltstack/salt#24712

Link to "Frontier at a glance" on the docs is broken

The Frontier at a glance link here

http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#spider-logic

is broken.

While using sqlalchemy backend 301 redirect remained as QUEUED

Only redirected url is marked marked as CRAWLED but the redirected remains as QUEUED.

Message loss in spider feed

From here https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-181386301

Another issue I noticed recently is that my DW keeps on pushing to all partitions although I have no spider running. When I start up my spiders now, they wait until the DW has pushed a new batch (although he pushed multiple times before that). This means that after running the DW for a while without there being any spider running depletes the queue until there is nothing left to crawl. I have to add new seeds (not the same seed URLs, since multiples get dropped) for the spiders to start again and the DW to push new requests again.

Black list for hosts no need to deal with

Producer/Consumer architecture

The idea is to feed Frontera cluster with URLs discovered outside of Frontera. It could be a typical Scrapy spider crawling a single website and yielding specific links. Some standard way of propagating URLs from external component to Frontera cluster is needed.

Document Fetcher/Downloader abstraction

FrontierManagerWrapper class,
Request and Response conversion.

Patterns of use chapter in documentation

We could describe in detail how to build:

focused crawler (one website), incl. watchdog,
multiple sites focus crawler,
revisiting mechanism
broad crawler.

Python 3 support

Playing well, when root logger is initialized

Frontera is using native python logging module. When root logger is initialized with stdout handlers, frontera is adding it's own handlers and that leads to duplicate output.