Giter VIP home page Giter VIP logo

scrapy_redis_bloomfilter's Introduction

##bloomfilterOnRedis.py:## 基于Redis的Bloomfilter去重,已经封装成一个类,只需两行代码即可实现去重。更多介绍见:《基于Redis的Bloomfilter去重(附Python代码)》



##scrapyWithBloomfilter_demo:## 一个简单的scrapy demo,对scrapy_redis模块作了一些修改,将去重模块替换成了Bloomfilter去重。更多介绍见:《scrapy_redis去重优化(已有7亿条数据),附Demo福利》



##种子优化:## 在scrapyWithBloomfilter_demo中我对默认的种子作了一些修改,在settings.py中将 SCHEDULER_QUEUE_CLASS 改成 'scrapyWithBloomfilter_demo.scrapy_redis.queue.SpiderSimpleQueue' 即可。详细介绍见:《scrapy_redis种子优化》

scrapy_redis_bloomfilter's People

Contributors

bone-ace avatar liuxingming avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy_redis_bloomfilter's Issues

关于scrapy_redis去重后增量爬取的疑问

比如网站 http://www.xxx.com/list-1 表示第1页,我需要爬取第1页里面特定子 url 的新闻。比如第一天使用 scrapy_redis_bloomfilter 爬取了 http://www.xxx.com/list-1 的新闻,然后第二天由于网站更新了 http://www.xxx.com/list-1 显示子 url 新闻就不一样了。这时候问题就来了,需求是增量爬取子 url 新闻,但是scrapy_redis_bloomfilter 的去重就会导致 http://www.xxx.com/list-1 根本就不再爬取了,这时候就爬不到最新的新闻了,这个问题怎么解决呢?麻烦大神回复下呢

Speed up isContains()

In file 'Scrapy_Redis_Bloomfilter/scrapyWithBloomfilter_demo/scrapyWithBloomfilter_demo/scrapy_redis/BloomfilterOnRedis.py', code lines of 33-42:

def isContains(self, str_input):
    if not str_input:
        return False
    ret = True

    name = self.key + str(int(str_input[0:2], 16) % self.blockNum)
    for f in self.hashfunc:
        loc = f.hash(str_input)
        ret = ret & self.server.getbit(name, loc)
    return ret

When getbit returns 0, it could tell the non-contains. So I think it'd be better to add 2 more code lines like this to speed up this function:

def isContains(self, str_input):
    if not str_input:
        return False
    ret = True

    name = self.key + str(int(str_input[0:2], 16) % self.blockNum)
    for f in self.hashfunc:
        loc = f.hash(str_input)
        ret = ret & self.server.getbit(name, loc)
        if not ret:
            break
    return ret

or might be even better without the '&' operation:

def isContains(self, str_input):
    if not str_input:
        return False
    ret = True

    name = self.key + str(int(str_input[0:2], 16) % self.blockNum)
    for f in self.hashfunc:
        loc = f.hash(str_input)
        ret = self.server.getbit(name, loc)
        if not ret:
            break
    return ret

builtins.TypeError: zadd() keywords must be strings

--- ---
File "/usr/local/lib/python3.6/dist-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/reactor.py", line 41, in call
return self._func(*self._a, **self._kw)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/engine.py", line 135, in _next_request
self.crawl(request, spider)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/root/spiders/yunqiCrawl/yunqiCrawl/scrapy_redis/scheduler.py", line 82, in enqueue_request
self.queue.push(request)
File "/root/spiders/yunqiCrawl/yunqiCrawl/scrapy_redis/queue.py", line 84, in push
self.server.zadd(self.key, **pairs)
builtins.TypeError: zadd() keywords must be strings

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.