Giter VIP home page Giter VIP logo

zhihu-scrapy's Introduction

zhihu-scrapy

A Scrapy Zhihu Crawler

What is zhihu-scrapy?

zhihu-scrapy is a distributed crawler system for crawling zhihu website.The data we gather include user profile, followees and followers.Collected data can be used for various purpose(eg. finding communities, identifying popular answer posters)

###How does it work?

It combines the following systems:

  1. scrapy (parsing and logging)
  2. selenium (downloading and executing javascript)
  3. redis (queueing and storing results)

The crawler system consists of one main redis server to manage crawling records. All crawling machines start a local redis server for storing user data.

###How to get started?

Start redis server on main server and crawling machines.

Add initial users to the main redis server with Monitor, example:

>> from zhihu.utils import Monitor
>> init_list = ['first-id',]
>> Monitor.add_user_ids(init_list)

In zhihu/settings.py set REDIS_HOST to the ip address of the main redis server.

Use scrapy crawl zhihu_people to start a crawler.

###How to solve captchas?

We provide the Monitor class to monitor crawlers, including solving captchas for them. To solve captchas for all crawlers that need captcha, use:

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.solve_captchas()

###How to add accounts?

Each crawler needs to fetch an account from the account pool to start. To add accounts to account pool, use:

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.add_account('username','password')

###How to check stats?

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.stats()

###License: GPL v3

GPL v3 details

zhihu-scrapy's People

Contributors

immzz avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.