Giter VIP home page Giter VIP logo

zhihu_crawler's Introduction

Abstract

It is a simple web crawler of several search pages of ZHIHU.com. Through the project, we hope to get several kinds of data from searching-result pages of ZHIHU. Here are the keywords we have to search:

  • 面试(interview)
  • 实习(intern)
  • 找工作(job hunting)
  • 简历(CV)

You can visit ZHIHU_Search and enter the keywords to view the generated answer page.
From the answer page, we will collect several information, which forms a tuple of table.

  • search_terms : the word you search
  • search_rank: the rank of this tuple
  • question_url: the link of the question
  • question_title: the name of question
  • question_follow_num: the number of followers of the question
  • question_view_num: how many times the question has been viewed
  • question_top_answer_username: the name of account whose answer ranks first among all the answers
  • question_top_answer_id: the id of account whose whose answer ranks first among all the answers

In order to distinguish potential same tuples, we add a "create_time" column to record when the tuple is created.

How to start the crawler

python3 main.py 

There are several command line parameters you may have to configure mannually to ensure the service can run normally.

  • --db_user: the name of your account which you log in to the database
  • --db_passwd: the password of your database account
  • --db_name: the name of the database that you want to connect
  • --db_addr: the IP address of the database that you want to connect

Outer reliance

Developing Environment

Personally, I build and test the structure on Windows 10 Professional, with MySQL 8.0.18 and Anaconda, whose Python's version is 3.7.4.

Target Environment

The project is designed to run on Linux server, relying on Python, whose version is bigger than 3.6, and PostgreSQL or MongoDB.

Docker configuration

In order to build it as a micro service, we provide Dockerfile in buildDocker folder. It is designed to work on CentOS 8, the latest version. Simply you merely need to add all python files into the folder and run dockerfile.

How to configure Chrome in Docker image

Though in Dockerfile we have configured how to install Chrome without GUI, there are still several steps which have to be done manually to ensure the project to work. Please strictly follow the below.

  1. Run the image in docker, entering bash
  2. Find the path of Chrome, create a soft link for the sake of use
which google-chrome-stable
ln -s [path] /bin/chrome
  1. Solve the problem that root user cannot run chrome, which needs to modify file '/opt/google/chrome/google-chrome', modify the last line as:
exec -a "$0" "$HERE/chrome" "$@" --no-sandbox $HOME
  1. Install chrome drive
    1. Download chromedrive built for installed version of Chrome
    2. build soft link and add 'x' mod
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Project Structure

|--main.py
|--spiders
  |--indexZhihu.py
  |--models.py
  |--multithread.py
  |--mysql_connect.py
  |--to_xlsx.py

Here are the illustrations of these files.

  • main.py : the entrance of whole project, accept command line paremeters, pass them to the function which connects to MySQL.
  • indexZhihu.py : invoke all the modules defined in Spiders to finish the job of generating target info.
  • models.py : The conglomrate of several practical functions which are used in other modules. For example, it provides the functions to capture a website, extract target urls, modify urls to standard format etc.
  • multithread.py : In this file we define a class to execute multi-thread crawler. It is able to modify the number of threads you plan to use.
  • mysql_connect.py : It is used to connect to the MySQL database, providing functions which respectively execute creating connection, closing connection and inserting tuples.
  • to_xlsx.py: It intends to collect all tuples from table in SQL database and format the dataframe into an xlsx file.

We also have prepared createDatabase.sql for you to have a clear understanding of the design of our database. You can run it on your computer.

Demostration of running result of the project

Here is a screenshot of the table.
demoGraph

zhihu_crawler's People

Contributors

hsudongdai avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.