Giter VIP home page Giter VIP logo

github_stargazers_crawler's Introduction

GitHub Stargazers Crawler

本项目采用了GitHub官方的api,提供从指定repo爬取关注者(stargazers)的id和邮箱信息的功能。
由于匿名调用官方的api的访问限制条件苛刻(匿名用户每小时60次,参考Rate Limiting),为了保持爬取脚本的顺利运行,本项目要求用户自行申请token,获取了token之后通过简单的配置即可运行。

环境准备

  • 系统环境:Linux, MacOS, Windows
  • python3.7+

安装

  1. Clone本项目代码

    git clone https://github.com/KPatr1ck/github_stargazers_crawler
  2. 安装依赖

    pip install -r requirements.txt

运行脚本

  1. 登陆GitHub,获取token

  2. *.json格式文件配置token和需要爬取的repos地址。

    {
        "token": "Your token",
        "repos": {
            "Repo1": "https://github.com/url_of_repo1",
            "Repo2": "https://github.com/url_of_repo2"
        }
    }

    注意:

    • "token"和"repos"都是必填的键值对
    • "repos"中最少要有一个元素
    • "repos"里嵌套的键值对中,repo的名字可以任取(示例的"Repo1"和"Repo2")
  3. 脚本运行和参数

    python main.py --repos ./repos.json --output_dir ./output --num_workers 64 --api_limit_threshold 200

    参数:

    • repos: 必填,值为上述配置文件的路径。
    • output_dir: 可选,爬取结果文件的存储路径,默认为./output
    • num_workers: 可选,爬虫的线程数,默认为100。
    • api_limit_threshold: 可选,当前token的访问限制小于此值时,工作线程进入睡眠,1小时后再继续爬取数据,默认为200。

运行结果

运行后将在输出目录中生成关注者名单*.txt和邮箱信息*.csv文件。

[注意]即使使用token,每小时也存在访问限制。当访问限制低于阈值时,工作线程会睡眠1小时后再继续工作。因此,如果爬取的数量过大时,需要等待较长时间。

TODO

  • 支持多个token切换。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.