Giter VIP home page Giter VIP logo

ghminer's Introduction

ghminer

A library and toolkit for MSR research.

Mining software repository has been a popular research method for quite long time. Although github offers convenient public REST and GraphQL API, collecting large scale dataset with long history of information such as repository, author, bot, issues, pull request, comment is still a non-trivial task. There are three major challenges to be solved in order to retrieve large search results from github:

  • 1000-limit issue: github API discards records beyond 1000 in the result set of a particular query.
  • rate-limit issue: github API prevents authenticated personal accounts from invoking API more than 5000 times per hour.
  • pagination: User has to issue multiple API calls to retrieve the complete query results over 100 records.

When the client exceeds the rate limit, it is disconnected with HTTP status code 503. Without proper recover handling, data collection process is subject to frequent interruptions.

This library and assoicated scripts are intended to help solve the three challenges so that you can focus on the data mining rather than data collection.

Requirements

  • Python 3.7 over

Features

  • Search Github repositories based on stars, fork, language and topic
  • Search a large number of repositories by dividing creation time into small time window
  • Support multiple topics with OR relation
  • Build dataset in .csv and .parquet format
  • Retrieve commit, issue comments
  • Golang miner with go.mod retrieval and parsing

Setup

$ python -m venv /path/to/venv
$ /path/to/venv/bin/python -m pip install ghminer

Usage

To identify repositories for your MSR research, please refer to the script identify-repos.py. To retrieve commits, use the script retrieve-commits.py. To mine golang projects, use the script golang-miner.

>>> from ghminer.retriever import collect_data
>>> collect_data(
        2022, 2023, None, True, 100, 15,
        "repo.d", "java", trace=True
    )

Dataset

The dataset containing the list of Golang repositories and the go.mod can be retrieved from kaggle.

ghminer's People

Contributors

schnell18 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.