Giter VIP home page Giter VIP logo

github-full-data-set's Introduction

Github repository dataset

Scraping more than 1M repositories from GitHub!

The 1M dataset

The dataset (TXT format) is located here:

https://github.com/philipperemy/Github-full-data-set/blob/master/data_1m/GITHUB.1M.txt

The fields recorded are:

  • name
  • clone_url
  • created_at
  • forks (FORKS)
  • has_issues
  • language (COMPUTER LANGUAGE)
  • subscribers_count (WATCH)
  • watchers_count (STARTS)
  • stargazers_count
  • size

Due to size limitations, I had to narrow down the available tags to those above. I provide all the tags for the 100k dataset (~260Mb for 100k objects). Also, you can have all the tags if you scrape the data yourself. More information below.

Statistics/Machine Learning ideas

  • predict the numbers of stars/forks based on the source code.
  • or maybe just on the README.
  • relations between all the variables.
  • or just extracting lots of source code and apply a language model on it:
    • Example: how to get a lot of JavaScript source code:
    • in the dataset, filter with $language equal to JavaScript
    • then clone the repository somewhere, git clone $clone_url
    • ultimately, list all the JS files find $directory -type f -name "*.js"

What if I want to scrape my own data?

Let's get started!

Replace python3 and pip3 by python and pip if you use Python 2.x.

git clone https://github.com/philipperemy/Github-full-data-set.git
cd Github-full-data-set/
sudo pip3 install -r requirements.txt

Read from pre existing data (100k dataset)

cat data_100k/x* > data_100k/GITHUB.tar.gz # because GitHub does not allow files bigger than 100Mb.
md5sum data_100k/GITHUB.tar.gz # 5886b24033991283a4dbfa6b328be011  data_100k/GITHUB.tar.gz
tar xvzf data_100k/GITHUB.tar.gz # goes to GITHUB/
python3 read.py GITHUB/

Generate your own data

python3 main_run_scraper.py <GITHUB_USERNAME> <GITHUB_PASSWORD> GITHUB/

Example of data from the 100k dataset

OUTPUT_DIR = GITHUB/
Search files here: GITHUB/**.pkl
--------------------------------------------------------------------------------
ID         =  0
URL        =  https://github.com/10gen/external
NAME       =  external
WATCH      =  3
STARTS     =  2
LANGUAGE   =  JavaScript
FORK       =  1
--------------------------------------------------------------------------------
ID         =  1
URL        =  https://github.com/4l3x2k/8086macs
NAME       =  8086macs
WATCH      =  2
STARTS     =  0
LANGUAGE   =  C
FORK       =  0
--------------------------------------------------------------------------------
ID         =  2
URL        =  https://github.com/A1kmm/cellml_meta_1_1
NAME       =  cellml_meta_1_1
WATCH      =  2
STARTS     =  3
LANGUAGE   =  None
FORK       =  0
--------------------------------------------------------------------------------
ID         =  3
URL        =  https://github.com/aaronchi/jrails
NAME       =  jrails
WATCH      =  3
STARTS     =  721
LANGUAGE   =  Ruby
FORK       =  82

NB: When you scrape yourself, there are way more tags than just those represented above. Please refer to the links below for a complete documentation of all the available tags.

github-full-data-set's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.