Scraping more than 1M repositories from GitHub!
The dataset (TXT format) is located here:
https://github.com/philipperemy/Github-full-data-set/blob/master/data_1m/GITHUB.1M.txt
The fields recorded are:
- name
- clone_url
- created_at
- forks (FORKS)
- has_issues
- language (COMPUTER LANGUAGE)
- subscribers_count (WATCH)
- watchers_count (STARTS)
- stargazers_count
- size
Due to size limitations, I had to narrow down the available tags to those above. I provide all the tags for the 100k dataset (~260Mb for 100k objects). Also, you can have all the tags if you scrape the data yourself. More information below.
- predict the numbers of stars/forks based on the source code.
- or maybe just on the README.
- relations between all the variables.
- or just extracting lots of source code and apply a language model on it:
- Example: how to get a lot of JavaScript source code:
- in the dataset, filter with
$language
equal toJavaScript
- then clone the repository somewhere,
git clone $clone_url
- ultimately, list all the JS files
find $directory -type f -name "*.js"
Replace python3
and pip3
by python
and pip
if you use Python 2.x.
git clone https://github.com/philipperemy/Github-full-data-set.git
cd Github-full-data-set/
sudo pip3 install -r requirements.txt
cat data_100k/x* > data_100k/GITHUB.tar.gz # because GitHub does not allow files bigger than 100Mb.
md5sum data_100k/GITHUB.tar.gz # 5886b24033991283a4dbfa6b328be011 data_100k/GITHUB.tar.gz
tar xvzf data_100k/GITHUB.tar.gz # goes to GITHUB/
python3 read.py GITHUB/
python3 main_run_scraper.py <GITHUB_USERNAME> <GITHUB_PASSWORD> GITHUB/
OUTPUT_DIR = GITHUB/
Search files here: GITHUB/**.pkl
--------------------------------------------------------------------------------
ID = 0
URL = https://github.com/10gen/external
NAME = external
WATCH = 3
STARTS = 2
LANGUAGE = JavaScript
FORK = 1
--------------------------------------------------------------------------------
ID = 1
URL = https://github.com/4l3x2k/8086macs
NAME = 8086macs
WATCH = 2
STARTS = 0
LANGUAGE = C
FORK = 0
--------------------------------------------------------------------------------
ID = 2
URL = https://github.com/A1kmm/cellml_meta_1_1
NAME = cellml_meta_1_1
WATCH = 2
STARTS = 3
LANGUAGE = None
FORK = 0
--------------------------------------------------------------------------------
ID = 3
URL = https://github.com/aaronchi/jrails
NAME = jrails
WATCH = 3
STARTS = 721
LANGUAGE = Ruby
FORK = 82
NB: When you scrape yourself, there are way more tags than just those represented above. Please refer to the links below for a complete documentation of all the available tags.