Giter VIP home page Giter VIP logo

rstudio-github-analysis's Introduction

RStudio-GitHub-Analysis

Contributors: Juno Chen, Ian Flores, Rayce Rossum, Richie Zitomer

Project Mentor: Dr. Tiffany Timbers

Project Partner: Dr. Greg Wilson

Overview

This project aims to understand how people are currently using GitHub, with the eventual goal of building an easy-to-use alternative to Git.

This project includes the ability to cluster similar GitHub projects and pick out their most commonly-occuring subgraphs.

Motivation behind this project: http://third-bit.com/2017/09/30/git-graphs-and-engineering.html

Useful documents

Installation instructions

First, to get credentials file neccessary for pulling the GitHub Torrent from Google Cloud (necessary for re-generating images for our analysis):

  • Follow the instructions under 'Set up a service account' to create and download a credentials file: https://cloud.google.com/video-intelligence/docs/common/auth
  • Change the name of the file to credentials_file.json and put it in the root directory of the project (a sample file with the name credentials_file_EXAMPLE.json is included as a reference).

Usage

Run the following commands to reproduce this analysis:

snakemake get_ght_data # Downloads GH Torrent data from figshare. Be aware that the file is quite large, and downloading can take 1-2 hours.

snakemake run_analysis # Run our pipeline; generate embeddings, clusters, tsne graph, motif report, etc.

snakemake generate_images # Generate images of our most important findings.

To change parameters from the command line, simply put --config param=value after your snakemake call. For a full list of configurable parameters, see the config.json file in the root directory of this project. For example, if you wanted to run the analysis with 5 workers instead of the default, run:

snakemake run_analysis --config n_workers=5

Config Parameters

Short Name Long Name Description Default Type
-rp --results_path The folder to output results of the analysis. e.g. embeddings and plots ./results/ String
-nw --n_workers The number of workers to use when running the analysis. 1 int
-dp --data_path The path to the commits.feather file. e.g. /home/user/RStudio-Data-Repository/clean_data/commits_by_org.feather ./data/commits_by_org.feather String
-np --n_projects The number of projects to sample from the dataset. 1000 int
-mc --min_commits The minimum number of commits for a project to be included in the sample. None none_or_int
-mcount --min_count The min_count parameter for the graph2vec model. 5 int
-nps --n_personas The number of personas to extract from each cluster. 5 int
-nn --n_neurons The number of neurons to use for Graph2Vec (project level) 128 int
-ni --n_iter The number of iteration to use to run the WeisfeilerLehmanMachine 10 int
-rs --random_state The random state to initalize all random states. 1 int

Data Repositories

RStudio-Data-Repository

Figshare Upload

Docker

To run Docker you have to run:

  1. docker build --tag rstudio:1.0.0 .

  2. docker run -it -v $(pwd):/rstudio_analysis rstudio:1.0.0 /bin/bash

Once inside the container you run:

  1. cd rstudio_analysis

  2. snakemake get_ght_data

  3. snakemake run_analysis

  4. snakemake generate_images

Software and Dependencies

  • MulticoreTSNE==0.1
  • pandas-gbq==0.10.0
  • panel==0.6.0
  • networkx==2.3
  • joblib==0.12.3
  • gensim==3.7.1
  • tqdm==4.26.0
  • pyviz-comms==0.7.2
  • snakemake=5.5.2

rstudio-github-analysis's People

Contributors

gvwilson avatar huijuechen avatar ian-flores avatar raycerossum avatar rzitomer avatar ttimbers avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.