Giter VIP home page Giter VIP logo

mie1512-data-analytics's Introduction

Effect of Versatility on Github Team Productivity

Overview

This is the final project for MIE1512 Data Analytics.

The purpose of the project is to analyze the effect of diversity of individual members on GitHub team productivity by Spark.

The core paper to develop this project on is Diversity of editors and teams versus quality of cooperative work: experiments on Wikipedia [1].

The paper presents some empirical study towards understanding of the role of diversity in individual and whole teams on the quality of the article in open collaboration environment like Wikipedia. In this paper, M. Sydow et al proposed an original diversity measure to quantify the diversity of interests of editor in Wikipedia. The interest profile of each editor is defined as the interest distribution vector over the set of all categories. And the diversity of interests(or equivalently versatility) of the editor is defined as the entropy of interest profile.

Team diversity is one of the fundamental issues in social and organizational studies that has been broadly researched on. Open Source Software(OSS) projects on GitHub, like Wikipedia, is highly rely on collaboration and naturally embraced the diversity. It is interesting to study whether team members have diverse interests tend to be more productive in GitHub projects.

Dependency

  • Jupyter Notebook 4.4.0
  • Collected and filtered data by using GHTorrent on Google BigQuery
  • Spark Python API pyspark

Content

  • Executable Jupyter file Link

    • projects.csv : dataset selected is ght_2018_04_01 from ghtorrent-bq. The table contains 83624114 records of GitHub repositories, and the tables we use in this project are projects, users, project_members, project_commits. Since some of Github repositories are not suitable for our analyzation, so I filter the projects by following standards. After filtering, we only select 59244 projects and saved as projects.csv.
    • project_proccessed.csv : use Latent Dirichlet Allocation(LDA) to infer the domain of the projects and generate 10 domains for GitHub repositories based on the name and description of the GitHub repositories.
    • projects_vers.csv
    • users_vers.csv
    • mem_commits.csv
  • Final report documentation Link

  • PPT for presentation Link

  • Spark cheatsheet Link

mie1512-data-analytics's People

Contributors

yiiifan avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.