Giter VIP home page Giter VIP logo

oss-contributors's Introduction

Tracking Open Source Contributors Build Status

Build a(n improved) ranking of companies-as-contributors-to-public-GitHub (based on this blog post).

Too Long; Didn't Read

Pretty graphs over here.

Why?

The user-to-company association in the ranking blog post that inspired us is not ideal: it uses the email associated to a git config, and if the domain to the email is NOT of a public mail provider (gmail, yahoo, etc), it assumes it's a company. That's not a great way of going about it because not many people use their company's e-mail in the git config they use with their public GitHub.com account.

To make that association better, this project cross-reference GitHub.com activity, which is tracked via githubarchive.org data (and is freely available as a dataset in Google BigQuery) with GitHub.com user profiles. We pull the company field from user's profiles and store those in a periodically-updated (currently monthly) database that we then copy over into BigQuery.

Features

  • Leverages githubarchive.org's freely available dataset on Google BigQuery to track public user activity on GitHub.
  • A GitHub.com REST API crawler that pulls users' company associations (based on their public profile), that we then store in a database (and periodically update).
  • Tracking and visualizing GitHub contributors from tech companies' activity over time in a spreadsheet.

Implementation

We have a BigQuery project with relevant supporting tables and queries. If you'd like access, contact @filmaj (via an issue in this repo or on twitter). This project contains:

  1. A database table tracking user-company associations (currently done in an Adobe IT managed MySQL DB). Fields include GitHub username, company field, fingerprint (ETag value as reported from GitHub, as a cache-buster). We synchronize the MySQL DB with BigQuery every now and then using a command this program provides.
  2. Another table tracks GitHub usernames active over a certain time period.
  3. For each active user identified in (2), we pound the GitHub REST API to pull user profile info, and drop the company field from that info into the DB table described in (1).

How Are Companies Tracked?

Check out the src/util/companies.js file. How it works:

  1. There is a "catch-all" regular expression (🤡) that tries to match on known tech company names.
  2. If a match is detected, then we try to map that back to a nicer label for a company name. Note that multiple expressions from the company catch-all may map to a single company (e.g. AWS, AMZN and Amazon all map back to Amazon).

TODO

  1. Describe how to use bigquery in conjunction with this repo.
  2. Real-time visualization of the data.
  3. Tests.

Requirements

  • Node.js 9+
  • a BigQuery account, and a bigquery.json file is needed in the root of the repo, which contains the credentials for access to Google Cloud BigQuery. More info on how to set this file up is available on BigQuery docs.
  • a oauth.token file is needed in the root of the repo, which contains GitHub.com personal access tokens, one per line, which we will use to get data from api.github.com. In my version of this file, I have several tokens (thanks to all my nice friends who graciously granted me one) as there is a maximum of 5,000 calls per hour to the GitHub REST API.
  • a MySQL database to store user-company associations. Currently using an Adobe-IT-managed instance: hostname leopardprdd, database name, table name and username are all GHUSERCO, running on port 3323. @filmaj has the password. The schema for this table is under the usercompany.sql file.

Doing The Thing

$ npm install
$ npm link

At this point you should be able to run the CLI and provide it subcommands:

Updating MySQL DB of User-Company Affiliations

This command will pull the rows from a bigquery table containing github.com usernames, pull user profile information for each user from the GitHub.com REST API and store the result of the company field (and the ETag) in a MySQL DB table.

$ node bin/oss.js update-db <bigquery-table-of-user-activity>

Running this command and pointing it to a bigquery table containing ~1.5 million github.com usernames, on last run (Feb 2018), took about 6 days.

Uploading Results Back to BigQuery

This command will push the MysQL DB up to BigQuery. This command will delete the table you specify before pushing up the results.

$ node bin/oss.js db-to-bigquery <bigquery-table-of-user-company-affiliations>

On last run (Feb 2018), this command took a few minutes to complete.

Putting It All Together

If you're still with me here: wow, thanks for sticking it out. How all of this fits together:

  1. Run the incremental user activity query on BigQuery, and store the result in a new table. I usually run this on a monthly basis, but you are free to use whatever time interval you wish.
  2. Run this program's update-db command, specifying the bigquery table name you created in (1), to get the latest company affiliations for the users identified in (1) stored in your MySQL DB. This usually takes days. You have been warned.
  3. Run this program's db-to-bigquery command to send these affiliations up to bigquery. Note that the table you specify to store these affiliations in, if it already exists, will be deleted. This should only take a few minutes.
  4. Run the contributor-count, repo-count and stars-accrued query on BigQuery, and store the result in a new table. This query will look at all github activity over the time period you specify (top of the query) and correlate it with the user-company affiliations table we created in (3). Make sure you use the correct table name for the user-company affiliations in the query (search for JOIN). BigQuery is awesome so this should never take more than a minute, though do keep an eye on your bill as, well, money goes fast ;)
  5. Bask in sweet, sweet data.

Contributing

Firstly, check out our contribution guidelines. Secondly, there are probably way better ways of doing this! For example, I've noticed that the company field info is somewhat available directly in BigQuery, so probably the whole "use a MySQL DB" thing is dumb. I'm grateful for any help 🙏.

oss-contributors's People

Contributors

filmaj avatar dependabot[bot] avatar pabelanger avatar

Stargazers

AI avatar Mini256 avatar Mike Linksvayer avatar Patrick Chan avatar harapan avatar Matteo Figus avatar Gábor Mihálcz avatar  avatar Ayman Farhat avatar Bassem Dghaidi avatar Victor Duarte avatar Scott Stout avatar  avatar Juri Grabowski avatar alison avatar Neal Gompa (ニール・ゴンパ) avatar Umar Hansa avatar Jim Park avatar neryajanel avatar Vlad Filippov avatar Gokulakrishnan Kalaikovan avatar Mark Fox avatar Orchimada avatar Neal Fultz avatar Luca Belluccini avatar bai avatar Brian Berliner avatar LiJiansheng avatar Alolita Sharma avatar Ole Bang Ottosen avatar Karla Falcão avatar Felipe Hoffa avatar James Nurthen avatar Chris Aniszczyk avatar Derek Gates avatar

Watchers

 avatar James Cloos avatar Juri Grabowski avatar David Nuescheler avatar Like Xu avatar  avatar

oss-contributors's Issues

munge together acquired companies or no?

We do so right now with Magento + Adobe (we count anyone who has "Magento" in their company profile name as an Adobe employee). There are open questions as to whether or not to do this for IBM+RedHat and MSFT+GitHub.

Doing so at data-scraping-time destroys valuable info: how much of a company's activity is contributed to by an acquired company? It would be better to allow to delineate between this information.

One option: these 'mungings' could be applied at the BigQuery level. That way, at BigQuery-time, we can choose to run analyses that delineate (or not) acquired companies.

compile 2019-02 numbers

  • create the users_pushes_2019_01 table
  • run the update-db command (in progress)
  • compile results into the spreadsheet

document bigquery views

Document what each of the views in the BigQuery dataset are for, how they are organized and how they map back to the usage instructions.

Add code to pull Magento employees from the Magento org's all-employees team

This would require using a github access token that has privileges to the magento github org (like my personal access token), so probably want to put this behind a flag for the tool.

But this would be a much more accurate way of identifying Magento employees instead of using regex on the Company field of GitHub.com user profiles / pounding the REST API. Would be much faster too.

When tackling this, should also see how the Adobe IT DB of employee github.com usernames is coming along, and if it's grown to include some decent % of Adobe engineering numbers, maybe we can use that instead too.

if oauth/personal access token is invalid, this should not blow up the program

During an update-db run, I stumbled upon the following crash:

2% complete
Retrieving GitHub token rate limits...
...complete.
Error retrieving rate limit { [Error: {"message":"Bad credentials","documentation_url":"
https://developer.github.com/v3"}]
  message:
   '{"message":"Bad credentials","documentation_url":"https://developer.github.com/v3"}'
,
  code: 401,

.. which stopped the program from running. Should make it more resilient.

open source is more than just commits

the main 'metrics' were counting now are basically commits and stars accrued. what about also adding additional metrics, like issues and PRs filed, PRs reviewed, emoji reactions, PR/issue comments, the list goes on...

enhancements suggested for the `users_companies` table

Thanks for sharing this!

https://bigquery.cloud.google.com/table/public-github-adobe:github_archive_query_views.users_companies?pli=1&tab=details

image

Suggestions:

  • Add the field 'user_id', as people can change their nick (but not their id).
  • Add the field 'crawled_at'. Account will have multiple companies through their lifetime, and this will allow you to attribute commits that happened x years ago to the right company.

With 'crawled_at' you'll have to allow multiple entries per user, and adjust queries later. For example, the easiest queries would go through a view that just gives the latest company per user.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.