cncf / gitdm.archive Goto Github PK

📜Fork for tracking CNCF projects

Python 18.83% Shell 16.81% Ruby 56.24% Makefile 0.27% Go 7.84% Vim Script 0.01%

gitdm.archive's Introduction

CNCF gitdm

This is the Cloud Native Computing Foundation's fork of Jon Corbet and Greg KH's gitdm tool for calculating contributions based on developers and their companies. Companies and developers can check if they are correctly attributed at the following links:

Company Developers list: co1, co2, co3, co4, co5, co6, co7, co8.

Developers affiliations list: dev1, dev2, dev3, dev4, dev5.

New affiliations are imported into DevStats about 1-2 times/month.

DevStats

This repository is used as a source of affiliations for all DevStats projects. The final affiliations JSON is periodically imported by the DevStats project.

Adding/Updating affiliation

If you find any errors or missing affiliations in those lists, please submit a pull request with edits to developers affiliations files: dev1, dev2, dev3, dev4, dev5, ...

Please note that we need both current and historical email here as we are processing data from GitHub Archives, so old emails are there (even if they are no longer current).

Only the Developers affiliations list dev1, dev2, dev3, dev4, dev5, ... should be edited manually.

Company Developers lists co1, co2, co3, co4, co5, co6, co7, co8 are computed derivatives of the first list.

Other files used for affiliations are the email map file and github users file.

Please note that cncf/gitdm affiliations are imported into DevStats (cncf/devstats) once per 4 weeks.

Removing affiliations

If you do not want to have your email listed here please read how to remove your email.

Testing changes

You can test any changes locally by cloning this repository and regenerating all data by running ./rerun_data.sh.

Then generate config files by running: ./import_affs.sh.

If those two files are out of sync, the tool will notify you about this.

This tool will generate a new email-map file.

Check if your changes processed properly and move the file to cncf-config/email-map (replace)

Sync workflow

Please follow the instructions from SYNC.md.

Running

Use *.sh scripts to run analytics (all*.sh for full analysis and rels*.sh for per release stats)

This program assumes that gitdm resides in: ~/dev/cncf/gitdm/ and that kubernetes is in ~/dev/go/src/k8s.io/kubernetes/

Output files are placed in the kubernetes directory.

To regenerate all statistics just run: ./rerun_data.sh

This is an iterative process: Run any of the scripts. Review its output in the kubernetes directory. Iteratively adjust mappings to handle more authors.

You can also run via ./debug.sh to halt in debugger and review the hacker's structure and those who were not found. See cncfdm.py:DebugUnknowns

Final report:

Data

Report

Contributing

Pull requests are welcome.

Our mapping is never complete, please see config files in Config files.

File email-map is a direct email to the employer mapping.

There is also a long list of unknown emails. For that, scroll to the section called Developers with unknown affiliation: in all.txt

All of those were searched for in various sources but we were not able to find their affiliation.

Detailed Description

Regenerating all data with ./rerun_data.sh means:

Data for kubernetes/kubernetes repository (all time) with 3 mappings of Unknown developers: no mapping (list them with their email & name), map them to their email domain ([email protected] --> 'Gmail *'), map all of them to '(Unknown)'. This is done via running: (./all.sh, ./all_no_map.sh, ./all_with_map.sh). Output goes to kubernetes/all_time/ directory
Data for kubernetes/kubernetes repository divided into releases v1.0.0, v1.1.0, ..., v1.7.0 (with 3 types of mappings described above). This is done via (./rels.sh, ./rels_strict.sh, ./rels_no_map.sh). Output goes to kubernetes/v1.X.0-v1.Y.0/ directory: X=0,1,2,3,4,5,6 Y=1,2,3,4,5,6,7)

After performing those two steps, cncfdm.py output needs to be analysed. It is done by calling: ./analysis_all.sh (analyses all-time results) and then ./analysis_rels.sh (for pre-release data)

Data for all 68 repos (currently) which makes the entire Kubernetes project with ./kubernetes_repos.sh script.

Final files generated by first 2 calls (for single repo kubernetes/kubernetes) are in kubernetes/all_time/*.txt and ./kubernetes/v1.X.0-v1.Y.0/*.txt

All scripts are configured to ignore commits related to files from vendor and Godeps directories. This is because external sources are placed here and many commits are just adding external libraries. Accounting for them would make the results less accurate

All of them use a git log call with specific args piped to cncfdm.py call with specific parameters.

See ./run.sh for an example. All other calls use the same commands git log and cncfdm.py with other parameters.

To get a list of parameters for cncfdm.py, see comments inside of the cncfdm.py file describing all possible options.

For more details about how cncfdm.py tool works refer to its sources and other *.py files.

Those files are analysed by ./analysis_all.sh and ./analysis_rels.sh.

The first one calls: ruby analysis.rb all kubernetes/all_time/first_run_patch.txt kubernetes/all_time/run_no_map_patch.txt kubernetes/all_time/run_with_map_patch.txt

The second calls: ruby analysis.rb v1.0_v1.1 kubernetes/*/output_strict_patch.txt kubernetes/*/output_patch.txt kubernetes/*/output_no_map_patch.txt

This ruby tool expects to get 3 files (one with no unknown developers mapping, 2nd with mapping to a domain name and 3rd with mapping to (Unknown).

The output of this analysis.rb tool goes to project/<prefix>_<key>_<type>.csv files. : can be all or v1.X.0-v1.Y.0 - it means that the file is for all time data or for a specific release of kubernetes/kubernetes : can be changeset, employers, lines, signoffs - it means that the file contains data sorted by this desc. : can be sum, top, all:

all means that the file contains all data for given sorted by desc (header is: idx,company,n,percent which means n-th, company name, n developers, % all developers) All known is the sum of all detected developers
top means that there will be top 10 data from all but also must contain data for: '(Unknown)', 'Gmail *', 'Qq *', 'Outlook *', 'Yahoo *', 'Hotmail *', '(Independent)', '(Not Found)'. The header is the same as in all.
sum contains a summary value for all found developers. It has a different header: N companies,sum,percent number of developer's companies found, the sum of for all found developers, % of the sum as a part of the sum for all developers.
Special names: All known (sum all known developers), (Independent) (developers working on their own), (Not Found) (developers for whom an employer was not found even though the search was done in multiple sources), (Unknown) (developers not mapped (yet?)), Some name * (sum of developers having emails on Some name domain).An asterisk * is added to indicate this.

This data is directly used for the "Who writes Kubernetes" report.

./kubernetes_repos.sh script is used to generate all-time data for all the kubernetes repos.

To use it, you must have all of kubernetes repositories (68 from 3 different organizations) cloned in ~/dev/go/src/k8s/.

Orgs are: kubernetes, kubernetes-incubator, kubernetes-client.

It generates statistics for each single repo via: ./anyrepo.sh ~/dev/go/src/k8s.io/<repo-name> <repo-name>

See details in ./kubernetes_repos.sh. is a directory where a given kubernetes repository is cloned.

To clone a repository, do: cd ~/dev/go/src/k8s/ git clone https://github.com/<one-of-3-kubernetes-orgs>/<kubernetes-repo-name>.git.

one-of-3-kubernetes-orgs: kubernetes, kubernetes-incubator and kubernetes-client

kubernetes-repo-name: please look up all repo names in all kubernetes orgs on GitHub.

./anyrepo.sh just calls cncfdm.py with appropriate args (like exclude vendor dir numstat etc).

There is also ./anyreporange.sh that allows querying a repo for a specific time range (cncfdm.py supports that as well).

Output of this goes to repos/<repo-name>.<ext> : repository name ./anyrepo.sh was called with. : txt, csv, html, out: txt: main data file, csv: dumps list of employers in given repo, html: the same as txt but in HTML format, out: cncfdm.py verbose output messages (for debugging)

Finally, ./kubernetes_repos.sh calls: ./multirepo.sh with all 68 repository directories listed.

It gathers git log on each of them and concatenates all those files and then run cncfdm.py on the concatenated result (see ./multirepo.sh)

Results are saved to repos/combined.<ext> is the same as for anyrepo.sh.

The typical workflow is re-runing ./kubernetes_repos.sh and examining repos/combined.txt for unknown developers.

Research on google, Clearbit, FullContact, github, LinkedIn, Facebook, any other source -> update cncf-config/<filename> and re-run ./kubernetes_repos.sh : usually in this order: email-map, domain-map, a in very rare cases: aliases, gitdm.config-cncf or group mappings in groups/

Also, when running data for a single kubernetes/kubernetes for example with ./all.sh examining developers found in ./kubernetes/all_time/first_run_patch.txt.

After all this data is generated, ./kubernetes_repos.sh concatenates all single repo data into a single output file: repos/merged.out to allow browsing all the data in a single file.

It also generates developers and companies statistics via a ./topdevs.sh call.

It calls a ruby tool on the combined output of all 68 kubernetes repos (saved as CSV) like so: ruby topdevs.rb repos/combined.csv

That tool generates files as follows:

companies_by_name.csv - this is a list of companies found, sorted by their names (not case sensitive) to allow manual examination for duplicates that came about from different names such as "Google" vs "Googe Corporation" vs "Google Corp." or "google"
companies_by_count.csv - list of companies found, sorted (desc) by the number of employers. This serves a similar purpose but from a different perspective.
unknown_devs.txt, unknown_devs.csv, unknown_emails.csv - list of developers for whom there isn't a mapping. Used to prioritize searching for devs, and unknown_emails.csv is in the format fitting a clearbit batch.

There are clearbit tools in clearbit_tools/ directory.

Look for any files with .rb extension. 3 rounds of commercial Clearbit requests were performed. And they returned quite a lot of data.

But those files are not checked in and are listed in ./.gitignore because we have to pay for that data.

Those tools are used to enrich of cncf-config/email-map mapping. google_other.txt - contains a list of Google developers with email on a domain different than @google.com. ./changesets.csv, ./added.csv, ./removed.csv files contain developers sorted by changesets, added lines, removed lines desc.

A new set of tools to get Clearbit and FullContact data is located in affiliation_finder/ directory. The two tools are described inthe 'Tools to help find unknown affiliations' section of this document.

This is used to generate Top N developers in given criteria.

./new_devs.sh (also used by ./rerun_data.sh) is used to generate statistics about new developers between kubernetes/kubernetes releases.

It calls: ruby new_devs.rb kubernetes/v1.X.0-v1.Y.0/output_strict_patch.csv for all X and Y. new_devs.rb simply generates information about developers who were new between each release and file new_devs.csv, which contains a list of companies who introduced most new developers overall (sorted by # of new developers desc).

That covers a typical usage and data for "Who writes Kubernetes report"

Other tools

Other tools include:

see_parser.sh - display data feed as used by cncfdm.py tool
range.sh - generate stats for Linux kernel for given data range (1st and 2nd command line argument like 2016-01-01 2017-01-01), assumes Linux repo (torvalds/linux) is cloned in ~/dev/linux/
range_<period>.sh - used to generate monthly, quarterly, yearly stats using above ./range.sh, for example ./range_monthly.sh.

To work on Prometheus contributors before and after joining CNCF:

Prometheus joined CNCF on 2016-05-09.

You need to clone all Prometheus repos into ~/dev/prometheus using ./clone_prometheus.sh

Then you need to get several distinct Prometheus contributors before joining CNCF: ./prometheus_repos.sh 2015-05-09 2016-05-08 ~/dev/prometheus/

Result is:

Processed 2721 csets from 230 developers
252 employers found
A total of 1558445 lines added, 353900 removed (delta 1204545)

Now check the number of distinct contributors after 2016-05-09: ./prometheus_repos.sh 2016-05-09 2017-06-01 ~/dev/prometheus/

Processed 2817 csets from 346 developers
365 employers found
A total of 2696196 lines added, 771502 removed (delta 1924694)

We have a change from 230 to 365 which is a 59% increase.

Report

Links to data and generated report are here: ./res/links.txt

CNCF Projects join statistics

CNCF Projects join dates are: https://github.com/cncf/toc#projects
To generate statistics for Prometheus 90 days before joining CNCF and 90 days after joining try this:
Run ./clone_prometheus.sh
Run ./cncf_join_analysis.sh prometheus 2016-05-09 90 ~/dev/prometheus/
Results go to prometheus_repos/result.txt
Create a directory where you want to put links to kubernetes repos, like this: mkdir ~/dev/kubernetes_repos_links
Copy kubernetes_repos.sh to link_kubernetes_repos.sh: cp kubernetes_repos.sh link_kubernetes_repos.sh
Open the copy and add 1st line: cd ~/dev/kubernetes_repos_links
Replace lines like ./anyrepo.sh ~/dev/go/src/k8s.io/test-infra/ test-infra with ln -s ~/dev/go/src/k8s.io/test-infra/ test-infra; run it; done. k8s repos links are now in ~/dev/kubernetes_repos_links
The command that takes on Kubernetes repos should be: ./cncf_join_analysis.sh kubernetes 2016-03-10 90 ~/dev/kubernetes_repos_links
Results go to kubernetes_repos/result.txt
To generate statistics for OpenTracing 90 days before joining CNCF and 90 days after joining try this:
Run ./clone_opentracing.sh
Run ./cncf_join_analysis.sh opentracing 2016-08-17 90 ~/dev/opentracing/
Results go to opentracing_repos/result.txt
There is also an All-in-one script to regenerate all CNCF Projects joint statistics, run ./join_stats.sh

Typical update of "Who writes Kubernetes report"

Run ./pull_kubernetes.sh to get all Kubernetes repos updated.
Change directory to dev/go/src/k8s.io/kubernetes/ and update this repository as well.
New release since the last run (1.7) so many scripts needs to be updated. Also, all repos from 3 Kubernetes orgs are now in ~/dev/kubernetes/repos so ./kubernetes_repos.sh script needs an update as well
Updated kubernetes_repos.sh script to get repos from ~/dev/kubernetes_repos/
Script to regenerate all data is ./rerun_data.sh, it needs to be updated to support v1.7.0
Now report is: https://docs.google.com/document/d/1RKtRamlu4D_OpTDFTKNpMsmV51obdZlPWbXVj-LrDuw/edit?usp=sharing
Report data sheet/draft is: https://docs.google.com/spreadsheets/d/15otmXVx8Gd6JzfiGP_OSjP8M9zyLeLof5-IGQKEb0UQ/edit#gid=0
Now report sections:

Since the Kubernetes project started in June 2014, 2623 Developers from 789 Companies worked on it (counting Kubernetes and all its projects 68 repos from 3 orgs).
A total of 28.4 million lines of code were added, 16.3 million lines removed.

Taken from: ./repos/combined.txt

Processed 59041 csets from 2623 developers
789 employers found
A total of 28440262 lines added, 16342872 removed (delta 12097390)

For a single kubernetes/kubernetes repo, the data is in: kubernetes/all_time/first_run_numstat.txt

Processed 28225 csets from 1338 developers
400 employers found
A total of 6667288 lines added, 4132224 removed (delta 2535064)

About how to fill data sheet/chart:
Sheet "all time data":
analysis_all_repos.sh, generates files starting with: report/all_repos_rest
report/prefix_key_type (prefix: all - for kubernetes/kubernetes, all_repos - for all repos, v1.x for releases), project/
Commits info is in other_repos/all_kubernetes_dtfrom_dtto and other_repos/kubernetes_dtfrom_dtto (for all k8s repos and kubernetes/kubernetes alone)
To see commits for all kubernetes repos combined for last year & for last 12 months (each) separately: grep -HIn "csets from" other_repos/all_kubernetes_range_unknown_201*
The same for kubernetes/kubernetes repo: grep -HIn "csets from" other_repos/kubernetes_range_unknown_201*
Update report and report data sheet with those results
Number of github events etc - from cncf/velocity:projects/unlimited.csv (this is for 201606-201705)
Values for May 2017 are in: cncf/velocity:projects/cncf_projects_201705.csv

activity,comments,prs,commits,issues,authors
Last year: 308313,217684,46351,16000,28278,1728
Last month: 30227,21371,4645,1741,2470,451

Analyses of kubernetes/kubernetes (main repo) are in this format: report/all_{key}_top.csv, import them to the 2nd sheet
Big summaries like all developers etc are in ./repos/combined.txt, for the main k8s repo: kubernetes/all_time/first_run_numstat.txt
Top developer stats are here: stats/all_key.csv (for all repos), stats/kubernetes_key.csv (for the main repo) and stats/v1.x_key.csv per versions.
Import those to the last 3 sheets in the data set
Per verion data: report/v1.x_v1.y_key_top.csv, key: changesets, lines, developers, import to the datasheet for all versions: 7 x 3 = 21 imports

Affiliations of some developers are uncertain despite the best effort. These developers are listed in uncertain.csv file.

GitHub users can be pulled using Octokit GitHub API.

To do this, call: ruby ghusers.rb or ./ghusers.sh

Required are:

Standard GitHub OAuth token: https://github.com/settings/tokens --> Personal access tokens, put it in /etc/github/oauth file.
A GitHub Application to increase the rate limit from 60 to 5000 (60 is not enough to process kubernetes, 5000 is enough).
See: https://github.com/settings/ --> OAuth application, put your client_id and client_secret in /ect/github/client_id, /etc/github/client_secret files.
This tool will cache all GitHub calls (save them as JSON files in ./ghusers/)
Final JSON will be saved in ./github_users.json (subsequent calls will use data from this file, so to reset cache, just remove this file and all files from ghusers/ directory
To generate the actual mapping, manually process this JSON (and do some mapping of company names - GitHub users sometimes put strange values there)
I've done that by iteratively using a new tool: import_from_github_users.sh, import_from_github_users.rb with a mapping file (that tries to map a GitHub user company name into something more accurate): company-names-mapping

Tools to help find unknown affiliations

To enhance this json with pre-existing affiliations, call: ./enchance_json.sh

To generate JSON with some filtered data (like all unknown devs with location or LinkedIn profile link or just a blog entry) call: ./lookup_json.sh (see the script for details, also lookup_json.rb have a lot of comments on how to use it).
To generate a progress report (report about how many Not Found, Unknowns, and Independent devs are defined in our affiliation, call: ./progress_report.sh).
To generate aliases for emails that are already known (are using the same GitHub user name) try ./aliaser.sh, the output is aliaser.txt that can be analyzed and manually added to cncf-config/aliases if needed.
To generate a correlations map for company name (to avoid mapping typos etc) run ./correlations.sh script.The result is in correlations.txt file that can be used to update cncf-config/email-map with corrected employer names.
To generate per-files/directories statistics, use: ./per_dirs.sh, this is a part of a standard workflow, results are in csv files in the per_dirs directory
To generate affiliation files (developers_affiliations.txt, company_developers.txt), use ./gen_aff_files.sh
To generate data for the stacked chart, run ./stacked_chart_<months|rels>_<csets|perc>.sh. It generates a csv file: stacked_chart_<months|rels>_<csets|perc>.csv, to generate all stacked charts: ./stacked_charts.sh
To import data from pretty-formatted files use import_affs.sh, this is not a part of the standard workflow

All those tools are automatically called when running the full data regeneration script: ./rerun_data.sh

To automatically find affiliations (email to the company) using Clearbit, run two scripts from the affiliation_finder folder in order:
- clearbit_affiliation_lookup.rb
- ruby clearbit_affiliation_merge.rb

The first one works with one argument and generates a file clearbit_affiliation_lookup.csv. The argument can be skipped or have a value of 'true' or 'false' - default. Invocation would be clearbit_affiliation_lookup.rb or clearbit_affiliation_lookup.rb false or clearbit_affiliation_lookup.rb true. The argument is used to whether the script's output data should be overwritten (normally data would be appended to the file) and at the same time it will allow previously looked-up email addresses to be checked again.
The execution environment needs to have a proper value for this: Clearbit.key = ENV['CLEARBIT_KEY'] It is a secret API key on a Clearbit account that has been set up for a subscription. When the file is generated, open it in a csv editor, sort by the 'chance' field. Visually check and correct data in the 'affiliation_suggestion' column. Replace values such as 'http://www.ghostcloud.cn/' with 'Ghostcloud'. If you find affiliations for other developers manually, just change the 'none' value in the 'chance' column to 'high' and provide a value in the 'affiliation_suggestion' column. Columns to the right of 'affiliation_suggestion' are not required.

The second script reads the 'clearbit_affiliation_lookup.csv' file. Data is processed against the cncf-config/email-map file. When done, the 'email-map' file will have new and updated affiliations. The file will be sorted as well. The lookup file will not be altered.

To automatically find affiliations (email to company) using FullContact, run two scripts from affiliation_finder folder in order:
- ruby fullcontact_affiliation_lookup.rb
- ruby fullcontact_affiliation_merge.rb

The first one works with one argument and generates a file fullcontact_affiliation_lookup.csv. The argument can be skipped or have a value of 'true' or 'false' - default. Invocation would be fullcontact_affiliation_lookup.rb or fullcontact_affiliation_lookup.rb false or fullcontact_affiliation_lookup.rb true. The argument is used to whether the script's output data should be overwritten (normally data would be appended to the file) and at the same time it will allow previously looked-up email addresses to be checked again.
The execution environment needs to have a proper value for this: config.api_key = ENV['FULLCONTACT_KEY'] It is a secret API key on a FullContact account that has been set up for a subscription. The columns differ in this file compared to that of Clearbit. If you find affiliations for other developers manually, just change the value in the 'org_1' column. The column by default should have 5 pipe-delimited values. If you do not have the values for the other 4, just type 4 pipes. Columns to the right of 'org_1' are not required.

Add a new project ( cncf or non-cncf) to get affiliation for it.

Please follow the instructions from ADD_PROJECT.md.

Authors

Łukasz Gryglicki [email protected] - developement.
Justyna Gryglicka [email protected] - researching affiliations data.

gitdm.archive's People

Contributors

Stargazers

Watchers

Forkers

isuker blixtra kinvolk-archives mattfarina radoslaw brian-brazil alexxnica kryndex jan11011977 heckj jbeda puja108 caoshufeng akutz derekcollison gaocegege clintkitson mewzherder alexbrand jbayer vitessio eirinikos platinumwrist detiber krisnova ncdc chuckha coffeepac zachpuck shomron murali-reddy jml gerred tomwilkie davkal bentheelder xiaoxubeii directxman12 jlewi randomvariable zyfjeff hhoover matzew markusthoemmes anypm mauilion jglick justaugustus liztio skriss stackpointcloud andrewsykim sebgl dimitropoulos sftim benpriestman chrisz100 palemtnrider major zpao grahamhayes stealthybox olblak karlomendozaarmory dotdotdotpaul german-muzquiz armory-io michmike nicolaferraro sujithsimon22 lidiyag mkimuram cpu1 craiglpeters maurya-anuj pskrzyns patrick-east johnharris85 aaronmell garganubhav jinan-zhou xyhuang krishnadurai ryysud akado2009 luxas yastij ivangoncharov vshn arrikto yagonobre martina-if beorn7 cristian-radu ehashman opowbow chandler-song fabriziopandini asalkeld davidewatson

gitdm.archive's Issues

Contribution not reflected in devstats

Hi there, my PR kubernetes/ingress-nginx#7202 to kubernetes/ingress-nginx got merged successfully and I'm listed in gitdm, however I don't see it showing up in Devstats, eg. Last year timeframe.

Is there something more needed to be done in gitdm etc. for the contribution to show up in Devstats?

I'm opening this issue to find out more, in case something additional needs to be done, and this could be added to a README, etc.

Not sure if this could be a contributing factor: I joined the VMware org on Github, then opened the PR #7202, then (seeing no updates to devstats) made a PR to gitdm.

cc: @LappleApple

From where affiliations are imported into DevStats

We want to add our company affiliations into devstatus, didn't find where to add company info like https://github.com/stackalytics/default_data

Instructions for testing changes are incorrect.

The instructions here:
https://github.com/cncf/gitdm#testing-changes

Say to run

./rerun_data.sh

But looks like it was moved to ./src/rerun_data.sh

Fix wrong country_id of mine

Hi, can help me fix my wrong country_id ?

I am from China, country_id should be cn : )

  {
    "login": "Xunzhuo",
    "email": "Xunzhuo!users.noreply.github.com",
    "affiliation": "Tencent Holdings Limited",
    "source": "manual",
    "name": "LIU",
    "commits": 759,
    "location": "China",
    "country_id": "ni"
  },
  {
    "login": "Xunzhuo",
    "email": "mixdeers!gmail.com",
    "affiliation": "Tencent Holdings Limited",
    "source": "manual",
    "name": "LIU",
    "commits": 759,
    "location": "China",
    "country_id": "ni"
  },

  {
    "login": "Xunzhuo",
    "email": "Xunzhuo!users.noreply.github.com",
    "affiliation": "Tencent Holdings Limited",
    "source": "manual",
    "name": "LIU",
    "commits": 759,
    "location": "China",
    "country_id": "cn"
  },
  {
    "login": "Xunzhuo",
    "email": "mixdeers!gmail.com",
    "affiliation": "Tencent Holdings Limited",
    "source": "manual",
    "name": "LIU",
    "commits": 759,
    "location": "China",
    "country_id": "cn"
  },

Unknown company name

I'd like to fix my company's name so that my name would be visible in company_developers and developers_affiliations files.
How should I do it? I'd open a PR but the automated machinery in this repo scares me.

$ grep asidorovj . -r
./src/all.txt:(Unknown) andrey.sidorov!flant.com asidorovj                                        4 (0.0%)
./src/all.txt:(Unknown) andrey.sidorov!flant.com asidorovj                                        4 (0.0%)
./src/alldevs.txt:(Unknown) andrey.sidorov!flant.com    asidorovj   4
./src/unknowns.txt:(Unknown)    andrey.sidorov!flant.com    asidorovj   4
./src/all_affs.csv:"andrey.sidorov!flant.com","asidorovj","(Unknown)","","config"

Can't update company information

Hi, I've submit a PR for my Corporation, but I still can't find my company information in Istio dashboard, how can I fix it? :-)

Or it will take a while to update personal information?

How does dev stats use the info in the developer affiliates file?

We did not see much, if any, movement in the dev stats for K8s after #44 was merged, and we are wondering how exactly the data in that file is used.

While working on the aforementioned PR, I learned more about querying data in and about GitHub via the GitHub API as well as gharchive.org. For example, it's not possible via the GitHub API to query all the comments made by a user. Nor is it possible to look up all the issues for a GitHub user by their e-mail address. Except for looking up commits via git log --author EMAIL, it doesn't appear as if the information in the file developer_affiliations.txt can be used to accurately determine the other data that contributes to dev stats such as Issues Opened, Pull Requests Opened, Comments.

In addition to updated e-mail addresses and company affiliation information, the data in #44 also reflects net-new members of the affiliates file that we do not see reflected in the actual dev stats. Because the data in developer_affiliations.txt does not include GitHub login IDs, we'd like to know how the issue, PR, and comment accruals are generated since the results at http://k8s.devstats.cncf.io do not seem to match what we found.

Thank you!

cc @clintkitson

Better gender lookup methods

Currently, we use Genderize to guess user gender by their first name. This method seems to lead to many wrong data with non-English names.

I wonder if there're any better ways, face recognition would be more reliable but also more expansive.

Process to add missing repos?

Hello,

Is there a process/approval to go through to add more repos here? There is a good portion of Kubernetes development work that is happening within the github.com/kubernetes-csi org, but none of the repos there are tracked as counting toward Kubernetes.

How does one know if a repo is okay to track, and how it should be categorized?

I was going to submit a PR that adds the relevant repos to ghusers.rb, but I am (1) not sure if that's allowed, (2) not sure how to make sure the repos get categorized towards Kubernetes (vs any other CNCF project, for example).

Thanks!

Red Hat has acquired CoreOS

Please change all CoreOS contributors to Red Hat.

Question: Do you know why my fellow's commit cannot be reflected in Tekton ranking?

Hi there, (I apologize if I am asking in inappropriate location)

I am in charge of Tekton contributions and commits in my company.

My fellow, Minoru ( mnitta ) has committed to Tekton website, however his commit does not seem to be reflected in Tekton ranking.

His commit-1: tektoncd/website@23e5c6f
His fellow's commit-2: tektoncd/website@92e50a5
The ranking (last month): https://tekton.devstats.cd.foundation/d/5/companies-table?orgId=1&var-period_name=Last%20month&var-metric=commits

Do you have any causes of not reflecting to Tekton ranking?
His information is added to both developers_affiliations and github_users.json.

Thank you for your cooperation.

Request to remove my private email address

For some reason my private email address is exposed in this repository, and I would like it removed:
https://github.com/cncf/gitdm/blob/master/src/other_repos/gitlab_2018-07-01_2019-07-01.csv#L576

If there is some reason that having a contact email is beneficial, I'm happy to provide one.

Affiliate nehaLohia27 with Vmware

How to merge duplicated companies?

I noticed that both "Dynatrace" and "Dynatrace LLC" show up as separate companies while they are not. Both of them have some affiliated developers assigned.
What would be the best approach to merge both into one and to ensure future employees are mapped to the correct entity?
Should the company-map be used for this or would it make sense to edit all occurrences in developer_affiliations*.txt?
Also, dynatrace.com seems to be missing from domain-map but I think this would not change the existing affiliations.

Flux: please add fluxcd/flux2

fluxcd/flux2 is the official repository for Flux now. fluxcd/flux hasn't reached its EOL just yet, but will be archived some time soon.

Change process for easier update PRs

I found my data was incorrect (dates/companies). The process of updating this appears to require running scripts (e.g., ./rerun_data.sh) that can take some time to run.

Could the process instead have it so that these files are not stored in Git and are generated at the time the data is used (or deploy time)? This could save from errors introduced by end users.

How to change my company information?

I'm a Knative contributor. I changed my company in 2020. How could I change my company information in knative.teststats.cncf.io ?

Same user, different case?

Hi,

I am not sure how to fix this, so I am raising this issue: looking at this dashboard, I see two users being listed twice (WPH95, wph95 & Arbiv, arbiv).

They are the same users, is there a way to "merge" them?

Thanks,

cc: @wph95 , @arbiv

Quite a few "Google LLC LLC" affiliations in dev affiliation files

Found looking in devstats:

Not sure why there are a few "Google LLC LLC" affiliations, maybe some copy/paste issue?

how to disable another wrong registered company

Sorry to disturb, as we have two companies registered, we'd like to remove one of them, what's the steps, thanks.

Why does k8s-ci-robot fail my CLA??

Why does k8s-ci-robot fail my CLA https://identity.linuxfoundation.org/users/cclauss ??
Like kubernetes/autoscaler#2150

ghusers.rb splitting logic for repos.txt seems very brittle

I followed the instructions to run get_repos to produce a list which I copy pasted to repos.txt. I then ran ghusers.rb.

ghusers.rb had trouble splitting the file correctly.
https://github.com/cncf/gitdm/blob/88f4c2a8d5df61b0ba09a5ef99fac95599ef08cf/src/ghusers.rb#L77

I may be misunderstanding ruby but I think

  repos = str.strip.split(",\n  ")

Is splitting on the entire string ",\n " and not treating that string as a list of delimiters. So the code is very particular about whitespace.

The following seemed to work better for me

repos_raw = str.strip.split(",")
  repos = []
  repos_raw.each do |r|
    repos.push(r.strip)
  end

Something weird happened with a recent commit

cncf/gitdm@f5f939a added dgn at Red Hat.

cncf/gitdm@6cf8f2f added jacob-delgado at Aspen Mesh.

cncf/gitdm@a05aaac munged the two somehow:

dgn: dgn!users.noreply.github.com
	Aspen Mesh
	Red Hat from 2100-01-01
	jacob-delgado: from 2100-01-01

Fix ocelotl affiliation

ocelotl affiliation should be LightStep.

Merge ids

Hi. I have 3 different entries. [email protected] is my current commit ID. [email protected] was the previous one. [email protected] is also mine (caused by creating PRs via github)

Contribution count is not reflecting at the portal

Around 20 contributions have been made by Infosys resources in last month, but the count is not yet reflected on the portal:
Link: https://k8s.devstats.cncf.io/d/9/companies-table?orgId=1

Affiliation doesn't appear to be updated

I'm going to assume that this is a foo-bar or misunderstanding on my end, but after merging: https://github.com/cncf/gitdm/pull/844 , I expected to see my affiliation updated to my new employer. However after looking at my contributions across a handful of projects (ex: https://metallb.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1 and https://argo.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1), I see myself listed with both no affiliation and affiliation to an old employer.

Is there something additional that needs done in order to reflect my new employer besides the update here? I thought maybe there was just something on the backend that needed synced, so I waited some time before opening this issue to rule that out. Or am I perhaps just misunderstanding how the listings themselves are calculated? Any information would be greatly appreciated!

Add affiliation for ankatare

Stale maintainers entry for jpeach

Hi, Not sure where the entry for me in src/maintainers.csv originated, but it's now stale. I'd be happy to be removed, or if there is some source of truth for this list, please LMK and I'll try to update that.

how to get xieyanker in stackalytics

hi,everyone,xieyanker commit a pr and the pr is merged,why cannot find xieyanker in stackalytics:
xieyanker work for inspur,in stackalytics can not find him:
https://www.stackalytics.com/cncf?company=inspur
the pr is:
kubernetes/kube-state-metrics#777

Difficult to file PRs against this gigantic repo

When I filed #83 I used GitHub’s web UI to create a fork, edit a file, and create a PR from it. I no longer seem to be able to do that—GH complains that the text files are too big to display, and does not offer an edit button. So I cloned the repository. The clone occupies 555MiB, and this took several minutes on a fiber connection! Pushing just a few weeks’ worth of master changes back to my fork took a while as well, and GH warned me

Enumerating objects: 1177, done.
Counting objects: 100% (1177/1177), done.
Delta compression using up to 4 threads
Compressing objects: 100% (406/406), done.
Writing objects: 100% (1153/1153), 44.06 MiB | 2.74 MiB/s, done.
Total 1153 (delta 832), reused 1039 (delta 744)
remote: Resolving deltas: 100% (832/832), completed with 18 local objects.        
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.        
remote: warning: See http://git.io/iEPt8g for more information.        
remote: warning: File github_users.json is 84.96 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.80 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.13 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.11 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.97 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.12 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 81.43 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.88 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 84.95 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 81.62 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 81.12 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 81.53 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB        
remote: warning: File github_users.json is 81.54 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB

I think about historical blobs. Indeed even the current values are huge:

$ git ls-files | fgrep .json | xargs du -h | fgrep M
2.7M	src/all_unknown.json
1.1M	src/default_data.json
13M	src/genderize_cache.json
9.6M	src/geousers_cache.json
44M	src/github_users.json
19M	src/stripped.json
2.4M	src/unknown_with_any_data.json
1.1M	src/unknown_with_blog.json
1.7M	src/unknown_with_location_and_name.json
1.8M	src/unknown_with_searchable_email.json

If these files really need to be in source control, please put the frequently edited text files in a separate repository (or at a minimum use LFS so you only need to pay for the size of the master checkout), and start fresh repository history so that people can reasonably file routine PRs. I cannot imagine what I would have done had I been trying to contribute to CNCF from a location with metered or non-broadband Internet access.

Issue with developers_affiliations

Hi,

I have created this PR: https://github.com/cncf/gitdm/pull/528
But in devstats site my company name is not showing against my name:
keda.devstats.cncf.io
Please direct how to solve this issue.

Fix links in the Contributing section

The links in Contributing[1] section are incorrect (Config files, email-map, all.txt).

[1] https://github.com/cncf/gitdm#contributing

sync corp and user information from stackalytics

would you please sync information from https://github.com/openstack/stackalytics/blob/master/etc/default_data.json as many people use it

Grammatical mistakes and typos found in the README.md file

I have found some grammatical mistakes and typos in the Readme.md file of this project.
For e.g:- on line no. 87 there is a typo that looks something like this "neds" instead it should be "needs".

I want to work on this file, update it and resolve the issue

Add dev4 and co5 into README.md

https://github.com/cncf/gitdm/blob/master/README.md

dev4 and co5 are not written in this README.md.
Therefore I added them.

Unknown affiliation for email that exists in developers_affiliations*.txt

Current src/all.txt file lists one of our engineers in the Developers with unknown affiliation section:

(Unknown) timofey.kirillov!flant.com Timofey Kirillov                              14 (0.0%)

However the relevant entry (to affiliate him with the company) exists in the developers_affiliations2.txt:

distorhead: distorhead!gmail.com, distorhead!users.noreply.github.com, timofey.kirillov!flant.com
        Flant

There is a corresponding entry in the generated company_developers2.txt file as well.

Can you please shed some light on what's missing in my understanding and should anything be done to fix this? There's probably no issue at all but I don't know how I can be sure.

Why didn't my company name change to Huawei in kubeedge.devstats.cncf.io ?

I've updated the configuration to change my company to Huawei, but the company on kubeedge.devstats.cncf.io is still NetEase.

Pull requests: cncf/gitdm#172

Use of my personal email may be driving spam to it

Links to this repository are the only results in a search for my personal email. If possible, I would like my email removed from the repository, and if possible I would like to replace it with my work email (which is sure to filter it). Please send me a message.

What is the best way to change the country affiliation?

Hello! Some users in our organization (including me) have the Country displayed as "-" at k8s.devstats.cncf.io. I have a country set in my GitHub profile now, but probably it was not there when my user was initially added to the gitdm database. Does it mean this data isn't refreshed for all users in a while?

If so, what is the best way to fix it? I found #69 where it was made by direct modification of src/github_users.json, is this the only way?

can't find parispittman in devstats for the last month

I can't seem to find my contributions in devstats for the last month. sorting the developer activity chart by last month produces nothing but all time produces just my record attributed to google. if you sort by quarter, my affiliation with google is shown but nothing from apple. I did change my affiliation in the developer file #364

Company name is incorrect

I noticed our company "Z Lab" was renamed and all members are belong to unknown organization "Zeppelin Lab GmbH".
Please see on the list below.

https://github.com/cncf/gitdm/blob/d7478928760b927f2859c705f77957a45bfe390b/company_developers8.txt#L6452-L6460

Our company name is not "Zeppelin Lab GmbH" but "Z Lab corporation".
https://github.com/cncf/gitdm/blob/master/src/mapping.json#L3277

ref. Our company https://github.com/zlabjp

Could I add my account into developers_affiliations*.txt by myself?

Hi there, could you please help me resolve the following problem?

What I want to do

The number of my issues, commits, contributions, etc. will be reflected into this Tekton ranking.

What I have found

The website of the Tekton ranking which is shown above says that "We are determining user's company affiliation from this file, which is imported from cncf/gitdm." in Description.
Adding/Updating affiliation says that once we edited developers_affiliations*.txt, github_users.json could be reflected.

What I am wondering

Could I add my account information and company I am belonging to into developers_affiliations*.txt by myself?
Could my fellows' account be added by me? Or Do they need to add by themselves?

Add affiliation for mk46

User skarlso was removed

This PR added me and WeaveWorks affiliation. https://github.com/cncf/gitdm/pull/772

However, I was somehow removed from the list and now am nowhere to be found. :(

Public CNCF Maintainer List

Hello 👋

Two things:

I added a comment about updating my company (also while I was there, commented about another Helm member who I know just moved to another company as well). Thanks in advance for updating this, and whatever generated files come from this.
The spreadsheet seems a bit out of date? After leaving the comments above, I began the in instructions at https://github.com/cncf/gitdm/blob/master/SYNC.md#to-sync-maintainers and only got as far as step 3, and saw from a git diff that there are quite a few other differences apart from what my proposed change would be. Running the script also has failures, so I just stopped. Should I proceed in a different way?

parispittman still not coming up as apple

https://k8s.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%20decade&var-metric=contributions&var-repogroup_name=All&var-country_name=All&var-companies=Apple

there are also several people that don't work at apple on this list.

#381

https://k8s.devstats.cncf.io/d/55/company-prs-in-repository-groups?orgId=1&var-period_name=Last%20decade&var-repogroups=All&var-companies=Apple&var-countries=All
this also only lists my contributor experience work and doesn't have the steering repo. there are also several people on this list that are not at apple.

Help investigating double entry for Fabrizio Pandini

July 2 with cncf/gitdm#160 I have changed my affiliation in developers_affiliations2.txt

Then, the 4 of July @lukaszgryglicki with the commit 65d2d95880d8e6b29fb53ea4e25da695d608089c added another entry for me in developers_affiliations1.txt, probably using an automatic script

Now, I would like to understand where the second entry comes from, and eventually how to fix my affiliation (nb. I don't recognize fpandini!fpandini-a01.vmware.com as one of my email address, but I would like to understand what is going on before deciding how to proceed...)

Thanks in advance for support

/assign @lukaszgryglicki

Contributors data in public repository

Hi Team,

Contributors' GitHub id's or email ids are fetched to this public repo which is not good practice and also not allowed by company policies.
so is there any way we can avoid public data exposure but still show company or individual contributions on dashboards?

Developer Affiliations emails are wrong

Per troubleshooting with me, @jdumars, and @lukaszgryglicki:

Some bug in CNCF/gitdm is causing the developer affiliations generation to promiscuously grab email addresses which are not in any way associated with the contributor. As a test case example:

hasbro17: geek.gsa!gmail.com, hasbro17!gmail.com, hasbro17!users.noreply.github.com, hello!zhaofeng.li, samuelcabralcruz!gmail.com
	Independent

Of the four real email addresses, only one belongs to the contributor Hasbro17 (Haseeb Tariq). The other 3 belong to random other contributors to various cloud native projects (not even Kubernetes), such as Zhaofeng Li. At an estimate, approximately 1/10 of all email addresses in developer_affiliations are incorrect in this way.

(Incidentally, Haseeb works for Red Hat)

cncf / gitdm.archive Goto Github PK

gitdm.archive's Introduction

CNCF gitdm

DevStats

Adding/Updating affiliation

Removing affiliations

Testing changes

Sync workflow

Running

Contributing

Detailed Description

Other tools

Report

CNCF Projects join statistics

Typical update of "Who writes Kubernetes report"

GitHub users can be pulled using Octokit GitHub API.

Tools to help find unknown affiliations

Add a new project ( cncf or non-cncf) to get affiliation for it.

Authors

gitdm.archive's People

Contributors

Stargazers

Watchers

Forkers

gitdm.archive's Issues

What I want to do

What I have found

What I am wondering

Recommend Projects

Recommend Topics

Recommend Org