Giter VIP home page Giter VIP logo

git-of-theseus's Introduction

pypi badge

Some scripts to analyze Git repos. Produces cool looking graphs like this (running it on git itself):

git

Installing

Run pip install git-of-theseus

Running

First, you need to run git-of-theseus-analyze <path to repo> (see git-of-theseus-analyze --help for a bunch of config). This will analyze a repository and might take quite some time.

After that, you can generate plots! Some examples:

  1. Run git-of-theseus-stack-plot cohorts.json will create a stack plot showing the total amount of code broken down into cohorts (what year the code was added)
  2. Run git-of-theseus-line-plot authors.json --normalize will show a plot of the % of code contributed by the top 20 authors
  3. Run git-of-theseus-survival-plot survival.json

You can run --help to see various options.

If you want to plot multiple repositories, have to run git-of-theseus-analyze separately for each project and store the data in separate directories using the --outdir flag. Then you can run git-of-theseus-survival-plot <foo/survival.json> <bar/survival.json> (optionally with the --exp-fit flag to fit an exponential decay)

Help

AttributeError: Unknown property labels – upgrade matplotlib if you are seeing this. pip install matplotlib --upgrade

Some pics

Survival of a line of code in a set of interesting repos:

git

This curve is produced by the git-of-theseus-survival-plot script and shows the percentage of lines in a commit that are still present after x years. It aggregates it over all commits, no matter what point in time they were made. So for x=0 it includes all commits, whereas for x>0 not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using Kaplan-Meier.

You can also add an exponential fit:

git

Linux – stack plot:

git

This curve is produced by the git-of-theseus-stack-plot script and shows the total number of lines in a repo broken down into cohorts by the year the code was added.

Node – stack plot:

git

Rails – stack plot:

git

Tensorflow – stack plot:

git

Rust – stack plot:

git

Plotting other stuff

git-of-theseus-analyze will write exts.json, cohorts.json and authors.json. You can run git-of-theseus-stack-plot authors.json to plot author statistics as well, or git-of-theseus-stack-plot exts.json to plot file extension statistics. For author statistics, you might want to create a .mailmap file in the root directory of the repository to deduplicate authors. If you need to create a .mailmap file the following command can list the distinct author-email combinations in a repository:

Mac / Linux

git log --pretty=format:"%an %ae" | sort | uniq

Windows Powershell

git log --pretty=format:"%an %ae" | Sort-Object | Select-Object -Unique

For instance, here's the author statistics for Kubernetes:

git

You can also normalize it to 100%. Here's author statistics for Git:

git

Other stuff

Markovtsev Vadim implemented a very similar analysis that claims to be 20%-6x faster than Git of Theseus. It's named Hercules and there's a great blog post about all the complexity going into the analysis of Git history.

git-of-theseus's People

Contributors

arnebachmann avatar austinwise avatar ctb avatar davidfdriscoll avatar erikbern avatar ferologics avatar jackdanger avatar marcaurele avatar mmerickel avatar monperrus avatar northbadge avatar owenlamont avatar rudisimo avatar willingc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

git-of-theseus's Issues

The link to blog post has broken

I am experiencing an issue while attempting to access the content through the following link: blog. It appears that the resource is currently unavailable, resulting in a 404 error.

In addition, I would like to inquire about the status of the matter referenced as Issue #79. Is there any progress or update available on this particular concern?

AttributeError when running `git-of-theseus-stack-plot authors.json`

I ran git-of-theseus-analyze . and got several json files
then ran git-of-theseus-stack-plot authors.json and got this error.
How could I fix it?

authors.json

{"y": [[8, 15, 28, 29, 29, 55, 64, 73, 90, 121, 122, 121, 121], [338, 363, 441, 503, 511, 559, 558, 556, 556, 621, 642, 722, 723]], "labels": ["Rumble Huang", "khiav reoy"], "ts": ["2016-12-29T07:04:30", "2017-01-05T07:33:51", "2017-01-12T17:15:48", "2017-02-07T07:55:19", "2017-02-14T16:28:30", "2017-02-24T03:29:36", "2017-03-11T15:08:56", "2017-04-02T16:15:51", "2017-04-13T07:31:24", "2017-06-21T11:52:27", "2017-07-30T12:27:49", "2017-09-12T12:59:13", "2018-02-12T03:45:04"]}

Error Messages

Traceback (most recent call last):
  File "/usr/local/bin/git-of-theseus-stack-plot", line 3, in <module>
    stack_plot()
  File "/Library/Python/2.7/site-packages/git_of_theseus/stack_plot.py", line 60, in stack_plot
    colors=generate_n_colors(len(labels)))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py", line 3165, in stackplot
    ret = ax.stackplot(x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6844, in stackplot
    return mstack.stackplot(self, x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/stackplot.py", line 101, in stackplot
    **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 7059, in fill_between
    collection = mcoll.PolyCollection(polys, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 742, in __init__
    Collection.__init__(self, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 128, in __init__
    self.update(kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 739, in update
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property labels

Failed dependencies

I assume the correct script to run is analyze.py:

[~/workspace/git-of-theseus] (master)$ python analyze.py gensim
Traceback (most recent call last):
  File "analyze.py", line 2, in <module>
    import argparse, git, datetime, numpy, traceback, time, os, fnmatch, json, progressbar
ImportError: No module named git

Trying to install git:

[~/workspace/git-of-theseus] (master)$ pip install git
Downloading/unpacking git
  Could not find any downloads that satisfy the requirement git
Cleaning up...
No distributions at all found for git

How do I install & run this tool?

(Python 2.7.10, pip 9.0.1)

Windows shortcut

Great tool!

On Windows when installing in Python 3 via pip, I get a corrupt console script.
It is actually just a Python file without extension that contains

#!d:\apps\miniconda3\python.exe
from git_of_theseus import analyze
analyze()

But should be a little .exe that loads Python and the library instead, as usual.

How is that console script generated? pip and setuptools should take care of it automatically, so I´m wondering how this went wrong.

Module invocation works only partially

One more hint for the docs:
It´s possible to invoke the tool via python -m git_of_theseus.analyze <repo-path> which may be interesting for PyPy use for broken/complex Python environments.

It also works for python -m git_of_theseus.survival_plot survival.json --exp-fit.

Invoking python -m git_of_theseus.stack_plot cohorts.json leads to a warning (but generates the plot nevertheless):

D:\apps\Miniconda3\lib\runpy.py:125: RuntimeWarning: 'git_of_theseus.stack_plot' found in sys.modules after import of package 'git_of_theseus', but prior to execution of 'git_of_theseus.stack_plot'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))

There is some discussion on the topic here.

Got KeyError: 'y' when analyze survival.json

Analyze of other json files works well, only survival.json raise errors.

$ git-of-theseus-stack-plot survival.json 
Traceback (most recent call last):
  File "/usr/local/bin/git-of-theseus-stack-plot", line 10, in <module>
    sys.exit(stack_plot_cmdline())
  File "/usr/local/lib/python2.7/site-packages/git_of_theseus/stack_plot.py", line 79, in stack_plot_cmdline
    stack_plot(**kwargs)
  File "/usr/local/lib/python2.7/site-packages/git_of_theseus/stack_plot.py", line 37, in stack_plot
    y = numpy.array(data['y'])
KeyError: 'y'

Unicode emojis

Be aware, that high unicode codepoints are not available in many fonts. Try to stick to 16bit codepoints to achieve better compatibility.

Here are the values I use in one if my Python projects:

# first list is ASCII, second is 16bit unicode, third list has some more exotic symbols
PROGRESS_MARKER:List[str] =  ["|/-\\", "\u2581\u2582\u2583\u2584\u2585\u2586\u2587\u2588\u2587\u2586\u2585\u2584\u2583\u2582", "\U0001f55b\U0001f550\U0001f551\U0001f552\U0001f553\U0001f554\U0001f555\U0001f556\U0001f557\U0001f558\U0001f559\U0001f55a\U0001f559\U0001f558\U0001f557\U0001f556\U0001f555\U0001f554\U0001f553\U0001f552\U0001f551\U0001f550"] 
DOT_SYMBOL:str = "\u00b7"
MULT_SYMBOL:str = "\u00d7"
CROSS_SYMBOL:str = "\u2716"
CHECKMARK_SYMBOL:str = "\u2714"
PLUSMINUS_SYMBOL:str = "\u00b1"  # alternative for "~"
ARROW_SYMBOL:str = "\u2799"  # alternative for "*"
MOVE_SYMBOL:str = "\u21cc"  # alternative for "#". or use \U0001F5C0", which is very unlikely to be in any console font

In addition, not all shells/consoles are set to unicode.
E.g. Windows 7 you need to call chcp 65001 to enable unicode, or do some registry hacks to enable it by default. On Linux, I guess utf-8 has been standard for at least 10 years in most if not all distributions. Not sure about Windows 10 or Powershell defaults.

"Failed to parse delta stream" error

Hi! This looks awesome :) I was excited to try it out but ran into this error on a repo:

Listing all commits
- 162149 Elapsed Time: 0:01:02                                                                                                             Backtracking the master branch
/ 96504 Elapsed Time: 0:00:26                                                                                                              Counting total entries to analyze
  0% (  0 of 338) |                                                                                  | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "analyze.py", line 61, in <module>
    for entry in get_entries(commit):
  File "analyze.py", line 50, in get_entries
    return [entry for entry in commit.tree.traverse()
  File "/usr/local/lib/python2.7/site-packages/git/objects/util.py", line 296, in traverse
    addToStack( stack, item, branch_first, nd )
  File "/usr/local/lib/python2.7/site-packages/git/objects/util.py", line 264, in addToStack
    lst = self._get_intermediate_items( item )
  File "/usr/local/lib/python2.7/site-packages/git/objects/tree.py", line 138, in _get_intermediate_items
    return tuple(index_object._iter_convert_to_object(index_object._cache))
  File "/usr/local/lib/python2.7/site-packages/gitdb/util.py", line 237, in __getattr__
    self._set_cache_(attr)
  File "/usr/local/lib/python2.7/site-packages/git/objects/tree.py", line 145, in _set_cache_
    self._cache = tree_entries_from_data(ostream.read())
  File "/usr/local/lib/python2.7/site-packages/gitdb/base.py", line 138, in read
    return self[3].read(size)
  File "/usr/local/lib/python2.7/site-packages/gitdb/stream.py", line 489, in read
    bl = self._size - self._br      # bytes left
  File "/usr/local/lib/python2.7/site-packages/gitdb/util.py", line 237, in __getattr__
    self._set_cache_(attr)
  File "/usr/local/lib/python2.7/site-packages/gitdb/stream.py", line 384, in _set_cache_too_slow_without_c
    dcl = connect_deltas(self._dstreams)
RuntimeError: Failed to parse delta stream

Any ideas or first steps to debug?

Make progress bar optional

E.g. via --no-progress option or (progress=True default argument), useful for interactive use via analyze() (cf. #51)

Multi-repo

Hello

I really like the visualization generated by this project! Is it possible to support several git repository instead of just one?

Thanks
Gaetan

AttributeError: Unknown property labels

Hey,

First of all kudos for a really nice project 👍

I've issue when running stack_plot.py it throws:

git-of-theseus(master|…) ➤ python stack_plot.py cohorts.json
Traceback (most recent call last):
  File "stack_plot.py", line 31, in <module>
    labels=data['labels'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py", line 3165, in stackplot
    ret = ax.stackplot(x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6844, in stackplot
    return mstack.stackplot(self, x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/stackplot.py", line 101, in stackplot
    **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 7059, in fill_between
    collection = mcoll.PolyCollection(polys, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 742, in __init__
    Collection.__init__(self, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 128, in __init__
    self.update(kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 739, in update
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property labels

Source git repo is on master
Running on macOS (10.12.1)

git-of-theseus(master|…) ➤ python -V
Python 2.7.10

Any idea why it might be failing like that?

stack_plot.py: TypeError

I was following on the readme, but python stack_plot.py cohorts.json gave me an error after the previous command succeeded:

/home/pushcx/.virtualenvs/git-of-theseus/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
Traceback (most recent call last):
  File "stack_plot.py", line 10, in <module>
    data = json.load(open(args.inputs))
TypeError: invalid file: ['cohorts.json']

Playing with it, the result of args.inputs is an array when open wants a str. I'm not familiar with argparse and I can't see how this code would ever have worked, unless perhaps it's a python 3/2 incompatibility. Maybe you wants args.inputs[0], but more likely there's something going on I don't understand.

Wonderful blog post, btw, and I really appreciate that you made this code available so I can play with it on a few repos. Thanks!

Error running stack plot

$ python stack_plot.py cohorts.json
Traceback (most recent call last):
  File "stack_plot.py", line 15, in <module>
    labels=None)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py", line 3165, in stackplot
    ret = ax.stackplot(x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6844, in stackplot
    return mstack.stackplot(self, x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/stackplot.py", line 101, in stackplot
    **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 7059, in fill_between
    collection = mcoll.PolyCollection(polys, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 742, in __init__
    Collection.__init__(self, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 128, in __init__
    self.update(kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 739, in update
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property labels

Survival plot/stack plot mismatch?

Hi, if I run git-of-theseus on, e.g., https://github.com/mil-tokyo/webdnn, I get a stack plot/survival plot pair that look like this:

stack_plot
survival_plot

It seems like newer commits are being treated as discarded as we move past their time in existence (even though they haven't been reverted). So since it's a relatively new repo, the survival plot just goes straight down. Does this look like intended behaviour?

clarification about survival plots

Those visualizations are awesome. I have questions about the survival plots. The legend says %of commits still present over time. Is it the full commit? Or the percentage of lines?

Also, about the "over time", does it mean:

  • % of commits present at least n years?
  • % of commits present exactly n years?
  • or something else?

Automatically determine default branch if not specified

We have a case of a repo with no master branch, but another branch is marked and checked out by default.
In case I don't use the --branch <name> option for analyze, I get an error which is hard to understand at first:
stderr: 'fatal: bad revision 'master'.

One option would be to display a warning and simply use the default branch instead.

how repo param works?

I have tried below and failed. Seems it can only take local repo as param?
C:\Users\Administrator>git-of-theseus-analyze https://github.com/erikbern/git-of
-theseus
it throw
File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-p
ackages\git_of_theseus\analyze.py", line 185, in analyze_cmdline
analyze(**kwargs)
File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-p
ackages\git_of_theseus\analyze.py", line 30, in analyze
repo = git.Repo(repo)
File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-p
ackages\git\repo\base.py", line 124, in init
raise NoSuchPathError(epath)
git.exc.NoSuchPathError: C:\Users\Administrator\https:\github.com\erikbern\git-o
f-theseus

what to do if I am trying to anaysis github repo?

Got InvalidGitRepositoryError when analyze survival.json

$ git-of-theseus-analyze survival.json 
Traceback (most recent call last):
  File "/usr/local/bin/git-of-theseus-analyze", line 10, in <module>
    sys.exit(analyze_cmdline())
  File "/usr/local/lib/python2.7/site-packages/git_of_theseus/analyze.py", line 185, in analyze_cmdline
    analyze(**kwargs)
  File "/usr/local/lib/python2.7/site-packages/git_of_theseus/analyze.py", line 30, in analyze
    repo = git.Repo(repo)
  File "/usr/local/lib/python2.7/site-packages/git/repo/base.py", line 183, in __init__
    raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /Users/khiav223577/Desktop/PaGamO2.K/survival.json

Looks like doesn't scale well

TL;DR:

 43% (11870586 of 27210192) |###############                    | Elapsed Time: 7 days, 19:07:58 ETA:  40 days, 7:15:53

And, there's no data available after stopping the process, meaning that it will start over on the next run. (So, I think I'm not going to try it again, for now...)

Please let me know if I can provide more info that can help with improving scalability.

key errors

Hello I keep seeing errors like these when running

Traceback (most recent call last):
  File "analyze.py", line 71, in get_file_histogram
    cohort = commit2cohort[old_commit.hexsha]
KeyError: u'58ec10229f8ae648bd978d0aa1e770daa289e89c'
Traceback (most recent call last):
  File "analyze.py", line 71, in get_file_histogram
    cohort = commit2cohort[old_commit.hexsha]
KeyError: u'58ec10229f8ae648bd978d0aa1e770daa289e89c'
 99% (454634 of 455317) |################## | Elapsed Time: 0:14:37 ETA: 0:00:01Traceback (most recent call last):
  File "analyze.py", line 71, in get_file_histogram
    cohort = commit2cohort[old_commit.hexsha]
KeyError: u'2643750ed2a84571c43ad32278de4a00d24bce46'
Traceback (most recent call last):
  File "analyze.py", line 71, in get_file_histogram
    cohort = commit2cohort[old_commit.hexsha]
KeyError: u'2643750ed2a84571c43ad32278de4a00d24bce46'
 99% (455084 of 455317) |################## | Elapsed Time: 0:14:37 ETA: 0:

This is being run on https://github.com/LibraryOfCongress/bagit-java

Survival going up

I ran survival_plot.py over ReproZip's 1.0.x branch and got this:

survival_plot

I imagine the graph going back up might be caused by duplication? (there are several packages in the repository which share some code) It's still surprising 😅

Theseus Seal Of Approval™️

make it so that if there is actually no initial code left, then repository gets a "Theseus Seal Of Approval™️", which is displayed somewhere in the corner of stack plot or something.

Python is not installed as a framework

When trying to create the .png file I get the following:

(env)[jscancella@johns-air git-of-theseus (master)]$ python stack_plot.py cohorts.json
/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Traceback (most recent call last):
  File "stack_plot.py", line 1, in <module>
    import sys, seaborn, dateutil.parser, numpy, json
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/seaborn/__init__.py", line 6, in <module>
    from .rcmod import *
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/seaborn/rcmod.py", line 8, in <module>
    from . import palettes, _orig_rc_params
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/seaborn/palettes.py", line 12, in <module>
    from .utils import desaturate, set_hls_values, get_color_cycle
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/seaborn/utils.py", line 12, in <module>
    import matplotlib.pyplot as plt
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 114, in <module>
    _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
    globals(),locals(),[backend_name],0)
  File "/Users/jscancella/work/git-of-theseus/env/lib/python2.7/site-packages/matplotlib/backends/backend_macosx.py", line 24, in <module>
    from matplotlib.backends import _macosx
RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

ImportError: No module named git

I get this after installing the requirements:

~/src/git-of-theseus[master]$ python analyze.py ../clojure
Traceback (most recent call last):
  File "analyze.py", line 2, in <module>
    import argparse, git, datetime, numpy, traceback, time, os, fnmatch, json, progressbar
ImportError: No module named git

mailmap processing assumes email addresses are valid

I've pointed git-of-theseus at https://github.com/mozilla/gecko-dev/, sadly it fails.

First issue: not enough values to unpack in analyze.py:555

mailmap_name, mailmap_email = mail_mapped_author_email[:-1].split(" <", maxsplit=1)

The problem here is gecko-dev has authors without an email address, which end up as an empty author_name in this function; this was easily fixed by:

- pre_mailmap_author_email = f"{author_name} <{author_email}>"
+ pre_mailmap_author_email = f"{author_name or author_email} <{author_email}>"

Second issue: unknown switch 'f' when running git check-mailmap, also triggered at analyze.py:555

Looks there's an author which is just -f. Yeah, I don't know either. mozilla/gecko-dev@1ea9e41
So the command line which ends up running is git check-mailmap -f, which fails. This should be git check-mailmap -- -f.

The easy work-around is to delete the .mailmap file.

Allow for more than 20 date cohorts

When plotting a repo, I noticed that after 20 months it began to group all other months into "other". I remember that an upper limit was mentioned somewhere, but I don't remember where. How can I change this setting so that I get more than 20 months, and the rest isn't just dumped into one slice?
stack_plot

Statistics are incomplete in multi-root repositories

I have a large repository that comes from a merge of several (four or more) old git repos. In this case only the first parent is followed by analyze.py (from the line "i, commit = i+1, commit.parents[0]"), and the other parents are skipped. Would it be straight forward to recurse down on each parent in this section?

Very unclear and --ignore probably not working.

 --ignore IGNORE      File patterns that should be ignored (can provide        
                       multiple, will each subtract independently)

This option is enigmatic to me - REGEX pattern or something else - is ts working under Windows (when I specify anything nothing is scanned).

How to skip such files REGEX:
.+pb2.py^
.+pb2.pyi^
.+pb2_grpc.py^

I tried to do it but I can not specify it. I read code and it looks not clear for me. There is some patterns but it looks like not REGEX.

No idea why this pattern exclude all files --ignore pyi?

Open the API

Allow manual provision of arguments to analyze.analyze() instead of getting them only from sys.argv.
Now I have to fiddle with sys.argv at runtime to invoke analyze().

Update analysis

The blog post was written ~6 years ago. Can the analyses be rerun, so we see how the selected few projects evolved since then? E.g. angular was very young back then, but not any more.

`git-of-theseus-stack-plot cohorts.json` fails if using matplotlib 3

Running git-of-theseus-stack-plot cohorts.json fails on the public release of macOS Mojave, running Python 3.7. On the other hand, git-of-theseus-survival-plot survival.json works just fine.

I'm getting a traceback with a TypeError:

$ git-of-theseus-stack-plot cohorts.json

Traceback (most recent call last):
  File "/usr/local/bin/git-of-theseus-stack-plot", line 11, in <module>
    sys.exit(stack_plot_cmdline())
  File "/usr/local/lib/python3.7/site-packages/git_of_theseus/stack_plot.py", line 79, in stack_plot_cmdline
    stack_plot(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/git_of_theseus/stack_plot.py", line 56, in stack_plot
    pyplot.stackplot(ts, numpy.array(y), labels=labels, colors=colors)
  File "/usr/local/lib/python3.7/site-packages/matplotlib/pyplot.py", line 2836, in stackplot
    return gca().stackplot(x=x, *args, data=data, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/matplotlib/__init__.py", line 1785, in inner
    return func(ax, *args, **kwargs)
TypeError: stackplot() got multiple values for argument 'x'

And here is my cohorts.json:

{"y": [[2467, 3052, 3488, 3506, 3907, 4333, 2194, 2190, 2190, 2168, 2045, 2045, 1879, 1846, 1830, 1830, 1800, 1685, 1685, 1655, 1585, 1393, 1393, 1372, 1371, 1314, 1313, 1284, 1284, 1284, 1250, 1222, 1222, 1222, 1108, 1108, 1048, 802], [0, 0, 0, 0, 0, 0, 1282, 1337, 1337, 1496, 10367, 10948, 12419, 13469, 13642, 13859, 14026, 14467, 14471, 14727, 10163, 10957, 10976, 10961, 10940, 10874, 10833, 9934, 9931, 9931, 9894, 9660, 9650, 9611, 9219, 9219, 9082, 8748], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 203, 270, 755, 796, 1699, 1841, 1927, 2862, 4569, 4809, 5699, 6711, 6711, 7238, 7986]], "ts": ["2016-05-12T22:25:54", "2016-05-20T03:55:10", "2016-05-27T17:38:18", "2016-06-19T19:13:41", "2016-06-26T22:07:09", "2016-12-02T04:42:04", "2017-02-09T06:51:20", "2017-02-22T06:59:31", "2017-03-06T04:11:52", "2017-03-19T00:37:59", "2017-07-20T05:51:33", "2017-07-27T06:11:32", "2017-08-03T07:47:49", "2017-08-11T06:37:24", "2017-08-19T01:51:04", "2017-08-28T01:37:28", "2017-09-05T01:58:38", "2017-09-12T21:15:10", "2017-09-24T19:15:07", "2017-10-30T22:17:18", "2017-12-09T19:18:28", "2017-12-17T01:55:34", "2017-12-24T04:43:51", "2018-01-03T22:42:32", "2018-03-09T02:20:24", "2018-03-19T08:01:01", "2018-04-08T03:12:46", "2018-04-18T04:21:10", "2018-05-13T22:51:15", "2018-05-25T17:14:02", "2018-06-04T01:12:04", "2018-06-11T02:17:21", "2018-06-18T09:20:56", "2018-07-09T06:13:45", "2018-07-23T07:42:13", "2018-08-04T05:20:25", "2018-09-17T13:01:48", "2018-09-26T11:40:21"], "labels": ["Code added in 2016", "Code added in 2017", "Code added in 2018"]}

Let me know if there's any other information that would help!

"Unknown property labels" in stack_plot

Great project! I was able to get through the analysis fine, but found this when I tried stack_plot.py cohorts.json:

$ python stack_plot.py cohorts.json
Traceback (most recent call last):
  File "stack_plot.py", line 9, in <module>
    labels=data['labels'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py", line 3165, in stackplot
    ret = ax.stackplot(x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6844, in stackplot
    return mstack.stackplot(self, x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/stackplot.py", line 101, in stackplot
    **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 7059, in fill_between
    collection = mcoll.PolyCollection(polys, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 742, in __init__
    Collection.__init__(self, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 128, in __init__
    self.update(kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 739, in update
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property labels

That said, python survival_plot.py survival.json worked out fine.

Add graphs for Django?

I was curious what they would look like so I generated them. This takes a bit of time; adding them to the pics/ directory will save others a few hours of CPU time.
stack_plot
survival_plot

IOError: [Errno 2] No such file or directory: 'c'

I am getting IOError hile executing python stack_plot.py cohorts.json

Error trace:

Traceback (most recent call last):
  File "stack_plot.py", line 10, in <module>
    data = json.load(open(args.inputs[0]))
IOError: [Errno 2] No such file or directory: 'c'

The error was resolved after I changed args.inputs[0] to args.inputs. I noticed that args.inputs was changed to args.inputs[0] in 080fd67 after #7 was reported. I checked the documentation for python2.7 and ran the following snippet:

import argparse
parser = argparse.ArgumentParser(description='Sample')
parser.add_argument('inputs')
args = parser.parse_args()
print args.inputs, type(args.inputs)

When I ran python sample.py sample it printed sample <type 'str'> on stdout. It seems args.inputs still returns a string.

Multi repo support for stack plot

The survival plot is able to visualise the results of multiple outdirs. It would be really great of the stack plot also supports this!

Can't generate cohort plot

I'm getting following error, when I run python stack_plot.py cohorts.json.
I was able to successfully generate cohorts.json with python analyze.py <path to repo>

Traceback (most recent call last):
  File "stack_plot.py", line 15, in <module>
    labels=data['labels'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py", line 3165, in stackplot
    ret = ax.stackplot(x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6844, in stackplot
    return mstack.stackplot(self, x, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/stackplot.py", line 101, in stackplot
    **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 7059, in fill_between
    collection = mcoll.PolyCollection(polys, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 742, in __init__
    Collection.__init__(self, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 128, in __init__
    self.update(kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 739, in update
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property labels

CI/CD integration fails

Hi,

I'm having issues using got-analyze from a Gitlab CI/CD job.
The job is set to git clone (not fetch), so I would assume that the entire repository is present...
When it didn't work, I added a command git checkout main right before the got command.
Anyway I can't get it running, even tried --branch %CI_COMMIT_SHA%:

branch 'main' set up to track 'origin/main'.
Switched to a new branch 'main'
Listing all commits                                    : 50 Commits [00:00, 337.65 Commits/s]
Backtracking the master branch                         : 50 Commits [00:00, 614.77 Commits/s]
Traceback (most recent call last):
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\Scripts\git-of-theseus-analyze.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git_of_theseus\analyze.py", line 585, in analyze_cmdline
    analyze(**kwargs)
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git_of_theseus\analyze.py", line 297, in analyze
    if last_date is None or commit.committed_date < last_date - interval:
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\gitdb\util.py", line 253, in __getattr__
    self._set_cache_(attr)
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git\objects\commit.py", line 199, in _set_cache_
    _binsha, _typename, self.size, stream = self.repo.odb.stream(self.binsha)
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git\db.py", line 48, in stream
    hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(binsha))
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git\cmd.py", line 1269, in stream_object_data
    hexsha, typename, size = self.__get_object_header(cmd, ref)
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git\cmd.py", line 1239, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "C:\ProgramData\mambaforge\envs\test-autocook-3-10\lib\site-packages\git\cmd.py", line 1199, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'2e80fbcd2417ada7a719febe26bb3e8dbae0092a' could not be resolved, git returned: b'2e80fbcd2417ada7a719febe26bb3e8dbae0092a missing'

Do you have any hints?

Add a plot with sum or average file size per file type

In addition to the existing plot with LOC per file type, add a plot with file size per file type.

Sum of file sizes per file type, because changes in the number of files could be accompanied by changes in the average file size (e.g. splitting vs. consolidating).

git-of-theseus-stack-plot --help crashes

requiel@requiels-PC ~/gitRepo/vscode $ git-of-theseus-stack-plot --help
Traceback (most recent call last):
File "/home/requiel/.virtualenvs/git-theseus/bin/git-of-theseus-stack-plot", line 3, in
stack_plot()
File "/home/requiel/.virtualenvs/git-theseus/local/lib/python2.7/site-packages/git_of_theseus/stack_plot.py", line 42, in stack_plot
args = parser.parse_args()
File "/usr/lib/python2.7/argparse.py", line 1701, in parse_args
args, argv = self.parse_known_args(args, namespace)
File "/usr/lib/python2.7/argparse.py", line 1733, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "/usr/lib/python2.7/argparse.py", line 1939, in _parse_known_args
start_index = consume_optional(start_index)
File "/usr/lib/python2.7/argparse.py", line 1879, in consume_optional
take_action(action, args, option_string)
File "/usr/lib/python2.7/argparse.py", line 1807, in take_action
action(self, namespace, argument_values, option_string)
File "/usr/lib/python2.7/argparse.py", line 996, in call
parser.print_help()
File "/usr/lib/python2.7/argparse.py", line 2340, in print_help
self._print_message(self.format_help(), file)
File "/usr/lib/python2.7/argparse.py", line 2314, in format_help
return formatter.format_help()
File "/usr/lib/python2.7/argparse.py", line 281, in format_help
help = self._root_section.format_help()
File "/usr/lib/python2.7/argparse.py", line 211, in format_help
func(*args)
File "/usr/lib/python2.7/argparse.py", line 211, in format_help
func(*args)
File "/usr/lib/python2.7/argparse.py", line 517, in _format_action
help_text = self._expand_help(action)
File "/usr/lib/python2.7/argparse.py", line 603, in _expand_help
return self._get_help_string(action) % params
ValueError: incomplete format

Some information about which options each command takes should be added to the readme too maybe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.