ishepard / pydriller Goto Github PK

View Code? Open in Web Editor NEW

815.0 19.0 139.0 77.26 MB

Python Framework to analyse Git repositories

Home Page: http://pydriller.readthedocs.io/en/latest/

License: Apache License 2.0

Python 100.00%

msr python3 git mining-software-repositories software-engineering python-framework python

pydriller's Introduction

PyDriller

PyDriller is a Python framework that helps developers in analyzing Git repositories. With PyDriller you can easily extract information about commits, developers, modified files, diffs, and source code.

Install

pip install pydriller

Quick usage

from pydriller import Repository

for commit in Repository('https://github.com/ishepard/pydriller').traverse_commits():
    print(commit.hash)
    print(commit.msg)
    print(commit.author.name)

    for file in commit.modified_files:
        print(file.filename, ' has changed')

Read the docs for more usage examples. Furthermore, a video is available on Youtube.

How to cite PyDriller

@inproceedings{Spadini2018,
  address = {New York, New York, USA},
  author = {Spadini, Davide and Aniche, Maur\'{i}cio and Bacchelli, Alberto},
  booktitle = {Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018},
  doi = {10.1145/3236024.3264598},
  isbn = {9781450355735},
  keywords = {2018,acm reference format,and alberto bacchelli,davide spadini,git,gitpython,maur\'{i}cio aniche,mining software repositories,pydriller,python},
  pages = {908--911},
  publisher = {ACM Press},
  title = {{PyDriller: Python framework for mining software repositories}},
  url = {http://dl.acm.org/citation.cfm?doid=3236024.3264598},
  year = {2018}
}

How to contribute

First clone the repository:

git clone https://github.com/ishepard/pydriller.git
cd pydriller

(Optional) It is suggested to make use of virtualenv. Therefore, before installing the requirements run:

python3 -m venv venv
source venv/bin/activate

Then, install the requirements:

pip install -r requirements.txt

(Important) I tend to not accept Pull Requests without tests, so:

unzip the test-repos.zip zip file
inside are many "small repositories" that were manually created to test PyDriller. Use one of your choice to test your feature (check the existing tests for inspiration)
if none is suitable for testing your feature, create a new one. Be careful: if you create a new one, do not forget to upload a new zip file test-repos.zip that includes your new repository, otherwise the tests will fail.

To run the tests (using pytest):

unzip test-repos.zip
pip install -r test-requirements.txt
pytest

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 642954.

pydriller's People

Contributors

Stargazers

Watchers

Forkers

klainfo aditya0212jain chubbymaggie pombredanne allangoncalves terminalkitten ledenel marco-c dstroe2000 habibrahmanbd matheusflauzino ichimarugin4869 lucapascarella jlgomes avancinirodrigo kanghj gotec iivanoo softnetworld m4tt30rru codematic71 tumtui111 kaochaiyakarn davidetaibi kulshah msraju2009 eunjong-choi goryszewskig yinzuojie software-improvement-group-research fancycheung trendingtechnology ppseguel rpaul20 alangerak magielbruntink yusufsn jphgoodwin sheldonsnote calmjerome kahbenya farshad-git tobiasfamos xserban sairamakuru doc22940 avandeursen ieiris evancasey1 descentis dlitosh ljwcau guomytodo sambacha jackhj000 kheman9862 mardibiase leticiadesouza farid-feyzi qperez diegofreitash sposs symbolk cdeboeser shbkhtchngs butterl jinguang-dong battyone felipeebert superskyyy ekincanufuktepe devopsguru2020 hayasam xtuchyna helgecph ikhsann02 david-siqi-liu okirsh sinaneski swarts-uk zmqgeek glato lbmeng narjessbessghaier v2vivar suhaibmujahid zzalscv2 kevinyoung98 glopezmuzh kevinah95 k----n mascari mgr-2021-2022 anusha2006 itrauco superjarek monishkreddy hrz6976 liorbass wcphkust

pydriller's Issues

GitRepository with remote URLs

Right now, to use GitRepository you need to pass a path to the repo, which in case of remote URLs is a temporary folder created on the fly. The Commit object carries this information in the project_path field. However, this leads the developer to create one GitRepository every visited commit:

for commit in RepositoryMining(remote_url).traverse_commits():
    gr = GitRepository(commit.project_path)

That is not optimal, since every time GitRepository is instantiated, a new request to Git is made.

Get tag of file

Is there a way to get the tag of a file in the repo? I can get the author but not the tag.

Holger

"Now the user needs to change from_commit and to_commit if reversed_order is True" may be misleading

Hi @ishepard ,

I have noticed that during commit bec86e2 you have swap from_commit and to_commit in the test data for the process metrics to reflect the changes you have done in commit f84047a.

Although those tests pass, I think it could be misleading telling the user to reverse the order of the two parameters, for two reasons:

Different applications already use PyDriller, and I imagine that some of them use the reverse_order and need to change the code accordingly;
In the specific case of the implemented process metrics, the user does not know (and is not assumed to know) that we set up reverse_order = True in the body of the metrics. When a user calls commits_count('pydriller', from_commit='commit_1', to_commit='commit_10'), he/she expects to analyse the commits in chronological order from commit_1 to commit_10.
The current version returns the wrong result, and the user has to call commits_count('pydriller', from_commit='commit_10', to_commit='commit_1') instead, which is IMHO a bit confusing.

Here the alternatives that I propose:

Handle the reverse_order in the constructor of RepositoryMining, by swapping from_commit and to_commit in the configuration dict when reverse_order == True, instead of letting the user do so;
We can leave everything as it is (but take care of point 1 in the previous list), and I can change the process metrics accordingly, by simply swapping the two variables when calling RepositoryMining in the body of each metrics.

Let me know what do you think.

Question regarding mining strategy

This is not entirely about pydriller but I figured there might be enough experience here to help with this:

I am trying to get the history of a branch and have to work with the diffs of those commits.
Before I did this manually by using git log --first-parent and some parsing to get all the hashes that I am interested in and then using hash^! to get the diffs for each commit.

First, I am not entirely sure this did what I intentioned. Merges are occult stuff I can not seem to wrap my head around.
However, I tried to recreate the behaviour with RepositoryMining using order="reverse", "only_in_branch=<origin_branch> and while comparing the results of the two approaches I noticed they don't return the same commits.

Shouldn't only_in_branch=<origin_branch> lead to the same behaviour as --first-parent?
And maybe someone can help me with understanding the merging stuff. Am I right in thinking that I do have to include merge commits in the history, otherwise there will be jumps I can not follow?

Only use hyper-blame when useful

Right now, when hyper-blame is available, we always prefer it to normal blame.
In practice, we should use it only when useful (that is, when the user passes a path to a file containing hashes to ignore, or when the repository contains a default ignore file), as it can slow down blaming.

Verify Modification.filename on Windows

It seems that filename is returning a relative path from repository instead just the filename (with extension) on Windows.

On Linux works correctly.

That happens using Python 3.5.1

To reproduce the problem:

   	for commit in RepositoryMining(repo_path, only_in_branches=['master'], 
   				only_no_merge=True, from_tag=from_tag, 
   				to_tag=to_tag).traverse_commits():
   		for m in commit.modifications:
   			print(m.filename)

to_commit and from_commit no longer support HEAD

There have been some recent changes to traverse_commits options that made some of my code break. I've tried quickly fiddling with the options, but it didn't help.

In the end, after debugging pydriller's code, I found out that the problem is that passing HEAD to either to_commit or from_commit breaks a lot of pydriller's assumptions.

I think HEAD (and similar references) should be supported, or clear errors should be spit out.

Use git-hyper-blame to improve SZZ results

SZZ can blame the wrong commit when there are large formatting changes.

https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html can help with this, skipping large meaningless formatting changes (for projects that do have an up to date .git-blame-ignore-revs). Possibly, it would also be nice to be allowed to pass a set of commits to skip in addition to those listed in .git-blame-ignore-revs.

Issues with filtering commits by file type

Describe the bug
Filtering commits by file type modified seems to be not really working. Please see the script below for a MWE. Using filtering as in examples in the docs.

To Reproduce

from pydriller import RepositoryMining

repo    = "https://github.com/jgrapht/jgrapht.git"
commits = [
    "0c62b9bea6ac0caeb6bc520d87708bceca0054dc", # README.html
    "0eec89d7d76077b485851c80e8f4e78e2d1e8cbf", # build.xml, package.html
    "0f3a10c6ea5258c0add31f2c1eb0ac64b015315a"  # README.html
]

for commit in RepositoryMining(repo,only_modifications_with_file_types=['.java'],only_commits=commits).traverse_commits():
    print(commit.hash)
    for modification in commit.modifications:
        if modification.filename[-4:] != 'java':
            print(modification.filename)

The output for me is:

$ python3 bug.py
0eec89d7d76077b485851c80e8f4e78e2d1e8cbf
build.xml
package.html
0f3a10c6ea5258c0add31f2c1eb0ac64b015315a
README.html
0c62b9bea6ac0caeb6bc520d87708bceca0054dc
README.html

OS Version:
Linux

My python installation is:

$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux

Question: when do I should expect modified_file.old_path to be None?

I have a question: the documentation for the object modifications states "old_path: the old path of the file (can be None if the file is added)".

Does it mean the standard behaviour of git is to set up the old_path to something != None and in some cases to None? Or is this handled by PyDriller itself?

For example, with the following code:

for commit in RepositoryMining('https://github.com/snowplow/ansible-playbooks', to_commit='45376c4f43238a6c2ac10c64bcd6d3120d17ab08', reversed_order=True).traverse_commits():
    for modified_file in commit.modifications:
        if 'roles/neo4j-demo/tasks/main.yml' in (modified_file.old_path, modified_file.new_path):
            if modified_file.change_type == ModificationType.ADD:
                print(commit.hash, modified_file.old_path, modified_file.change_type)

I would expect:

the modified_file.old_path of roles/neo4j-demo/tasks/main.yml to be None (as it is added at commit 70fc0074c32ec55053b47e285be0030e6022941e);

while the actual value is: roles/neo4j-demo/tasks/main.yml.

I am asking because when I used PyDriller in my code I assumed the old_path to be None when the file was added, and then I realized this.
However, adding the check if modified_file.change_type == ModificationType.ADD solved the problem :)

Recent changes to traverse_commits broke existing code without warnings

RepositoryMining(".", from_commit=older_commit, to_commit=newer_commit, reversed_order=True).traverse_commits() used to work.
Now, it returns all commits in the repo, without printing any warning.

Instead, you have to use RepositoryMining(".", from_commit=newer_commit, to_commit=older_commit, reversed_order=True).traverse_commits().

New filter only_authors=

select all commits of 1 specific author

Commit filter only_authors raising "unexpected keyword argument" error

The only_authors() commit filter is not working and raising an error.
I'm trying to run the example given in the configuration page

Following is the error that I'm getting:

for commit in RepositoryMining('repos/caboche', only_authors=['jmettraux']).traverse_commits():
TypeError: __init__() got an unexpected keyword argument 'only_authors'

I'm using Windows powershell, python 3.6.5.

Is there a way to traverse commits on a single file?

It will be nice if I was able to traverse through the commits through a single file's history. The file could be specified as a parameter to RepositoryMining, perhaps? I can try adding a feature like this when I have more time, if @ishepard is interested.

Use --ignore-revs-file instead of hyper-blame

It looks like "hyper blame" is now included in git itself (see https://stackoverflow.com/a/57129540/5769903 for details).

It might be beneficial to switch to this instead of having a custom implementation. It also has some additional features, e.g. "unblamable" lines.

Use pygit2 where possible

gitpython's performance is not great, as it uses subprocess to spawn a git process for every command.
Using pygit2 can greatly improve performance, even though it's a bit harder to use.

Pickling creates new Commit instances (huge memory usage with multiprocessing)

Describe the bug
I started to parallelize my code, which used pydriller to extract the source code of methods from commits. For that purpose, I used multiprocessing package. I encountered very high memory usage in just a few hours. It's known that multiprocessing uses pickle. And, as I found out, pickling Commit objects leads to new Commit instances in memory.

To Reproduce
I wrote a simple script to demonstrate the problem:

import pickle
import os

import objgraph
from pydriller import RepositoryMining

import settings


def main():
    repo_path = os.path.join(settings.get('git_repositories_dir'), 'trfl')
    repo = RepositoryMining(repo_path)

    commits = list(repo.traverse_commits())[:10]

    print(f'Starting with {len(objgraph.by_type("Commit"))}')
    for n, commit in enumerate(commits):
        pickle.dumps(commit)
        print(f'#{n+1} {len(objgraph.by_type("Commit"))}')
    # del commits
    print(f'Ending with {len(objgraph.by_type("Commit"))}')


if __name__ == '__main__':
    main()

The output is:

Starting with 29
#1 29
#2 29
#3 30
#4 32
#5 35
#6 39
#7 44
#8 50
#9 57
#10 65
Ending with 65

gc.collect() didn't helped, because new Commit objects referenced by old ones. You can see this by uncommenting del commits. The result should be Ending with 0.

OS Version:

MacOS Mojave 10.14.6;
Python 3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53)
PyDriller==1.11

LSB Version: 1.4
Distributor ID: Arch
Description: Arch Linux
Release: rolling
Codename: n/a
Linux ggnb-arch 5.5.8-arch1-1 #1 SMP PREEMPT Fri, 06 Mar 2020 00:57:33 +0000 x86_64 GNU/Linux;
Python 3.8.2
PyDriller==1.12

P.S. I've added info about multiprocessing in the title for people, who are facing the same problem in a similar way.

Only methods that changed in the commit

Request from a user:

I know that I can get a list of modified files from the commit 
object and with the list of modifications i can also get the diff 
and a list of methods. But here I'd like to know if it's possible 
to get only the changed methods that are included in the diff.

For example , suppose a modification file contains total five 
methods, but only two have been changed and the diff 
contains only the changed ones. Is it possible to get only the 
two methods instead of all of them?

This is an old issue #24. With the latest changes of Pydriller it's definitely possible.

filtering out commits with whitespace-only changes

Is your feature request related to a problem? Please describe.
How hard would it be to add an option to include/exclude whitespace changes in diffs? More specifically, if a file modification in a commit only contains whitespace formatting changes.

Describe the solution you'd like
Maybe some commit filtering parameter like only_modifications_nowhitespace=true/false for the RepositoryMining class?

Additional context
Example: changes in file Subgraph.java in the commit 0ad4e7aa are purely whitespace. If I do the git diff manually, the following two commands produce different results and the second case this file is not included in the diff:

$ git diff e07df84 0ad4e7aa | grep Subgraph
diff --git a/src/org/_3pq/jgrapht/graph/Subgraph.java b/src/org/_3pq/jgrapht/graph/Subgraph.java
--- a/src/org/_3pq/jgrapht/graph/Subgraph.java
+++ b/src/org/_3pq/jgrapht/graph/Subgraph.java
@@ -132,8 +132,8 @@ public class Subgraph extends AbstractGraph implements Serializable {
      * Creates a new Subgraph.

$ git diff -w e07df84 0ad4e7aa | grep Subgraph

On Windows, exception ignored causes the break of program's execution

When I run some operation provided by RepositoryMining at end of its execution occours the exception below:

Exception ignored in: <finalize object at 0x3019fb0; dead>

My test:

for commit in RepositoryMining("https://github.com/Marlysson/trackwork").traverse_commits():
    print(commit.hash, len(commit.modifications) )

I'm trying something wrong way or need a patch to fix them...

I wonder help if need some fix.

change ufficial to official in Readme.md

Method traverse_commits() returns more commits than expected for some repositories

For some repositories like UnderGreen/ansible-role-mongodb I get wrong results when calling the method traverse_commits().

How to reproduce:

from pydriller.repository_mining import RepositoryMining

for commit in RepositoryMining('https://github.com/UnderGreen/ansible-role-mongodb',
                                from_commit='3efddcf91d51dd55a7d322ec9e0f14f673da3d1a',
                                to_commit='416fb2daf88410bd928d310e75e2ff4447be30e2').traverse_commits():
    print(commit.hash)

Expected result:

3efddcf91d51dd55a7d322ec9e0f14f673da3d1a
abf8ded23ecc712e1ce4347403ded96f374e761b
527214bcc9311a0e42868144f97982878e33c9f3
454df2ef69d0bc3c13dc0eee748595885c3a0036
416fb2daf88410bd928d310e75e2ff4447be30e2

Actual result:

2dba294ff88b3c7fbf28802c5179c79292ab353e
5f167f612a9bd52bc8983ec6f3f143e7e5daca82
9e17ec9ed8ab36defed2017f9495e71a2688dfab
d6aa8e0ffa70ec3590a01eaec65626209db6c1b1
3efddcf91d51dd55a7d322ec9e0f14f673da3d1a     [correct]
abf8ded23ecc712e1ce4347403ded96f374e761b     [correct]
527214bcc9311a0e42868144f97982878e33c9f3     [correct]
454df2ef69d0bc3c13dc0eee748595885c3a0036     [correct]
416fb2daf88410bd928d310e75e2ff4447be30e2     [correct]
d8d847825d97f842355dbaf6a361089b141bdc41
5c04b2c8bb6fc31994974e84adab155c80d448eb
30c2bc37c4fbc4b95bbee9e01566400e070f4ab0
4c3919d9201e38d47de52a4882652d7ba01312a5
845f25d5a2dec1c02b6798f7e95d77ea3768c307

Version:
I've installed the latest version locally (from commit e4449b602be0dff8dc0a49a466ec3dbf29ec88fb).

The same behavior appears in PyDriller-1.10.1.

Problem with old_path

I wanted to search the commit when a file was added but the result is that the old_path of the file is never None so the commit was not found. I tested this on the repository https://github.com/krb5/krb5 and the file define.h.

Unexpected behavior in RepositoryMining

In the repositoryDataDog/ansible-datadog, the commit edf7f39adb5301c6d157cbc07291cc5cd6629dc2 precedes 5adf2f1659fb8686b27317aa7ee2f6468bbcdb38 (in chronological order).

While running:

for commit in RepositoryMining('https://github.com/DataDog/ansible-datadog', from_commit='edf7f39adb5301c6d157cbc07291cc5cd6629dc2', to_commit='5adf2f1659fb8686b27317aa7ee2f6468bbcdb38').traverse_commits():
    print(commit.hash)

I expect

edf7f39adb5301c6d157cbc07291cc5cd6629dc2
5adf2f1659fb8686b27317aa7ee2f6468bbcdb38

but obtain

(nothing)

The expected result is obtained by swapping from_commit and to_commit.
This is a bit misleading.

PyDriller version: Latest commit -> dc5d544ff1d0aac23e4cbd1b14a179b04323defb

New commit filter "only_modifications_without_file_types"

I would like to be able to filter commits that modify files that do not have certain formats..
I am interested in analyzing flow of commits taking into account certain format of the modified files. For example: I want to filter commits that modify '.java' format files (we already did it with 'only_modifications_with_file_types'), however, I would also like to get all commits that do not modify '.java' files.

Example of solution
I would like to get a new commit filter, like: only_modifications_without_file_types = ['.java', '.c'], getting all commits that do not modify any '.java' files and any '.c' files.

Additional context

Filtering commits: https://pydriller.readthedocs.io/en/latest/configuration.html#filtering-commits
We need to add a new filter here with the described behavior.

Need to test reversed_order=True with (from_commit=None, to_commit=sha) and (from_commit=sha, to_commit=None)

The last four commits of pydriller are (from oldest to newest):

13c29c9773c599094caadba0b04ba99aebb1cce5
14535c53e82505c7a4c1c6c96ff7afc43a0849d9
6ae8b9dba47cf79ed10ba30aaf3ac58dc21adcf1
07099286e8a06680038d6c1ac9ba1ebb12a1406e

When testing the following code:

for commit in RepositoryMining('https://github.com/ishepard/pydriller', from_commit='13c29c9773c599094caadba0b04ba99aebb1cce5', to_commit=None, reversed_order=True).traverse_commits():
    print(commit.hash)

I expect the following:

07099286e8a06680038d6c1ac9ba1ebb12a1406e (latest)
6ae8b9dba47cf79ed10ba30aaf3ac58dc21adcf1 (latest - 1)
14535c53e82505c7a4c1c6c96ff7afc43a0849d9 (latest - 2)
13c29c9773c599094caadba0b04ba99aebb1cce5 (latest - 3)

but I obtain all the commits from the first to from_commit:

13c29c9773c599094caadba0b04ba99aebb1cce5 (from_commit = latest - 3)
386b6eecd23c9fc5222624b0898aa7ee05121765
d519d9ca5a22686ed104f20a6294ed575096612f
f470ff1fc7f25b03cb4b72742d052176ab80b05f
...
71e053f61fc5d31b3e31eccd9c79df27c31279bf
90ca34ebfe69629cb7f186a1582fc38a73cc572e
fdf671856b260aca058e6595a96a7a0fba05454b
ab36bf45859a210b0eae14e17683f31d19eea041 (1st commit)

The opposite happens when from_commit=None. For example, with

for commit in RepositoryMining('https://github.com/ishepard/pydriller', from_commit=None, to_commit='71e053f61fc5d31b3e31eccd9c79df27c31279bf' , reversed_order=True).traverse_commits():
    print(commit.hash)

I expect the following:

71e053f61fc5d31b3e31eccd9c79df27c31279bf (4th commit)
90ca34ebfe69629cb7f186a1582fc38a73cc572e
fdf671856b260aca058e6595a96a7a0fba05454b
ab36bf45859a210b0eae14e17683f31d19eea041 (1st commit)

and obtain all the commits from the latest to to_commit:

07099286e8a06680038d6c1ac9ba1ebb12a1406e (latest commit)
6ae8b9dba47cf79ed10ba30aaf3ac58dc21adcf1 (latest - 1)
14535c53e82505c7a4c1c6c96ff7afc43a0849d9 (latest - 2)
13c29c9773c599094caadba0b04ba99aebb1cce5 (latest - 3)
...
2437f7bb5659a342b8d74c169d175604170395ae (7th commit)
f8b713a9f5e4620abbbc84f45493352fc046b5e0 (6th commit)
205f6fb09734667b0c1842fd3c317013640189ce (5th commit)
71e053f61fc5d31b3e31eccd9c79df27c31279bf (to_commit = 4th commit)

NOTE: This behavior shows only when reversed_order = True. Furthermore, passing both from_commit and to_commit (or setting both to None) works as intended.

Version: latest version installed from current master -> commit 07099286e8a06680038d6c1ac9ba1ebb12a1406e

Support using topological order for traversing commits

It can be useful in cases like https://github.com/mozilla/gecko-dev (see also #93).

git diff unified output

Is your feature request related to a problem? Please describe.
currently pydriller is using default git-diff output instead of unified output (-U0 flag).
For example:

git diff  HEAD~1 | grep @@
@@ -1,5 +1,6 @@
@@ -9,7 +10,7 @@ PropertyStatus RegExpValidator::validate(const Property & property, const std::s
@@ -2,8 +2,6 @@
@@ -52,10 +52,12 @@ static vector<TestData> userAgentModifiers {

git diff -U0 HEAD~1 | grep @@
@@ -2,0 +3 @@
@@ -12 +13 @@ PropertyStatus RegExpValidator::validate(const Property & property, const std::s
@@ -5,2 +4,0 @@
@@ -54,0 +55 @@ static vector<TestData> userAgentModifiers {
@@ -58,0 +60 @@ static vector<TestData> userAgentModifiers {

The numbers of changed lines are different, and they are an exact match with the actual lines in the file while -U0 flag is used. The difference in numbers between U0 and default is non-deterministic as tested on several files in my repo (even if man for diff states U0 flags skips first 3 lines, it is not as simple as shift numbers by 3)

Describe the solution you'd like
I would like to be able to pass the flag -U0 to the git diff command from pydriller, or have -U0 as default git diff.

Additional context
I tracked down the command to the function def diff in diff.py (its in git library pydriller uses)
line 78:

        def diff(self, other=Index, paths=None, create_patch=False, **kwargs):
        """Creates diffs between two items being trees, trees and index or an
        index and the working tree. It will detect renames automatically.

        :param other:
            Is the item to compare us with.
            If None, we will be compared to the working tree.
            If Treeish, it will be compared against the respective tree
            If Index ( type ), it will be compared against the index.
            If git.NULL_TREE, it will compare against the empty tree.
            It defaults to Index to assure the method will not by-default fail
            on bare repositories.

        :param paths:
            is a list of paths or a single path to limit the diff to.
            It will only include at least one of the given path or paths.

        :param create_patch:
            If True, the returned Diff contains a detailed patch that if applied
            makes the self to other. Patches are somewhat costly as blobs have to be read
            and diffed.

        :param kwargs:
            Additional arguments passed to git-diff, such as
            R=True to swap both sides of the diff.

        :return: git.DiffIndex

        :note:
            On a bare repository, 'other' needs to be provided as Index or as
            as Tree/Commit, or a git command error will occur"""
        args = []
        args.append("--abbrev=40")        # we need full shas
        args.append("--full-index")       # get full index paths, not only filenames
        # get unified output with 0 lines (added this dummy change myself - Vox1984)       
        args.append("-U0")
       ##############################################                
        args.append("-M")                 # check for renames, in both formats
        if create_patch:
            args.append("-p")
        else:
            args.append("--raw")

        # in any way, assure we don't see colored output,
        # fixes https://github.com/gitpython-developers/GitPython/issues/172
        args.append('--no-color')

        if paths is not None and not isinstance(paths, (tuple, list)):
            paths = [paths]

        diff_cmd = self.repo.git.diff
        if other is self.Index:
            args.insert(0, '--cached')
        elif other is NULL_TREE:
            args.insert(0, '-r')  # recursive diff-tree
            args.insert(0, '--root')
            diff_cmd = self.repo.git.diff_tree
        elif other is not None:
            args.insert(0, '-r')  # recursive diff-tree
            args.insert(0, other)
            diff_cmd = self.repo.git.diff_tree

        args.insert(0, self)

        # paths is list here or None
        if paths:
            args.append("--")
            args.extend(paths)
        # END paths handling

        kwargs['as_process'] = True
        proc = diff_cmd(*self._process_diff_args(args), **kwargs)

        diff_method = (Diff._index_from_patch_format
                       if create_patch
                       else Diff._index_from_raw_format)
        index = diff_method(self.repo, proc)

        proc.wait()
        return index

Implement histogram algorithm for diff

Problems
The diff utility calculates and displays the differences between two files, and is typically used to investigate the changes between two versions of the same file. Git offers four diff algorithms, namely, Myers, Minimal, Patience, and Histogram. Without an identifying algorithm, Myers is used as the default algorithm.
In textual differencing, all diff algorithms are computationally correct in generating the diff outputs. However, the diff outputs are sometimes different due to different diff algorithms. Different diff algorithms might identify different change hunks, that is, a list of program statements deleted or added contiguously, separated by at least one line of unchanged context. We expect that a set of changing operations done by developers can be represented by change hunks. However, there can be inappropriate identifications of change hunks.

Recommendation
Based on this paper, the authors investigated the impact of adopting different diff algorithms (Myers and Histogram) on empirical studies and investigated which diff algorithm can provide better diff results that can be expected to recover the changing operations [1].
On the impact on code churn metrics and bug-introducing change identification in 24 Java projects, the result shows different code changes based on different diff algorithms. For patch application, the Histogram is more suitable than Myers for providing the changes of code.
Therefore, we recommend using the Histogram algorithm in this tool for extracting code changes.

References
[1] Nugroho, Y. S., Hata, H. Matsumoto, K., How different are different diff algorithms in Git?, Empirical Software Engineering (2019). Available online

I cannot use pydriller traverse_commits() method on detached repository

Describe the bug
I cannot use pydriller traverse_commits() method from RepositoryMining class on detached repository, for other repositories it works fine. (I am intending to use it for enforcing some rules on commits, and I am doing it on detached branch).

To Reproduce

from pydriller import RepositoryMining
import calculateChanges as c

repo = "/home/repo"
lista = list(RepositoryMining(repo).traverse_commits())

Results in an exception:

/home/user/.local/lib/python3.6/site-packages/git/refs/symbolic.py", line 275, in _get_reference
    raise TypeError("%s is a detached symbolic reference as it points to %r" % (self, sha))
TypeError: HEAD is a detached symbolic reference as it points to 'e12532b37ffe6d67140841a0a918c8b3744a01ae'

OS Version:
Python 3.6.9
Ubuntu 18.04.3 LTS

Dummy Workaround:
git_repository.py

    def _discover_main_branch(self, repo):
        if self.main_branch is None:
            self.main_branch = "dummy"
        else:
            self.main_branch = repo.active_branch.name

Parsing a combined diff

When in a merge commit, the modified files of the commit should be the files in conflict. To obtain this list, one should use "git diff-tree --cc SHA". Right now, since I haven't found a solution to parse this yet, I don't return anything.

As an example, I show here the result of diff-tree --cc on the commit cc5b002a5140e2d60184de42554a8737981c846c of project SignalR. I took this example from this tutorial.

The output of the command git diff-tree --cc cc5b002a5140e2d60184de42554a8737981c846c is:

cc5b002a5140e2d60184de42554a8737981c846c
diff --cc tests/Microsoft.AspNet.SignalR.FunctionalTests/Client/HubProxyFacts.cs
index aaad4c4a,8bf42fc3..4979ab7d
--- a/tests/Microsoft.AspNet.SignalR.FunctionalTests/Client/HubProxyFacts.cs
+++ b/tests/Microsoft.AspNet.SignalR.FunctionalTests/Client/HubProxyFacts.cs
@@@ -36,11 -36,9 +37,11 @@@ namespace Microsoft.AspNet.SignalR.Test

                  hubConnection.Start(host.Transport).Wait();

-                 proxy.Invoke("Send", "hello").Wait();
+                 proxy.InvokeWithTimeout("Send", "hello");

 -                Assert.True(wh.WaitOne(TimeSpan.FromSeconds(5)));
 +                Assert.True(wh.WaitOne(TimeSpan.FromSeconds(10)));
 +
 +                hubConnection.Stop();
              }
          }

@@@ -65,9 -63,9 +66,9 @@@

                  hubConnection.Start(host.Transport).Wait();

-                 proxy.Invoke("Send", "hello").Wait();
+                 proxy.InvokeWithTimeout("Send", "hello");

 -                Assert.True(wh.WaitOne(TimeSpan.FromSeconds(5)));
 +                Assert.True(wh.WaitOne(TimeSpan.FromSeconds(10)));
              }
          }

diff --cc tests/Microsoft.AspNet.SignalR.FunctionalTests/Server/Hubs/HubFacts.cs
index d153740f,4bdad4db..393a1ebf
--- a/tests/Microsoft.AspNet.SignalR.FunctionalTests/Server/Hubs/HubFacts.cs
+++ b/tests/Microsoft.AspNet.SignalR.FunctionalTests/Server/Hubs/HubFacts.cs
@@@ -297,8 -291,10 +298,8 @@@ namespace Microsoft.AspNet.SignalR.Test
                      Name = "David"
                  };

-                 var person1 = hub.Invoke<SignalR.Samples.Hubs.DemoHub.DemoHub.Person>("ComplexType", person).Result;
+                 var person1 = hub.InvokeWithTimeout<SignalR.Samples.Hubs.DemoHub.DemoHub.Person>("ComplexType", person);
                  var person2 = hub.GetValue<SignalR.Samples.Hubs.DemoHub.DemoHub.Person>("person");
 -                JObject obj = ((dynamic)hub).person;
 -                var person3 = obj.ToObject<SignalR.Samples.Hubs.DemoHub.DemoHub.Person>();

                  Assert.NotNull(person1);
                  Assert.NotNull(person2);

This so called Combined Diff represent only the files that changed in a merge commit, all the other files of the diff changed in 1 of the branches instead (and we don't want those).

As we can see, it's a bit different from a "normal" diff. We have different headers (@@@ -297,8 -291,10 +298,8 @@@), and 2 columns to represent whether the line was added or deleted in a branch (so the options are -,+,--,-+,+-, etc). I did a small research and I couldn't find a parser for this Combined Diff. Maybe we should write one. Documentation for the combined diff can be found here.

Lost diffs

Describe the bug
When iterating through the commits of a repo and analyzing the diffs for each commit sometimes the diffs are missing.
I also hat this happen when running it over the Linux kernel and apache httpd but I don't have examples at hand.

To Reproduce
I am testing this on https://github.com/GNOME/evolution
commit sha is 756c8abcb840b8da588031f4a0d7e1fc979fab70
GNOME/evolution@756c8ab

Using git directly:
git diff 756c8abcb840b8da588031f4a0d7e1fc979fab70^!

Gives the same as using git-python directly:

repo = Repo("./")
commit = repo.commit("756c8abcb840b8da588031f4a0d7e1fc979fab70")
diff = commit.diff(commit.parents[0], create_patch=True)

But when using pydriller like this:

commits = reversed(list(RepositoryMining("./").traverse_commits()))
for commit in commits:
    if commit.hash == "756c8abcb840b8da588031f4a0d7e1fc979fab70":
        for mod in commit.modifications:
            if mod.filename == "gnome-canvas-text.c":
                print(mod.diff)
                return

It gives an empty diff.

OS Version:
Using Antergos Linux (Arch based), python 3.7, GitPython 2.1.11 and pydriller 1.7

In the modified file bring only the modified methods and not all

Include a method that brings all modified methods into a commit

GitRepository error: A branch named '_PD' already exists.'

I tried to use pydriller to iterate all the commits of a project and then execute a custom command on each commit. (in my example, I simply print all the files in the directory for each commit).

The library GitRepository is looking for a branch "_PD" that does not exist in my repository.

Here is the script I wrote:

import os
from typing import re
from pydriller import RepositoryMining
from pydriller import GitRepository

DATASET_DIR= './GitTest'
gr = GitRepository(DATASET_DIR)

for commit in RepositoryMining(DATASET_DIR).traverse_commits():
    print("commit {}, date {}".format(commit.hash, commit.author_date))
    gr.checkout(commit.hash)
    
    # Print all files in the repository for the current commit
    filenames = os.listdir(DATASET_DIR)  # get all files' and folders' names in the current directory
    for filename in filenames:  # loop through all the files and folders
        print(filename)

This is the exception

 File "test_miner.py", line 37, in <module>
    gr.checkout(commit.hash)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pydriller/git_repository.py", line 115, in checkout
    self.git.checkout('-f', _hash, b='_PD')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/git/cmd.py", line 548, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/git/cmd.py", line 1014, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/git/cmd.py", line 825, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git checkout -b _PD -f 3dfc48d9e65d907c0fd2cf0b34c510c9f5355984
  stderr: 'fatal: A branch named '_PD' already exists.'

parse_diff() yields wrong results for files ending without newline

If a file ends without a newline git represents this with the additional line

\ No newline at end of file

in corresponding diffs. These lines are not parsed correctly in the parse_diff() function.

commits from "git log --all"

By default RepositoryMining().traverse_commits() shows the commits can be seen by running "git log". Is there a way to see all the commits? i.e "git log --all"

If not, it'd be really useful for me if that feature was available. :/

traverse_files()

One user asked to implement a traverse_file() function instead of traverse_commits().

In this case, I would have to obtain the list of files of the repo in that specific moment, and trace back all the commits in which each file was modified.

The result will be a dictionary, in which the keys are the files, and the value for each key is a list of commits.

'Tree' object has no attribute

Describe the bug:
Trying to analyze a git repository which was ported from SVN.

To Reproduce:
Running very simple code on repo (which should not produce any errors?):

for commit in RepositoryMining("repo/path").traverse_commits():
    for modification in commit.modifications:
        print(" with a change type of {}".format(modification.complexity))

OS Version:
MacOS

PyDriller Version:
1.9.2

Logs:

Traceback (most recent call last):
  File ".../Library/Application Support/IntelliJIdea2019.2/python/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "...//Library/Application Support/IntelliJIdea2019.2/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File ".../repo_analyzer.py", line 13, in <module>
    for m in commit.modifications:
  File “.../lib/python3.7/site-packages/pydriller/domain/commit.py", line 377, in modifications
    self._modifications = self._get_modifications()
  File “.../lib/python3.7/site-packages/pydriller/domain/commit.py", line 388, in _get_modifications
    create_patch=True)
  File “.../lib/python3.7/site-packages/git/diff.py", line 152, in diff
    index = diff_method(self.repo, proc)
  File “.../lib/python3.7/site-packages/git/diff.py", line 470, in _index_from_patch_format
    None, None, None))
  File “.../lib/python3.7/site-packages/git/diff.py", line 284, in __init__
    for submodule in repo.submodules:
  File “.../lib/python3.7/site-packages/git/repo/base.py", line 340, in submodules
    return Submodule.list_items(self)
  File “.../lib/python3.7/site-packages/git/util.py", line 928, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File “.../lib/python3.7/site-packages/git/objects/submodule/base.py", line 1192, in iter_items
    sm._name = n
AttributeError: 'Tree' object has no attribute '_name'

Any idea what could be the issue here?

Modified files are not fetched for merge commits

Describe the bug
I fetched all the commits from a repository using the standard commits = list(RepositoryMining(repo_path).traverse_commits()). It returns 2061 commits (as expected). Out of these, 331 commits are merged commits (for which commit.merge is true). For these 331 merged commits, PyDriller returns an empty list of modified files, even though there are many files modified in those commits.

To Reproduce
Eg: Commit number 115 (hash 13104b2651bed37c1eb238eacd09e05e5906534a) in the above mentioned repository is a merged commit. And it has 14 modified files (with 556 additions and 176 deletions), but none of them show up in commit.modifications.

OS Version:
Linux Ubuntu 18.04

Implement our version of git hyper-blame

Currently, we use Google "git hyper-blame" from their depot_tools. However, this means another dependency that might break someday. After having seen the code (https://github.com/GiantPay/depot_tools/blob/master/git_hyper_blame.py), I think it's not impossible to replicate by ourselves.

Ability to provide a directory to store clone of remote repo instead of using a temporary directory

Problem
For my work I need to access the other files in the repo that weren't altered by the commit. If I'm using a remote repo then the clone is stored in a temp folder and is inaccessible.

Potential solution
I need the ability to access the cloned repo when iterating over commits. As I see it there are two main ways of doing this: first you could make the temp folder accessible (the best approach to this is not clear to me), and second you could add a new configuration option through which the user could specify a location for the cloned repo. The second option would leave the task of managing the destruction of the cloned repo to the user, giving them the option to persist it as long as they deem necessary. I am in favor of the second option.

I was initially thinking to hack something together for my own purposes, but the thought occurred to me that this might be something that would be useful to other users and so I'm presenting it as a feature request. Feel free to assign this to me.

Use --histogram for git blame too

https://link.springer.com/article/10.1007/s10664-019-09772-z reports better results for SZZ when using "--histogram".
Right now pydriller uses it for getting modifications, it might be useful to use it for blaming too.

Update examples in example folder

Results are not consistent between traverse_commits and git log --reverse

git log d2990dc0e788ccd43e89b04f5c2f9d5a16c1eec0^..ed0d83d64365762361bcc491f72c33b272bc815f --reverse shows:

d2990dc0e788ccd43e89b04f5c2f9d5a16c1eec0
6c05c922e3dfb1a200e3db6318b5197b9d85c08c

Instead, with pydriller:

>>> commits = list(RepositoryMining(".", from_commit="d2990dc0e788ccd43e89b04f5c2f9d5a16c1eec0", to_commit="ed0d83d64365762361bcc491f72c33b272bc815f").traverse_commits())
>>> commits[0].hash
'd2990dc0e788ccd43e89b04f5c2f9d5a16c1eec0'
>>> commits[1].hash
'ee1e094b81f0efa26c1bc570959b8981ae425488'

importing pydriller causes print function to print twice

Describe the bug
importing pydriller causes print function to print twice.

To Reproduce
code:

import pydriller
print ("hello")

output

hello
hello

OS Version:
MacOS

UnicodeDecodeError while reading some commits

Describe the bug
When reading certain commits (likely containing non utf8 characters), the commit object breaks. A UnicodeDecodeError error is thrown when trying to access most properties. For details see the example below.

To Reproduce
E.g. commit 13e644bb36a0b1f3ef0c2091ab648978d18f369d in https://github.com/gentoo/gentoo.

import pydriller
import git

repo_url = 'https://github.com/gentoo/gentoo.git'
local_directory = '/tmp'
repo_name = 'gentoo'

git.Git(local_directory).clone(repo_url)

git_repo = pydriller.GitRepository(local_directory + '/' + repo_name)
commit = git_repo.get_commit('13e644bb36a0b1f3ef0c2091ab648978d18f369d')

commit.author_date

OS Version:
Ubuntu 19.04

New filter only_commits=

New filter only_commits. Right now PyDriller can accept a single commit, or it can filter by dates. However we may want to analyse specific commits.

list_commits = ["commit1", "commit2", "commit3"]
RepositoryMining('repo', only_commits=list_commits)

Implement traverse_releases()

My suggestion would to implement a function traverse_release(), so we can traverse releases commits, or the last release commit.
We are working on paper in Brazil que analisa updates to each realease and this particular suggestion would help us greatly. Thanks!

Commit.modifications returns non-existing modifications

Steps to reproduce:

Create a sample repository using this Bash script:

#!/bin/bash

mkdir repr
cd repr
git init .

echo "file A, version 1" > A
git add A
git commit -m "initial commit"

git branch new_branch
git checkout new_branch
echo "file B, version 1" > B
git add B
git commit -m "second version of file B"

git checkout master
echo "file A, version 2" > A
git add A
git commit -m "second version of file A"

git merge --no-edit new_branch

Run a Python script below that prints all file modifications.
The code below assumes pydriller 1.7, installed via pip

from pydriller import RepositoryMining

PATH_TO_REPOSITORY = "repr"

for i, commit in enumerate(RepositoryMining(PATH_TO_REPOSITORY).traverse_commits(), start=1):
    mods = commit.modifications

    if (len(mods) == 0): print(f"{commit.hash}: no files changed")

    for m in mods:
        print(f"{commit.hash}: {m.change_type.name} {m.new_path}")

Expected output:

a6763da25856674843cff0105c162e3e4ead57b7: ADD A
720f2c15cb164a51b057975640f1184780e714b3: ADD B
6bebe30fbb1395775279cafc7f2dd8df365cbea2: MODIFY A
6df08c17e08eae9623f445994544a43bc5ba913d: no files changed

Actual output:

a6763da25856674843cff0105c162e3e4ead57b7: ADD A
720f2c15cb164a51b057975640f1184780e714b3: ADD B
6bebe30fbb1395775279cafc7f2dd8df365cbea2: MODIFY A
6df08c17e08eae9623f445994544a43bc5ba913d: ADD B

File B appears to have been created again in the last commit.

Running git show 6df08c1 --pretty=raw to list the changes in the last commit yields the following:

Vladimirs-MacBook-Pro:repr vovak$ git show 6df08c1 --pretty=raw
commit 6df08c17e08eae9623f445994544a43bc5ba913d
tree cc7d1f6be10849812cf58d519538dbd36d82e261
parent 6bebe30fbb1395775279cafc7f2dd8df365cbea2
parent 720f2c15cb164a51b057975640f1184780e714b3
author Vladimir Kovalenko <[email protected]> 1555373952 +0200
committer Vladimir Kovalenko <[email protected]> 1555373952 +0200

    Merge branch 'new_branch'

git show demonstrates that no files were changed in this last commit, but pydriller presents it as if a file was created there.