chaoss / grimoirelab-cereslib Goto Github PK

View Code? Open in Web Editor NEW

10.0 12.0 61.0 337 KB

This project aims at unifying, eventizing and enriching information from the Perceval tool

License: GNU Lesser General Public License v3.0

Python 100.00%

grimoirelab-cereslib's Introduction

Ceres

Ceres is a library that aims at dealing with data in general, and software development data in particular.

The initial goal of Ceres is to parse information in several ways from the Perceval tool in the GrimoireLab project.

However, the more code is added to this project, the more generic methods are found to be useful in other areas of analysis.

The following are the areas of analysis that Ceres can help at:

Eventize

The 'eventizer' helps to split information coming from Perceval. In short, Perceval produces JSON documents and those can be consumed by Ceres and by the 'eventizing' side of the library.

By 'eventizing', this means the process to parse a full Perceval JSON document and produce a Pandas DataFrame with certain amount of information.

As an example, a commit contains information about the commit itself, and the files that were 'touched' at some point. Depending on the granularity of the analysis Ceres will work in the following way:

Granularity = 1: This is the first level and produces 1 to 1 relationship with the main items in the original data source. For example 1 commit would be just 1 row in the resultant dataframe. This would be a similar case for a code review process in Gerrit or in Bugzilla for tickets.
Granularity = 2: This is the second level and depends on the data source how in depth this goes. In the specific case of commits, this would return n rows in the dataframe. And there will be as many rows as files where 'touched' in the original data source.

Format

The format part of the library contains some utils that are useful for some basic formatting actions such as having a whole column in the Pandas dataframe with the same string format.

Another example would be the use of the format utils to cast from string to date using datetuils and applying the method to a whole column of a given dataframe.

Filter

The filter utility basically removes rows based on certain values in certain cells of a dataframe.

Data Enrich

This is the utility most context-related together with the eventizing actions. This will add or modify one or more columns in several ways.

There are several examples such as taking care of the surrogates enabling UTF8, adding new columns based on some actions on others, adding the gender of the name provided in another column, and others.

How can you help here?

This project is still quite new, and the development is really slow, so any extra hand would be really awesome, even giving directions, pieces of advice or feature requests :).

And of course, using the software would be great!

Where to start?

The examples folder contains some of the clients I've used for some analysis such as the gender analysis or to produce dataframes that help to understand the areas of the code where developers are working.

Those are probably a good place to have a look at.

Requirements

Python >= 3.8

You will also need some other libraries for running the tool, you can find the whole list of dependencies in pyproject.toml file.

Installation

There are several ways to install Cereslib on your system: packages or source code using Poetry or pip.

PyPI

Cereslib can be installed using pip, a tool for installing Python packages. To do it, run the next command:

$ pip install cereslib

Source code

To install from the source code you will need to clone the repository first:

$ git clone https://github.com/chaoss/grimoirelab-cereslib
$ cd grimoirelab-cereslib

Then use pip or Poetry to install the package along with its dependencies.

Pip

To install the package from local directory run the following command:

$ pip install .

In case you are a developer, you should install cereslib in editable mode:

$ pip install -e .

Poetry

We use poetry for dependency management and packaging. You can install it following its documentation. Once you have installed it, you can install cereslib and the dependencies in a project isolated environment using:

$ poetry install

To spaw a new shell within the virtual environment use:

$ poetry shell

License

Licensed under GNU General Public License (GPL), version 3 or later.

grimoirelab-cereslib's People

Contributors

Stargazers

Watchers

grimoirelab-cereslib's Issues

csv for `openstack_gender` missing?

Hi @valeriocos and other contributors.
I was exploring this project, and noticed in openstack_gender.py study, the CSVs to import are missing (ref here). Can I get some help on this. Do I have to extract the csvs from a source or something?

I would love to contribute to the project, the way git commits are enriched in areas_code study was very intuitive. I did see some TODOs and incomplete code, will try to fix those and add some tests to improve coverage, please inform me if there is any other requirement or work that I could help with. Would love to do that too.

regexp warning for str.replace

I get this error running the latest version of GrimoireLab by using the docker-compose-opensearch.yml file. It points to the container grimoirelab/grimoirelab:latest which now is the 0.7.1 version.

/usr/local/lib/python3.8/site-packages/cereslib/enrich/enrich.py:185: FutureWarning: The default value of regex will change from True to False in a future version.
  self.data['file_dir_name'] = self.data[column].str.replace('/+', '/')
/usr/local/lib/python3.8/site-packages/cereslib/enrich/enrich.py:202: FutureWarning: The default value of regex will change from True to False in a future version.
  self.data['file_path_list'] = self.data[column].str.replace('/+', '/')
/usr/local/lib/python3.8/site-packages/cereslib/enrich/enrich.py:203: FutureWarning: The default value of regex will change from True to False in a future version.
  self.data['file_path_list'] = self.data.file_path_list.str.replace('^/', '')
/usr/local/lib/python3.8/site-packages/cereslib/enrich/enrich.py:204: FutureWarning: The default value of regex will change from True to False in a future version.
  self.data['file_path_list'] = self.data.file_path_list.str.replace('/$', '')

.hpp is not categorized as code

The file extension .hpp is not categorized as code in the areas of code indexes and it should be. See more details about this extension at https://en.wikipedia.org/wiki/Precompiled_header.

This behavior is seen in the latest version available. See https://github.com/chaoss/grimoirelab-cereslib/blob/master/cereslib/enrich/enrich.py#L132

Require DCO sign-off for new commits

This issue is to activate protobot/dco (or similar bot) to check that all commits have a sign-off in this repository.

The CHAOSS Project Charter section 8.2.1 requires that all contributions are signed-off. The CHAOSS project has been piloting the use of DCO sign-offs. Once contributors know how to do it, sign-offs are easy to do with little overhead.

For users of the git command line interface, a sign-off is accomplished with the -s as part of the commit command: git commit -s -m 'This is a commit message'

For users of the GitHub interface, a sign-off is accomplished by writing Signed-off-by: Your Name <[email protected]> into the commit comment field. This can be automated by using a browser plugin like scottrigby/dco-gh-ui

To-Do for repo maintainers: Please inform your contributors about DCO sign-offs and comment on this issue when your are ready for the DCO bot to be activated on this repository.

Create Ceres pip package

In order to use Ceres in GrimoireLab, a pip package must exists in order to install it.

Github Eventizer

Hi there,

I am in the phase of doing some research before starting my Master's Thesis.
I have done an extensive literature review already, however, more often than not, used Tools were not open sourced or lack accessibility (thinking of CHAMELEON for clustering, or Understand for static analysis).

I am now planning to use parts of grimoirelab to extract a collaboration graph.

Anyways, I do think that I am going to write a Github eventizer, to extract the necessary information from PRs extracted from the repository. Would there be any interest in having this emrged upstream? If so, I'd fork the project rather than adding it to my script.

BR,

Manuel