CleanPy

This package cleans a dataset and returns summary statistics as well as number, proportion and location of NA values for string and number column inputs. Data cleaning made easy!

Collaborators

Heather Van Tassel, Phuntsok Tseten, Patrick Tung

Overview

There is a dire need for a good data cleaning package, and we are trying to develop our version of a good data cleaning package that will help users clean their data in a meaningful way. Data cleaning is usually the first step in any data science problem, and if you don’t clean your data well, it might be really difficult to proceed further. So our motivation for coming up with this idea was to address this very issue of messy data.

CleanPy is especially developed to create a streamlined process to give you an easy to read summary statistics table about your data. CleanPy is able to easily locate all the missing data for you and allow you to locate where exactly it occurs. Not only are you able to locate missing data, you can also define how you would like to deal with your missing data.

Function

Function 1) summary: Summary statistics generator for string and numeric data from dataframes.

def summary(data):
    """
    This function computes summary statistics for text and numerical column data from a given dataframe.
    Input: dictionary or column_dataframe
    Returns summary statistics for each column in a nested pandas dataframe. Since pandas only accepts one
    data type  per column, we only need to test the type of each column once.
    It will perform two different summary statistics based on 2 column_datatypes of either
    1) string/bool or 2) int/float/datetime object.
    For numeric data columns it returns a dictionary of summary statistics including
    mean value for each column, min, max, mean, median and count (number of non NA values per column) and count_NA
    (number of NA values per column). Similarly, for string columns it returns the unique string values and
    their counts in a dictionary. The column summary statistics are then nested into a pandas dataframe and returned.

    Parameters
    ----------
    data : pd.DataFrame
        used to provide summary statistics of each column.

    Returns
    -------
    Summary pandas dataframe of each column's summary statistics
    >>> summary(pd.column_dataFrame(colnames="Likes coding", rows= np.array([[4,3,2,2])))
    pd.DataFrame(
        "unique" = [4,3,2]
        "min"= 2
        "max"= 4
        "mean"= 11/4
        "median"= 2
        "count"= 4
        "count_NA"= 0)
    """

Function 2) locate_na: Returns a dataframe of the count and indices of NA values. This function takes in a dataframe and finds NA values and returns the location of these values along the count of total NAs.

def locate_na(data):
    """
    Locate and return the indices to all missing values within an inputted dataframe.
    Each element of the returned dictionary will be a column in a dataframe, which will
    contain the row indices of the missing values.

    Parameters
    ----------
    data : dataframe
        This is the dataframe that the function will use to locate NAs.

    Returns
    -------
    dictionary of lists
        key = column indices that contain missing values
        value = list of row indices that have missing values

    >>> locate_na(pd.DataFrame(np.array([[“Yes”, “No”], [None, “Yes”]])))
    {"0": [1]}
    >>> locate_na(pd.DataFrame(np.array([[1, 2, None], [None, 2, 3]])))
    {"0": [1], "2": [0]}
    """

Function 3) replace_na:Replaces missing values with either min, max, median, or average (default) values of the column(s). There will be an option to remove the rows with NAs.

def replace_na(data, columns, replace="mean", remove=False):
    """
    This function replaces NA values with either the min, max, median or mean
    value or removes the rows.
    Parameters
    ----------
    data : dataframe
        This is the dataframe that the function will use to replace NAs.

    columns : list
        List of columns to replace missing values on.

    replace : string
        Specifies how to replace missing values.
        values include: "mean", "min", "max", "median"

    remove : boolean
        Tells the function whether or not to remove rows with NA.
        If True, replace argument will not be used.

    Returns
    -------
    dataframe
        A pandas dataframe where each NAs will be replaced by either mean,
        min, max, median  (specified by the user)

    >>> replace_na(pd.DataFrame(np.array([[0, 1], [NA, 1]])), replace="min", columns=[0])
    pd.DataFrame(np.array([[0, 1], [0, 1]]))
    """

CleanPy and Python's Ecosystem

Sometimes, it can get quite annoying to go through your data line by line, and a quick summary of the data, will not only save you a lot of time but also give you a quick insight and overall picture of your data, which can be very useful to understand the task at hand. Python has a summary function called describe() function from Python's pandas.DataFrame. CleanPy's summary() function will be quite similar to describe() but it will take it a step further and generate summary statistics, which will be presented in a very intuitive manner. The summary() function will also provide more information such as the number of missing values, and summaries of string information. In regards to our locate_na() and replace_na(), there is no similar function in existence in the current Python ecosystem that we are aware of. The only way to do them is to mannually combine a few functions including pandas.DataFrame.isna().

Installation

CleanPy can be installed using the pip

pip install git+https://github.com/UBC-MDS/CleanPy.git

Then you can import our packages using:

from CleanPy import summary, locate_na, replace_na

Usage

Let's assume that you have a dataframe like the following:

toy_data = pd.DataFrame({"x":[None, "b", "c"], "y": [2, None, None], "z": [3.6, 8.5, None]})

summary Arguments:
- data: dataframe that the function will provide summary statistics on
- Example: summary(toy_data)
- Output:
locate_na Arguments:
- data: dataframe that the function will use to locate NAs
- Example: locate_na(toy_data)
- Output: {'x': [0], 'y': [1, 2], 'z': [2]}
replace_na Arguments:
- data: dataframe that the function will use to replace NAs
- columns: list of columns to replace missing values on
- replace: specifies how to replace missing values
- remove: tells the function whether or not to remove rows with NA
- Example: replace_na(toy_data, columns=["y"], replace="mean", remove=False)
- Output:

Branch Coverage

You can install the coverage package in terminal/command prompt with the following code:

pip install coverage

To get the branch coverage of the package, type the following at the root of the folder:

coverage run -m --branch pytest -q; coverage report -m

# If you want to view it interactively
coverage html

The coverage results are shown below:

Name                              Stmts   Miss Branch BrPart  Cover   Missing
-----------------------------------------------------------------------------
CleanPy/__init__.py                   4      0      0      0   100%
CleanPy/locate_na.py                 20      0     12      0   100%
CleanPy/replace_na.py                30      0     26      0   100%
CleanPy/summary.py                   25      0      8      0   100%
CleanPy/test/test_locate_na.py       40      0      2      0   100%
CleanPy/test/test_replace_na.py      49      0      0      0   100%
CleanPy/test/test_summary.py         46      0      0      0   100%
-----------------------------------------------------------------------------
TOTAL                               214      0     48      0   100%

Python Dependencies

Pandas
Numpy

workflow notes

We are starting to implement and merging more and more features into the remote master on github.
Here is a quick reference on how to update your local branch (such as the sharpen branch on your computer) such that your local work is not falling too far behind.
[at the root of your local clone of this repo]
Not a bad idea to always check the local branch that you are currently in.
git branch
If you are in your local feature branch, such as contrast or sharpen, great! Now you can switch to your local master, and catch your local master up to speed with the remote master.
git checkout master
git pull
You can now switch back to your local feature branch (sharpen or vibrance, etc), and incorporate changes from the local master to your local feature branch.
git checkout sharpen or git checkout vibrance
git merge master
Resolve any merge conflicts locally, and your local branch should now be all up-to-date with the remote master!
Continue working on your function and test in your local branch (use git branch when in doubt!), and make commits as usual. When you are ready, push your work from your local feature branch to your remote feature branch by simply using:
git push
(We are allowed to do this now because the first time you pushed to your remote feature branch you should be using git push -u origin vibrance or git push -u origin sharpen, so now you can simply use git push to do the same.)
Make a pull request to merge your remote feature branch to the remote master on github.
A summary of the above commands:
git checkout master
git pull
git checkout sharpen (or vibrance)
git merge master
[continue working on your function and tests]
git push
[issue a pull request on github when ready]

credit goes to George for these tips

For running tests in Pytest
in your test script import your function using import function_name.
in command line, in the test folder run:
pytest test_summary.py

ubc-mds / cleanpy Goto Github PK

cleanpy's Introduction

CleanPy

Collaborators

Overview

Function

CleanPy and Python's Ecosystem

Installation

Usage

Branch Coverage

Python Dependencies

cleanpy's People

Contributors

Stargazers

Watchers

Forkers

cleanpy's Issues

Recommend Projects

Recommend Topics

Recommend Org