CorrPy

Latest Update Date: 2019 Feb.

Overview

This package is developed to help users calculate correlation coefficients and covariance matrix of a given data with missing values. In order to implement correlation coefficients and covariance matrix, the standard deviation of the data is needed however the world of data is not always clean and tidy. Python's numpy fails to return standard deviation and calculation of the correlation coefficients when the data has missing values. This package aims to overcome this obstacle and help users handle missing values when calculating correlation coefficients and covariance matrix. CorrPy uses likewise deletion method to handle missing values: removing the rows of a data frame where the missing values are present.

Note: If the course timeline permits, CorrPy will handle missing values via single manipulation with mean value: replacing the missing values with the mean of existing values.

Team

Name	Slack Handle	Github.com	Link
KERA YUCEL	`@KERA YUCEL`	`@K3ra-y`	Kera's link
GOPALAKRISHNAN ANDIVEL	`@Krish`	`@Gopsathvik`	Krish's link
WEISHUN DENG	`@Wilson Deng`	`@xiaoweideng`	Wilson's link
Mengda Yu	`@Mengda(Albert) Yu`	`@mru4913`	Albert's link

Installation

CorrPy can be installed with pip in a command window:

pip install git+https://github.com/UBC-MDS/CorrPy.git

Branch Coverage Test

To test branch coverage, we use coverage.py. You can install by pip install coverage.

We also create a Makefile to automate the process. You can try the following to observe branch coverage.

make report_branch

The results are shown below.

Name                            Stmts   Miss Branch BrPart  Cover   Missing
---------------------------------------------------------------------------
CorrPy/__init__.py                  4      0      0      0   100%
CorrPy/corr_plus.py                26      0     12      0   100%
CorrPy/cov_mx.py                   20      0      8      0   100%
CorrPy/std_plus.py                 15      0      8      0   100%
CorrPy/test/__init__.py             0      0      0      0   100%
CorrPy/test/test_corr_plus.py      41      0      0      0   100%
CorrPy/test/test_cov_mx.py         45      0      0      0   100%
CorrPy/test/test_std_plus.py       35      0      0      0   100%
---------------------------------------------------------------------------

Test

To test all the files, we use pytest by make test_all.

The results are shown below.

Functions

Standard Deviation (`std_plus`)

Standard deviation calculates how close the data points to the mean, in which an insight for the variation of the data points. This function would automatically handle the missing values in the input.

$s = \sqrt{\frac{\sum(x-\overline{x})^2}{n-1}}$

std_plus will omit frustration from workflows.

Example:

>>> import CorrPy
>>> x = [1,2, np.nan, 4, np.nan, 6]
>>> std_plus(x)
array([1.920286436967152])

>>> y = [1,2, np.inf, 4, np.nan, 6, "a"]
>>> np.std_plus(y)
array([1.920286436967152])

Correlation Coefficients (`corr_plus`)

Correlation coefficients calculates the relationship between two variables as well as the magnitude of this relationship. This function would automatically handle the missing values in the input.

$r = \frac{1}{n-1}(\frac{\sum(x-\overline{x})(y-\overline{y})}{s_{x}s_{y}})$

Example:

>>> import CorrPy
>>> x = [1,2,np.nan,4,5]
>>> y = [-6,-7,-8,9,True]
>>> corr_plus(x,y)
array([0.7391090892601785])

Covariance Matrix (`cov_mx`)

A Covariance matrix displays the variance and covariance together. This function would use the above two functions.

$Cov(X,Y) = \frac{\sum(x-\overline{x})(y-\overline{y})}{N}$
A covariance matrix displays the variance and covariance together. The diagonal elements represent the variances and the covariances are represented by the other elements in the matrix shown below.

$\Sigma = \begin{bmatrix}Var(X_1) & Cov(X_1 X_2) &\cdots&\cdots &\cdots & Cov(X_1 X_k)\\ Cov(X_2 X_1) &Var(X_2)& \cdots &\cdots &\cdots & \cdots\\ \cdots & \cdots &\ddots &\cdots &\cdots & \cdots\\ \cdots & \cdots &\cdots &\ddots &\cdots & \cdots\\ Cov(X_{k-1} X_1) & \cdots &\cdots &\cdots &Var(X_{k-1}) & \cdots\\ Cov(X_k X_1) & Cov(X_k X_2) &\cdots &\cdots &\cdots & Var(X_k)\\\end{bmatrix}$

Example:

>>> import CorrPy
>>> x = [1,2,np.nan,4,5]
>>> y = [-6,-7,-8,9,True]
>>> cov_mx([x,y])
array([[ 2.33333333, 12.66666667],
       [12.66666667, 80.33333333]])

How does `CorrPy` package fits into the Python ecosystem?

Following functions are already present in Python ecosystem. However, missing values are not being handles for the following functions and CorrPy package will implement calculation of standard deviation, correlation coefficients and covariance matrix.

Python Standard Deviation: https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.std.html

Python Correlation Coefficients: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.corrcoef.html

Python Covariance Matrix: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.cov.html

Milestone Progress

Milestone	Tasks
Milestone 1	Proposal
Milestone 2	Function Code
	Test Code

ubc-mds / corrpy Goto Github PK

corrpy's Introduction

CorrPy

Overview

Team

Installation

Branch Coverage Test

Test

Functions

Standard Deviation (std_plus)

Example:

Correlation Coefficients (corr_plus)

Example:

Covariance Matrix (cov_mx)

Example:

How does CorrPy package fits into the Python ecosystem?

Milestone Progress

corrpy's People

Contributors

Stargazers

Watchers

Forkers

corrpy's Issues

Recommend Projects

Recommend Topics

Recommend Org

Standard Deviation (`std_plus`)

Correlation Coefficients (`corr_plus`)

Covariance Matrix (`cov_mx`)

How does `CorrPy` package fits into the Python ecosystem?