sdv-dev / copulas Goto Github PK

A library to model multivariate data using copulas.

License: Other

Python 95.28% Makefile 2.07% R 1.47% MATLAB 1.18%

copulas data-generation generative-ai generative-model machine-learning synthetic-data synthetic-data-generation tabular-data

copulas's People

Contributors

Stargazers

Watchers

Forkers

alfredo-cuesta echo66 ahoyosid paulolimac ahcheriet csala pvk-developer manuelalvarezc chrisebell24 merz9b kjella volpatto karagul faisalnawazmir greatwizard9519 statdataanalyzer gaboolic roarkemc andreas-koukorinis pythiac remem9527 hauchenjiang fagan2888 dwvandermeer kesyren navass11 joanvaquer dbpgz nazar-ivantsiv hookeyplayer zhuofanxie dougrichardson herolibra zeta1999 shreyaskulkarni19 fealho rch surajitdb yzr1991 galer-king gbonomib hackthecrisis21 sandy4321 jakesylvestre waniomar lei-1126 ppeddada97 amarjitghuman vinhloc30796 jo-cho huning2009 quantcn tonylibing jinsongl oklbuy331 parhamallboje riaduli pragyanaischool ryansdowning arezab 0wenwu tomasfernandez1212 raymond-diao zhe233 chinaao hooddi vivek1240 divagora fzw9381 dyuliu ifantasyzhao mooneral kishorkukreja rolveb pijuszczyk haoren211 strateus tanguyurvoy vicjoy zaneguqi tokamaster rmallof marjan-emd sraypen alfurka enriquephl jiwoncpark nidarts1 h294liu krishxx danielamartinezd02 arpitjain799 mrc-cso-sphsu marcjohler bahgatn imustitanveer nancyhuihui lxlxllx89 ziyit alexanderschatzberg

copulas's Issues

Error on vine copulas sampling - 'Edge' object has no attribute 'index'

Copulas version: 0.2.0
Python version: 3.6.5
Operating System: Xubuntu 18.04

Description

Got and exception while trying to sample data using Vine Copulas with Regular Tree.

What I Did

import pandas as pd
from copulas.multivariate import VineCopula, TreeTypes

X = pd.DataFrame([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])
vine = VineCopula(TreeTypes.REGULAR)
vine.fit(X)
vine.sample()

which gave me the following exception:

copulas/multivariate/vine.py in sample(self, num_rows)
    165                             if (edge.L == current and edge.R == visited[0]) or\
    166                                (edge.R == current and edge.L == visited[0]):
--> 167                                 current_ind = edge.index
    168                                 break
    169                         else:

AttributeError: 'Edge' object has no attribute 'index'

Matlab copulastat and copularnd equivalent

Copulas version: origin/master
Python version:
Operating System: Linux

Description

Not being a copula's expert this is just quick question more then an issue . May I use copulas to
fully emulate Matlab copulastat ? that is:

r = copulastat('Gaussian',rho) returns the Kendall’s rank correlation, r, that corresponds to a Gaussian copula with linear correlation parameters rho.

and the same apply to copularns:

u = copularnd('Gaussian',rho,n) returns n random vectors generated from a Gaussian copula with linear correlation parameters rho.

After searching it looks like there are a couple of packages, Copulas and copualib to deal with
Copula in python. Thus before starting working with one or the other would be good to have some feedback from the experts

thanks

What I Did

Add independence bivariate class

Description

Add a independence copula class to fit data that are not correlated.

Make serialization of models flat.

Currently, the methods to serialize Copulas can return nested dictionaries, which are not useful to work with in some use cases. We can change the way to_dict in order to keep the information of the internal structure in the keys, something like:

>>> copula.to_dict() # actual implementation
{
    'one_attribute': 0
    'nested_attribute': {
        'foo': 'bar
    }

}


>>> copula.to_dict() # Desired behavior
{
    'one_attribute': 0
    'nested_attribute__foo': 'bar'
}

Integrate with TravisCI

Integrate with TravisCI in order to:

Run tests on each commit
Build documentation
Release to PyPI

add tests for univariate models

Fix means in gaussian copula sample method

The means used in the sample method of a copulas.multivariate.GaussianMultivariate instance should be a list of 0s with the length of the number of columns in the distribution. This is on lines 85 and 89 of the https://github.com/DAI-Lab/Copulas/blob/master/copulas/multivariate/gaussian.py.

Update README

Readme should be updated before the release for PyPI, with the following:

A section documenting the release process. See BTB for reference.
Add pip install copulas as default installation method.
List of copulas we currently support
Data input expectations : numerical, perfectly clean

method to select the best univariate distribution to use for a column

A number of Univariate distributions are available in scipy. We may add a method to select the best Univariate distribution for each column. It would be computationally expensive but we can provide it as an option.

seed for the random numbers generators

Copulas version: 0.1
Python version: 3.6.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I was expecting the sample methods to allow the user to pass a seed for the random numbers generators.

What I Did

Instead, we have to use, outside the function call, numpy.random.seed(seed_value) and random.setstate(seed_value). This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.

Recommendations

Currently, in order to get the same sample from the sampling methods, we need to

invoke np.random.seed(seed_value)
invoke random.setstate(random_state_tuple)

outside of the sampling function being invoked (i.e. sample). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:

Create a parameter, in the sample methods, named seed or random_state.
Create a parameter in the constructor of classes offering the sample method named seed or random_state IF the distribution fit method requires some sort of stochastic process.

In scikit-learn and other popular python machine learning tools, what happens is the following

When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state, which will output a numpy.random.RandomState instance. Note that the sklearn.utils.check_random_state method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state as an integer, every fit has to be deterministic in its behavior and output.
If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample method in sklearn.neighbors.KernelDensity), we are allowed to control the seed through the random_state parameter.
In other more low level APIs like scipy, the seed must be an integer or None.

I also advise against using both the random and numpy.random modules at the same time because it makes the seed management harder.

EDIT: Current fix available at #62

Review the single value column workaround

If the fit data has only one value, hence the std is 0, the current code has a workaround to avoid crashing later on when trying to crash:

https://github.com/DAI-Lab/Copulas/blob/f27d86d50db8e8e94c43bf6c0f17f650bffdd6ce/copulas/univariate/gaussian.py#L45

Wouldn't it be better to capture and implement this exceptional situation by just making all samples constant, equal to the mean?

Add serialization to Vines

Add methods from_dict and to_dict for Vine Copulas that returns a dictionary with all it's internal parameters and is able to create a new instance from them.

Enforce python naming conventions

Filenames must follow python naming conventions and shouldn’t be redundant (univariate/GaussianUnivariate.py -> univariate/gaussian.py)
Function and variables names shouldn’t be acronyms but explicit and clear names (Copulas.cdf -> Copulas.cumulative_distribution)

Remove need for data in Gaussian Copula Sample

The sample method for a gaussian copula currently requires a that the data attribute exist. This is not correct since a user should be able to create a copula by just setting the parameters and still sample.

This is on line 91 and 92 of the gaussian copula file.

move bivariate models into new format

Finish to adapt to DAI’s cookiecutter template

This was started by amontanez, but some details are still pending

Move vine model into new format

Reorganize dependencies

Remove the requirements.txt and requirements_test.txt files and list the dependencies only in setup.py.

requirements_dev.txt should be kept but it should install the test requirements as .[test]

Gaussian Copula should take argument for distribution

https://github.com/DAI-Lab/Copulas/blob/e1f966a1775e63bc7fadf1b13bb0d2567a30ce78/copulas/multivariate/gaussian.py#L74

If no distribution map is given, the Gaussian Copula uses the GaussianUnivariate class as the default distribution. It should instead take a distribution in as an argument and use that as the default.

Update Documentation

In order to make project easy to use for new users, we should have:

An updated and complete README ( Showing examples of Vines, listing all the copula types, ...)
Docstrings on methods showing expected parameters and types, and usage examples.
A contribution guide

Add notebook with data example

The notebook should connect to copulas bucket on s3, download a dataset and run it on a Copula,

Add option to accept scalars

Currently, our implementation of the statistical functions of copulas we are expecting and returning numpy.arrays. However, it could be useful to have this functionality to accept and return scalar values.

Changes in API

Right now, bivariate and univariate copulas have methods to compute the different probability functions, however, this functions return another function that is later called.

We should change this behavior to have functions that return the actual result values, instead of a function.

The list of changes to do is:
1-. First, rename all functions with descriptive names instead of acronims.
2-. Make the probability functions returns values instead of a function.
3-. Make the probability functions not require arguments that can be taken from self.
4-. Unify types for input and output values, making all classes only accept and return np.ndarray

fix numpy runtime warning

Description

Currently there are warnings in fit method that due to divided by zero. In that case, theta should be set to infinity and verify the computation is still correct for get_pdf(),get_cdf() etc.

Further Improvement

Maybe add a CopulaException class to ensure theta is in the valid range instead of checking inside each function.

python2 compatibility

Copulas version: 0.1.0
Python version: 2.6
Operating System:

Description

Please could you explicitly inherit from object when declaring your base classes 'BVCopula', 'MVCopula', 'UnivariateDistrib' and others I may have missed. This should allow compatibility with python2.
Many thanks!

What I Did

The error when using python2 is ""TypeError: must be type, not classobj" whenever super is called.

I did some quick testing of parts of the codebase (not 100% coverage) to verify and explicitly inheriting from object does seem to be the only limiting factor preventing python2 compatibility.

Vine Copulas not working

Copulas version: 0.1
Python version: 3.6.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I tried to use VineCopula with a simple dataset like the Breast Cancer dataset and got an error.

What I Did

from copulas.multivariate import VineCopula
import pandas as pd
from sklearn.datasets import load_breast_cancer


data = pd.DataFrame(load_breast_cancer()['data'])
c = VineCopula('center')
c.fit(data)

produced

/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/frank.py:76: RuntimeWarning: divide by zero encountered in log
  return -1.0 / self.theta * np.log(1 + num / den)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:60: RuntimeWarning: overflow encountered in power
  for i in range(len(U))
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:102: RuntimeWarning: overflow encountered in power
  B = np.power(V, -self.theta) + np.power(U, -self.theta) - 1
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:101: RuntimeWarning: overflow encountered in power
  A = np.power(V, -self.theta - 1)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:104: RuntimeWarning: invalid value encountered in multiply
  return np.multiply(A, h) - y
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/scipy/optimize/minpack.py:163: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-ab3f90cc1dd4> in <module>
      3 c = VineCopula('center')
      4 
----> 5 c.fit(data)

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in fit(self, X, truncated)
     51             self.ppfs.append(uni.percent_point)
     52 
---> 53         self.train_vine(self.type)
     54 
     55     def train_vine(self, tree_type):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in train_vine(self, tree_type)
     66             LOGGER.debug('start building tree: {0}'.format(k))
     67             tree_k = Tree(tree_type)
---> 68             tree_k.fit(k, self.n_var - k, tau, self.trees[k - 1])
     69             self.trees.append(tree_k)
     70             LOGGER.debug('finish building tree: {0}'.format(k))

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in fit(self, index, n_nodes, tau_matrix, previous_tree, edges)
     86                 self._build_kth_tree()
     87 
---> 88             self.prepare_next_tree()
     89 
     90     def _check_contraint(self, edge1, edge2):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in prepare_next_tree(self)
    196 
    197             copula = Bivariate(edge.name)
--> 198             copula.fit(X_left_right)
    199             left_given_right = copula.partial_derivative(X_left_right, copula_theta)
    200             right_given_left = copula.partial_derivative(X_right_left, copula_theta)

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in fit(self, X)
     87         self.tau = stats.kendalltau(U, V)[0]
     88         self.theta = self.compute_theta()
---> 89         self.check_theta()
     90 
     91     def to_dict(self):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in check_theta(self)
    212         if (not lower <= self.theta <= upper) or (self.theta in self.invalid_thetas):
    213             message = 'The computed theta value {} is out of limits for the given {} copula.'
--> 214             raise ValueError(message.format(self.theta, self.copula_type.name))
    215 
    216     def check_fit(self):

ValueError: The computed theta value nan is out of limits for the given CLAYTON copula.

Make flake8 / isort compatible

Right now there are some issues with code legibility.

Let's start by setting the standard on flake8+isort

Add ability to handle constant data to univariate classes

Copulas version: 0.2.1
Python version: 3.6.1
Operating System: Ubuntu 18.04.1 LTS

Description

I was trying to fit a copulas.univariate.kde.KDEUnivariate with an array of constant data. I expected for it to work and be able to sample data ( altough I was supposing that the sampled values will be constant too).

What I Did

import numpy as np

from copulas.univariate import KDEUnivariate

X = np.array([1, 1, 1, 1])
kde = KDEUnivariate()
kde.fit(X)

and got the following traceback:

<ipython-input-2-6d5d418eb1ce> in <module>
      5 X = np.array([1, 1, 1, 1])
      6 kde = KDEUnivariate()
----> 7 kde.fit(X)

~/Pythia/MIT/Copulas/copulas/univariate/kde.py in fit(self, X)
     27             raise ValueError("data cannot be empty")
     28 
---> 29         self.model = scipy.stats.gaussian_kde(X)
     30         self.fitted = True
     31 

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in __init__(self, dataset, bw_method, weights)
    206             self._neff = 1/sum(self._weights**2)
    207 
--> 208         self.set_bandwidth(bw_method=bw_method)
    209 
    210     def evaluate(self, points):

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in set_bandwidth(self, bw_method)
    538             raise ValueError(msg)
    539 
--> 540         self._compute_covariance()
    541 
    542     def _compute_covariance(self):

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in _compute_covariance(self)
    550                                                bias=False,
    551                                                aweights=self.weights))
--> 552             self._data_inv_cov = linalg.inv(self._data_covariance)
    553 
    554         self.covariance = self._data_covariance * self.factor**2

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/linalg/basic.py in inv(a, overwrite_a, check_finite)
    972         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    973     if info > 0:
--> 974         raise LinAlgError("singular matrix")
    975     if info < 0:
    976         raise ValueError('illegal value in %d-th argument of internal '

LinAlgError: singular matrix

There is an issue open (#57) to fix a workaround on copulas.univariate.gaussian.GaussianUnivariate that avoid this exact situation, could we generalize its solution on copulas.univariate.base.Univariate to be able to model and sample constant data with all univariate distributions?

Improve data management

Don’t store iris dataset on repo, download it using sklearn.load_data

Add tests for analytics properties of copulas.

Copulas, as mathematical functions should fulfill some analytical properities:

The copula is zero if one of the arguments is zero.
The copula is equal to u if one argument is u and all others 1.

It would be nice to have one unittest for each property and copula on our test suite.

Make KDEUnivariate accept arrays as arguments.

Currently, all distributions and copulas accept as arguments arrays, usually numpy.array, with the exception of copulas.univariate.KDEUnivariate. We should change this behavior to match the rest of the library.

add circleci

Add unit test for sample generation methods

Description

The behavior of the sampling methods need to be throughly tested. The goal is to verify that the data used to fit the model and the samples generated from the model should be from roughly the same distribution. This would be tricky, since the sampling method by its definition is random. Some possible ways are:

For Bivariate class, compare the mean, variance, tail distribution etc.. There are also implementation in Matlab to be cross-checked on.
For Multivariate class, assuming the algorithm for building the model is correct, generating samples and then use get_likelihood() to compute likelihood and verify the likelihood is reasonable.

Update VineCopula docstring

VineCopula docstring is out of date, as says that the vine_type should be ctype, rtype and dtype, when the actual specification is center, regular, direct.

Add from_theta and from_tau classmethods on Bivariate

Currently, the only supported behavior for Bivariate copulas is the following:

Instantiate the class of the desired family.

instance = Bivariate('frank')

Fit it with data, and set the internal parameters tau and theta.

instance.fit(X)

assert instance.tau is not None
assert instance.theta is not None

Use the instance methods to access the statistical functions of the copula family for given parameter theta computed in the instance

instance.sample(5)
instance.cdf(W)

Considering that all of the pdf, cdf, ppf sample use only the theta parameter and that the tau parameter is only used to compute theta, we could add the following methods:

from_tau: A classmethod that receives the tau parameter, create a new instance, compute and set the theta parameter, set it as fitted and return it.
from_theta: A classmethod that receives the theta parameter, create a new instance, set theta and fitted attributes and return it.

Fix linting issues for flake==3.6.0

After the release of flake==3.6.0 new linting issues appeared.

Make vine copulas sample use num_rows arg.

Currently the sample method on copulas.multivariate.vine.VineCopulas doesn't take in consideration the argument num_rows, would be useful to either delete it, or make it work.

add predict function to gaussian copula

Add support to Python 3.7

Currently, we only support python 3.5 and 3.6. We need to add the newest version of python. To do so, we need to check:

All dependencies of the package are compatible with 3.7
The project builds after adding environment 3.7 on TravisCI
The supported versions are correctly listed in setup.py

Implement `partial_derivative_scalar` in Bivariate Base class

The method _partial_derivative is being used in the Vine classes:

https://github.com/DAI-Lab/Copulas/blob/57e4eb3a462e0ccffc25cc4bedd5a413304fe27a/copulas/multivariate/vine.py#L187

However, this method should is not intended to be called from outside the Bivariate copula classes (as it starts with an underscore), and is not implemented in all the Bivariate subclasses.

In order to fix this, the method should be moved to the Bivariate class and renamed to partial_derivative_scalar or similar.

Separate Bivariate Copulas

Bivariate Copulas should be separated into a class for each. Also the copula selector class copulas.bivariate.bv_copula should use inheritance to select one or another, instead of if statements

Fix DeprecationWarnings

When fitting a GaussianCopula the following warning is raised:

copulas/multivariate/GaussianCopula.py:64: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
means = [np.mean(res.iloc[:, i].as_matrix()) for i in range(n)]
numpy/lib/function_base.py:3103: RuntimeWarning: invalid value encountered in subtract
X -= avg[:, None]
copulas/multivariate/GaussianCopula.py:66: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
return (cov.as_matrix(), means, res)

Add methods to load/save the intenal state of a copula.

We need to add a saveand load methods to replicate the internal state of an copula

add tests for gaussian copula

Add CLA

Configure a service to allow contributors to sign a CLA before submiting their contributions.

Add infer method for all copula class

Description

Given a subset of data, predict the rest.

Refactor repeated chunks of code and too large functions.

In different points on the codebase, there are functions with repeated lines of code, that could be simplified into smaller functions. Same with functions of over 100 lines of code

Integrate with CodeCov

Integrate with CodeCov to make sure all the changes from PR improve the code coverage of tests.

Create truncnorm univariate class

Description

Create a new univariate model using scipy.stats.truncnorm
Relevant lines of code in utils.py of sdv:
376-386

Make a test function that runs a copula against all datasets in aws

While there is no way we can check the results obtained, it will help developers creating new copulas

sdv-dev / copulas Goto Github PK

copulas's People

Contributors

Stargazers

Watchers

Forkers

copulas's Issues

Description

What I Did

Description

What I Did

Description

Description

What I Did

Recommendations

Description

Further Improvement

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org