sdv-dev / copulas Goto Github PK
View Code? Open in Web Editor NEWA library to model multivariate data using copulas.
Home Page: https://sdv.dev/Copulas/
License: Other
A library to model multivariate data using copulas.
Home Page: https://sdv.dev/Copulas/
License: Other
Got and exception while trying to sample data using Vine Copulas with Regular Tree.
import pandas as pd
from copulas.multivariate import VineCopula, TreeTypes
X = pd.DataFrame([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]
])
vine = VineCopula(TreeTypes.REGULAR)
vine.fit(X)
vine.sample()
which gave me the following exception:
copulas/multivariate/vine.py in sample(self, num_rows)
165 if (edge.L == current and edge.R == visited[0]) or\
166 (edge.R == current and edge.L == visited[0]):
--> 167 current_ind = edge.index
168 break
169 else:
AttributeError: 'Edge' object has no attribute 'index'
Not being a copula's expert this is just quick question more then an issue . May I use copulas to
fully emulate Matlab copulastat ? that is:
r = copulastat('Gaussian',rho) returns the Kendall’s rank correlation, r, that corresponds to a Gaussian copula with linear correlation parameters rho.
and the same apply to copularns:
u = copularnd('Gaussian',rho,n) returns n random vectors generated from a Gaussian copula with linear correlation parameters rho.
After searching it looks like there are a couple of packages, Copulas and copualib to deal with
Copula in python. Thus before starting working with one or the other would be good to have some feedback from the experts
thanks
Add a independence copula class to fit data that are not correlated.
Currently, the methods to serialize Copulas can return nested dictionaries, which are not useful to work with in some use cases. We can change the way to_dict
in order to keep the information of the internal structure in the keys, something like:
>>> copula.to_dict() # actual implementation
{
'one_attribute': 0
'nested_attribute': {
'foo': 'bar
}
}
>>> copula.to_dict() # Desired behavior
{
'one_attribute': 0
'nested_attribute__foo': 'bar'
}
Integrate with TravisCI in order to:
The means used in the sample method of a copulas.multivariate.GaussianMultivariate instance should be a list of 0s with the length of the number of columns in the distribution. This is on lines 85 and 89 of the https://github.com/DAI-Lab/Copulas/blob/master/copulas/multivariate/gaussian.py.
Readme should be updated before the release for PyPI, with the following:
A number of Univariate distributions are available in scipy. We may add a method to select the best Univariate distribution for each column. It would be computationally expensive but we can provide it as an option.
I was expecting the sample
methods to allow the user to pass a seed for the random numbers generators.
Instead, we have to use, outside the function call, numpy.random.seed(seed_value)
and random.setstate(seed_value)
. This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.
Currently, in order to get the same sample from the sampling methods, we need to
np.random.seed(seed_value)
random.setstate(random_state_tuple)
outside of the sampling function being invoked (i.e. sample
). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:
sample
methods, named seed
or random_state
.sample
method named seed
or random_state
IF the distribution fit method requires some sort of stochastic process.In scikit-learn and other popular python machine learning tools, what happens is the following
When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state
value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState
instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state
, which will output a numpy.random.RandomState
instance. Note that the sklearn.utils.check_random_state
method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state
as an integer, every fit has to be deterministic in its behavior and output.
If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample
method in sklearn.neighbors.KernelDensity
), we are allowed to control the seed through the random_state
parameter.
In other more low level APIs like scipy, the seed must be an integer or None.
I also advise against using both the random
and numpy.random
modules at the same time because it makes the seed management harder.
EDIT: Current fix available at #62
If the fit data has only one value, hence the std is 0, the current code has a workaround to avoid crashing later on when trying to crash:
Wouldn't it be better to capture and implement this exceptional situation by just making all samples constant, equal to the mean?
Add methods from_dict
and to_dict
for Vine Copulas that returns a dictionary with all it's internal parameters and is able to create a new instance from them.
Filenames must follow python naming conventions and shouldn’t be redundant (univariate/GaussianUnivariate.py -> univariate/gaussian.py)
Function and variables names shouldn’t be acronyms but explicit and clear names (Copulas.cdf -> Copulas.cumulative_distribution)
The sample method for a gaussian copula currently requires a that the data attribute exist. This is not correct since a user should be able to create a copula by just setting the parameters and still sample.
This is on line 91 and 92 of the gaussian copula file.
This was started by amontanez, but some details are still pending
Remove the requirements.txt and requirements_test.txt files and list the dependencies only in setup.py
.
requirements_dev.txt
should be kept but it should install the test requirements as .[test]
If no distribution map is given, the Gaussian Copula uses the GaussianUnivariate class as the default distribution. It should instead take a distribution in as an argument and use that as the default.
In order to make project easy to use for new users, we should have:
README
( Showing examples of Vines, listing all the copula types, ...)The notebook should connect to copulas bucket on s3, download a dataset and run it on a Copula,
Currently, our implementation of the statistical functions of copulas we are expecting and returning numpy.arrays
. However, it could be useful to have this functionality to accept and return scalar values.
Right now, bivariate and univariate copulas have methods to compute the different probability functions, however, this functions return another function that is later called.
We should change this behavior to have functions that return the actual result values, instead of a function.
The list of changes to do is:
1-. First, rename all functions with descriptive names instead of acronims.
2-. Make the probability functions returns values instead of a function.
3-. Make the probability functions not require arguments that can be taken from self
.
4-. Unify types for input and output values, making all classes only accept and return np.ndarray
Currently there are warnings in fit method that due to divided by zero. In that case, theta should be set to infinity and verify the computation is still correct for get_pdf()
,get_cdf()
etc.
Maybe add a CopulaException class to ensure theta is in the valid range instead of checking inside each function.
Please could you explicitly inherit from object
when declaring your base classes 'BVCopula', 'MVCopula', 'UnivariateDistrib' and others I may have missed. This should allow compatibility with python2.
Many thanks!
The error when using python2 is ""TypeError: must be type, not classobj" whenever super is called.
I did some quick testing of parts of the codebase (not 100% coverage) to verify and explicitly inheriting from object does seem to be the only limiting factor preventing python2 compatibility.
Copulas version: 0.1
Python version: 3.6.6
Operating System: Fedora release 28 (Twenty Eight)
I tried to use VineCopula with a simple dataset like the Breast Cancer dataset and got an error.
from copulas.multivariate import VineCopula
import pandas as pd
from sklearn.datasets import load_breast_cancer
data = pd.DataFrame(load_breast_cancer()['data'])
c = VineCopula('center')
c.fit(data)
produced
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/frank.py:76: RuntimeWarning: divide by zero encountered in log
return -1.0 / self.theta * np.log(1 + num / den)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:60: RuntimeWarning: overflow encountered in power
for i in range(len(U))
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:102: RuntimeWarning: overflow encountered in power
B = np.power(V, -self.theta) + np.power(U, -self.theta) - 1
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:101: RuntimeWarning: overflow encountered in power
A = np.power(V, -self.theta - 1)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:104: RuntimeWarning: invalid value encountered in multiply
return np.multiply(A, h) - y
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/scipy/optimize/minpack.py:163: RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-56-ab3f90cc1dd4> in <module>
3 c = VineCopula('center')
4
----> 5 c.fit(data)
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in fit(self, X, truncated)
51 self.ppfs.append(uni.percent_point)
52
---> 53 self.train_vine(self.type)
54
55 def train_vine(self, tree_type):
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in train_vine(self, tree_type)
66 LOGGER.debug('start building tree: {0}'.format(k))
67 tree_k = Tree(tree_type)
---> 68 tree_k.fit(k, self.n_var - k, tau, self.trees[k - 1])
69 self.trees.append(tree_k)
70 LOGGER.debug('finish building tree: {0}'.format(k))
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in fit(self, index, n_nodes, tau_matrix, previous_tree, edges)
86 self._build_kth_tree()
87
---> 88 self.prepare_next_tree()
89
90 def _check_contraint(self, edge1, edge2):
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in prepare_next_tree(self)
196
197 copula = Bivariate(edge.name)
--> 198 copula.fit(X_left_right)
199 left_given_right = copula.partial_derivative(X_left_right, copula_theta)
200 right_given_left = copula.partial_derivative(X_right_left, copula_theta)
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in fit(self, X)
87 self.tau = stats.kendalltau(U, V)[0]
88 self.theta = self.compute_theta()
---> 89 self.check_theta()
90
91 def to_dict(self):
~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in check_theta(self)
212 if (not lower <= self.theta <= upper) or (self.theta in self.invalid_thetas):
213 message = 'The computed theta value {} is out of limits for the given {} copula.'
--> 214 raise ValueError(message.format(self.theta, self.copula_type.name))
215
216 def check_fit(self):
ValueError: The computed theta value nan is out of limits for the given CLAYTON copula.
Right now there are some issues with code legibility.
Let's start by setting the standard on flake8+isort
I was trying to fit
a copulas.univariate.kde.KDEUnivariate
with an array of constant data. I expected for it to work and be able to sample data ( altough I was supposing that the sampled values will be constant too).
import numpy as np
from copulas.univariate import KDEUnivariate
X = np.array([1, 1, 1, 1])
kde = KDEUnivariate()
kde.fit(X)
and got the following traceback:
<ipython-input-2-6d5d418eb1ce> in <module>
5 X = np.array([1, 1, 1, 1])
6 kde = KDEUnivariate()
----> 7 kde.fit(X)
~/Pythia/MIT/Copulas/copulas/univariate/kde.py in fit(self, X)
27 raise ValueError("data cannot be empty")
28
---> 29 self.model = scipy.stats.gaussian_kde(X)
30 self.fitted = True
31
~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in __init__(self, dataset, bw_method, weights)
206 self._neff = 1/sum(self._weights**2)
207
--> 208 self.set_bandwidth(bw_method=bw_method)
209
210 def evaluate(self, points):
~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in set_bandwidth(self, bw_method)
538 raise ValueError(msg)
539
--> 540 self._compute_covariance()
541
542 def _compute_covariance(self):
~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in _compute_covariance(self)
550 bias=False,
551 aweights=self.weights))
--> 552 self._data_inv_cov = linalg.inv(self._data_covariance)
553
554 self.covariance = self._data_covariance * self.factor**2
~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/linalg/basic.py in inv(a, overwrite_a, check_finite)
972 inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
973 if info > 0:
--> 974 raise LinAlgError("singular matrix")
975 if info < 0:
976 raise ValueError('illegal value in %d-th argument of internal '
LinAlgError: singular matrix
There is an issue open (#57) to fix a workaround on copulas.univariate.gaussian.GaussianUnivariate
that avoid this exact situation, could we generalize its solution on copulas.univariate.base.Univariate
to be able to model and sample constant data with all univariate distributions?
Don’t store iris dataset on repo, download it using sklearn.load_data
Copulas, as mathematical functions should fulfill some analytical properities:
u
if one argument is u
and all others 1.It would be nice to have one unittest for each property and copula on our test suite.
Currently, all distributions and copulas accept as arguments arrays, usually numpy.array
, with the exception of copulas.univariate.KDEUnivariate
. We should change this behavior to match the rest of the library.
The behavior of the sampling methods need to be throughly tested. The goal is to verify that the data used to fit the model and the samples generated from the model should be from roughly the same distribution. This would be tricky, since the sampling method by its definition is random. Some possible ways are:
get_likelihood()
to compute likelihood and verify the likelihood is reasonable.VineCopula docstring is out of date, as says that the vine_type
should be ctype
, rtype
and dtype
, when the actual specification is center
, regular
, direct
.
Currently, the only supported behavior for Bivariate
copulas is the following:
instance = Bivariate('frank')
tau
and theta
.instance.fit(X)
assert instance.tau is not None
assert instance.theta is not None
theta
computed in the instanceinstance.sample(5)
instance.cdf(W)
Considering that all of the pdf
, cdf
, ppf
sample
use only the theta
parameter and that the tau
parameter is only used to compute theta
, we could add the following methods:
from_tau
: A classmethod
that receives the tau
parameter, create a new instance, compute and set the theta
parameter, set it as fitted
and return it.
from_theta
: A classmethod
that receives the theta
parameter, create a new instance, set theta
and fitted
attributes and return it.
After the release of flake==3.6.0
new linting issues appeared.
Currently the sample method on copulas.multivariate.vine.VineCopulas
doesn't take in consideration the argument num_rows, would be useful to either delete it, or make it work.
Currently, we only support python 3.5 and 3.6. We need to add the newest version of python. To do so, we need to check:
3.7
3.7
on TravisCIsetup.py
The method _partial_derivative
is being used in the Vine classes:
However, this method should is not intended to be called from outside the Bivariate copula classes (as it starts with an underscore), and is not implemented in all the Bivariate subclasses.
In order to fix this, the method should be moved to the Bivariate class and renamed to partial_derivative_scalar
or similar.
Bivariate Copulas should be separated into a class for each. Also the copula selector class copulas.bivariate.bv_copula
should use inheritance to select one or another, instead of if statements
When fitting a GaussianCopula
the following warning is raised:
copulas/multivariate/GaussianCopula.py:64: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
means = [np.mean(res.iloc[:, i].as_matrix()) for i in range(n)]
numpy/lib/function_base.py:3103: RuntimeWarning: invalid value encountered in subtract
X -= avg[:, None]
copulas/multivariate/GaussianCopula.py:66: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
return (cov.as_matrix(), means, res)
We need to add a save
and load
methods to replicate the internal state of an copula
Configure a service to allow contributors to sign a CLA before submiting their contributions.
Given a subset of data, predict the rest.
In different points on the codebase, there are functions with repeated lines of code, that could be simplified into smaller functions. Same with functions of over 100 lines of code
Integrate with CodeCov to make sure all the changes from PR improve the code coverage of tests.
Create a new univariate model using scipy.stats.truncnorm
Relevant lines of code in utils.py of sdv:
376-386
While there is no way we can check the results obtained, it will help developers creating new copulas
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.