cdt15 / lingam Goto Github PK

View Code? Open in Web Editor NEW

366.0 366.0 55.0 6.16 MB

Python package for causal discovery based on LiNGAM.

Home Page: https://sites.google.com/view/sshimizu06/lingam

License: MIT License

Python 100.00%

causal-discovery causal-inference causal-models causality causality-analysis lingam machine-learning python

lingam's People

Contributors

Stargazers

Watchers

lingam's Issues

LiM uses too much memory.

I tried to run the LiM code, but because of lack of memory, it didn't work. I set n_features as 10 and it used more than 120GB. Does this happen normally, or did I do something inappropriate? Are there any limitations about the number of variables?

Normalization in ICA-LiNGAM

Hi,

When ICA-LiNGAM was run with normalized data, it sometimes outputs DAGs with a different structure than the data that was not normalized. For comparison, the same was done with Direct-LiNGAM, but the structure was the same, although the edge weights were different. Is this a problem specific to ICA-LiNGAM?

The data sets that output results with different structures were created with the following code.

import numpy as np
import pandas as pd

adj_matrix = np.array(
    [
        [0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0],
        [10, 1, 0, 0, 0],
        [100, 10, 1, 0, 0],
        [1000, 100, 10, 1, 0],
    ]
)
n_features = adj_matrix.shape[1]
rng = np.random.default_rng(seed=0)
E = rng.uniform(low=-1, high=1, size=(n_features, 10000))
I = np.identity(n_features)
X = np.matmul(np.linalg.inv(I - adj_matrix), E)
df = pd.DataFrame(X.T)
# df = df.sub(df.mean()).div(df.std(ddof=0))  # nomalization

Thank you.

Interpretation

The code and documentation is fantastic. But how about some interpretation of the outputs in relation to the examples. What can and cannot be said about the coefficients, adjacency matrices etc?

CI fails due to a spec change of sklearn

https://github.com/cdt15/lingam/actions/runs/7723734677

Calculation process of the total effect of longitudinal lingam

lingam/lingam/longitudinal_lingam.py

Line 211 in 7dae97d

am = np.concatenate([*self._adjacency_matrices[from_t]], axis=1)

to_t is correct for the index, not from_t.

[bug] pip installation issue - statsmodels dependency not recognized

when installing via pip and PyPI, the statsmodels dependency is not installed. Is is listed as a dependency in the readme, but it is not identified by PyPI. I guess it is because it is also missing from setup.py.

I guess this counts as a bug....

also - installing statsmodels manually solves it, so it is not that bad....

The calculation of total effect is wrong in LongitudinalLiNGAM

lingam/lingam/longitudinal_lingam.py

Line 223 in 0cccccd

def estimate_total_effect2(self, from_t, from_index, to_t, to_index):

The path finding in the estimate_total_effect2 function was incorrect and did not calculate the correct causal effect.

Questions about usage of DirectLiNGAM

Hello, I’d first like to thank you for this incredible package (along with the interesting papers on LiNGAM you’ve published)!

I’m currently trying to employ this package in my Causal Inference pipeline (causal discovery portion).

I’m currently trying to measure the treatment effects of a binary treatment on a binary outcome, in the presence of hundreds of potential confounders
At present, I am trying to first identify a causal graph using this package for relatively large datasets before identifying confounders using dowhy’s Backdoor Criterion and estimating treatment effects using EconML’s methods.
The datasets have sizes ranging about ~250k - 750k rows/entries and 150 - 600+ columns/features

More specifically, I am currently using DirectLiNGAM with a prior knowledge matrix (specifically for having an edge from the treatment to outcome variable, and that there should be no other outgoing edges from both the treatment and outcome variables). BottomUpParceLiNGAM would have been the ideal model but it dosent work due to scalability and instant out of memory issues.

After running a couple of experiments with DirectLiNGAM, I have 3 questions I’d like to clarify with you if possible:

(Extension of Issue 5) As my dataset has a good mix of continuous and discrete data, are there currently any models in the package that can deal with both continuous and discrete (i.e, binary and encoded categorical) variables, while allowing for prior knowledge matrices?
Are there any scalable alternatives to validate the results and assumptions of LiNGAM models in general aside from get_error_independence_p_values and bootstrap? As the former (specifically during hsic_test_gamma) causes out of memory issues for even the smallest dataset (e.g., 250k x 155), while the latter takes too long (i.e, the default fit in DirectLiNGAM with the above mentioned prior-knowledge matrix ranges between 20 hours - 5 days for the datasets I currently have)
It seems that regardless of the dataset, DirectLiNGAM consistently outputs a causal graph (adjacency matrix) where all the non-outcome & non-treatment features are considered confounders (via Backdoor Criterion from dowhy). This is different when compared to using ICALiNGAM which does not necessarily produce results with such a trend. Could it be due to the way in which DirectLiNGAM tries to identify the causal order (and therefore the adjacency matrix)?

Thank you very much!

Regression coefficients are incorrect when calculating residuals in DirectLiNGAM

In DirectLiNGAM, regression coefficients were not calculated correctly when calculating residuals. In order to calculate the correct regression coefficient, the variance should be based on np.cov, not np.var in the denominator (or the bias option in np.cov in the numerator should be set to True).

DirectLiNGAM Duplicate no_path Prior Knowledge

Hi there, thank you for your amazing work on this package. I was just wondering if you could help me understand the intuition of this snipper of code from DirectLiNGAM (lines 116-125). I am trying to understand why it should be the case that duplicate pairs without a path should cancel out and not be ordered, and subsequently not be included in the partial_orders array?

# Check for inconsistencies in pairs without path.
# If there are duplicate pairs without path, they cancel out and are not ordered.
check_pairs = np.concatenate([no_path_pairs, no_path_pairs[:, [1, 0]]])
if len(check_pairs) > 0:
    pairs, counts = np.unique(check_pairs, axis=0, return_counts=True)
    check_pairs = np.concatenate([no_path_pairs, pairs[counts > 1]])
    pairs, counts = np.unique(check_pairs, axis=0, return_counts=True)
    no_path_pairs = pairs[counts < 2]

check_pairs = np.concatenate([path_pairs, no_path_pairs[:, [1, 0]]])

How to use the initial DAG in the RESIT sample code?

For example, we have this code here: https://github.com/cdt15/lingam/blob/master/docs/tutorial/resit.rst

In that, we create an array:

m = np.array([
[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 0, 1, 0]])

But we never use it. I am confused because the adjacency matrix contains only 1 after the fitting. Could you help me to clarify the input/output of that code?

code of Adaptive Lasso

I found a difference between the code of Adaptive Lasso in this package and the code in R package.
This is the code in this package:

And This is the code in R package:

It is different that the R package let the X divided by the weights which was generated from linear regression while this package let the X times the weights.
And here is the formula I found:

Is the bootstrap method implemented here a multiscale one?

When using bootstrap to evaluate the causal structure given by LiNGAM, multiscale bootstrap is said to be better. Is the bootstrap method already prepared here a multiscale one?

The Read the Docs configuration method will be changed.

The Read the Docs build system will start requiring a configuration file v2 (.readthedocs.yaml) starting on September 25, 2023.

[Bug] Prior knowledge is not applied correctly when bootstrapping with BottomUpParceLiNGAM.

The partially-ordered list is created from prior knowledge during the initialization of BottomUpParceLiNGAM. However, since the partially-ordered list changes in the fit function, the second time of bootstrapping uses the changed partially-ordered list.

Test Error: ValueError: 'x0' must only have one dimension.

I get the following error when I run the lina.py test.

FAILED tests/test_lina.py::test_fit_lina - ValueError: 'x0' must only have one dimension.
FAILED tests/test_lina.py::test_fit_mdlina - ValueError: 'x0' must only have one dimension.

The following changes in scipy-1.11.0 may have affected this.

SciPy 1.11.0 Release Notes

The scipy.optimize.minimize function now raises an error for x0 with x0.ndim > 1.

Total effect is not calculated with resampled dataset

When bootstrapping with DirectLiNGAM, etc., the total effect is not calculated with the resampled dataset.

Feature Selection

Congragulations on the package, it could be interesting to write a feature selection model ontop of the discovery to remove slow moving indicators.

Change in method of calculating total effect using regression

When using regression to calculate the overall effect, the coefficients were pruned using AdaptiveLasso. As a result, the distribution of the overall effect took on a skewed shape when bootstrapped. To solve this problem, we change from AdaptiveLasso to LinerRegression.

lingam/lingam/base.py

Line 87 in 0cccccd

coefs = predict_adaptive_lasso(X, predictors, to_index)

LiM runs into an exception when using the option only_global=False

When using the LiM implementation, I run it on larger datasets and everything works, as long as I don't use the option only_global=False. Whenever it is set, I get the following message:
ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.
In case of running it on with only_global=True, everything works. I checked my data and from what I can see, I correctly define the discrete_indicator to contain all discrete features.

Can you help me out here?

Adjacency matrix for varlingam

I simply tried to test a dataset with lag=30 with the bellow code snippet but it gives at most 14 adjacency matrix as output. What could be the resaon?

model = lingam.VARLiNGAM(lags=30, criterion='bic',prune=True)
model.fit(X)
print(model._adjacency_matrices)

no_paths doesn't work when using prior_knowledge

Hi,

I really appreciate this repository because I can apply LINGAM to the system very quickly.

Now, I have one question, "Does no paths work correctly?"

For example of this notebook: https://github.com/cdt15/lingam/blob/master/examples/DirectLiNGAM(PriorKnowledge).ipynb,

generete prior knowledge,

prior_knowledge = make_prior_knowledge(
    n_variables=6,
    exogenous_variables=[0],
    no_paths=[[2,1]])
print(prior_knowledge)
make_prior_knowledge_graph(prior_knowledge)

outout data is

[[ 0  0  0  0  0  0]
 [-1  0  0 -1 -1 -1]
 [-1 -1  0 -1 -1 -1]
 [-1 -1 -1  0 -1 -1]
 [-1 -1 -1 -1  0 -1]
 [-1 -1 -1 -1 -1  0]]

It seems the path "2 -> 1" is zero.

However,
if the data fit model

model = lingam.DirectLiNGAM(prior_knowledge=prior_knowledge)
model.fit(X)

output model.adjacency_matrix_ is


	0	1	2	3	4	5
0	0.000000	0.0	0.000000	0.000000	0.0	0.0
1	2.986726	0.0	2.006062	0.000000	0.0	0.0
2	0.000000	0.0	0.000000	6.016333	0.0	0.0
3	0.299046	0.0	0.000000	0.000000	0.0	0.0
4	7.984485	0.0	-0.990590	0.000000	0.0	0.0
5	3.952478	0.0	0.000000	0.000000	0.0	0.0

The path "2 -> 1" has value 2.006062.

Is it the correct output value?
When using "no paths", the value should be zero, just I think.

about DirectLiNGAM: when sample size is less than the number of variables

Hi,
Thanks for your wonderful work!

I have met a value error when I used DirectLiNAM from this python package.
It seem that DirectLiNGAM imported LassoLarsIC which cannot work when sample size is less than the number of variables.
Console message is here:
ValueError: You are using LassoLarsIC in the case where the number of samples is smaller than the number of features. In this setting, getting a good estimate for the variance of the noise is not possible. Provide an estimate of the noise variance in the constructor.

I'm wondering if it is a bug or a property of DirectLiNGAM.

Look forward to your reply.

How to use prior knowledge in LiM?

Bootstrap Speed

## This is 30 times slower 
model = lingam.VARLiNGAM()
result = model.bootstrap(process_df, n_sampling=1)

## Than the following

model = lingam.VARLiNGAM(lags=lags)
process_df = variable_improvement(df_accounting, variable)
model.fit(process_df)

Would you know why this is the reason, and whould you know how to improve these times?

What is the advantage of the "global_only=True" parameter in the LiM method?

Hi.
In the recent updates, the "global_only" parameter has been added to the model.fit() method of LiM. What is the advantage of setting this parameter to "True"?

In VARLiNGAM, different behavior with the same data

I ran the following code and an error occurred:

rng = np.random.default_rng(seed=0)
x = rng.random(size=80)
y = 10 * x
df_x_y = pd.DataFrame({'x': x, 'y': y})
lingam.VARLiNGAM(random_state=0).fit(df_x_y)

However, no error occurred when I ran the following code:

df_y_x = pd.DataFrame({'y': y, 'x': x})
lingam.VARLiNGAM(random_state=0).fit(df_y_x)

The results were different even though the data seemed to be the same.

I think this is a bug, but please check it.

VARLiNGAM statsmodels

Dear, if I run VARLiNGAM with the latest version of statsmodels (0.14.0) installed, I seem to be getting an error:

Traceback (most recent call last):
  File ".../var_lingam.py", line 79, in <module>
    model.fit(df1)
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/lingam/var_lingam.py", line 95, in fit
    M_taus, lags, residuals = self._estimate_var_coefs(X)
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/lingam/var_lingam.py", line 279, in _estimate_var_coefs
    fitted = var.fit(maxlags=lag, ic=None, trend="nc")
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 656, in fit
    raise ValueError("trend '{}' not supported for VAR".format(trend))
ValueError: trend 'nc' not supported for VAR

With an older version (0.12.2) everything seems to be working fine.

Thanks

Skewed Data Distributions and Homoscedasticity

Hi,

I'm wondering what's the best approach for data that is highly right-skewed. Is it best to take a log transform of it to make it more "normal" or does DirectLiNGAM deal with skewed data? The causal graphs are substantially different if I take the log and then normalise the data compared to only normalising the data and keeping the skewed distribution. I couldn't find the implementations of Hyvarinen & Smith 2013 for skewed data.

Also, my understanding is that LiNGAM is specifically made for non-Gaussian distributions, but I'm a bit confused about how this impacts the adjacency matrix computation using linear regression since from my understanding non-Gaussian distributions violate homoscedasticity.

Any clarity on these two topics would be greatly appreciated!

make_dot()

dot = make_dot()
dot.render("dag")

Unable to display node Chinese name
how I can fix it?

thank you

In ICALiNGAM, direction of causality is wrong when using data containing two variables

Data I used contained two variables as follows:

np.random.seed(0)
size = 1000
x = np.random.uniform(size=size)
y = 2.0*x + np.random.uniform(size=size)
df_x_y = pd.DataFrame({'x': x, 'y': y})

Using ICALiNGAM, I got the following result:

model_x_y = lingam.ICALiNGAM()
model_x_y.fit(df_x_y)
model_x_y.adjacency_matrix_
# [[0.         0.        ]
#  [2.00618965 0.        ]]
model_x_y.causal_order_
# [0, 1]

Direction of causality is right, because estimated result is x->y.
But when swapping x y, I got the following result:

df_y_x = pd.DataFrame({'y': df_x_y['y'], 'x': df_x_y['x']})
model_y_x = lingam.ICALiNGAM()
model_y_x.fit(df_y_x)
model_y_x.adjacency_matrix_
# [[0.         0.        ]
#  [0.39467697 0.        ]]
model_y_x.causal_order_
# [0, 1]

Estimated direction is y->x. I only swapped order of variables, but direction I got seemes wrong.
I think this is bug. Please check this issue.

For your information, when using data containing three or more variables (e.g. x->y->z), this issue was not occurred.

Prior adjancency matrices

Why are prior adjacency matrices, created with the make_prior_knowledge function, have -1 for the edges instead of 1? what is the rationale for this?

Interpretation of values

Hello, thank you for your package!
I cannot find any information on how to interpret the values that one recieves after causal inference.
Is it correct to intuitively assume that the higher the value, the higher the correlation between variables? Is there a maximum value?

In VARLiNGAM, the estimate_total_effect function outputs incorrect results when lag is greater than 2.

The total_effects matrix is incorrectly overwritten.

Sampling data randomly when calculating kernel bandwidth in HSIC

Motivation

In the hsic.get_kernel_width function, when the sample size is more than 100, the first 100 of the data are fixedly selected. This is not good when the data is being sorted.

Description

Change to random sampling of 100 data.

lingam with latent confounder

Hi, thanks for your work. Would this package support the lingam with latent confounder?

Install not downloading all packages

Hello,

Recent versions of Lingam don't install all packages needed. For example, when I install using lingam via pip in a conda environment file, I need to manually add igraph, factor analyzer, and pygam. Is there a way around this?

About discrete graphs

Hello,

This is the second time I ask a question. Thanks for your wonderful work.

When we use the OLS linear regression method, we can't get 0 even if there is no edge between two variable nodes. So, in your work instead of the general linear regression, the adaptive Lasso is used.

However, for some data, even with the adaptive Lasso, there are times when it can not guarantee that the DirectLiNGAM result will not have redundant edges, even though the values of the edges are very close to zero.

I think if I can further prune the adjacency matrix, the result might be closer to the true matrix. But, I don't know what criteria should I use for pruning, especially when analyzing discrete graphs in which there is no edge.

I'm wondering if I can directly get an answer here or if you can recommend me some papers that will be helpful.

Error in visualize_nonlinear_causal_effect

The example of utils.visualize_nonlinear_causal_effect does not seem to work anymore.

When running it, you get a problem with the shape of cd_result. The error occurs at _is_adjacency_matrix(cd_result, X.shape[1]) and is caused by check_array(cd_result). It yields ValueError: Expected 2D array, got scalar array instead.

How can this be fixed?

AttributeError during import

Hello,
I have installed lingam 1.6.0, and when I attempt to import it, I consistently encounter the error AttributeError: module 'numpy' has no attribute 'MachAr'. It appears that the error arises when lingam attempts to import VAR from statsmodels.tsa.vector_ar.var_model.

I have attempted to resolve the issue by changing the version of 'numpy,' but unfortunately, I was unable to eliminate the error.
Thank you.

Inquiry about the p-value in results

Dear author
When I am reading https://lingam.readthedocs.io/en/latest/tutorial/lingam.html, a question puzzles me.

As shown in the tutorial's results, there is a chain between x3 and x4. Such a structure denotes the dependence of x3 and x4.
However, the p-value between x3 and x4 is 0.681 (>0.05). In my understanding, this means the null hypothesis (errors e3 and e4 are independent) cannot be rejected.
Thus, the dependence of x3 and x4, the independence of e3 and e4, seems to create a conflict here.
So, How exactly should we understand the p-values in LINGAM results？And What is the distinction of independence between variables and independence between errors?
Best regards

panel data

hello

can I appy var-lingam model with panel data? If yes, is there any tutorial or document for me to learn how to do it with panel data?

[Question] What is the difference between probabilities output from result.get_probabilities() and result.get_total_causal_effects()?

Hi, I am a beginner.
I'm not quite sure the difference between probabilities output from result.get_probabilities() and result.get_total_causal_effects() after bootstrap, i.e., result = model.bootstrap(data, n_sampling=1000).
I would appreciate it if you could give me more info.

ValueError: trend 'nc' not supported for VAR

Hi!
I got the following error when trying to use Varlingam

ValueError: trend 'nc' not supported for VAR
the error seems to be caused from line 272 at file var_lingam.py

result = var.fit(maxlags=self._lags, trend="nc")
but there is no nc available trend in the fit method of var.

Spport for categorical or binary variables

Does the Lingam Algorithm allow the inclusion of categorical and dichotomical variables?

ModuleNotFoundError after installation from pypi

When installing lingam via

pip install "lingam==0.17.0"

and importing it via

import lingam

then a ModuleNotFoundError: No module named 'pygam is thrown`.

The reason might be that the setup.py does not use the requirement.txt, where all requirements are specified.

cdt15 / lingam Goto Github PK

lingam's People

Contributors

Stargazers

Watchers

Forkers

lingam's Issues

Motivation

Description

Recommend Projects

Recommend Topics

Recommend Org