Giter VIP home page Giter VIP logo

lingam's People

Contributors

aknvictor avatar csymvoul avatar haraoka-screen avatar ide-screen avatar ikeuchi-screen avatar iyake avatar sshimizu2006 avatar yanazeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingam's Issues

LiM uses too much memory.

I tried to run the LiM code, but because of lack of memory, it didn't work. I set n_features as 10 and it used more than 120GB. Does this happen normally, or did I do something inappropriate? Are there any limitations about the number of variables?

Normalization in ICA-LiNGAM

Hi,

When ICA-LiNGAM was run with normalized data, it sometimes outputs DAGs with a different structure than the data that was not normalized. For comparison, the same was done with Direct-LiNGAM, but the structure was the same, although the edge weights were different. Is this a problem specific to ICA-LiNGAM?

The data sets that output results with different structures were created with the following code.

import numpy as np
import pandas as pd

adj_matrix = np.array(
    [
        [0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0],
        [10, 1, 0, 0, 0],
        [100, 10, 1, 0, 0],
        [1000, 100, 10, 1, 0],
    ]
)
n_features = adj_matrix.shape[1]
rng = np.random.default_rng(seed=0)
E = rng.uniform(low=-1, high=1, size=(n_features, 10000))
I = np.identity(n_features)
X = np.matmul(np.linalg.inv(I - adj_matrix), E)
df = pd.DataFrame(X.T)
# df = df.sub(df.mean()).div(df.std(ddof=0))  # nomalization

Thank you.

Interpretation

The code and documentation is fantastic. But how about some interpretation of the outputs in relation to the examples. What can and cannot be said about the coefficients, adjacency matrices etc?

[bug] pip installation issue - statsmodels dependency not recognized

when installing via pip and PyPI, the statsmodels dependency is not installed. Is is listed as a dependency in the readme, but it is not identified by PyPI. I guess it is because it is also missing from setup.py.

I guess this counts as a bug....

also - installing statsmodels manually solves it, so it is not that bad....

Questions about usage of DirectLiNGAM

Hello, I’d first like to thank you for this incredible package (along with the interesting papers on LiNGAM you’ve published)!

I’m currently trying to employ this package in my Causal Inference pipeline (causal discovery portion).

  • I’m currently trying to measure the treatment effects of a binary treatment on a binary outcome, in the presence of hundreds of potential confounders
  • At present, I am trying to first identify a causal graph using this package for relatively large datasets before identifying confounders using dowhy’s Backdoor Criterion and estimating treatment effects using EconML’s methods.
  • The datasets have sizes ranging about ~250k - 750k rows/entries and 150 - 600+ columns/features

More specifically, I am currently using DirectLiNGAM with a prior knowledge matrix (specifically for having an edge from the treatment to outcome variable, and that there should be no other outgoing edges from both the treatment and outcome variables). BottomUpParceLiNGAM would have been the ideal model but it dosent work due to scalability and instant out of memory issues.

After running a couple of experiments with DirectLiNGAM, I have 3 questions I’d like to clarify with you if possible:

  1. (Extension of Issue 5) As my dataset has a good mix of continuous and discrete data, are there currently any models in the package that can deal with both continuous and discrete (i.e, binary and encoded categorical) variables, while allowing for prior knowledge matrices?
  2. Are there any scalable alternatives to validate the results and assumptions of LiNGAM models in general aside from get_error_independence_p_values and bootstrap? As the former (specifically during hsic_test_gamma) causes out of memory issues for even the smallest dataset (e.g., 250k x 155), while the latter takes too long (i.e, the default fit in DirectLiNGAM with the above mentioned prior-knowledge matrix ranges between 20 hours - 5 days for the datasets I currently have)
  3. It seems that regardless of the dataset, DirectLiNGAM consistently outputs a causal graph (adjacency matrix) where all the non-outcome & non-treatment features are considered confounders (via Backdoor Criterion from dowhy). This is different when compared to using ICALiNGAM which does not necessarily produce results with such a trend. Could it be due to the way in which DirectLiNGAM tries to identify the causal order (and therefore the adjacency matrix)?

Thank you very much!

DirectLiNGAM Duplicate no_path Prior Knowledge

Hi there, thank you for your amazing work on this package. I was just wondering if you could help me understand the intuition of this snipper of code from DirectLiNGAM (lines 116-125). I am trying to understand why it should be the case that duplicate pairs without a path should cancel out and not be ordered, and subsequently not be included in the partial_orders array?

# Check for inconsistencies in pairs without path.
# If there are duplicate pairs without path, they cancel out and are not ordered.
check_pairs = np.concatenate([no_path_pairs, no_path_pairs[:, [1, 0]]])
if len(check_pairs) > 0:
    pairs, counts = np.unique(check_pairs, axis=0, return_counts=True)
    check_pairs = np.concatenate([no_path_pairs, pairs[counts > 1]])
    pairs, counts = np.unique(check_pairs, axis=0, return_counts=True)
    no_path_pairs = pairs[counts < 2]

check_pairs = np.concatenate([path_pairs, no_path_pairs[:, [1, 0]]])

code of Adaptive Lasso

I found a difference between the code of Adaptive Lasso in this package and the code in R package.
This is the code in this package:
图片
And This is the code in R package:
图片
It is different that the R package let the X divided by the weights which was generated from linear regression while this package let the X times the weights.
And here is the formula I found:
图片

Feature Selection

Congragulations on the package, it could be interesting to write a feature selection model ontop of the discovery to remove slow moving indicators.

LiM runs into an exception when using the option only_global=False

When using the LiM implementation, I run it on larger datasets and everything works, as long as I don't use the option only_global=False. Whenever it is set, I get the following message:
ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.
In case of running it on with only_global=True, everything works. I checked my data and from what I can see, I correctly define the discrete_indicator to contain all discrete features.

Can you help me out here?

Adjacency matrix for varlingam

I simply tried to test a dataset with lag=30 with the bellow code snippet but it gives at most 14 adjacency matrix as output. What could be the resaon?

model = lingam.VARLiNGAM(lags=30, criterion='bic',prune=True)
model.fit(X)
print(model._adjacency_matrices)

no_paths doesn't work when using prior_knowledge

Hi,

I really appreciate this repository because I can apply LINGAM to the system very quickly.

Now, I have one question, "Does no paths work correctly?"

For example of this notebook: https://github.com/cdt15/lingam/blob/master/examples/DirectLiNGAM(PriorKnowledge).ipynb,

generete prior knowledge,

prior_knowledge = make_prior_knowledge(
    n_variables=6,
    exogenous_variables=[0],
    no_paths=[[2,1]])
print(prior_knowledge)
make_prior_knowledge_graph(prior_knowledge)

outout data is

[[ 0  0  0  0  0  0]
 [-1  0  0 -1 -1 -1]
 [-1 -1  0 -1 -1 -1]
 [-1 -1 -1  0 -1 -1]
 [-1 -1 -1 -1  0 -1]
 [-1 -1 -1 -1 -1  0]]

It seems the path "2 -> 1" is zero.

However,
if the data fit model

model = lingam.DirectLiNGAM(prior_knowledge=prior_knowledge)
model.fit(X)

output model.adjacency_matrix_ is


	0	1	2	3	4	5
0	0.000000	0.0	0.000000	0.000000	0.0	0.0
1	2.986726	0.0	2.006062	0.000000	0.0	0.0
2	0.000000	0.0	0.000000	6.016333	0.0	0.0
3	0.299046	0.0	0.000000	0.000000	0.0	0.0
4	7.984485	0.0	-0.990590	0.000000	0.0	0.0
5	3.952478	0.0	0.000000	0.000000	0.0	0.0


The path "2 -> 1" has value 2.006062.

Is it the correct output value?
When using "no paths", the value should be zero, just I think.

about DirectLiNGAM: when sample size is less than the number of variables

Hi,
Thanks for your wonderful work!

I have met a value error when I used DirectLiNAM from this python package.
It seem that DirectLiNGAM imported LassoLarsIC which cannot work when sample size is less than the number of variables.
Console message is here:
ValueError: You are using LassoLarsIC in the case where the number of samples is smaller than the number of features. In this setting, getting a good estimate for the variance of the noise is not possible. Provide an estimate of the noise variance in the constructor.

I'm wondering if it is a bug or a property of DirectLiNGAM.

Look forward to your reply.

Bootstrap Speed

## This is 30 times slower 
model = lingam.VARLiNGAM()
result = model.bootstrap(process_df, n_sampling=1)
## Than the following

model = lingam.VARLiNGAM(lags=lags)
process_df = variable_improvement(df_accounting, variable)
model.fit(process_df)

Would you know why this is the reason, and whould you know how to improve these times?

In VARLiNGAM, different behavior with the same data

I ran the following code and an error occurred:

rng = np.random.default_rng(seed=0)
x = rng.random(size=80)
y = 10 * x
df_x_y = pd.DataFrame({'x': x, 'y': y})
lingam.VARLiNGAM(random_state=0).fit(df_x_y)

However, no error occurred when I ran the following code:

df_y_x = pd.DataFrame({'y': y, 'x': x})
lingam.VARLiNGAM(random_state=0).fit(df_y_x)

The results were different even though the data seemed to be the same.

I think this is a bug, but please check it.

VARLiNGAM statsmodels

Dear, if I run VARLiNGAM with the latest version of statsmodels (0.14.0) installed, I seem to be getting an error:

Traceback (most recent call last):
  File ".../var_lingam.py", line 79, in <module>
    model.fit(df1)
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/lingam/var_lingam.py", line 95, in fit
    M_taus, lags, residuals = self._estimate_var_coefs(X)
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/lingam/var_lingam.py", line 279, in _estimate_var_coefs
    fitted = var.fit(maxlags=lag, ic=None, trend="nc")
  File ".../opt/anaconda3/envs/lingam/lib/python3.9/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 656, in fit
    raise ValueError("trend '{}' not supported for VAR".format(trend))
ValueError: trend 'nc' not supported for VAR

With an older version (0.12.2) everything seems to be working fine.

Thanks

Skewed Data Distributions and Homoscedasticity

Hi,

I'm wondering what's the best approach for data that is highly right-skewed. Is it best to take a log transform of it to make it more "normal" or does DirectLiNGAM deal with skewed data? The causal graphs are substantially different if I take the log and then normalise the data compared to only normalising the data and keeping the skewed distribution. I couldn't find the implementations of Hyvarinen & Smith 2013 for skewed data.

Also, my understanding is that LiNGAM is specifically made for non-Gaussian distributions, but I'm a bit confused about how this impacts the adjacency matrix computation using linear regression since from my understanding non-Gaussian distributions violate homoscedasticity.

Any clarity on these two topics would be greatly appreciated!

make_dot()

dot = make_dot()
dot.render("dag")

Unable to display node Chinese name
how I can fix it?

thank you

In ICALiNGAM, direction of causality is wrong when using data containing two variables

Data I used contained two variables as follows:

np.random.seed(0)
size = 1000
x = np.random.uniform(size=size)
y = 2.0*x + np.random.uniform(size=size)
df_x_y = pd.DataFrame({'x': x, 'y': y})

Using ICALiNGAM, I got the following result:

model_x_y = lingam.ICALiNGAM()
model_x_y.fit(df_x_y)
model_x_y.adjacency_matrix_
# [[0.         0.        ]
#  [2.00618965 0.        ]]
model_x_y.causal_order_
# [0, 1]

Direction of causality is right, because estimated result is x->y.
But when swapping x y, I got the following result:

df_y_x = pd.DataFrame({'y': df_x_y['y'], 'x': df_x_y['x']})
model_y_x = lingam.ICALiNGAM()
model_y_x.fit(df_y_x)
model_y_x.adjacency_matrix_
# [[0.         0.        ]
#  [0.39467697 0.        ]]
model_y_x.causal_order_
# [0, 1]

Estimated direction is y->x. I only swapped order of variables, but direction I got seemes wrong.
I think this is bug. Please check this issue.

For your information, when using data containing three or more variables (e.g. x->y->z), this issue was not occurred.

Prior adjancency matrices

Why are prior adjacency matrices, created with the make_prior_knowledge function, have -1 for the edges instead of 1? what is the rationale for this?

Interpretation of values

Hello, thank you for your package!
I cannot find any information on how to interpret the values that one recieves after causal inference.
Is it correct to intuitively assume that the higher the value, the higher the correlation between variables? Is there a maximum value?

Install not downloading all packages

Hello,

Recent versions of Lingam don't install all packages needed. For example, when I install using lingam via pip in a conda environment file, I need to manually add igraph, factor analyzer, and pygam. Is there a way around this?

About discrete graphs

Hello,

This is the second time I ask a question. Thanks for your wonderful work.

When we use the OLS linear regression method, we can't get 0 even if there is no edge between two variable nodes. So, in your work instead of the general linear regression, the adaptive Lasso is used.

However, for some data, even with the adaptive Lasso, there are times when it can not guarantee that the DirectLiNGAM result will not have redundant edges, even though the values of the edges are very close to zero.

I think if I can further prune the adjacency matrix, the result might be closer to the true matrix. But, I don't know what criteria should I use for pruning, especially when analyzing discrete graphs in which there is no edge.

I'm wondering if I can directly get an answer here or if you can recommend me some papers that will be helpful.

Error in visualize_nonlinear_causal_effect

The example of utils.visualize_nonlinear_causal_effect does not seem to work anymore.

When running it, you get a problem with the shape of cd_result. The error occurs at _is_adjacency_matrix(cd_result, X.shape[1]) and is caused by check_array(cd_result). It yields ValueError: Expected 2D array, got scalar array instead.

How can this be fixed?

AttributeError during import

Hello,
I have installed lingam 1.6.0, and when I attempt to import it, I consistently encounter the error AttributeError: module 'numpy' has no attribute 'MachAr'. It appears that the error arises when lingam attempts to import VAR from statsmodels.tsa.vector_ar.var_model.

I have attempted to resolve the issue by changing the version of 'numpy,' but unfortunately, I was unable to eliminate the error.
Thank you.

Inquiry about the p-value in results

Dear author
When I am reading https://lingam.readthedocs.io/en/latest/tutorial/lingam.html, a question puzzles me.
image
As shown in the tutorial's results, there is a chain between x3 and x4. Such a structure denotes the dependence of x3 and x4.
However, the p-value between x3 and x4 is 0.681 (>0.05). In my understanding, this means the null hypothesis (errors e3 and e4 are independent) cannot be rejected.
Thus, the dependence of x3 and x4, the independence of e3 and e4, seems to create a conflict here.
So, How exactly should we understand the p-values in LINGAM results?And What is the distinction of independence between variables and independence between errors?
Best regards

panel data

hello

can I appy var-lingam model with panel data? If yes, is there any tutorial or document for me to learn how to do it with panel data?

ValueError: trend 'nc' not supported for VAR

Hi!
I got the following error when trying to use Varlingam

ValueError: trend 'nc' not supported for VAR
the error seems to be caused from line 272 at file var_lingam.py

result = var.fit(maxlags=self._lags, trend="nc")
but there is no nc available trend in the fit method of var.

ModuleNotFoundError after installation from pypi

When installing lingam via

pip install "lingam==0.17.0"

and importing it via

import lingam

then a ModuleNotFoundError: No module named 'pygam is thrown`.

The reason might be that the setup.py does not use the requirement.txt, where all requirements are specified.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.