py-why / causaltune Goto Github PK
View Code? Open in Web Editor NEWAutoML for causal inference.
License: Apache License 2.0
AutoML for causal inference.
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
Currently the metrics are not properly tested
Describe the solution you'd like
Unit and integration tests for the metrics need to be written that do not slow CI down.
no estimator list specified for create_init_config
Describe the solution you'd like
Implement other metrics in addition to the ERUPT score.
Here are a few candidates https://arxiv.org/abs/1804.05146
In the AutoCausality init() you use the pattern settings["use_dummyclassifier"] = settings.get("use_dummyclassifier", True)
and so on for all the arguments.
What does that achieve compared to just having (e.g.) an explicit keyword argument with default value use_dummyclassifier=True
in the __init__()
signature, apart from making the code harder to read?
Add method to AutoCausality to verify that valid strings have been supplied as metrics
and metrics_to_report
arguments.
If invalid string supplied, raise error and show user list of available metrics
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Right now I have to supply both test and train sets when fitting (why btw? the test set shoudn't be used at all, we should auto-split the validation set from train I think), but the function only outputs one ERUPT value per fit, without saying whether it's on train or test.
Could we please change it so it shows both?
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
If an individual model fit fails with an exception, we should report the exception then move on to the next attempt
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
something like
def ate(df: pd.DataFrame):
return df[outcome][df[treatment]==1].mean() - df[outcome][df[treatment]==0].mean()
self.full_scores = {"baseline":{"estimator": "baseline",
"outcome": outcome,
"train":{"erupt": ac.train_df[outcome].mean(),"ate": ate(ac.train_df)},
"validation":{"erupt": ac.test_df[outcome].mean(),"ate": ate(ac.test_df)}}, **self.full_scores}
Plus corresponding metrics for qini, AUC, and r-scorer, using randomness instead of model where a model is needed.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
fitting estimators: ['backdoor.econml.dml.LinearDML', 'backdoor.econml.dml.SparseLinearDML', 'backdoor.econml.dml.CausalForestDML', 'backdoor.econml.dr.ForestDRLearner']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\EGOR~1.KRA\AppData\Local\Temp/ipykernel_27524/31860363.py in <module>
3 auto_causality = AutoCausality(time_budget=1200,components_time_budget=120,estimator_list=estimator_list)
4
----> 5 myresults = auto_causality.fit(train_df, test_df, treatment, outcome,
6 features_W, features_X)
7
~\Transferwise\auto-causality\auto_causality\optimiser.py in fit(self, train_df, test_df, treatment, outcome, common_causes, effect_modifiers)
229 ]
230 else:
--> 231 results = tune.run(
232 self._tune_with_config,
233 self.estimator_cfg["search_space"],
~\.conda\envs\generic3.9\lib\site-packages\flaml\tune\tune.py in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_incumbent_result_in_evaluation)
450 if verbose:
451 logger.info(f"trial {num_trials} config: {trial_to_run.config}")
--> 452 result = evaluation_function(trial_to_run.config)
453 if result is not None:
454 if isinstance(result, dict):
~\Transferwise\auto-causality\auto_causality\optimiser.py in _tune_with_config(self, config)
269 }
270 # estimate effect with current config
--> 271 self._estimate_effect()
272
273 # compute a metric and return results
~\Transferwise\auto-causality\auto_causality\optimiser.py in _estimate_effect(self)
279 """estimates effect with chosen estimator"""
280 if hasattr(self, "estimator"):
--> 281 self.estimates[self.estimator] = self.causal_model.estimate_effect(
282 self.identified_estimand,
283 method_name=self.estimator,
~\Transferwise\dowhy\dowhy\causal_model.py in estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
311 target_units)
312
--> 313 estimate = self.causal_estimator.estimate_effect()
314 # Store parameters inside estimate object for refutation methods
315 # TODO: This add_params needs to move to the estimator class
~\Transferwise\dowhy\dowhy\causal_estimator.py in estimate_effect(self)
178 :returns: A CausalEstimate instance that contains point estimates of average and conditional effects. Based on the parameters provided, it optionally includes confidence intervals, standard errors,statistical significance and other statistical parameters.
179 """
--> 180 est = self._estimate_effect()
181 est.add_estimator(self)
182
~\Transferwise\dowhy\dowhy\causal_estimators\econml.py in _estimate_effect(self)
86 if self.estimator is None:
87 estimator_class = self._get_econml_class_object(self._econml_methodname)
---> 88 self.estimator = estimator_class(**self.method_params["init_params"])
89 # Calling the econml estimator's fit method
90 estimator_argspec = inspect.getfullargspec(
TypeError: __init__() got an unexpected keyword argument 'early_stopping_rounds'
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
FLAML lets the user specify which metric to use and what kind of estimators to optimise:
settings = {
"time_budget": 10, # total running time in seconds
"metric": 'accuracy',
"estimator_list": ['RGF', 'lgbm', 'rf', 'xgboost'], # list of ML learners
"task": 'classification', # task type
}
automl.fit(X_train = X_train, y_train = y_train, **settings)
In order to run FLAML's automl on econML models (i.e. to select among DML, metalearners etc), we need to supply it with a custom metric and a list of custom estimators.
Describe the solution you'd like
ERUPT metric implemented in automl compatible format & econML estimators implemented in automl compatible format. The usage would be sth like this below:
automl = AutoML()
automl.add_learner(learner_name='lindml', learner_class=LinearDML)
automl.add_learner(learner_name='tLearner', learner_class=Tlearner)
settings = {
"time_budget": 10, # total running time in seconds
"metric": 'ERUPT',
"estimator_list": ['lindml','tlearner'], # list of ML learners
"task": 'causalinference', # task type
}
automl.fit(X_train = X_train, y_train = y_train, **settings)
Note: The add_learner method is already part of automl. Ideally, we'd later on write a wrapper that instantiates the automl and adds all of these learners under the hood.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The choice of estimator should be as much a thing for FLAML, as the choice of estimator-specific hyperparams.
I thought that was already done, but looking at the source code in main branch, I don't see that?
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
I try to run the following code, and am getting the below error - can you please take a look?
outcome = targets[0]
auto_causality = AutoCausality(time_budget=1200,components_time_budget=120,estimator_list=estimator_list)
myresults = auto_causality.fit(train_df,
test_df,
treatment=treatment,
outcome=outcome,
common_causes= features_W,
effect_modifiers = features_X)
print(f"Best estimator: {auto_causality.best_estimator}")
fitting estimators: ['backdoor.econml.dr.ForestDRLearner']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\EGOR~1.KRA\AppData\Local\Temp/ipykernel_28884/1817044370.py in <module>
17 auto_causality = AutoCausality(time_budget=1200,components_time_budget=120,estimator_list=estimator_list)
18
---> 19 myresults = auto_causality.fit(train_df,
20 test_df,
21 treatment=treatment,
~\Transferwise\auto-causality\auto_causality\optimiser.py in fit(self, train_df, test_df, treatment, outcome, common_causes, effect_modifiers)
229 ]
230 else:
--> 231 results = tune.run(
232 self._tune_with_config,
233 self.estimator_cfg["search_space"],
~\.conda\envs\generic3.9\lib\site-packages\flaml\tune\tune.py in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_incumbent_result_in_evaluation)
450 if verbose:
451 logger.info(f"trial {num_trials} config: {trial_to_run.config}")
--> 452 result = evaluation_function(trial_to_run.config)
453 if result is not None:
454 if isinstance(result, dict):
~\Transferwise\auto-causality\auto_causality\optimiser.py in _tune_with_config(self, config)
269 }
270 # estimate effect with current config
--> 271 self._estimate_effect()
272
273 # compute a metric and return results
~\Transferwise\auto-causality\auto_causality\optimiser.py in _estimate_effect(self)
279 """estimates effect with chosen estimator"""
280 if hasattr(self, "estimator"):
--> 281 self.estimates[self.estimator] = self.causal_model.estimate_effect(
282 self.identified_estimand,
283 method_name=self.estimator,
~\Transferwise\dowhy\dowhy\causal_model.py in estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
311 target_units)
312
--> 313 estimate = self.causal_estimator.estimate_effect()
314 # Store parameters inside estimate object for refutation methods
315 # TODO: This add_params needs to move to the estimator class
~\Transferwise\dowhy\dowhy\causal_estimator.py in estimate_effect(self)
178 :returns: A CausalEstimate instance that contains point estimates of average and conditional effects. Based on the parameters provided, it optionally includes confidence intervals, standard errors,statistical significance and other statistical parameters.
179 """
--> 180 est = self._estimate_effect()
181 est.add_estimator(self)
182
~\Transferwise\dowhy\dowhy\causal_estimators\econml.py in _estimate_effect(self)
86 if self.estimator is None:
87 estimator_class = self._get_econml_class_object(self._econml_methodname)
---> 88 self.estimator = estimator_class(**self.method_params["init_params"])
89 # Calling the econml estimator's fit method
90 estimator_argspec = inspect.getfullargspec(
TypeError: __init__() got an unexpected keyword argument 'num_leaves'
We need a concise description of the package.
Let's do this as we keep adding features. It's then easier to keep track.
When running flaml_squared.ipynb across all estimators, I get
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py", line 822, in dispatch_one_batch
tasks = self._ready_batches.get(block=False)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\queue.py", line 168, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\externals\loky\process_executor.py", line 436, in _process_worker
r = call_item()
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\externals\loky\process_executor.py", line 288, in __call__
return self.fn(*self.args, **self.kwargs)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\Users\egor.kraev\Transferwise\auto-causality\auto_causality\optimiser.py", line 341, in _estimate_effect
estimate = self.causal_model.estimate_effect(
File "C:\Users\egor.kraev\Transferwise\dowhy\dowhy\causal_model.py", line 316, in estimate_effect
estimate = self.causal_estimator.estimate_effect()
File "C:\Users\egor.kraev\Transferwise\dowhy\dowhy\causal_estimator.py", line 191, in estimate_effect
est = self._estimate_effect()
File "C:\Users\egor.kraev\Transferwise\dowhy\dowhy\causal_estimators\econml.py", line 137, in _estimate_effect
est = self.estimator.effect(X_test, T0=T0_test, T1=T1_test)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\_cate_estimator.py", line 862, in effect
return super().effect(X, T0=T0, T1=T1)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\_cate_estimator.py", line 590, in effect
eff = self.const_marginal_effect(X)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 1001, in const_marginal_effect
effects = super().const_marginal_effect(X=X)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 319, in const_marginal_effect
return np.asarray(self._predict(X))
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 325, in _predict
results = Parallel(n_jobs=self.n_jobs, backend=self.backend,
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py", line 833, in dispatch_one_batch
islice = list(itertools.islice(iterator, big_batch_size))
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 327, in <genexpr>
delayed(_pointwise_effect)(X_single, *self._pw_effect_inputs(X_single, stderr=stderr),
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 335, in _pw_effect_inputs
w1, w2 = self._get_weights(X_single)
File "C:\Users\egor.kraev\.conda\envs\generic3.9\lib\site-packages\econml\orf\_ortho_forest.py", line 399, in _get_weights
leaf_weight = 1 / len(leaf.est_sample_inds)
ZeroDivisionError: division by zero
"""
The above exception was the direct cause of the following exception:
ZeroDivisionError Traceback (most recent call last)
C:\Users\EGOR~1.KRA\AppData\Local\Temp/ipykernel_19912/2886102488.py in <module>
19
20 # run autocausality
---> 21 myresults = ac.fit(data_df, treatment, outcome, features_W, features_X)
22
23 # return best estimator
~\Transferwise\auto-causality\auto_causality\optimiser.py in fit(self, data_df, treatment, outcome, common_causes, effect_modifiers)
270 last_result = self._estimate_effect(self.estimator_cfg["init_params"])
271 else:
--> 272 results = tune.run(
273 self._tune_with_config,
274 self.estimator_cfg["search_space"],
~\.conda\envs\generic3.9\lib\site-packages\flaml\tune\tune.py in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_incumbent_result_in_evaluation)
446 if verbose:
447 logger.info(f"trial {num_trials} config: {trial_to_run.config}")
--> 448 result = evaluation_function(trial_to_run.config)
449 if result is not None:
450 if isinstance(result, dict):
~\Transferwise\auto-causality\auto_causality\optimiser.py in _tune_with_config(self, config)
315 # estimate effect with current config
316 # spawn a separate process to prevent cross-talk between tuner and automl on component models:
--> 317 estimates = Parallel(n_jobs=2)(
318 delayed(self._estimate_effect)(config) for i in range(1)
319 )[0]
~\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1054
1055 with self._backend.retrieval_context():
-> 1056 self.retrieve()
1057 # Make sure that we get a last message telling us we are done
1058 elapsed_time = time.time() - self._start_time
~\.conda\envs\generic3.9\lib\site-packages\joblib\parallel.py in retrieve(self)
933 try:
934 if getattr(self._backend, 'supports_timeout', False):
--> 935 self._output.extend(job.get(timeout=self.timeout))
936 else:
937 self._output.extend(job.get())
~\.conda\envs\generic3.9\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
~\.conda\envs\generic3.9\lib\concurrent\futures\_base.py in result(self, timeout)
443 raise CancelledError()
444 elif self._state == FINISHED:
--> 445 return self.__get_result()
446 else:
447 raise TimeoutError()
~\.conda\envs\generic3.9\lib\concurrent\futures\_base.py in __get_result(self)
388 if self._exception:
389 try:
--> 390 raise self._exception
391 finally:
392 # Break a reference cycle with the exception in self._exception
ZeroDivisionError: division by zero
outcome = targets[0]
baseline
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
Your question
For the benchmarking it would perhaps be useful to have a data ingestion pipeline that crawls different causal inference datasets from the web and brings them into a coherent format?
Extend pipeline to accept multi-valued categorical treatment datasets.
Checklist
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Need to do a test run
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Once we have a single fitter run across all models, we should firstly, start with a good initial guess for the component_budget value (say time_budget/(number of estimators * 6), and then treat it as another hyperparameter to optimize, along with model choice and model hyperaparameters
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
If I set the default n_estimators for ForestDRLearner
to 1000 as per the docs, and then fit it on a 500K CRM dataset, it fails reliably with the below stack trace. When I change n_estimators to 100, it runs fine.
['backdoor.econml.dr.ForestDRLearner']
[flaml.automl: 03-29 08:37:38] {2293} WARNING - Time taken to find the best model is 76% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
[flaml.automl: 03-29 08:39:41] {2293} WARNING - Time taken to find the best model is 71% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
[flaml.automl: 03-29 08:43:55] {2293} WARNING - Time taken to find the best model is 98% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
[flaml.tune.tune: 03-29 08:43:56] {331} WARNING - Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]
[flaml.tune.tune: 03-29 08:43:56] {451} INFO - trial 1 config: {'min_propensity': 1e-06, 'n_estimators': 1000, 'min_samples_split': 5, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'min_impurity_decrease': 0.0, 'max_samples': 0.45, 'min_balancedness_tol': 0.45, 'honest': 1, 'subforest_size': 4}
Starting fit of backdoor.econml.dr.ForestDRLearner
backdoor.econml.dr.ForestDRLearner {'min_propensity': 1e-06, 'n_estimators': 1000, 'min_samples_split': 5, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'min_impurity_decrease': 0.0, 'max_samples': 0.45, 'min_balancedness_tol': 0.45, 'honest': 1, 'subforest_size': 4}
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ansible/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 357, in _sendback_result
exception=exception))
File "/home/ansible/.local/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 244, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
error Traceback (most recent call last)
<ipython-input-27-53894a4c9887> in <module>
26 outcome=outcome,
27 common_causes= features_W,
---> 28 effect_modifiers = features_X)
29
30 print(f"Best estimator for {outcome}: {auto_causality.best_estimator}")
/data/code/auto-causality/auto_causality/optimiser.py in fit(self, data_df, treatment, outcome, common_causes, effect_modifiers)
280 points_to_evaluate=[self.estimator_cfg.get("defaults", {})],
281 low_cost_partial_config={},
--> 282 **self._settings["tuner"],
283 )
284
~/.local/lib/python3.6/site-packages/flaml/tune/tune.py in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_incumbent_result_in_evaluation)
450 if verbose:
451 logger.info(f"trial {num_trials} config: {trial_to_run.config}")
--> 452 result = evaluation_function(trial_to_run.config)
453 if result is not None:
454 if isinstance(result, dict):
/data/code/auto-causality/auto_causality/optimiser.py in _tune_with_config(self, config)
321 # spawn a separate process to prevent cross-talk between tuner and automl on component models:
322 estimates = Parallel(n_jobs=2)(
--> 323 delayed(self._estimate_effect)(config) for i in range(1)
324 )[0]
325
~/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
1040
1041 with self._backend.retrieval_context():
-> 1042 self.retrieve()
1043 # Make sure that we get a last message telling us we are done
1044 elapsed_time = time.time() - self._start_time
~/.local/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
919 try:
920 if getattr(self._backend, 'supports_timeout', False):
--> 921 self._output.extend(job.get(timeout=self.timeout))
922 else:
923 self._output.extend(job.get())
~/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
538 AsyncResults.get from multiprocessing."""
539 try:
--> 540 return future.result(timeout=timeout)
541 except CfTimeoutError:
542 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
error: 'i' format requires -2147483648 <= number <= 2147483647
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
One could want to run for a subset of estimators first, and then extend the search to others, for example.
We could implement an option to save them to disk eg by having model_save_path
and scores_save_path
as optional arguments to AutoCausality(), and then have a filename derived from eg estimator name, time budget, and save timestamp, saved in that path if the path is not Null
https://github.com/transferwise/auto-causality/runs/5626194835?check_suite_focus=true
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
When trying to plot a simple tree policy after running the optimization, for a LinearDML estimator, I get an exception sometimes, but can't reliably reproduce on local machine - flagging it here in case it appears again
Running the code below, I get this error. Any idea what could be going on?
estimator_list = [
"TransformedOutcome",
"DomainAdaptationLearner",
"SLearner",
"CausalForestDML",
]
outcome = targets[0]
auto_causality = AutoCausality(time_budget=1200,components_time_budget=60,estimator_list=estimator_list)
myresults = auto_causality.fit(train_df,
test_df,
treatment=treatment,
outcome=outcome,
common_causes= features_W,
effect_modifiers = features_X)
print(f"Best estimator: {auto_causality.best_estimator}")
fitting estimators: ['backdoor.auto_causality.models.TransformedOutcome', 'backdoor.econml.metalearners.DomainAdaptationLearner', 'backdoor.econml.metalearners.SLearner', 'backdoor.econml.dml.CausalForestDML']
... Estimator: backdoor.auto_causality.models.TransformedOutcome ERUPT: 1063.705429
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
... Estimator: backdoor.econml.metalearners.DomainAdaptationLearner ERUPT: 836.830616
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
[flaml.tune.tune: 03-04 12:23:44] {330} WARNING - Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]
[flaml.tune.tune: 03-04 12:23:44] {451} INFO - trial 1 config: {'mc_iters': 5, 'drate': 0, 'n_estimators': 67, 'criterion': 'het', 'max_depth': 2, 'min_samples_split': 18, 'min_samples_leaf': 7, 'min_weight_fraction_leaf': 0.377743449018298, 'min_var_fraction_leaf': 0.26639242043080236, 'max_features': 'sqrt', 'min_impurity_decrease': 7.5621068260561595, 'max_samples': 0.013611566647889872, 'min_balancedness_tol': 0.3836583130489407, 'honest': 0, 'inference': 0, 'fit_intercept': 1}
... Estimator: backdoor.econml.metalearners.SLearner ERUPT: 679.618384
Could not find best trial. Did you pass the correct `metric` parameter?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\EGOR~1.KRA\AppData\Local\Temp/ipykernel_28884/483037859.py in <module>
17 auto_causality = AutoCausality(time_budget=1200,components_time_budget=120,estimator_list=estimator_list)
18
---> 19 myresults = auto_causality.fit(train_df,
20 test_df,
21 treatment=treatment,
~\Transferwise\auto-causality\auto_causality\optimiser.py in fit(self, train_df, test_df, treatment, outcome, common_causes, effect_modifiers)
249 ]
250 print(
--> 251 f"... Estimator: {self.estimator} \t {self._settings['metric']}: {self.results[self.estimator]:6f}"
252 )
253
TypeError: unsupported format string passed to NoneType.__format__
When I remove some other estimators from the list, OrthoForests are called without problems
I think I give the fitter ample time to finish - testing now
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Your environment
Additional context
Add any other context about the problem here.
When running AutoCausality just now with components_budget=1800 and time_budget = 7200 for (almost) all estimators, I get performance that's quite inferior to what I was getting when I was just running a for loop with default parameter values.
So for optimal performance, I think it's imporant to tell FLAML to first run each estimator with default EconML values (could be as simple as passing it an empty dict instead of the sampleable params?), and only then start random sampling with our config.
The points_to_evaluate
argument in FLAML should let us achieve just that.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The full_scores dict seems to only contain the results for the best fitting estimator, as identified by hierarchical fitting procedure.
This means that we can't easily create plots comparing the train/test performance of multiple estimators.
Suggestion: Retrieve the trial history and store for each estimator the trial with best-fitting config in the full_scores dict ?
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
estimator_list = [ "CausalForestDML" ]
outcome = targets[0]
auto_causality = AutoCausality(time_budget=1200,components_time_budget=60,estimator_list=estimator_list)
myresults = auto_causality.fit(train_df,
test_df,
treatment=treatment,
outcome=outcome,
common_causes= features_W,
effect_modifiers = features_X)
print(f"Best estimator: {auto_causality.best_estimator}")
[flaml.tune.tune: 03-04 12:52:12] {330} WARNING - Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]
[flaml.tune.tune: 03-04 12:52:12] {451} INFO - trial 1 config: {'mc_iters': 5, 'drate': 0, 'n_estimators': 67, 'criterion': 'het', 'max_depth': 2, 'min_samples_split': 18, 'min_samples_leaf': 7, 'min_weight_fraction_leaf': 0.377743449018298, 'min_var_fraction_leaf': 0.26639242043080236, 'max_features': 'sqrt', 'min_impurity_decrease': 7.5621068260561595, 'max_samples': 0.013611566647889872, 'min_balancedness_tol': 0.3836583130489407, 'honest': 0, 'inference': 0, 'fit_intercept': 1}
fitting estimators: ['backdoor.econml.dml.CausalForestDML']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\EGOR~1.KRA\AppData\Local\Temp/ipykernel_28884/3995702144.py in <module>
17 auto_causality = AutoCausality(time_budget=1200,components_time_budget=60,estimator_list=estimator_list)
18
---> 19 myresults = auto_causality.fit(train_df,
20 test_df,
21 treatment=treatment,
~\Transferwise\auto-causality\auto_causality\optimiser.py in fit(self, train_df, test_df, treatment, outcome, common_causes, effect_modifiers)
229 ]
230 else:
--> 231 results = tune.run(
232 self._tune_with_config,
233 self.estimator_cfg["search_space"],
~\.conda\envs\generic3.9\lib\site-packages\flaml\tune\tune.py in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_incumbent_result_in_evaluation)
450 if verbose:
451 logger.info(f"trial {num_trials} config: {trial_to_run.config}")
--> 452 result = evaluation_function(trial_to_run.config)
453 if result is not None:
454 if isinstance(result, dict):
~\Transferwise\auto-causality\auto_causality\optimiser.py in _tune_with_config(self, config)
269 }
270 # estimate effect with current config
--> 271 self._estimate_effect()
272
273 # compute a metric and return results
~\Transferwise\auto-causality\auto_causality\optimiser.py in _estimate_effect(self)
279 """estimates effect with chosen estimator"""
280 if hasattr(self, "estimator"):
--> 281 self.estimates[self.estimator] = self.causal_model.estimate_effect(
282 self.identified_estimand,
283 method_name=self.estimator,
~\Transferwise\dowhy\dowhy\causal_model.py in estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
311 target_units)
312
--> 313 estimate = self.causal_estimator.estimate_effect()
314 # Store parameters inside estimate object for refutation methods
315 # TODO: This add_params needs to move to the estimator class
~\Transferwise\dowhy\dowhy\causal_estimator.py in estimate_effect(self)
178 :returns: A CausalEstimate instance that contains point estimates of average and conditional effects. Based on the parameters provided, it optionally includes confidence intervals, standard errors,statistical significance and other statistical parameters.
179 """
--> 180 est = self._estimate_effect()
181 est.add_estimator(self)
182
~\Transferwise\dowhy\dowhy\causal_estimators\econml.py in _estimate_effect(self)
86 if self.estimator is None:
87 estimator_class = self._get_econml_class_object(self._econml_methodname)
---> 88 self.estimator = estimator_class(**self.method_params["init_params"])
89 # Calling the econml estimator's fit method
90 estimator_argspec = inspect.getfullargspec(
TypeError: __init__() got an unexpected keyword argument 'min_child_weight'
For each estimator, we should add a few init params that correspond to a solution with low computational cost. See following link for details https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune
Describe the bug
The search space for the ForestDRlearner needs to be constrained so that modulo(n_estimators,subforst_size)==0.
Otherwise, we can end up with the following error:
FAILED tests/autocausality/test_endtoend.py::TestEndToEnd::test_endtoend - ValueError: The number of estimators to be constructed must be divisible the `subforest_size` parameter. Asked to build `n_estimators=354` with `subforest_size=4`.
Maybe sth that @guidodee could address?
The default list should be 3-5 estimators that we deem to be a reasonable choice after trying all of them on several datasets; so the estimator_list should allow an explicit list, like now, but also 'auto', which is treated the same as 'None', which resolves to the above smaller list, and 'all', which resolves to all supported estimators.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
easy to fix, need to change line 129 and 217 which currently read: "min_samples_split": tune.randint(1, 50),
. @guidodee can you have a look?
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.