benmiroglio / pymatch Goto Github PK

View Code? Open in Web Editor NEW

271.0 271.0 129.0 2.67 MB

License: MIT License

Python 11.74% Jupyter Notebook 88.26%

pymatch's People

Contributors

Stargazers

Watchers

Forkers

alephnotation olivierbrouw iriosu rajeshranjan duolajiang dylanking42 ericzhaomac vandith rosminraju nikolaismolin tammymendt zottacko amimul ppstacy frankntim subhareng cedroh alistairewj rm17tink elaragon littlerabbithole shreyasparbat mc51 tlooden aspirin2007 xiaolinzhuo ktmud yceny devashish-thakar paudelsagar xiguigui yeziqm garciadias ahoyosid sssomani nanami770 sinkasula juan-de-lucio abhireddy vispz ntellakula rajeshkolachana cah-casey-whorton platypus1989 tlifke ylv2174 monicatongya fanllzz engineeryashsaxena hyper-island-data-analyst shuhot rishabharora90 sayantanmitra87 gmholder eclee tvkpz salma1601 beespinosa zhiye9 juhee-sungschenck roserosem jlian401 vincent-wq agrawalujjwal-fractal akashmavle5 silviomueller jovinus oalfliu jsh08 millebean starflettw jevonsevilla battlegg francismarti prabhathn miaohancheng personx000 john-lee-octane evaxu-octanelending kboyernola shivaprasad-patil chrisvoncsefalvay kaela-nelson karsubr batra-akshita clarkenj carlosdullius mickey134 cothurn rawwinton mourarthur kaitlynrourke3 wj111j gabiborges1 zhangyunhong98 gusdn5656 gjdv rohansinghh544 idroz harveyaa

pymatch's Issues

Find matches from minority group for majority group

"By default matches are found from the majority group for the minority group."
I actually want Matcher to do the opposite; for each row in the majority group, I want to assign the best match from the minority group.

Is there an easy way to enable this without reworking the source code?

Can I match without replacement?

The matching process is done with replacement, so a single majority record can be matched to multiple minority records. Can I match without replacement? So a single majority record can only be matched to one minority record.

Fitting Models on Balanced Samples: 1\100Error: Unable to coerce to Series, length must be 1: given 158

RuntimeWarning appears when importing Matcher

See code below:

>>> from pymatch import Matcher
/home/colin/.virtualenvs/test/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
>>>

Error: Perfect separation detected, results not available

Hi,
I met an error described in the title when invoking fit_scores(). My data structrue is below

and I draw samples 2000 for test, 20000 for control for fitting the matcher, but I have no clue why this error occurs (I have looked into the source code). In addition, I ran the example code for loan.csv successfully, so I wonder if the fields of the data should not be string, rather integer? In fact, the data structure of loan example contains string as well see below

Hope anyone can help, thanks!

Print statements not python3 compatible

parenthesis missing

Meaning of `nmodels` unclear

I am sorry for asking but after reading README.md and help() it is still unclear for me what the nmodels argument of fit_score() means.

Can you explain it another way?

I get an IndexError no matter which value I set for nmodels or if I leave it empty.

    m.predict_scores()
  File "C:\Users\buhtzch\AppData\Roaming\Python\Python39\site-packages\pymatch\Matcher.py", line 140, in predict_scores
    m = self.models[i]
IndexError: list index out of range

ModuleNotFoundError: No module named 'functions'

ModuleNotFoundError Traceback (most recent call last)
in ()
----> 1 from pymatch.Matcher import Matcher

~/anaconda3/lib/python3.6/site-packages/pymatch/init.py in ()
6 from collections import Counter
7 from itertools import chain
----> 8 import functions as uf
9 import statsmodels.api as sm
10 import patsy

ModuleNotFoundError: No module named 'functions'

error in method _scores_to_accuracy() of Matcher.py

Got error when i tried to run the example at:
m.fit_scores(balance=True, nmodels=10)
When the function calls the static method _scores_to_accuracy(), got error of mis matching size.
In this function, y is a DataFrame with shape as (n, 1), while preds is a list. I fixed the code by convert preds to a matrix

def _scores_to_accuracy(m, X, y):
    preds = [1.0 if i >= .5 else 0.0 for i in m.predict(X)]
    return (y == preds).sum() * 1.0 / len(y)

def _scores_to_accuracy(m, X, y):
    preds = [1.0 if i >= .5 else 0.0 for i in m.predict(X)]
    # return (y == preds).sum() * 1.0 / len(y)
    return (y.to_numpy().T == preds).sum() * 1.0 / len(y)

The code above works for me.

Missing package dependencies

Hi, I just installed pymatch 0.3.4 via pip (inside a conda environment). Newly installed it could not be imported because the seaborn and statsmodels libraries weren't installed automatically.

In fact it appears that pip (version 19.2.3) doesn't know about any of the dependencies:

$ pip show pymatch
Name: pymatch
Version: 0.3.4
Summary: Matching techniques for Observational Studies
Home-page: https://github.com/benmiroglio/pymatch
Author: Ben Miroglio
Author-email: [email protected]
License: UNKNOWN
Location: $HOME/local/anaconda3/envs/julia/lib/python3.6/site-packages
Requires: 
Required-by:

Perhaps it's because setup.py uses requires rather than setuptools install_requires keyword? See https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-dependencies

Exclusion Features - Shouldn't they be allowed to be NULL?

Shouldn't the exclusion column handling happen before removing rows with Nan?

I couldn't figure out why my sample size was being reduced so much until I stepped through the code and realized that it doesn't care about the exclusion feature list prior to dropping rows.

I generally perform isna checks on my matching features prior to creating the matcher object.

Thoughts?

KeyError: 'weight' in method assign_weight_vector

compare_continuous doesn't work, and compare_categorical get all predictors

compare_continuous doesn't work, got error below.
KeyError: "None of [Index(['var', 'ks_before', 'ks_after', 'grouped_chisqr_before',\n 'grouped_chisqr_after', 'std_median_diff_before',\n 'std_median_diff_after', 'std_mean_diff_before', 'std_mean_diff_after'],\n dtype='object')] are in the [columns]"

But compare_categorical gets to

all predictors even they are continous variables.

Example Loan data for replication

The exact data set was used to perform the example is not available "link doesn't work: https://www.kaggle.com/wendykan/lending-club-loan-data". Also checked and the same data format is not available! Can you put the exact data you used for the example code?

drop_static_cols actually drops a column and breaks fit_scores()

Within Matcher.fit_scores(), if uf.drop_static_cols() actually drops a column, then the following line patsy.dmatrics breaks. As it uses the entire formula, since df no longer has the dropped static column, the formula argument has one extra X variable.

How to keep track of dropped data when matching?

According to the documentation, "If a record in the minority group has no suitable matches, it is dropped from the final matched dataset."

How can we alter the match function so we can keep track of what was dropped? Preferably in a dataset separate to the normal resulting matched pairs dataset.

Error: Unable to coerce to Series

When calling fit_scores, I got the Error: Unable to coerce to Series. I think the problem is in the _scores_to_accuracy function. The y object is a pandas dataframe, and it can't be directly compared to the list preds. y.values would work.

return (y.values == preds).sum() * 1.0 / len(y)

pymatch/pymatch/Matcher.py

Line 517 in f199dfd

return (y == preds).sum() * 1.0 / len(y)

ValueError: Unable to coerce to Series, length must be 1: given 4000

I use the example data, loan.csv, and I face the problem "ValueError: Unable to coerce to Series, length must be 1: given 4000". How can I solve this problem?
Many thanks.

'threshold' parameter in match not used when method='min'

Hi,

For the function matcher.match I'm wondering if there is a reason why the parameter for threshold is only used for method=='random' ? See below:

if method == 'random':
                bool_match = abs(ctrl_scores - score) <= threshold
                matches = ctrl_scores.loc[bool_match[bool_match.scores].index]
            elif method == 'min':
                matches = abs(ctrl_scores - score).sort_values('scores').head(nmatches)

m.fit_scores

I got errors when calling m.fit_scores(balance = True, nmodels = 10)

Static column dropped: yr_nbrTraceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/builtins.py", line 87, in Q
    return env.namespace[name]
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 51, in __getitem__
    raise KeyError(key)
KeyError: 'purchase_yr_nbr'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/builtins.py", line 89, in Q
    raise NameError("no data named %r found" % (name,))
NameError: no data named 'purchase_yr_nbr' found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tracy/Downloads/pymatch-master/pymatch/Matcher.py", line 109, in fit_scores
    y_samp, X_samp = patsy.dmatrices(self.formula, data=df, return_type='dataframe')
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: NameError: no data named 'purchase_yr_nbr' found
    Q('adas') ~ Q('car_age_month')+Q('day_mileage')+Q('engn_size')+Q('est_hh_incm_prmr_cd')+Q('gmqualty_model')+Q('input_indiv_gndr_prmr_cd')+Q('latitude1')+Q('longitude1')+Q('purchase_lat1')+Q('purchase_lng1')+Q('purchase_mth_nbr')+Q('purchase_yr_nbr')+Q('purchaser_age_at_tm_of_purch')+Q('umf_xref_finc_gbl_trim')

Add `Coo` class for clarity

I'm unable to focus on the readme knowing there are no coos.

builtin error in ast.py

when attemtping:

m = Matcher(test, control, yvar="loan_status", exclude=[])

from the example @ https://github.com/benmiroglio/pymatch/blob/master/Example.ipynb

I ran into a syntax error in the python built in ast.py

The error was that my variable column names contained white space.

I resolved this error by simply

df.columns = df.columns.str.replace(' ', '_')

and then the ast.py did not throw a syntax error.

Not sure if this is something you would want to put a check in for or not, but it really threw me for a loop getting a syntax error in a base python package.

Can pymatch be used for a dataset in long format?

Thanks

Attribute Error

Attribute Error. Not sure why there is an error: Sample size is over 9k.

 record_freqs = self.matched_data.groupby("record_id")\

AttributeError: 'list' object has no attribute 'groupby'

Dived a data set based on PS

The example in the README.md is a bit overwhelming for me.

And I can not see where there a data set is divided based on the PS. In some cases you use the PS to dived a dataset (with control and treat group together). In the result you got two datasets one for control and one for treat but both of them of the same length.

I can not see this in your example.

list index out of range

Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52```
Average Accuracy: nan%

I don't know what this means.
n majority: 7619
n minority: 26

Can I create age-matched control groups with this package?

I have a list of cases and a list of controls with their respective age.

Now I want to match one control to each case with a maximum age difference of 3. The goal is to get as many matches as possible, the age difference doesn't need to be optimized.

Is this possible with pymatch?

I've tried the following (just an example, not the real data):

from pymatch.Matcher import Matcher
import pandas as pd

cases_ages =[23, 21, 26, 25, 23, 44, 24, 22, 46, 26]
controls_ages = [34, 30, 24, 25, 25, 27, 30, 33, 53, 27, 26, 28, 23, 23, 28, 23, 24, 22, 23, 25]
cases_group = [1 for _ in range(len(cases_ages))]
controls_group = [0 for _ in range(len(cases_ages))]

df_cases = pd.DataFrame(list(zip(cases_ages, cases_group )), columns=['age', 'group'])
df_controls = pd.DataFrame(list(zip(controls_ages, controls_group )), columns=['age', 'group'])

m = Matcher(df_cases , df_controls , yvar='group')
m.fit_scores(balance=True, nmodels=100)
m.match(method='min', nmatches=1, threshold=0.0005)

print(m.matched_data)

I'd like to have a 1:1 mapping like that, with as many matches as possible, but without replacement (ie every case has exactly one control).

# match case id : control id
{0:2, 1:12, 2:3, 3:4, 4:4, ...}

However, pymatch matches with replacement (which is attempted to fix in #20 ), but even then matches are not optimized for number of matches

m.predict_scores() - list index out of range error

I got a list index out of range error when trying to run m.predict_scores(). Might anybody help out? Thanks.

errors for continuous_results

I encountered the current error when running continuous_results. The package works for categorical_results. Why this is happening?

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tracy/Downloads/pymatch-master/pymatch/Matcher.py", line 313, in compare_continuous
    # before/after stats
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 79, in grouped_permutation_test
    truth = f(t, c)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 54, in chi2_distance
    tb, cb = bin_hist(tb, cb, bins)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 75, in bin_hist
    return idx_to_value(tc, bins), idx_to_value(cc, bins)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 72, in idx_to_value
    result[int(bins[k-1])] = v
IndexError: index -1 is out of bounds for axis 0 with size 0

Is pymatch dead?

Hi,

I wonder if this project is unsupported as I would like to work on matching algorithms in Python. The most complete library I have found so far is implemented in R, but that limits the possible user base.

https://cran.r-project.org/web/packages/MatchIt/vignettes/MatchIt.html

What's the plan for pymatch? Are you accepting collaborators?

Thanks

set

`chi2_distance` function fails when length of bins variable is equal to zero.

def chi2_distance(t, c):
    tb, cb, bins = which_bin_hist(t, c)
    tb, cb = bin_hist(tb, cb, bins)
    return _chi2_distance(tb,cb)

Instead of failing, I would suggest ending gracefully but lifting a warning.

Get the Fit of the model itself

Does anyone have a clean way to extract the summary file of the models behavior from this? It doesnt seem like there is a call from the .fit()
glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial()) res = glm_binom.fit() print(res.summary())

Allow spaces inside column names

My data frame contains columns such as Current Age or Subject Type, so the column names contain spaces. When I initialized an instance of the Matcher Class I got:

SyntaxError: invalid syntax

It took me a while to understand that my column names are not allowed to have spaces in them. However, I prefer them to 'look pretty', because eventually you want to plot your data and then it is nice when you don't have to work on the matplotlib or seaborn plots to change x- or y-axis titles.

Please either allow for spaces inside column names or update the documentation so that it is clear, that column names must not contain spaces.

Citation guide?

Hi all,
Do you have a preferred citation guide? ie authors, etc?

This is an analogous guide: https://github.com/benmiroglio/pymatch

Matcher can not use spark data frame

when i use Matcher method test and control parameter need pd.DataFrame,but when I query by hive, my data is sparkdataframe,and it is too big can't use to_pandas.
how can I use, when my data is sparkdataframe?

Error in Example.ipynb "ValueError: negative dimensions are not allowed"

While following the tutorial, I hit the error below, which seems to be caused by: https://stackoverflow.com/a/19939003.

Installation:

conda create -q --yes -n pymatch python=3.6 pip
conda activate pymatch
pip install statsmodels seaborn scipy pymatch ipython[all] autopep8

Error in line m = Matcher(test, control, yvar="loan_status", exclude=[]):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-7c4230cfce4f> in <module>
----> 1 m = Matcher(test, control, yvar="loan_status", exclude=[])

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/pymatch/Matcher.py in __init__(self, test, control, yvar, formula, exclude)
     47         self.matched_data = []
     48         self.y, self.X = patsy.dmatrices('{} ~ {}'.format(yvar, '+'.join(self.xvars)), data=self.data,
---> 49                                          return_type='dataframe')
     50         self.xvars = [i for i in self.data.columns if i not in self.exclude]
     51         self.test= self.data[self.data[yvar] == True]

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    308     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    309     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310                                       NA_action, return_type)
    311     if lhs.shape[1] == 0:
    312         raise PatsyError("model is missing required outcome variables")

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    163         return iter([data])
    164     design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165                                       NA_action)
    166     if design_infos is not None:
    167         return build_design_matrices(design_infos, data,

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     68                                       data_iter_maker,
     69                                       eval_env,
---> 70                                       NA_action)
     71     else:
     72         return None

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    719         term_to_subterm_infos = _make_subterm_infos(termlist,
    720                                                     num_column_counts,
--> 721                                                     cat_levels_contrasts)
    722         assert isinstance(term_to_subterm_infos, OrderedDict)
    723         assert frozenset(term_to_subterm_infos) == frozenset(termlist)

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/build.py in _make_subterm_infos(terms, num_column_counts, cat_levels_contrasts)
    626                         coded = code_contrast_matrix(factor_coding[factor],
    627                                                      levels, contrast,
--> 628                                                      default=Treatment)
    629                         contrast_matrices[factor] = coded
    630                         subterm_columns *= coded.matrix.shape[1]

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/contrasts.py in code_contrast_matrix(intercept, levels, contrast, default)
    600         return contrast.code_with_intercept(levels)
    601     else:
--> 602         return contrast.code_without_intercept(levels)
    603 

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/contrasts.py in code_without_intercept(self, levels)
    181         else:
    182             reference = _get_level(levels, self.reference)
--> 183         eye = np.eye(len(levels) - 1)
    184         contrasts = np.vstack((eye[:reference, :],
    185                                 np.zeros((1, len(levels) - 1)),

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/numpy/lib/twodim_base.py in eye(N, M, k, dtype, order)
    199     if M is None:
    200         M = N
--> 201     m = zeros((N, M), dtype=dtype, order=order)
    202     if k >= M:
    203         return m

ValueError: negative dimensions are not allowed

Weights

In explaining to an audience who would look at the co-variate in my matching before i do the ATE, would it be possible to get the weights of each co-variate. So when an audience sees GENDER for Test ad Control different but matched they don't question the entire model? thanks

Float and Int - Treated as Categorical?

I am trying to use this package on a dataset that contains both categorical featuresand numerical features. For some reason when I run the below it returns all features as if it is treating them all as categorical.

categorical_results = m.compare_categorical(return_table=True)

Running the below returns an error which I assume means I don't have any continous features:

cc = m.compare_continuous(return_table=True)

KeyError: "['var' 'ks_before' 'ks_after' 'grouped_chisqr_before'\n 'grouped_chisqr_after' 'std_median_diff_before' 'std_median_diff_after'\n 'std_mean_diff_before' 'std_mean_diff_after'] not in index"

I haven't used this package in a while, but I had no issues with this in the past. Anyone run into this? I've checked my dataframe columns throughout each step and they appear to remain of dtype 'int64' or 'float64'. Checking the generated dataframe columns from 'matched_data' shows the same dtypes also.

About threshold parameter of match function

Thank you for the develpoment of the great module.
I'm not sure you intend as the specification or not, the threshold parameter of match function is not used when 'min' is set as the value for the method parameter (the 188th line of the code below ).
https://github.com/benmiroglio/pymatch/blob/master/pymatch/Matcher.py
If 'random' is set for the method parameter, the threshold parameter works.
I hope this comment would be useful for the improvement of the great module.

Next steps after matching

I'm curious what you do after matching is complete.
Usually one is interested in the causal effect of the treatment on the outcome variable. So it's logical to run a regression to come up with coefficients (that are log odds). But what should we run this regression on? On the new dataset that Pymatch returns? The one that has one more extra column match_id?
I guess it's quickly addressed in the Match Data section of the manual? Can someone elaborate on what to do with the weight vector in the subsequent regression?