Giter VIP home page Giter VIP logo

pymatch's People

Contributors

alephnotation avatar beespinosa avatar benmiroglio avatar cojabi avatar mc51 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pymatch's Issues

Find matches from minority group for majority group

"By default matches are found from the majority group for the minority group."
I actually want Matcher to do the opposite; for each row in the majority group, I want to assign the best match from the minority group.

Is there an easy way to enable this without reworking the source code?

Can I match without replacement?

The matching process is done with replacement, so a single majority record can be matched to multiple minority records. Can I match without replacement? So a single majority record can only be matched to one minority record.

RuntimeWarning appears when importing Matcher

See code below:

>>> from pymatch import Matcher
/home/colin/.virtualenvs/test/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
>>> 

Error: Perfect separation detected, results not available

Hi,
I met an error described in the title when invoking fit_scores(). My data structrue is below
image

and I draw samples 2000 for test, 20000 for control for fitting the matcher, but I have no clue why this error occurs (I have looked into the source code). In addition, I ran the example code for loan.csv successfully, so I wonder if the fields of the data should not be string, rather integer? In fact, the data structure of loan example contains string as well see below
image

Hope anyone can help, thanks!

Meaning of `nmodels` unclear

I am sorry for asking but after reading README.md and help() it is still unclear for me what the nmodels argument of fit_score() means.

Can you explain it another way?

I get an IndexError no matter which value I set for nmodels or if I leave it empty.

    m.predict_scores()
  File "C:\Users\buhtzch\AppData\Roaming\Python\Python39\site-packages\pymatch\Matcher.py", line 140, in predict_scores
    m = self.models[i]
IndexError: list index out of range

ModuleNotFoundError: No module named 'functions'


ModuleNotFoundError Traceback (most recent call last)
in ()
----> 1 from pymatch.Matcher import Matcher

~/anaconda3/lib/python3.6/site-packages/pymatch/init.py in ()
6 from collections import Counter
7 from itertools import chain
----> 8 import functions as uf
9 import statsmodels.api as sm
10 import patsy

ModuleNotFoundError: No module named 'functions'

error in method _scores_to_accuracy() of Matcher.py

Got error when i tried to run the example at:
m.fit_scores(balance=True, nmodels=10)
When the function calls the static method _scores_to_accuracy(), got error of mis matching size.
In this function, y is a DataFrame with shape as (n, 1), while preds is a list. I fixed the code by convert preds to a matrix

def _scores_to_accuracy(m, X, y):
    preds = [1.0 if i >= .5 else 0.0 for i in m.predict(X)]
    return (y == preds).sum() * 1.0 / len(y)

def _scores_to_accuracy(m, X, y):
    preds = [1.0 if i >= .5 else 0.0 for i in m.predict(X)]
    # return (y == preds).sum() * 1.0 / len(y)
    return (y.to_numpy().T == preds).sum() * 1.0 / len(y)

The code above works for me.

Missing package dependencies

Hi, I just installed pymatch 0.3.4 via pip (inside a conda environment). Newly installed it could not be imported because the seaborn and statsmodels libraries weren't installed automatically.

In fact it appears that pip (version 19.2.3) doesn't know about any of the dependencies:

$ pip show pymatch
Name: pymatch
Version: 0.3.4
Summary: Matching techniques for Observational Studies
Home-page: https://github.com/benmiroglio/pymatch
Author: Ben Miroglio
Author-email: [email protected]
License: UNKNOWN
Location: $HOME/local/anaconda3/envs/julia/lib/python3.6/site-packages
Requires: 
Required-by: 

Perhaps it's because setup.py uses requires rather than setuptools install_requires keyword? See https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-dependencies

Exclusion Features - Shouldn't they be allowed to be NULL?

Shouldn't the exclusion column handling happen before removing rows with Nan?

I couldn't figure out why my sample size was being reduced so much until I stepped through the code and realized that it doesn't care about the exclusion feature list prior to dropping rows.

I generally perform isna checks on my matching features prior to creating the matcher object.

Thoughts?

compare_continuous doesn't work, and compare_categorical get all predictors

compare_continuous doesn't work, got error below.
KeyError: "None of [Index(['var', 'ks_before', 'ks_after', 'grouped_chisqr_before',\n 'grouped_chisqr_after', 'std_median_diff_before',\n 'std_median_diff_after', 'std_mean_diff_before', 'std_mean_diff_after'],\n dtype='object')] are in the [columns]"

But compare_categorical gets to
Capture
all predictors even they are continous variables.

drop_static_cols actually drops a column and breaks fit_scores()

Within Matcher.fit_scores(), if uf.drop_static_cols() actually drops a column, then the following line patsy.dmatrics breaks. As it uses the entire formula, since df no longer has the dropped static column, the formula argument has one extra X variable.

How to keep track of dropped data when matching?

According to the documentation, "If a record in the minority group has no suitable matches, it is dropped from the final matched dataset."

How can we alter the match function so we can keep track of what was dropped? Preferably in a dataset separate to the normal resulting matched pairs dataset.

Error: Unable to coerce to Series

When calling fit_scores, I got the Error: Unable to coerce to Series. I think the problem is in the _scores_to_accuracy function. The y object is a pandas dataframe, and it can't be directly compared to the list preds. y.values would work.

return (y.values == preds).sum() * 1.0 / len(y)

return (y == preds).sum() * 1.0 / len(y)

'threshold' parameter in match not used when method='min'

Hi,

For the function matcher.match I'm wondering if there is a reason why the parameter for threshold is only used for method=='random' ? See below:

if method == 'random':
                bool_match = abs(ctrl_scores - score) <= threshold
                matches = ctrl_scores.loc[bool_match[bool_match.scores].index]
            elif method == 'min':
                matches = abs(ctrl_scores - score).sort_values('scores').head(nmatches)

m.fit_scores

I got errors when calling m.fit_scores(balance = True, nmodels = 10)

Static column dropped: yr_nbrTraceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/builtins.py", line 87, in Q
    return env.namespace[name]
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 51, in __getitem__
    raise KeyError(key)
KeyError: 'purchase_yr_nbr'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/builtins.py", line 89, in Q
    raise NameError("no data named %r found" % (name,))
NameError: no data named 'purchase_yr_nbr' found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tracy/Downloads/pymatch-master/pymatch/Matcher.py", line 109, in fit_scores
    y_samp, X_samp = patsy.dmatrices(self.formula, data=df, return_type='dataframe')
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "/opt/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: NameError: no data named 'purchase_yr_nbr' found
    Q('adas') ~ Q('car_age_month')+Q('day_mileage')+Q('engn_size')+Q('est_hh_incm_prmr_cd')+Q('gmqualty_model')+Q('input_indiv_gndr_prmr_cd')+Q('latitude1')+Q('longitude1')+Q('purchase_lat1')+Q('purchase_lng1')+Q('purchase_mth_nbr')+Q('purchase_yr_nbr')+Q('purchaser_age_at_tm_of_purch')+Q('umf_xref_finc_gbl_trim')

builtin error in ast.py

when attemtping:

m = Matcher(test, control, yvar="loan_status", exclude=[])

from the example @ https://github.com/benmiroglio/pymatch/blob/master/Example.ipynb

I ran into a syntax error in the python built in ast.py

The error was that my variable column names contained white space.

I resolved this error by simply

df.columns = df.columns.str.replace(' ', '_')

and then the ast.py did not throw a syntax error.

Not sure if this is something you would want to put a check in for or not, but it really threw me for a loop getting a syntax error in a base python package.

Attribute Error

Attribute Error. Not sure why there is an error: Sample size is over 9k.

 record_freqs = self.matched_data.groupby("record_id")\

AttributeError: 'list' object has no attribute 'groupby'

Dived a data set based on PS

The example in the README.md is a bit overwhelming for me.

And I can not see where there a data set is divided based on the PS. In some cases you use the PS to dived a dataset (with control and treat group together). In the result you got two datasets one for control and one for treat but both of them of the same length.

I can not see this in your example.

list index out of range

Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52
Fitting Models on Balanced Samples: 1\300Error: Unable to coerce to Series, length must be 1: given 52```
Average Accuracy: nan%

I don't know what this means.
n majority: 7619
n minority: 26

Can I create age-matched control groups with this package?

I have a list of cases and a list of controls with their respective age.

Now I want to match one control to each case with a maximum age difference of 3. The goal is to get as many matches as possible, the age difference doesn't need to be optimized.

Is this possible with pymatch?

I've tried the following (just an example, not the real data):

from pymatch.Matcher import Matcher
import pandas as pd

cases_ages =[23, 21, 26, 25, 23, 44, 24, 22, 46, 26]
controls_ages = [34, 30, 24, 25, 25, 27, 30, 33, 53, 27, 26, 28, 23, 23, 28, 23, 24, 22, 23, 25]
cases_group = [1 for _ in range(len(cases_ages))]
controls_group = [0 for _ in range(len(cases_ages))]

df_cases = pd.DataFrame(list(zip(cases_ages, cases_group )), columns=['age', 'group'])
df_controls = pd.DataFrame(list(zip(controls_ages, controls_group )), columns=['age', 'group'])

m = Matcher(df_cases , df_controls , yvar='group')
m.fit_scores(balance=True, nmodels=100)
m.match(method='min', nmatches=1, threshold=0.0005)

print(m.matched_data)

I'd like to have a 1:1 mapping like that, with as many matches as possible, but without replacement (ie every case has exactly one control).

# match case id : control id
{0:2, 1:12, 2:3, 3:4, 4:4, ...}

However, pymatch matches with replacement (which is attempted to fix in #20 ), but even then matches are not optimized for number of matches

errors for continuous_results

I encountered the current error when running continuous_results. The package works for categorical_results. Why this is happening?

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tracy/Downloads/pymatch-master/pymatch/Matcher.py", line 313, in compare_continuous
    # before/after stats
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 79, in grouped_permutation_test
    truth = f(t, c)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 54, in chi2_distance
    tb, cb = bin_hist(tb, cb, bins)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 75, in bin_hist
    return idx_to_value(tc, bins), idx_to_value(cc, bins)
  File "/Users/tracy/Downloads/pymatch-master/pymatch/functions.py", line 72, in idx_to_value
    result[int(bins[k-1])] = v
IndexError: index -1 is out of bounds for axis 0 with size 0

Get the Fit of the model itself

Does anyone have a clean way to extract the summary file of the models behavior from this? It doesnt seem like there is a call from the .fit()
glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial()) res = glm_binom.fit() print(res.summary())

Allow spaces inside column names

My data frame contains columns such as Current Age or Subject Type, so the column names contain spaces. When I initialized an instance of the Matcher Class I got:

SyntaxError: invalid syntax

It took me a while to understand that my column names are not allowed to have spaces in them. However, I prefer them to 'look pretty', because eventually you want to plot your data and then it is nice when you don't have to work on the matplotlib or seaborn plots to change x- or y-axis titles.

Please either allow for spaces inside column names or update the documentation so that it is clear, that column names must not contain spaces.

Matcher can not use spark data frame

when i use Matcher method test and control parameter need pd.DataFrame,but when I query by hive, my data is sparkdataframe,and it is too big can't use to_pandas.
how can I use, when my data is sparkdataframe?

Error in Example.ipynb "ValueError: negative dimensions are not allowed"

While following the tutorial, I hit the error below, which seems to be caused by: https://stackoverflow.com/a/19939003.

  • Installation:
conda create -q --yes -n pymatch python=3.6 pip
conda activate pymatch
pip install statsmodels seaborn scipy pymatch ipython[all] autopep8 
  • Error in line m = Matcher(test, control, yvar="loan_status", exclude=[]):
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-7c4230cfce4f> in <module>
----> 1 m = Matcher(test, control, yvar="loan_status", exclude=[])

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/pymatch/Matcher.py in __init__(self, test, control, yvar, formula, exclude)
     47         self.matched_data = []
     48         self.y, self.X = patsy.dmatrices('{} ~ {}'.format(yvar, '+'.join(self.xvars)), data=self.data,
---> 49                                          return_type='dataframe')
     50         self.xvars = [i for i in self.data.columns if i not in self.exclude]
     51         self.test= self.data[self.data[yvar] == True]

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    308     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    309     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310                                       NA_action, return_type)
    311     if lhs.shape[1] == 0:
    312         raise PatsyError("model is missing required outcome variables")

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    163         return iter([data])
    164     design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165                                       NA_action)
    166     if design_infos is not None:
    167         return build_design_matrices(design_infos, data,

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     68                                       data_iter_maker,
     69                                       eval_env,
---> 70                                       NA_action)
     71     else:
     72         return None

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    719         term_to_subterm_infos = _make_subterm_infos(termlist,
    720                                                     num_column_counts,
--> 721                                                     cat_levels_contrasts)
    722         assert isinstance(term_to_subterm_infos, OrderedDict)
    723         assert frozenset(term_to_subterm_infos) == frozenset(termlist)

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/build.py in _make_subterm_infos(terms, num_column_counts, cat_levels_contrasts)
    626                         coded = code_contrast_matrix(factor_coding[factor],
    627                                                      levels, contrast,
--> 628                                                      default=Treatment)
    629                         contrast_matrices[factor] = coded
    630                         subterm_columns *= coded.matrix.shape[1]

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/contrasts.py in code_contrast_matrix(intercept, levels, contrast, default)
    600         return contrast.code_with_intercept(levels)
    601     else:
--> 602         return contrast.code_without_intercept(levels)
    603 

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/patsy/contrasts.py in code_without_intercept(self, levels)
    181         else:
    182             reference = _get_level(levels, self.reference)
--> 183         eye = np.eye(len(levels) - 1)
    184         contrasts = np.vstack((eye[:reference, :],
    185                                 np.zeros((1, len(levels) - 1)),

~/miniconda3/envs/pymatch/lib/python3.6/site-packages/numpy/lib/twodim_base.py in eye(N, M, k, dtype, order)
    199     if M is None:
    200         M = N
--> 201     m = zeros((N, M), dtype=dtype, order=order)
    202     if k >= M:
    203         return m

ValueError: negative dimensions are not allowed

Weights

In explaining to an audience who would look at the co-variate in my matching before i do the ATE, would it be possible to get the weights of each co-variate. So when an audience sees GENDER for Test ad Control different but matched they don't question the entire model? thanks

Float and Int - Treated as Categorical?

I am trying to use this package on a dataset that contains both categorical featuresand numerical features. For some reason when I run the below it returns all features as if it is treating them all as categorical.

categorical_results = m.compare_categorical(return_table=True)

Running the below returns an error which I assume means I don't have any continous features:

cc = m.compare_continuous(return_table=True)

KeyError: "['var' 'ks_before' 'ks_after' 'grouped_chisqr_before'\n 'grouped_chisqr_after' 'std_median_diff_before' 'std_median_diff_after'\n 'std_mean_diff_before' 'std_mean_diff_after'] not in index"

I haven't used this package in a while, but I had no issues with this in the past. Anyone run into this? I've checked my dataframe columns throughout each step and they appear to remain of dtype 'int64' or 'float64'. Checking the generated dataframe columns from 'matched_data' shows the same dtypes also.

About threshold parameter of match function

Thank you for the develpoment of the great module.
I'm not sure you intend as the specification or not, the threshold parameter of match function is not used when 'min' is set as the value for the method parameter (the 188th line of the code below ).
https://github.com/benmiroglio/pymatch/blob/master/pymatch/Matcher.py
If 'random' is set for the method parameter, the threshold parameter works.
I hope this comment would be useful for the improvement of the great module.

Next steps after matching

I'm curious what you do after matching is complete.
Usually one is interested in the causal effect of the treatment on the outcome variable. So it's logical to run a regression to come up with coefficients (that are log odds). But what should we run this regression on? On the new dataset that Pymatch returns? The one that has one more extra column match_id?
I guess it's quickly addressed in the Match Data section of the manual? Can someone elaborate on what to do with the weight vector in the subsequent regression?

numpy.arange in py.pymatch.functions.which_bins_hist

For negative covariates, the 99 percentile could be a negative number x. Then, np.arange(x,step=10) fails as the default start value is zero and it is larger than then stop. Here is a fix:

np.arange(start=min(comb), stop=np.percentile(comb,99), step=1-)

record_id

Is there anyway I can substitute the record_id with my project specific id? Please help.

Not working with Pandas 0.24

Hi Guys!
pymatch has stopped working after pandas update, but I managed to fix it. I also made some minor enhacements. please let me know how I can send you the files so you can check them and merge them with the next release.

Best Regards!

m.compare_categorical()

get an error claiming "['var' 'before' 'after'] not in index" when running compare_categorical()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.