Giter VIP home page Giter VIP logo

knnimpute's Issues

Write a blog post comparing performance of different methods

I'm curious how these methods compare for different values of n_samples, n_features, and fraction_missing. If there's a very obvious trend we should provide a wrapper function that switches between underlying implementations. Also if anything is uniformly slower than the reference implementation we should drop it (since I implemented these for fun and they're not all necessarily useful).

Distance zero leads to division by zero and NaN

This seems to be failing on a toy example.
I haven't checked the code yet but here's how to reproduce:

import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# some values missing only
rng = np.random.RandomState(0)
X_some_missing = X.copy()
mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask, 3] = np.NaN
# different random numbers
mask2 = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask2, 2] = np.NaN

from fancyimpute import KNN
res = KNN().complete(X_some_missing)

[KNN] Warning: 6/600 still missing after imputation, replacing with 0

What is wrong when i get a memoryerror without errorcode?

I'm trying to use knnimpute to fill nan of a dataframe.
The frame info looks fine:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0                              150000 non-null int64
SeriousDlqin2yrs                        150000 non-null int64
RevolvingUtilizationOfUnsecuredLines    150000 non-null float64
age                                     150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    150000 non-null int64
DebtRatio                               150000 non-null float64
MonthlyIncome                           120269 non-null float64
NumberOfOpenCreditLinesAndLoans         150000 non-null int64
NumberOfTimes90DaysLate                 150000 non-null int64
NumberRealEstateLoansOrLines            150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    150000 non-null int64
NumberOfDependents                      146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB

Not too much size right?

But i get a memoryerror without any hint or errorcode while running:

from knnimpute import knn_impute_reference
X_imputed =knn_impute_reference(test_data.iloc[:,2:].values, np.isnan(test_data.iloc[:,2:].values), k=3)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-39-d0cef60a839b> in <module>()
      1 from knnimpute import knn_impute_reference
----> 2 X_imputed =knn_impute_reference(test_data.iloc[:,2:].values, np.isnan(test_data.iloc[:,2:].values), k=3)

d:\Anaconda3\lib\site-packages\knnimpute\reference.py in knn_impute_reference(X, missing_mask, k, verbose, print_interval)
     29     n_rows, n_cols = X.shape
     30     X_result, D, effective_infinity = \
---> 31         knn_initialize(X, missing_mask, verbose=verbose)
     32 
     33     for i in range(n_rows):

d:\Anaconda3\lib\site-packages\knnimpute\common.py in knn_initialize(X, missing_mask, verbose, min_dist, max_dist_multiplier)
     37         # to put NaN's back in the data matrix for the distances function
     38         X_row_major[missing_mask] = np.nan
---> 39     D = all_pairs_normalized_distances(X_row_major)
     40     D_finite_flat = D[np.isfinite(D)]
     41     if len(D_finite_flat) > 0:

d:\Anaconda3\lib\site-packages\knnimpute\normalized_distance.py in all_pairs_normalized_distances(X)
     36 
     37     # matrix of mean squared difference between between samples
---> 38     D = np.ones((n_rows, n_rows), dtype="float32", order="C") * np.inf
     39 
     40     # we can cheaply determine the number of columns that two rows share

d:\Anaconda3\lib\site-packages\numpy\core\numeric.py in ones(shape, dtype, order)
    190 
    191     """
--> 192     a = empty(shape, dtype, order)
    193     multiarray.copyto(a, 1, casting='unsafe')
    194     return a

MemoryError: 

error message

Dear Sir,
When I am running knn impute I am getting this error message

[KNN] Warning: 104/77964 still missing after imputation, replacing with 0

Could you please tell me its meaning?

Bst,
Edgar Acuna
University of Puerto Rico

Add unit test for iris data

From #3:

import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# some values missing only
rng = np.random.RandomState(0)
X_some_missing = X.copy()
mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask, 3] = np.NaN
# different random numbers
mask2 = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask2, 2] = np.NaN

from fancyimpute import KNN
res = KNN().complete(X_some_missing)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.