Multiple implementations of kNN imputation in pure Python + NumPy
iskandr / knnimpute Goto Github PK
View Code? Open in Web Editor NEWPython implementations of kNN imputation
License: Apache License 2.0
Python implementations of kNN imputation
License: Apache License 2.0
Dear Sir,
When I am running knn impute I am getting this error message
[KNN] Warning: 104/77964 still missing after imputation, replacing with 0
Could you please tell me its meaning?
Bst,
Edgar Acuna
University of Puerto Rico
I'm curious how these methods compare for different values of n_samples
, n_features
, and fraction_missing
. If there's a very obvious trend we should provide a wrapper function that switches between underlying implementations. Also if anything is uniformly slower than the reference implementation we should drop it (since I implemented these for fun and they're not all necessarily useful).
This seems to be failing on a toy example.
I haven't checked the code yet but here's how to reproduce:
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# some values missing only
rng = np.random.RandomState(0)
X_some_missing = X.copy()
mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask, 3] = np.NaN
# different random numbers
mask2 = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask2, 2] = np.NaN
from fancyimpute import KNN
res = KNN().complete(X_some_missing)
[KNN] Warning: 6/600 still missing after imputation, replacing with 0
From #3:
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# some values missing only
rng = np.random.RandomState(0)
X_some_missing = X.copy()
mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask, 3] = np.NaN
# different random numbers
mask2 = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=.7, size=X.shape[0])) < .6
X_some_missing[mask2, 2] = np.NaN
from fancyimpute import KNN
res = KNN().complete(X_some_missing)
I'm trying to use knnimpute to fill nan of a dataframe.
The frame info looks fine:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0 150000 non-null int64
SeriousDlqin2yrs 150000 non-null int64
RevolvingUtilizationOfUnsecuredLines 150000 non-null float64
age 150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64
DebtRatio 150000 non-null float64
MonthlyIncome 120269 non-null float64
NumberOfOpenCreditLinesAndLoans 150000 non-null int64
NumberOfTimes90DaysLate 150000 non-null int64
NumberRealEstateLoansOrLines 150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64
NumberOfDependents 146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB
Not too much size right?
But i get a memoryerror without any hint or errorcode while running:
from knnimpute import knn_impute_reference
X_imputed =knn_impute_reference(test_data.iloc[:,2:].values, np.isnan(test_data.iloc[:,2:].values), k=3)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-39-d0cef60a839b> in <module>()
1 from knnimpute import knn_impute_reference
----> 2 X_imputed =knn_impute_reference(test_data.iloc[:,2:].values, np.isnan(test_data.iloc[:,2:].values), k=3)
d:\Anaconda3\lib\site-packages\knnimpute\reference.py in knn_impute_reference(X, missing_mask, k, verbose, print_interval)
29 n_rows, n_cols = X.shape
30 X_result, D, effective_infinity = \
---> 31 knn_initialize(X, missing_mask, verbose=verbose)
32
33 for i in range(n_rows):
d:\Anaconda3\lib\site-packages\knnimpute\common.py in knn_initialize(X, missing_mask, verbose, min_dist, max_dist_multiplier)
37 # to put NaN's back in the data matrix for the distances function
38 X_row_major[missing_mask] = np.nan
---> 39 D = all_pairs_normalized_distances(X_row_major)
40 D_finite_flat = D[np.isfinite(D)]
41 if len(D_finite_flat) > 0:
d:\Anaconda3\lib\site-packages\knnimpute\normalized_distance.py in all_pairs_normalized_distances(X)
36
37 # matrix of mean squared difference between between samples
---> 38 D = np.ones((n_rows, n_rows), dtype="float32", order="C") * np.inf
39
40 # we can cheaply determine the number of columns that two rows share
d:\Anaconda3\lib\site-packages\numpy\core\numeric.py in ones(shape, dtype, order)
190
191 """
--> 192 a = empty(shape, dtype, order)
193 multiarray.copyto(a, 1, casting='unsafe')
194 return a
MemoryError:
please make an example to explain how to use this algorithm, thanks.
X = array([[ nan, 1.],
[ 1., nan]])
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.