Giter VIP home page Giter VIP logo

gcforest's People

Contributors

pylablanche avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcforest's Issues

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The data is from UCI.Here is the link.http://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29
Here is my code.

`
data_dir = '../census_income.data'

df = pd.read_table(data_dir,sep=',',header=-1)

df[41][df[41]==' 50000+.']=1
df[41][df[41]==' - 50000.']=0

y_tag = 41
pos_value = 1
neg_value = 0
y = df[y_tag].values
y = y.astype(float32)
del df[y_tag]

for c in df.columns:
if df[c].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df[c].values))
df[c] = lbl.transform(list(df[c].values))

mmsc = MinMaxScaler()
for i in df.columns:
df[i] = mmsc.fit_transform(df[i])

df = df.astype(float32)

df = df.fillna(df.median(axis=0))

X = df.values

X_train, X_test, y_train, y_test =train_test_split(np.nan_to_num(X),y,test_size = 0.3,random_state=123)

gcf_param={'shape_1X': X.shape[1],
'window':[1],
'n_mgsRFtree':30,
'stride':1,
'cascade_test_size':0.2,
'n_cascadeRF':2,
'n_cascadeRFtree':101,
'cascade_layer':100,
'min_samples_mgs':0.1,
'min_samples_cascade':0.05,
'tolerance':0.0,
'n_jobs':1
}

gcf=gcForest(**gcf_param)

start_time=datetime.datetime.now()

gcf.fit(X_train, y_train)

end_time = datetime.datetime.now()

cost_time = end_time-start_time

cost_time = int(cost_time.seconds)
`
The error raises when it comes to the 'gcf.fit(X_train,y_train)',but there is no NA and inf in the data,so I wonder where the problem is.

NameError: name 'basestring' is not defined

I want to test gcForest so I used tools/train_fg.py according to README.
But the error indicated at the title occurs.
The detail of error message is following below.

File "tools/train_fg.py", line 48, in
net = FGNet(config["net"], train_config.data_cache)
File "lib\gcforest\fgnet.py", line 36, in init
layer = get_layer(layer_config, self.data_cache)
File "lib\gcforest\layers_init_.py", line 32, in get_layer
layer = layer_class(layer_config, data_cache)
File "lib\gcforest\layers\fg_pool_layer.py", line 27, in init
self.pool_method = self.get_value("pool_method", "avg", basestring)
NameError: name 'basestring' is not defined

Does this error cause my environment?
Or does it cause bad implementation?

--Appendix information--

I ran following command in the root directory of gcForest. (This command is noted at line 168 about README)

python tools/train_fg.py --model models/mnist/gcforest/fg-tree500-depth100-3folds.json --log_dir logs/gcforest/mnist/fg --save_outputs

And my environment is following here.

  • Python 3.6 (Anaconda 4.4.0)
  • Windows 10 64bit

completely-random tree forests

There are two types of forest in the paper, and the first is completely-random tree forests. I think the forest is show in code
crf = RandomForestClassifier(n_estimators=n_tree, max_features=None, min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) , in function window_slicing_pred_prob, if i'm right, then the parameter max_features should be 1.

Application on large scale dataset.

Hi,
Your implementation is really elegant. But I tried your code on the real MINIST dataset, and it took up to almost 100 GB memory before I force stopped. Do you have any idea about this?

How is the trained model saved?

Hello,
Thank you very much for writing this program!But I have some questions.

One, I can't find the trained model. I don't know if there is a log that saves the training model. It can be used directly to test the sample, instead of the test set being entered into the model along with the training set.
Two, I saw that you are using the out-of-bag error to test the model. If I want to use cross-validation instead of out-of-bag error, which piece of code should be modified. and I also want to ask if cross-validation is better than the out-of-bag error for this model.
Three, Where should the variety of classifiers be added? It seems that you only wrote one classifier in your code.
Four, I am very confused about the training of the decision tree of Cascade Forest. I don't know how the probability distribution is used as a feature for the training of decision trees.

Please forgive me for my poor English and so many questions,they really confused me for a long time.
Can you explain me these questions ? Thank you in advance!

getting Buffer has wrong number of dimensions (expected 1, got 2) error

I get the following error when calling these lines:

gcf = gcForest(shape_1X=6, window=(4))
gcf.fit(X_train, Y_train])

and the dimensions of my features and target are:

[1686 rows x 6 columns] and [1686 rows x 1 columns]

Slicing Sequence...

ValueError Traceback (most recent call last)
in ()
----> 1 gcf.fit(df[columns].iloc[:last_train_index], df[['is_attributed']].iloc[:last_train_index])

in fit(self, X, y)
97 raise ValueError('Sizes of y and X do not match.')
98
---> 99 mgs_X = self.mg_scanning(X, y)
100 _ = self.cascade_forest(mgs_X, y)
101

in mg_scanning(self, X, y)
146
147 for wdw_size in getattr(self, 'window'):
--> 148 wdw_pred_prob = self.window_slicing_pred_prob(X, wdw_size, shape_1X, y=y)
149 mgs_pred_prob.append(wdw_pred_prob)
150

in window_slicing_pred_prob(self, X, window, shape_1X, y)
176 else:
177 print('Slicing Sequence...')
--> 178 sliced_X, sliced_y = self._window_slicing_sequence(X, window, shape_1X, y=y, stride=stride)
179
180 if y is not None:

in _window_slicing_sequence(self, X, window, shape_1X, y, stride)
264 ind_1X = np.arange(np.prod(shape_1X))
265 inds_to_take = [ind_1X[i:i+window] for i in iter_array]
--> 266 sliced_sqce = np.take(X, inds_to_take, axis=1).reshape(-1, window)
267
268 if y is not None:

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py in take(a, indices, axis, out, mode)
157 [5, 7]])
158 """
--> 159 return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
160
161

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
50 def _wrapfunc(obj, method, *args, **kwds):
51 try:
---> 52 return getattr(obj, method)(*args, **kwds)
53
54 # An AttributeError occurs if the object does not have

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in take(self, indices, axis, convert, is_copy, **kwargs)
2246
2247 convert = nv.validate_take(tuple(), kwargs)
-> 2248 return self._take(indices, axis=axis, convert=convert, is_copy=is_copy)
2249
2250 def xs(self, key, axis=0, level=None, drop_level=True):

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, convert, is_copy)
2148 new_data = self._data.take(indices,
2149 axis=self._get_block_manager_axis(axis),
-> 2150 verify=True)
2151 result = self._constructor(new_data).finalize(self)
2152

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)
4262 new_labels = self.axes[axis].take(indexer)
4263 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4264 axis=axis, allow_dups=True)
4265
4266 def merge(self, other, lsuffix='', rsuffix=''):

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
4144 if axis == 0:
4145 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4146 fill_tuple=(fill_value,))
4147 else:
4148 new_blocks = [blk.take_nd(indexer, axis=axis, fill_tuple=(

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in _slice_take_blocks_ax0(self, slice_or_indexer, fill_tuple)
4190 else:
4191 blknos = algos.take_1d(self._blknos, slobj, fill_value=-1,
-> 4192 allow_fill=allow_fill)
4193 blklocs = algos.take_1d(self._blklocs, slobj, fill_value=-1,
4194 allow_fill=allow_fill)

/opt/conda/lib/python3.6/site-packages/pandas/core/algorithms.py in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
1381 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1382 mask_info=mask_info)
-> 1383 func(arr, indexer, out, fill_value)
1384
1385 if flip_order:

pandas/_libs/algos_take_helper.pxi in pandas._libs.algos.take_1d_int64_int64()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Word vec Feature

train_w2v = wordvec_df.iloc[:31962,:]
test_w2v = wordvec_df.iloc[31962:,:]
xtrain_w2v = train_w2v.iloc[ytrain.index,:]
xvalid_w2v = train_w2v.iloc[yvalid.index,:]

lreg.fit(xtrain_w2v, ytrain)
prediction = lreg.predict_proba(xvalid_w2v)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int)

ValueError Traceback (most recent call last)
in
5 xvalid_w2v = train_w2v.iloc[yvalid.index,:]
6
----> 7 lreg.fit(xtrain_w2v, ytrain)
8 prediction = lreg.predict_proba(xvalid_w2v)
9 prediction_int = prediction[:,1] >= 0.3

~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1530
1531 X, y = check_X_y(X, y, accept_sparse='csr', dtype=dtype, order="C",
-> 1532 accept_large_sparse=solver != 'liblinear')
1533 check_classification_targets(y)
1534 self.classes
= np.unique(y)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
717 ensure_min_features=ensure_min_features,
718 warn_on_dtype=warn_on_dtype,
--> 719 estimator=estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57 # for object dtype data, we only check for NaNs (GH-13254)
58 elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

issues about the proportion of the test_size

Hi,
When I use the gcForest, I tried different test_size as 0.3, 0.2 and 0.1. But when I set test_size=0.1, the gcForest() produced errors. I find the cascade_test_size is default=0.2.
Is this the problem? Thanks.

Code

X_tr, X_te, y_tr, y_te = train_test_split(sX, sY, test_size=0.2)
gcf = gcForest(shape_1X= [1,X_tr.shape[1]],   window=50, tolerance=0.0)
gcf.fit(X_tr, y_tr)

Errors:

/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:458: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
warn("Some inputs do not have OOB scores. "
/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:463: RuntimeWarning: divide by zero encountered in true_divide
predictions[k].sum(axis=1)[:, np.newaxis])
/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:463: RuntimeWarning: invalid value encountered in true_divide
predictions[k].sum(axis=1)[:, np.newaxis])
Adding/Training Layer, n_layer=1
ValueError Traceback (most recent call last)
in
1 gcf = gcForest(shape_1X= [1,X_tr.shape[1]], window=50, tolerance=0.0)
----> 2 gcf.fit(X_tr, y_tr)

~/Documents/tools/deepL/gcForest-master/GCForest.py in fit(self, X, y)
124
125 mgs_X = self.mg_scanning(X, y)
--> 126 _ = self.cascade_forest(mgs_X, y)
127
128 def predict_proba(self, X):

~/Documents/tools/deepL/gcForest-master/GCForest.py in cascade_forest(self, X, y)
345
346 self.n_layer += 1
--> 347 prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
348 accuracy_ref = self._cascade_evaluation(X_test, y_test)
349 feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)

~/Documents/tools/deepL/gcForest-master/GCForest.py in _cascade_layer(self, X, y, layer)
409 print('Adding/Training Layer, n_layer={}'.format(self.n_layer))
410 for irf in range(n_cascadeRF):
--> 411 prf.fit(X, y)
412 crf.fit(X, y)
413 setattr(self, 'casprf{}{}'.format(self.n_layer, irf), prf)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
248
249 # Validate or convert input data
--> 250 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
251 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
252 if sample_weight is not None:

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 shape_repr = _shape_repr(array.shape)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57
58

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

import pandas as pd
import numpy as np
from sklearn import datasets , linear_model
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('Data_Train.csv')
pd.isnull(df).sum() > 0
data=df.describe()
df.isnull()
df.describe().columns
x1 = df[['Year', 'Seats']].values
y1= df[['Price']]
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x1, y1)
x_train.shape , x_test.shape , y_train.shape , y_test.shape
from sklearn.linear_model import LinearRegression
linreg = LinearRegression();
from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier()
knn.fit(x_train,y_train)

Regression variation

Can this library be adapted/technique to do regression? Like the RandomForestRegressor in sklearn.

When the predicted output is of a continuous type this library throws and error.

Input contains NaN, infinity or a value too large for dtype('float64').

x_train,x_val,y_train,y_val=train_test_split(input_predictors,output_target,test_size=0.20,random_state=7)
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
x_train=x_train.astype(np.float64,copy=False)
y_train=y_train.astype(np.float64,copy=False)

logreg.fit(x_train,y_train)
when i write fit line it is showing:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

no saveModel function?

Hi,
Your implementation is good. There is a function gcf.fit(X_train, y_train) for training and gcf.predict(X_test) for testing.
Is there a function like saveModel() for saving gcf.fit(X_train, y_train)'s result,and a function like lodelModel() for loading gcf.fit(X_train, y_train)'s result ?

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

def word_vector(tokens,size):
vec = np.zeros(size).reshape((1,size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1,size))
count += 1.
except KeyError:#Handling the case where the token is not in vocabulary

                    continue       
if count!=0:
    vec /= count
    return vec

wordvec_arrays = np.zeros((len(tokenized_tweet),200))
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i],200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape

train_w2v = wordvec_df.iloc[:31962,:]
test_w2v = wordvec_df.iloc[31962:,:]
xtrain_w2v = train_w2v.iloc[ytrain.index,:]
xvalid_w2v = train_w2v.iloc[yvalid.index,:]

lreg.fit(xtrain_w2v, ytrain)
prediction = lreg.predict_proba(xvalid_w2v)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.