pylablanche / gcforest Goto Github PK

View Code? Open in Web Editor NEW

416.0 416.0 193.0 65 KB

Python implementation of deep forest method : gcForest

License: MIT License

Python 53.25% Jupyter Notebook 46.75%

gcforest's People

Contributors

Stargazers

Watchers

Forkers

chenyangh redzh jalused wanjinchang hgpatswu ghzhangnj binbinbian vagiler benjamesbabala jdc08161063 zhangxujinsh bloodd hbu-mlc-3 hythbr stevenlol cuihengbin tianchuliang aacherish benchengy scchow lihengtianxia xylary zijin2 dyj6146 dazhaxie0526 epirs jianfly wwwcs59 jz3707 chao-jiang yaokaifei dreamstudioai xingwumath zhliaoli hangtongluo edwardzeng zhengliu6699 styanddty ydcun huminpurin flysky1991 betterjiang xflee wangcanqiang wishwill felixwzh r-wheeler michaelfeng87 flytengfei pkwangwanjun jane8816 yaoliweb hanahimi dragon229 alanbrown1 mzzyk zhangyuxin621 ruimao1988 bihui9968 dacapricorn qixuxiang qwshy vivianuszhang mengqhui cbennett ruczikai scsherm liu9x johnson-yue guillermogsjc maggie0830 wuqinhao sue2415535899 monologue110 yqian1014 zvcxoyo jawaechan chrinide codeaudit yangtsoo rspadim quxiaofeng huihui7987 siriuswy lqhuang movinghera zy-wang huangshizhi heibihehehe laoma023012 wavesflag djofouc lidaboo jclu81 robi56 manuj005 sdoof andriusland intrad qingdatascience

gcforest's Issues

TypeError: Cannot clone object '<GCForest.gcForest object at 0x00000000029FD240>' <type <class 'GCForest.gcForest'>>:it does not seem to be a scikit-learn estimators as it does not implement a 'get_params' methods

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The data is from UCI.Here is the link.http://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29
Here is my code.

`
data_dir = '../census_income.data'

df = pd.read_table(data_dir,sep=',',header=-1)

df[41][df[41]==' 50000+.']=1
df[41][df[41]==' - 50000.']=0

y_tag = 41
pos_value = 1
neg_value = 0
y = df[y_tag].values
y = y.astype(float32)
del df[y_tag]

for c in df.columns:
if df[c].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df[c].values))
df[c] = lbl.transform(list(df[c].values))

mmsc = MinMaxScaler()
for i in df.columns:
df[i] = mmsc.fit_transform(df[i])

df = df.astype(float32)

df = df.fillna(df.median(axis=0))

X = df.values

X_train, X_test, y_train, y_test =train_test_split(np.nan_to_num(X),y,test_size = 0.3,random_state=123)

gcf_param={'shape_1X': X.shape[1],
'window':[1],
'n_mgsRFtree':30,
'stride':1,
'cascade_test_size':0.2,
'n_cascadeRF':2,
'n_cascadeRFtree':101,
'cascade_layer':100,
'min_samples_mgs':0.1,
'min_samples_cascade':0.05,
'tolerance':0.0,
'n_jobs':1
}

gcf=gcForest(**gcf_param)

start_time=datetime.datetime.now()

gcf.fit(X_train, y_train)

end_time = datetime.datetime.now()

cost_time = end_time-start_time

cost_time = int(cost_time.seconds)
`
The error raises when it comes to the 'gcf.fit(X_train,y_train)',but there is no NA and inf in the data,so I wonder where the problem is.

NameError: name 'basestring' is not defined

I want to test gcForest so I used tools/train_fg.py according to README.
But the error indicated at the title occurs.
The detail of error message is following below.

File "tools/train_fg.py", line 48, in
net = FGNet(config["net"], train_config.data_cache)
File "lib\gcforest\fgnet.py", line 36, in init
layer = get_layer(layer_config, self.data_cache)
File "lib\gcforest\layers_init_.py", line 32, in get_layer
layer = layer_class(layer_config, data_cache)
File "lib\gcforest\layers\fg_pool_layer.py", line 27, in init
self.pool_method = self.get_value("pool_method", "avg", basestring)
NameError: name 'basestring' is not defined

Does this error cause my environment?
Or does it cause bad implementation?

--Appendix information--

I ran following command in the root directory of gcForest. (This command is noted at line 168 about README)

python tools/train_fg.py --model models/mnist/gcforest/fg-tree500-depth100-3folds.json --log_dir logs/gcforest/mnist/fg --save_outputs

And my environment is following here.

Python 3.6 (Anaconda 4.4.0)
Windows 10 64bit

How to deal with missing values？

How to deal with missing values in the input data set？

completely-random tree forests

There are two types of forest in the paper, and the first is completely-random tree forests. I think the forest is show in code
crf = RandomForestClassifier(n_estimators=n_tree, max_features=None, min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) , in function window_slicing_pred_prob, if i'm right, then the parameter max_features should be 1.

what is the completely-random tree

can you reproduce the result from the paper(the part only with cascade Forest )

To be specific , i am not able to reproduce the result with yeast dataset using your cascade Forest structure?could you give me some help?Thanks in advance.

Application on large scale dataset.

Hi,
Your implementation is really elegant. But I tried your code on the real MINIST dataset, and it took up to almost 100 GB memory before I force stopped. Do you have any idea about this?

How is the trained model saved?

Hello,
Thank you very much for writing this program！But I have some questions.

One, I can't find the trained model. I don't know if there is a log that saves the training model. It can be used directly to test the sample, instead of the test set being entered into the model along with the training set.
Two, I saw that you are using the out-of-bag error to test the model. If I want to use cross-validation instead of out-of-bag error, which piece of code should be modified. and I also want to ask if cross-validation is better than the out-of-bag error for this model.
Three, Where should the variety of classifiers be added? It seems that you only wrote one classifier in your code.
Four, I am very confused about the training of the decision tree of Cascade Forest. I don't know how the probability distribution is used as a feature for the training of decision trees.

Please forgive me for my poor English and so many questions，they really confused me for a long time.
Can you explain me these questions ? Thank you in advance!

can the gcForest do regression task?

up to now, is there copyright of regression version for the gcForest ?

TypeError: it does not seem to be a scikit-learn estimators as it does not implement a 'get_params' methods

TypeError: Cannot clone object '<GCForest.gcForest object at 0x00000000029FD240>' <type <class 'GCForest.gcForest'>>:it does not seem to be a scikit-learn estimators as it does not implement a 'get_params' methods

I get a AttributeError: 'gcForest' object have no attribute 'window'

when run the code gcf.fit(X,y), a error encounted :'gcForest' object have no attribute 'window'

getting Buffer has wrong number of dimensions (expected 1, got 2) error

I get the following error when calling these lines:

gcf = gcForest(shape_1X=6, window=(4))
gcf.fit(X_train, Y_train])

and the dimensions of my features and target are:

[1686 rows x 6 columns] and [1686 rows x 1 columns]

Slicing Sequence...

ValueError Traceback (most recent call last)
in ()
----> 1 gcf.fit(df[columns].iloc[:last_train_index], df[['is_attributed']].iloc[:last_train_index])

in fit(self, X, y)
97 raise ValueError('Sizes of y and X do not match.')
98
---> 99 mgs_X = self.mg_scanning(X, y)
100 _ = self.cascade_forest(mgs_X, y)
101

in mg_scanning(self, X, y)
146
147 for wdw_size in getattr(self, 'window'):
--> 148 wdw_pred_prob = self.window_slicing_pred_prob(X, wdw_size, shape_1X, y=y)
149 mgs_pred_prob.append(wdw_pred_prob)
150

in window_slicing_pred_prob(self, X, window, shape_1X, y)
176 else:
177 print('Slicing Sequence...')
--> 178 sliced_X, sliced_y = self._window_slicing_sequence(X, window, shape_1X, y=y, stride=stride)
179
180 if y is not None:

in _window_slicing_sequence(self, X, window, shape_1X, y, stride)
264 ind_1X = np.arange(np.prod(shape_1X))
265 inds_to_take = [ind_1X[i:i+window] for i in iter_array]
--> 266 sliced_sqce = np.take(X, inds_to_take, axis=1).reshape(-1, window)
267
268 if y is not None:

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py in take(a, indices, axis, out, mode)
157 [5, 7]])
158 """
--> 159 return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
160
161

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
50 def _wrapfunc(obj, method, *args, **kwds):
51 try:
---> 52 return getattr(obj, method)(*args, **kwds)
53
54 # An AttributeError occurs if the object does not have

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in take(self, indices, axis, convert, is_copy, **kwargs)
2246
2247 convert = nv.validate_take(tuple(), kwargs)
-> 2248 return self._take(indices, axis=axis, convert=convert, is_copy=is_copy)
2249
2250 def xs(self, key, axis=0, level=None, drop_level=True):

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, convert, is_copy)
2148 new_data = self._data.take(indices,
2149 axis=self._get_block_manager_axis(axis),
-> 2150 verify=True)
2151 result = self._constructor(new_data).finalize(self)
2152

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)
4262 new_labels = self.axes[axis].take(indexer)
4263 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4264 axis=axis, allow_dups=True)
4265
4266 def merge(self, other, lsuffix='', rsuffix=''):

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
4144 if axis == 0:
4145 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4146 fill_tuple=(fill_value,))
4147 else:
4148 new_blocks = [blk.take_nd(indexer, axis=axis, fill_tuple=(

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in _slice_take_blocks_ax0(self, slice_or_indexer, fill_tuple)
4190 else:
4191 blknos = algos.take_1d(self._blknos, slobj, fill_value=-1,
-> 4192 allow_fill=allow_fill)
4193 blklocs = algos.take_1d(self._blklocs, slobj, fill_value=-1,
4194 allow_fill=allow_fill)

/opt/conda/lib/python3.6/site-packages/pandas/core/algorithms.py in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
1381 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1382 mask_info=mask_info)
-> 1383 func(arr, indexer, out, fill_value)
1384
1385 if flip_order:

pandas/_libs/algos_take_helper.pxi in pandas._libs.algos.take_1d_int64_int64()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Word vec Feature

train_w2v = wordvec_df.iloc[:31962,:]
test_w2v = wordvec_df.iloc[31962:,:]
xtrain_w2v = train_w2v.iloc[ytrain.index,:]
xvalid_w2v = train_w2v.iloc[yvalid.index,:]

lreg.fit(xtrain_w2v, ytrain)
prediction = lreg.predict_proba(xvalid_w2v)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int)

ValueError Traceback (most recent call last)
in
5 xvalid_w2v = train_w2v.iloc[yvalid.index,:]
6
----> 7 lreg.fit(xtrain_w2v, ytrain)
8 prediction = lreg.predict_proba(xvalid_w2v)
9 prediction_int = prediction[:,1] >= 0.3

~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1530
1531 X, y = check_X_y(X, y, accept_sparse='csr', dtype=dtype, order="C",
-> 1532 accept_large_sparse=solver != 'liblinear')
1533 check_classification_targets(y)
1534 self.classes = np.unique(y)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
717 ensure_min_features=ensure_min_features,
718 warn_on_dtype=warn_on_dtype,
--> 719 estimator=estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57 # for object dtype data, we only check for NaNs (GH-13254)
58 elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

issues about the proportion of the test_size

Hi,
When I use the gcForest, I tried different test_size as 0.3, 0.2 and 0.1. But when I set test_size=0.1, the gcForest() produced errors. I find the cascade_test_size is default=0.2.
Is this the problem? Thanks.

Code

X_tr, X_te, y_tr, y_te = train_test_split(sX, sY, test_size=0.2)
gcf = gcForest(shape_1X= [1,X_tr.shape[1]],   window=50, tolerance=0.0)
gcf.fit(X_tr, y_tr)

Errors:

/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:458: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
warn("Some inputs do not have OOB scores. "
/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:463: RuntimeWarning: divide by zero encountered in true_divide
predictions[k].sum(axis=1)[:, np.newaxis])
/Users/cheny/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:463: RuntimeWarning: invalid value encountered in true_divide
predictions[k].sum(axis=1)[:, np.newaxis])
Adding/Training Layer, n_layer=1
ValueError Traceback (most recent call last)
in
1 gcf = gcForest(shape_1X= [1,X_tr.shape[1]], window=50, tolerance=0.0)
----> 2 gcf.fit(X_tr, y_tr)

~/Documents/tools/deepL/gcForest-master/GCForest.py in fit(self, X, y)
124
125 mgs_X = self.mg_scanning(X, y)
--> 126 _ = self.cascade_forest(mgs_X, y)
127
128 def predict_proba(self, X):

~/Documents/tools/deepL/gcForest-master/GCForest.py in cascade_forest(self, X, y)
345
346 self.n_layer += 1
--> 347 prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
348 accuracy_ref = self._cascade_evaluation(X_test, y_test)
349 feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)

~/Documents/tools/deepL/gcForest-master/GCForest.py in _cascade_layer(self, X, y, layer)
409 print('Adding/Training Layer, n_layer={}'.format(self.n_layer))
410 for irf in range(n_cascadeRF):
--> 411 prf.fit(X, y)
412 crf.fit(X, y)
413 setattr(self, 'casprf{}{}'.format(self.n_layer, irf), prf)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
248
249 # Validate or convert input data
--> 250 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
251 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
252 if sample_weight is not None:

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 shape_repr = _shape_repr(array.shape)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57
58

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

import pandas as pd
import numpy as np
from sklearn import datasets , linear_model
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('Data_Train.csv')
pd.isnull(df).sum() > 0
data=df.describe()
df.isnull()
df.describe().columns
x1 = df[['Year', 'Seats']].values
y1= df[['Price']]
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x1, y1)
x_train.shape , x_test.shape , y_train.shape , y_test.shape
from sklearn.linear_model import LinearRegression
linreg = LinearRegression();
from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier()
knn.fit(x_train,y_train)

There will be a AttributeError: 'gcForest' object have no attribute 'window'

When I set parameter 'window' as None, there will be this AttributeError.

I think this is because when I call 'fit' function, it will call getattr function to check 'window'. While window is None, it will say there is no attribute 'window'

package installation:

How can I install GCForest package in jupyter notebook?

Regression variation

Can this library be adapted/technique to do regression? Like the RandomForestRegressor in sklearn.

When the predicted output is of a continuous type this library throws and error.

how to combine Reinforcement Learning with gcForest？

official code from the authors of the paper released

http://lamda.nju.edu.cn/code_gcForest.ashx?AspxAutoDetectCookieSupport=1
&
https://github.com/kingfengji/gcForest

Input contains NaN, infinity or a value too large for dtype('float64').

x_train,x_val,y_train,y_val=train_test_split(input_predictors,output_target,test_size=0.20,random_state=7)
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
x_train=x_train.astype(np.float64,copy=False)
y_train=y_train.astype(np.float64,copy=False)

logreg.fit(x_train,y_train)
when i write fit line it is showing:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

no saveModel function?

Hi,
Your implementation is good. There is a function gcf.fit(X_train, y_train) for training and gcf.predict(X_test) for testing.
Is there a function like saveModel() for saving gcf.fit(X_train, y_train)'s result,and a function like lodelModel() for loading gcf.fit(X_train, y_train)'s result ?

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

def word_vector(tokens,size):
vec = np.zeros(size).reshape((1,size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1,size))
count += 1.
except KeyError:#Handling the case where the token is not in vocabulary

                    continue       
if count!=0:
    vec /= count
    return vec

wordvec_arrays = np.zeros((len(tokenized_tweet),200))
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i],200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape

train_w2v = wordvec_df.iloc[:31962,:]
test_w2v = wordvec_df.iloc[31962:,:]
xtrain_w2v = train_w2v.iloc[ytrain.index,:]
xvalid_w2v = train_w2v.iloc[yvalid.index,:]

lreg.fit(xtrain_w2v, ytrain)
prediction = lreg.predict_proba(xvalid_w2v)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int)

So,can the Deep Forest do regression task ?

I appreciate you or someone can offer the regression version

I test the code on mnist and another structured data

but why my result show that the gcforest is not so powerfull when compared with cnn and xgboost，am I wrong somewhere？

pylablanche / gcforest Goto Github PK

gcforest's People

Contributors

Stargazers

Watchers

Forkers

gcforest's Issues

Slicing Sequence...

Word vec Feature

Recommend Projects

Recommend Topics

Recommend Org