Giter VIP home page Giter VIP logo

rgf-team / rgf Goto Github PK

View Code? Open in Web Editor NEW
371.0 18.0 55.0 5.38 MB

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Python 9.12% CMake 0.34% Makefile 0.13% C++ 72.59% Perl 0.29% Logos 4.94% Yacc 1.56% Shell 0.42% Jupyter Notebook 4.95% Dockerfile 0.05% R 5.20% PowerShell 0.41%
machine-learning ml decision-trees ensemble-model decision-forest regularized-greedy-forest rgf kaggle

rgf's Introduction

Python and R tests DOI arXiv.org Python Versions PyPI Version CRAN Version

Regularized Greedy Forest

Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better results than gradient boosted decision trees (GBDT) on a number of datasets and it has been used to win a few Kaggle competitions. Unlike the traditional boosted decision tree approach, RGF works directly with the underlying forest structure. RGF integrates two ideas: one is to include tree-structured regularization into the learning formulation; and the other is to employ the fully-corrective regularized greedy algorithm.

This repository contains the following implementations of the RGF algorithm:

  • RGF: original implementation from the paper;
  • FastRGF: multi-core implementation with some simplifications;
  • rgf_python: wrapper of both RGF and FastRGF implementations for Python;
  • R package: wrapper of rgf_python for R.

You may want to get interesting information about RGF from the posts collected in Awesome RGF.

rgf's People

Contributors

ankane avatar eyadsibai avatar fukatani avatar jameslamb avatar mlampros avatar niknoproblems avatar seans84 avatar strikerrus avatar vmarkovtsev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rgf's Issues

Sparse matrixes

I saw that you had ideas about sparse matrixes but then deleted your commits. Maybe you can add new branch and commit there? So I could help with this issue.

Reduce test job time

  • cache dependency
  • Reduce some version test (mainly appveyor)?
    Is it enable to omit python 3.5 and 3.6 test for appveyor?

predict_proba returns strange values

Hello!
Now checking your awesome wrapper, and find out that predict_proba method returns the values like (this is for binary problem):

array([[ 1.76994984, -0.76994984],
       [ 4.3186308 , -3.3186308 ],
       [ 4.4845848 , -3.4845848 ],
       ..., 
       [ 3.2685191 , -2.2685191 ],
       [ 1.10539214, -0.10539214],
       [ 3.4617065 , -2.4617065 ]])

Each pair sums to 1 but its not probablities.
I think its the problem with my executable, can you please share the executable within repository?

Many thanks.

Remove Deprecation Warning about cross validation.

Scikit-learn will abort old CV iterator in future verision.
We have to update iterator.

e.g.

/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Model learning result is not found in C:\Users\hp\temp\rgf. This is rgf_python error.

Hello,

i have read the previous thread on the same post, but it does not seem to solve my problem, because the previous case had string included in dataset and all i have got are all numbers. Could you please let me know what could be the problem??

Much appreciated !

skf = StratifiedKFold(n_splits = kfold, random_state=1)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, X_eval = X[train_index], X[test_index]
    y_train, y_eval = y[train_index], y[test_index]
   
    rgf_model = RGFClassifier(max_leaf=400,
                    algorithm="RGF_Sib",
                    test_interval=100,
                    verbose=True).fit( X_train, y_train)
    pred = rgf_model.predict_proba(X_eval)[:,1]
    print( "Gini = ", eval_gini(y_eval, pred) )

and

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-17-b27ba3506d06> in <module>()
     12                     test_interval=100,
     13                     verbose=True).fit( X_train, y_train)
---> 14     pred = rgf_model.predict_proba(X_eval)[:,1]
     15     print( "Gini = ", eval_gini(y_eval, pred) )

C:\Anaconda3\lib\site-packages\rgf\sklearn.py in predict_proba(self, X)
    644                              % (self._n_features, n_features))
    645         if self._n_classes == 2:
--> 646             y = self._estimators[0].predict_proba(X)
    647             y = _sigmoid(y)
    648             y = np.c_[y, 1 - y]

C:\Anaconda3\lib\site-packages\rgf\sklearn.py in predict_proba(self, X)
    796         if not model_files:
    797             raise Exception('Model learning result is not found in {0}. '
--> 798                             'This is rgf_python error.'.format(_TEMP_PATH))
    799         latest_model_loc = sorted(model_files, reverse=True)[0]
    800 

Exception: Model learning result is not found in C:\Users\hp\temp\rgf. This is rgf_python error.

multi label classification

First of all thank you for the useful python library (by the way I made an R wrapper of rgf_python which can be installed on Linux, and somehow cumbersome on Macintosh and windows).

I'm currently experimenting with a multi label classification data set and the classes have the following format,

      toxic severe_toxic obscene threat insult identity_hate
 1:     0            0       0      0      0             0
 2:     0            0       0      0      0             0
..
 6:     0            0       0      0      0             0
 7:     1            1       1      0      1             0
 8:     0            0       0      0      0             0
....
13:     1            0       0      0      0             0
....
16:     0            0       0      0      0             0
17:     1            0       0      0      0             0
18:     0            0       0      0      0             0

So, some classes overlap, which means sigmoid (rather than softmax) would be appropriate as is the case for binary classification tasks. I'm wondering if this format can be parallelized as is the case for the iris data set. However I guess the input for the response y should not be a 1 - dimensional array but a 6 - dimensional in this case (the latter currently throws an error).
I took a closer look at the rgf_python and it seems to me that if that would be possible then the code should be modified in at least two places : fit_multi_class_task and predict_proba.
Thanks in any case

multi-jobs for binary classifier

Hi, when I set n_jobs parameter to 8, but there only one cpu running. I find this parameter used in this, Does it mean the binary classifier do not support the parallel?

Thanks.

dump RGF and FastRGF to the JSON file

Initial support for dumping the RGF model is already implemented in #161. At present it's possible to print the model to the console. But it's good idea to bring the possibility of dumping the model to the file (e.g. JSON).

@StrikerRUS:

Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.

@fukatani:

For example dump in JSON format like lightGBM.
It's convenient and we may support it in the future, but we should do it with another PR.

how to build the binaries yourself

In the Readme it says:

If you have any problems while installing by methods listed above you should build RGF executable file from binaries by your own and place compiled executable file into directory which is included in environmental variable 'PATH' or into directory with installed package.

I had an installation without problems on one machine, but now for some reason on a different machine, it is complaining. Are there any further instructions or links which explain how to do this? Here is what happens when I try to install:

$ python setup.py install
running install
INFO:rgf_python:Starting to compile executable file.
INFO:rgf_python:Trying to build executable file with g++ from existing makefile.
WARNING:rgf_python:Building executable file with g++ from existing makefile failed.
INFO:rgf_python:Trying to build executable file with CMake.
ERROR:rgf_python:Compilation of executable file failed. Please build from binaries by your own and specify path to the compiled file in the config file.
running build
running build_py
running egg_info
writing rgf_python.egg-info/PKG-INFO
writing dependency_links to rgf_python.egg-info/dependency_links.txt
writing requirements to rgf_python.egg-info/requires.txt
writing top-level names to rgf_python.egg-info/top_level.txt
reading manifest file 'rgf_python.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'makefile' under directory 'include/rgf/build'
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
writing manifest file 'rgf_python.egg-info/SOURCES.txt'
running install_lib
ERROR:rgf_python:Cannot find executable file. Installing without it.
running install_egg_info
removing '/home/ubuntu/.pyenv/versions/miniconda3-latest/envs/myenv/lib/python3.6/site-packages/rgf_python-2.0.3-py3.6.egg-info' (and everything under it)
Copying rgf_python.egg-info to /home/ubuntu/.pyenv/versions/miniconda3-latest/envs/myenv/lib/python3.6/site-packages/rgf_python-2.0.3-py3.6.egg-info
running install_scripts

Here is some information about g++ on the system for which the installation failed:

$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.4-2ubuntu1~16.04' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.4 (Ubuntu 4.9.4-2ubuntu1~16.04) 

Model learning result is not found

Hi fukatani,

I install rfg, and then alter the loc_exec/loc_temp path in rgf.py,
when i use rgf_python, it can use fit(), but can not use rgf.predict() or rgf.score(), the error msg is :

/usr/local/lib/python2.7/dist-packages/rgf_sklearn-0.0.1-py2.7.egg/rgf/rgf.pyc in predict_proba(self, X)
    357         model_glob = loc_temp + os.sep + self.file_prefix + "*"
    358         if not glob(model_glob):
--> 359             raise Exception('Model learning result is not found @{0}. This is rgf_python error.'.format(loc_temp))
    360         latest_model_loc = sorted(glob(model_glob), reverse=True)[0]
    361 

Exception: Model learning result is not found @/tmp/rgf. This is rgf_python error.

Thanks for your reading.

[FastRGF] small weights lead to the crash of executable

FastRGF doesn't work with small weights. The minimum value of the weight with which the crash doesn't happen is dependent on the #samples. So, we could not provide the universal threshold for users because each case is unique. C++ code should be fixed for this issue.

More informative error message

Check execution file, check temp_file directory, check error from execution file.
And they are invalid, rgf_python have to output informative message.

Support f_ratio?

I found not documented parameter f_ratio in RGF.
This corresponding to LightGBM feature_fraction and XGB colsample_bytree.

I tried these parameter with boston regression example.
In small max_leaf(300), f_ratio=0.9 improves score to 11.0 from 11.8,
but in many max_leaf(5000), f_ratio=0.95 degrared score to 10.34 from 10.19810.

After all, is there no value to use f_ratio < 1.0?

predict_proba fails inside GridSearchCV

Hello!

I'm trying to tune parameters with scikit-learn's GridSearchCV and every time it fails with: "Exception: Model learning result is not found @d:\rgf\temp. This is rgf_python error."

The full traceback:

Exception                                 Traceback (most recent call last)
<ipython-input-15-e7385635c948> in <module>()
      7 grid = GridSearchCV(RGFClassifier(verbose = 5,),
      8                     param_grid = param_grid, cv = 5, verbose = 5)
----> 9 grid.fit(train_X, train_y)
     10 print("The best parameters are {0} with a score of {1:.7f}.".format(grid.best_params_, grid.best_score_))
     11 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups)
    943             train/test set.
    944         """
--> 945         return self._fit(X, y, groups, ParameterGrid(self.param_grid))
    946 
    947 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _fit(self, X, y, groups, parameter_iterable)
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
    565           for train, test in cv_iter)
    566 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
    606                 return False
    607             else:
--> 608                 self._dispatch(tasks)
    609                 return True
    610 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
    569         dispatch_timestamp = time.time()
    570         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571         job = self._backend.apply_async(batch, callback=cb)
    572         self._jobs.append(job)
    573 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
    107     def apply_async(self, func, callback=None):
    108         """Schedule a func to be run"""
--> 109         result = ImmediateResult(func)
    110         if callback:
    111             callback(result)

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
    324         # Don't delay the application, to avoid keeping the input
    325         # arguments in memory
--> 326         self.results = batch()
    327 
    328     def get(self):

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

C:\Program Files\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
    261         score_time = time.time() - start_time - fit_time
    262         if return_train_score:
--> 263             train_score = _score(estimator, X_train, y_train, scorer)
    264 
    265     if verbose > 2:

C:\Program Files\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _score(estimator, X_test, y_test, scorer)
    286         score = scorer(estimator, X_test)
    287     else:
--> 288         score = scorer(estimator, X_test, y_test)
    289     if hasattr(score, 'item'):
    290         try:

C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    217 def _passthrough_scorer(estimator, *args, **kwargs):
    218     """Function that wraps estimator.score"""
--> 219     return estimator.score(*args, **kwargs)
    220 
    221 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\base.py in score(self, X, y, sample_weight)
    347         """
    348         from .metrics import accuracy_score
--> 349         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    350 
    351 

C:\Program Files\Anaconda3\lib\site-packages\rgf\rgf.py in predict(self, X)
    254             The predicted classes.
    255         """
--> 256         proba = self.predict_proba(X)
    257         return np.argmax(proba, axis=1)
    258 

C:\Program Files\Anaconda3\lib\site-packages\rgf\rgf.py in predict_proba(self, X)
    224             proba = np.zeros((X.shape[0], self.n_classes_))
    225             for i, clf in enumerate(self.estimators):
--> 226                 class_proba = clf.predict_proba(X)
    227                 proba[:, i] = class_proba
    228 

C:\Program Files\Anaconda3\lib\site-packages\rgf\rgf.py in predict_proba(self, X)
    357         model_glob = loc_temp + os.sep + self.file_prefix + "*"
    358         if not glob(model_glob):
--> 359             raise Exception('Model learning result is not found @{0}. This is rgf_python error.'.format(loc_temp))
    360         latest_model_loc = sorted(glob(model_glob), reverse=True)[0]
    361 

Exception: Model learning result is not found @D:\rgf\temp. This is rgf_python error.

and the code:

from sklearn.model_selection import GridSearchCV
from rgf.rgf import RGFClassifier

max_leaf_range = [1000, 1500,]
param_grid = dict(max_leaf = max_leaf_range,)
np.random.seed(42)
grid = GridSearchCV(RGFClassifier(verbose = 5,),
                    param_grid = param_grid, cv = 5, verbose = 5)
grid.fit(train_X, train_y)
print("The best parameters are {0} with a score of {1:.7f}.".format(grid.best_params_, grid.best_score_))

But when I use method score directly (like in your example) everything is OK.

from sklearn.model_selection import StratifiedKFold
from rgf.rgf import RGFClassifier

rgf_score = 0
n_folds = 5

rgf = RGFClassifier(max_leaf = 2000,)

for train_idx, test_idx in StratifiedKFold(n_folds).split(train_X, train_y):
    xs_train = train_X[train_idx]
    y_train = train_y[train_idx]
    xs_test = train_X[test_idx]
    y_test = train_y[test_idx]
    rgf.fit(xs_train, y_train)
    rgf_score += rgf.score(xs_test, y_test)

rgf_score /= n_folds
print('RGF Classfier score: {0:.7f}.'.format(rgf_score))

I'm using Windows 10 if it's important.

Installation failed because of encoding problem

Hello, I just cloned the latest code and installed with pip. I found that it failed with the following output:

    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wi7rcu1o-build/setup.py", line 196, in <module>
        long_description=read('Readme.rst'),
      File "/tmp/pip-wi7rcu1o-build/setup.py", line 20, in read
        return open(os.path.join(CURRENT_DIR, filename)).read()
      File "/homes/willzhqiang/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3119: ordinal not in range(128)

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-wi7rcu1o-build/

So I modified the setup.py file and changed this line
return open(os.path.join(CURRENT_DIR, filename)).read()
to
return open(os.path.join(CURRENT_DIR, filename), encoding='utf-8').read()
and it is installed successfully now.
Please help correct it.

Support wheels

Since rgf_python hasn't any special requirements (for compiler, environment, etc.), I think it good idea to have wheels on PyPI site (and the sources in .tar.gz, of course). I believe providing successfully compiled binaries will prevent many strange errors like recent ones.

We need wheels for two platforms: first for macOS and Linux and second for Windows.

The final result should be similar to this one:
image

But each wheel for each platform should have 32bit and 64bit version.

Binaries we could get from Travis and Appveyor as artifacts (I can do this). The one problem I see now is that Travis hasn't 32bit machines, but I believe we'll overcome this problem 😃 .

@fukatani When you'll have time, please search how to appropriate name wheels according to target platforms and how to post them at PyPI. Or I can do it more later.

Cannot import name 'RGFClassifier'

I am having the above error. I have made rgf1.2 and have tested using rgf1.2's own perl test script. This works. I have installed rgf_python and run the python setup as specified. I have changed the two folder locations to rgf1.2..\rgf executable and a temp folder that exist.

In python when I try to import I get the error Cannot import name 'RGFClassifier'. I tried to run the exact code in the test.py script provided in with rgf_python and this same error occurs.

Strangely, I have /usr/local/lib/python3.5/dist-packages/rgf_sklearn-0.0.0-py3.5.egg/rgf in my path when I do run

import sys
sys.path

in python. I also in /usr/local/lib/python3.5/dist-packages I only have the rfg-sklearn-0.0.0-py3.5.egg and no rgf-sklearn as I would expect as the following appeared towards the end of the setup.py,

Extracting rgf_sklearn-0.0.0-py3.5.egg to /usr/local/lib/python3.5/dist-packages
Adding rgf-sklearn 0.0.0 to easy-install.pth file

is it worth to refactor sparse_savetxt?

I've accidentally found that sklearn contains functions to work with lightsvm files (sparse files [Fast]RGF works with). 😃
Should we replace our own function sparse_savetxt() with sklearn's one? I think we should benchmark with rather big sparse datasets. At first glance it will be easy one-line solution for FastRGF but some headache for RGF due to its' not pure lightsvm format (header with #features, disability to save y in the same file, etc.)

https://github.com/scikit-learn/scikit-learn/blob/a24c8b464d094d2c468a16ea9f8bf8d42d949f84/sklearn/datasets/svmlight_format.py#L376

rgf is not executable file. Please set config flag 'exe_location' to RGF execution file.

@fukatani Yesterday, while testing wheels on Ubuntu 32-bit, I faced this problem. And the good news: successfully solved it! :-)

See logs:

nikita@nikita-VirtualBox:~$ git clone https://github.com/fukatani/rgf_python.git
Клонирование в «rgf_python»…
remote: Counting objects: 1368, done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 1368 (delta 26), reused 30 (delta 11), pack-reused 1315
Получение объектов: 100% (1368/1368), 4.74 MiB | 1.50 MiB/s, готово.
Определение изменений: 100% (684/684), готово.
Проверка соединения… готово.
nikita@nikita-VirtualBox:~$ python rgf_python/tests/test.py
Traceback (most recent call last):
  File "rgf_python/tests/test.py", line 16, in <module>
    from rgf.sklearn import RGFClassifier, RGFRegressor, _cleanup, _get_temp_path
  File "/home/nikita/.local/lib/python3.6/site-packages/rgf/sklearn.py", line 111, in <module>
    "config flag 'exe_location' to RGF execution file.".format(_EXE_PATH))
Exception: /home/nikita/rgf is not executable file. Please set config flag 'exe_location' to RGF execution file.
nikita@nikita-VirtualBox:~$ cd /home/nikita/.local/lib/python3.6/site-packages/rgf
nikita@nikita-VirtualBox:~/.local/lib/python3.6/site-packages/rgf$ ./rgf
bash: ./rgf: Permission denied
nikita@nikita-VirtualBox:~/.local/lib/python3.6/site-packages/rgf$ sudo bash ./rgf
[sudo] пароль для nikita: 
./rgf: ./rgf: Cannot execute binary file
nikita@nikita-VirtualBox:~/.local/lib/python3.6/site-packages/rgf$ chmod +x rgf
nikita@nikita-VirtualBox:~/.local/lib/python3.6/site-packages/rgf$ cd ~
nikita@nikita-VirtualBox:~$ python rgf_python/tests/test.py.Score: 1.00000
..Score: 1.00000
./home/nikita/.local/lib/python3.6/site-packages/scipy/sparse/coo.py:370: SparseEfficiencyWarning: Constructing a DIA matrix with 153 diagonals is inefficient
  "is inefficient" % len(diags), SparseEfficiencyWarning)
/home/nikita/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:304: UserWarning: Can't check dok sparse matrix for nan or inf.
  % spmatrix.format)
....."train": 
   algorithm=RGF_Sib
   train_x_fn=/tmp/rgf/ef5fa689-cc11-4325-a09f-e2eb69ca905354.train.data.x
   train_y_fn=/tmp/rgf/ef5fa689-cc11-4325-a09f-e2eb69ca905354.train.data.y
   train_w_fn=/tmp/rgf/ef5fa689-cc11-4325-a09f-e2eb69ca905354.train.data.weight
   Log:ON
   model_fn_prefix=/tmp/rgf/ef5fa689-cc11-4325-a09f-e2eb69ca905354.model
--------------------
Tue Dec  5 01:45:50 2017: Reading training data ... 
Tue Dec  5 01:45:50 2017: Start ... #train=120
--------------------
# And so on...
----------------------------------------------------------------------
Ran 26 tests in 45.779s

OK
nikita@nikita-VirtualBox:~$ 

So the solution was to allow execution of rgf binary file.

I think we should at least add this workaround to the troubleshooting section of README. Or maybe even you have any ideas how we can avoid such situation (I have one speculation, but will have the ability to test it only in a few days latter).

PyPI

I think it's time to increment the version, create new release on GitHub and register rgf_python on PyPI to make it more popular and easy-to-install.

[RGF] Reduce compilation warning.

Compilation warning of rgf may be easily deleted.

Warnings from stderr:
In file included from /Users/vincenzolavorini/Downloads/rgf_python/include/rgf/src/com/AzSvDataS.cpp:19:
/Users/vincenzolavorini/Downloads/rgf_python/include/rgf/src/com/AzSvDataS.hpp:353:37: warning: '&&' within '||' [-Wlogical-op-parentheses]
    if (*str == '\0' || *str >= '0' && *str <= '9' || 

Delete temp file aggressively.

Delete temp file aggressively.
As refered in https://www.kaggle.com/tunguz/rgf-target-encoding-0-282-on-lb/code ,
sometimes user's environment has limited disk capacity, (ex. kaggle kernel has 1GB).

Currently, we clean temp file by @exit.register.
But some user need to delete temp file in run time.
I think it is natural that user expected release resources by garbage collector or __del__ method.

I mean,

rgf = RGFClassifier()
del rgf

But we should not expect to be called __del__ method every time.
For example, I confirmed test_parallel_gridsearch doesn't call __del__ method.

So we should also use @exit.register.

Another option to solve this problem is changing communication method between Python and C++.
For example, XGB uses ctypes.

pip installation error

image

Installation without compiling works well.
Working on this issue...

log:

D:\Users\nekit\Downloads\rgf_python>pip install rgf_python
Collecting rgf_python
  Using cached rgf_python-2.0.0.tar.gz
Requirement already satisfied: scikit-learn>=0.18 in c:\program files\anaconda3\lib\site-packages (from rgf_python)
Building wheels for collected packages: rgf-python
  Running setup.py bdist_wheel for rgf-python ... error
  Failed building wheel for rgf-python
  Running setup.py clean for rgf-python
Failed to build rgf-python
Installing collected packages: rgf-python
  Running setup.py install for rgf-python ... error
Exception:
Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
    return s.decode(sys.__stdout__.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 27: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\commands\install.py", line 342, in run
    prefix=options.prefix_path,
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\req\req_set.py", line 784, in install
    **kwargs
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\req\req_install.py", line 878, in install
    spinner=spinner,
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
    line = console_to_str(proc.stdout.readline())
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
    return s.decode('utf_8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 27: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\commands\install.py", line 385, in run
    requirement_set.cleanup_files()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\req\req_set.py", line 729, in cleanup_files
    req.remove_temporary_source()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\req\req_install.py", line 977, in remove_temporary_source
    rmtree(self.source_dir)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 212, in call
    raise attempt.get()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\six.py", line 686, in reraise
    raise value
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\__init__.py", line 102, in rmtree
    onerror=rmtree_errorhandler)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 488, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 387, in _rmtree_unsafe
    onerror(os.rmdir, path, sys.exc_info())
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\__init__.py", line 114, in rmtree_errorhandler
    func(path)
PermissionError: [WinError 32] Процесс не может получить доступ к файлу, так как этот файл занят другим процессом: 'C:\\Users\\nekit\\AppData\\Local\\Temp\\pip-build-eueqhyy7\\rgf-python\\include\\rgf\\Windows\\rgf'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\basecommand.py", line 215, in main
    status = self.run(options, args)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\commands\install.py", line 385, in run
    requirement_set.cleanup_files()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\build.py", line 38, in __exit__
    self.cleanup()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\build.py", line 42, in cleanup
    rmtree(self.name)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 212, in call
    raise attempt.get()
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\six.py", line 686, in reraise
    raise value
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\_vendor\retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\__init__.py", line 102, in rmtree
    onerror=rmtree_errorhandler)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 488, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 378, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  [Previous line repeated 1 more times]
  File "C:\Program Files\Anaconda3\lib\shutil.py", line 387, in _rmtree_unsafe
    onerror(os.rmdir, path, sys.exc_info())
  File "C:\Program Files\Anaconda3\lib\site-packages\pip\utils\__init__.py", line 114, in rmtree_errorhandler
    func(path)
PermissionError: [WinError 32] Процесс не может получить доступ к файлу, так как этот файл занят другим процессом: 'C:\\Users\\nekit\\AppData\\Local\\Temp\\pip-build-eueqhyy7\\rgf-python\\include\\rgf\\Windows\\rgf'

Rename module

Hi @fukatani !

What do you think about renaming rgf.py module?
I mean other ml libraries often practice such naming:
from xgboost.sklearn import XGBClassifier
or
from keras.wrappers.scikit_learn import KerasClassifier

Maybe it's better to do the following naming?
from rgf.sklearn import RGFClassifier
instead of
from rgf.rgf import RGFClassifier

ModuleNotFoundError: No module named 'rgf.sklearn'; 'rgf' is not a package

For bugs and unexpected issues, please provide the following information, so that we could reproduce them on our system.

Environment Info

Operating System: MacOS Sierra 10.12 | Ubuntu 16.04.3 LTS

Python version: 3.6.1

rgf_python version: HEAD (pulled from github)

Whether test.py is passed or not: FAILED (errors=24)

Error Message

ModuleNotFoundError: No module named 'rgf.sklearn'; 'rgf' is not a package

Reproducible Example

from rgf.sklearn import RGFClassifier

error:Exception: Model learning result is not found in /tmp/rgf. This is rgf_python error.

How to deal with this error:

Ran 0 examples: 0 success, 0 failure, 0 error

None
Ran 0 examples: 0 success, 0 failure, 0 error

None
Ran 0 examples: 0 success, 0 failure, 0 error

None
Traceback (most recent call last):
File "/Users/k.den/Desktop/For_Submission/1_source_code/test.py", line 25, in
pred = rgf_model.predict_proba(X_eval)[:, 1]
File "/usr/local/lib/python3.6/site-packages/rgf/sklearn.py", line 652, in predict_proba
class_proba = clf.predict_proba(X)
File "/usr/local/lib/python3.6/site-packages/rgf/sklearn.py", line 798, in predict_proba
'This is rgf_python error.'.format(_TEMP_PATH))
Exception: Model learning result is not found in /tmp/rgf. This is rgf_python error.

Process finished with exit code 1

More Travis tests

Hi @fukatani !
Can you add more platforms (Windows, MacOS) to Travis? I don't know how, but it's possible 😄 :
image
[Screenshot from xgboost repo]
Maybe it can help: https://github.com/dmlc/xgboost/blob/master/.travis.yml

If there is a limitation to number of tests, maybe it's better to split Python version tests between platforms: Windows + 2.7, Linux + 3.4, MacOS + 3.5 (I think you understand me).

Use milestone for schedule.

I start trial to use github milestone for expression my wil about what PR and issue should be resolved before release version X.X.X.
Any opinion about milestone is welcome.
(e.g. I want to merge this PR before version X.X.X)

Especially, at next major version 3.0.0, I deal with FastRGF as the stable.

Test build for Python 3.3

I can add test with Python 3.3 to Travis but it will take 7+ minutes: link. What do you think? Should we add this building test? My opinion: "No" 😃

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.