trevorstephens / gplearn Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 274.0 7.8 MB

Genetic Programming in Python, with a scikit-learn inspired API

Home Page: http://gplearn.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

genetic-programming machine-learning python scikit-learn symbolic-regression

gplearn's People

Contributors

Stargazers

Watchers

Forkers

hihihippp lenovor drvinceknight eduardolempe ivanajw ml-ai-nlp-ir alexandrudaia joshloyal nkhuyu strongwolf sparks1372 jasontam selcukfidan47 flashnik actushumanus ibell colinsongf yuhangwang megan-xu rcarneva ogoid jayinai josh-howes langongjin triobox ash-anand chenyuan920911 mshkhan81 somepy ryanchan01 warmth88 eggachecat neelshah18 pchankh resurgo-genetics hubokitty chrinide lazycrazyowl wacquser liyingkun1237 shawfei luyao0912 mhy12345 nemocpp yimingpeng dotrado stefan-niculae longbiscuit afcarl bartolkaruza nmoghaddam vishalbelsare pacorofe isaac-you 20cmdingding grseb9s dagut gpilania shyampandey2895 pricebenjamin bubblyyi hofesh anshulrai guyko81 xincui-math milescranmer rsrawat06 pushkarsinha yudemeirain shadowwill iuliailie hengruix limingbei jpiggy18 plin1112 hoshinory siddharth-verma60 gitouyou leoleos ygq5534 taehyokim notspicyzhan binyi10 otthqs yunhengzi keven3696 ztx4326555 hwfluid laure-crochepierre yuxiaomu tnpodinson drcassar windwiki skrawczuk hwulfmeyer mengchao-yu ibrahim85 xiaojianfei foristkirito wanqinggao

gplearn's Issues

OSError: [Errno 28] No space left on device

I am training a SymbolicTransformer with a dataset with shape (94159, 935), using n_jobs = 1 and all goes well. However, if I try to use parallel processing to leverage multicore machines, setting n_jobs > 1, I get the following error:

   |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-6ceaea1f883d> in <module>()
     10     metric='spearman')
     11 
---> 12 gp.fit(tr_data.todense(), y_train)

/opt/conda/lib/python3.5/site-packages/gplearn/genetic.py in fit(self, X, y, sample_weight)
   1052                                           seeds[starts[i]:starts[i + 1]],
   1053                                           params)
-> 1054                 for i in range(n_jobs))
   1055 
   1056             # Reduce, maintaining order across different n_jobs

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    717                     ensure_ready = self._managed_backend
    718                     backend.abort_everything(ensure_ready=ensure_ready)
--> 719                 raise exception
    720 
    721     def __call__(self, iterable):

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    680                 # check if timeout supported in backend future implementation
    681                 if 'timeout' in getfullargspec(job.get).args:
--> 682                     self._output.extend(job.get(timeout=self.timeout))
    683                 else:
    684                     self._output.extend(job.get())

/opt/conda/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/opt/conda/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in send(obj)
    369             def send(obj):
    370                 buffer = BytesIO()
--> 371                 CustomizablePickler(buffer, self._reducers).dump(obj)
    372                 self._writer.send_bytes(buffer.getvalue())
    373             self._send = send

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in __call__(self, a)
    238                     print("Memmaping (shape=%r, dtype=%s) to new file %s" % (
    239                         a.shape, a.dtype, filename))
--> 240                 for dumped_filename in dump(a, filename):
    241                     os.chmod(dumped_filename, FILE_PERMISSIONS)
    242 

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
    481     elif is_filename:
    482         with open(filename, 'wb') as f:
--> 483             NumpyPickler(f, protocol=protocol).dump(value)
    484     else:
    485         NumpyPickler(filename, protocol=protocol).dump(value)

/opt/conda/lib/python3.5/pickle.py in dump(self, obj)
    406         if self.proto >= 4:
    407             self.framer.start_framing()
--> 408         self.save(obj)
    409         self.write(STOP)
    410         self.framer.end_framing()

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in save(self, obj)
    275 
    276             # And then array bytes are written right after the wrapper.
--> 277             wrapper.write_array(obj, self)
    278             return
    279 

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in write_array(self, array, pickler)
     90                                            buffersize=buffersize,
     91                                            order=self.order):
---> 92                 pickler.file_handle.write(chunk.tostring('C'))
     93 
     94     def read_array(self, unpickler):

OSError: [Errno 28] No space left on device

Is this expected for SymbolicTransformer in a parallel setting? I have 160 GB available on disk, and 30 GB available on RAM. I didn't quite understand what is happening.

Is it possible to specify a starting function

For a number of our problems, we have a rough of a functional form that should be close to the "right answer". Is there a way to feed this in as the initial starting point for the symbolic regression algorithm?

Two potential "bugs"

Hi Trevor,

Thanks for developing this tool. It has been really helpful for my research. As I'm using the "greater_is_better" feature of the code, I detected something abnormal originated from "genetic.py".

(1) if isinstance(self, RegressorMixin):
# Find the best individual in the final generation
self._program = self._programs[-1][np.argmin(fitness)]

This is only True when the best program we are seeking is the one that has lowest fitness, which is not the case for "greater_is_better".

(2) Similarly
if isinstance(self, TransformerMixin):
# Find the best individuals in the final generation
fitness = np.array(fitness)
hall_of_fame = fitness.argsort()[:self.hall_of_fame]

This would select top several programs that have lowest fitness instead of the highest.

Adding pow(), exp() functions

I'm interested in adding pow and exponential functions to the set of functions. Could you please add them? They are very useful in my fitting routines.

Also, it would be nice if you could describe how you add functions to the library of functions so we could extend it with any function we want.

Citation?

We're using gplearn in a research paper. Do you have a preferred citation for gplearn?

force rational functions rather than polynomials

Hi @trevorstephens ,

I still love your work and found a good paper:
Neural networks and rational functions
It describes that rational functions can approximate arbitrary functions better than polynomials. That made me think that the gplearn should generate 2 separate polynomials in parallel and use the ratio of them as the function approximate.
My argument is the following (without exactly knowing the proof of theorems):

it is proven that a neural network with non-linear activations can approximate any function
it is proven in this paper that a rational function (functions represented as the ratio of two polynomials) can approximate any RELU neural network (which is nonlinear)
==>
so the rational functions can approximate any functions

I'm not sure that the performance or the size will be optimal, but I think it worth a try. What do you think?

guyko

Adding exp() as a make_function raises overflow error

@trevorstephens,
The 0.2.0 release it is perfect to make new functions as solved in issue #18.
However, I find that exponential function encounters some errors that make not achieve a result.
Looking for the following example, where we are looking for a simple exponential equation, the gplearn encounters invalid values in evaluating some functions. This makes fitness to become NaN and the algorithm seem to not converge anywhere.

import numpy as np
from gplearn.genetic import SymbolicRegressor
from gplearn.functions import make_function

def exponent(x):
  return np.exp(x)

X = np.random.randint(0,100,size=(100,3))
y = np.exp(X[:, 0])

X_train , y_train = X[:80,:], y[:80]
X_test , y_test = X[80:,:], y[80:]

exponential = make_function(function=exponent, name='exp', arity=1)
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan', exponential]

est_gp = SymbolicRegressor(population_size=5000,
                           generations=20, stopping_criteria=0.01,
                           function_set=function_set,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)
print 'Score: ', est_gp.score(X_test, y_test)
print est_gp._program

This code show overflow errors creating NaN fitness as mentioned, and the final result is None

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
GPlearn_example_exp.py:19: RuntimeWarning: overflow encountered in exp
  return np.exp(x)
/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py:1142: RuntimeWarning: invalid value encountered in multiply
  avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in tan
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in multiply
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in cos
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in sin
  return self.function(*args)
   0    11.09              nan        7              nan              nan     39.42s
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in subtract
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in add
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: overflow encountered in multiply
  return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:111: RuntimeWarning: overflow encountered in divide
  return np.where(np.abs(x2) > 0.001, np.divide(x1, x2), 1.)
   1    11.48              nan       38              nan              nan     44.32s
   2     16.2              nan       10              nan              nan     46.98s
   3    18.69              nan       29              nan              nan     46.91s
   4    21.19              nan       22              nan              nan     46.92s
   5    23.44              nan       25              nan              nan     46.38s
   6    25.52              nan       30              nan              nan     44.79s
   7    27.96              nan       56              nan              nan     43.73s
   8    30.01              nan       44              nan              nan     41.50s
   9    32.29              nan       54              nan              nan     39.10s
  10    34.59              nan       11              nan              nan     36.28s
  11    37.08              nan       18              nan              nan     34.26s
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:128: RuntimeWarning: overflow encountered in divide
  return np.where(np.abs(x1) > 0.001, 1. / x1, 0.)
  12    39.66              nan       34              nan              nan     31.14s
  13    42.05              nan       43              nan              nan     27.39s
  14    43.91              nan       52              nan              nan     23.64s
  15     47.2              nan      118              nan              nan     19.36s
  16    49.95              nan       36              nan              nan     14.88s
  17    52.13              nan        8              nan              nan     10.15s
  18    55.53              nan       21              nan              nan      5.31s
  19    58.17              nan       63              nan              nan      0.00s
Score: 
Traceback (most recent call last):
  File "GPlearn_example_exp.py", line 69, in <module>
    
  File "/usr/local/lib/python2.7/site-packages/sklearn/base.py", line 388, in score
    multioutput='variance_weighted')
  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/regression.py", line 530, in r2_score
    y_true, y_pred, multioutput)
  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/regression.py", line 77, in _check_reg_targets
    y_pred = check_array(y_pred, ensure_2d=False)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 422, in check_array
    _assert_all_finite(array)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

How can this issue be solved? May a solution be to avoid to check results that contain NaNs which will discard that result for fitness?
Using lower values (uniform [0,1] values) also raises the overflow

X = np.random.uniform(0,1,size=(100,3))

Possible to remove const_range?

Hello,
I just started using this library. Could anyone help if it is possible to remove const_range? Since I would like to perform 'add,minus' methods only between variables.

I set const_range to (0,0). But got many of these expressions:
add(add(add(0.000, X0), 0.000), 0.000)

Thanks a lot.

How to write custom function with make_function?

I did the following and gotten errors:

from gplearn.functions import make_function

def internaltanh(x):
return np.tanh(x1)

dtanh = make_function(function=internaltanh, name='dtanh',arity=1)

function_set2 = ['add', 'sub', 'mul', 'div', 'sqrt', 'log','dtanh', 'abs', 'neg', 'inv']

gp2 = SymbolicTransformer(generations=20, population_size=5000,
... hall_of_fame=100, n_components=10,
... function_set=function_set2,
... parsimony_coefficient=0.0001,
... max_samples=0.9, verbose=1,
... random_state=0, n_jobs=3)
gp2.fit(xtrain2, xtrain.y)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gplearn/genetic.py", line 317, in fit
'function_set.' % function)
ValueError: invalid function name dtanh found in function_set.

How can we fix this error? Can we have an example?

Thanks
Dr Patrick

Add python 3.5 & 3.6 support

Travis currently only checks python 3.4. Need to add 3.5 & 3.6 to checks to ensure package behaves under these more recent versions of python.

module 'gplearn' has no attribute 'genetic'

code like this:

```
import gplearn
```

est_gp = gplearn.genetic.SymbolicRegressor()

But there is an error msg as follows:

AttributeError Traceback (most recent call last)
in ()
1 import gplearn
----> 2 est_gp = gplearn.genetic.SymbolicRegressor()

AttributeError: module 'gplearn' has no attribute 'genetic'

my python version =3.6,anyone can help me?

scikit-learn contrib?

Hey Trevor.
Have you thought about the project becoming part of scikit-learn-contrib
https://github.com/scikit-learn-contrib/ ?
Right now the exact governance of that repo is still a bit in flux, but I think it does serve the purpose of advertising high-quality scikit-learn compatible models and extensions.

Andy

Include logic regression

New estimator, needs much more research to see how/if it fits into gplearn's API. No milestone yet. Citation

Add boolean/logical functions, conditional functions and potential to input a binary input dataset

hangs when saving to pickle

Hi,

Thanks for creating this wonderful tool. I was wondering if saving models using pickle is supported?

When i issue the following command:

gp = SymbolicRegressor()
trained_models = {}
trained_models['gp'] = pickle.dumps(gp)

my IDE hangs. This usually works when I am using other sklearn models. I am using the following:

Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Oct 19 2015, 18:04:42)
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.

Thank you.

Pareto front

Has there been any thought given to pareto front optimization? There's always a tradeoff between tree size and model fidelity, which I gather you handle with parsimony. But the other alternative is to keep any model that is non-dominated by the pareto front. I couldn't see any clear way of hacking that into gplearn.

A few suggestions on user-defined function

Thank you for an interesting package

Do you have any plans to add the following options:

In a user-defined function in addition to the data may be consist some parameters, for example, MovingAvg (a, 9) - 9 is parameter. And each parameter may variate in the range, example [2,...., 25]
Function can return a matrix - multiple columns
Function can include a model, and we are interested in not only the results of the model, but model parameters, such as linear regression coefficients. On full train-set the coefficients is const, but on iteration lags the parameters will change.

If you plan to do, how soon we can expect the implementation ?

get best program with OOB score

hi guys, how could i scan _programs and get the best OOB program?

SymbolicTransformer stuck with parallel or n_jobs= -1 option

Hi ,

I don't succeed to run some even simple example with n_jobs parameter other than 1.

The script is stuck and failed to finish. I am using Windows machine .

For example this one : .

Any solution ?

Add different selection methods

Roulette wheel selection
Others?

Some questions about using and future versions

Hello!
If you allow, I will continue (#22) to ask questions in the hope that some of them will be implemented in the next versions.

In your opinion, is it best to create an _Function using eg sklearn.linear_model.LinearRegression or sklearn.neighbors.KneighborsClassifier?
And get access to the methods fit () and predict ()?
As a result of the work of SymbolicTransformer received a certain set of chromosomes.
How it is better to transform them into objects of type TransformerMixin for further use in piplines separately from SymbolicTransformer?
PS If in the next versions you will implement the transfer of parameters to the function, then it is probably better to do it as a dictionary with ranges, so that you can use eg HyperOpt

Add another fit API(for reinforcement learning use case)

Hi, @trevorstephens
In my opinion, GP is different from other regression/fit algorithms since there can be no accurate target value for it to fit. You can simply train GP by telling it whether the result it generates is good or how good the result is, so GP is more like a reinforcement learning algorithm to me than a supervised learning algorithm(e.g. linear regression, LR, SVR...)

Considering this feature(reinforcement-use-case), I think the current fit(X,Y) API may not be enough(for supervised learning use case it still works fine)...

Although this can be done with the current set of APIs like this:

def evaluation_function(y_pred):
     #do what you want
     return score

def explicit_fitness(y, y_pred, sample_weight):
    # normally we compare y_pred with y and get a score
    # but what about the case we dont know the accurate/target y?
    # in this case we just use y_pred to evaluate the performance 

    return evaluation_function(y_pred)

SymbolicRegressor(metric=make_fitness(explicit_fitness, False))

# in the case that we use only y_pred to evaluate the performance,
# y_data in fit(x_data,y_data) is useless

y_as_indices = [i for i in range(x_data.shape[0])]
est_gp.fit(x_data, y_as_indices)

I still look forwarding another API for this purpose...
In some case, may we use the x_data in the fitness function?
Thanks!!

Update scikit-learn requirements to more recent version

Bump sklearn requirement to more recent version (0.15.2 was released in 2014)
Update tests to remove/update deprecated modules that currently warn when running tests
Potentially integrate utils.estimator_checks.check_estimator into tests and remove bundled sklearn tests from gplearn
Update bundled utils

Binary classification with gplearn

Hey Trevor,

thank you for your work at the gplearn package - I really appreciate your work!

I'm quite interested in implementing genetic programming for multiclass classification, however, I'm quite out-of-time and probably won't manage to wait until the release of 0.3 version. I've just wondered, if I could use either One vs. One or One vs. All approach (like in SVM), using binary classification.

What would be the best way to implement binary classifier using estimators, which are available in gplearn so far? I've tried some of my ideas, however, I'm not very skilled in statistics and my attempts ended in failure.

I would be really thankful for giving me a hand.

Best regards,
Mateusz

Broken Pipe Issue

When fitting large datasets, frequently errors related to broken pipes occur.

I have not yet been able to consistently reproduce the error but my dataset characteristics are listed below:

number of rows: 143,000
number of columns: 160

The gplearn configuration is as follows:

gp = SymbolicTransformer(metric='spearman', generations=30,
                             population_size=2500,
                             hall_of_fame=100, n_components=10,
                             parsimony_coefficient=0.0005,
                             max_samples=0.9, verbose=1,
                             random_state=0, n_jobs=4)

(I am running gplearn in a Mac environment)

The stack trace is listed below:

Process ForkPoolWorker-95:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 384, in put
return send(obj)
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 370, in send
self._writer.send_bytes(buffer.getvalue())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(_self._args, *_self._kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 384, in put
return send(obj)
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 370, in send
self._writer.send_bytes(buffer.getvalue())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Is it possible to get best 3 results?

Sorry, maybe this question is naive, but I just used sample code to get best ONE program.

Is there any way to get best 3 instead? If not, could you please kindly add an option for defining the number of results?

Thank you in advance!

How to limit the memory required by gplearn?

I think the gplearn wil use as much memory as possible.
One machine is equipped with 8GB memory.The gplearn eats off 7.85GB and no exception is raised.Everyhing works well.
Another machine has 12GB memory and this time the gplearn consumes 11.8GB.At the last round of the genetic-compution,python is crashed due to a MemoryError.
It's an ironical phenomenon.Why more memory dosen't satisfy the appetite of gplearn instead?
(PS:The paramters of gplearn are fixed.Both environments are WINDOWS)

Classification problems

Hello,

Can gplearn be used to solve classification problems? I have an example running with two classes (0 and 1) and est.score is giving me for accuracy .0017 which is quite low, perhaps due to the targets being classes.

Thanks

EDIT:

Used LogisticRegression instead of Ridge and it works fine

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

GP Hyperparameters

Hi Trevor this package is simply AMAZING. Great compliment to my research for sure.
Quick question: There are a lot of tuneable parameters (this is a good thing) for both the symbolic regressor and symbolic transformer. In your experience how sensitive are the fitness measures to these parameters? Which ones typically have the biggest output on fitness performance (which key parameters should one focus tuning efforts on )?

Sorry for the dumb question!

Is it possible to implement predict_proba?

I don't know if it's possible to implement, but it could be really important in some kind of solutions!

Is it possible to get the transformed features automatically？

Hi,

thanks for building this awesome package.

I am using it to do feature engineering, getting new features and use them to build more models. just wanna know if there is a way to get the transformed features automatically?

like, in the Symbolic Regressor example, we got the solutions by doing this：

print est_gp._program

sub(add(-0.999, X1), mul(sub(X1, X0), add(X0, X1)))

can we, instead of expanding the mathematics out manually, directly have the new feature names in a list or new dataframe of those new features? like:

.new_featrues_df 

[
  ['x0**2', '-x1**2', 'x1']
  [4, -1, 1]
  [25, -9, 3]
]

thanks!

feature Request - cover binary classification?

Hello Trevor,

great job with gplearn; really enjoy using it data mining problems.

Would it be possible to extend gplearn to cover binary classification at some point?

pretty please ;)

Thank you!

Is it possible to access for each gen the transformation applied the original features?

The code is awesome but It would be good in my opinion to be able to know for instance that at :

generation 1 the subset of features is add(var_1,var_2)...
generation 2 the subset of features is std_add(var_1,var_2)...
etc ...

Avoid sub(x1,x1)

How to define customised function to avoid sub(x1,x1) in formula?
Thanks in advance for help.

SymbolicTransformer does not create added value features as expected

Hi @trevorstephens ,

I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer.
I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.

Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.

from gplearn.genetic import SymbolicRegressor
from sklearn import metrics
import pandas as pd
import numpy as np

# Load data
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan']

est_gp = SymbolicRegressor(population_size=5000,
                           generations=20, stopping_criteria=0.001,
                           function_set=function_set,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           n_jobs=1,
                           parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)

print 'Score: ', est_gp.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_gp.predict(X_test))
print est_gp._program

This example give us a perfect result and the MAE metrics is ~perfect as shows the output:

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    11.81    8396.89543051       10    25.3022470326     26.608049431     35.35s
   1    12.36    8904.35549713        8    20.0284767508    19.0994923956     37.34s
   2    13.74     37263.312834        8 7.82583874247e-14 2.13162820728e-14     36.67s
Score:  1.0 5.71986902287e-14
abs(div(neg(X2), inv(min(X0, X1))))

However, SymbolicTransformer although the training works well, the transform does not work well.
See next same example to previous one but with SymbolicTransformer:

from gplearn.genetic import SymbolicRegressor,SymbolicTransformer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics

X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(orig): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))

# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min']

gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=3)

gp.fit(X_train, y_train)
gp_features = gp.transform(X)

# Linear model - Transformed features
newX = np.hstack((X, gp_features))
print 'newX: ', np.shape(newX)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

# Linear model - "The" feature
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
print 'newX: ', newX.shape
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.

X:  (100, 3)
y:  (100,)
Lars(orig):  0.850145084161 518.34496409
    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.349810294784        6   0.954248106272   0.939129495332     16.04s
   1    16.01   0.601354215127        6              1.0              1.0     25.56s
newX:  (100, 13)
Lars(trans):  0.83552794823 497.438879508
newX:  (100, 4)
Lars(trans):  1.0 1.60411683936e-12

So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on gp_features created

>>>print 'Eq. of new features: ', gp.__str__()
 mul(mul(neg(sqrt(min(neg(mul(mul(X1, X0), add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0), sub(X2, 0.904)))))), X2))), sqrt(max(X2, X2))), X1),
 div(min(div(abs(X0), log(0.901)), log(max(X2, -0.222))), X0),
 mul(sub(X1, X0), mul(X1, X0)),
 mul(X2, inv(X2)),
 mul(mul(neg(sqrt(min(X0, X2))), add(neg(X0), min(X0, X2))), X1),
 div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min(X0, X2))), mul(neg(X2), max(X1, X1))), X1))),
 div(abs(mul(X0, X2)), inv(mul(0.640, mul(X1, X0)))),
 div(abs(mul(X0, X2)), inv(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(-0.603, 0.299), sub(0.063, X1)), neg(min(X1, -0.125))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))))),
 mul(neg(mul(div(X2, -0.678), neg(X1))), div(sqrt(max(X2, X2)), min(X1, X0)))]
>>>
>>>df = pd.DataFrame(columns=['Gen','OOB_fitness','Equation'])
>>>for idGen in range(len(gp._programs)):
>>>   for idPopulation in range(gp.population_size):
>>>      if(gp._programs[idGen][idPopulation] != None):
>>>         df = df.append({'fitness': value_fitness_, 'OOB_fitness': value_oobfitness_, 'Equation': str(gp._programs[-1][idPopulation])}, ignore_index=True)
>>>
>>>print 'Best of last Gen: '
>>>print df[df['Gen']==df['Gen'].max()].sort_values('OOB_fitness')
Best of last Gen: 
      Gen  OOB_fitness                                           Equation
1126  2.0     0.000000                            add(0.944, sub(X0, X0))
952   2.0     0.000000                      div(min(X2, X0), min(X2, X0))
1530  2.0     0.000000  min(inv(neg(abs(log(min(X1, 0.535))))), neg(su...
2146  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min...
2148  2.0     0.000000  div(min(add(-0.868, -0.285), X2), sqrt(sqrt(0....
2150  2.0     0.000000                                 sub(-0.603, 0.299)
2476  2.0     0.000000  min(min(max(X0, X2), add(-0.738, 0.612)), sqrt...
1601  2.0     0.000000                               neg(min(X1, -0.125))
1271  2.0     0.000000                                 add(-0.504, 0.058)
1742  2.0     0.000000  add(inv(log(abs(-0.575))), inv(log(abs(-0.575))))
733   2.0     0.000000                                        abs(-0.575)
1304  2.0     0.000000                                  abs(sqrt(-0.758))
1630  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(max(X2, X2), add...
652   2.0     0.000000                                         log(0.341)
1708  2.0     0.000000                                              0.904
2262  2.0     0.000000                                       sqrt(-0.715)
1338  2.0     0.000000                               mul(X2, sub(X1, X1))
826   2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
1615  2.0     0.000000                             abs(add(0.640, 0.766))
2415  2.0     0.000000                                   log(abs(-0.575))
1670  2.0     0.000000                                     min(X0, 0.657)
1644  2.0     0.000000                               log(min(-0.524, X0))
2361  2.0     0.000000                                              0.944
785   2.0     0.000000  min(inv(log(abs(log(min(X1, 0.535))))), neg(mu...
2367  2.0     0.000000                                        abs(-0.911)
2249  2.0     0.000000                                              0.904
960   2.0     0.000000                                   inv(inv(-0.045))
955   2.0     0.000000                 div(add(X1, X2), inv(sub(X2, X2)))
2397  2.0     0.000000                                             -0.125
1878  2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
...   ...          ...                                                ...
1103  2.0     0.997786        mul(X2, abs(sub(mul(X0, X1), add(X2, X0))))
2225  2.0     0.997790  mul(sub(min(log(div(X0, -0.717)), neg(sqrt(mul...
1890  2.0     0.998069  mul(sub(div(X2, 0.309), neg(X2)), sub(max(X2, ...
1704  2.0     0.998283  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1829  2.0     0.998284  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
700   2.0     0.998345  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1770  2.0     0.998638  mul(add(min(X0, min(X1, X1)), X2), sqrt(abs(ab...
2344  2.0     0.998692  div(min(X2, add(sub(neg(sub(0.096, abs(-0.575)...
985   2.0     0.998793  sub(min(mul(sub(min(sqrt(log(max(X1, X2))), ne...
1634  2.0     0.998815  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
1412  2.0     0.998945  mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(m...
855   2.0     0.998965  add(inv(log(abs(X1))), neg(mul(mul(X1, X0), su...
839   2.0     0.998996  add(inv(abs(add(min(X0, min(X1, X1)), X2))), n...
1528  2.0     0.999066  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
690   2.0     0.999875  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
2047  2.0     0.999895  sub(min(neg(X1), div(X1, X2)), sub(min(abs(X1)...
1951  2.0     0.999921  sub(min(min(X2, X0), X2), mul(min(X1, X0), neg...
1981  2.0     0.999954  mul(X2, neg(neg(min(add(0.448, X0), sub(X1, -0...
2349  2.0     0.999954   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2364  2.0     0.999960  add(inv(log(abs(-0.575))), mul(X2, neg(neg(min...
2487  2.0     0.999971   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2056  2.0     0.999975   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1559  2.0     0.999976    mul(X2, neg(neg(min(add(0.448, X0), abs(X1)))))
975   2.0     0.999982   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2032  2.0     0.999992   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1288  2.0     1.000000  sub(min(div(-0.992, X2), X2), mul(min(X1, X0),...
2482  2.0     1.000000  sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0...
1776  2.0     1.000000  mul(min(mul(add(X0, X0), abs(log(X1))), min(ab...
2392  2.0     1.000000  mul(neg(X2), max(div(0.933, X0), min(X0, min(X...
1329  2.0     1.000000                          mul(min(X1, X0), neg(X2))

[2000 rows x 3 columns]

Is this a bug? I am doing the same thing as explained on SymbolicTransformer example

Initializing the model with the previous population?

I'm using the SymbolicRegressor class to iteratively train models on some data (for use as a surrogate model, actually). Is there a way to reuse the last generation of a previous model as the initial population of a new model? Ideally, something like "half new, half old", so I could still get new models as well.

More Flexibility in User Defined Measure Function

I was trying to use a self defined fitness function.
Actually, my measure function won't use the y_true or w at all. Instead of using metrics that compares the difference between y_true and y_pred, I measure the fitness of individuals by feeding the y_pred into some other function(which would spit out a float) and get the returned values as fitness.

I think the following can be improved:

the measure function's arguments will not always be (y, y_pred, w), it could be more flexible and generalized.
the way you check if a np.float is returned by the function. I was searching for better ways of checking returned types in python, but didn't get satisfying results yet. It was actually quite tricky.

Thanks for your work on genetic programming. It's awesome and really easy to understand and use.

I'll be rather happy to help you improve this fantastic tool. There are much more can be added, things like selection function, mutation function, even user defined selection function, mutation function etc.

Add max_time_mins, which overrides generations and acts like stopping_criteria

It would be helpful if the code had a walltime, aka max_time_mins, that it enforced just like the stopping_criteria.

Image(graph.create()) problem with Example of GPLearn

graph = pydotplus.graphviz.graph_from_dot_data(est_gp._program.export_graphviz())
print graph
Image(graph.create())

Error:
sub(add(-0.999, X1), mul(sub(X1, X0), add(X0, X1)))
<pydotplus.graphviz.Dot object at 0x7f3a3b2f56d0>
Traceback (most recent call last):
File "test.py", line 83, in
Image(graph.create())
File "/home/josimar/Envs/test/local/lib/python2.7/site-packages/pydotplus/graphviz.py", line 1960, in create
'GraphViz's executables not found')
pydotplus.graphviz.InvocationException: GraphViz's executables not found

Drop python 2.6 support

Many other libraries are dropping support for this version. Removing this from Travis checks will aid in keeping aligned with packages gplearn depends upon.

fuzzy functions - and,xor,not,or

hi guys, could you include fuzzy functions as default?

def fuzzy_not(x):
return np.minimum(1,np.maximum(0,1-x))
def fuzzy_and(x,y):
return np.minimum(1,np.maximum(0,x*y))
def fuzzy_or(x,y):
return np.minimum(1,np.maximum(0,x+y/2))
def fuzzy_xor(x,y):
return np.minimum(1,np.maximum(0,0.5 - 2 * (x - 0.5) * (y - 0.5)))

Customized function takes forever to run for SymbolicTransformer

I just started to use this package. I was running the gp_examples.ipynb. Everything was fine except that it takes forever for SymbolicTransformer to run with user-defined logical function on my computer in Example 3. Is it normal?

Thanks a lot for your help.

Feature request: custom fitness measure

source comment

Add the ability to create a scikit-learn scorer object to allow user to define their own fitness measure

readable/diagram program expression

Hi, how can I output the best program as a readable expression or a diagram?

Integration with sympy

Hello Trevor, thanks for your fantastic gp Tool.

I am starting to use it.

Have you considered to integrate sympy with gplearn ? I mean, you can export individual formulas to a simpy formula so that we can use all the machinery of sympy to reduce, simplify and analyse symbolically the generated programs/formulas?

And, even more, how about simplifying at each iterations the formulas by interacting with sympy?

Cheers,
Jose

Question on User-defined function

Hi, So I've had a look at the custom functions definition. While constructing possible individuals from my code I ran into some issues.

pset = gp.PrimitiveSetTyped("MAIN",[int],str)

pset.addPrimitive(intprocessor,[int],list)
pset.addPrimitive(listprocessor,[list],str)
pset.renameArguments(ARG0='x')

the error I'm facing is this
IndexError: The gp.generate function tried to add a terminal of type '<type 'str'>', but there is none available

Now one way of solving this issue is to use terminals as evident by the error message
pset.addTerminal("hello", str) pset.addTerminal([123], list)
But when the terminals are not defined shouldn't the GP be only producing listprocessor(intprocessor(x)) since its the only tree which is possible ?

100.000 file, ~1GB save files when saving model to disk.

Hi Again,

When I tried to save the model to disk using

from sklearn.externals import joblib

it created ~100,000 6kb numpy files and one 42mb pickle file.

Please advise how to fix this.

Thank you.

Memory issues

Hi Trevor,

it's a very nice implementation - I was searching for such solution for a long time. So really thank you!

I got only 1 issue that with long term of evolution (generations = some_huge_number; or population_size = some_huge_number + generations = some_number) the program runs out of memory. I checked the code and it saves every iteration's population. Do you think it's necessary? In my understanding we only need the current population and the best of the previous in the beginning.

What do you think, can the code be changed some way to make
self._programs = []
before every iteration and just save the previous one in a self._programs_prev (or something)?

Feature Request: Saving / Loading a Model

In many cases one invests significant time/effort to fit/train the GP model. Unfortunately, without a method of "saving" the model and "reloading" it from its save state, one is required to re-do the training of the model form its initial state. This forces one to spend the same (significant) time/effort.

It would be ideal to:
(a) save the state of the model to a file
(b) load a model from a (state) file
(c) Incrementally update the loaded model with new data

This is a feature request.

Wrong calculation from 'mean absolute error'?

Hi, I just try to fit a program to solve XOR problem, so here is my data:

X = [[1,1],
     [1,0],
     [0,1],
     [0,0]]
y = [0,1,1,0]

This is settings:

est_gp = SymbolicRegressor(population_size=5000,
                           generations=20, stopping_criteria=0.01,
                           init_depth=(2,4),
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=1, verbose=1,
                           parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X, y)
print (est_gp._program)

Result is: sub(X1, X0), which is unexpected, and not right. Below are the running info:

   |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    15.48    7.45612899871        3              0.0              2.0      1.23m

Thus, I checked the function evaluation:

est_gp.predict([1,0])
array([-1])

It seems right, but,

est_gp.score([1,0],[1])
0.0

Run in parallel took much more time than single job

Hi,

I just started using gplearn.
When I run the symbolic regressor example with n_jobs = 1, it took about 5 minutes and finished. However, when I set n_jobs = 15, it was still running after one hour, and I had to stop it manually.

Here is my environment:
Windows 7 Professional,
Anaconda 3,
CPU Xeon E5-2687w v2.

Could anyone help me? Thanks very much.