trevorstephens / gplearn Goto Github PK
View Code? Open in Web Editor NEWGenetic Programming in Python, with a scikit-learn inspired API
Home Page: http://gplearn.readthedocs.io/
License: BSD 3-Clause "New" or "Revised" License
Genetic Programming in Python, with a scikit-learn inspired API
Home Page: http://gplearn.readthedocs.io/
License: BSD 3-Clause "New" or "Revised" License
I am training a SymbolicTransformer
with a dataset with shape (94159, 935)
, using n_jobs = 1
and all goes well. However, if I try to use parallel processing to leverage multicore machines, setting n_jobs > 1
, I get the following error:
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-14-6ceaea1f883d> in <module>()
10 metric='spearman')
11
---> 12 gp.fit(tr_data.todense(), y_train)
/opt/conda/lib/python3.5/site-packages/gplearn/genetic.py in fit(self, X, y, sample_weight)
1052 seeds[starts[i]:starts[i + 1]],
1053 params)
-> 1054 for i in range(n_jobs))
1055
1056 # Reduce, maintaining order across different n_jobs
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
717 ensure_ready = self._managed_backend
718 backend.abort_everything(ensure_ready=ensure_ready)
--> 719 raise exception
720
721 def __call__(self, iterable):
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
680 # check if timeout supported in backend future implementation
681 if 'timeout' in getfullargspec(job.get).args:
--> 682 self._output.extend(job.get(timeout=self.timeout))
683 else:
684 self._output.extend(job.get())
/opt/conda/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
/opt/conda/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
383 break
384 try:
--> 385 put(task)
386 except Exception as e:
387 job, ind = task[:2]
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in send(obj)
369 def send(obj):
370 buffer = BytesIO()
--> 371 CustomizablePickler(buffer, self._reducers).dump(obj)
372 self._writer.send_bytes(buffer.getvalue())
373 self._send = send
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in __call__(self, a)
238 print("Memmaping (shape=%r, dtype=%s) to new file %s" % (
239 a.shape, a.dtype, filename))
--> 240 for dumped_filename in dump(a, filename):
241 os.chmod(dumped_filename, FILE_PERMISSIONS)
242
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
481 elif is_filename:
482 with open(filename, 'wb') as f:
--> 483 NumpyPickler(f, protocol=protocol).dump(value)
484 else:
485 NumpyPickler(filename, protocol=protocol).dump(value)
/opt/conda/lib/python3.5/pickle.py in dump(self, obj)
406 if self.proto >= 4:
407 self.framer.start_framing()
--> 408 self.save(obj)
409 self.write(STOP)
410 self.framer.end_framing()
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in save(self, obj)
275
276 # And then array bytes are written right after the wrapper.
--> 277 wrapper.write_array(obj, self)
278 return
279
/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in write_array(self, array, pickler)
90 buffersize=buffersize,
91 order=self.order):
---> 92 pickler.file_handle.write(chunk.tostring('C'))
93
94 def read_array(self, unpickler):
OSError: [Errno 28] No space left on device
Is this expected for SymbolicTransformer in a parallel setting? I have 160 GB available on disk, and 30 GB available on RAM. I didn't quite understand what is happening.
For a number of our problems, we have a rough of a functional form that should be close to the "right answer". Is there a way to feed this in as the initial starting point for the symbolic regression algorithm?
Hi Trevor,
Thanks for developing this tool. It has been really helpful for my research. As I'm using the "greater_is_better" feature of the code, I detected something abnormal originated from "genetic.py".
(1) if isinstance(self, RegressorMixin):
# Find the best individual in the final generation
self._program = self._programs[-1][np.argmin(fitness)]
This is only True when the best program we are seeking is the one that has lowest fitness, which is not the case for "greater_is_better".
(2) Similarly
if isinstance(self, TransformerMixin):
# Find the best individuals in the final generation
fitness = np.array(fitness)
hall_of_fame = fitness.argsort()[:self.hall_of_fame]
This would select top several programs that have lowest fitness instead of the highest.
I'm interested in adding pow and exponential functions to the set of functions. Could you please add them? They are very useful in my fitting routines.
Also, it would be nice if you could describe how you add functions to the library of functions so we could extend it with any function we want.
We're using gplearn in a research paper. Do you have a preferred citation for gplearn?
Hi @trevorstephens ,
I still love your work and found a good paper:
Neural networks and rational functions
It describes that rational functions can approximate arbitrary functions better than polynomials. That made me think that the gplearn should generate 2 separate polynomials in parallel and use the ratio of them as the function approximate.
My argument is the following (without exactly knowing the proof of theorems):
I'm not sure that the performance or the size will be optimal, but I think it worth a try. What do you think?
guyko
@trevorstephens,
The 0.2.0 release it is perfect to make new functions as solved in issue #18.
However, I find that exponential function encounters some errors that make not achieve a result.
Looking for the following example, where we are looking for a simple exponential equation, the gplearn encounters invalid values in evaluating some functions. This makes fitness to become NaN and the algorithm seem to not converge anywhere.
import numpy as np
from gplearn.genetic import SymbolicRegressor
from gplearn.functions import make_function
def exponent(x):
return np.exp(x)
X = np.random.randint(0,100,size=(100,3))
y = np.exp(X[:, 0])
X_train , y_train = X[:80,:], y[:80]
X_test , y_test = X[80:,:], y[80:]
exponential = make_function(function=exponent, name='exp', arity=1)
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan', exponential]
est_gp = SymbolicRegressor(population_size=5000,
generations=20, stopping_criteria=0.01,
function_set=function_set,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.1,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)
print 'Score: ', est_gp.score(X_test, y_test)
print est_gp._program
This code show overflow errors creating NaN fitness as mentioned, and the final result is None
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
GPlearn_example_exp.py:19: RuntimeWarning: overflow encountered in exp
return np.exp(x)
/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py:1142: RuntimeWarning: invalid value encountered in multiply
avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in tan
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in multiply
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in cos
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in sin
return self.function(*args)
0 11.09 nan 7 nan nan 39.42s
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in subtract
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: invalid value encountered in add
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:46: RuntimeWarning: overflow encountered in multiply
return self.function(*args)
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:111: RuntimeWarning: overflow encountered in divide
return np.where(np.abs(x2) > 0.001, np.divide(x1, x2), 1.)
1 11.48 nan 38 nan nan 44.32s
2 16.2 nan 10 nan nan 46.98s
3 18.69 nan 29 nan nan 46.91s
4 21.19 nan 22 nan nan 46.92s
5 23.44 nan 25 nan nan 46.38s
6 25.52 nan 30 nan nan 44.79s
7 27.96 nan 56 nan nan 43.73s
8 30.01 nan 44 nan nan 41.50s
9 32.29 nan 54 nan nan 39.10s
10 34.59 nan 11 nan nan 36.28s
11 37.08 nan 18 nan nan 34.26s
/usr/local/lib/python2.7/site-packages/gplearn/functions.py:128: RuntimeWarning: overflow encountered in divide
return np.where(np.abs(x1) > 0.001, 1. / x1, 0.)
12 39.66 nan 34 nan nan 31.14s
13 42.05 nan 43 nan nan 27.39s
14 43.91 nan 52 nan nan 23.64s
15 47.2 nan 118 nan nan 19.36s
16 49.95 nan 36 nan nan 14.88s
17 52.13 nan 8 nan nan 10.15s
18 55.53 nan 21 nan nan 5.31s
19 58.17 nan 63 nan nan 0.00s
Score:
Traceback (most recent call last):
File "GPlearn_example_exp.py", line 69, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/base.py", line 388, in score
multioutput='variance_weighted')
File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/regression.py", line 530, in r2_score
y_true, y_pred, multioutput)
File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/regression.py", line 77, in _check_reg_targets
y_pred = check_array(y_pred, ensure_2d=False)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 422, in check_array
_assert_all_finite(array)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
How can this issue be solved? May a solution be to avoid to check results that contain NaNs which will discard that result for fitness?
Using lower values (uniform [0,1] values) also raises the overflow
X = np.random.uniform(0,1,size=(100,3))
Hello,
I just started using this library. Could anyone help if it is possible to remove const_range? Since I would like to perform 'add,minus' methods only between variables.
I set const_range to (0,0). But got many of these expressions:
add(add(add(0.000, X0), 0.000), 0.000)
Thanks a lot.
I did the following and gotten errors:
from gplearn.functions import make_function
def internaltanh(x):
return np.tanh(x1)
dtanh = make_function(function=internaltanh, name='dtanh',arity=1)
function_set2 = ['add', 'sub', 'mul', 'div', 'sqrt', 'log','dtanh', 'abs', 'neg', 'inv']
gp2 = SymbolicTransformer(generations=20, population_size=5000,
... hall_of_fame=100, n_components=10,
... function_set=function_set2,
... parsimony_coefficient=0.0001,
... max_samples=0.9, verbose=1,
... random_state=0, n_jobs=3)
gp2.fit(xtrain2, xtrain.y)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gplearn/genetic.py", line 317, in fit
'function_set
.' % function)
ValueError: invalid function name dtanh found infunction_set
.
How can we fix this error? Can we have an example?
Thanks
Dr Patrick
Travis currently only checks python 3.4. Need to add 3.5 & 3.6 to checks to ensure package behaves under these more recent versions of python.
code like this:
import gplearn
est_gp = gplearn.genetic.SymbolicRegressor()
AttributeError Traceback (most recent call last)
in ()
1 import gplearn
----> 2 est_gp = gplearn.genetic.SymbolicRegressor()
AttributeError: module 'gplearn' has no attribute 'genetic'
my python version =3.6,anyone can help me?
Hey Trevor.
Have you thought about the project becoming part of scikit-learn-contrib
https://github.com/scikit-learn-contrib/ ?
Right now the exact governance of that repo is still a bit in flux, but I think it does serve the purpose of advertising high-quality scikit-learn compatible models and extensions.
Andy
New estimator, needs much more research to see how/if it fits into gplearn
's API. No milestone yet. Citation
Add boolean/logical functions, conditional functions and potential to input a binary input dataset
Hi,
Thanks for creating this wonderful tool. I was wondering if saving models using pickle is supported?
When i issue the following command:
gp = SymbolicRegressor()
trained_models = {}
trained_models['gp'] = pickle.dumps(gp)
my IDE hangs. This usually works when I am using other sklearn models. I am using the following:
Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Oct 19 2015, 18:04:42)
Type "copyright", "credits" or "license" for more information.
IPython 4.0.0 -- An enhanced Interactive Python.
Thank you.
Has there been any thought given to pareto front optimization? There's always a tradeoff between tree size and model fidelity, which I gather you handle with parsimony. But the other alternative is to keep any model that is non-dominated by the pareto front. I couldn't see any clear way of hacking that into gplearn.
Thank you for an interesting package
Do you have any plans to add the following options:
If you plan to do, how soon we can expect the implementation ?
hi guys, how could i scan _programs and get the best OOB program?
Hi ,
I don't succeed to run some even simple example with n_jobs parameter other than 1.
The script is stuck and failed to finish. I am using Windows machine .
For example this one : .
Any solution ?
Hello!
If you allow, I will continue (#22) to ask questions in the hope that some of them will be implemented in the next versions.
Hi, @trevorstephens
In my opinion, GP is different from other regression/fit algorithms since there can be no accurate target value for it to fit. You can simply train GP by telling it whether the result it generates is good or how good the result is, so GP is more like a reinforcement learning algorithm to me than a supervised learning algorithm(e.g. linear regression, LR, SVR...)
Considering this feature(reinforcement-use-case), I think the current fit(X,Y) API may not be enough(for supervised learning use case it still works fine)...
Although this can be done with the current set of APIs like this:
def evaluation_function(y_pred):
#do what you want
return score
def explicit_fitness(y, y_pred, sample_weight):
# normally we compare y_pred with y and get a score
# but what about the case we dont know the accurate/target y?
# in this case we just use y_pred to evaluate the performance
return evaluation_function(y_pred)
SymbolicRegressor(metric=make_fitness(explicit_fitness, False))
# in the case that we use only y_pred to evaluate the performance,
# y_data in fit(x_data,y_data) is useless
y_as_indices = [i for i in range(x_data.shape[0])]
est_gp.fit(x_data, y_as_indices)
I still look forwarding another API for this purpose...
In some case, may we use the x_data in the fitness function?
Thanks!!
sklearn
requirement to more recent version (0.15.2 was released in 2014)utils.estimator_checks.check_estimator
into tests and remove bundled sklearn
tests from gplearn
Hey Trevor,
thank you for your work at the gplearn package - I really appreciate your work!
I'm quite interested in implementing genetic programming for multiclass classification, however, I'm quite out-of-time and probably won't manage to wait until the release of 0.3 version. I've just wondered, if I could use either One vs. One or One vs. All approach (like in SVM), using binary classification.
What would be the best way to implement binary classifier using estimators, which are available in gplearn so far? I've tried some of my ideas, however, I'm not very skilled in statistics and my attempts ended in failure.
I would be really thankful for giving me a hand.
Best regards,
Mateusz
When fitting large datasets, frequently errors related to broken pipes occur.
I have not yet been able to consistently reproduce the error but my dataset characteristics are listed below:
The gplearn configuration is as follows:
gp = SymbolicTransformer(metric='spearman', generations=30,
population_size=2500,
hall_of_fame=100, n_components=10,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=4)
(I am running gplearn in a Mac environment)
The stack trace is listed below:
Process ForkPoolWorker-95:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 384, in put
return send(obj)
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 370, in send
self._writer.send_bytes(buffer.getvalue())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(_self._args, *_self._kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 384, in put
return send(obj)
File "/{myenv}/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 370, in send
self._writer.send_bytes(buffer.getvalue())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Sorry, maybe this question is naive, but I just used sample code to get best ONE program.
Is there any way to get best 3 instead? If not, could you please kindly add an option for defining the number of results?
Thank you in advance!
I think the gplearn wil use as much memory as possible.
One machine is equipped with 8GB memory.The gplearn eats off 7.85GB and no exception is raised.Everyhing works well.
Another machine has 12GB memory and this time the gplearn consumes 11.8GB.At the last round of the genetic-compution,python is crashed due to a MemoryError.
It's an ironical phenomenon.Why more memory dosen't satisfy the appetite of gplearn instead?
(PS:The paramters of gplearn are fixed.Both environments are WINDOWS)
Hello,
Can gplearn be used to solve classification problems? I have an example running with two classes (0 and 1) and est.score is giving me for accuracy .0017 which is quite low, perhaps due to the targets being classes.
Thanks
EDIT:
Used LogisticRegression instead of Ridge and it works fine
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Hi Trevor this package is simply AMAZING. Great compliment to my research for sure.
Quick question: There are a lot of tuneable parameters (this is a good thing) for both the symbolic regressor and symbolic transformer. In your experience how sensitive are the fitness measures to these parameters? Which ones typically have the biggest output on fitness performance (which key parameters should one focus tuning efforts on )?
Sorry for the dumb question!
I don't know if it's possible to implement, but it could be really important in some kind of solutions!
Hi,
thanks for building this awesome package.
I am using it to do feature engineering, getting new features and use them to build more models. just wanna know if there is a way to get the transformed features automatically?
like, in the Symbolic Regressor example, we got the solutions by doing this:
print est_gp._program
sub(add(-0.999, X1), mul(sub(X1, X0), add(X0, X1)))
can we, instead of expanding the mathematics out manually, directly have the new feature names in a list or new dataframe of those new features? like:
.new_featrues_df
[
['x0**2', '-x1**2', 'x1']
[4, -1, 1]
[25, -9, 3]
]
thanks!
Hello Trevor,
great job with gplearn; really enjoy using it data mining problems.
Would it be possible to extend gplearn to cover binary classification at some point?
pretty please ;)
Thank you!
The code is awesome but It would be good in my opinion to be able to know for instance that at :
How to define customised function to avoid sub(x1,x1) in formula?
Thanks in advance for help.
Hi @trevorstephens ,
I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer.
I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.
Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.
from gplearn.genetic import SymbolicRegressor
from sklearn import metrics
import pandas as pd
import numpy as np
# Load data
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]
index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan']
est_gp = SymbolicRegressor(population_size=5000,
generations=20, stopping_criteria=0.001,
function_set=function_set,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.1,
max_samples=0.9, verbose=1,
n_jobs=1,
parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)
print 'Score: ', est_gp.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_gp.predict(X_test))
print est_gp._program
This example give us a perfect result and the MAE metrics is ~perfect as shows the output:
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 11.81 8396.89543051 10 25.3022470326 26.608049431 35.35s
1 12.36 8904.35549713 8 20.0284767508 19.0994923956 37.34s
2 13.74 37263.312834 8 7.82583874247e-14 2.13162820728e-14 36.67s
Score: 1.0 5.71986902287e-14
abs(div(neg(X2), inv(min(X0, X1))))
However, SymbolicTransformer although the training works well, the transform does not work well.
See next same example to previous one but with SymbolicTransformer:
from gplearn.genetic import SymbolicRegressor,SymbolicTransformer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]
index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]
# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(orig): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))
# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
'abs', 'neg', 'inv', 'max', 'min']
gp = SymbolicTransformer(generations=20, population_size=2000,
hall_of_fame=100, n_components=10,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=3)
gp.fit(X_train, y_train)
gp_features = gp.transform(X)
# Linear model - Transformed features
newX = np.hstack((X, gp_features))
print 'newX: ', np.shape(newX)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Linear model - "The" feature
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
print 'newX: ', newX.shape
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.
X: (100, 3)
y: (100,)
Lars(orig): 0.850145084161 518.34496409
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 14.62 0.349810294784 6 0.954248106272 0.939129495332 16.04s
1 16.01 0.601354215127 6 1.0 1.0 25.56s
newX: (100, 13)
Lars(trans): 0.83552794823 497.438879508
newX: (100, 4)
Lars(trans): 1.0 1.60411683936e-12
So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on gp_features
created
>>>print 'Eq. of new features: ', gp.__str__()
mul(mul(neg(sqrt(min(neg(mul(mul(X1, X0), add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0), sub(X2, 0.904)))))), X2))), sqrt(max(X2, X2))), X1),
div(min(div(abs(X0), log(0.901)), log(max(X2, -0.222))), X0),
mul(sub(X1, X0), mul(X1, X0)),
mul(X2, inv(X2)),
mul(mul(neg(sqrt(min(X0, X2))), add(neg(X0), min(X0, X2))), X1),
div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min(X0, X2))), mul(neg(X2), max(X1, X1))), X1))),
div(abs(mul(X0, X2)), inv(mul(0.640, mul(X1, X0)))),
div(abs(mul(X0, X2)), inv(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(-0.603, 0.299), sub(0.063, X1)), neg(min(X1, -0.125))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))))),
mul(neg(mul(div(X2, -0.678), neg(X1))), div(sqrt(max(X2, X2)), min(X1, X0)))]
>>>
>>>df = pd.DataFrame(columns=['Gen','OOB_fitness','Equation'])
>>>for idGen in range(len(gp._programs)):
>>> for idPopulation in range(gp.population_size):
>>> if(gp._programs[idGen][idPopulation] != None):
>>> df = df.append({'fitness': value_fitness_, 'OOB_fitness': value_oobfitness_, 'Equation': str(gp._programs[-1][idPopulation])}, ignore_index=True)
>>>
>>>print 'Best of last Gen: '
>>>print df[df['Gen']==df['Gen'].max()].sort_values('OOB_fitness')
Best of last Gen:
Gen OOB_fitness Equation
1126 2.0 0.000000 add(0.944, sub(X0, X0))
952 2.0 0.000000 div(min(X2, X0), min(X2, X0))
1530 2.0 0.000000 min(inv(neg(abs(log(min(X1, 0.535))))), neg(su...
2146 2.0 0.000000 div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min...
2148 2.0 0.000000 div(min(add(-0.868, -0.285), X2), sqrt(sqrt(0....
2150 2.0 0.000000 sub(-0.603, 0.299)
2476 2.0 0.000000 min(min(max(X0, X2), add(-0.738, 0.612)), sqrt...
1601 2.0 0.000000 neg(min(X1, -0.125))
1271 2.0 0.000000 add(-0.504, 0.058)
1742 2.0 0.000000 add(inv(log(abs(-0.575))), inv(log(abs(-0.575))))
733 2.0 0.000000 abs(-0.575)
1304 2.0 0.000000 abs(sqrt(-0.758))
1630 2.0 0.000000 div(abs(mul(X0, X2)), inv(mul(max(X2, X2), add...
652 2.0 0.000000 log(0.341)
1708 2.0 0.000000 0.904
2262 2.0 0.000000 sqrt(-0.715)
1338 2.0 0.000000 mul(X2, sub(X1, X1))
826 2.0 0.000000 div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
1615 2.0 0.000000 abs(add(0.640, 0.766))
2415 2.0 0.000000 log(abs(-0.575))
1670 2.0 0.000000 min(X0, 0.657)
1644 2.0 0.000000 log(min(-0.524, X0))
2361 2.0 0.000000 0.944
785 2.0 0.000000 min(inv(log(abs(log(min(X1, 0.535))))), neg(mu...
2367 2.0 0.000000 abs(-0.911)
2249 2.0 0.000000 0.904
960 2.0 0.000000 inv(inv(-0.045))
955 2.0 0.000000 div(add(X1, X2), inv(sub(X2, X2)))
2397 2.0 0.000000 -0.125
1878 2.0 0.000000 div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
... ... ... ...
1103 2.0 0.997786 mul(X2, abs(sub(mul(X0, X1), add(X2, X0))))
2225 2.0 0.997790 mul(sub(min(log(div(X0, -0.717)), neg(sqrt(mul...
1890 2.0 0.998069 mul(sub(div(X2, 0.309), neg(X2)), sub(max(X2, ...
1704 2.0 0.998283 add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1829 2.0 0.998284 add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
700 2.0 0.998345 add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1770 2.0 0.998638 mul(add(min(X0, min(X1, X1)), X2), sqrt(abs(ab...
2344 2.0 0.998692 div(min(X2, add(sub(neg(sub(0.096, abs(-0.575)...
985 2.0 0.998793 sub(min(mul(sub(min(sqrt(log(max(X1, X2))), ne...
1634 2.0 0.998815 add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
1412 2.0 0.998945 mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(m...
855 2.0 0.998965 add(inv(log(abs(X1))), neg(mul(mul(X1, X0), su...
839 2.0 0.998996 add(inv(abs(add(min(X0, min(X1, X1)), X2))), n...
1528 2.0 0.999066 add(sub(log(min(add(0.769, X1), abs(X1))), sub...
690 2.0 0.999875 add(sub(log(min(add(0.769, X1), abs(X1))), sub...
2047 2.0 0.999895 sub(min(neg(X1), div(X1, X2)), sub(min(abs(X1)...
1951 2.0 0.999921 sub(min(min(X2, X0), X2), mul(min(X1, X0), neg...
1981 2.0 0.999954 mul(X2, neg(neg(min(add(0.448, X0), sub(X1, -0...
2349 2.0 0.999954 sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2364 2.0 0.999960 add(inv(log(abs(-0.575))), mul(X2, neg(neg(min...
2487 2.0 0.999971 sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2056 2.0 0.999975 sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1559 2.0 0.999976 mul(X2, neg(neg(min(add(0.448, X0), abs(X1)))))
975 2.0 0.999982 sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2032 2.0 0.999992 sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1288 2.0 1.000000 sub(min(div(-0.992, X2), X2), mul(min(X1, X0),...
2482 2.0 1.000000 sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0...
1776 2.0 1.000000 mul(min(mul(add(X0, X0), abs(log(X1))), min(ab...
2392 2.0 1.000000 mul(neg(X2), max(div(0.933, X0), min(X0, min(X...
1329 2.0 1.000000 mul(min(X1, X0), neg(X2))
[2000 rows x 3 columns]
Is this a bug? I am doing the same thing as explained on SymbolicTransformer example
I'm using the SymbolicRegressor class to iteratively train models on some data (for use as a surrogate model, actually). Is there a way to reuse the last generation of a previous model as the initial population of a new model? Ideally, something like "half new, half old", so I could still get new models as well.
I was trying to use a self defined fitness function.
Actually, my measure function won't use the y_true
or w
at all. Instead of using metrics that compares the difference between y_true
and y_pred
, I measure the fitness of individuals by feeding the y_pred
into some other function(which would spit out a float) and get the returned values as fitness.
I think the following can be improved:
Thanks for your work on genetic programming. It's awesome and really easy to understand and use.
I'll be rather happy to help you improve this fantastic tool. There are much more can be added, things like selection function, mutation function, even user defined selection function, mutation function etc.
It would be helpful if the code had a walltime, aka max_time_mins, that it enforced just like the stopping_criteria.
graph = pydotplus.graphviz.graph_from_dot_data(est_gp._program.export_graphviz())
print graph
Image(graph.create())
Error:
sub(add(-0.999, X1), mul(sub(X1, X0), add(X0, X1)))
<pydotplus.graphviz.Dot object at 0x7f3a3b2f56d0>
Traceback (most recent call last):
File "test.py", line 83, in
Image(graph.create())
File "/home/josimar/Envs/test/local/lib/python2.7/site-packages/pydotplus/graphviz.py", line 1960, in create
'GraphViz's executables not found')
pydotplus.graphviz.InvocationException: GraphViz's executables not found
Many other libraries are dropping support for this version. Removing this from Travis checks will aid in keeping aligned with packages gplearn
depends upon.
hi guys, could you include fuzzy functions as default?
def fuzzy_not(x):
return np.minimum(1,np.maximum(0,1-x))
def fuzzy_and(x,y):
return np.minimum(1,np.maximum(0,x*y))
def fuzzy_or(x,y):
return np.minimum(1,np.maximum(0,x+y/2))
def fuzzy_xor(x,y):
return np.minimum(1,np.maximum(0,0.5 - 2 * (x - 0.5) * (y - 0.5)))
I just started to use this package. I was running the gp_examples.ipynb. Everything was fine except that it takes forever for SymbolicTransformer to run with user-defined logical function on my computer in Example 3. Is it normal?
Thanks a lot for your help.
Add the ability to create a scikit-learn
scorer object to allow user to define their own fitness measure
Hi, how can I output the best program as a readable expression or a diagram?
Hello Trevor, thanks for your fantastic gp Tool.
I am starting to use it.
Have you considered to integrate sympy with gplearn ? I mean, you can export individual formulas to a simpy formula so that we can use all the machinery of sympy to reduce, simplify and analyse symbolically the generated programs/formulas?
And, even more, how about simplifying at each iterations the formulas by interacting with sympy?
Cheers,
Jose
Hi, So I've had a look at the custom functions definition. While constructing possible individuals from my code I ran into some issues.
pset = gp.PrimitiveSetTyped("MAIN",[int],str)
pset.addPrimitive(intprocessor,[int],list)
pset.addPrimitive(listprocessor,[list],str)
pset.renameArguments(ARG0='x')
the error I'm facing is this
IndexError: The gp.generate function tried to add a terminal of type '<type 'str'>', but there is none available
Now one way of solving this issue is to use terminals as evident by the error message
pset.addTerminal("hello", str) pset.addTerminal([123], list)
But when the terminals are not defined shouldn't the GP be only producing listprocessor(intprocessor(x))
since its the only tree which is possible ?
Hi Again,
When I tried to save the model to disk using
from sklearn.externals import joblib
it created ~100,000 6kb numpy files and one 42mb pickle file.
Please advise how to fix this.
Thank you.
Hi Trevor,
it's a very nice implementation - I was searching for such solution for a long time. So really thank you!
I got only 1 issue that with long term of evolution (generations = some_huge_number; or population_size = some_huge_number + generations = some_number) the program runs out of memory. I checked the code and it saves every iteration's population. Do you think it's necessary? In my understanding we only need the current population and the best of the previous in the beginning.
What do you think, can the code be changed some way to make
self._programs = []
before every iteration and just save the previous one in a self._programs_prev (or something)?
In many cases one invests significant time/effort to fit/train the GP model. Unfortunately, without a method of "saving" the model and "reloading" it from its save state, one is required to re-do the training of the model form its initial state. This forces one to spend the same (significant) time/effort.
It would be ideal to:
(a) save the state of the model to a file
(b) load a model from a (state) file
(c) Incrementally update the loaded model with new data
This is a feature request.
Hi, I just try to fit a program to solve XOR problem, so here is my data:
X = [[1,1],
[1,0],
[0,1],
[0,0]]
y = [0,1,1,0]
This is settings:
est_gp = SymbolicRegressor(population_size=5000,
generations=20, stopping_criteria=0.01,
init_depth=(2,4),
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.1,
max_samples=1, verbose=1,
parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X, y)
print (est_gp._program)
Result is: sub(X1, X0), which is unexpected, and not right. Below are the running info:
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 15.48 7.45612899871 3 0.0 2.0 1.23m
Thus, I checked the function evaluation:
est_gp.predict([1,0])
array([-1])
It seems right, but,
est_gp.score([1,0],[1])
0.0
Hi,
I just started using gplearn.
When I run the symbolic regressor example with n_jobs = 1, it took about 5 minutes and finished. However, when I set n_jobs = 15, it was still running after one hour, and I had to stop it manually.
Here is my environment:
Windows 7 Professional,
Anaconda 3,
CPU Xeon E5-2687w v2.
Could anyone help me? Thanks very much.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.