Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks again <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

That's a great observation <a class="user-mention notranslate" data-hovercard-type="us

SymbolicTransformer does not create added value features as expected,about trevorstephens/gplearn

Comments (8)

iblasi commented on June 19, 2024

Hi @trevorstephens

Looking to all the source code I understand why happens the previously commented situation, and I have a recomendation to improve the SymbolicTransformer if you consider.
To use SymbolicTransformer, you are doing the following steps on code genetic.py to extract complex added value features.

You sort by fitness on line 479 hall_of_fame = fitness.argsort()[:self.hall_of_fame] and then take the number of components to fix them as _best_programs on line 500 self._best_programs = [self._programs[-1][i] for i in hall_of_fame[components]].
The _best_programs are the used in line 1013 inside class SymbolicTransformer on method transform to execute those best programs.

Being honest, I checked all the code in github and it seems that the calculations are correct. However if I do it by hand I am able to fix the error, so I live here to report and to fix code. I installed gplearn through pip and I have version 0.2, but they may be different code and for that reason I am not able to find the erro in github.
With the following example you can see perfectly what I am talking about

# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(Orig features): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))
# Output: Lars(Orig features):  0.775865035322 764.451209695


# Linear model - Transformed features
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Orig code): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Orig code):  0.838022410672 600.687128977


# Linear model - Hand made transform fitness_
index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made fitness_):  1.0 1.50051021564e-07


# Linear model - Hand made transform oob_fitness_
index_oob_fitness = np.argsort([prog.oob_fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_oob_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made oob_fitness_):  1.0 3.90751682033e-07


# Linear model - "The feature"
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - The Feature): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - The Feature):  1.0 7.2901684689e-13

So, it is clear that everything works except the selection of the _best_programs.
Also, it is strange that OOB_fitness_ gives the same result that "The feature". Looking the code I think that it might be dur to max_samples variable, but it is just a guess without any testing.

Hope it helps to improve the code.
Just for curiosity, what does OOB mean? "Out of ..."? Which is the difference between fitness_ and oob_fitness_ and the objective of them?

I ask it because using the boston data of your example (of the documentation), taking the first 300 as training, the resutls are the following ones (I used Ridge and Lars to be consistent with doc. example and scrips above):

Ridge(Orig features):  -0.260236502156 5.00110179084
Ridge(trans - Orig code):  0.465892381009 3.04110672459
Ridge(trans - Hand made fitness_):  0.252378627452 3.77678891869
Ridge(trans - Hand made oob_fitness_):  0.399317288976 3.3718350818

Lars(Orig features):  -0.350599260885 5.16027432523
Lars(trans - Orig code):  -14.8537565657 14.5151593774
Lars(trans - Hand made fitness_):  -8384.20328411 402.904365315
Lars(trans - Hand made oob_fitness_):  -5.06798482764 10.2869394223

Curious values as in boston works better the original transform code from the doc. example. However, using the dataset I used in the example, the results are (with Ridge and Lars):

Ridge(Orig features):  0.69342345079 676.190448752
Ridge(trans - Orig code): 0.75514334089 606.258605416
Ridge(trans - Hand made fitness_):  0.9999999997 0.0176059710462
Ridge(trans - Hand made oob_fitness_):  0.999999999605 0.019754171254
Ridge(trans - The Feature):  1.0 1.09972716373e-05

Lars(Orig features):  0.558940611669 757.375881816
Lars(trans - Orig code):  0.288037192098 824.847940324
Lars(trans - Hand made fitness_):  1.0 5.88329385209e-13
Lars(trans - Hand made oob_fitness_):  1.0 7.58141638357e-10
Lars(trans - The Feature):  1.0 5.88329385209e-13

Very curious... But no idea of what happens exactly with best_programs_

from gplearn.

iblasi commented on June 19, 2024

Looking to other issues, I have seen that this problem was reported by issue #42
I leave the solution any way for user that want to solve it with pip installed version.

from gplearn.

trevorstephens commented on June 19, 2024

Hi @iblasi , thank you greatly for your in depth description of the issue! 👍

You appear to be correct. I am working on fixing #42 in #60 ... I believe I came across the same issue as you when debugging that problem. If you have the time, are you able to use the code from #60 to see if it generates the features as you would hope?

FWIW, most raised issues have related to the regressor, so this bug may well have been present but not found for a while.

OOB means "out of bag" as we use "bagging" to subsample the rows to evaluate.

from gplearn.

iblasi commented on June 19, 2024

@trevorstephens, I have seen that I have some errors on my first example code that they have been fixed in case you want to test exactly the code to see more clearly the bug.

I have tested your code and it still does not work properly.
In my opinion, although I understand what you are doing on fitwith the correlation coefficients to remove worst to take only the _best_programs, I checked that there is where the problem comes.
I am not sure why you make so many operations instead of doing, as I show in my example, just:

index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]

I mean, you have already measured fitness and you just want to sort through that score value. So just take the n_componentsdefined by user and return the best fitness achieved.

You can still maintain the if self._metric.greater_is_better:statement removing [::-1]to avoid upside-down of fitness.
I modified the original code on gplearn.py (lines 480-505) with the following one and it works perfectly.

        if isinstance(self, TransformerMixin):
            # Find the best individuals in the final generation
            fitness = np.array(fitness)
            if self._metric.greater_is_better:
              index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
            else:
              index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])

            self._best_programs = [self._programs[-1][index_fitness[i]] for i in range(self.n_components)]

        return self

I may be doing something wrong, so please check it, but the code works and gives the expected results.

from gplearn.

iblasi commented on June 19, 2024

@trevorstephens I already realized what I think you were trying to do with correlation matrix.
If I am right, you were making a hall of fameand then ask your self to take n_components of all of them that they were different between each other to avoid dependent, equal or very similar features generated.
In that case you should change in gplearn.py (line 504) to take the less correlated programs/equation worst = worst[np.argmin(np.sum(correlations[worst, :], 1))] (note the change to argmin).

That worked for me also, although the results are not perfect with LARS, but that's logical as I am creating features less correlated (most distiguished) and not the top fitness features that may be the exact features (like the example shown).

EDITED
See better with example mentioned.

X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]
index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log', 'abs', 'neg', 'inv', 'max', 'min']

gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=3)

gp.fit(X_train, y_train)
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin.fit(newX[:index,:], y_train)
print 'Lars: ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:])), est_lin.coef_

# Best of Hall of Fame
for i in range(gp.n_components):
        print 'fitness: ', gp._best_programs[i].fitness_, ' - OOB_fitness: ',  gp._best_programs[i].oob_fitness_, ' - Equation:',  str(gp._best_programs[i])

Using actual code gplearn-fix-second-issue-from-42, the output:

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.341830473377       13   0.980737768321   0.923335544581     22.97s
   1     16.7   0.586896359115       10   0.999958560964   0.999807398718     41.50s
   2    24.63   0.729392599276       23              1.0              1.0     54.02s
Lars(trans - Orig code):  -0.428482827341 1681.25844662 [  6.53825764e+02  -5.82911477e+00  -3.30141921e+02  -1.28659452e+01
   6.70911585e-02   5.78386284e-04   3.09191951e+00   1.79057224e-01
  -8.23886970e-01   6.99084916e+01   3.42641835e-01  -5.64189101e+01
  -1.68442146e-01]
fitness:  0.957459388729  - OOB_fitness:  0.943425675292  - Equation: mul(sqrt(neg(mul(min(X1, X2), min(X0, X1)))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness:  0.966664740558  - OOB_fitness:  0.971917258901  - Equation: add(div(abs(X0), inv(X2)), neg(mul(mul(X1, X0), sub(X2, 0.904))))
fitness:  0.952392215033  - OOB_fitness:  0.937331147986  - Equation: neg(sub(mul(sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(-0.902, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))
fitness:  0.97197765133  - OOB_fitness:  0.998886554151  - Equation: mul(min(X1, X2), min(X0, X2))
fitness:  0.956167302787  - OOB_fitness:  0.895024815031  - Equation: mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(X1, 0.570), log(0.341)), neg(sqrt(X1))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))), abs(abs(log(max(neg(X2), add(X1, X1))))))
fitness:  0.970261237422  - OOB_fitness:  0.987509757459  - Equation: mul(add(div(X2, mul(add(X1, X0), mul(0.211, sub(-0.603, 0.299)))), 0.107), sub(max(X2, X2), mul(X1, X0)))
fitness:  0.978940700006  - OOB_fitness:  0.990502720054  - Equation: add(div(abs(X0), log(0.901)), neg(mul(min(X0, min(X1, X1)), sub(X2, 0.904))))
fitness:  0.988439517202  - OOB_fitness:  1.0  - Equation: mul(neg(X2), min(min(X0, X1), X2))
fitness:  0.985085078049  - OOB_fitness:  0.997438798978  - Equation: sub(add(sub(div(X2, 0.309), neg(X2)), mul(neg(X2), sqrt(X2))), mul(min(X1, X0), abs(sub(X2, 0.504))))
fitness:  0.977089527904  - OOB_fitness:  0.999715095651  - Equation: neg(sub(mul(min(min(X0, X1), add(X2, min(X2, X0))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(X0, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))

MAE is 1681.25844662 (depends on random_state used but it is always high). So clearly it is not correct.

Using argmin mentioned as worst = worst[np.argmin(np.sum(correlations[worst, :], 1))] you may take good features or "the feature" in this case.

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.348067860081       13   0.962599606892   0.978903441836     22.29s
   1    17.73    0.58764177602        6              1.0              1.0     36.75s
Lars:  1.0 3.77298192689e-13 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
fitness:  0.887486179235  - OOB_fitness:  0.939257217581  - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), neg(neg(X2)))
fitness:  0.896724908859  - OOB_fitness:  0.774177697706  - Equation: mul(sqrt(neg(add(min(X1, X0), neg(sqrt(X0))))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness:  0.90185742783  - OOB_fitness:  0.913124629422  - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), abs(neg(abs(-0.805))))
fitness:  0.905559461075  - OOB_fitness:  0.94344057166  - Equation: neg(mul(inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2))))), X2))
fitness:  0.891892597392  - OOB_fitness:  0.741803381846  - Equation: div(min(X2, add(sub(neg(sub(0.096, abs(-0.575))), min(min(X2, X1), X2)), min(log(max(X1, X2)), abs(sqrt(X1))))), max(div(min(log(sqrt(X0)), X1), min(sub(sub(X2, X2), add(X0, X0)), max(div(X0, X0), abs(X2)))), min(min(-0.779, X0), sub(X0, div(min(X0, 0.657), neg(X1))))))
fitness:  0.916011339582  - OOB_fitness:  0.828693451632  - Equation: inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2)))))
fitness:  0.928574574624  - OOB_fitness:  0.70923178817  - Equation: div(min(X2, X0), inv(X1))
fitness:  0.925215746576  - OOB_fitness:  0.813373952474  - Equation: mul(sqrt(div(X0, div(-0.100, X1))), mul(log(sqrt(X2)), min(abs(X1), div(X2, -0.385))))
fitness:  0.944269300313  - OOB_fitness:  0.761873869578  - Equation: mul(neg(mul(div(X2, -0.678), neg(X1))), div(inv(sqrt(X1)), abs(inv(X0))))
fitness:  0.997  - OOB_fitness:  1.0  - Equation: div(X2, inv(min(X0, X1)))

The MAE is 3.77298192689e-13 which is perfect. But you make take other features and not the highest fitness. If you run several times this option, sometimes does not converge and MAE is not ~zero.

Using my proposed index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1], most of the results are the same equation so they are dependent/similar/same as mentioned. The MAE is 4.13535872212e-13, let's say perfect as expected. But just expected because the feature created is exactly the target equation. If we are looking for different added value features, using correlation coefficients seems to have more sense.

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.358673286081       15   0.977149386266   0.923139960675     22.40s
   1     15.4   0.626041566983       11   0.999943066381   0.999969695391     36.42s
   2    23.91   0.737708119156       23              1.0              1.0     45.02s
Lars:  1.0 4.13535872212e-13 [ 0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
fitness:  0.997  - OOB_fitness:  1.0  - Equation: mul(min(X1, X0), neg(X2))
fitness:  0.994449946246  - OOB_fitness:  0.999899881163  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994449639861  - OOB_fitness:  0.999905066915  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.9944454337  - OOB_fitness:  0.999955978128  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.9944379586  - OOB_fitness:  0.999983214347  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994433378141  - OOB_fitness:  0.999997279781  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994243226944  - OOB_fitness:  0.999897627664  - Equation: sub(X0, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993668918502  - OOB_fitness:  0.999729768531  - Equation: sub(X2, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993599015477  - OOB_fitness:  0.999705898121  - Equation: sub(log(X0), mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993499415109  - OOB_fitness:  0.999999998  - Equation: sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0), neg(X2)))

As summary, I am not sure that using correlation coefficients is the best choice as you may miss the highest fitness features.
So you may be interested on using a simple but sparse selection of hall of fame _programs such as:

index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
self._best_programs = [self._programs[-1][index_fitness[i]] for i in np.arange(0,self.hall_of_fame,1.*self.hall_of_fame/self.n_components).astype(np.int)]

that uses a linear selection of hall_of_fame including always the first highest 'fitness' result. Or more complicated one, using logarithmic to give more importance to highest fitness, or ...

from gplearn.

trevorstephens commented on June 19, 2024

Thanks again @iblasi , you are correct in that the algorithm tries to pick out the least correlated features of the hall_of_fame so that they can then be used in another model without as much collinearity in the features. I do see your point in terms of potential removal of key programs in the group. In order to maintain the idea of removal of correlated programs, but maintain the best programs where possible, I think that my latest change might tick all boxes.

Currently the code checks for the most correlated pair of programs from the hall of fame, and then removes the one that is also most correlated with all other programs left in the group. This could easily remove the top programs from the field as they are also most likely to have many other correlated "clones" in the generation.

Instead I now propose to find the most correlated pair as before, but then remove the one with the worst fitness of the two -- and then iterate until the group is reduced to the required number of components. Take a look at the current version of #60 and let me know what you think.

Really appreciate your taking the time to dig into this one! 👍

from gplearn.

iblasi commented on June 19, 2024

@trevorstephens just perfect. Good job!

Just one comment that does not apply to final result, and may not improve too much speed, but when you use unravel_index, you are finding the exact index (row,col) in correlation matrix, which it is the hall_of_fame variable where fitness values were sorted previously.
So as the correlation matrix rows/cols are ordered based on fitness, the maximum value of unravel_index is the worst that has to be used to delete. I mean, you can write it in the following way:

            [...]
            while len(components) > self.n_components:
                most_correlated = np.unravel_index(np.argmax(correlations),
                                                   correlations.shape)
                if self._metric.greater_is_better:
                    worst = max(most_correlated)
                else:
                    worst = min(most_correlated)
                components.pop(worst)
                indices.remove(worst)
                correlations = correlations[:, indices][indices, :]
                indices = list(range(len(components)))
            self._best_programs = [self._programs[-1][i] for i in
                                   hall_of_fame[components]]

        return self

But as I said, it does not improve the result as it makes the same operations.
Thank you for your good point of view and for finding a good solution.

from gplearn.

trevorstephens commented on June 19, 2024

That's a great observation @iblasi , simplifies the code a fair bit. And as the greater_is_better == False has the seq reversed, it doesn't even need that logic. Cheers 🍻

from gplearn.

SymbolicTransformer does not create added value features as expected about gplearn HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent