Comments (7)
Hi @kadarakos!
Error with .predict for iris example
Please check if the iris features and classes are encoded as numerical features. This is likely the source of your error. We've raised issue #61 to address this problem in the near future.
Interpreting generated code
Happy to see feedback about the generated code! The following is occurring in the pipeline you posted:
-
The training features and class labels are used to train the first decision tree
-
The class predictions from the first decision tree are then added as a new feature in the training features
-
A second decision tree is then trained on the training features (+ the predictions from the first decision tree) and class labels
-
result2['dtc2-classification']
contains the final classifications from the pipeline. These values should correspond to what you see when you call.predict()
on the TPOT object.
If you have thoughts on how to make the generated code clearer or easier to use, please let me know.
Best,
Randy
from tpot.
Hi @rhiever ,
Both iris features and classes are encoded as floats.
Your explanation makes it clear how to interpret the generated code. It makes me wonder, however, if this is the best way to ensemble models. Imho using the VotingClassifier object would be a more standard/straightforward way of ensembling different classifiers, plus it provides some additional flexibility.
from tpot.
Ah, I see what happened. The predict function is missing the .loc
call at the end:
return result[result['group'] == 'testing', 'guess'].values
should be
return result.loc[result['group'] == 'testing', 'guess'].values
This has already been fixed in the development version, but I haven't rolled it out to pip yet. I will do this soon!
from tpot.
wrt ensembles of classifiers: I agree 100%! This is also something we're working on in the near future -- adding a pipeline operator that pools classifications from multiple classifiers in different ways (majority etc.).
from tpot.
Thanks for the quick reply!
I evolved another piece of code that scores 1.0 on the iris data set, which is pretty impressive. However, I did raise some questions.
import numpy as np
import pandas as pd
from sklearn.cross_validation import StratifiedShuffleSplit
from itertools import combinations
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier # ME IMPORTING
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('iris.csv', sep=',')
tpot_data['class'] = digits['target'] # ME CHANGING THE STRINGS TO INTEGERS
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75)))
result1 = tpot_data.copy()
# Perform classification with a k-nearest neighbor classifier
knnc1 = KNeighborsClassifier(n_neighbors=min(13, len(training_indeces)))
knnc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['knnc1-classification'] = knnc1.predict(result1.drop('class', axis=1).values)
# Perform classification with a logistic regression classifier
lrc2 = LogisticRegression(C=2.75)
lrc2.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result2 = result1
result2['lrc2-classification'] = lrc2.predict(result2.drop('class', axis=1).values)
# Decision-tree based feature selection
training_features = result2.loc[training_indeces].drop('class', axis=1)
training_class_vals = result2.loc[training_indeces, 'class'].values
pair_scores = dict()
for features in combinations(training_features.columns.values, 2):
print features
dtc = DecisionTreeClassifier()
training_feature_vals = training_features[list(features)].values
dtc.fit(training_feature_vals, training_class_vals)
pair_scores[features] = (dtc.score(training_feature_vals, training_class_vals), list(features))
best_pairs = []
print pair_scores
for pair in sorted(pair_scores, key=pair_scores.get, reverse=True)[:1070]:
best_pairs.extend(list(pair))
best_pairs = sorted(list(set(best_pairs)))
result3 = result2[sorted(list(set(best_pairs + ['class'])))]
# Perform classification with a k-nearest neighbor classifier
knnc4 = KNeighborsClassifier(n_neighbors=min(6, len(training_indeces)))
knnc4.fit(result3.loc[training_indeces].drop('class', axis=1).values, result3.loc[training_indeces, 'class'].values)
result4 = result3
result4['knnc4-classification'] = knnc4.predict(result4.drop('class', axis=1).values)
A minor issue was that the DecisionTreeClassifier wasn't imported for the feature selection. Apart from that the I was a bit surprised by the way the feature selection part was implemented. I believe, could be replaced with the shorter - and maybe more general - code snippet from the sklearn documentation:
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
feature_clf = DecisionTreeClassifier()
feature_clf = clf.fit(training_features, training_class_vals)
feature_select = SelectFromModel(feature_clf, prefit=True)
training_features_new = model.transform(training_features)
Is it just me or would this be a bit more concise?
Best,
Ákos
from tpot.
Actually from observing the code a bit more precisely it seems to me that "result3" is just a sorted version of the original features:
result3 = result2[sorted(list(set(best_pairs + ['class'])))]
and then the kNN is fitted to the this sorted data frame
knnc4.fit(result3.loc[training_indeces].drop('class', axis=1).values, result3.loc[training_indeces, 'class'].values)
, so as far as I understand the feature selection was not actually performed. Running this piece of code
feature_clf = DecisionTreeClassifier()
feature_clf = feature_clf.fit(training_features, training_class_vals)
feature_select = SelectFromModel(feature_clf, prefit=True)
training_features_new = feature_select.transform(training_features)
actually shows - unsurprisingly - that the most informative features are the decisions of the previous classifiers.
from tpot.
That's exactly right. It seems the feature selection in this case was "junk code" that wasn't pruned by the optimization process. i.e., because the feature selection didn't do anything, it wasn't optimized away. I'm working on code now that selects against bloat like that currently.
In the most recent version, we've actually removed the decision tree-based feature selection entirely and replaced it with more standard feature selection operators from sklearn: RFE, variance threshold, and various forms of univariate feature selection. Hopefully that will be out soon. You can check it out on the development version in the meantime.
from tpot.
Related Issues (20)
- Potential New Feature: allowing users to input customized initial pipelines HOT 1
- TPOT2 and the future of TPOT development -- From the Devs
- How can I be part of the project to develop new modules? HOT 2
- Documentation should use est=TPOTClassifier rather than tpot=TPOTClassifier
- Question: How is the data split using cross validation HOT 2
- In python 3.12, Get error after importing module. HOT 1
- How to map the features at the end of the pipeline back to the initial features HOT 1
- Can't import TPOTClassifier from tpot
- How long is the installation supposed to take HOT 3
- TPOTClassifier error for large data HOT 1
- TPOT error for xgboost multiclass classificaion HOT 1
- Unable to install pip ARM64 Mac HOT 5
- TPOT NN example fails
- last update of code and report HOT 1
- TPOT report error HOT 4
- Feature to Export the Pipeline/Model as pickle file HOT 1
- TPOT underpopulates a class, but manual sklearn does not HOT 4
- macos
- Tpot affects the nb of jobs at import HOT 1
- How to use tpot with MLFlow HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tpot.