Comments (17)
This is actually a better solution,
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6
Mine fails to create extra columns for categories not present in data run through the pipeline.
from handson-ml.
ageron, looks you have forgotten updating requirements.txt. Can you do that, please. As I am getting error like " fit_transform() takes 2 positional arguments but 3 were given " until I do as suggested by Kallin.
from handson-ml.
Hi @ageron, the "LabelBinarizer hack" in Chapter 2 and similar solutions like this are incorrect. It seems the correct way to handle this will be with the CategoricalEncoder, which is in progress.
According to the documentation, LabelBinarizer (called via super
in such solutions) does not work on the 2-dimensional input X with multiple features. Although it seems to run without error, the behavior is quite different than desired.
fit(y)
Parameters: | y : array of shape [n_samples,] or [n_samples, n_classes]
Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
Similarly for transform(y)
and fit_transform(y)
.
So, it treats the columns of the 2-d matrix X as a single multiple-label classification feature with Boolean entries, not multiple single-label classification features.
For now, we could modify DataFrameSelector
to encode the features as integers, and then use OneHotEncoder
instead of LabelBinarizer
in the pipeline.
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names, factorize=False):
self.attribute_names = attribute_names
self.factorize = factorize
def fit(self, X, y=None):
return self
def transform(self, X):
selection = X[self.attribute_names]
if self.factorize:
selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
return selection.values
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_cols, True)),
('one_hot_encoder', OneHotEncoder())
])
Or, you could use pandas.get_dummies
instead of factorize
inside DataFrameSelector
and remove the OneHotEncoder
from the pipeline.
from handson-ml.
Okay, I just pushed the changes in the notebook for chapter 2, using factorize()
to explain ordinal encoding, then OneHotEncoder
to explain, well, one-hot encoding, and finally CategoricalEncoder
to use in pipelines. Phew...
Thanks again for your great suggestions.
Cheers
from handson-ml.
@ageron Thank you for your feedback! I find your book tremendously useful! Let me emphasize that your use of LabelBinarizer here should produce erroneous or spurious results. The signatures for fit
, transform
and fit_transform
have not changed between 0.18 and 0.19. Please reread my excerpt of the documentation for LabelBinarizer and also see @amueller's note here (similar implementation): "...the way you implemented it right now it only works for a single column...". I see many incarnations of this erroneous and spurious use of LabelBinarizer on github, and must strongly but humbly emphasize that it is not just a minor abuse of LabelBinarizer, but a major scientific error.
from handson-ml.
Thanks for all your hard work!
from handson-ml.
Ah, good point! I plan to update requirements.txt
to use the latest versions of every library this project depends on, and fix whatever needs fixing at that point, which will apparently include this LabelBinarizer
issue. I'm on vacation right now, so I'll probably do this only in 2-3 weeks.
Thanks Kallin!
from handson-ml.
Happy to help Aurélien :)
It may be out of scope, but I think the preferred way of doing this now (since you're using Pandas) is to use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html .
There's a good article here showing how they updated their approach: http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/ .
Enjoy your vacation!
from handson-ml.
Thanks Kallin, I'll look into this when I get back.
from handson-ml.
@ageron I am studying the book, too. It would be very well to update it.
Also a note in Readme file at github repo will be great.
from handson-ml.
Hi,
Thanks everyone for your great feedback, and my apologies for the very long delay, I ended up with very little time to work on this project over the summer. But I'm back! :) I'm looking into all the issues over the coming week, in particular I'll update the libraries to the latest versions, which should fix a few issues.
Cheers,
Aurélien
from handson-ml.
Thanks for the info @dustbort, great summary, and nice suggestions!
I just updated chapter 2 and the corresponding notebook to explain that using the LabelEncoder
is a temporary hack, and that the correct solution will be to use the OneHotEncoder
class, once it handles categorical string features (which may be soon, see scikit-learn/scikit-learn#7327). Another option would be to use two steps: first the CategoricalEncoder
, as you note, then a OneHotEncoder
(but the CategoricalEncoder
class is not there yet).
For now, I'll stick with the LabelBinarizer
class, since it works well with scikit-learn <0.19, and it only requires a very thin wrapper to work in 0.19. But just to be clear, I don't like this hack either: as soon as one of the other methods works, I will update the book and the notebooks to use it instead.
Hope this helps! :)
from handson-ml.
I didn't check out the details of this issue but the solution is probably scikit-learn/scikit-learn#9151
from handson-ml.
Hi @dustbort and @amueller ,
I understand now, thanks for opening my eyes, and my apologies for the error. I'll update the book and the code to use @dustbort's solution, and I'll update them to use the CategoricalEncoder
when it's ready.
from handson-ml.
Hi @dustbort, while updating the notebook, I noticed that your DataFrameSelector
implementation above does not store the mapping from category strings to integers. I'll try to fix that.
Basically the problem is that we need to use the same mapping from category string to integer during training and during testing, so the fit()
method must keep a copy of the mapping, and that mapping must be used in the transform()
method.
from handson-ml.
The problem with using Pandas' get_dummies()
method is that during testing the instances may only contain a subset of the possible categories (and possibly also some extra categories that were unseen during training), so the columns during training and testing may not match. Extra code needs to be written to take care of that, and it adds some unwanted complexity to my book.
After a lot of thinking, I decided to go with the CategoricalEncoder
class, even though it's not part of Scikit-Learn yet. I just copied the code from scikit-learn/scikit-learn#9151 and encouraged readers to use that code. This way, their code will probably not need much tweaking when this class gets added to Scikit-Learn (just replace the class definition by from sklearn.preprocessing import CategoricalEncoder
). I'm testing everything now, and rewriting the paragraph, I'll push my changes asap.
from handson-ml.
yeah we really need to fix this in scikit-learn. I'm also not happy how I did it in my book :-/
from handson-ml.
Related Issues (20)
- mnist dataset HOT 2
- Chapter#02 FileNotFoundError HOT 1
- Chapter 2 error during prediction HOT 2
- Ml
- Dropout at test time HOT 3
- How can I use my own dataset and fit it to your code
- Need help understanding crc hash used to explain test train split in Chapter 2 HOT 1
- ImportError: cannot import name 'fetch_mldata' from 'sklearn.datasets' (F:\Anaconda3\lib\site-packages\sklearn\datasets\__init__.py) HOT 1
- Chapter 3 : Exercise 1 - MNIST Classifier with 97% accuracy - Could not pickle the task to send it to the workers. HOT 3
- Broken image in readme HOT 1
- Chapter 5 SVM why should center before LinearSVC
- Chapter 3 (Page 82): Getting error during Fitting the SGD Classifier with Training data
- Chapter 2: Value differences in prediction
- Chapter 2: Looking for Correlations - ValueError: could not convert string to float: 'INLAND' HOT 1
- Use github.com/apssouza22/chatflow as a conversational layer. It would enable actual API requests to be carried out from natural language inputs.
- chapter 4: SGDRegressor(tol=-np.infty) is not accepted by the module HOT 1
- Hi
- Ch.2 Error using corr() HOT 1
- Problem downloading data HOT 1
- Why does saving the test set not work?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from handson-ml.