I'm working through Chapter 2 exercises w/ latest scikit (0.18.2), and this code was f

This is actually a better solution, <a href="https://github.com/scikit-learn/sciki

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

LabelBinarizer doesn't work in Pipeline about handson-ml HOT 17 CLOSED

Kallin commented on May 5, 2024 9

LabelBinarizer doesn't work in Pipeline

from handson-ml.

Comments (17)

Kallin commented on May 5, 2024 4

This is actually a better solution,
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6

Mine fails to create extra columns for categories not present in data run through the pipeline.

from handson-ml.

nagark16 commented on May 5, 2024 2

ageron, looks you have forgotten updating requirements.txt. Can you do that, please. As I am getting error like " fit_transform() takes 2 positional arguments but 3 were given " until I do as suggested by Kallin.

from handson-ml.

dustbort commented on May 5, 2024 2

Hi @ageron, the "LabelBinarizer hack" in Chapter 2 and similar solutions like this are incorrect. It seems the correct way to handle this will be with the CategoricalEncoder, which is in progress.

According to the documentation, LabelBinarizer (called via super in such solutions) does not work on the 2-dimensional input X with multiple features. Although it seems to run without error, the behavior is quite different than desired.

fit(y)
Parameters: | y : array of shape [n_samples,] or [n_samples, n_classes]  
Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Similarly for transform(y) and fit_transform(y).

So, it treats the columns of the 2-d matrix X as a single multiple-label classification feature with Boolean entries, not multiple single-label classification features.

For now, we could modify DataFrameSelector to encode the features as integers, and then use OneHotEncoder instead of LabelBinarizer in the pipeline.

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, factorize=False):
        self.attribute_names = attribute_names
        self.factorize = factorize
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        selection = X[self.attribute_names]
        if self.factorize:
            selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
        return selection.values

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_cols, True)),
    ('one_hot_encoder', OneHotEncoder())
])

Or, you could use pandas.get_dummies instead of factorize inside DataFrameSelector and remove the OneHotEncoder from the pipeline.

from handson-ml.

ageron commented on May 5, 2024 2

Okay, I just pushed the changes in the notebook for chapter 2, using factorize() to explain ordinal encoding, then OneHotEncoder to explain, well, one-hot encoding, and finally CategoricalEncoder to use in pipelines. Phew...
Thanks again for your great suggestions.
Cheers

from handson-ml.

dustbort commented on May 5, 2024 1

@ageron Thank you for your feedback! I find your book tremendously useful! Let me emphasize that your use of LabelBinarizer here should produce erroneous or spurious results. The signatures for fit, transform and fit_transform have not changed between 0.18 and 0.19. Please reread my excerpt of the documentation for LabelBinarizer and also see @amueller's note here (similar implementation): "...the way you implemented it right now it only works for a single column...". I see many incarnations of this erroneous and spurious use of LabelBinarizer on github, and must strongly but humbly emphasize that it is not just a minor abuse of LabelBinarizer, but a major scientific error.

from handson-ml.

dustbort commented on May 5, 2024 1

Thanks for all your hard work!

from handson-ml.

ageron commented on May 5, 2024

Ah, good point! I plan to update requirements.txt to use the latest versions of every library this project depends on, and fix whatever needs fixing at that point, which will apparently include this LabelBinarizer issue. I'm on vacation right now, so I'll probably do this only in 2-3 weeks.
Thanks Kallin!

from handson-ml.

Kallin commented on May 5, 2024

Happy to help Aurélien :)

It may be out of scope, but I think the preferred way of doing this now (since you're using Pandas) is to use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html .

There's a good article here showing how they updated their approach: http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/ .

Enjoy your vacation!

from handson-ml.

ageron commented on May 5, 2024

Thanks Kallin, I'll look into this when I get back.

from handson-ml.

cbilgili commented on May 5, 2024

@ageron I am studying the book, too. It would be very well to update it.

Also a note in Readme file at github repo will be great.

from handson-ml.

ageron commented on May 5, 2024

Hi,
Thanks everyone for your great feedback, and my apologies for the very long delay, I ended up with very little time to work on this project over the summer. But I'm back! :) I'm looking into all the issues over the coming week, in particular I'll update the libraries to the latest versions, which should fix a few issues.
Cheers,
Aurélien

from handson-ml.

ageron commented on May 5, 2024

Thanks for the info @dustbort, great summary, and nice suggestions!

I just updated chapter 2 and the corresponding notebook to explain that using the LabelEncoder is a temporary hack, and that the correct solution will be to use the OneHotEncoder class, once it handles categorical string features (which may be soon, see scikit-learn/scikit-learn#7327). Another option would be to use two steps: first the CategoricalEncoder, as you note, then a OneHotEncoder (but the CategoricalEncoder class is not there yet).

For now, I'll stick with the LabelBinarizer class, since it works well with scikit-learn <0.19, and it only requires a very thin wrapper to work in 0.19. But just to be clear, I don't like this hack either: as soon as one of the other methods works, I will update the book and the notebooks to use it instead.

Hope this helps! :)

from handson-ml.

amueller commented on May 5, 2024

I didn't check out the details of this issue but the solution is probably scikit-learn/scikit-learn#9151

from handson-ml.

ageron commented on May 5, 2024

Hi @dustbort and @amueller ,
I understand now, thanks for opening my eyes, and my apologies for the error. I'll update the book and the code to use @dustbort's solution, and I'll update them to use the CategoricalEncoder when it's ready.

from handson-ml.

ageron commented on May 5, 2024

Hi @dustbort, while updating the notebook, I noticed that your DataFrameSelector implementation above does not store the mapping from category strings to integers. I'll try to fix that.

Basically the problem is that we need to use the same mapping from category string to integer during training and during testing, so the fit() method must keep a copy of the mapping, and that mapping must be used in the transform() method.

from handson-ml.

ageron commented on May 5, 2024

The problem with using Pandas' get_dummies() method is that during testing the instances may only contain a subset of the possible categories (and possibly also some extra categories that were unseen during training), so the columns during training and testing may not match. Extra code needs to be written to take care of that, and it adds some unwanted complexity to my book.
After a lot of thinking, I decided to go with the CategoricalEncoder class, even though it's not part of Scikit-Learn yet. I just copied the code from scikit-learn/scikit-learn#9151 and encouraged readers to use that code. This way, their code will probably not need much tweaking when this class gets added to Scikit-Learn (just replace the class definition by from sklearn.preprocessing import CategoricalEncoder). I'm testing everything now, and rewriting the paragraph, I'll push my changes asap.

from handson-ml.

amueller commented on May 5, 2024

yeah we really need to fix this in scikit-learn. I'm also not happy how I did it in my book :-/

from handson-ml.

LabelBinarizer doesn't work in Pipeline about handson-ml HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent