Giter VIP home page Giter VIP logo

Comments (17)

Kallin avatar Kallin commented on May 5, 2024 4

This is actually a better solution,
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6

Mine fails to create extra columns for categories not present in data run through the pipeline.

from handson-ml.

nagark16 avatar nagark16 commented on May 5, 2024 2

ageron, looks you have forgotten updating requirements.txt. Can you do that, please. As I am getting error like " fit_transform() takes 2 positional arguments but 3 were given " until I do as suggested by Kallin.

from handson-ml.

dustbort avatar dustbort commented on May 5, 2024 2

Hi @ageron, the "LabelBinarizer hack" in Chapter 2 and similar solutions like this are incorrect. It seems the correct way to handle this will be with the CategoricalEncoder, which is in progress.

According to the documentation, LabelBinarizer (called via super in such solutions) does not work on the 2-dimensional input X with multiple features. Although it seems to run without error, the behavior is quite different than desired.

fit(y)
Parameters: | y : array of shape [n_samples,] or [n_samples, n_classes]  
Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Similarly for transform(y) and fit_transform(y).

So, it treats the columns of the 2-d matrix X as a single multiple-label classification feature with Boolean entries, not multiple single-label classification features.

For now, we could modify DataFrameSelector to encode the features as integers, and then use OneHotEncoder instead of LabelBinarizer in the pipeline.

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, factorize=False):
        self.attribute_names = attribute_names
        self.factorize = factorize
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        selection = X[self.attribute_names]
        if self.factorize:
            selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
        return selection.values
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_cols, True)),
    ('one_hot_encoder', OneHotEncoder())
])

Or, you could use pandas.get_dummies instead of factorize inside DataFrameSelector and remove the OneHotEncoder from the pipeline.

from handson-ml.

ageron avatar ageron commented on May 5, 2024 2

Okay, I just pushed the changes in the notebook for chapter 2, using factorize() to explain ordinal encoding, then OneHotEncoder to explain, well, one-hot encoding, and finally CategoricalEncoder to use in pipelines. Phew...
Thanks again for your great suggestions.
Cheers

from handson-ml.

dustbort avatar dustbort commented on May 5, 2024 1

@ageron Thank you for your feedback! I find your book tremendously useful! Let me emphasize that your use of LabelBinarizer here should produce erroneous or spurious results. The signatures for fit, transform and fit_transform have not changed between 0.18 and 0.19. Please reread my excerpt of the documentation for LabelBinarizer and also see @amueller's note here (similar implementation): "...the way you implemented it right now it only works for a single column...". I see many incarnations of this erroneous and spurious use of LabelBinarizer on github, and must strongly but humbly emphasize that it is not just a minor abuse of LabelBinarizer, but a major scientific error.

from handson-ml.

dustbort avatar dustbort commented on May 5, 2024 1

Thanks for all your hard work!

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Ah, good point! I plan to update requirements.txt to use the latest versions of every library this project depends on, and fix whatever needs fixing at that point, which will apparently include this LabelBinarizer issue. I'm on vacation right now, so I'll probably do this only in 2-3 weeks.
Thanks Kallin!

from handson-ml.

Kallin avatar Kallin commented on May 5, 2024

Happy to help Aurélien :)

It may be out of scope, but I think the preferred way of doing this now (since you're using Pandas) is to use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html .

There's a good article here showing how they updated their approach: http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/ .

Enjoy your vacation!

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Thanks Kallin, I'll look into this when I get back.

from handson-ml.

cbilgili avatar cbilgili commented on May 5, 2024

@ageron I am studying the book, too. It would be very well to update it.

Also a note in Readme file at github repo will be great.

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Hi,
Thanks everyone for your great feedback, and my apologies for the very long delay, I ended up with very little time to work on this project over the summer. But I'm back! :) I'm looking into all the issues over the coming week, in particular I'll update the libraries to the latest versions, which should fix a few issues.
Cheers,
Aurélien

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Thanks for the info @dustbort, great summary, and nice suggestions!

I just updated chapter 2 and the corresponding notebook to explain that using the LabelEncoder is a temporary hack, and that the correct solution will be to use the OneHotEncoder class, once it handles categorical string features (which may be soon, see scikit-learn/scikit-learn#7327). Another option would be to use two steps: first the CategoricalEncoder, as you note, then a OneHotEncoder (but the CategoricalEncoder class is not there yet).

For now, I'll stick with the LabelBinarizer class, since it works well with scikit-learn <0.19, and it only requires a very thin wrapper to work in 0.19. But just to be clear, I don't like this hack either: as soon as one of the other methods works, I will update the book and the notebooks to use it instead.

Hope this helps! :)

from handson-ml.

amueller avatar amueller commented on May 5, 2024

I didn't check out the details of this issue but the solution is probably scikit-learn/scikit-learn#9151

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Hi @dustbort and @amueller ,
I understand now, thanks for opening my eyes, and my apologies for the error. I'll update the book and the code to use @dustbort's solution, and I'll update them to use the CategoricalEncoder when it's ready.

from handson-ml.

ageron avatar ageron commented on May 5, 2024

Hi @dustbort, while updating the notebook, I noticed that your DataFrameSelector implementation above does not store the mapping from category strings to integers. I'll try to fix that.

Basically the problem is that we need to use the same mapping from category string to integer during training and during testing, so the fit() method must keep a copy of the mapping, and that mapping must be used in the transform() method.

from handson-ml.

ageron avatar ageron commented on May 5, 2024

The problem with using Pandas' get_dummies() method is that during testing the instances may only contain a subset of the possible categories (and possibly also some extra categories that were unseen during training), so the columns during training and testing may not match. Extra code needs to be written to take care of that, and it adds some unwanted complexity to my book.
After a lot of thinking, I decided to go with the CategoricalEncoder class, even though it's not part of Scikit-Learn yet. I just copied the code from scikit-learn/scikit-learn#9151 and encouraged readers to use that code. This way, their code will probably not need much tweaking when this class gets added to Scikit-Learn (just replace the class definition by from sklearn.preprocessing import CategoricalEncoder). I'm testing everything now, and rewriting the paragraph, I'll push my changes asap.

from handson-ml.

amueller avatar amueller commented on May 5, 2024

yeah we really need to fix this in scikit-learn. I'm also not happy how I did it in my book :-/

from handson-ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.