Giter VIP home page Giter VIP logo

Comments (13)

mrocklin avatar mrocklin commented on August 23, 2024

What are the cases where you don't want to compute immediately after calling fit?

  1. Fitting multiple estimators at once?
  2. Parameter searches?
  3. ...?

cc @jcrist who I think thought about this for dask-searchcv

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 23, 2024

What are the cases where you don't want to compute immediately after calling fit?

I was trying to think of cases where multiple stages of a Pipeline could be fused together by dask into a single .compute. Something like

  • scale columns [0, 1] by mean and variance
  • categorical encode columns [2, 3]

In a pipeline that's

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

As a programmer, I know that DummyEncoding operation doesn't depend on the scaling step in this specific case. Ideally, I could share the common tasks like reading X off disk, across the two operations. I haven't thought about this very deeply yet :)

from dask-ml.

jcrist avatar jcrist commented on August 23, 2024

The big question right now is should fitting be eager, and fitted values concrete? e.g.

Off the top of my head I can think of a few different solutions:

  • Keyword to the transformer init compute=False, defaults to true. Programmer is responsible for this. Easy, intuitive, this would be my recommendation.

  • Always return dask.array objects from dask based transformers, and let non-compatible scikit-learn transformers convert via np.asarray. This is potentially fragile if transforms are not robust to array-like objects.

  • Pipeline subclass that always runs transforms collectively as a graph. Non-dask inputs to fit/fit_transform are coerced into dask inputs, then fed lazily through the pipeline. Non-dask transformers are run as a single task each, similar to if they were wrapped with dask.delayed, dask based transformers are free to use dask-array/dataframe api. Compute is called at the end of the pipeline by default to match scikit-learn eager evaluation, but a delayed option is available to support dask pipelines containing dask pipelines.

    You might implement this with either a base-class check to see if a lazy kwarg is supported, or a method check to see if a transformer supports lazy fit/transform. Something like:

    # Using a base-class and a supported kwarg to `fit`/`fit_transform`
    if isinstance(transformer, DaskBaseEstimator):
       # lazily fit the next stage in the pipeline
       est, Xt = transformer.fit_transform(Xt, y, compute=False)
    
    # Using duck-typing and custom method names
    if hasattr(transformer, 'fit_transform_dask'):
        # lazily fit the next stage in the pipeline
        est, Xt = transformer.fit_transform_dask(Xt, y)

Of these I'd probably go with either 1 (easy, clear, intuitive) or 3 (harder to implement, may not play nice with all of scikit-learn, but probably more robust than 2).

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

I think that after calling fit in immediate mode (whether this is the default or the only option depends on the solution picked above), all attributes (e.g. .mean_) should be concrete. This will mesh better with scikit-learn, and matches their eager evaluation model.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 23, 2024

Thanks @jcrist. Your proposal 1 seems the best. I think option 3, of running all the transforms as a single graph, will be interesting to experiment with at some point.

from dask-ml.

dsevero avatar dsevero commented on August 23, 2024

Cool!

@TomAugspurger does it make sense to implement the MinMaxScaler? If so, I'll do it. Looks like a good entry point to the dask-ml philosophy, given that I'm familiar with sklearn.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 23, 2024

@daniel-severo yep, that'd be great to have. I'll add it to the list.

from dask-ml.

dsevero avatar dsevero commented on August 23, 2024

With respect to being nan safe: I think the user should handle this using the Imputer scaler for inputs.

If operations within daskml end up throwing up np.nan results, it is also probably due to some misuse by the user.

from dask-ml.

dsevero avatar dsevero commented on August 23, 2024

MinMaxScaler: #9

from dask-ml.

dsevero avatar dsevero commented on August 23, 2024

Imputer: #11

from dask-ml.

jorisvandenbossche avatar jorisvandenbossche commented on August 23, 2024

CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)

I picked up again this work last week, and I think the design should rather fixed now (scikit-learn/scikit-learn#9151). For now, we opted to only provide two ways to encode: just replacing with integer (categorial 'codes') and one-hot / dummy encoding (as there are many ways one could do 'categorical encoding'). Feedback is always welcome there!

Note that you might not want to follow the design exactly, as it does not use anything dataframe-specific (it can handles dataframes, but does not take advantage of it). Eg, the categories are specified as a positional list for the different columns, not as a dict of column name -> categories.

All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled

In scikit-learn, we are currently taking another route: instead of adding such a columns keyword to all transformers, I am working on a ColumnTransformer to perform such column-specific transformers (scikit-learn/scikit-learn#9012).

The example from above:

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

would then become

make_pipeline(
    make_column_transformer(
        ([0, 1], StandardScaler()),
        ([2, 3], DummyEncoder())
    ),
    ... regresson/classifier
)

Maybe a bit less easy from user point of view, but it means that the transformers itself don't need to be updated to handle column selection (and since it is in a single object, it would naturally share the reading of X).
But of course, that doesn't you need to adapt the same pattern here.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 23, 2024

Thanks Joris, I'll try to take another look at the CategoricalEncoder soon.

I don't have much to add on the ColumnTransformer PR. It seems like make_column_transformer will be awkward to use, but perhaps not. Ideally, any transformer / estimator implemented in dask_ml could also be wrapped in make_column_transformer. I'll look through to seed if that's the case.

from dask-ml.

paolof89 avatar paolof89 commented on August 23, 2024

I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding.
Do you have thoughts about it? Do you think it make sense to candidate it as a transformer?

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 23, 2024

from dask-ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.