I think it'd be nice to have some transformers that work on dask and numpy arrays, &am

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Cool! <a class="user-mention notranslate" data-hovercard-type="user"

MinMaxScaler: <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Imputer: <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

Candidate transformers about dask-ml HOT 13 OPEN

dask commented on August 23, 2024

Candidate transformers

from dask-ml.

Comments (13)

mrocklin commented on August 23, 2024

What are the cases where you don't want to compute immediately after calling fit?

Fitting multiple estimators at once?
Parameter searches?
...?

cc @jcrist who I think thought about this for dask-searchcv

from dask-ml.

TomAugspurger commented on August 23, 2024

What are the cases where you don't want to compute immediately after calling fit?

I was trying to think of cases where multiple stages of a Pipeline could be fused together by dask into a single .compute. Something like

scale columns [0, 1] by mean and variance
categorical encode columns [2, 3]

In a pipeline that's

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

As a programmer, I know that DummyEncoding operation doesn't depend on the scaling step in this specific case. Ideally, I could share the common tasks like reading X off disk, across the two operations. I haven't thought about this very deeply yet :)

from dask-ml.

jcrist commented on August 23, 2024

The big question right now is should fitting be eager, and fitted values concrete? e.g.

Off the top of my head I can think of a few different solutions:

Keyword to the transformer init compute=False, defaults to true. Programmer is responsible for this. Easy, intuitive, this would be my recommendation.
Always return dask.array objects from dask based transformers, and let non-compatible scikit-learn transformers convert via np.asarray. This is potentially fragile if transforms are not robust to array-like objects.
Pipeline subclass that always runs transforms collectively as a graph. Non-dask inputs to fit/fit_transform are coerced into dask inputs, then fed lazily through the pipeline. Non-dask transformers are run as a single task each, similar to if they were wrapped with dask.delayed, dask based transformers are free to use dask-array/dataframe api. Compute is called at the end of the pipeline by default to match scikit-learn eager evaluation, but a delayed option is available to support dask pipelines containing dask pipelines.

You might implement this with either a base-class check to see if a lazy kwarg is supported, or a method check to see if a transformer supports lazy fit/transform. Something like:
```
# Using a base-class and a supported kwarg to `fit`/`fit_transform`
if isinstance(transformer, DaskBaseEstimator):
   # lazily fit the next stage in the pipeline
   est, Xt = transformer.fit_transform(Xt, y, compute=False)

# Using duck-typing and custom method names
if hasattr(transformer, 'fit_transform_dask'):
    # lazily fit the next stage in the pipeline
    est, Xt = transformer.fit_transform_dask(Xt, y)
```

Of these I'd probably go with either 1 (easy, clear, intuitive) or 3 (harder to implement, may not play nice with all of scikit-learn, but probably more robust than 2).

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

I think that after calling fit in immediate mode (whether this is the default or the only option depends on the solution picked above), all attributes (e.g. .mean_) should be concrete. This will mesh better with scikit-learn, and matches their eager evaluation model.

from dask-ml.

TomAugspurger commented on August 23, 2024

Thanks @jcrist. Your proposal 1 seems the best. I think option 3, of running all the transforms as a single graph, will be interesting to experiment with at some point.

from dask-ml.

dsevero commented on August 23, 2024

Cool!

@TomAugspurger does it make sense to implement the MinMaxScaler? If so, I'll do it. Looks like a good entry point to the dask-ml philosophy, given that I'm familiar with sklearn.

from dask-ml.

TomAugspurger commented on August 23, 2024

@daniel-severo yep, that'd be great to have. I'll add it to the list.

from dask-ml.

dsevero commented on August 23, 2024

With respect to being nan safe: I think the user should handle this using the Imputer scaler for inputs.

If operations within daskml end up throwing up np.nan results, it is also probably due to some misuse by the user.

from dask-ml.

dsevero commented on August 23, 2024

MinMaxScaler: #9

from dask-ml.

dsevero commented on August 23, 2024

Imputer: #11

from dask-ml.

jorisvandenbossche commented on August 23, 2024

CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)

I picked up again this work last week, and I think the design should rather fixed now (scikit-learn/scikit-learn#9151). For now, we opted to only provide two ways to encode: just replacing with integer (categorial 'codes') and one-hot / dummy encoding (as there are many ways one could do 'categorical encoding'). Feedback is always welcome there!

Note that you might not want to follow the design exactly, as it does not use anything dataframe-specific (it can handles dataframes, but does not take advantage of it). Eg, the categories are specified as a positional list for the different columns, not as a dict of column name -> categories.

All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled

In scikit-learn, we are currently taking another route: instead of adding such a columns keyword to all transformers, I am working on a ColumnTransformer to perform such column-specific transformers (scikit-learn/scikit-learn#9012).

The example from above:

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

would then become

make_pipeline(
    make_column_transformer(
        ([0, 1], StandardScaler()),
        ([2, 3], DummyEncoder())
    ),
    ... regresson/classifier
)

Maybe a bit less easy from user point of view, but it means that the transformers itself don't need to be updated to handle column selection (and since it is in a single object, it would naturally share the reading of X).
But of course, that doesn't you need to adapt the same pattern here.

from dask-ml.

TomAugspurger commented on August 23, 2024

Thanks Joris, I'll try to take another look at the CategoricalEncoder soon.

I don't have much to add on the ColumnTransformer PR. It seems like make_column_transformer will be awkward to use, but perhaps not. Ideally, any transformer / estimator implemented in dask_ml could also be wrapped in make_column_transformer. I'll look through to seed if that's the case.

from dask-ml.

paolof89 commented on August 23, 2024

I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding.
Do you have thoughts about it? Do you think it make sense to candidate it as a transformer?

from dask-ml.

TomAugspurger commented on August 23, 2024

Skimming the implementation, things seem doable. The bulk of the work seems to be in OrdinalEncoder: https://github.com/scikit-learn-contrib/categorical-encoding/blob/e3ce76f711f923e762722aa8d6cb44cb9a17742c/category_encoders/woe.py#L150, which is implemented in dask-ml.

…

On Tue, Feb 5, 2019 at 5:10 AM paolof89 ***@***.***> wrote: I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding. Do you have thoughts about it? Do you think it make sense to candidate it as a transformer? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhV4qdIi1H5zmEa7vrCx_L_rWfq7ks5vKWa0gaJpZM4PizuV> .

from dask-ml.

Candidate transformers about dask-ml HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent