Giter VIP home page Giter VIP logo

Comments (11)

itamarst avatar itamarst commented on September 25, 2024

I will try to work on this. Depending how long it gets this might end up being a series of issues+PRs.

from polars.

stinodego avatar stinodego commented on September 25, 2024

There's a (somewhat) related PR open already: #13392

from polars.

itamarst avatar itamarst commented on September 25, 2024

Thanks! I'll take the comments and info there into account. But at this point I'm contemplating a much more significant rewrite, given these APIs are so tricky to use correctly.

from polars.

itamarst avatar itamarst commented on September 25, 2024

More problems: the documented behavior of map_batches() on this page doesn't match the demonstrated behavior (or perhaps the underlying behavior?). For example, it says:

Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.

Except the actual output in the documentation isn't the wrong results, and group "b" does not in fact have values from group "a"...

from polars.

MarcoGorelli avatar MarcoGorelli commented on September 25, 2024

yeah it needs updating since #13181

from polars.

itamarst avatar itamarst commented on September 25, 2024

OK so what are the expected semantics of map_batches()? Is the batch is always the original series (in select()) or always the group (in group_by())?

from polars.

MarcoGorelli avatar MarcoGorelli commented on September 25, 2024

that looks right

(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ cat t.py
import polars as pl

def func(x):
    print('batch is: ', x)
    return x

df = pl.DataFrame({'group': ['a', 'a', 'b'], 'value': [1, 2, 3]})

df.select(pl.col('value').map_batches(func))
df.select(pl.col('value').map_batches(func).over('group'))
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ python t.py
batch is:  shape: (3,)
Series: 'value' [i64]
[
        1
        2
        3
]
batch is:  shape: (2,)
Series: '' [i64]
[
        1
        2
]
batch is:  shape: (1,)
Series: '' [i64]
[
        3
]

from polars.

itamarst avatar itamarst commented on September 25, 2024

If that's the case, my first inclination is to not document map_elements() at all in the user guide, nor refer to it in API docs for map_batches()? Since the fact map_elements() sometimes takes a single element and sometimes takes a whole Series seems problematic. Or am I missing something?

from polars.

cmdlineluser avatar cmdlineluser commented on September 25, 2024

Somewhat related:

from polars.

itamarst avatar itamarst commented on September 25, 2024

I guess the complexity argument in #14521 suggests one should only use map_batches() for the UDF docs, and as suggested in #14521 document the different APIs in its whole own document.

from polars.

deanm0000 avatar deanm0000 commented on September 25, 2024

If you exclude groups and you have func which takes a python scaler and returns a scaler then

then

df.select(pl.col('a').map_elements(func))

Is mostly there same as

df.select(pl.col('a')
.map_batches(lambda x : (
pl.Series([func(y) for y in x])
))
)

So it's just a convenience shortcut really.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.