Comments (11)
I will try to work on this. Depending how long it gets this might end up being a series of issues+PRs.
from polars.
There's a (somewhat) related PR open already: #13392
from polars.
Thanks! I'll take the comments and info there into account. But at this point I'm contemplating a much more significant rewrite, given these APIs are so tricky to use correctly.
from polars.
More problems: the documented behavior of map_batches()
on this page doesn't match the demonstrated behavior (or perhaps the underlying behavior?). For example, it says:
Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.
Except the actual output in the documentation isn't the wrong results, and group "b" does not in fact have values from group "a"...
from polars.
yeah it needs updating since #13181
from polars.
OK so what are the expected semantics of map_batches()
? Is the batch is always the original series (in select()
) or always the group (in group_by()
)?
from polars.
that looks right
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ cat t.py
import polars as pl
def func(x):
print('batch is: ', x)
return x
df = pl.DataFrame({'group': ['a', 'a', 'b'], 'value': [1, 2, 3]})
df.select(pl.col('value').map_batches(func))
df.select(pl.col('value').map_batches(func).over('group'))
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ python t.py
batch is: shape: (3,)
Series: 'value' [i64]
[
1
2
3
]
batch is: shape: (2,)
Series: '' [i64]
[
1
2
]
batch is: shape: (1,)
Series: '' [i64]
[
3
]
from polars.
If that's the case, my first inclination is to not document map_elements()
at all in the user guide, nor refer to it in API docs for map_batches()
? Since the fact map_elements()
sometimes takes a single element and sometimes takes a whole Series seems problematic. Or am I missing something?
from polars.
Somewhat related:
from polars.
I guess the complexity argument in #14521 suggests one should only use map_batches()
for the UDF docs, and as suggested in #14521 document the different APIs in its whole own document.
from polars.
If you exclude groups and you have func
which takes a python scaler and returns a scaler then
then
df.select(pl.col('a').map_elements(func))
Is mostly there same as
df.select(pl.col('a')
.map_batches(lambda x : (
pl.Series([func(y) for y in x])
))
)
So it's just a convenience shortcut really.
from polars.
Related Issues (20)
- Why do the two results differ with different sort orders before group_by? HOT 8
- Join result cannot be used in `with_context` for LazyFrame HOT 4
- CSE not applied for large map in `.replace_strict()` HOT 1
- polars write_excel how to cancel drop-down list HOT 1
- Parquet nested slice pushdown gives incorrect results
- pl.LazyDataFrame.slice has a buggy behaviour with non scalar columns. HOT 1
- `read_ndjson()` and `read_parquet()` behave differently when the input is a list of files with different schemas HOT 1
- pl.from_numpy produces column with null dtype when input array is empty HOT 3
- equals lacks functionality that polars.testing.assert_frame_equal has HOT 6
- Polars drops pyarrow field-level metadata HOT 4
- Turn off CSE for new streaming engine
- Reading wide parquet is 25x slower with polars than pyarrow HOT 4
- In read_csv convert too long separator, quote_char, and/or eol_char to valid char HOT 2
- Optimize for simple math? HOT 3
- read_csv on gzipped csv much slower if n_rows specified
- CSV
- Some pl.Expr aggregations missing in the Aggregation section HOT 1
- Incorrect values calculated depending on the sequence of operations HOT 4
- from_jax
- Unexpected behaviour when calling list() on a slice of a series of dtype Object
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.