Comments (4)
This sounds like a special case of a run-length encoded column.
from cudf.
Could you elaborate on which operations would use this and how they would work?
So much of our code checks to make sure column-types match.
For example, even dictionary columns generally operate only against other dictionary columns.
It may be easier to overload certain APIs to accept a scalar parameter instead of a column.
from cudf.
This sounds like a special case of a run-length encoded column.
Yes I would accept a run length encoded column too.
Could you elaborate on which operations would use this and how they would work?
That is kind of hard to do right now. Conceptually we pass tables around in Spark between different operators. It is not really a table, as it is Spark specific, but we treat them the same and convert back and forth between them everywhere. I was hoping to be able to keep this abstraction, and just allow us to have a scalar column, or run length encoded column as the case may be.
If there is no simple way to add an abstraction to get this, then we should go back and see how much of this we can do ourselves outside of using a table everywhere.
from cudf.
I assume the scalar column would only be for fixed-width-types and strings. There are certain APIs where the internal implementation converts (non-nested) inputs to iterators and here a scalar could easily be represented as a column using a constant iterator. And so these APIs could have an alternate signature to plumb the scalar (column or rle column) through.
Of course, not all internal implementations are coded this way so the amount of effort here depends on what APIs need to be targeted.
from cudf.
Related Issues (20)
- I am not able to install cudf with Cuda12.4 python 3.11.7 driver = 551 [BUG] HOT 1
- [BUG] `Index.get_loc` is returning incorrect results on index objects that are in decreasing order
- [BUG] cmake fails to configure static libcudf due to arrow issues
- [BUG] `df.loc` needs to return index of same types
- [BUG] `df.loc` drops index labels during assignment
- [BUG] `Index.repeat` is failing for `DatetimeIndex` with a frequency
- [FEA] Implement new test organization in cuDF
- [FEA] Disable fallback in cudf.pandas on request
- [BUG] CMake Error "The required target arrow_compute is not in any export set when calling with target arrow_static" HOT 3
- [QST] aggregate function that operates on vector(array of numeric) data
- [BUG] `loc` returning incorrect results for `DatetimeIndex` that is in monotonically decreasing
- [BUG] chunked parquet reader is not factoring empty dataframes with `>0` columns present HOT 2
- [FEA] Make line terminator sequence handling in regular expression engine a configurable option HOT 1
- [BUG] cudf.pandas dataframe.__repr__ slow in jupyterlab for large datasets HOT 1
- [BUG] iloc/loc keeps circular reference to original DataFrame/Series
- [BUG] double free or memory corruption when parsing some JSON HOT 6
- [BUG] Stop allowing floating arrow along minor versions
- [BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters HOT 2
- [BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset
- [FEA] explore using KMP for string matching like operations HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.