Comments (6)
A DataFrame and each of its series all contain references to the same index. This currently isn't checked, but must be true for correctness.
It is checked at https://github.com/gpuopenanalytics/pygdf/blob/f3330612316e2f40fb0bda205ee2ad6ad212bc62/pygdf/dataframe.py#L286. Series' index must be equivalent to the dataframe index..
from cudf.
The numeric implementation contains no data, but the categorical implementation does. This makes figuring out where to put methods kind of confusing.
The categorical impl doesn't contain data. I think _categories
and _ordered
are metadata.
All SeriesImpl
subclasses do not contain data. They are the delegates. When a type-specialized operation is invoked, a Series calls a SeriesImpl to handle the operation. In SeriesImpl, the methods always take the "calling" series as one of the arguments.
To add new type-specialized methods, the actual implementation goes to SeriesImpl. The Series will have a small wrapper to delegate the work.
However, I have been lazy sometimes and the actual implementation is put directly to Series. That's okay for now.
from cudf.
As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl
, the data is always passed as an arg of Series
type. Whereas is in the new Data
class, it will be under self
. I would prefer not to couple the data and the type-specialized operation.
from cudf.
Series' index must be equivalent to the dataframe index..
The categorical impl doesn't contain data. I think _categories and _ordered are metadata.
Ah, you are correct in both cases. Ignore my gripes there :).
As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.
My only remaining issue is that I'm finding myself frequently using Series
where an array would suffice. For example, GenericIndex
is backed by a series (which also has an index). Concat is another example - to avoid repeatedly concatenating the same index arrays I had to do a kludgy hack in #40. With array like objects (that understand categoricals), concat
could work on the arrays, and then wrap the final output in a series/dataframe with a concatenated index.
I think it would be nice to build operations on a generic data container without an index. Since categorical data has metadata (categories and ordered) at least the categorical data container would have to be a step higher than a gpu array. This is similar to pandas - pd.Series
are either backed by numpy arrays or pd.Categorical
.
Again, feel free to ignore, I'm still figuring things out. I think the differences between this project and pandas are tripping me up and that's leading to confusion on my part.
from cudf.
I agree that the GenericIndex
design has issues. I hacked it up to make progress. It's backed by a Series
just because Series
has a lot of features. But, the Series
in GenericIndex
should always have the basic RangeIndex
as to not recursively containing more Series
.
My only remaining issue is that I'm finding myself frequently using Series where an array would suffice.
That's true. Sadly, the underlying numba gpu array is not as feature-rich.
Now, I think it make sense to introduce the Data
classes. It will resolve the problem in GenericIndex
nicely.
So the Buffer
class will be the physical layer. The Data
class will be the logical layer.
from cudf.
Closing due to #54
from cudf.
Related Issues (20)
- [JNI] Cleanup MemoryBuffer/ColumnVector close methods to use common code
- [FEA] Remove `boundscheck=False` from `parquet.pyx` and `json.pyx`
- [FEA] Consider exploring JIT compilation/LTO to replace AST evaluation
- [BUG] Empty DataFrame object `columns` property doesn't match pandas for `data=None` or `data={}`.
- [QST] How can the performance of chunked reading in Parquet be improved? HOT 10
- [BUG] ORC writer can't write files with more than 65535 row groups
- [FEA] pandas DatetimeIndex.indexer_between_time HOT 1
- [BUG] In cudf.pandas mode, `.array` or `.values` don't actually return views to the underlying data HOT 2
- [BUG] Chunked ORC writer throws in `close()` if not tables have been written successfully
- [BUG] ORC writer writes the 3 byte "ORC" header even when the write operation fails
- [BUG] Array proxy in cudf.pandas don't include special casing for `ndarray.flat`
- [FEA] Report the number of rows read per file in libcudf's Parquet reader
- [BUG] json_column.cu (possibly others) appears to have a lot of async synchronization issues
- [DOC] update CONTRIBUTING.md to mention devcontainers? HOT 2
- [FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas
- [FEA] Support pandas flags in cudf.pandas
- [FEA] Improved performance for strings finder_warp_parallel_fn / contains_warp_parallel_fn kernels HOT 3
- [FEA] lower/upper are really slow for long strings
- [BUG] Dfa::Transduce destroys hostdevice_uvector before it is synchronized HOT 1
- [BUG] Error when installing cuDF (Windows 11, CUDA 12, Python 3.12) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.