While working to implement things like concat , I've n

Series' index must be equivalent to the dataframe index.. </blockquot

Closing due to <a class="issue-link js-issue-link" data-error-text="Failed to load tit

Rethink class structure about cudf HOT 6 CLOSED

rapidsai commented on May 14, 2024

Rethink class structure

from cudf.

Comments (6)

sklam commented on May 14, 2024

A DataFrame and each of its series all contain references to the same index. This currently isn't checked, but must be true for correctness.

It is checked at https://github.com/gpuopenanalytics/pygdf/blob/f3330612316e2f40fb0bda205ee2ad6ad212bc62/pygdf/dataframe.py#L286. Series' index must be equivalent to the dataframe index..

from cudf.

sklam commented on May 14, 2024

The numeric implementation contains no data, but the categorical implementation does. This makes figuring out where to put methods kind of confusing.

The categorical impl doesn't contain data. I think _categories and _ordered are metadata.

All SeriesImpl subclasses do not contain data. They are the delegates. When a type-specialized operation is invoked, a Series calls a SeriesImpl to handle the operation. In SeriesImpl, the methods always take the "calling" series as one of the arguments.

To add new type-specialized methods, the actual implementation goes to SeriesImpl. The Series will have a small wrapper to delegate the work.

However, I have been lazy sometimes and the actual implementation is put directly to Series. That's okay for now.

from cudf.

sklam commented on May 14, 2024

As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.

from cudf.

jcrist commented on May 14, 2024

Series' index must be equivalent to the dataframe index..

The categorical impl doesn't contain data. I think _categories and _ordered are metadata.

Ah, you are correct in both cases. Ignore my gripes there :).

As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.

My only remaining issue is that I'm finding myself frequently using Series where an array would suffice. For example, GenericIndex is backed by a series (which also has an index). Concat is another example - to avoid repeatedly concatenating the same index arrays I had to do a kludgy hack in #40. With array like objects (that understand categoricals), concat could work on the arrays, and then wrap the final output in a series/dataframe with a concatenated index.

I think it would be nice to build operations on a generic data container without an index. Since categorical data has metadata (categories and ordered) at least the categorical data container would have to be a step higher than a gpu array. This is similar to pandas - pd.Series are either backed by numpy arrays or pd.Categorical.

Again, feel free to ignore, I'm still figuring things out. I think the differences between this project and pandas are tripping me up and that's leading to confusion on my part.

from cudf.

sklam commented on May 14, 2024

I agree that the GenericIndex design has issues. I hacked it up to make progress. It's backed by a Series just because Series has a lot of features. But, the Series in GenericIndex should always have the basic RangeIndex as to not recursively containing more Series.

My only remaining issue is that I'm finding myself frequently using Series where an array would suffice.

That's true. Sadly, the underlying numba gpu array is not as feature-rich.

Now, I think it make sense to introduce the Data classes. It will resolve the problem in GenericIndex nicely.

So the Buffer class will be the physical layer. The Data class will be the logical layer.

from cudf.

sklam commented on May 14, 2024

Closing due to #54

from cudf.

Rethink class structure about cudf HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent