Giter VIP home page Giter VIP logo

Comments (6)

sklam avatar sklam commented on May 14, 2024

A DataFrame and each of its series all contain references to the same index. This currently isn't checked, but must be true for correctness.

It is checked at https://github.com/gpuopenanalytics/pygdf/blob/f3330612316e2f40fb0bda205ee2ad6ad212bc62/pygdf/dataframe.py#L286. Series' index must be equivalent to the dataframe index..

from cudf.

sklam avatar sklam commented on May 14, 2024

The numeric implementation contains no data, but the categorical implementation does. This makes figuring out where to put methods kind of confusing.

The categorical impl doesn't contain data. I think _categories and _ordered are metadata.

All SeriesImpl subclasses do not contain data. They are the delegates. When a type-specialized operation is invoked, a Series calls a SeriesImpl to handle the operation. In SeriesImpl, the methods always take the "calling" series as one of the arguments.

To add new type-specialized methods, the actual implementation goes to SeriesImpl. The Series will have a small wrapper to delegate the work.

However, I have been lazy sometimes and the actual implementation is put directly to Series. That's okay for now.

from cudf.

sklam avatar sklam commented on May 14, 2024

As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.

from cudf.

jcrist avatar jcrist commented on May 14, 2024

Series' index must be equivalent to the dataframe index..

The categorical impl doesn't contain data. I think _categories and _ordered are metadata.

Ah, you are correct in both cases. Ignore my gripes there :).

As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.

My only remaining issue is that I'm finding myself frequently using Series where an array would suffice. For example, GenericIndex is backed by a series (which also has an index). Concat is another example - to avoid repeatedly concatenating the same index arrays I had to do a kludgy hack in #40. With array like objects (that understand categoricals), concat could work on the arrays, and then wrap the final output in a series/dataframe with a concatenated index.

I think it would be nice to build operations on a generic data container without an index. Since categorical data has metadata (categories and ordered) at least the categorical data container would have to be a step higher than a gpu array. This is similar to pandas - pd.Series are either backed by numpy arrays or pd.Categorical.

Again, feel free to ignore, I'm still figuring things out. I think the differences between this project and pandas are tripping me up and that's leading to confusion on my part.

from cudf.

sklam avatar sklam commented on May 14, 2024

I agree that the GenericIndex design has issues. I hacked it up to make progress. It's backed by a Series just because Series has a lot of features. But, the Series in GenericIndex should always have the basic RangeIndex as to not recursively containing more Series.

My only remaining issue is that I'm finding myself frequently using Series where an array would suffice.

That's true. Sadly, the underlying numba gpu array is not as feature-rich.

Now, I think it make sense to introduce the Data classes. It will resolve the problem in GenericIndex nicely.

So the Buffer class will be the physical layer. The Data class will be the logical layer.

from cudf.

sklam avatar sklam commented on May 14, 2024

Closing due to #54

from cudf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.