Giter VIP home page Giter VIP logo

Comments (12)

mcrumiller avatar mcrumiller commented on May 25, 2024

I'm a fan of this. What about using brackets instead of parentheses for details, and omit the quotes? e.g. Datetime[us] versus Datetime('us')? This is how numpy/pandas show dtype details, and IMO makes it look less like a function and more of a property of the dtype, and it's a bit more compact without the quotes.

from polars.

stinodego avatar stinodego commented on May 25, 2024

That's definitely worth considering!

from polars.

ritchie46 avatar ritchie46 commented on May 25, 2024

Isn't that a bad idea if we have 1 million categories? Shouldn't we just make the datatypes support pickle and json?

from polars.

mcrumiller avatar mcrumiller commented on May 25, 2024

@ritchie46 yeah I think that case maybe. FYI here is the text from the official python docs (emphasis mine):

Return a string containing a printable representation of an object. For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(); otherwise, the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.

We could set a threshold, or we could just say "if the user really wanted it, here it is" and give them a super long string.

from polars.

ritchie46 avatar ritchie46 commented on May 25, 2024

if the user really wanted it, here it is" and give them a super long string.

Yes, but I don't expect calling repr to block my thread if we have a large Enum type.

For serialization I understand and expect that this costs compute.

from polars.

stinodego avatar stinodego commented on May 25, 2024

Isn't that a bad idea if we have 1 million categories?

Probably... we can set an upper bound. It can be the same as the __str__ or maybe a bit higher.

Shouldn't we just make the datatypes support pickle and json?

We can definitely add a to_json / from_json to the DataType class. That doesn't have to block this improvement. I guess the JSON representation of a DataType would look something like:

{
  class: "Enum",
  attributes: {
    "categories": ["a", "b"]
  }
}

from polars.

alexander-beedie avatar alexander-beedie commented on May 25, 2024

Hmm... Think I'm with @ritchie46 on this one; it's not the case that anyone would expect that the repr(obj) can be run through eval to recreate it, and there's also no reason why the str version might be expected to be more manageable than the repr - don't forget that the fully-expanded repr result is what would be shown for every frame.schema1 repr result too. I doubt the shorter str version would get called much in real-world use.

We don't do this for DataFrame, LazyFrame, or Expr, so I'm not sure why we should do it specially for DataType given the potential downsides (I don't think that there's a compelling reason to do so; serialisation should be handled with dedicated methods, and that was probably the main driver for wanting it?)

Note that the phrasing in the Python docs is:

"For many types..." ('many', not 'all')
"...this function makes an attempt" ('makes an attempt', not 'guarantees')

In other words, this approach is considered optional, and cannot be relied on anywhere else; we are under no obligation to do so either ;)

Footnotes

  1. Looking at the frame schema is likely the most common way that the DataType repr will be viewed, and it wouldn't be the proposed short/str one.

from polars.

stinodego avatar stinodego commented on May 25, 2024

Hm, for some reason I thought the __str__ version was used in the Python console / Notebooks, but I was wrong. That definitely limits the usefulness of this whole venture.

It can still be nice to have a more compact representation for use in error messages, though.

from polars.

mcrumiller avatar mcrumiller commented on May 25, 2024

@stinodego in the console, if create an object x and simply type the x and hit enter in the command prompt, it gives you the result of __repr__. If you type print(x) it gives you the result of __str__:

>>> class TestReprStr:
...     def __str__(self):
...         return "__str__"
...     def __repr__(self):
...         return "__repr__"
... 
>>> x = TestReprStr()
>>> x
__repr__
>>> print(x)
__str__

Not super obvious behavior, but that's how it works. 🤷

from polars.

mcrumiller avatar mcrumiller commented on May 25, 2024

I decided to add some __str__ implementations just to see what a more compact version might look like:

import polars as pl

dt_u8 = pl.UInt8
dt_u16 = pl.UInt16
dt_u32 = pl.UInt32
dt_u64 = pl.UInt64
dt_i8 = pl.Int8
dt_i16 = pl.Int16
dt_i32 = pl.Int32
dt_i64 = pl.Int64
dt_dec = pl.Decimal(precision=5, scale=2)
dt_dur = pl.Duration("ms")
dt_time = pl.Time
dt_date = pl.Date
dt_datetime = pl.Datetime("ns")
dt_datetime_tz = pl.Datetime("ns", "UTC")
dt_array = pl.Array(inner=pl.UInt8, width=3)
dt_list = pl.List(inner=pl.Datetime("ms"))
dt_struct = pl.Struct(fields={"a": pl.Int8, "b": dt_list, "c": pl.Int32, "d": dt_array})

print(f"""
{dt_u8}\n{dt_u16}\n{dt_u32}\n{dt_u64}\n{dt_i8}\n{dt_i16}
{dt_i32}\n{dt_i64}\n{dt_dec}\n{dt_dur}\n{dt_time}\n{dt_date}
{dt_datetime}\n{dt_datetime_tz}\n{dt_array}\n{dt_list}\n{dt_struct}
""")
UInt8
UInt16
UInt32
UInt64
Int8
Int16
Int32
Int64
Decimal[5,2]
Duration[ms]
Time
Date
Datetime[ns]
Datetime[ns,UTC]
Array[UInt8,3]
List[Datetime[ms]]
Struct[a[Int8];b[List[Datetime[ms]]];c[Int32];d[Array[UInt8,3]]]

The struct looks a bit messy but I'm not sure that's avoidable.

from polars.

stinodego avatar stinodego commented on May 25, 2024

Struct could be Struct[a:Int8,c:Int32] etc. in your syntax.

But yeah, I am giving this one a re-think as the benefits aren't as obvious as I thought. And having print show something else than on the console can also be a little confusing.

from polars.

alexander-beedie avatar alexander-beedie commented on May 25, 2024

And having print show something else than on the console can also be a little confusing.

Yup, I don't think we should rely on non-obvious differences between __str__ and __repr__ without a good reason; if we want a compact repr (which certainly isn't a bad idea) we should make that a discoverable method, rather than a "magic" behaviour of __str__ that is DataType specific 🤔

I suggest something like the following:

  • Default repr for large Enum objects is truncated, as per #13357. This ensures that frame schemas (and dtypes) will always be reasonable by default (in notebooks, console, logs, etc).
  • Add dedicated serialisation methods (to/from JSON would be consistent with other objects; pickle should "just work").
  • Add a public .repr() (or .format()?) method for places where the caller wants to opt for something minimal or non-truncated. Could have options that allow you to explicitly return the repr in the desired state.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.