Giter VIP home page Giter VIP logo

Comments (6)

quinnj avatar quinnj commented on July 27, 2024

@nalimilan, do you think it'd be wise to parse string columns as CategoricalArrays by default? Or should that be opt-in? Or should we try and be smart about it during type detection and do some kind of threshold for when we switch from CategoricalArray => String column?

from csv.jl.

nalimilan avatar nalimilan commented on July 27, 2024

That's a tough question. It's universally considered (e.g. this post) that the stringsAsFactors=TRUE default in R is a big mistake, so I'd rather not make the same choice again.

OTOH, R factors are worse than our categorical arrays since they have sometimes surprising behaviors: they sometimes behave like integers (and convert silently to them), and values not in levels are converted to missing on assignment (!). Categorical arrays should have a less surprising behavior, though the fact that they return CategoricalValue objects could be annoying when you really want a string (even though we could have CategoricalValue{String} <: AbstractString so that most string operations work).

Also, in R, creating factors by default doesn't have any advantage in terms of memory use now since all strings are automatically pooled. As long as this doesn't happen in Julia, using categorical arrays will be much more efficient for most tables. But we could also use a custom pooled string type instead, which would be completely transparent for users (as opposed to CategoricalValue).

So overall I'm not sure. I'd say we should return standard strings by default, with an option to easily use categorical arrays instead. That's the safest option for now.

from csv.jl.

nalimilan avatar nalimilan commented on July 27, 2024

Most relevant excerpt from the link in my previous post:

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in data.frame() was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is not possible between mutable versions of factor and character vector, so the substitution should not have been offered.

http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/

from csv.jl.

quinnj avatar quinnj commented on July 27, 2024

Plan:

  • pick a threshold % when doing column type detection; if under %, parse as CategoricalArray, otherwise, it'll be String array
  • % threshold will be (# of unique string values) / rows_for_type_detect
  • if Categorical, column will have type of CategoricalValue
  • we might need a Data.categorical(::Type{Sink}) = true addition to the DataStreams interface to signal that a Sink supports streaming categorical columns (i.e. SQLite, CSV, & ODBC don't since they just store as strings anyway, but DataFrames & Feather do natively)
  • Sink's will need to allocate specially so they store as CategoricalArray and not a Vector of CategoricalValues
  • Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with
  • some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

from csv.jl.

nalimilan avatar nalimilan commented on July 27, 2024

Makes sense. I'm torn about whether choosing the type automatically based on the % of unique values is a good idea or not. In principle it's a bit annoying to have the types depend on the contents of the file, but I guess it would work fine as long as the chosen threshold is high enough (so that you don't accidentally get a non-categorical array just because a given file contains more diverse data, e.g. because it's not sorted on that particular column). In practice the difference between categorical and non-categorical variables is likely quite clear: either only a handful of unique levels or mostly unique values, with very few cases in between.

Regarding the implementation, I only have a few comments:

Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with

Note that setindex!(::CategoricalArray, ...) will automatically add new levels to the pool when needed, and throw an OverflowError if there are too many of them. You could just catch it and resize the pool in that case. It would be interesting to start with a smaller reftype than the default UInt32, depending on how many unique values were detected in the first rows, since typemax(UInt32) is an unusually large number of levels for a categorical variable. In most cases I expect UInt8 to be enough, though UInt16 would be less risky.

some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

See the suggestions I made at JuliaData/DataFrames.jl#895 (comment). I don't think passing a custom threshold makes a lot of sense: if the default doesn't work for you, better specify what type you want explicitly, either for all string columns, or for a subset of columns.

from csv.jl.

nalimilan avatar nalimilan commented on July 27, 2024

JuliaData/CategoricalArrays.jl#77 makes CategoricalValue <: AbstractString to make it mostly transparent to the user that a CategoricalArray{String} is used instead of an Array{String}.

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.