Support CategoricalArray parsing about csv.jl HOT 6 CLOSED

juliadata commented on July 27, 2024

Support CategoricalArray parsing

from csv.jl.

Comments (6)

quinnj commented on July 27, 2024

@nalimilan, do you think it'd be wise to parse string columns as CategoricalArrays by default? Or should that be opt-in? Or should we try and be smart about it during type detection and do some kind of threshold for when we switch from CategoricalArray => String column?

from csv.jl.

nalimilan commented on July 27, 2024

That's a tough question. It's universally considered (e.g. this post) that the stringsAsFactors=TRUE default in R is a big mistake, so I'd rather not make the same choice again.

OTOH, R factors are worse than our categorical arrays since they have sometimes surprising behaviors: they sometimes behave like integers (and convert silently to them), and values not in levels are converted to missing on assignment (!). Categorical arrays should have a less surprising behavior, though the fact that they return CategoricalValue objects could be annoying when you really want a string (even though we could have CategoricalValue{String} <: AbstractString so that most string operations work).

Also, in R, creating factors by default doesn't have any advantage in terms of memory use now since all strings are automatically pooled. As long as this doesn't happen in Julia, using categorical arrays will be much more efficient for most tables. But we could also use a custom pooled string type instead, which would be completely transparent for users (as opposed to CategoricalValue).

So overall I'm not sure. I'd say we should return standard strings by default, with an option to easily use categorical arrays instead. That's the safest option for now.

from csv.jl.

nalimilan commented on July 27, 2024

Most relevant excerpt from the link in my previous post:

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in data.frame() was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is not possible between mutable versions of factor and character vector, so the substitution should not have been offered.

http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/

from csv.jl.

quinnj commented on July 27, 2024

Plan:

pick a threshold % when doing column type detection; if under %, parse as CategoricalArray, otherwise, it'll be String array
% threshold will be (# of unique string values) / rows_for_type_detect
if Categorical, column will have type of CategoricalValue
we might need a Data.categorical(::Type{Sink}) = true addition to the DataStreams interface to signal that a Sink supports streaming categorical columns (i.e. SQLite, CSV, & ODBC don't since they just store as strings anyway, but DataFrames & Feather do natively)
Sink's will need to allocate specially so they store as CategoricalArray and not a Vector of CategoricalValues
Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with
some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

from csv.jl.

nalimilan commented on July 27, 2024

Makes sense. I'm torn about whether choosing the type automatically based on the % of unique values is a good idea or not. In principle it's a bit annoying to have the types depend on the contents of the file, but I guess it would work fine as long as the chosen threshold is high enough (so that you don't accidentally get a non-categorical array just because a given file contains more diverse data, e.g. because it's not sorted on that particular column). In practice the difference between categorical and non-categorical variables is likely quite clear: either only a handful of unique levels or mostly unique values, with very few cases in between.

Regarding the implementation, I only have a few comments:

Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with

Note that setindex!(::CategoricalArray, ...) will automatically add new levels to the pool when needed, and throw an OverflowError if there are too many of them. You could just catch it and resize the pool in that case. It would be interesting to start with a smaller reftype than the default UInt32, depending on how many unique values were detected in the first rows, since typemax(UInt32) is an unusually large number of levels for a categorical variable. In most cases I expect UInt8 to be enough, though UInt16 would be less risky.

some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

See the suggestions I made at JuliaData/DataFrames.jl#895 (comment). I don't think passing a custom threshold makes a lot of sense: if the default doesn't work for you, better specify what type you want explicitly, either for all string columns, or for a subset of columns.

from csv.jl.

nalimilan commented on July 27, 2024

JuliaData/CategoricalArrays.jl#77 makes CategoricalValue <: AbstractString to make it mostly transparent to the user that a CategoricalArray{String} is used instead of an Array{String}.

from csv.jl.

Support CategoricalArray parsing about csv.jl HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent