juliastats / statsmodels.jl Goto Github PK

View Code? Open in Web Editor NEW

248.0 8.0 29.0 2.31 MB

Specifying, fitting, and evaluating statistical models in Julia

Julia 100.00%

julia statistical-models

statsmodels.jl's Introduction

StatsModels

Basic functionality for specifying, fitting, and evaluating statistical models in Julia.

Documentation:

Originally part of DataFrames.jl.

statsmodels.jl's People

Contributors

Stargazers

Watchers

statsmodels.jl's Issues

Docs refer to nonexistent extract_matrix_term function

The collect_matrix_terms docstring refers to the nonexistent extract_matrix_term function:

StatsModels.jl/src/terms.jl

Line 255 in b29558f

collect_matrix_terms(t::AbstractTerm) = extract_matrix_term((t, ))

BTW, maybe collect_matrix_terms could just be called matrix_terms/matrixterms?

Julia 0.7-DEV version

I was wondering if anyone was working on getting a release of StatsModels for Julia 0.7-DEV. Getting a 0.7-DEV version should be pretty straightforward if we drop the statsmodels.jl DataFrame wrapper and delegations. That should also help towards eventually dropping exclusive dependency on DataFrames and supporting multiple tabular data representations (e.g., getting a stream of Named Tuples for the ModelFrame). One argument if favor of discontinuing future development for 0.x.y versions and focusing on 1.x.y is that no new features are being developed as part of the plans to re-write the Terms representation and ModelBuilder pipeline which will not happen in the near future and most likely after 0.7 / 1.0 is released.

Setting contrasts changes the number of columns in a model matrix

I was experimenting with using HelmertCoding() to generate -1/+1 coding form a 2 by 2 by 2 factorial design. Things work as expected with the default DummyCoding() but give an incorrect answer with HelmertCoding(). This is on the master branch.

ulia> tbl = (Y = randn(8), A = repeat(['N','Y'], outer=4), B=repeat(['-','+'], inner=2, outer=2), C=repeat(['L','H'], inner=4))
(Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H'])

julia> DataFrame(tbl)
8×4 DataFrame
│ Row │ Y           │ A    │ B    │ C    │
│     │ Float64     │ Char │ Char │ Char │
├─────┼─────────────┼──────┼──────┼──────┤
│ 1   │ 0.0762428   │ 'N'  │ '-'  │ 'L'  │
│ 2   │ -1.23052    │ 'Y'  │ '-'  │ 'L'  │
│ 3   │ 1.60522     │ 'N'  │ '+'  │ 'L'  │
│ 4   │ -0.00539344 │ 'Y'  │ '+'  │ 'L'  │
│ 5   │ 0.396057    │ 'N'  │ '-'  │ 'H'  │
│ 6   │ 1.86998     │ 'Y'  │ '-'  │ 'H'  │
│ 7   │ 0.393598    │ 'N'  │ '+'  │ 'H'  │
│ 8   │ -0.622502   │ 'Y'  │ '+'  │ 'H'  │

julia> mf = ModelFrame(@formula(Y ~ 1 + A*B*C), tbl)
ModelFrame{NamedTuple{(:Y, :A, :B, :C),Tuple{Array{Float64,1},Array{Char,1},Array{Char,1},Array{Char,1}}},StatisticalModel}(Y ~ 1 + A + B + C + A & B + A & C + B & C + A & B & C, Dict{Any,Any}(A=>A,C=>C,B=>B,Y=>Y), (Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H']), StatisticalModel)

julia> modelmatrix(mf)
8×8 Array{Float64,2}:
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  0.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0

julia> setcontrasts!(mf, Dict(:A=>HelmertCoding(),:B=>HelmertCoding(), :C=>HelmertCoding()))
ModelFrame{NamedTuple{(:Y, :A, :B, :C),Tuple{Array{Float64,1},Array{Char,1},Array{Char,1},Array{Char,1}}},StatisticalModel}(Y ~ 1 + A + B + C + A & B + A & C + B & C + A & B & C, Dict{Any,Any}(A=>A,C=>C,B=>B,Y=>Y), (Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H']), StatisticalModel)

julia> modelmatrix(mf)
8×12 Array{Float64,2}:
 1.0  -1.0   1.0   1.0  0.0  0.0  1.0  0.0   1.0   0.0   1.0  -1.0
 1.0   1.0   1.0   1.0  0.0  0.0  0.0  1.0   0.0   1.0   1.0   1.0
 1.0  -1.0  -1.0   1.0  1.0  0.0  0.0  0.0   1.0   0.0  -1.0   1.0
 1.0   1.0  -1.0   1.0  0.0  1.0  0.0  0.0   0.0   1.0  -1.0  -1.0
 1.0  -1.0   1.0  -1.0  0.0  0.0  1.0  0.0  -1.0  -0.0  -1.0   1.0
 1.0   1.0   1.0  -1.0  0.0  0.0  0.0  1.0  -0.0  -1.0  -1.0  -1.0
 1.0  -1.0  -1.0  -1.0  1.0  0.0  0.0  0.0  -1.0  -0.0   1.0  -1.0
 1.0   1.0  -1.0  -1.0  0.0  1.0  0.0  0.0  -0.0  -1.0   1.0   1.0

update() method

I originally posted this here, but it might a more general feature that applies to all models.

There doesn't seem to be, currently, an update() method similar to R's that would allow to very easily re-fit a model with only changing specific parameters (for instance, the data).

model = lm(@formula(y ~ x), data1)
update!(model, data2)

If such method is not in the plans, how could I achieve a refitting with a different dataset (but everything else the same) using the current tools? Thank you!

Missing allvars?

I was a fan of DataFrames.allvars for obtaining all variables from a Formula. I noticed it was missing from StatsModels. Was this intentional? I believe it to be a pretty essential functionality that would be a useful to have for our development and for other packages developers. Was there a reason to get rid of it?

Make ModelFrame work with an iterable of named tuples

This started in JuliaData/DataFrames.jl#1018, but probably more appropriate to discuss this here.

Maybe a first step would be to just add more constructors to ModelFrame that for now materialize an iterator of named tuples into a DataFrame and then call the existing constructor? I'm sure there could be a more elegant implementation, but that might get us started?

Multipart Formula

Several arguments of a function fit typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.

It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:

Define a macro for each argument that needs to capture variable names

fit(df, @formula(y ~ x1), @weight(x2), @vcov(cluster(x3+x4)), @where(x5 >= 0), maxiter = 100)

Define a model macro that accepts multiple arguments, i.e. @model(expr, args...).

fit(df, @model(y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0)), maxiter = 100)

Define a fit macro that accepts multiple arguments (syntax closest to reg in Stata and to @with in DataFramesMeta.jl), i.e. @fit(expr1, expr2, args...)

@fit df y ~ x1 weight = x2 vcov = cluster(x3+x4) where = (x5 >= 0) maxiter = 100
# or (either would work)
@fit(df, y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0), maxiter = 100)

An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.

@formula macro does not take interpolation any more

I found that @formula does not take interpolation any more. For example,

lhs = :dis
fm = @formula($lhs ~ 1)

generates the following error message:

ArgumentError: non-call expression encountered: $(Expr(:$, :lhs))
include_string(::String, ::String) at loading.jl:522
include_string(::String, ::String, ::Int64) at eval.jl:30
include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N} where N) at eval.jl:34
(::Atom.##102#107{String,Int64,String})() at eval.jl:82
withpath(::Atom.##102#107{String,Int64,String}, ::Void) at utils.jl:30
withpath(::Function, ::String) at eval.jl:38
hideprompt(::Atom.##101#106{String,Int64,String}) at repl.jl:67
macro expansion at eval.jl:80 [inlined]
(::Atom.##100#105{Dict{String,Any}})() at task.jl:80

Is it a bug or an intended change? If this behavior is intended, is there an alternative?

Standardized parameters

Follow up of this issue to discuss the possibility and methods to obtain standardized parameters from models, with a general solution at the StatModels level.

Several methods exist to compute standardized coefs:
- Coefs transformation (usually including the outcome's SD)
- Data standardization and model re-fitting

What would be the best way to obtain them similarly to method 2?

Related to #73

Feature Request: Poly

Hello, I'm reaching out to request a part of the modelling framework that is similar to R's poly. I think this would be very beneficial within linear modeling as well as other places

This is related to both #29 and #21 but I haven't seen any recent updates regarding this specific feature. Are there people actively working on it? My knowledge of the R poly is not good, so I wouldn't be a great person to tackle the problem, but I'm interested in using it and I'm sure there are many others.

Use some character other than * for "main effects and interactions"

* is quiet possibly the most confusing charact for "main effects and interactions" .

Because
a*b becomes a&b +a.

Then for continous a and b that & for the interactions term,
will get translated into scalar multiplication, which is normally notated using *.
This makes it really hard to explain to people (I know, I just tried).
and it seems like a common mistake when wanting to do a&b would be to write a*b.

Alternative character options I suggest isa && b.
it is like & but do more of it

cc @nickrobinson251

Docs badge is broken

I just clicked the docs badge in the readme (5/9/2019) and it links to this page https://juliastats.github.io/StatsModels.jl/latest which gives a 404 page missing error. Where are the hosted docs supposed to be?

Get rid of DataFrameStatisticalModel

The current approach of wrapping fitted StatisticalModel objects within DataFrameStatisticalModel/DataTableStatisticalModel objects has a major limitation: we can only @delegate functions that we know about, i.e. only those defined in StatsBase. This sounds problematic in the long term, where custom model types defined in packages may want to support new operations.

A solution to this would be to move from the current composition approach (wrapper) to inheritance. StatisticalModel object would be required to have some predefined fields, like coefnames, formula, modelmatrix and modelframe. The fit method from StatsModels would set these fields after fitting the model, so that they can be accessed later both by generic methods and by custom methods defined for a given model type.

This solution would for example allow implementing a generic Wald test function in StatsBase (JuliaStats/StatsBase.jl#300), which would use the coefficient names stored in the object. This is not possible with the current DataFrameStatisticalModel approach, where coefficient names can only be accessed by methods designed for DataFrameStatisticalModel, which therefore cannot live in StatsBase.

Vector functions vs elementwise function

In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows StatModels to work with any streaming interface, not just DataFrame. However,
this has two drawbacks:

a lot of useful transformations, such as creating lagged variables (#86) or converting continuous variables to categorical variables (#93), cannot be done in the current implementation, because they require the entire vector.
the syntax is somewhat at odds with the rest of Julia, where one would typically write log.(x)

This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also:
#75 (comment)_
#71 (comment)_

Allow custom types for continuous terms

In JuliaStats/Survival.jl#10, it was reported that Survival is broken with the most recent changes to StatsModels. This is apparently due to how continuous terms are handled, converting to Float64 right away and marking other types as categorical. Survival uses a custom type called EventTime which holds the time to an event as a number and metadata about the event as a boolean. Conceptually this is continuous, as the actual values in question are the numerical times.

In Slack, @kleinschmidt noted that a solution to this may be to relax how continuous terms are handled. See also the IdentityTerm idea in the linked issue. He also mentioned that another possible solution would be to define a special term for EventTime and handle it separately, but I think a more general solution is warranted here, as other packages may run into similar issues.

Feature request: support . notation in formula

It would be very useful to write "Y ~ . " to mean "regress on all inputs".
Here is a proposed solution:
https://stackoverflow.com/questions/44222763/specify-only-target-variable-in-glm-jl
But it does not work in StatsModels v0.5 and Julia 1.1.

An alternative representation of Terms?

Thanks for all the work on this, @kleinschmidt .

Do you have a plan of how you want to rework the Terms type?

I have been thinking that many of the kludges like the response member being a Bool with the unstated assumption that it corresponds to the first of the eterms can be bypassed. The response member should be something like Formula.lhs and the terms member should be an array of types like Formula.rhs. If we retain a 1 as a term in terms then the intercept member can be replaced by an expression like 1 in terms. It would then only be necessary to sort the terms according to order and decide how to evaluate them in the ModelFrame where dispatching on the type would be helpful.

These are just a few thoughts. If you have a better scheme in mind then we should go with that.

Are offsets supported?

I would like to fit a GLM model with poisson distribution response variables, and also use an offset. I can't tell from the documentation or the code if I can use a @formula to include an offset. Are offsets supported, and if so how can I use them?

Preserve original expression when parsing a formula in terms2.0?

As discussed in JuliaInterop/RCall.jl#306 I think it would be helpful to preserve a copy of the original expression passed to the @formula macro, either in the FormulaTerm type or perhaps in a Formula type that contains this expression and the FormulaTerm.

My reasons for suggesting this are twofold:

I have users who want to have main effects, two-factor interactions, three-factor interactions, etc. and a formula like

y ~ 1 + A * B * C * D * ...

gets expanded to

y ~ 1 + A + B + C + D + ... + A & B + A & C + ...

when I include it in the show method for the fitted model.

It is easier in the RCall conversion from a Julia formula to an R formula is you can just grab hold of the original expression rather than needing to reconstruct it from the lhs and the expanded rhs.

Disruptive Variable Names

using Feather
using DataFrames
using StatsModels
df = Feather.read(joinpath(Pkg.dir("Feather"), "test/newdata/iris.feather"),
    use_mmap = false)
names(df)

The names are: Vector{Symbol}(:Sepal.Length, :Sepal.Width, :Petal.Length, :Petal.Width, :Species)

Having variable names with . makes them Expr with ex.head == :(.) which prompts errors in:

dospecials
Terms

The following fix works:

function dospecials(ex::Expr)
    if ex.head == :(.) return Symbol(ex) end
    ...
end
function Terms(f::Formula)
    ...
    if haslhs
        unshift!(etrms, Any[Symbol(f.lhs)]) # Shouldn't typeof(eterms) == Vector{Symbol} ?
end

modelcols should always return a matrix for both sides

Alternatively, should always return a tuple of vectors for both sides.
It should not return a tuple of vectors for the LHS and a matrix for the right hand side.

Support panel and time aware sources

Virtually every statistical environment supports the notion of a dataset having repeated observations and potentially multiple units of observations (i.e., longitudinal). Terms 2.0 introduced terms that operate between the data and the representation. Terms such as lag/lead,diff, and others require the notion of time and potentially a group by approach. Software usually handles this by xtset panelvar timevar in Stata, pdata.frame(index = c(id, time)) in R's plm, ID cross-section-id <time-series-id> in SAS software CPANEL Procedure, etc. Any thoughts on how to best integrate that within StatsModels?

Implementing First-Difference

I am trying to make a transformation à la PolyTerm. However, I fail to get the modelcols correctly when the term is a CategoricalTerm. It seems to parse correctly, but the expansion only does à la ContinousTerm.

using DataFrames, StatsBase, StatsModels
data = DataFrame(y = rand(10), x = rand(10), z = categorical(repeat(1:2, 5)))
formula = @formula(Δ(y) ~ Δ(x) + Δ(z))
Δ(obj::AbstractVector) = diff(obj)
Δ(obj::AbstractCategoricalVector) = obj[2:end]

struct FDTerm{T} <: AbstractTerm
    term::T
end

function StatsModels.apply_schema(t::FunctionTerm{typeof(Δ)}, sch)
    term = apply_schema(t.args_parsed[1], sch)
    FDTerm(term)
end

function StatsModels.modelcols(t::FDTerm, d::NamedTuple)
    modelcols(t.term, d)
end

sc = apply_schema(formula, schema(data))

modelcols(sc, data)

apply_schema(sc.rhs.terms[2], schema(data))

@formula broken in v0.2.4

julia> using StatsModels
julia> @formula(:z ~ 1)

I get error
ERROR: ArgumentError: non-call expression encountered: :z

However, previously I should get
Formula: :z ~ 1

delegate macro fails to precompile in v0.7.0-dev

@andreasnoack I hate to bother you on this but git blame indicates that you were the last person to fix the delegate macro in src/statsmodel.jl. Right now precompilation of the DataFrames11 branch fails under v0.7.0-dev with

julia> using StatsModels  
INFO: Precompiling module StatsModels.              
ERROR: LoadError: LoadError: LoadError: type QuoteNode has no field args                                 
Stacktrace:               
 [1] @delegate(::LineNumberNode, ::Module, ::Any, ::Any) at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:20
 [2] include_relative(::Module, ::String) at ./loading.jl:509                                            
 [3] include at ./sysimg.jl:15 [inlined]            
 [4] include(::String) at /home/bates/.julia/v0.7/StatsModels/src/StatsModels.jl:3                       
 [5] collect_to!(::Array{Any,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64, ::Int64) at ./array.jl:603
 [6] collect_to!(::Array{Function,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64, ::Int64) at ./array.jl:613 (repeats 2 times)
 [7] collect_to_with_first!(::Array{typeof(StatsModels.contrasts_matrix),1}, ::Function, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64) at ./array.jl:590
 [8] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:584
 [9] collect_similar(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}) at ./array.jl:520
 [10] map(::Function, ::Array{String,1}) at ./abstractarray.jl:1901                                      
 [11] top-level scope     
 [12] include_relative(::Module, ::String) at ./loading.jl:509                                           
 [13] include(::Module, ::String) at ./sysimg.jl:15 
 [14] top-level scope     
 [15] top-level scope at ./<missing>:2              
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:62                       
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:62                       
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/StatsModels.jl:26                      
ERROR: Failed to precompile StatsModels to /home/bates/.julia/lib/v0.7/StatsModels.ji.                   
Stacktrace:               
 [1] compilecache(::String) at ./loading.jl:637     
 [2] _require(::Symbol) at ./loading.jl:448         
 [3] require(::Symbol) at ./loading.jl:300

To me this is an indication that these exotic model abstract types might do with some reconsideration but, for the short term, it would help if that could be repaired.

Documentation

The documentation for these features still needs to be ported from DataFrames.jl.

MethodError: no method matching degree(::Float64)

I tried to create a formula like so:
@formula(T ~ 1 + 0.5*x)

But got the following error:
Warning: Number 0.5 removed from interaction term 0.5 & x
ERROR: MethodError: no method matching degree(::Float64)
Closest candidates are:
degree(::Expr) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:291
degree(::Symbol) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:289
degree(::Integer) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:288

This worked fine with version 0.0.2.

Potential bug in fit(Type, formula, data, contrasts = d, ...)

model = fit(LinearMixedModel,
            @formula(AUC ~ formulation + sequence + period + (1|subject)),
            data,
            contrasts = Dict(:period => EffectsCoding()))

yields
MethodError: no method matching fit(::Type{LinearMixedModel}, Matrix{Float64}, Vector{Int})
The long-hand,

model = LinearMixedModel(@formula(AUC ~ formulation + sequence + period + (1|subject)),
                         data,
                         contrasts = Dict(:period => EffectsCoding()))
fit!(model)

works just fine.
@dmbates, I suspect the issue is in StatsModels.jl statsmodel.jl macro, but just in case there is something at MixedModels that might be triggering it.

Package compatibility caps

Ref: https://discourse.julialang.org/t/package-compatibility-caps/15301

Contrast Coding and 0.7-alpha

data = DataFrame(x = ["a", "b"], y = 1:2)
    categorical!(data)
    mf = ModelFrame(@formula(y ~ x),
                    data)
ERROR: ArgumentError: mismatching levels types: got String, expected Any based on contrasts levels.
Stacktrace:
 [1] StatsModels.ContrastsMatrix(::DummyCoding, ::Array{String,1}) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/contrasts.jl:142
 [2] evalcontrasts(::DataFrame, ::Dict{Any,Any}) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:126
 [3] #ModelFrame#17(::Dict{Any,Any}, ::Any, ::StatsModels.Terms, ::DataFrame) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:148
 [4] Type at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:145 [inlined]
 [5] #ModelFrame#20 at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:157 [inlined]
 [6] ModelFrame(::Formula, ::DataFrame) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:157
 [7] top-level scope

If one manually specifies it goes through, but if multiple categorical variables on the rhs it produces the wrong matrix.

data = DataFrame(x = ["a", "b", "b", "c"], y = 1:4, z = ["b","b","c","a"])
    categorical!(data)
    mf = ModelFrame(@formula(y ~ x),
                    data,
                    contrasts = Dict(:x => DummyCoding("a", ["a","b","c"]),
                                     :z => DummyCoding("a", ["a","b","c"])))
ModelMatrix(mf).m
4×3 Array{Float64,2}:
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0

Version info,

Julia v"0.7.0-alpha.0"
CategoricalArrays v0.3.9+
DataFrames v0.11.6
StatsModels v0.2.5

default intercept behavior

I'd like to change the default intercept behavior in formulas. Currently, we have this:

~ 1 + x intercept column and x
~ x intercept column and x
~ 0 + x or ~ -1 + x or ~ x-1 no intercept column, only x.

Including the intercept by default might be sensible when all you're doing is regression. But it's not always the most appropriate thing (e.g. JuliaStats/Survival.jl#2) and making people use ~ 0+x when they don't want an intercept feels like an unnecessary gotcha, Especially if we want formulas to be useful as a general "glue" between tabular data and numerical arrays.

I think we've largely agreed that this is the desirable behavior:

~ 1+x intercept and x
~ x just x
~ 0+x or ~ -1+x or ~ x-1 an (informative) error.

If that's the case, let's implement that before we release this as a replacement for the statsmodels code in DataFrames.

Feature Request: Alternative Representation to ModelMatrix

While the most common transformation for statistical models is a formula and data to a model matrix, there are certain cases which require or benefit from a different transformation. For example, for cluster robust variance covariance estimators it is necessary to have the cluster memberships. Another example includes fixed effects (absorption) which requires to have the groups memberships as well. I will illustrate the procedure with a simple example.

using DataFrames
using StatsModels
df = DataFrame(y = 1:9, A = repeat(["A", "B", "C"], inner = 3),
    B = repeat(["a","b","c"], outer = 3))
df[2,:B] = df[1,:B]
categorical!(df, [:A, :B])

We have a dataframe with categorical variables which will be used for clustering or absorption.

mf = ModelFrame(@formula(y ~ 0 + A * B), df)
mm = ModelMatrix(mf) # This expands a sparse matrix which is not
    # efficient for these procedures and suffers from missing category
    # for each dimension - 1.

Rather, what the transformation should look like is,

df[:AB] = df[:A] .* df[:B]
categorical!(df, :AB)
output = map(col -> map(elem -> find(col .== elem), unique(col)), df[[:A, :B, :AB]].columns)

which returns a Vector{Vector{Vector{Int64}}} indicating dimension[level[observations]].

This solution could be implemented with a function such that

output = thefunc(formula::StatsModels.Formula, df::DataFrames.AbstractDataFrame)

Names of coefficients do not match those of columns

I am using Julia 0.6-rc2 with StatsModels.jl + DataTables.jl.

df = readtable("NLS.csv"); # Same as DataFrames.readtable(IO, DataTable)
formula = @formula(ln_wage ~ age + race) # StatsModels.Formula
df = df[:, StatsModels.allvars(formula)] # DataTables.DataTable
df = df[completecases(df), :]
df[:race] # NullableArrays.NullableArray{WeakRefStrings.WeakRefString{UInt8},1}
mf = StatsModels.ModelFrame(formula, df)
mm = getfield(StatsModels.ModelMatrix(mf), :m)
y = Array{Float64,1}(df[formula.lhs])
β = inv(cholfact(mm' * mm)) * mm' * y
Names = StatsModels.coefnames(mf)
output = DataTable(Feature = Names, β = β)

The names of the covariates for the nominal variable are not in the correct order to match those of the coefficient estimates.

One can visually confirm the bug my inspecting the design matrix mm.

Names[3:4]
mm[find(df[:race] .== Names[3])[1:5], 3:4]

Setting base level of categorical variables has no effect

It seems that setting the base level of categorical variables has no effect on the final ModelMatrix. I've tried setting the base level via the contrasts keyword to ModelFrame and via levels!. Both strategies do change the ConstrastsMatrix in the ModelFrame, so the problem must be the transformation of the ModelFrame into a ModelMatrix. I believe that lines 68-70 in modelmatrix.jl are the culprit, because they reorder the levels of and recode the categorical variable. Perhaps you want to add a line like reindex = contrast.levels[reindex] after line 68.

An alternative syntax for formulas?

What we have

Currently formulas are constructed using @~, which as far as I know is Julia's only infix macro. This allows us to mirror R's formulas, which take advantage of R's nonstandard evaluation, and avoid prefixing variable names with :.

However, this is a special case that Julia makes in parsing that it—at least in my opinion—shouldn't have to do to allow Julia to look like R in one scenario. It would be nice if Julia could avoid having to parse anything as an infix macro. Plus, there surely is a more Julian approach!

A solution

A while back, @dmbates suggested on the mailing list that we replace our current formula syntax with Pairs. Based on my understanding, this would look like :y => :(1 + x1 + x2). @StefanKarpinski alternatively suggested the use of a @model macro, which would take an expression and construct a formula or model object, which could then be passed to fit or what have you. I imagine this looking something like

fit(LinearRegressionModel, @model y ~ 1 + x1 + x2, ...)

This approach allows any separator for the dependent and independent variables, not just ~, since it simply becomes the head in the resulting Expr. We could even use = for a more SAS-like aesthetic, or => as in Pairs.

A call to action

As we're transitioning the modeling functionality out of DataFrames.jl and into its own package (this one!), we have the opportunity to revisit some design decisions such as this. What do you all think? Should we leave things as-is or blaze a new trail?

Unified framework for dealing with Nulls?

Currently, lm seems to deal with NAs in DataFrames by ignoring the rows where they appear:

df = DataFrame(a = randn(10), b = randn(10));
df[:a][2] = NA
mod = lm(a~b, df)
mod.model.pp.X

This is great, but is there a general syntax for this behaviour? I'm asking, among other things, because I want to add a keyword for this behaviour to plotting with DataFrames in StatPlots.

In R, this is a complete mess. mean uses the keyword na.rm = T, lm uses na.action = "na.omit", cor uses `use = "complete.obs", which is just the sort of thing that guarantees that half your coding time in R is spent reading man pages also after 10 years of use.

Calling `apply_schema` twice should not be an error

MWE:

using StatsModels, DataFrames

f = @formula(y ~ 1 + a + b + c + b&c)
df = DataFrame(y = rand(9), a = 1:9, b = rand(9), c = repeat(["d","e","f"], 3))
f = apply_schema(f, schema(f, df))
f = apply_schema(f, schema(f, df))

The last line errors with

MethodError: no method matching concrete_term(::ContinuousTerm{Float64}, ::NamedTuple{(:y, :a, :b, :c),Tuple{Array{Float64,1},Array{Int64,1},Array{Float64,1},Array{String,1}}}, ::Dict{Symbol,Any})

Extend StatsBase.model_response methods with a Distribution argument

I have been thinking about conversion of a binary PooledDataArray or CategoricalVector to a 0/1 floating point vector in the StatsBase.model_response method. The trick is in determining what kind of response vector (or matrix or other structure) is desired. It occurred to me that the Distribution type in the model tells us that. For an lm or, in my case, lmm model the distribution type is Normal. For a glm or glmm model the distribution type is given in the model specification.

In creating a GLM.GlmResp object the response vector is checked to ensure that all values are in the support of the distribution so we might as well use that as the guide.

If this seems reasonable I will create a pull request.

Is it worthwhile changing the name of this extractor at the same time? The name model_response was copied from R and is not Julianesque.

Need to de-unionize model_response

When the response (left-hand side) variable is a Vector{Union{Missing,T}}, the result of model_response is also a union type. We need to change that method to convert to "de-unionize". For RHS variables, this is handled by modelmat_cols methods that explicitly convert to a AbstractFloatMatrix, but I'm not sure it's always appropriate to convert to a vector of floats since there may be cases where having a non-float (categorical) response actually makes sense.

Related: JuliaStats/GLM.jl#210

Tag a version

It would help me a lot with testing IterableTables if we could tag a new version here that reflects the use of DataTable instead of DataFrame. Just current master would be fine. I think if the version was 0.0.2, it would still indicate that there is no proper first release.

With no intercept, number of model matrix columns depend on type of dependent variable

I found this surprising bug (which it turns out I'm relying on in my tests for the Cox regression...). The number of columns in the model matrix in the categorical no intercept case depend on the type of the dependent variable! I have no idea what is going on. To reproduce, you need Survival.

julia> using Survival, DataFrames, StatsModels

julia> df = DataFrame(week = 1:4, status = [false, false, true, true], cat = [1,2,3,4]);

julia> df[:event] = EventTime.(df[:week],df[:status]);

julia> categorical!(df, :cat);

julia> ModelMatrix(ModelFrame(@formula(week ~ 0 + cat), df)).m
4×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0

julia> ModelMatrix(ModelFrame(@formula(event ~ 0 + cat), df)).m
4×3 Array{Float64,2}:
 0.0  0.0  0.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0

I'm on DataFrames 0.11 and StatsModels master.

`@rexport using StatsBase, DataFrames` decision

I would like to get some feedback on whether StatsModels should @reexport StatsBase and DataFrames.

DataFrames @reexport using CategoricalArrays, Missings
Since StatsModels can only be used with DataFrames, it makes sense for it to @reexport using DataFrames
In essence,

using StatsModels

rather than

using DataFrames, StatsModels

Likewise, StatsModels are used to make statistical models which will 99% rely or should use StatsBase.RegressionModel <: StatsBase.StatisticalModel.

A normal pipeline currently uses:

using DataFrames, StatsBase, StatsModels, StatisticalModule

rather than

using StatsModels, StatisticalModule

Specify certain variables as categorical

The same variable can be considered as continuous or as categorical when fitting different regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect).

Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical. To avoid this situation, it would be great to allow users to specify that certain variables should be treated as categorical directly in the formula.

applying transformations on formula arguments

Not sure whether this belongs here or on GLM?
A normal use case is to provide exponents (and sometimes log-transforms) directly on model objects. E.g.

using RDatasets, GLM
lm(@formula(SR ~ Pop75 + Pop75^2), LifeCycleSavings) #the key here is the exponent

However, adding the exponent seems not to be supported. The current workaround appears to be to create a LifeCycleSavings[:Pop75_2] = LifeCycleSavings[:Pop75] .^2.

There are use cases where this won't be ideal, because the model does not realize that it's the same input argument that has just been squared. Importantly, for predict you'd need to also do that transformation on the new DataFrame (newX). This in particular means that you cannot plot(newX, predict(mymodel, newX)) (for a single input variable), which is a particularly useful functionality IMHO.

Advanced Lead/Lag, via Index terms

Kinda related to #90, but not about supporting special table types,
but coming at it from the other direction; supporting/defining special columns.

#108
introduced a very basic Lag, which is just based on getting the previous row.
Doesn't know about time.

I propose a more general solution:

Index Terms

An IndexTerm represents some variable that we are going to use to base out Lag off.

There would not be a direct element for an IndexTerm in a formula.
But it can be inferred during apply schema,
either via hints, or as the default for DateTime, Period (or from packages: TimeZones.jl's ZonedDateTime, and Intervals.jl;s AnchoredInterval. Need to workout how default hints for package types works.).
And can also be created via the 3 argument form of Lag.
See below.

Lag: FunctionTerm

This comes in a 3 argument form lag(main_term, step, index_term).
E,g, lag(x, Day(3), release_date)
and a 2 argument form lag(main_term, step).
The 2 argument form works out the index_term based on if there is only one column that the schema + hints agree should be the index.
And if there are multiple it throws an ambiguity error.

`apply_schema`

During apply_schema, LagTerm works out what it's index_term is,
and what it's main_term is. (which thus lets it work out its width etc).

`modelcol`

During modelcols,
The lagterm calls modelcols on its index_term::IndexTerm and its main_term.
It then runs sortperm on the resolved index colum.
Then it generates the new lagged main colum by:

perm = sortperm(index_col)
map(1:size(main_cols, 2)) do ii
    
    cur_index = index_col[perm[ii]]
    prior_index = cur_index
    jj=ii
    while jj > 1 && cur_index - prior_index < lag.step
        prior_index = index_col[perm[jj]]
        jj-=1
    end
    if cur_index - prior_index ≈ lag.step
        return main_cols[perm[jj], :]
    else
        return missing
    endvalue
end

It isn't exactly the fastest code, but it is robust against iregular spacing.
Basically walk back through the rows as sorted by index_col and look to find on that is lag.step ago.
Alternatives include detecting if data is regularly spaced (by something that has a surtifiny lowest common multiple with out lag.step) and just jumping direct to the answer.
Or making an index Dict (Down side: doesn't work wiith approx, so troubles with Floats. (though small float integers tend to be fine))

Two Approaches to Enhanced Formulae Language

Currently @formula macro generates the Formula struct which has fields: lhs and rhs. For categorical variables (PooledDataArray), ModelFrame can use a Dict{Symbol, ContrastsMatrix} to prepare the object for ModelMatrix to generate the desired Matrix and keep track of relevant data (e.g., coefnames).

First Approach

The feature request is to allow for additional operators other than the currently implemented +, *, &. For categorical variables, codings are enabled through

mf = ModelFrame(@formula(y ~ x), df, contrasts = Dict(:x => EffectsCoding()))

and the following codings are available

DummyCoding (default)
EffectsCoding
HelmertCoding
ContrastsCoding

A welcomed addition would be to allow for codings for numerical variables (DataArrays.DataArray). For example,

IdentityCoding (default)
Log
Poly(d::Integer)

fm = DataFrames.@formula(y ~ x1 + x2)
mf = DataFrames.ModelFrame(fm, df, contrasts = Dict((:x1, LogCoding()), (:x2, PolyCoding(2)))

Second Approach

Inline specification can be can work as:

Stata's reg y i.x for dummy coding
R's lm(y ~ log(x), df) for common transformation
and implement these as we currently handle +, * and & for interactions.

@formula(y ~ log(x1) + poly(x2, 2) + i.x3 + EffectsCoding(x4))

see StreamModels.

Additional thought:

The polynomial transformation can be pretty useful especially when building models where keeping track of derived features is essential. For example, for accurately computing the marginal effects of a logistic regression. Other functions that might use the information are multicollinearity diagnostics or lasso / net-elastic estimators in the group-versions.
While the second approach seems more convenient, the first one tends to be more robust. See for example #27 where having more information on the coding for each variable is preferred (e.g., reference level). A mixture could give a best of both solution by having Formula or ModelFrame additional fields for holding this information.

Allow easier creation of formulas using lists of variable names

I have a workflow in Stata that works well for me that I would like to port to Julia. In general, i want to produce pristine formatted tables in latex that require a decent amount of code to create. Additionally, we usually have one main specification (fixed effects and clustering) that we care about, so all we need really explore in our regressions is various independent and dependent variable combinations. To do this in stata I write a function that accepts a list of y variables and a list of x variables.

foreach y 
    local regressors `y' `x' // the equivalent to creating a formula 
    regress `regressors'
    ... saving all the results and making a table in latex
end

The key thing here is that it is very easy to change the collection of x variables in your model.

Using the @formula macro in Julia, this is difficult. The rhs of a model is relatively easy, since you can interpolate with $y, but I cannot figure out how to take a vector of symbols and coerce everything into a model that can be regressed.

I wrote a macro that solves this problem a bit.

df = DataFrame(rand(10,10))
macro g(x)
    rhs = Expr(:call)
    push!(rhs.args, :+)
    for i in x.args
        push!(rhs.args, i)
    end
    Meta.quot(rhs)
end
m1 = @formula(x1 ~ x3 + x4)
m2 = copy(m1)
m2.rhs = @g [x3, x4] # looks the same
lm(m1, df) # works
lm(m2, df) # works

However I understand that such direct interpolation is very fragile. For instance, this macro doesn't allow the & operator etc, and you all may be considering re-doing the whole way you parse macros.

I am looking for guidance on a few ways forward.

To what extent could the simple macro I defined above be extended to cover more use cases?
How do you envision config files to work with StatsModels?
- Say want to define a list of dependent variables at the top of my script and then perform many operations on them, that don't involve StatsModels (summary statistics etc). How should I define these lists of variables in a dataset at the top of my script so that I can make a formula with this existing list?

Avoid hard-coding for concrete implementations

I'm wondering to what extent it would be practical to avoid hard-coding for specific implementations in this package. In particular for NullableArrays and DataTables since I'd like to use the functionality here with DataFrames and DataArrays. Commenting out specific uses of NullableArrays and removing the uses of DataTable signatures (could be some kind of AbstractTable in the future) made some of the basic functionality work. Some definitions like https://github.com/JuliaStats/StatsModels.jl/blob/master/src/modelframe.jl#L108 seem misplaced here in any case. I believe they should be defined in their home packages. @nalimilan why do you define the underscore versions of these functions? Other uses might be harder to get rid of. In particular the use of the Categorical types. I wondered if we could define an abstract categorical which also PooledDataArrays could be subtypes of.

A very different approach could be to use https://github.com/davidagold/Relations.jl/ as the unifying framework here. I think it is very interesting and might be a way to go beyond in-memory data frame representations. However, there are probably a lot to sort out before could consider such a change whereas the changes discussed above are pretty simple.

Release Terms 2.0 era for General

Is there anything outstanding for a release with Terms 2.0?

Bug with categorical arrays with difference in distinct and levels

Using the latest release of CategoricalArrays, ModelMatrix fails if

obj::CategoricalArrays.AbstractCategoricalVector
length(unique(obj)) ≠ length(DataFrames.levels(obj))

I believe the issue is in modelmat_cols.
A possible fix for now is

function rebase(obj::CategoricalArrays.AbstractCategoricalVector{T}) where T
    if length(unique(obj)) ≠ length(levels(obj))
        output = CategoricalArrays.CategoricalVector{T}(convert.(eltype(levels(obj)), obj))
    else
        output = obj
    end
    return output
end
obj = rebase(obj)

and that allows ModelMatrix to work properly.