Basic functionality for specifying, fitting, and evaluating statistical models in Julia.
Originally part of DataFrames.jl.
Specifying, fitting, and evaluating statistical models in Julia
Basic functionality for specifying, fitting, and evaluating statistical models in Julia.
Originally part of DataFrames.jl.
The collect_matrix_terms
docstring refers to the nonexistent extract_matrix_term
function:
Line 255 in b29558f
BTW, maybe collect_matrix_terms
could just be called matrix_terms
/matrixterms
?
I was wondering if anyone was working on getting a release of StatsModels for Julia 0.7-DEV. Getting a 0.7-DEV version should be pretty straightforward if we drop the statsmodels.jl DataFrame wrapper and delegations. That should also help towards eventually dropping exclusive dependency on DataFrames and supporting multiple tabular data representations (e.g., getting a stream of Named Tuples for the ModelFrame). One argument if favor of discontinuing future development for 0.x.y versions and focusing on 1.x.y is that no new features are being developed as part of the plans to re-write the Terms representation and ModelBuilder pipeline which will not happen in the near future and most likely after 0.7 / 1.0 is released.
I was experimenting with using HelmertCoding()
to generate -1/+1 coding form a 2 by 2 by 2 factorial design. Things work as expected with the default DummyCoding()
but give an incorrect answer with HelmertCoding()
. This is on the master branch.
ulia> tbl = (Y = randn(8), A = repeat(['N','Y'], outer=4), B=repeat(['-','+'], inner=2, outer=2), C=repeat(['L','H'], inner=4))
(Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H'])
julia> DataFrame(tbl)
8×4 DataFrame
│ Row │ Y │ A │ B │ C │
│ │ Float64 │ Char │ Char │ Char │
├─────┼─────────────┼──────┼──────┼──────┤
│ 1 │ 0.0762428 │ 'N' │ '-' │ 'L' │
│ 2 │ -1.23052 │ 'Y' │ '-' │ 'L' │
│ 3 │ 1.60522 │ 'N' │ '+' │ 'L' │
│ 4 │ -0.00539344 │ 'Y' │ '+' │ 'L' │
│ 5 │ 0.396057 │ 'N' │ '-' │ 'H' │
│ 6 │ 1.86998 │ 'Y' │ '-' │ 'H' │
│ 7 │ 0.393598 │ 'N' │ '+' │ 'H' │
│ 8 │ -0.622502 │ 'Y' │ '+' │ 'H' │
julia> mf = ModelFrame(@formula(Y ~ 1 + A*B*C), tbl)
ModelFrame{NamedTuple{(:Y, :A, :B, :C),Tuple{Array{Float64,1},Array{Char,1},Array{Char,1},Array{Char,1}}},StatisticalModel}(Y ~ 1 + A + B + C + A & B + A & C + B & C + A & B & C, Dict{Any,Any}(A=>A,C=>C,B=>B,Y=>Y), (Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H']), StatisticalModel)
julia> modelmatrix(mf)
8×8 Array{Float64,2}:
1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
julia> setcontrasts!(mf, Dict(:A=>HelmertCoding(),:B=>HelmertCoding(), :C=>HelmertCoding()))
ModelFrame{NamedTuple{(:Y, :A, :B, :C),Tuple{Array{Float64,1},Array{Char,1},Array{Char,1},Array{Char,1}}},StatisticalModel}(Y ~ 1 + A + B + C + A & B + A & C + B & C + A & B & C, Dict{Any,Any}(A=>A,C=>C,B=>B,Y=>Y), (Y = [0.0762428, -1.23052, 1.60522, -0.00539344, 0.396057, 1.86998, 0.393598, -0.622502], A = ['N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y'], B = ['-', '-', '+', '+', '-', '-', '+', '+'], C = ['L', 'L', 'L', 'L', 'H', 'H', 'H', 'H']), StatisticalModel)
julia> modelmatrix(mf)
8×12 Array{Float64,2}:
1.0 -1.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 -1.0
1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 1.0
1.0 -1.0 -1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 -1.0 1.0
1.0 1.0 -1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 -1.0 -1.0
1.0 -1.0 1.0 -1.0 0.0 0.0 1.0 0.0 -1.0 -0.0 -1.0 1.0
1.0 1.0 1.0 -1.0 0.0 0.0 0.0 1.0 -0.0 -1.0 -1.0 -1.0
1.0 -1.0 -1.0 -1.0 1.0 0.0 0.0 0.0 -1.0 -0.0 1.0 -1.0
1.0 1.0 -1.0 -1.0 0.0 1.0 0.0 0.0 -0.0 -1.0 1.0 1.0
I originally posted this here, but it might a more general feature that applies to all models.
There doesn't seem to be, currently, an update() method similar to R's that would allow to very easily re-fit a model with only changing specific parameters (for instance, the data).
model = lm(@formula(y ~ x), data1)
update!(model, data2)
If such method is not in the plans, how could I achieve a refitting with a different dataset (but everything else the same) using the current tools? Thank you!
I was a fan of DataFrames.allvars
for obtaining all variables from a Formula
. I noticed it was missing from StatsModels. Was this intentional? I believe it to be a pretty essential functionality that would be a useful to have for our development and for other packages developers. Was there a reason to get rid of it?
This started in JuliaData/DataFrames.jl#1018, but probably more appropriate to discuss this here.
Maybe a first step would be to just add more constructors to ModelFrame
that for now materialize an iterator of named tuples into a DataFrame
and then call the existing constructor? I'm sure there could be a more elegant implementation, but that might get us started?
Several arguments of a function fit
typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.
It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:
fit(df, @formula(y ~ x1), @weight(x2), @vcov(cluster(x3+x4)), @where(x5 >= 0), maxiter = 100)
@model(expr, args...)
.
fit(df, @model(y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0)), maxiter = 100)
@with
in DataFramesMeta.jl), i.e. @fit(expr1, expr2, args...)
@fit df y ~ x1 weight = x2 vcov = cluster(x3+x4) where = (x5 >= 0) maxiter = 100
# or (either would work)
@fit(df, y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0), maxiter = 100)
An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.
I found that @formula does not take interpolation any more. For example,
lhs = :dis
fm = @formula($lhs ~ 1)
generates the following error message:
ArgumentError: non-call expression encountered:
include_string(::String, ::String) at loading.jl:522
include_string(::String, ::String, ::Int64) at eval.jl:30
include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N} where N) at eval.jl:34
(::Atom.##102#107{String,Int64,String})() at eval.jl:82
withpath(::Atom.##102#107{String,Int64,String}, ::Void) at utils.jl:30
withpath(::Function, ::String) at eval.jl:38
hideprompt(::Atom.##101#106{String,Int64,String}) at repl.jl:67
macro expansion at eval.jl:80 [inlined]
(::Atom.##100#105{Dict{String,Any}})() at task.jl:80
Is it a bug or an intended change? If this behavior is intended, is there an alternative?
Follow up of this issue to discuss the possibility and methods to obtain standardized parameters from models, with a general solution at the StatModels level.
What would be the best way to obtain them similarly to method 2?
Related to #73
Hello, I'm reaching out to request a part of the modelling framework that is similar to R's poly
. I think this would be very beneficial within linear modeling as well as other places
This is related to both #29 and #21 but I haven't seen any recent updates regarding this specific feature. Are there people actively working on it? My knowledge of the R poly
is not good, so I wouldn't be a great person to tackle the problem, but I'm interested in using it and I'm sure there are many others.
*
is quiet possibly the most confusing charact for "main effects and interactions" .
Because
a*b
becomes a&b +a
.
Then for continous a
and b
that &
for the interactions term,
will get translated into scalar multiplication, which is normally notated using *
.
This makes it really hard to explain to people (I know, I just tried).
and it seems like a common mistake when wanting to do a&b
would be to write a*b
.
Alternative character options I suggest isa && b
.
it is like &
but do more of it
I just clicked the docs badge in the readme (5/9/2019) and it links to this page https://juliastats.github.io/StatsModels.jl/latest which gives a 404 page missing error. Where are the hosted docs supposed to be?
The current approach of wrapping fitted StatisticalModel
objects within DataFrameStatisticalModel
/DataTableStatisticalModel
objects has a major limitation: we can only @delegate
functions that we know about, i.e. only those defined in StatsBase. This sounds problematic in the long term, where custom model types defined in packages may want to support new operations.
A solution to this would be to move from the current composition approach (wrapper) to inheritance. StatisticalModel
object would be required to have some predefined fields, like coefnames
, formula
, modelmatrix
and modelframe
. The fit
method from StatsModels would set these fields after fitting the model, so that they can be accessed later both by generic methods and by custom methods defined for a given model type.
This solution would for example allow implementing a generic Wald test function in StatsBase (JuliaStats/StatsBase.jl#300), which would use the coefficient names stored in the object. This is not possible with the current DataFrameStatisticalModel
approach, where coefficient names can only be accessed by methods designed for DataFrameStatisticalModel
, which therefore cannot live in StatsBase.
In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows StatModels
to work with any streaming interface, not just DataFrame
. However,
this has two drawbacks:
log.(x)
This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also:
#75 (comment)_
#71 (comment)_
In JuliaStats/Survival.jl#10, it was reported that Survival is broken with the most recent changes to StatsModels. This is apparently due to how continuous terms are handled, converting to Float64
right away and marking other types as categorical. Survival uses a custom type called EventTime
which holds the time to an event as a number and metadata about the event as a boolean. Conceptually this is continuous, as the actual values in question are the numerical times.
In Slack, @kleinschmidt noted that a solution to this may be to relax how continuous terms are handled. See also the IdentityTerm
idea in the linked issue. He also mentioned that another possible solution would be to define a special term for EventTime
and handle it separately, but I think a more general solution is warranted here, as other packages may run into similar issues.
It would be very useful to write "Y ~ . " to mean "regress on all inputs".
Here is a proposed solution:
https://stackoverflow.com/questions/44222763/specify-only-target-variable-in-glm-jl
But it does not work in StatsModels v0.5 and Julia 1.1.
Thanks for all the work on this, @kleinschmidt .
Do you have a plan of how you want to rework the Terms
type?
I have been thinking that many of the kludges like the response
member being a Bool
with the unstated assumption that it corresponds to the first of the eterms
can be bypassed. The response member should be something like Formula.lhs
and the terms
member should be an array of types like Formula.rhs
. If we retain a 1
as a term in terms
then the intercept
member can be replaced by an expression like 1 in terms
. It would then only be necessary to sort the terms according to order and decide how to evaluate them in the ModelFrame where dispatching on the type would be helpful.
These are just a few thoughts. If you have a better scheme in mind then we should go with that.
I would like to fit a GLM model with poisson distribution response variables, and also use an offset. I can't tell from the documentation or the code if I can use a @formula
to include an offset. Are offsets supported, and if so how can I use them?
As discussed in JuliaInterop/RCall.jl#306 I think it would be helpful to preserve a copy of the original expression passed to the @formula
macro, either in the FormulaTerm
type or perhaps in a Formula
type that contains this expression and the FormulaTerm
.
My reasons for suggesting this are twofold:
y ~ 1 + A * B * C * D * ...
gets expanded to
y ~ 1 + A + B + C + D + ... + A & B + A & C + ...
when I include it in the show
method for the fitted model.
RCall
conversion from a Julia formula to an R formula is you can just grab hold of the original expression rather than needing to reconstruct it from the lhs
and the expanded rhs
.using Feather
using DataFrames
using StatsModels
df = Feather.read(joinpath(Pkg.dir("Feather"), "test/newdata/iris.feather"),
use_mmap = false)
names(df)
The names are: Vector{Symbol}(:Sepal.Length, :Sepal.Width, :Petal.Length, :Petal.Width, :Species)
Having variable names with .
makes them Expr
with ex.head == :(.)
which prompts errors in:
dospecials
Terms
The following fix works:
function dospecials(ex::Expr)
if ex.head == :(.) return Symbol(ex) end
...
end
function Terms(f::Formula)
...
if haslhs
unshift!(etrms, Any[Symbol(f.lhs)]) # Shouldn't typeof(eterms) == Vector{Symbol} ?
end
Alternatively, should always return a tuple of vectors for both sides.
It should not return a tuple of vectors for the LHS and a matrix for the right hand side.
Virtually every statistical environment supports the notion of a dataset having repeated observations and potentially multiple units of observations (i.e., longitudinal). Terms 2.0 introduced terms that operate between the data and the representation. Terms such as lag
/lead
,diff
, and others require the notion of time and potentially a group by approach. Software usually handles this by xtset panelvar timevar
in Stata, pdata.frame(index = c(id, time))
in R's plm, ID cross-section-id <time-series-id>
in SAS software CPANEL Procedure, etc. Any thoughts on how to best integrate that within StatsModels?
I am trying to make a transformation à la PolyTerm
. However, I fail to get the modelcols
correctly when the term is a CategoricalTerm
. It seems to parse correctly, but the expansion only does à la ContinousTerm
.
using DataFrames, StatsBase, StatsModels
data = DataFrame(y = rand(10), x = rand(10), z = categorical(repeat(1:2, 5)))
formula = @formula(Δ(y) ~ Δ(x) + Δ(z))
Δ(obj::AbstractVector) = diff(obj)
Δ(obj::AbstractCategoricalVector) = obj[2:end]
struct FDTerm{T} <: AbstractTerm
term::T
end
function StatsModels.apply_schema(t::FunctionTerm{typeof(Δ)}, sch)
term = apply_schema(t.args_parsed[1], sch)
FDTerm(term)
end
function StatsModels.modelcols(t::FDTerm, d::NamedTuple)
modelcols(t.term, d)
end
sc = apply_schema(formula, schema(data))
modelcols(sc, data)
apply_schema(sc.rhs.terms[2], schema(data))
julia> using StatsModels
julia> @formula(:z ~ 1)
I get error
ERROR: ArgumentError: non-call expression encountered: :z
However, previously I should get
Formula: :z ~ 1
@andreasnoack I hate to bother you on this but git blame
indicates that you were the last person to fix the delegate
macro in src/statsmodel.jl
. Right now precompilation of the DataFrames11
branch fails under v0.7.0-dev with
julia> using StatsModels
INFO: Precompiling module StatsModels.
ERROR: LoadError: LoadError: LoadError: type QuoteNode has no field args
Stacktrace:
[1] @delegate(::LineNumberNode, ::Module, ::Any, ::Any) at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:20
[2] include_relative(::Module, ::String) at ./loading.jl:509
[3] include at ./sysimg.jl:15 [inlined]
[4] include(::String) at /home/bates/.julia/v0.7/StatsModels/src/StatsModels.jl:3
[5] collect_to!(::Array{Any,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64, ::Int64) at ./array.jl:603
[6] collect_to!(::Array{Function,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64, ::Int64) at ./array.jl:613 (repeats 2 times)
[7] collect_to_with_first!(::Array{typeof(StatsModels.contrasts_matrix),1}, ::Function, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Int64) at ./array.jl:590
[8] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:584
[9] collect_similar(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(StatsModels.include)}) at ./array.jl:520
[10] map(::Function, ::Array{String,1}) at ./abstractarray.jl:1901
[11] top-level scope
[12] include_relative(::Module, ::String) at ./loading.jl:509
[13] include(::Module, ::String) at ./sysimg.jl:15
[14] top-level scope
[15] top-level scope at ./<missing>:2
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:62
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/statsmodel.jl:62
in expression starting at /home/bates/.julia/v0.7/StatsModels/src/StatsModels.jl:26
ERROR: Failed to precompile StatsModels to /home/bates/.julia/lib/v0.7/StatsModels.ji.
Stacktrace:
[1] compilecache(::String) at ./loading.jl:637
[2] _require(::Symbol) at ./loading.jl:448
[3] require(::Symbol) at ./loading.jl:300
To me this is an indication that these exotic model abstract types might do with some reconsideration but, for the short term, it would help if that could be repaired.
The documentation for these features still needs to be ported from DataFrames.jl.
I tried to create a formula like so:
@formula(T ~ 1 + 0.5*x)
But got the following error:
Warning: Number 0.5 removed from interaction term 0.5 & x
ERROR: MethodError: no method matching degree(::Float64)
Closest candidates are:
degree(::Expr) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:291
degree(::Symbol) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:289
degree(::Integer) at /u/home/s/shihuwen/.julia/v0.6/StatsModels/src/formula.jl:288
This worked fine with version 0.0.2.
model = fit(LinearMixedModel,
@formula(AUC ~ formulation + sequence + period + (1|subject)),
data,
contrasts = Dict(:period => EffectsCoding()))
yields
MethodError: no method matching fit(::Type{LinearMixedModel}, Matrix{Float64}, Vector{Int})
The long-hand,
model = LinearMixedModel(@formula(AUC ~ formulation + sequence + period + (1|subject)),
data,
contrasts = Dict(:period => EffectsCoding()))
fit!(model)
works just fine.
@dmbates, I suspect the issue is in StatsModels.jl statsmodel.jl
macro, but just in case there is something at MixedModels that might be triggering it.
data = DataFrame(x = ["a", "b"], y = 1:2)
categorical!(data)
mf = ModelFrame(@formula(y ~ x),
data)
ERROR: ArgumentError: mismatching levels types: got String, expected Any based on contrasts levels.
Stacktrace:
[1] StatsModels.ContrastsMatrix(::DummyCoding, ::Array{String,1}) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/contrasts.jl:142
[2] evalcontrasts(::DataFrame, ::Dict{Any,Any}) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:126
[3] #ModelFrame#17(::Dict{Any,Any}, ::Any, ::StatsModels.Terms, ::DataFrame) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:148
[4] Type at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:145 [inlined]
[5] #ModelFrame#20 at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:157 [inlined]
[6] ModelFrame(::Formula, ::DataFrame) at /Users/jbsc/.julia/packages/StatsModels/lbKp/src/modelframe.jl:157
[7] top-level scope
If one manually specifies it goes through, but if multiple categorical variables on the rhs it produces the wrong matrix.
data = DataFrame(x = ["a", "b", "b", "c"], y = 1:4, z = ["b","b","c","a"])
categorical!(data)
mf = ModelFrame(@formula(y ~ x),
data,
contrasts = Dict(:x => DummyCoding("a", ["a","b","c"]),
:z => DummyCoding("a", ["a","b","c"])))
ModelMatrix(mf).m
4×3 Array{Float64,2}:
1.0 0.0 0.0
1.0 1.0 0.0
1.0 1.0 0.0
1.0 0.0 1.0
Version info,
Julia v"0.7.0-alpha.0"
CategoricalArrays v0.3.9+
DataFrames v0.11.6
StatsModels v0.2.5
I'd like to change the default intercept behavior in formulas. Currently, we have this:
~ 1 + x
intercept column and x~ x
intercept column and x~ 0 + x
or ~ -1 + x
or ~ x-1
no intercept column, only x.Including the intercept by default might be sensible when all you're doing is regression. But it's not always the most appropriate thing (e.g. JuliaStats/Survival.jl#2) and making people use ~ 0+x
when they don't want an intercept feels like an unnecessary gotcha, Especially if we want formulas to be useful as a general "glue" between tabular data and numerical arrays.
I think we've largely agreed that this is the desirable behavior:
~ 1+x
intercept and x~ x
just x~ 0+x
or ~ -1+x
or ~ x-1
an (informative) error.If that's the case, let's implement that before we release this as a replacement for the statsmodels code in DataFrames.
While the most common transformation for statistical models is a formula and data to a model matrix, there are certain cases which require or benefit from a different transformation. For example, for cluster robust variance covariance estimators it is necessary to have the cluster memberships. Another example includes fixed effects (absorption) which requires to have the groups memberships as well. I will illustrate the procedure with a simple example.
using DataFrames
using StatsModels
df = DataFrame(y = 1:9, A = repeat(["A", "B", "C"], inner = 3),
B = repeat(["a","b","c"], outer = 3))
df[2,:B] = df[1,:B]
categorical!(df, [:A, :B])
We have a dataframe with categorical variables which will be used for clustering or absorption.
mf = ModelFrame(@formula(y ~ 0 + A * B), df)
mm = ModelMatrix(mf) # This expands a sparse matrix which is not
# efficient for these procedures and suffers from missing category
# for each dimension - 1.
Rather, what the transformation should look like is,
df[:AB] = df[:A] .* df[:B]
categorical!(df, :AB)
output = map(col -> map(elem -> find(col .== elem), unique(col)), df[[:A, :B, :AB]].columns)
which returns a Vector{Vector{Vector{Int64}}}
indicating dimension[level[observations]]
.
This solution could be implemented with a function such that
output = thefunc(formula::StatsModels.Formula, df::DataFrames.AbstractDataFrame)
I am using Julia 0.6-rc2 with StatsModels.jl + DataTables.jl.
df = readtable("NLS.csv"); # Same as DataFrames.readtable(IO, DataTable)
formula = @formula(ln_wage ~ age + race) # StatsModels.Formula
df = df[:, StatsModels.allvars(formula)] # DataTables.DataTable
df = df[completecases(df), :]
df[:race] # NullableArrays.NullableArray{WeakRefStrings.WeakRefString{UInt8},1}
mf = StatsModels.ModelFrame(formula, df)
mm = getfield(StatsModels.ModelMatrix(mf), :m)
y = Array{Float64,1}(df[formula.lhs])
β = inv(cholfact(mm' * mm)) * mm' * y
Names = StatsModels.coefnames(mf)
output = DataTable(Feature = Names, β = β)
The names of the covariates for the nominal variable are not in the correct order to match those of the coefficient estimates.
One can visually confirm the bug my inspecting the design matrix mm
.
Names[3:4]
mm[find(df[:race] .== Names[3])[1:5], 3:4]
It seems that setting the base level of categorical variables has no effect on the final ModelMatrix
. I've tried setting the base level via the contrasts
keyword to ModelFrame
and via levels!
. Both strategies do change the ConstrastsMatrix
in the ModelFrame
, so the problem must be the transformation of the ModelFrame
into a ModelMatrix
. I believe that lines 68-70 in modelmatrix.jl
are the culprit, because they reorder the levels of and recode the categorical variable. Perhaps you want to add a line like reindex = contrast.levels[reindex]
after line 68.
Currently formulas are constructed using @~
, which as far as I know is Julia's only infix macro. This allows us to mirror R's formulas, which take advantage of R's nonstandard evaluation, and avoid prefixing variable names with :
.
However, this is a special case that Julia makes in parsing that it—at least in my opinion—shouldn't have to do to allow Julia to look like R in one scenario. It would be nice if Julia could avoid having to parse anything as an infix macro. Plus, there surely is a more Julian approach!
A while back, @dmbates suggested on the mailing list that we replace our current formula syntax with Pair
s. Based on my understanding, this would look like :y => :(1 + x1 + x2)
. @StefanKarpinski alternatively suggested the use of a @model
macro, which would take an expression and construct a formula or model object, which could then be passed to fit
or what have you. I imagine this looking something like
fit(LinearRegressionModel, @model y ~ 1 + x1 + x2, ...)
This approach allows any separator for the dependent and independent variables, not just ~
, since it simply becomes the head
in the resulting Expr
. We could even use =
for a more SAS-like aesthetic, or =>
as in Pair
s.
As we're transitioning the modeling functionality out of DataFrames.jl and into its own package (this one!), we have the opportunity to revisit some design decisions such as this. What do you all think? Should we leave things as-is or blaze a new trail?
Currently, lm
seems to deal with NAs in DataFrames by ignoring the rows where they appear:
df = DataFrame(a = randn(10), b = randn(10));
df[:a][2] = NA
mod = lm(a~b, df)
mod.model.pp.X
This is great, but is there a general syntax for this behaviour? I'm asking, among other things, because I want to add a keyword for this behaviour to plotting with DataFrames in StatPlots.
In R, this is a complete mess. mean
uses the keyword na.rm = T
, lm
uses na.action = "na.omit"
, cor
uses `use = "complete.obs", which is just the sort of thing that guarantees that half your coding time in R is spent reading man pages also after 10 years of use.
MWE:
using StatsModels, DataFrames
f = @formula(y ~ 1 + a + b + c + b&c)
df = DataFrame(y = rand(9), a = 1:9, b = rand(9), c = repeat(["d","e","f"], 3))
f = apply_schema(f, schema(f, df))
f = apply_schema(f, schema(f, df))
The last line errors with
MethodError: no method matching concrete_term(::ContinuousTerm{Float64}, ::NamedTuple{(:y, :a, :b, :c),Tuple{Array{Float64,1},Array{Int64,1},Array{Float64,1},Array{String,1}}}, ::Dict{Symbol,Any})
I have been thinking about conversion of a binary PooledDataArray
or CategoricalVector
to a 0/1 floating point vector in the StatsBase.model_response
method. The trick is in determining what kind of response vector (or matrix or other structure) is desired. It occurred to me that the Distribution type in the model tells us that. For an lm
or, in my case, lmm
model the distribution type is Normal. For a glm
or glmm
model the distribution type is given in the model specification.
In creating a GLM.GlmResp
object the response vector is checked to ensure that all values are in the support of the distribution so we might as well use that as the guide.
If this seems reasonable I will create a pull request.
Is it worthwhile changing the name of this extractor at the same time? The name model_response
was copied from R and is not Julianesque.
When the response (left-hand side) variable is a Vector{Union{Missing,T}}
, the result of model_response
is also a union type. We need to change that method to convert to "de-unionize". For RHS variables, this is handled by modelmat_cols
methods that explicitly convert to a AbstractFloatMatrix
, but I'm not sure it's always appropriate to convert to a vector of floats since there may be cases where having a non-float (categorical) response actually makes sense.
Related: JuliaStats/GLM.jl#210
It would help me a lot with testing IterableTables if we could tag a new version here that reflects the use of DataTable
instead of DataFrame
. Just current master
would be fine. I think if the version was 0.0.2
, it would still indicate that there is no proper first release.
I found this surprising bug (which it turns out I'm relying on in my tests for the Cox regression...). The number of columns in the model matrix in the categorical no intercept case depend on the type of the dependent variable! I have no idea what is going on. To reproduce, you need Survival.
julia> using Survival, DataFrames, StatsModels
julia> df = DataFrame(week = 1:4, status = [false, false, true, true], cat = [1,2,3,4]);
julia> df[:event] = EventTime.(df[:week],df[:status]);
julia> categorical!(df, :cat);
julia> ModelMatrix(ModelFrame(@formula(week ~ 0 + cat), df)).m
4×4 Array{Float64,2}:
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
julia> ModelMatrix(ModelFrame(@formula(event ~ 0 + cat), df)).m
4×3 Array{Float64,2}:
0.0 0.0 0.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
I'm on DataFrames 0.11 and StatsModels master.
I would like to get some feedback on whether StatsModels should @reexport
StatsBase and DataFrames.
DataFrames @reexport using CategoricalArrays, Missings
Since StatsModels can only be used with DataFrames, it makes sense for it to @reexport using DataFrames
In essence,
using StatsModels
rather than
using DataFrames, StatsModels
Likewise, StatsModels are used to make statistical models which will 99% rely or should use StatsBase.RegressionModel <: StatsBase.StatisticalModel
.
A normal pipeline currently uses:
using DataFrames, StatsBase, StatsModels, StatisticalModule
rather than
using StatsModels, StatisticalModule
The same variable can be considered as continuous or as categorical when fitting different regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect).
Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical. To avoid this situation, it would be great to allow users to specify that certain variables should be treated as categorical directly in the formula.
Not sure whether this belongs here or on GLM?
A normal use case is to provide exponents (and sometimes log-transforms) directly on model objects. E.g.
using RDatasets, GLM
lm(@formula(SR ~ Pop75 + Pop75^2), LifeCycleSavings) #the key here is the exponent
However, adding the exponent seems not to be supported. The current workaround appears to be to create a LifeCycleSavings[:Pop75_2] = LifeCycleSavings[:Pop75] .^2
.
There are use cases where this won't be ideal, because the model does not realize that it's the same input argument that has just been squared. Importantly, for predict
you'd need to also do that transformation on the new DataFrame (newX
). This in particular means that you cannot plot(newX, predict(mymodel, newX))
(for a single input variable), which is a particularly useful functionality IMHO.
Kinda related to #90, but not about supporting special table types,
but coming at it from the other direction; supporting/defining special columns.
#108
introduced a very basic Lag, which is just based on getting the previous row.
Doesn't know about time.
I propose a more general solution:
An IndexTerm
represents some variable that we are going to use to base out Lag off.
There would not be a direct element for an IndexTerm
in a formula.
But it can be inferred during apply schema,
either via hints, or as the default for DateTime
, Period
(or from packages: TimeZones.jl's ZonedDateTime
, and Intervals.jl;s AnchoredInterval
. Need to workout how default hints for package types works.).
And can also be created via the 3 argument form of Lag.
See below.
This comes in a 3 argument form lag(main_term, step, index_term)
.
E,g, lag(x, Day(3), release_date)
and a 2 argument form lag(main_term, step)
.
The 2 argument form works out the index_term
based on if there is only one column that the schema + hints agree should be the index.
And if there are multiple it throws an ambiguity error.
apply_schema
During apply_schema
, LagTerm
works out what it's index_term
is,
and what it's main_term
is. (which thus lets it work out its width
etc).
modelcol
During modelcol
s,
The lagterm calls modelcols
on its index_term::IndexTerm
and its main_term
.
It then runs sortperm
on the resolved index colum.
Then it generates the new lagged main colum by:
perm = sortperm(index_col)
map(1:size(main_cols, 2)) do ii
cur_index = index_col[perm[ii]]
prior_index = cur_index
jj=ii
while jj > 1 && cur_index - prior_index < lag.step
prior_index = index_col[perm[jj]]
jj-=1
end
if cur_index - prior_index ≈ lag.step
return main_cols[perm[jj], :]
else
return missing
endvalue
end
It isn't exactly the fastest code, but it is robust against iregular spacing.
Basically walk back through the rows as sorted by index_col
and look to find on that is lag.step
ago.
Alternatives include detecting if data is regularly spaced (by something that has a surtifiny lowest common multiple with out lag.step
) and just jumping direct to the answer.
Or making an index Dict
(Down side: doesn't work wiith approx, so troubles with Floats. (though small float integers tend to be fine))
Currently @formula
macro generates the Formula
struct which has fields: lhs
and rhs
. For categorical variables (PooledDataArray
), ModelFrame
can use a Dict{Symbol, ContrastsMatrix}
to prepare the object for ModelMatrix
to generate the desired Matrix
and keep track of relevant data (e.g., coefnames
).
The feature request is to allow for additional operators other than the currently implemented +
, *
, &
. For categorical variables, codings are enabled through
mf = ModelFrame(@formula(y ~ x), df, contrasts = Dict(:x => EffectsCoding()))
and the following codings are available
A welcomed addition would be to allow for codings for numerical variables (DataArrays.DataArray
). For example,
fm = DataFrames.@formula(y ~ x1 + x2)
mf = DataFrames.ModelFrame(fm, df, contrasts = Dict((:x1, LogCoding()), (:x2, PolyCoding(2)))
Inline specification can be can work as:
reg y i.x
for dummy codinglm(y ~ log(x), df)
for common transformation+
, *
and &
for interactions.@formula(y ~ log(x1) + poly(x2, 2) + i.x3 + EffectsCoding(x4))
see StreamModels.
Additional thought:
Formula
or ModelFrame
additional fields for holding this information.I have a workflow in Stata that works well for me that I would like to port to Julia. In general, i want to produce pristine formatted tables in latex that require a decent amount of code to create. Additionally, we usually have one main specification (fixed effects and clustering) that we care about, so all we need really explore in our regressions is various independent and dependent variable combinations. To do this in stata I write a function that accepts a list of y
variables and a list of x
variables.
foreach y
local regressors `y' `x' // the equivalent to creating a formula
regress `regressors'
... saving all the results and making a table in latex
end
The key thing here is that it is very easy to change the collection of x
variables in your model.
Using the @formula
macro in Julia, this is difficult. The rhs of a model is relatively easy, since you can interpolate with $y
, but I cannot figure out how to take a vector of symbols and coerce everything into a model that can be regressed.
I wrote a macro that solves this problem a bit.
df = DataFrame(rand(10,10))
macro g(x)
rhs = Expr(:call)
push!(rhs.args, :+)
for i in x.args
push!(rhs.args, i)
end
Meta.quot(rhs)
end
m1 = @formula(x1 ~ x3 + x4)
m2 = copy(m1)
m2.rhs = @g [x3, x4] # looks the same
lm(m1, df) # works
lm(m2, df) # works
However I understand that such direct interpolation is very fragile. For instance, this macro doesn't allow the &
operator etc, and you all may be considering re-doing the whole way you parse macros.
I am looking for guidance on a few ways forward.
I'm wondering to what extent it would be practical to avoid hard-coding for specific implementations in this package. In particular for NullableArrays and DataTables since I'd like to use the functionality here with DataFrames and DataArrays. Commenting out specific uses of NullableArrays and removing the uses of DataTable
signatures (could be some kind of AbstractTable in the future) made some of the basic functionality work. Some definitions like https://github.com/JuliaStats/StatsModels.jl/blob/master/src/modelframe.jl#L108 seem misplaced here in any case. I believe they should be defined in their home packages. @nalimilan why do you define the underscore versions of these functions? Other uses might be harder to get rid of. In particular the use of the Categorical types. I wondered if we could define an abstract categorical which also PooledDataArrays could be subtypes of.
A very different approach could be to use https://github.com/davidagold/Relations.jl/ as the unifying framework here. I think it is very interesting and might be a way to go beyond in-memory data frame representations. However, there are probably a lot to sort out before could consider such a change whereas the changes discussed above are pretty simple.
Is there anything outstanding for a release with Terms 2.0?
Using the latest release of CategoricalArrays, ModelMatrix
fails if
obj::CategoricalArrays.AbstractCategoricalVector
length(unique(obj)) ≠ length(DataFrames.levels(obj))
I believe the issue is in modelmat_cols
.
A possible fix for now is
function rebase(obj::CategoricalArrays.AbstractCategoricalVector{T}) where T
if length(unique(obj)) ≠ length(levels(obj))
output = CategoricalArrays.CategoricalVector{T}(convert.(eltype(levels(obj)), obj))
else
output = obj
end
return output
end
obj = rebase(obj)
and that allows ModelMatrix
to work properly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.