juliastats / glm.jl Goto Github PK

Generalized linear models in Julia

License: Other

Julia 100.00%

julia glm regression statistics statistical-models data-science

glm.jl's Introduction

Linear and generalized linear models in Julia

Documentation	CI Status	Coverage	DOI

glm.jl's People

Contributors

Stargazers

Watchers

Forkers

doobwa dioptre tshort lendle ttriche simonab jimhester simonster jgkunk tlamadon kirillshaman guria65 vunguyene floswald kleinschmidt timema setzler hua-zhou cheuerde joemliang gragusa zhmz90 tkelman arksutw nalimilan panlanfeng yuyichao abhijithch pkofod keven-cheng anhecon mwsohn maximsch2 mkborregaard grero vinayprakashsingh chicagostats alstat lewishein cth jeffwong pbastide benjaminborn djsegal cyrussafaie bkamins spencerx relationalai-oss sudachen jacobadenbaum congwen benluteijn gzabuelgasim galenlynch ianbutterworth fwang88 tlienart cmcaine stjordanis nosferican jason-xuan 1991sig kshedden pdeffebach jmboehm nignatiadis tbeason dsweber2 iljak91 birusod kafisatz ablaom tangwang-ustc matthew-amos humepac zeta1999 roundnose mcoyrp am1codes ondrejslamecka standardgalactic dwinkler1 rcurtin samusz ym-han playfloor kronosthelate fintzij patrikdurdevic piever ecustwy aerdely rurutan512 xkdr harsharora21 cscherrer sindresops lordbudapest bioturbonick aidantr

glm.jl's Issues

Use coeftable methods in show methods

Shall we keep the coeftable methods and use them in the show methods? The reason that I wrote them is because there are often requests on R mailing lists dealing with extracting standard errors or p-values that are displayed by print.lm. The usual recommendation in R is to use

coef(summary(fm))

We should consider whether we want to go through the intermediate stage of summary in the GLM package. If so, we can fold coeftable into the summary methods.

On a related issue, there is an R function printCoefmat for printing the coefficient table. This isolates the code for printing small p-values. There is a reticence to displaying small p-values as 0.000 and I feel it is justified.

lm doesn't work for complex numbers

Shouldn't it? This is useful in signal processing. Right now I am just solving my complex regression problems with pinv(X'X)*X'*Y, but it doesn't produce the nice output of lm and it is not scalable to problems with tons of predictors.

RFC - when should a model fit be triggered?

Issue #19 relates to the use of the glmfit function. At present the constructor for GlmMod always calls glmfit to fit the model. In the MixedModels package I took a different approach in which the type representing the model does not call the model-fitting function - it just creates the representation of the model which includes a boolean indicator of whether the fit has been performed. Any of the extractor functions, coef, deviance, etc. check first if the fit has been performed and call fit if needed.

Separating fitting the model from constructing the representation has the advantage that the representation can be reused, say in a simulation, without going through its construction again. The fit methods can take other parameters such as one for verbose output.

Does this seem reasonable? Are more details needed?

Question: Does this package have support for generalized least squares

I need to do generalized least squares and would like to use this package.

Is there an existing way to specify the variance of the residual, so as to do weighted least squares?

Error in QR method

with newest Julia and GLM:

using DataFrames,GLM
df = DataFrame(A = [0.1,0.12,0.31,0.2,0.1], B =[0.21, 0.3, 0.1, 0.2,0.2])
ab=lm((A~B),df)
stderr(ab)

no method copytri!(Float64, Char)
while loading In[22], in expression starting on line 2
in symmetrize! at deprecated.jl:21
in vcov at /home/gaofeng/.julia/GLM/src/linpred.jl:65
in stderr at /home/gaofeng/.julia/GLM/src/linpred.jl:67

there is no error with "ab=lmc((A~B),df);stderr(ab)"

Inject dependencies?

Because the GLM package so strongly depends upon the DataFrames and Distribution packages, it is not really usable without importing the exports from both DataFrames and Distributions.

It would be nice if the following example worked out of the box:

using GLM
dobson = DataFrame({[18.,17,15,20,10,20,25,13,12], gl(3,1,9), gl(3,3)},
                          ["counts","outcome","treatment"])
fm3 = glm(:(counts ~ outcome + treatment), dobson, Poisson())

To do this, we would have to force execution of using DataFrames, Distributions when someone executes using GLM. The RDatasets package is doing this sort of thing already, but we need to establish some community rules about what's acceptable and what's bad form. Input from @StefanKarpinski, @ViralBShah and @JeffBezanson would be helpful here.

Comparing with R for logistic regression

For logistic regression in R I always use family="binomial" but in GLM.jl its Bernoulli() - a slight distinction, but is there any reason to change it?

Meaning of "deviance"

"Deviance" in the context of a GLM is typically defined in terms of difference between the log likelihood of the model and the log likelihood of a full model with a parameter for each observation (see Wikipedia), but the value GLM.jl reports is just -2*log likelihood of model. This makes no difference if the likelihood of the full model is 1 but isn't always the case. This explains the difference between the deviance of the Poisson example from the README in R and GLM.jl.

Errors running code in documentation

Hello, i'm getting some errors running code from the documentation... i'm running:

using GLM, RDatasets
form = dataset("datasets","Formaldehyde")
lm1 = fit(LmMod, OptDen ~ Carb, form)

i'm getting the error:

Error OptDen not defined

i'm running the stable release of julia.

Any help would be appreciated

Jason

Fitting glms on matricies instead of dataframes?

I'm looking for something like R's glm.fit function, which is called by glm after the design matrix and response vector are constructed and can be called directly. This would be useful if you already have a design matrix and response, and you don't need extra features that are provided by DataFrames. This would also avoid the overhead of having a DataFrame in memory in addition to your design matrix.

It looks like it would not be too hard to provide this. In glmfit.jl, fit takes a GlmMod. Other than the ModelFrame, the GlmMod object can be constructed from a response vector and design matrix instead of a Formula and DataFrame, and it looks like the ModelFrame is only used for column names in the coef table. (I'm exactly sure what the assign field is for in a ModelMatrix, important here?)

NumericExtensions.Logistic error

Not sure if this was the cause, but I started using Julia-0.2.0-rc2. On attempting using GLM, I got the error:

ERROR: Logistic not defined
in include at boot.jl:238
in include_from_node1 at loading.jl:119
in include at boot.jl:238
in include_from_node1 at loading.jl:119
in reload_path at loading.jl:145
in _require at loading.jl:63
in require at loading.jl:48
at /Users/seanmackesey/.julia/GLM/src/glmtools.jl:3
at /Users/seanmackesey/.julia/GLM/src/GLM.jl:192

I tried changing line 3 of glmtools.jl to using NumericExtensions.LogisticFun from using NumericExtensions.Logistic, and this seemed to fix the problem. I guess there was a name change in NumericExtensions?

QR type fields has changed

The functions delbeta and vcov uses the internal representation of the QR object but they changed in the update of the linear algebra code. The QR factorization now uses xgeqrt3.

glm error: no method Formula(Expr)

julia> gm1 = glm(:(counts ~ outcome + treatment), dobson, Poisson()) ERROR: no method Formula(Expr) in glm at /Users/gustavolacerda/.julia/v0.3/GLM/src/glmfit.jl:155

Definition of :sqrt2 constant overwritten in GLM

Hi,

Not major, but just wanted to report it.

Regards, Rob

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease+3241 (2014-05-20 14:04 UTC)
/ |_'|||__'| | master/6980728* (fork: 755 commits, 46 days)
|__/ | x86_64-apple-darwin13.2.0

julia> using Distributions

julia> using GLM
Warning: Method definition convert(Type{BigFloat},MathConst{:sqrt2}) in module Distributions at constants.jl:38 overwritten in module GLM at constants.jl:38.
Warning: Method definition convert(Type{Float64},MathConst{:sqrt2}) in module Distributions at constants.jl:43 overwritten in module GLM at constants.jl:43.
Warning: Method definition convert(Type{Float32},MathConst{:sqrt2}) in module Distributions at constants.jl:44 overwritten in module GLM at constants.jl:44.

julia>

[PkgEval] GLM may have a testing issue on Julia 0.3 (2014-05-20)

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3). The results of this script are used to generate a package listing enhanced with testing results.

On Julia 0.3

On 2014-05-19 the testing status was Tests pass..
On 2014-05-20 the testing status changed to Tests fail, but package loads..

Tests pass. means that PackageEvaluator found the tests for your package, executed them, and they all passed.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.

on a mac glm.jl overwrites GLM.jl during install because it does not distinguish case

when I do Pkg.add("GLM") on my mac my .julia/GLM/src folder only contains is missing GLM.jl. I discovered that the reason is that my mac does not distinguish case in the file name so it replaces GLM.jl with glm.jl during the install. The only way to fix it is to change the name of one of the two files. I change glm.jl to glm2.jl and in GLM.jl:199 to include("glm2.jl").

Package unusable on OS X

On OS X, no package can contain two files with the same name up to case. Right now, there are src/GLM.jl and src/glm.jl files in this package. One has to be renamed.

Fitting simple LM results in error about higer order interactions.

This:

using GLM, RDatasets
form = dataset("datasets","Formaldehyde")
lm1 = lm(OptDen ~ Carb, form)

results in:

code for 3rd and higher order interactions not yet written
while loading In[1], in expression starting on line 3
 in expandcols at C:\Users\admin\.julia\v0.3\DataFrames\src\formula\formula.jl:216
 in ModelMatrix at C:\Users\admin\.julia\v0.3\DataFrames\src\formula\formula.jl:238

This is on Windows 8.1 x64.

versioninfo()

Julia Version 0.3.0-prerelease+1570
Commit a6f6461* (2014-02-14 21:07 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

Pkg.status()

8 required packages:
 - Benchmark                     0.0.3
 - GLM                           0.3.0
 - Gadfly                        0.2.5
 - IJulia                        0.1.4
 - ProfileView                   0.0.1
 - RDatasets                     0.1.1
 - Resampling                    0.0.0
 - Winston                       0.9.0
38 additional packages:
 - ArrayViews                    0.4.1
 - BinDeps                       0.2.12
 - Blocks                        0.0.2
 - Cairo                         0.2.12
 - Cartesian                     0.1.4
 - Codecs                        0.1.0
 - Color                         0.2.8
 - Compose                       0.1.26
 - DataArrays                    0.1.3
 - DataFrames                    0.5.3
 - Datetime                      0.1.2
 - Distance                      0.3.1
 - Distributions                 0.4.0
 - GZip                          0.2.12
 - HTTPClient                    0.1.0
 - Hexagons                      0.0.1
 - ImageView                     0.0.14
 - Images                        0.2.30
 - IniFile                       0.2.2
 - Iterators                     0.1.2
 - JSON                          0.3.3
 - LibCURL                       0.1.0
 - LibExpat                      0.0.3
 - Loess                         0.0.2
 - Nettle                        0.1.3
 - NumericExtensions             0.5.4
 - PDMats                        0.1.0
 - REPLCompletions               0.0.0
 - SIUnits                       0.0.1
 - SortingAlgorithms             0.0.1
 - StatsBase                     0.3.7
 - TexExtensions                 0.0.1
 - Tk                            0.2.11
 - URIParser                     0.0.1
 - URLParse                      0.0.0
 - WinRPM                        0.0.12
 - ZMQ                           0.1.8
 - Zlib                          0.1.5

Segmentation fault - OSX

I'm getting segmentation faults every time I run lm or glm now. Here's a screen grab from a session using an example straight from the documentation.

Syntax for complex Formula's

Right now, a Formula specification looks like:

:(y ~ x1 + x2 + x3)

This is essentially just an enumeration of the columns of the LHS and the RHS. We would like to support more complex Formula's for use with both DataFrame's and DataStream's. The streaming case poses some challenges, but should be soluble with a little bit more work.

There are several issues that need to be addressed. At a minimum, we need:

Rules for specifying interactions
Rules for working with functions of columns
Rules for turning factor objects into numbers
- Categorical factors
- Ordered factors

Interactions

Harlan has noted before that specifying rules for interactions is non-trivial because : has excessively high precedence in Julia. We can't support things like:

:(y ~ x1*x3 + x2:x3)

because they are not parsed the way that R would parse them.

At present, I believe the solution is that R's : is written as * and R's * is written as &. As such, the previous formula becomes:

:(y ~ x1&x3 + x2*x3)

This is not a bad solution, although I have mixed feelings about &. It provides the right intuitions for interactions of dummy variables in which bitwise operations do actually capture the computations being done, but it doesn't provide the right intuition for interactions of continuous variables. Of course, : doesn't provide much intuition either, so this may be a moot point.

Functions of Columns

I believe that we support things like :(y ~ log(x1)), but am not sure. In general, it would be nice to support things like:

:(y ~ clamp(x1, x2, x3))

This should be possible using the with function when we create a ModelMatrix.

Categorical Factors => Numbers

The representation of factors is Julia's weak point right now. We have a PooledDataArray object that provides a means for storing factors with a small number of levels that is memory-efficient.

But this backend is not equivalent to the concept of a factor. We should be able to treat DataArray's as factors as well. For a categorical factor f, what is essential is the ability to replace f with a k-column matrix that indicates whether f[i] == k for each of the k unique values that f takes on. We could do this by supporting either levels or unique for anything that will be treated as a categorical factor.

A major difficulty arises when working with categorical factors in DataStream's:

If we run unique on a minibatch of 25 rows pulled from a 100,000 row data set, many of the k levels will not be present.
This problem is greatly exacerbated when the DataStream is being evaluated in real-time because the data set itself may not contain all of the levels that we expected it to contain.

To deal with this case, I would propose extending the functions we use for converting factors to allow an optional list of unique levels for each factor. If this list is blank, then we call levels or factors.

Ordered Factors => Numbers

We currently have no mechanism for working with ordered factors. In part, this is because our PooledDataArray is not like R's factor: a PooledDataArray dispatches all comparison operators to the contents so that PooledDataArray's of strings use string ordering for comparisons.

To provide something like R's ordered factor, we need to provide:

A mechanism for overriding the default ordering
A mechanism for translating entries of an ordered factor into a numeric representation that respects that ordering

A lot of thought needs to go into this still.

NumericExtensions update broke GLM

There has been some changes in the API for NumericExtensions that breaks GLM.

julia> using GLM
ERROR: TernaryFunctor not defined
 in reload_path at loading.jl:140
 in _require at loading.jl:58
 in require at loading.jl:43
while loading /Users/andreasnoackjensen/.julia/GLM/src/lm.jl, in expression starting on line 22
while loading /Users/andreasnoackjensen/.julia/GLM/src/GLM.jl, in expression starting on line 76

Failing on "using GLM"?

Might be a new user question, but I have the following output upon entering 'using GLM':

Warning: could not import Base.foldl into NumericExtensions
Warning: could not import Base.foldr into NumericExtensions
Warning: could not import Base.sum! into NumericExtensions
Warning: could not import Base.maximum! into NumericExtensions
Warning: could not import Base.minimum! into NumericExtensions
ERROR: TernaryFunctor not defined
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in reload_path at loading.jl:140
 in _require at loading.jl:58
 in require at loading.jl:43
at /Users/dhaivat/.julia/GLM/src/lm.jl:22
at /Users/dhaivat/.julia/GLM/src/GLM.jl:76

Any ideas how to resolve?

Running on Mac OS X, w/ Julia 0.2.0.

[PackageEvaluator.jl] Your package GLM may have a testing issue.

This issue is being filed by a script, but if you reply, I will see it.

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).

The results of this script are used to generate a package listing enhanced with testing results.

The status of this package, GLM, on...

Julia 0.2 is 'Package doesn't load.'
Julia 0.3 is 'Tests pass.'

'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl file.

'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.

This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.

Functions to evaluate fit

I can't seem to find any functions for evaluating the fit for a linear model. Does GLM have ways to get things like R^2, adjusted R^2, sigma, AIC, BIC?

Error from lm

I am playing with Julia 3.0 prerelease.

using GLM, RDatasets
duncan = dataset("car","Duncan")
lm1=lm(Prestige~Income, duncan)

ERROR: no method LmResp{V<:StoredArray{T<:FloatingPoint,1}}(Array{Int32,1})

Preferred formula API?

I've set up things so that the user can do:

lm("y ~ x", df)

I feel like this is less intimidating to new users. Of course, it does so by separating users from the work being done under the hood. I'd like to know whether others think this is a good thing. Any thoughts @dmbates, @andreasnoackjensen, @HarlanH, @doobwa or @tshort?

Thoughts about improving on R WRT contrasts

Not sure this is the right place nor whether this may be of any use to you, but anyway...

I feel it's great that you write a GLM package from scratch while knowing very well the existing R implementation so that you take the best of it and improve it where needed. With that idea in mind, I'd like to share two points that I consider limits of the current handling of contrasts in R.

In R, to set the reference level when using treatment contrasts, you need to change the order of levels (since the first is used). This is often cumbersome when you have a natural ordering of levels, and yet you would like to take a category in the center of the scale. Once you have made that category the first one, you cannot use the variable anymore e.g. to do a barplot, a table or anything, since the order is not correct. Duplicating a variable just for this detail is not ideal (memory use, potential mistakes when updating one and not the other...).

So IMHO it would be nice to be able to save the reference level separately from the order of levels.

In R, redundant contrast levels are not included in the model: the reference category (for treatment contrasts) or the last category (for sum contrasts) are lost. This means that regression tables, be they on screen or for output to a document (e.g. using Sweave), cannot show these values while there are usually needed for publication-grade tables, and anyway very useful to the researcher.

Thus, it would be nice if the models could somewhat remember these contrast levels, and allow retrieving them via a function similar to coef(). I would even hold that they should be printed by the default show() method.

I realize that for sum contrasts things are a little more complex since the last contrast level is not equal to 0, but to the sum of all other coefficients. So this needs a more generic handling, which could likely be part of a contrast object.

following README.MD failed

I just tried to test GLM following the README.MD but it simply does not work whatever I try (different packages using in different order ), lm() simply does not work
+++++++++++++++++++
julia> using DataFrames, Distributions, GLM
Warning: New definition refModResp not defined
at C:\Users\omorjan\AppData\Roaming\julia\packages\GLM\src\GLM.jl:18

julia>
(DataArray{T,N},Union(BitArray{1},Array{Bool,1})) is ambiguous with ref(DataArray{T<:Number,N},Union(BitArray{1},Ranges{T},Array{T,1})) at C:\Users\omorjan\AppData\Roaming\julia\packages\DataFrames\src\dataarray.jl:339.
Make sure ref(DataArray{T<:Number,N},Union(BitArray{1},Array{Bool,1})) is defined first.

julia> using RDatasets, GLM
invalid redefinition of constant GlmResp
at C:\Users\omorjan\AppData\Roaming\julia\packages\GLM\src\GLM.jl:18

julia> form = data("datasets", "Formaldehyde")
6x3 DataFrame:
carb optden
[1,] 1 0.1 0.086
[2,] 2 0.3 0.269
[3,] 3 0.5 0.446
[4,] 4 0.6 0.538
[5,] 5 0.7 0.626
[6,] 6 0.9 0.782

julia> fm1 = lm(:(optden ~ carb), form)
lm not defined
++++++++++++++++++++++

delbeta(p::DensePredChol, r::Vector{Float64}, sqrtwt::Vector{Float64}) is broken

The variable y in the final line of this function is not defined anywhere.

Add additional methods

lm only works with Float64 values, but not with Integers

Trying to apply lm to Int values gives me this error:

ERROR: no method LmResp(Array{Int64,1},)

Use all variables?

Is there an equivalent to R's use of "." to select all columns for the model?

e.g. glm(:(outcome ~ .), df, Bernoulli());

ERROR: invalid redefinition of constant GlmResp

When I try to load GLM, I got an error, and cannot load GLM.

> using Distributions
> using GLM
ERROR: invalid redefinition of constant GlmResp
 in include_from_node1 at loading.jl:76
 in reload_path at loading.jl:96
 in require at loading.jl:48
at /Users/kiwamu/.julia/GLM/src/GLM.jl:18

How can I solve this problem? I run Julia (Version 0.1.2+113667293.r7252) on Mac OS X 10.8, and GLM and Distributions were installed by Pkg.add("GLM") without any errors or warnings. Thanks.

unexplained SingularException

Running this same data set through R does give me a reasonable result...

The entire set is available upon request.

julia> trndf = readtable("trn.ds.fix.11")
409x12 DataFrame:
outcome c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
[1,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[2,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
[3,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 2.0
[4,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 4.0
[5,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 7.0
[6,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 8.0
[7,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
[8,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
[9,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0
[10,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
[11,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.0 0.0 0.0 11.0
[12,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
[13,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0
[14,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2.0 0.0
[15,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 1.0 0.0 1.0 1.0
[16,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0
[17,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 0.0 2.0 0.0
[18,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 3.0 0.0 2.0 3.0
[19,] 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
[20,] 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 2.0 0.0
:
[390,] 1.0 9.0 15.0 7.0 1.0 0.0 157.0 12.0 7.0 0.0 70.0 7.0
[391,] 0.0 0.0 3.0 0.0 0.0 17.0 113.0 53.0 25.0 0.0 39.0 22.0
[392,] 0.0 6.0 10.0 17.0 10.0 1.0 74.0 45.0 1.0 0.0 26.0 1.0
[393,] 0.0 7.0 15.0 3.0 3.0 10.0 27.0 21.0 23.0 0.0 8.0 23.0
[394,] 1.0 0.0 0.0 0.0 0.0 9.0 108.0 30.0 19.0 0.0 17.0 18.0
[395,] 1.0 0.0 15.0 31.0 14.0 7.0 32.0 32.0 0.0 0.0 12.0 0.0
[396,] 1.0 0.0 2.0 6.0 1.0 0.0 105.0 29.0 43.0 0.0 76.0 41.0
[397,] 1.0 12.0 12.0 38.0 6.0 0.0 24.0 0.0 33.0 0.0 2.0 21.0
[398,] 1.0 0.0 0.0 0.0 0.0 27.0 104.0 66.0 85.0 0.0 57.0 84.0
[399,] 1.0 16.0 22.0 31.0 21.0 4.0 32.0 20.0 0.0 0.0 14.0 0.0
[400,] 1.0 0.0 0.0 0.0 0.0 11.0 147.0 14.0 13.0 0.0 108.0 13.0
[401,] 1.0 0.0 0.0 2.0 0.0 33.0 201.0 50.0 50.0 0.0 155.0 34.0
[402,] 0.0 10.0 17.0 15.0 11.0 51.0 100.0 88.0 6.0 0.0 98.0 6.0
[403,] 1.0 0.0 0.0 18.0 0.0 103.0 360.0 107.0 9.0 0.0 338.0 7.0
[404,] 1.0 0.0 3.0 11.0 3.0 19.0 123.0 63.0 13.0 0.0 117.0 11.0
[405,] 1.0 1.0 18.0 0.0 0.0 49.0 255.0 70.0 56.0 0.0 127.0 53.0
[406,] 1.0 6.0 18.0 37.0 18.0 21.0 45.0 39.0 23.0 0.0 44.0 21.0
[407,] 1.0 27.0 28.0 12.0 12.0 34.0 162.0 40.0 53.0 0.0 61.0 46.0
[408,] 1.0 21.0 43.0 32.0 32.0 22.0 68.0 39.0 124.0 0.0 49.0 101.0
[409,] 0.0 49.0 52.0 31.0 31.0 64.0 198.0 96.0 164.0 0.0 137.0 149.0

julia> lm(:(outcome ~ c1 + c2 + c3 + c4 + c5 + c6 + c7 + c8 + c9 + c10 + c11), trndf)
ERROR: SingularException(10)
in \ at linalg/triangular.jl:46
in \ at linalg/factorization.jl:310
in delbeta! at /Users/daljohnson/.julia/GLM/src/linpred.jl:27
in lm at /Users/daljohnson/.julia/GLM/src/lm.jl:39
in lm at /Users/daljohnson/.julia/GLM/src/lm.jl:42

Error running GLM example in readme

I'm trying to run the first example in the readme (using the formaldehyde data set), and I'm getting an error (ERROR: no method fit(Type{LmMod}, Formula, DataFrame)) when trying to run lm1 = fit(LmMod, OptDen ~ Carb, form). Is this an outdated example? I've pasted some diagnostic info below:

Julia version: 0.3.0-prerelease+2265

Results of Pkg.status():

9 required packages:

DataFrames 0.5.3
Devectorize 0.2.1
Distributions 0.4.3
GLM 0.3.2
Gadfly 0.2.6
IJulia 0.1.7
RDatasets 0.1.1
SMTPClient 0.0.0
Winston 0.10.2

28 additional packages:

ArrayViews 0.4.2
BinDeps 0.2.12
Blocks 0.0.2
Cairo 0.2.12
Codecs 0.1.0
Color 0.2.9
Compose 0.1.28
DataArrays 0.1.6
Datetime 0.1.2
Distance 0.3.1
GZip 0.2.12
Hexagons 0.0.1
Homebrew 0.0.6
IniFile 0.2.2
Iterators 0.1.2
JSON 0.3.5
LibCURL 0.1.1
Loess 0.0.2
Nettle 0.1.3
NumericExtensions 0.6.1
NumericFuns 0.2.2
PDMats 0.1.2
REPLCompletions 0.0.0
SortingAlgorithms 0.0.1
StatsBase 0.3.11
Tk 0.2.11
URIParser 0.0.1
ZMQ 0.1.9

Linear separability diagnostics?

One thing I'd really like is for Julia to tell the user when the data is linearly separable under a logistic model. This could be done by making a call to glm for logistic models terminate with a call to predict to see if there are no mispredicted responses. In that case, it would be nice to output a message noting this.

TernaryFunctor not defined

julia> using GLM

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:58.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:71.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:77.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:92.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:106.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:111.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:122.
Use "x[i:end]" instead.
ERROR: TernaryFunctor not defined
 in include at boot.jl:240
while loading /home/diego/.julia/GLM/src/lm.jl, in expression starting on line 22
while loading /home/diego/.julia/GLM/src/GLM.jl, in expression starting on line 76

 - Benchmark                     0.0.3
 - BioSeq                        0.2.1
 - Cairo                         0.2.12
 - DataFrames                    0.5.1
 - Distributions                 0.3.0
 - FastaIO                       0.1.2
 - GLM                           0.2.2
 - Gadfly                        0.1.31
 - IJulia                        0.1.1
 - NumericExtensions             0.3.6
 - PyCall                        0.4.1
 - Stats                         0.1.0
 - StatsBase                     0.3.5
 - Winston                       0.8.0

Needs a LICENSE file

Allow sandwich/bootstrap/jackknife standard errors as options

This is more a feature request or policy question than a bug report.

I'm wondering whether you would like to add an argument allowing to easily compute sandwich (heteroskedasticity-robust), bootstrap, jackknife and possibly other types of variance-covariance matrix and standard errors, instead of the asymptotic ones. Or whether you would like to provide a function to compute these after the model has been fitted.

R makes it relatively cumbersome to get these common standard errors (see e.g. [1] or [2]), which either means that researchers with limited programming skills are incited to keep using Stata, or to keep relying only on asymptotic standard errors. I see no good reason why such an option shouldn't be provided in the standard modeling package.

1: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf
2: http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf

ERROR: QRDense not defined

Version 0.2.0
Commit 855a9220ec 2013-04-08 14:51:03

julia> using GLM
ERROR: QRDense not defined
in include_from_node1 at loading.jl:92
in reload_path at loading.jl:112
in require at loading.jl:48
at /home/iain/.julia/GLM/src/GLM.jl:173

Alternatives to MLE

I'm a complete noob to Julia. Please be patient with me whilst I find my feet.

For binomial GLMs, separation is a common problem in some areas of application. The general "solution" is MAP estimation, or penalized MLE. The methods of Firth and of Gelman et al work well (but ridge estimation would also do).

In R, these are implemented in various different functions in various different packages: brglm::brglm, arm::bayesglm. Both of these use fairly straightforward modifications to the IRLS algorithm. (Firth's method had a couple of previous incarnations using different fitting algorithms in the logistf and brlr packages.)

So far as I understand the Julia GLM package, it implements MLE. To avoid multiple future packages implementing alternatives such as those mentioned above, a general interface for penalized MLE would be useful, with predefined options for at least Firth's method.

Is such an interface reasonable, or does the required flexibility make it more sensible to implement separate fitting methods explicitly elsewhere? If total flexibility is unreasonable, can Firth's method be offered as an option (even SAS has it!)

Thanks,
Harry

http://biomet.oxfordjournals.org/content/80/1/27.short
http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

lm error showing value of type LmMod

i am getting an error when running a lm:

julia> using GLM, RDatasets

julia> form = dataset("datasets","Formaldehyde");

julia> lm1 = lm(OptDen ~ Carb, form)
Formula: OptDen ~ Carb

Coefficients:
Evaluation succeeded, but an error occurred while showing value of type LmMod:
ERROR: .+= not defined
in show at .julia\v0.3\StatsBase\src\statmodels.jl:68

julia>

i am running Julia 0.3.0 1897 and i've updated my packages.

no method symmetrize!(Float64)

I'm getting this error when trying to run a simple lm model on two Float64 values.

In  [138]: mod = lm(:(offtask ~ testing), bystudent)

Out [138]: no method symmetrize!(Float64)

Enable Travis checking

I should know how to do this but I don't. Could someone else flip the switch on this, please?

Error when loading package

julia> using GLM
ERROR: Distribution not defined
 in include_from_node1 at loading.jl:90
 in reload_path at loading.jl:113
 in require at loading.jl:48
at /Users/Adam/.julia/GLM/src/GLM.jl:18

GLM does not work under Julia 0.3

using DataFrames,GLM
df = DataFrame(A = 0.1*(1:4), B =[0.1, 0.3, 0.1, 0.2])
lm(A~B,df)
ERROR: no method convert(Type{QR{Float64}}, QRCompactWY{Float64})
in convert at base.jl:11
in DensePredQR at /home/gaofeng/.julia/GLM/src/linpred.jl:22
in lm at /home/gaofeng/.julia/GLM/src/lm.jl:38

lm(:(A~B),df)
ERROR: no method Formula(Expr)
in lm at /home/gaofeng/.julia/GLM/src/lm.jl:42

Update docs and code(?) for Julia 0.2

See discussion in #67. We should at least be able to make the examples work.

lm doesn't work with Julia 0.3.0 - "ERROR: no method symmetrize!(Float64)"

Doing the steps in the README.md works, up to

julia> fm1 = lm(:(optden ~ carb), form)

Formula: optden ~ carb

Coefficients:

Evaluation succeeded, but an error occurred while showing value of type LmMod:
ERROR: no method display(LmMod)
 in display at multimedia.jl:158

julia> confint(fm1)
ERROR: no method symmetrize!(Float64)
 in vcov at /Users/adr/.julia/GLM/src/linpred.jl:65
 in coeftable at /Users/adr/.julia/GLM/src/lm.jl:64
 in confint at /Users/adr/.julia/GLM/src/lm.jl:75
 in confint at /Users/adr/.julia/GLM/src/lm.jl:79

Bug using glm

julia> glm(:(d1 ~ dc), my_data_frame, Bernoulli())
ERROR: no method cols(Array{Float64,1},)
in ModelMatrix at /home/laurence/.julia/DataFrames/src/formula.jl:229
in glm at /home/laurence/.julia/GLM/src/glmfit.jl:62
in glm at /home/laurence/.julia/GLM/src/glmfit.jl:70

I'm pretty sure I'm using the latest versions of everything.

Taking a look at the ModelMatrix code, it seems that cols is defined for DataVector, but not Vector.

Option to supress output of glmfit?

Currently the glmfit command has a println in it - is it possible to make this off by default and add a verbose option?