matloff / polyreg Goto Github PK

R 100.00%

polyreg's Introduction

polyreg, an Alternative to Machine Learning Methods

A package to automate formation and evaluation of multivariate polynomial regression models, especially as an alternative to neural networks and other machine learning algorithms.

An important feature is that dummy variables are handled properly, so that for instance powers of a dummy variable do not exist as duplicates of the original.

Note: This library is used in the qeML package; qeML ("quick and easy machine learning") provides a convenient, consistent interface to various machine learning algorithms, including polynomial regression via polyreg. There is also a polynomial version of ridge regression. Other than special purposes, it is recommended that the user try the qeMLinterface, rather than using polyreg directly.

Motivation

In Polynomial Regression As an Alternative to Neural Nets, by Cheng, Khomtchouk, Matloff and Mohanty, 2018, it is argued that dense, feedforward neural networks are essentially polynomial regression models. This was extended in Towards a Mathematical Framework to Inform Neural Network Modelling via Polynomial Regression. by Morala, Cifuentes, Lillo, and Iñaki Ucar. The point is then, why go through the problems of neural networks--convergence, local minima and so on--when can can work more simply with polynomials?

Of course, it is not quite that simple. If we start with p variables in our model, the d-degree polynomial version will have O(p^d) variables, which can easily become computationally challenging. Nevertheless, our experiments have had quite encouraging results.

Usage

The main functions are polyfit() and predict.polyFit(). One can fit either regression or classification models.

Example

Programmer/engineer 2000 Census data, Silicon Valley, built-in to the package.

data(pef)

# model wage income, fitting a degree-2 model
pfout <- polyFit(pef[,c(1,2,3,4,6,5)],2,use='lm')

# predict wage of person like that in row 1, but age 40 and female
newx <- pef[1,-5]
newx$age <- 48
newx$sex <- 1
predict(pfout,newx)  # about $84,330

polyreg's People

Contributors

Stargazers

Watchers

polyreg's Issues

Constant fitted.value

After fixing interruptions by browser(), polyfit() always generate constant fitted.values for all observations. Furthermore, the degree in getPoly() was always 1, even I set the deg to be non-1 value. Can you take a look? Thanks!

========================
library("polyreg")
model <- polyFit(cbind(train$x, train$y), deg=4)

for getting the adjusted r-square

library("rsq")
rsq(model$fit)

Output:

getPoly time: 0.023 0.001 0.024 0 0
lm() time: 0.002 0 0.002 0 0
Length Class Mode
xy 11358 -none- numeric
degree 1 -none- numeric
maxInteractDeg 1 -none- numeric
use 1 -none- character
poly.xy 1 data.frame list
fit 11 lm list
PCA 0 -none- NULL
pca.portion 1 -none- numeric
pca.xy 0 -none- NULL
pcaCol 1 -none- numeric
pcaLocation 1 -none- character
glmMethod 0 -none- NULL
classProblem 1 -none- logical
classes 1 -none- logical

[1] 0

Support for Python?

I just read https://arxiv.org/abs/1806.06850 and found it really interesting. Any chance you're thinking about implementing a Python version of the package some time soon?

I'd be willing to help as long as someone writes up a high-level description of the functionality needed.

bug in predict for two-class cases?

When computing predictions for a two-class case there seems to be a mistake.

Here is a reproducible example:

library(polyreg)
library(MLmetrics)

data(kyphosis, package = "rpart")
kyphosis$y <- ifelse(kyphosis$Kyphosis == "absent", 1, 0)
kyphosis$Kyphosis <- NULL
mod <- glm(y ~ ., data = kyphosis, family = binomial())
mod
# Coefficients:
# (Intercept)          Age       Number        Start  
#     2.03693     -0.01093     -0.41060      0.20651  
#
# Degrees of Freedom: 80 Total (i.e. Null);  77 Residual
# Null Deviance:	    83.23 
# Residual Deviance: 61.38 	AIC: 69.38
table(ifelse(predict(mod, type = "response") > 0.5, 1, 0), kyphosis$y)
#    0  1
# 0  7  3
# 1 10 61
Accuracy(ifelse(predict(mod, type = "response") > 0.5, 1, 0), kyphosis$y)
# 0.8395062
table(ifelse(predict(mod) > 0.5, 1, 0), kyphosis$y)
#    0  1
# 0 10  8
# 1  7 56
Accuracy(ifelse(predict(mod) > 0.5, 1, 0), kyphosis$y)
# 0.8148148

data(kyphosis, package = "rpart")
kyphosis <- kyphosis[,c(2:4,1)]
kyphosis$Kyphosis <- as.character(kyphosis$Kyphosis)
pf <- polyFit(kyphosis, deg = 1, use = "glm")
pf$fit 
# Coefficients:
# (Intercept)           V1           V2           V3  
#     2.03693     -0.01093     -0.41060      0.20651  
# 
# Degrees of Freedom: 80 Total (i.e. Null);  77 Residual
# Null Deviance:	    83.23 
# Residual Deviance: 61.38 	AIC: 69.38

Ok the same model is fitted, but computing predictions:

table(predict(pf, kyphosis), kyphosis$Kyphosis)
#         absent present
# absent      56       7
# present      8      10
Accuracy(predict(pf, kyphosis), kyphosis$Kyphosis)
# 0.8148148

seems to be wrong. Looking at the code you can see

# glm case
  if (is.null(object$glmMethod)) { # only two classes
    pre <- predict(object$fit, plm.newdata)
    pred <- ifelse(pre > 0.5, object$classes[1], object$classes[2])
  }

IMHO the prediction returned is in the link scale (see help(predict.glm)) but it should be on the probability scale, i.e. type = "response", or if in the link scale pre > 0. However, I prefer to resonate in terms of probability scale.

Higher degree polynomial

I have a couple of questions about your paper "Polynomial Regression as an Alternative to Neural Networks".

At point 5 it is said that "many authors recommend not using polynomial models of degree higher than 2 or 3" .

In which cases is it possible to get higher degree polynomials which fit well?

At point 6.1 it is said that " the degree of the approximating polynomial increases from layer to layer".

So, if I get a polynomial of degree 3 which fits well, then the corresponding neural network will have three layers approximately?

Paper

If this is the code for the paper "Polynomial Regression As an Alternative to Neural Nets" from Xi Cheng, Bohdan Khomtchouk, Norman Matloff and Pete Mohanty, please make a link or at least a citation in the ReadMe.

predict output = 'response'

Hi,

Thank you for ths new methodology. Was trying out and i realized that after modelling in glm() mode, there is no way i can output my prediction in probability. I tried setting: predict(xtest, type = "response")

get_interactions fails when an interaction uses all variables

Hi and thanks for your work on polyreg. I think I've found a small bug.

It relates to the case where we want to generate the "saturated" polynomial, with all possible terms. The problem arises with the unique term that has all of the variables multiplied together.

The easiest way to reproduce the problem is to run polyreg:::get_interactions(c("A","B","C"),3), which returns

Error in apply(combos, 2, paste, collapse = " * ") : 
  dim(X) must have a positive length

Inside get_interactions(), there is the line

interactions[[i]] <- apply(combos, 2, paste, collapse = " * ")

But if there is only one combination of the type, by this time combos is a vector, with dim(combos) equal to NULL, hence apply crashes.

It is probably an easy fix but you might have your own preferred way of handling it.

Users would probably only want to get all possible terms in cases with very few variables, which may seem trivial, but I am but I am using it for a real problem where I've been asked to get a "simple to calculate" approximation to some simulator output.

just clarifying : categorical data features interaction

great code (as what you doing )

"An important feature is that dummy variables are handled properly, so that for instance powers of a dummy variable do not exist as duplicates of the original."

just clarifying
given categorical data

f1 f2 f3
a c u
b g x
a k y

after features interaction will it be for row N1 ?

f1 f2 f3 f1 f2 f3 f1 f2 f1f3 f2f3 f1f2f3
a c u => a c u ac au cu acu

to make sure
there will not be ua, ca , ... uca , etc

problem with maxInteractDeg

Hi,
I found the problem that, when the second tuning parameter changes (maxInteractDeg), the model estimated with the same "deg" remains unchanged.
In fact, from this example the design matrix remains unchanged.

x <- iris[,1:4]
dim(getPoly(x, deg = 1, maxInteractDeg = 1)$xdata)
dim(getPoly(x, deg = 1, maxInteractDeg = 2)$xdata)
dim(getPoly(x, deg = 1, maxInteractDeg = 3)$xdata)
dim(getPoly(x, deg = 2, maxInteractDeg = 1)$xdata)
dim(getPoly(x, deg = 2, maxInteractDeg = 2)$xdata)
dim(getPoly(x, deg = 2, maxInteractDeg = 3)$xdata)

What could this problem depend on?
Currently I do not find it useful to vary the model according to the degree of interaction.

Relationship to R poly function

I want to figure out how polyFit from the polyreg library relates to Rs poly function (https://www.rdocumentation.org/packages/stats/versions/3.5.2/topics/poly)

What are the differences?
polyFit can perform PCA beforehand, poly not.

Is there any more?

Does poly have some limitations that polyFit has not which I am not seeing?

Unexpected result when pcaMethod=FALSE (default)

After reading your paper from Arxiv, I am learning how to use your package. Today I have faced this unexpected result, based on your example for polyFit. Am I doing something wrong?

library(polyreg)

y <- mtcars[, 1]
data <- cbind(mtcars[, -1], y) # make y column the last column
f <- polyFit(data, 2, 2, "lm", TRUE, 0.9)
pred <- predict(f, data[, -ncol(data)])
plot(y, pred, col = "blue", pch = 19, asp = c(1, 1))

f1 <- polyFit(data, 2, 2, "lm")
pred1 <- predict(f1, data[, -ncol(data)])
points(y, pred1, col = "red", pch = 19, asp = c(1, 1))

Problem running examples

Dear Normal Matloff,

I have been trying out your R package, however I am having trouble getting the examples provided in the readme file to work.

Code:

Working part:
`> getPE()

pe <- pe[,c(1,2,4,6,7,3)]
head(pe,2)
age sex wkswrkd ms phd wageinc
1 50.30082 0 52 0 0 75000
2 41.10139 1 20 0 0 12300

pfout <- polyFit(pe,2) # quadratic model
getPoly time: 0.021 0.003 0.024 0 0
lm() time: 0.02 0.004 0.027 0 0
newx <- pe[1,] # dummy 1-row data frame
newx <- newx[,-6] # no Y value
newx$age <- 40
newx$sex <- 1
newx
age sex wkswrkd ms phd
1 40 1 52 0 0
`

However, when I run predict() I get the following error:
Called from: predict.polyFit(pfout, newx) Browse[1]> c model.matrix() reported the following error: Error : $ operator is invalid for atomic vectors

R session:
`

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rpart_4.1-13 polyreg_0.4.0 regtools_1.0.0 car_3.0-2
[5] carData_3.0-2 dummies_1.5.6 mvtnorm_1.0-8 FNN_1.1.3
[9] tidyquant_0.5.5 quantmod_0.4-13 TTR_0.23-4 PerformanceAnalytics_1.5.2
[13] xts_0.11-0 zoo_1.8-3 lubridate_1.7.4 gtrendsR_1.4.2
[17] BatchGetSymbols_2.3 rvest_0.3.2 xml2_1.2.0 forcats_0.4.0
[21] stringr_1.3.1 dplyr_0.8.0.1 purrr_0.3.0 readr_1.3.1
[25] tidyr_0.8.1 tibble_2.0.1 ggplot2_3.1.0 tidyverse_1.2.1
[29] devtools_1.13.6

loaded via a namespace (and not attached):
[1] nlme_3.1-137 httr_1.3.1 tools_3.4.2 backports_1.1.2 utf8_1.1.4 R6_2.4.0
[7] lazyeval_0.2.1 colorspace_1.3-2 nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5 curl_3.3
[13] compiler_3.4.2 git2r_0.23.0 cli_1.0.1 keras_2.2.4 scales_1.0.0 quadprog_1.5-5
[19] tfruns_1.4 digest_0.6.18 foreign_0.8-71 rio_0.5.16 base64enc_0.1-3 pkgconfig_2.0.2
[25] rlang_0.3.1 readxl_1.3.0 pdist_1.2 rstudioapi_0.9.0 generics_0.0.2 jsonlite_1.6
[31] tensorflow_1.10 zip_1.0.0 magrittr_1.5 Matrix_1.2-14 Rcpp_1.0.0 Quandl_2.9.1
[37] munsell_0.5.0 fansi_0.4.0 abind_1.4-5 reticulate_1.10 partools_1.1.6 stringi_1.2.4
[43] whisker_0.3-2 yaml_2.2.0 kerasformula_1.5.1 plyr_1.8.4 grid_3.4.2 parallel_3.4.2
[49] crayon_1.3.4 lattice_0.20-35 haven_2.0.0 hms_0.4.2 zeallot_0.1.0 pillar_1.3.1
[55] glue_1.3.0 data.table_1.12.0 modelr_0.1.2 cellranger_1.1.0 gtable_0.2.0 assertthat_0.2.0
[61] openxlsx_4.1.0 broom_0.5.0 RSpectra_0.13-1 memoise_1.1.0 deepnet_0.2
`

Any idea whats going on?
Thanks

unable to run the mtcar example

I was trying to run the example of

y <- mtcars[,1]
data <- cbind(mtcars[,-1], y) # make y column the last column
f <- polyFit(data,2,2,"lm",TRUE, 0.9)
pred <- predict(f,data[,-ncol(data)])

after installing following

install.packages("devtools")
library(devtools)
install_github("matloff/polyreg")

I got error messages of
"Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "eigen"".

Did I got the package installation wrong?

Datasets in repository history

There are large dataset csv files in this repository's history that should be purged.

example(xvalPoly) throws error

I have been trying to run (working) code from a colleague on two different ubuntu machines. I can't get their code to run, and I can't even get the xvalPoly example to run.

Here is the unexpected error, followed by the complete R session in which it was produced (with sessionInfo). I get similar results from another machine. I'm not aware of anything weird I've done.

example(xvalPoly)

xvlPly> y <- mtcars[,1]

xvlPly> data <- cbind(mtcars[,-1], y) # make y column the last column

xvlPly> pf1 <- xvalPoly(data,2,2,"lm",0.8,FALSE)
Error in polyFit(training, i, m, use, pcaMethod, pcaLocation, pcaPortion, :
pcaMethod should be either NULL, prcomp, or RSpectra

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

library(polyreg)
Loading required package: regtools
Loading required package: FNN
Loading required package: mvtnorm
Loading required package: dummies
dummies-1.5.6 provided by Decision Patterns

Loading required package: car
Loading required package: carData

packageVersion("polyreg")
[1] ‘0.2’
example(xvalPoly)

xvlPly> y <- mtcars[,1]

xvlPly> data <- cbind(mtcars[,-1], y) # make y column the last column

xvlPly> pf1 <- xvalPoly(data,2,2,"lm",0.8,FALSE)
Error in polyFit(training, i, m, use, pcaMethod, pcaLocation, pcaPortion, :
pcaMethod should be either NULL, prcomp, or RSpectra

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] polyreg_0.2 regtools_1.0.1 car_3.0-2 carData_3.0-1 dummies_1.5.6
[6] mvtnorm_1.0-8 FNN_1.1.2.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 rio_0.5.10 crayon_1.3.4 cellranger_1.1.0
[5] magrittr_1.5 zip_1.0.0 pillar_1.3.0 rlang_0.2.2
[9] readxl_1.1.0 curl_3.2 data.table_1.11.4 openxlsx_4.1.0
[13] tools_3.5.1 forcats_0.3.0 foreign_0.8-71 hms_0.4.2
[17] abind_1.4-5 compiler_3.5.1 pkgconfig_2.0.2 haven_1.1.2
[21] tibble_1.4.2

Error: invalid type (list) for variable 'I(W^1)'

Issue:

Got the following error message when only one feature is passed into the function polyFit().

"
getPoly() expects a matrix or a data.frame. The input will be coerced to a data.frame but you may wish to stop and provide one directly.

Error in t.default(xdata) : argument is not a matrix
In addition: Warning message:
In model_matrix(modelFormula, W, intercept, noisy, ...) :
model.matrix() reported the following error:
Error in model.frame.default(object, data, xlev = xlev) :
invalid type (list) for variable 'I(W^1)'
"

Code for reproducing the problem:
y <- as.factor(rep(c(0,1,1),20))
s <- rnorm(length(y))
dta <- data.frame(scores = s, y=y)
polyFit(dta,deg=1,use='glm')

I tried the following and it seems the function works for two features

dta2 <- data.frame(scores = s,scores_two = sort(s), y=y)
polyFit(dta2,deg=1,use='glm')