sajari / regression Goto Github PK

View Code? Open in Web Editor NEW

391.0 17.0 68.0 58 KB

Multivariable regression library in Go

License: MIT License

Go 100.00%

go regression linear-regression

regression's Introduction

regression

Multivariable Linear Regression in Go (golang)

installation

$ go get github.com/sajari/regression

Supports Go 1.8+

example usage

Import the package, create a regression and add data to it. You can use as many variables as you like, in the below example there are 3 variables for each observation.

package main

import (
	"fmt"

	"github.com/sajari/regression"
)

func main() {
	r := new(regression.Regression)
	r.SetObserved("Murders per annum per 1,000,000 inhabitants")
	r.SetVar(0, "Inhabitants")
	r.SetVar(1, "Percent with incomes below $5000")
	r.SetVar(2, "Percent unemployed")
	r.Train(
		regression.DataPoint(11.2, []float64{587000, 16.5, 6.2}),
		regression.DataPoint(13.4, []float64{643000, 20.5, 6.4}),
		regression.DataPoint(40.7, []float64{635000, 26.3, 9.3}),
		regression.DataPoint(5.3, []float64{692000, 16.5, 5.3}),
		regression.DataPoint(24.8, []float64{1248000, 19.2, 7.3}),
		regression.DataPoint(12.7, []float64{643000, 16.5, 5.9}),
		regression.DataPoint(20.9, []float64{1964000, 20.2, 6.4}),
		regression.DataPoint(35.7, []float64{1531000, 21.3, 7.6}),
		regression.DataPoint(8.7, []float64{713000, 17.2, 4.9}),
		regression.DataPoint(9.6, []float64{749000, 14.3, 6.4}),
		regression.DataPoint(14.5, []float64{7895000, 18.1, 6}),
		regression.DataPoint(26.9, []float64{762000, 23.1, 7.4}),
		regression.DataPoint(15.7, []float64{2793000, 19.1, 5.8}),
		regression.DataPoint(36.2, []float64{741000, 24.7, 8.6}),
		regression.DataPoint(18.1, []float64{625000, 18.6, 6.5}),
		regression.DataPoint(28.9, []float64{854000, 24.9, 8.3}),
		regression.DataPoint(14.9, []float64{716000, 17.9, 6.7}),
		regression.DataPoint(25.8, []float64{921000, 22.4, 8.6}),
		regression.DataPoint(21.7, []float64{595000, 20.2, 8.4}),
		regression.DataPoint(25.7, []float64{3353000, 16.9, 6.7}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)
	fmt.Printf("Regression:\n%s\n", r)
}

Note: You can also add data points one by one.

Once calculated you can print the data, look at the R^2, Variance, residuals, etc. You can also access the coefficients directly to use elsewhere, e.g.

// Get the coefficient for the "Inhabitants" variable 0:
c := r.Coeff(0)

You can also use the model to predict new data points

prediction, err := r.Predict([]float64{587000, 16.5, 6.2})

Feature crosses are supported so your model can capture fixed non-linear relationships

r.Train(
  regression.DataPoint(11.2, []float64{587000, 16.5, 6.2}),
)
//Add a new feature which is the first variable (index 0) to the power of 2
r.AddCross(PowCross(0, 2))
r.Run()

regression's People

Contributors

Stargazers

Watchers

regression's Issues

Add Gradient Decent + Cost methods

How is the regression being tested, hyper-parameters such as the learning rate don't seem to be things that can be passed in. How do we know the model is correct without calculating the cost? How does it take loss into consideration? Can the regression line be optimized to factor in gradient decent?
Cost formula
// cost = 1/N * sum((y - (m*x+c))^2)

gradient decent formulas
// cost = 1/N * sum((y - (mx+c))^2)
// cost/dm = 2/N * sum(-x * (y - (mx+c)))
// cost/dc = 2/N * sum(-(y - (m*x+c)))

Which is reference in

https://www.youtube.com/watch?v=ZPd_fKyrX48

Function Run() has to have "division by 0" protection

Could someone make this change in this part of the code:
for i := n - 1; i >= 0; i-- {
c[i] = qty.Get(i, 0)
for j := i + 1; j < n; j++ {
c[i] -= c[j] * reg.Get(i, j)
}
// practically add these lines
regValue := reg.Get(i, i)
if regValue == 0 {
return errors.New("Division by 0 is not allowed.")
}
c[i] /= regValue
// up to here
// replacing this line c[i] /= reg.Get(i, i)
}

Thanks

Error at qr.QTo and qr.RTo

Hi,

I recieved the error bellow when I tried build my app, because this pull gonum/gonum#1090

go build

github.com/sajari/regression

../../../github.com/sajari/regression/regression.go:170:13: qr.QTo(nil) used as value
../../../github.com/sajari/regression/regression.go:171:15: qr.RTo(nil) used as value

To fix it, i change your code in regression.go to:

var q, reg mat.Dense
qr.QTo(&q)
qr.RTo(&reg)

Support automatically generating feature crosses

Heya,

Sick presentation Hamish.

Considering adding support for feature crosses (square, cubic, etc)? I think it would be pretty easy. I would also boost the dank score of the repo.

I think it could be done as a thin wrapper function around DataPoint().

What's the best practice to save a model

What's the best way to save a model for reuse, with the unexported fields I'm not seeing any way to store a model in a file or database to load back into an application.

It would be nice to add the ability to enforce zero intercept

Seems to be be nice package with clean interface. The only problem for me to use: I did not find the way to enforce line to pass through the origin (in other word zero intercept) as LINEST in Excel or Calc allows.

[Feature Request] An API to get the number of variables.

Requesting an API to get the number of variables, i.e. the number of coefficients. Implementation-wise, just return len(r.coeff).

It would be even better if there was an API to get the coefficients as a []float64.

Meaning of ¨initialised"

What is the meaning of initialised in the Regression type?

The error returned in

regression/regression.go

Lines 55 to 57 in 7932f0e

 if !r.initialised { 

 return 0, errNotEnoughData 

 }

suggests that it becomes true when there is enough data. In

regression/regression.go

Lines 107 to 109 in 7932f0e

 if len(r.data) > 2 { 

 r.initialised = true 

 }

it is set to true if len(r.data) > 2. Why this condition?

The output of the following program

package main

import (
	"fmt"
	"github.com/sajari/regression"
)

func main() {
	r := new(regression.Regression)
	r.SetObserved("obversed")
	r.SetVar(0, "var")
	r.Train(
		regression.DataPoint(1, []float64{1}),
		regression.DataPoint(2, []float64{2}),
		//regression.DataPoint(2, []float64{2}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)
	fmt.Printf("Regression:\n%s\n", r)
}

Regression formula:

Regression:
Not enough data points

With the line

regression.DataPoint(2, []float64{2}),
instead of

//regression.DataPoint(2, []float64{2}),

the output becomes

Regression formula:
Predicted = 0.00 + var*1.00
Residuals:
observed|	Predicted|	Residual
1.00|	1.00|	-0.00
2.00|	2.00|	0.00
2.00|	2.00|	0.00


Regression:
obversed|	var
1.00|	1.00
2.00|	2.00
2.00|	2.00

N = 3
Variance observed = 0.2222222222222222
Variance Predicted = 0.22222222222222196
R2 = 0.9999999999999989

I would not expect any of these outputs.

Out of memory issue on 100k data points

runtime: VirtualAlloc of 80000000000 bytes failed with errno=1455
fatal error: out of memory

Crashing here (calling gonum code):

regression/regression.go

Line 179 in d629f2e

qr.QTo(q)

Is there any method, or plans to support larger data sets?

The actual size of the data set is only about 2 MB, so the 80GB virtual alloc would suggest that there are some serious scaling issues with the current implementation. I see that the actual crash is occurring in the gonum code, but I'm just wondering if there's a solution here to make this library usable with large datasets.

Runtime indicator?

Looking at this totally from a black-box perspective, what should I expect in regards to runtime complexity?

I have over 30,000 data points on my first run that I wish to model, and the app has been sitting for some time now... can you provide some guidance as to how I can estimate runtime?

No change in predictions when new data is added for training.

If we use the same regression object to add additional training data (using Train(...)) and then call Run(), there seems to be no change in the predictions made.

On further inspection, it seems that Run() can be called only once. Is this expected behavior? So, if I'm constantly acquiring new data and want to increase my training data set and retrain the model, I would have to create a new object of regression.Regression and pass the union of the old and the new data?

Below is example code to illustrate the same.

        r = new(regression.Regression)
        r.SetObserved("Z")
	r.SetVar(0, "X")
	r.SetVar(1, "Y")
	r.Train(
		regression.DataPoint(12.59, []float64{3, 0.25}),
		regression.DataPoint(17.54, []float64{1, 0.40}),
		regression.DataPoint(24.14, []float64{1, 0.268}),
		regression.DataPoint(21.47, []float64{2, 0.35}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

	prediction, _ := r.Predict([]float64{2, 0.30})
	fmt.Println("Prediction = " + strconv.FormatFloat(prediction, 'f', 3, 64))

	fmt.Println("adding new data points...")
	r.Train(
		regression.DataPoint(15.65, []float64{3, 0.45}),
		regression.DataPoint(13.35, []float64{2, 0.65}),
	)
	r.Run() // attempt to retrain using also the newly added data.

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

	prediction, e := r.Predict([]float64{2, 0.30})
	fmt.Println("Prediction_new = " + strconv.FormatFloat(prediction, 'f', 3, 64))

The output for the above code is as shown below. As we can see, the formula and the prediction hasn't changed.

Regression formula:
Predicted = 33.43 + X*-4.47 + Y*-21.06
Prediction = 18.176
adding new data points...
Regression formula:
Predicted = 33.43 + X*-4.47 + Y*-21.06
Prediction_new = 18.176

Multivariate multiple regression analysis possible?

I'm trying to run a regression analysis with multiple independent and multiple dependent variables (multivariate multiple regression). Is it also possible to use your library for that? And if not, would you know another library to do that in golang? All tips are welcome!

The package does not build with the latest version of gonum

This commit in gonum changed the way the QTo and RTo methods are called, they don't return the Dense pointer anymore.

The fix is pretty straightforward, the only thing needed is to create a Dense and pass it to the methods

In regression.go use this

q := new(mat.Dense)
reg := new(mat.Dense)
qr.QTo(q)
qr.RTo(reg)

Instead of this

q := qr.QTo(nil)
reg := qr.RTo(nil)

I made a PR with the fix :)

License

Hi,

Thanks for writing such a great library. We'd love to use this in production at our firm. Is it possible to specify the license that the code is under so we can determine whether we can use it?

Stephen

Is dense copy truly needed?

regression/regression.go

Line 174 in c1eb7a8

qtrd := mat.DenseCopyOf(qtr)

couldn't

qtr := q.T()
qtrd := mat.DenseCopyOf(qtr)
qty := new(mat.Dense)
qty.Mul(qtrd, observed)

qtr := q.T()
qty := new(mat.Dense)
qty.Mul(qtr, observed)

^ tests don't seem to break and I can't see a reason why a copy is being made the original qtr is never being mutated. Moreover more memory is being used than necessary. Should this be optimized? Was this an oversight? Is there intent that I'm missing? The only real thing DenseCopyOf seems to be doing is converting the Matrix interface into the concrete type *Dense, which doesn't seem necessary because none of the underline values of Dense are being used explicitly.

How does this actually work

I was expecting a hot loop with stochastic gradient descent or something but then I saw this in run():

c := make([]float64, n)
for i := n - 1; i >= 0; i-- {
	c[i] = qty.Get(i, 0)
	for j := i + 1; j < n; j++ {
		c[i] -= c[j] * reg.Get(i, j)
	}
	c[i] /= reg.Get(i, i)
}

That seems way too efficient for ML.

Point me to a wikipedia on this algorithm ? :D

Nan & Inf all over the place

Hi,

Thanks for your repository.

I am trying to use it with my own data and the regression come up with something like this:
Predicted = NaN + Subsystems*NaN + Directories*NaN + Files*NaN + Entrophy*+Inf + LineAdded*0.00 + LineDeleted*-0.00 + LineTotal*-0.00 + Devs*-0.05 + Age*0.00 + UniqueChange*0.00 + Exp*-0.00 + RExp*-0.00 + Sexp*0.00 + AuthorID*0.08

Here's the relevant code:

	r.SetObserved("Buggy commit")

	r.SetVar(0, "Subsystems")
	r.SetVar(1, "Directories")
	r.SetVar(2, "Files")
	r.SetVar(3, "Entrophy")
	r.SetVar(4, "LineAdded")
	r.SetVar(5, "LineDeleted")
	r.SetVar(6, "LineTotal")
	r.SetVar(7, "Devs")
	r.SetVar(8, "Age")
	r.SetVar(9, "UniqueChange")
	r.SetVar(10, "Exp")
	r.SetVar(11, "RExp")
	r.SetVar(12, "Sexp")
	r.SetVar(13, "AuthorID")

	trainingSetSize := int(len(commits) / 100 * 70)

	fmt.Println("trainingSetSize", trainingSetSize)

	for index := 0; index <trainingSetSize100; index++ {

		observedValue := -1.0

		if commits[index].ContainsBug {
			observedValue = 1.0
		}

		values := []float64{
			float64(commits[index].Subsystems),
			float64(commits[index].Directories),
			float64(commits[index].Files),
			float64(commits[index].Entrophy),
			float64(commits[index].LineAdded),
			float64(commits[index].LineDeleted),
			float64(commits[index].LineTotal),
			float64(commits[index].Devs),
			float64(commits[index].Age),
			float64(commits[index].UniqueChange),
			float64(commits[index].Exp),
			float64(commits[index].RExp),
			float64(commits[index].Sexp),
			float64(commits[index].AuthorID),
		}

		r.Train(
			regression.DataPoint(
				observedValue,
				values,
			),
		)
	}

       err = r.Run()

	if err != nil {
		panic(err)
	}

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

Am I using this correctly ? What I am missing ?

Thanks,

sajari / regression Goto Github PK

regression's Introduction

regression

installation

example usage

regression's People

Contributors

Stargazers

Watchers

Forkers

regression's Issues

github.com/sajari/regression

Recommend Projects

Recommend Topics

Recommend Org