Giter VIP home page Giter VIP logo

regression's Introduction

regression

GoDoc Go Report Card Build Status License

Multivariable Linear Regression in Go (golang)

installation

$ go get github.com/sajari/regression

Supports Go 1.8+

example usage

Import the package, create a regression and add data to it. You can use as many variables as you like, in the below example there are 3 variables for each observation.

package main

import (
	"fmt"

	"github.com/sajari/regression"
)

func main() {
	r := new(regression.Regression)
	r.SetObserved("Murders per annum per 1,000,000 inhabitants")
	r.SetVar(0, "Inhabitants")
	r.SetVar(1, "Percent with incomes below $5000")
	r.SetVar(2, "Percent unemployed")
	r.Train(
		regression.DataPoint(11.2, []float64{587000, 16.5, 6.2}),
		regression.DataPoint(13.4, []float64{643000, 20.5, 6.4}),
		regression.DataPoint(40.7, []float64{635000, 26.3, 9.3}),
		regression.DataPoint(5.3, []float64{692000, 16.5, 5.3}),
		regression.DataPoint(24.8, []float64{1248000, 19.2, 7.3}),
		regression.DataPoint(12.7, []float64{643000, 16.5, 5.9}),
		regression.DataPoint(20.9, []float64{1964000, 20.2, 6.4}),
		regression.DataPoint(35.7, []float64{1531000, 21.3, 7.6}),
		regression.DataPoint(8.7, []float64{713000, 17.2, 4.9}),
		regression.DataPoint(9.6, []float64{749000, 14.3, 6.4}),
		regression.DataPoint(14.5, []float64{7895000, 18.1, 6}),
		regression.DataPoint(26.9, []float64{762000, 23.1, 7.4}),
		regression.DataPoint(15.7, []float64{2793000, 19.1, 5.8}),
		regression.DataPoint(36.2, []float64{741000, 24.7, 8.6}),
		regression.DataPoint(18.1, []float64{625000, 18.6, 6.5}),
		regression.DataPoint(28.9, []float64{854000, 24.9, 8.3}),
		regression.DataPoint(14.9, []float64{716000, 17.9, 6.7}),
		regression.DataPoint(25.8, []float64{921000, 22.4, 8.6}),
		regression.DataPoint(21.7, []float64{595000, 20.2, 8.4}),
		regression.DataPoint(25.7, []float64{3353000, 16.9, 6.7}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)
	fmt.Printf("Regression:\n%s\n", r)
}

Note: You can also add data points one by one.

Once calculated you can print the data, look at the R^2, Variance, residuals, etc. You can also access the coefficients directly to use elsewhere, e.g.

// Get the coefficient for the "Inhabitants" variable 0:
c := r.Coeff(0)

You can also use the model to predict new data points

prediction, err := r.Predict([]float64{587000, 16.5, 6.2})

Feature crosses are supported so your model can capture fixed non-linear relationships

r.Train(
  regression.DataPoint(11.2, []float64{587000, 16.5, 6.2}),
)
//Add a new feature which is the first variable (index 0) to the power of 2
r.AddCross(PowCross(0, 2))
r.Run()

regression's People

Contributors

chewxy avatar codelingobot avatar crhntr avatar dhowden avatar haarts avatar heiderich avatar marcsantiago avatar mish15 avatar mjanda avatar olimpias avatar srisaro avatar timsimmons avatar updogliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regression's Issues

Add Gradient Decent + Cost methods

How is the regression being tested, hyper-parameters such as the learning rate don't seem to be things that can be passed in. How do we know the model is correct without calculating the cost? How does it take loss into consideration? Can the regression line be optimized to factor in gradient decent?
Cost formula
// cost = 1/N * sum((y - (m*x+c))^2)

gradient decent formulas
// cost = 1/N * sum((y - (mx+c))^2)
// cost/dm = 2/N * sum(-x * (y - (m
x+c)))
// cost/dc = 2/N * sum(-(y - (m*x+c)))

Which is reference in

https://www.youtube.com/watch?v=ZPd_fKyrX48

Function Run() has to have "division by 0" protection

Could someone make this change in this part of the code:
for i := n - 1; i >= 0; i-- {
c[i] = qty.Get(i, 0)
for j := i + 1; j < n; j++ {
c[i] -= c[j] * reg.Get(i, j)
}
// practically add these lines
regValue := reg.Get(i, i)
if regValue == 0 {
return errors.New("Division by 0 is not allowed.")
}
c[i] /= regValue
// up to here
// replacing this line c[i] /= reg.Get(i, i)
}

Thanks

Error at qr.QTo and qr.RTo

Hi,

I recieved the error bellow when I tried build my app, because this pull gonum/gonum#1090

go build

github.com/sajari/regression

../../../github.com/sajari/regression/regression.go:170:13: qr.QTo(nil) used as value
../../../github.com/sajari/regression/regression.go:171:15: qr.RTo(nil) used as value

To fix it, i change your code in regression.go to:

var q, reg mat.Dense
qr.QTo(&q)
qr.RTo(&reg)

Support automatically generating feature crosses

Heya,

Sick presentation Hamish.

Considering adding support for feature crosses (square, cubic, etc)? I think it would be pretty easy. I would also boost the dank score of the repo.

I think it could be done as a thin wrapper function around DataPoint().

What's the best practice to save a model

What's the best way to save a model for reuse, with the unexported fields I'm not seeing any way to store a model in a file or database to load back into an application.

Meaning of ¨initialised"

What is the meaning of initialised in the Regression type?

The error returned in

regression/regression.go

Lines 55 to 57 in 7932f0e

if !r.initialised {
return 0, errNotEnoughData
}

suggests that it becomes true when there is enough data. In

regression/regression.go

Lines 107 to 109 in 7932f0e

if len(r.data) > 2 {
r.initialised = true
}

it is set to true if len(r.data) > 2. Why this condition?

The output of the following program

package main

import (
	"fmt"
	"github.com/sajari/regression"
)

func main() {
	r := new(regression.Regression)
	r.SetObserved("obversed")
	r.SetVar(0, "var")
	r.Train(
		regression.DataPoint(1, []float64{1}),
		regression.DataPoint(2, []float64{2}),
		//regression.DataPoint(2, []float64{2}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)
	fmt.Printf("Regression:\n%s\n", r)
}

is

Regression formula:

Regression:
Not enough data points

With the line

regression.DataPoint(2, []float64{2}),
instead of

//regression.DataPoint(2, []float64{2}),

the output becomes

Regression formula:
Predicted = 0.00 + var*1.00
Residuals:
observed|	Predicted|	Residual
1.00|	1.00|	-0.00
2.00|	2.00|	0.00
2.00|	2.00|	0.00


Regression:
obversed|	var
1.00|	1.00
2.00|	2.00
2.00|	2.00

N = 3
Variance observed = 0.2222222222222222
Variance Predicted = 0.22222222222222196
R2 = 0.9999999999999989

I would not expect any of these outputs.

Out of memory issue on 100k data points

runtime: VirtualAlloc of 80000000000 bytes failed with errno=1455
fatal error: out of memory

Crashing here (calling gonum code):

qr.QTo(q)

Is there any method, or plans to support larger data sets?

The actual size of the data set is only about 2 MB, so the 80GB virtual alloc would suggest that there are some serious scaling issues with the current implementation. I see that the actual crash is occurring in the gonum code, but I'm just wondering if there's a solution here to make this library usable with large datasets.

Runtime indicator?

Looking at this totally from a black-box perspective, what should I expect in regards to runtime complexity?

I have over 30,000 data points on my first run that I wish to model, and the app has been sitting for some time now... can you provide some guidance as to how I can estimate runtime?

No change in predictions when new data is added for training.

If we use the same regression object to add additional training data (using Train(...)) and then call Run(), there seems to be no change in the predictions made.

On further inspection, it seems that Run() can be called only once. Is this expected behavior? So, if I'm constantly acquiring new data and want to increase my training data set and retrain the model, I would have to create a new object of regression.Regression and pass the union of the old and the new data?

Below is example code to illustrate the same.

        r = new(regression.Regression)
        r.SetObserved("Z")
	r.SetVar(0, "X")
	r.SetVar(1, "Y")
	r.Train(
		regression.DataPoint(12.59, []float64{3, 0.25}),
		regression.DataPoint(17.54, []float64{1, 0.40}),
		regression.DataPoint(24.14, []float64{1, 0.268}),
		regression.DataPoint(21.47, []float64{2, 0.35}),
	)
	r.Run()

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

	prediction, _ := r.Predict([]float64{2, 0.30})
	fmt.Println("Prediction = " + strconv.FormatFloat(prediction, 'f', 3, 64))

	fmt.Println("adding new data points...")
	r.Train(
		regression.DataPoint(15.65, []float64{3, 0.45}),
		regression.DataPoint(13.35, []float64{2, 0.65}),
	)
	r.Run() // attempt to retrain using also the newly added data.

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

	prediction, e := r.Predict([]float64{2, 0.30})
	fmt.Println("Prediction_new = " + strconv.FormatFloat(prediction, 'f', 3, 64))

The output for the above code is as shown below. As we can see, the formula and the prediction hasn't changed.

Regression formula:
Predicted = 33.43 + X*-4.47 + Y*-21.06
Prediction = 18.176
adding new data points...
Regression formula:
Predicted = 33.43 + X*-4.47 + Y*-21.06
Prediction_new = 18.176

Multivariate multiple regression analysis possible?

I'm trying to run a regression analysis with multiple independent and multiple dependent variables (multivariate multiple regression). Is it also possible to use your library for that? And if not, would you know another library to do that in golang? All tips are welcome!

The package does not build with the latest version of gonum

This commit in gonum changed the way the QTo and RTo methods are called, they don't return the Dense pointer anymore.

The fix is pretty straightforward, the only thing needed is to create a Dense and pass it to the methods

In regression.go use this

q := new(mat.Dense)
reg := new(mat.Dense)
qr.QTo(q)
qr.RTo(reg)

Instead of this

q := qr.QTo(nil)
reg := qr.RTo(nil)

I made a PR with the fix :)

License

Hi,

Thanks for writing such a great library. We'd love to use this in production at our firm. Is it possible to specify the license that the code is under so we can determine whether we can use it?

Stephen

Is dense copy truly needed?

qtrd := mat.DenseCopyOf(qtr)

couldn't

qtr := q.T()
qtrd := mat.DenseCopyOf(qtr)
qty := new(mat.Dense)
qty.Mul(qtrd, observed)

be

qtr := q.T()
qty := new(mat.Dense)
qty.Mul(qtr, observed)

^ tests don't seem to break and I can't see a reason why a copy is being made the original qtr is never being mutated. Moreover more memory is being used than necessary. Should this be optimized? Was this an oversight? Is there intent that I'm missing? The only real thing DenseCopyOf seems to be doing is converting the Matrix interface into the concrete type *Dense, which doesn't seem necessary because none of the underline values of Dense are being used explicitly.

How does this actually work

I was expecting a hot loop with stochastic gradient descent or something but then I saw this in run():

c := make([]float64, n)
for i := n - 1; i >= 0; i-- {
	c[i] = qty.Get(i, 0)
	for j := i + 1; j < n; j++ {
		c[i] -= c[j] * reg.Get(i, j)
	}
	c[i] /= reg.Get(i, i)
}

That seems way too efficient for ML.

Point me to a wikipedia on this algorithm ? :D

Nan & Inf all over the place

Hi,

Thanks for your repository.

I am trying to use it with my own data and the regression come up with something like this:
Predicted = NaN + Subsystems*NaN + Directories*NaN + Files*NaN + Entrophy*+Inf + LineAdded*0.00 + LineDeleted*-0.00 + LineTotal*-0.00 + Devs*-0.05 + Age*0.00 + UniqueChange*0.00 + Exp*-0.00 + RExp*-0.00 + Sexp*0.00 + AuthorID*0.08

Here's the relevant code:

	r.SetObserved("Buggy commit")

	r.SetVar(0, "Subsystems")
	r.SetVar(1, "Directories")
	r.SetVar(2, "Files")
	r.SetVar(3, "Entrophy")
	r.SetVar(4, "LineAdded")
	r.SetVar(5, "LineDeleted")
	r.SetVar(6, "LineTotal")
	r.SetVar(7, "Devs")
	r.SetVar(8, "Age")
	r.SetVar(9, "UniqueChange")
	r.SetVar(10, "Exp")
	r.SetVar(11, "RExp")
	r.SetVar(12, "Sexp")
	r.SetVar(13, "AuthorID")

	trainingSetSize := int(len(commits) / 100 * 70)

	fmt.Println("trainingSetSize", trainingSetSize)

	for index := 0; index <trainingSetSize100; index++ {

		observedValue := -1.0

		if commits[index].ContainsBug {
			observedValue = 1.0
		}

		values := []float64{
			float64(commits[index].Subsystems),
			float64(commits[index].Directories),
			float64(commits[index].Files),
			float64(commits[index].Entrophy),
			float64(commits[index].LineAdded),
			float64(commits[index].LineDeleted),
			float64(commits[index].LineTotal),
			float64(commits[index].Devs),
			float64(commits[index].Age),
			float64(commits[index].UniqueChange),
			float64(commits[index].Exp),
			float64(commits[index].RExp),
			float64(commits[index].Sexp),
			float64(commits[index].AuthorID),
		}

		r.Train(
			regression.DataPoint(
				observedValue,
				values,
			),
		)
	}

       err = r.Run()

	if err != nil {
		panic(err)
	}

	fmt.Printf("Regression formula:\n%v\n", r.Formula)

Am I using this correctly ? What I am missing ?

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.