gonum / stat Goto Github PK

View Code? Open in Web Editor NEW

195.0 195.0 23.0 537 KB

Statistics package for Go [DEPRECATED]

Go 99.76% Shell 0.24%

go probability-distribution random-generation statistics stats

stat's Introduction

Gonum

Installation

The core packages of the Gonum suite are written in pure Go with some assembly. Installation is done using go get.

go get -u gonum.org/v1/gonum/...

Supported Go versions

Gonum supports and tests using the gc compiler on the two most recent Go releases on Linux (386, amd64 and arm64), macOS and Windows (both on amd64).

Note that floating point behavior may differ between compiler versions and between architectures due to differences in floating point operation implementations.

Release schedule

The Gonum modules are released on a six-month release schedule, aligned with the Go releases. i.e.: when Go-1.x is released, Gonum-v0.n.0 is released around the same time. Six months after, Go-1.x+1 is released, and Gonum-v0.n+1.0 as well.

The release schedule, based on the current Go release schedule is thus:

Gonum-v0.n.0: February
Gonum-v0.n+1.0: August

Build tags

The Gonum packages use a variety of build tags to set non-standard build conditions. Building Gonum applications will work without knowing how to use these tags, but they can be used during testing and to control the use of assembly and CGO code.

The current list of non-internal tags is as follows:

safe — do not use assembly or unsafe
bounds — use bounds checks even in internal calls
noasm — do not use assembly implementations
tomita — use Tomita, Tanaka, Takahashi pivot choice for maximimal clique calculation, otherwise use random pivot (only in topo package)

Issues

If you find any bugs, feel free to file an issue on the github issue tracker. Discussions on API changes, added features, code review, or similar requests are preferred on the gonum-dev Google Group.

https://groups.google.com/forum/#!forum/gonum-dev

License

Original code is licensed under the Gonum License found in the LICENSE file. Portions of the code are subject to the additional licenses found in THIRD_PARTY_LICENSES. All third party code is licensed either under a BSD or MIT license.

Code in graph/formats/dot is dual licensed Public Domain Dedication and Gonum License, and users are free to choose the license which suits their needs for this code.

The W3C test suites in graph/formats/rdf are distributed under both the W3C Test Suite License and the W3C 3-clause BSD License.

stat's People

Contributors

Stargazers

Watchers

Forkers

cjslep blanu shazow yonglehou jmptrader jacobxk sbinet darrenmcc armadilloa16 zihua mdimercurio livitki rikonor zeroviscosity evertlammerts bpgray johncsnyder larytet-go benluteijn

stat's Issues

stat/distmv: NewNormalChol doesn't actually set Cholesky factor

We first need a way to copy Cholesky decomposition.

Add FDR correction

The false discovery rate (FDR) is a very common method of multiple tests correction.
fdr

Adding a fdr method will benefits us a lot. Thanks!

stat/distmv: nil checks for sigma race against setSigma in Normal and Student's T

We protect setSigma with a once so it only gets set once, but this still races against checks that it is non-nil, for example in normal.ConditionNormalSingle

distuv: implement Chi Square distribution

dist: Implement categorical distribution

This will have a lot of overlap with sample/Weighted

Check GammaRand implementation

For {Alpha: 30.2, Beta: 1.7, Source: src}, according to the paper the acceptance rate should be very high. Looking at the actual reject rate it's very low.

Any interest in implementing CircularMean?

I've recently had to calculate a circular mean for a series of angles and noticed it's not implemented in gonum/stat.
If there's any interest I'd be happy to submit a PR for it.

distuv: Implement beta distribution

Something we need to have

Population variance vs sample variance

From the souce code, the sum of squared deviations is routinely divided by n-1, but not n. It seems that this package takes the sample variance and sample standard deviation as the default ones and many methods based on them, while packages of other languages (numpy, scipy of python) make the population variance and population std as defaults.
That is kind of misleading.

stat/distuv: incorrect length of suffStat parameter passed Normal.Fit

For normal distribution, the suffstat obviously has a length of 2, with mu and sigma. But in the distuv/norm.go source code, the length of suffStat in the Fit method is set to 1. That causes a panic.

func (n *Normal) Fit(samples, weights []float64) {

    suffStat := make([]float64, 1)                  //HERE

    nSamples := n.SuffStat(samples, weights, suffStat)

    n.ConjugateUpdate(suffStat, nSamples, []float64{0, 0})

}

I think that should be set to 2.

Minor spelling mistake

The correct spelling for KulbeckLeibler is KullbackLeibler.

mat64: SymDense.SymOuterK should zero data slice

This fails:

func TestSymOuter(t *testing.T) {
    x := NewVector(5, []float64{1, 2, 3, 4, 5})
    var s1, s2 SymDense
    s1.SymOuterK(x)
    s2.SymOuterK(x)
    s2.SymOuterK(x)
    if !Equal(&s1, &s2) {
        t.Error("unexpected result from repeat")
    }
}

distuv: Replace Entropy with DiffEntropy?

Entropy is defined as \sum_i p(i) log(p(i)). There is an extension to continuous distributions where one replaces the sum with an integral, but these two values do not mean the same thing (the limit of the entropy formula goes to infinity for continuous distributions). As a result, we probably want to actively avoid allowing discrete and continuous distributions to satisfy the same Entropy interface. The right thing to do is rename Entroy to DifferentialEntropy, which is a mouthful, so DiffEntropy, DEntropy, EntropyD or something similar.

stat/distuv: introduce Func()

hi there,

what about having this:

func (n *Normal) Func() func(x float64) float64 {
    mu := n.Mu
    sigma := n.Sigma
    sigma2 := 2*sigma*sigma
    root2pi := math.Sqrt(2*math.Pi)
    return func(x float64) float64 {
        return 1 / (sigma * root2pi) * math.Exp(-((x-mu)*(x-mu))/sigma2)    
    }

(and similarly for the other distributions)

this would allow to easily compare (and plot) distributions and their underlying function.

stat: Add PanicTests for stat.go file

Panic tests are needed for the stat.go file, which are not included currently: https://coveralls.io/builds/8112223/source?filename=stat.go

The LinearRegression and related modules do not have panic tests. So, including them would help cover most of the file properly.

Histogram should panic if a data point is outside the bin.

It seems to me like the histogram implementation doesn't make any sense. The function documentation doesn't mention anything about behavior when the data point is outside the dividers, and says that count must have a length of 1 less than dividers (not one more), implying that data should be between the bounds. Forcing to match between the bounds matches NearestWithinSpan, and seems to me to be the better behavior. Users can construct dividers with values -inf and +inf if they would like. The actual behavior of the function does the opposite, so it's clear that something is wrong.

Lognormal Distribution Parameters

Do you think I should use the 3 parameters lognormal or the (more common) 2 parameter one? There is detail at http://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm

Correlation's signature is inefficient

The Correlation function:

func Correlation(x []float64, meanX, stdX float64, y []float64, meanY, stdY float64, weights []float64) float64 {
    return Covariance(x, meanX, y, meanY, weights) / (stdX * stdY)
}

Has a very long signature, and due to the fact that it includes the standard deviations and weights as input, the caller has to calculate those before calling correlation, which is going to be duplicative at the least, because Covariance gets about 95% of the way to calculating them again. In addition, it admits cases where the weights of the standard deviations do not match the weights asked of the correlation, which would probably be an error.

If we were to remove the standard deviations from the sig, it would probably be at least as fast within the Correlation function, and it would not have as much cost before being called. It would also be simpler.

Thoughts?

stat/distuv: Adding F-distribution

Hello! I'm doing some statistical work with this package recently, and find it is very useful and practical.

However the F-distribution (Fisher–Snedecor distribution) is still missing from the distuv module, that is required for the Levene test to assert equality of variances for Student's T test of independent samples. So, I think it is necessary to add the F-distribution to the type list. This will definitely make the package more powerful.

Distribution Interface

It seems like the dist package could use an interface that all distributions satisfy. Currently it only has ParameterMarshaler, but all of the distributions have:

CDF(x float64) float64
Entropy() float64
ExKurtosis() float64
LogProb(x float64) float64
LogSurvival(x float64) float64
Mean() float64
Median() float64
NumParameters() int
Prob(x float64) float64
Quantile(p float64) float64
Rand() float64
Skewness() float64
StdDev() float64
Survival(x float64) float64
Variance() float64

Although it seems like some of the other methods (such as Mode and Fit) would also be candidates for a Distribution interface. I'd like to start a discussion on what the dist package's goals are as well.

proposal: reorganise stat/...

I think that stat/... probably warrant a reorganisation.

The central change is to split stat into stat, univariate and multivariate. univariate and multivariate would hold their respective functions and types and stat would hold common functions and types.

In addition, dist and sample would be renamed to have the suffix uv for univariate.

The decision about which packages current stat functions/types end up in is reasonably straightforward for multivariate, but univariate vs stat may be more contentious.

Here is my initial view - only things I have an opinion on are marked. Please change with an associated comment so we get this right.

stat:

Histogram
SortWeighted
CumulantKind

univariate:

multivariate:

CorrelationMatrix
CovarianceMatrix
PrincipalComponents

Edit: Checklist removed to prevent rendering in issues list. This can be resurrected if/when we decide to tackle this again.

A bug in combin.CombinationGenerator

I just found a bug in combin.CombinationGenerator. For n=30 and k=15, the generator produced nothing. The code is below:

func main()
        cg := combin.NewCombinationGenerator(30, 15)
        com := make([]int, 15)
        for i:=0;cg.Next();i++ {
                com=cg.Combination(com)
                fmt.Println(com)
        }
}

So far, Same situation occurs when
n=30 and k=14 (up to 16)
n=40 and k=14 (up to 26)
n=50 and k=20 (up to 33)
...
it seems that most cases are when k is close to half of n.

distmv: Allow initialization of Normal with specified cholesky

I think it's reasonably common for a user to want a series of multivariate normals with the same mean, but different covariances. It would be nice to allow construction with the cholesky decomposition so that it only has to be computed once.

stat/distuv: add an example for Normal

So far, there's no obvious usage example for, eg, distuv.Normal.

A simple example could go a long way for beginners.

{dist,distmv}: lack package doc comment

The packages dist and distmv lack package-level documentation.

CovarianceMatrix should take in and return a SymDense

Covariance matrices are symmetric (and positive semidefinite at that!), and should be returned as such.

stat/distuv: Fix discrepancy between Src and Source

Some of the distributions use Src as the field name, others use Source. They should all use the same name.

sample: Sampler interface?

The advanced sampling routines are implemented as functions at the moment. I wonder if they should be types instead. This would allow abstracting over the kind of sampling. The most common would be

// A Sampler can generate a set of values generated using an underlying sampling algorithm. The number of samples generated 
type Sampler interface{
       Sample([]float64)
}

This would allow other algorithms to be written on top of this interface.

stat: big data interface support

I've been thinking about how samples and weighted samples are represented in stat, and in dist, in the ConjugateUpdate methods, and it seems like we might be able to do something interesting by providing a set of interfaces to represent different ways of providing sample data to measure.

Here's the rough outline:

type Sampler interface {
   Sample() []float64
}

type WeightedSampler interface {
   Sampler
   Weights() []float64
}

type Sample []float64

func (s Sample) Sample() []float64 {
   return s
}

type WeightedSample struct {
   Sample
   w []float64
}

func NewWeightedSample(s Sample, w []float64) *WeightedSample {
   // panic if len doesn't match
  return &WeightedSample{s, w}
}

func (ws *WeightedSample) Weights() []float64 {
  return ws.w
}

And then collapse functions like Mean to take a Sampler, perform a type check to see if it is a WeightedSampler, and then proceed from there.

This also ties into iterative methods where the data cannot be represented in memory easily. We could provide:

type SampleReader interface {
   SampleRead([]float64) int
}
// and equivalent with weights

type IterativeSample func([]float64) int 

func (is IterativeSample) SampleRead(buf []float64) int {
   return is(buf)
}

func (is IterativeSample) Sample() []float64 {
   buf := make([]float64, 1023)
   res := make([]float64, 0)
   for n := is(buf); n > 0; {
      res = append(res, buf...)
   }
   return res // for use in cases where a SampleReader path doesn't exist
}

Which could read a portion of the sample in things like Mean. There would be a corresponding WeightedIterativeSampler, and additional types for multivariate samples, which would deal primarily in matrices. There is still some planning to do to allow things like buffered reading and writing.

The same types would be used in dist for ConjugateUpdate, Rand, and equivalent ones would exist in a multivariate version of dist.

Online versions of the various statistical measurements would take only a SampleReader or WeightedSampleReader, and would have "Memory" so that they could produce updated measurements on demand. Parallel methods could use several goroutines to call SampleRead(), and then combine the results at the end.

Any thoughts on these changes? The immediate benefit would be to simplify the signatures for functions on samples of data, and to provide a way for users to handle very large amounts of data.

edit: fix go code

CovarianceMatrix can't get the symmetric matrix

X :=  []float64{0,2,
		1,1,
		2,0,}

I use stat.CovarianceMatrix(nil,mat.NewDense(3,2, X),nil) in order to caculate the covariance of matrix.But i can't get the symmetric matrix.The result is [1 -1 0 1] not the [1 -1 -1 1].What's wrong with it?And many others tests like this.

installation of gonum/stat failed

Hi,

Apologize in advance if my question is naive, I'm very new to golang and gonum packages.

I tried installing gonum/stat with

 go get github.com/gonum/stat

but I get the following error:

matrix/mat64/vector.go:134: undefined: asm.DscalIncTo
matrix/mat64/vector.go:142: undefined: asm.DscalInc

I'm running go1.5.2 on ubuntu 14.04.3 LTS.

Thanks for any help,
Emmanuel

Median function in `stats`

Hi,

I was wondering - is there a desire to have a Median([]float64) float64 function in stats?
I find myself in need of it and instead of implementing it (which, admittedly, is simple enough) in my code was thinking it'd be nicer to be able to use a standard stats implementation.

I'm happy to open a PR but thought I'd ask before as there may be a reason there isn't one already.

x and weights should be a self-enforcing struct

Currently almost every method requires x and weights to be equal in len (or len(weights) to be 0). Each method will panic if this isn't true. I think this is messy, and places a burden on the user that shouldn't exist.

Further, as far as I can tell they're treated as a pair 100% of the time which expresses a sort of conceptual unity to me -- meaning we should really treat it as a single object rather than a pair that happens to be extremely sensitive to each other's states.

I propose moving this into a struct of some sort:

type DataSet struct {
    // Could be exported, but I prefer getters here
    x, weights []float64
}

func NewDataSet(x, weights []float64) (*DataSet, error) {
    if len(x) > len(weights) {
        for i := len(weights); i < len(x); i++ {
            weights = append(weights, 1)
        }
    } else if len(x) < len(weights) {
        // Alternatively, we could eschew the error and assign some default value to x like 0
        return nil, errors.New("Fewer values than weights, can not create samples")
    }

    return &DataSet{x: x, weights: weights}, nil
}

func (s *DataSet) AddData(x, w float64) {
    s.x = append(s.x, x)
    s.w = append(s.w, w)
}

// We could also do ChangeData, ChangeWeight, RemoveSample, SortAscending etc

This unifies the samples into a single conceptual structure, and removes a great deal of panics through the entire program. Though I suppose a benchmark would be in order to test the performance impact of this (it does add a pointer chase to and access of the data).

stat/distmv: Normal and StudentsT sigma field should not be accessed directly.

We currently rely on methods to call setSigma before accessing the sigma field. Instead, we should prevent any possible bugs by implementing a getSigma method rather than allowing direct access.