Giter VIP home page Giter VIP logo

stat's Introduction

Gonum

Build status Build status codecov.io go.dev reference GoDoc Go Report Card stability-unstable

Installation

The core packages of the Gonum suite are written in pure Go with some assembly. Installation is done using go get.

go get -u gonum.org/v1/gonum/...

Supported Go versions

Gonum supports and tests using the gc compiler on the two most recent Go releases on Linux (386, amd64 and arm64), macOS and Windows (both on amd64).

Note that floating point behavior may differ between compiler versions and between architectures due to differences in floating point operation implementations.

Release schedule

The Gonum modules are released on a six-month release schedule, aligned with the Go releases. i.e.: when Go-1.x is released, Gonum-v0.n.0 is released around the same time. Six months after, Go-1.x+1 is released, and Gonum-v0.n+1.0 as well.

The release schedule, based on the current Go release schedule is thus:

  • Gonum-v0.n.0: February
  • Gonum-v0.n+1.0: August

Build tags

The Gonum packages use a variety of build tags to set non-standard build conditions. Building Gonum applications will work without knowing how to use these tags, but they can be used during testing and to control the use of assembly and CGO code.

The current list of non-internal tags is as follows:

  • safe — do not use assembly or unsafe
  • bounds — use bounds checks even in internal calls
  • noasm — do not use assembly implementations
  • tomita — use Tomita, Tanaka, Takahashi pivot choice for maximimal clique calculation, otherwise use random pivot (only in topo package)

Issues TODOs

If you find any bugs, feel free to file an issue on the github issue tracker. Discussions on API changes, added features, code review, or similar requests are preferred on the gonum-dev Google Group.

https://groups.google.com/forum/#!forum/gonum-dev

License

Original code is licensed under the Gonum License found in the LICENSE file. Portions of the code are subject to the additional licenses found in THIRD_PARTY_LICENSES. All third party code is licensed either under a BSD or MIT license.

Code in graph/formats/dot is dual licensed Public Domain Dedication and Gonum License, and users are free to choose the license which suits their needs for this code.

The W3C test suites in graph/formats/rdf are distributed under both the W3C Test Suite License and the W3C 3-clause BSD License.

stat's People

Contributors

armadilloa16 avatar btracey avatar cjslep avatar dawny33 avatar jonlawlor avatar kortschak avatar pradeep-pyro avatar rikonor avatar sbinet avatar vladimir-ch avatar zeroviscosity avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stat's Issues

Add FDR correction

The false discovery rate (FDR) is a very common method of multiple tests correction.
fdr

Adding a fdr method will benefits us a lot. Thanks!

Check GammaRand implementation

For {Alpha: 30.2, Beta: 1.7, Source: src}, according to the paper the acceptance rate should be very high. Looking at the actual reject rate it's very low.

Population variance vs sample variance

From the souce code, the sum of squared deviations is routinely divided by n-1, but not n. It seems that this package takes the sample variance and sample standard deviation as the default ones and many methods based on them, while packages of other languages (numpy, scipy of python) make the population variance and population std as defaults.
That is kind of misleading.

stat/distuv: incorrect length of suffStat parameter passed Normal.Fit

For normal distribution, the suffstat obviously has a length of 2, with mu and sigma. But in the distuv/norm.go source code, the length of suffStat in the Fit method is set to 1. That causes a panic.

func (n *Normal) Fit(samples, weights []float64) {

    suffStat := make([]float64, 1)                  //HERE

    nSamples := n.SuffStat(samples, weights, suffStat)

    n.ConjugateUpdate(suffStat, nSamples, []float64{0, 0})

}

I think that should be set to 2.

mat64: SymDense.SymOuterK should zero data slice

This fails:

func TestSymOuter(t *testing.T) {
    x := NewVector(5, []float64{1, 2, 3, 4, 5})
    var s1, s2 SymDense
    s1.SymOuterK(x)
    s2.SymOuterK(x)
    s2.SymOuterK(x)
    if !Equal(&s1, &s2) {
        t.Error("unexpected result from repeat")
    }
}

distuv: Replace Entropy with DiffEntropy?

Entropy is defined as \sum_i p(i) log(p(i)). There is an extension to continuous distributions where one replaces the sum with an integral, but these two values do not mean the same thing (the limit of the entropy formula goes to infinity for continuous distributions). As a result, we probably want to actively avoid allowing discrete and continuous distributions to satisfy the same Entropy interface. The right thing to do is rename Entroy to DifferentialEntropy, which is a mouthful, so DiffEntropy, DEntropy, EntropyD or something similar.

stat/distuv: introduce Func()

hi there,

what about having this:

func (n *Normal) Func() func(x float64) float64 {
    mu := n.Mu
    sigma := n.Sigma
    sigma2 := 2*sigma*sigma
    root2pi := math.Sqrt(2*math.Pi)
    return func(x float64) float64 {
        return 1 / (sigma * root2pi) * math.Exp(-((x-mu)*(x-mu))/sigma2)    
    }

(and similarly for the other distributions)

this would allow to easily compare (and plot) distributions and their underlying function.

Histogram should panic if a data point is outside the bin.

It seems to me like the histogram implementation doesn't make any sense. The function documentation doesn't mention anything about behavior when the data point is outside the dividers, and says that count must have a length of 1 less than dividers (not one more), implying that data should be between the bounds. Forcing to match between the bounds matches NearestWithinSpan, and seems to me to be the better behavior. Users can construct dividers with values -inf and +inf if they would like. The actual behavior of the function does the opposite, so it's clear that something is wrong.

Correlation's signature is inefficient

The Correlation function:

func Correlation(x []float64, meanX, stdX float64, y []float64, meanY, stdY float64, weights []float64) float64 {
    return Covariance(x, meanX, y, meanY, weights) / (stdX * stdY)
}

Has a very long signature, and due to the fact that it includes the standard deviations and weights as input, the caller has to calculate those before calling correlation, which is going to be duplicative at the least, because Covariance gets about 95% of the way to calculating them again. In addition, it admits cases where the weights of the standard deviations do not match the weights asked of the correlation, which would probably be an error.

If we were to remove the standard deviations from the sig, it would probably be at least as fast within the Correlation function, and it would not have as much cost before being called. It would also be simpler.

Thoughts?

stat/distuv: Adding F-distribution

Hello! I'm doing some statistical work with this package recently, and find it is very useful and practical.

However the F-distribution (Fisher–Snedecor distribution) is still missing from the distuv module, that is required for the Levene test to assert equality of variances for Student's T test of independent samples. So, I think it is necessary to add the F-distribution to the type list. This will definitely make the package more powerful.

Distribution Interface

It seems like the dist package could use an interface that all distributions satisfy. Currently it only has ParameterMarshaler, but all of the distributions have:

CDF(x float64) float64
Entropy() float64
ExKurtosis() float64
LogProb(x float64) float64
LogSurvival(x float64) float64
Mean() float64
Median() float64
NumParameters() int
Prob(x float64) float64
Quantile(p float64) float64
Rand() float64
Skewness() float64
StdDev() float64
Survival(x float64) float64
Variance() float64

Although it seems like some of the other methods (such as Mode and Fit) would also be candidates for a Distribution interface. I'd like to start a discussion on what the dist package's goals are as well.

proposal: reorganise stat/...

I think that stat/... probably warrant a reorganisation.

The central change is to split stat into stat, univariate and multivariate. univariate and multivariate would hold their respective functions and types and stat would hold common functions and types.

In addition, dist and sample would be renamed to have the suffix uv for univariate.

The decision about which packages current stat functions/types end up in is reasonably straightforward for multivariate, but univariate vs stat may be more contentious.

Here is my initial view - only things I have an opinion on are marked. Please change with an associated comment so we get this right.

stat:

  • Histogram
  • SortWeighted
  • CumulantKind

univariate:

multivariate:

  • CorrelationMatrix
  • CovarianceMatrix
  • PrincipalComponents

Edit: Checklist removed to prevent rendering in issues list. This can be resurrected if/when we decide to tackle this again.

A bug in combin.CombinationGenerator

I just found a bug in combin.CombinationGenerator. For n=30 and k=15, the generator produced nothing. The code is below:

func main()
        cg := combin.NewCombinationGenerator(30, 15)
        com := make([]int, 15)
        for i:=0;cg.Next();i++ {
                com=cg.Combination(com)
                fmt.Println(com)
        }
} 

So far, Same situation occurs when
n=30 and k=14 (up to 16)
n=40 and k=14 (up to 26)
n=50 and k=20 (up to 33)
...
it seems that most cases are when k is close to half of n.

sample: Sampler interface?

The advanced sampling routines are implemented as functions at the moment. I wonder if they should be types instead. This would allow abstracting over the kind of sampling. The most common would be

// A Sampler can generate a set of values generated using an underlying sampling algorithm. The number of samples generated 
type Sampler interface{
       Sample([]float64)
}

This would allow other algorithms to be written on top of this interface.

stat: big data interface support

I've been thinking about how samples and weighted samples are represented in stat, and in dist, in the ConjugateUpdate methods, and it seems like we might be able to do something interesting by providing a set of interfaces to represent different ways of providing sample data to measure.

Here's the rough outline:

type Sampler interface {
   Sample() []float64
}

type WeightedSampler interface {
   Sampler
   Weights() []float64
}

type Sample []float64

func (s Sample) Sample() []float64 {
   return s
}

type WeightedSample struct {
   Sample
   w []float64
}

func NewWeightedSample(s Sample, w []float64) *WeightedSample {
   // panic if len doesn't match
  return &WeightedSample{s, w}
}

func (ws *WeightedSample) Weights() []float64 {
  return ws.w
}

And then collapse functions like Mean to take a Sampler, perform a type check to see if it is a WeightedSampler, and then proceed from there.

This also ties into iterative methods where the data cannot be represented in memory easily. We could provide:

type SampleReader interface {
   SampleRead([]float64) int
}
// and equivalent with weights

type IterativeSample func([]float64) int 

func (is IterativeSample) SampleRead(buf []float64) int {
   return is(buf)
}

func (is IterativeSample) Sample() []float64 {
   buf := make([]float64, 1023)
   res := make([]float64, 0)
   for n := is(buf); n > 0; {
      res = append(res, buf...)
   }
   return res // for use in cases where a SampleReader path doesn't exist
}

Which could read a portion of the sample in things like Mean. There would be a corresponding WeightedIterativeSampler, and additional types for multivariate samples, which would deal primarily in matrices. There is still some planning to do to allow things like buffered reading and writing.

The same types would be used in dist for ConjugateUpdate, Rand, and equivalent ones would exist in a multivariate version of dist.

Online versions of the various statistical measurements would take only a SampleReader or WeightedSampleReader, and would have "Memory" so that they could produce updated measurements on demand. Parallel methods could use several goroutines to call SampleRead(), and then combine the results at the end.

Any thoughts on these changes? The immediate benefit would be to simplify the signatures for functions on samples of data, and to provide a way for users to handle very large amounts of data.

edit: fix go code

CovarianceMatrix can't get the symmetric matrix

X :=  []float64{0,2,
		1,1,
		2,0,}

I use stat.CovarianceMatrix(nil,mat.NewDense(3,2, X),nil) in order to caculate the covariance of matrix.But i can't get the symmetric matrix.The result is [1 -1 0 1] not the [1 -1 -1 1].What's wrong with it?And many others tests like this.

installation of gonum/stat failed

Hi,

Apologize in advance if my question is naive, I'm very new to golang and gonum packages.

I tried installing gonum/stat with

 go get github.com/gonum/stat

but I get the following error:

matrix/mat64/vector.go:134: undefined: asm.DscalIncTo
matrix/mat64/vector.go:142: undefined: asm.DscalInc

I'm running go1.5.2 on ubuntu 14.04.3 LTS.

Thanks for any help,
Emmanuel

Median function in `stats`

Hi,

I was wondering - is there a desire to have a Median([]float64) float64 function in stats?
I find myself in need of it and instead of implementing it (which, admittedly, is simple enough) in my code was thinking it'd be nicer to be able to use a standard stats implementation.

I'm happy to open a PR but thought I'd ask before as there may be a reason there isn't one already.

Or

x and weights should be a self-enforcing struct

Currently almost every method requires x and weights to be equal in len (or len(weights) to be 0). Each method will panic if this isn't true. I think this is messy, and places a burden on the user that shouldn't exist.

Further, as far as I can tell they're treated as a pair 100% of the time which expresses a sort of conceptual unity to me -- meaning we should really treat it as a single object rather than a pair that happens to be extremely sensitive to each other's states.

I propose moving this into a struct of some sort:

type DataSet struct {
    // Could be exported, but I prefer getters here
    x, weights []float64
}

func NewDataSet(x, weights []float64) (*DataSet, error) {
    if len(x) > len(weights) {
        for i := len(weights); i < len(x); i++ {
            weights = append(weights, 1)
        }
    } else if len(x) < len(weights) {
        // Alternatively, we could eschew the error and assign some default value to x like 0
        return nil, errors.New("Fewer values than weights, can not create samples")
    }

    return &DataSet{x: x, weights: weights}, nil
}

func (s *DataSet) AddData(x, w float64) {
    s.x = append(s.x, x)
    s.w = append(s.w, w)
}

// We could also do ChangeData, ChangeWeight, RemoveSample, SortAscending etc

This unifies the samples into a single conceptual structure, and removes a great deal of panics through the entire program. Though I suppose a benchmark would be in order to test the performance impact of this (it does add a pointer chase to and access of the data).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.