Giter VIP home page Giter VIP logo

livestats's Introduction

LiveStats - Online Statistical Algorithms for Python

LiveStats solves the problem of generating accurate statistics for when your data set is too large to fit in memory, or too costly to sort. Just add your data to the LiveStats object, and query the methods on it to produce statistical estimates of your data.

LiveStats doesn't keep any items in memory, only estimates of the statistics. This means you can calculate statistics on an arbitrary amount of data.

LiveStats supports Python 2.7+ and Python 3.2+ and doesn't rely on any external Python libraries.

Example usage

First install LiveStats

$ pip install LiveStats

When constructing a LiveStats object, pass in an array of the quantiles you wish you track. LiveStats stores 15 double values per quantile supplied.

from livestats import livestats
from math import sqrt
import random

test = livestats.LiveStats([0.25, 0.5, 0.75])

data = iter(random.random, 1)

for x in xrange(3):
    for y in xrange(100):
        test.add(data.next()*100)

    print "Average {}, stddev {}, quantiles {}".format(test.mean(), 
            sqrt(test.variance()), test.quantiles())

Easy.

There are plenty of other methods too, such as minimum, maximum, kurtosis, and skewness.

FAQ

How does this work?

LiveStats uses the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations and other online statistical algorithms. I also wrote a post on where I got this idea.

How accurate is it?

Very accurate. If you run livestats.py as a script with a numeric argument, it'll run some tests with that many data points. As soon as you start to get over 10,000 elements, accuracy to the actual quantiles is well below 1%. At 10,000,000, it's this:

Uniform:    Avg%E 1.732260e-12 Var%E 2.999999e-05 Quant%E 1.315983e-05
Expovar:    Avg%E 9.999994e-06 Var%E 1.000523e-05 Quant%E 1.741774e-05
Triangular: Avg%E 9.988727e-06 Var%E 4.839340e-12 Quant%E 0.015595
Bimodal:    Avg%E 9.999991e-06 Var%E 4.555303e-05 Quant%E 9.047849e-06

That's percent error for the cumulative moving average, variance, and the average percent error for four different random distributions at three quantiules, 25th, 50th, and 75th. Pretty good.

Why didn't you use NumPy?

I didn't think it would help that much. LiveStats doesn't work on large arrays and I wanted PyPy support, which NumPy currently lacks. I'm curious about any and all performance improvements, so please contact me if you think NumPy (or anything else) would help.

livestats's People

Contributors

cxxr avatar indraniel avatar pyup-bot avatar reardencode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

livestats's Issues

P^2 algo for 99 percentile is very erratic

Try out this simple code which gives 10 values in [60-61) and then keeps giving values only between [50-51)

After 100 total samples, we expect theoretically that the percentile value will be around in (50-51). The As P^2 is just an estimate, I break out of the loop once we get a value of less than 52. Now this takes anywhere from 5000 samples forever.

Try running this simple code multiple times and see..

from livestats import livestats
from math import sqrt
import random

low = 50
high = 50*1.2
randomdata = iter(random.random, 1)

test = livestats.LiveStats([0.99])
count = 0;
for count in xrange(10):
        test.add(randomdata.next() + high)

for count in xrange(50000000):
        test.add(low)
        if(count%100 == 0): 
                print "count {}: Average {}, stddev {}, quantiles {}".format(count, test.mean(), sqrt(test.variance()), test.quantiles())
        if(test.quantiles()[0][1] < low+2):
                break;

print "Done: count {}: Average {}, stddev {}, quantiles {}".format(count, test.mean(), sqrt(test.variance()), test.quantiles())

Problem with median

[1, 5, 6, 7, 9, 12, 15, 19, 20]
for this set LiveStats shows me 8.68209876543 but this is not a median for this values! It's not just not-accurate - this just not a median. Just becouse median must split set to 2 parts with same length, this value wrong.
Any ideas why this happened and how to solve this?

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.