johnsyweb / python_sparse_list Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 7.0 181 KB

A list where most values will be None (or default)

Home Page: http://stackoverflow.com/q/17522753/78845

License: MIT License

Python 100.00%

containers python

python_sparse_list's People

Stargazers

Watchers

Forkers

pombredanne goutte speedplane acdha robertour jaanauati wetneb

python_sparse_list's Issues

SparseVector

Your yummy test-suite has spurred me to try my hand at making a quick'n dirty implementation of a sparse list using not a dictionary of keys, but linked lists : two numpy.ndarrays internally, one of integers for the keys, and one for the values, whose dtype is left to the choice of the user.

If this works, I'll probably make a sparse_vector lib, unless you think this lib should hold multiple implementations. (my inner sloth hopes that you do)

I called it vector because it's specialized for numerical values.

The idea is to make the init with initial values and the setting via list slicing (see PR #2) both faster.

I tried scipy.sparse and pandas.Series but they don't fit my immediate needs.

My constraints are :
1. Python only
2. Performance
3. Available now :)

I just want a simple sparse 1D numerical vector that works well with numpy.

I guess scipy.sparse should make separate classes for 1D, 2D, 3D, and ND.
They only support 2D right now (and therefore 1D), with classes such as scipy.sparse.lil_matrix, which is also a linked list implementation.

But the API does not feel right for vectors, look at the instantiation :

# data = [6.28, 1.62, 2.72, 2.675]
# indices = [42, 1e9, 6e6, 1]  # not sorted (constraint)
vector = lil_matrix(
    (values, (np.zeros(len(indices)), indices)),  # data, (rowinds, colsinds)
    shape=(1, 1e16)
)

Besides having to np.zeros(len(indices)) to fulfill the 2D signature, there's no settable default value,
which is zero. (and I need another)

And there's a bunch of 2D-only "features" such as :

vector[0] # same object as vector, shape (1, 1e16)
# it somewhat makes sense in 2D, but not in 1D

I have not tried sympy.matrices.sparse.SparseMatrix yet.

Anyhow, I'm running benchmarks with sparse_vector :

import benchmark
import numpy as np

from sparse_list import SparseList
from sparse_vector import SparseVector


class BenchmarkSparse(benchmark.Benchmark):

    each = 100          # number of runs
    full_size = 100000  # total size of sparse lists and vectors
    data_size = 1000    # actual data size of sparse lists and vectors

    def setUp(self):
        # Can also specify tearDown, eachSetUp, and eachTearDown
        self.a = np.random.rand(self.data_size)
        self.k = np.arange(self.data_size)

    def test_sparse_list_init(self):
        sl = SparseList(self.full_size, default_value=0)

    def test_sparse_vector_init(self):
        sv = SparseVector(self.full_size, default_value=0)

    def test_sparse_list_set_with_iterables_in_slice(self):
        sl = SparseList(self.full_size, default_value=0)
        sl[self.k] = self.a

    def test_sparse_list_set_with_iterables_on_init(self):
        sl = SparseList(dict(zip(self.k, self.a)), default_value=0)
        sl.size = self.full_size

    def test_sparse_vector_set_with_iterables_in_slice(self):
        sv = SparseVector(self.full_size, default_value=0)
        sv[self.k] = self.a

    def test_sparse_vector_set_with_iterables_on_init(self):
        sv = SparseVector((self.k, self.a), default_value=0)
        sv.size = self.full_size


if __name__ == '__main__':
    benchmark.main(format="markdown", numberFormat="%.4g")

which right now yields :

Benchmark Report
================

BenchmarkSparse
---------------

                                     name | rank | runs |      mean |        sd | timesBaseline
------------------------------------------|------|------|-----------|-----------|--------------
                         sparse list init |    1 |  100 | 1.771e-06 | 1.344e-06 |           1.0
                       sparse vector init |    2 |  100 | 4.199e-06 | 1.877e-06 | 2.37012113055
 sparse vector set with iterables on init |    3 |  100 | 1.531e-05 |  7.01e-06 | 8.64468371467
   sparse list set with iterables on init |    4 |  100 | 0.0004169 | 8.095e-05 | 235.347240915
  sparse list set with iterables in slice |    5 |  100 | 0.0004247 | 7.585e-05 | 239.769851952
sparse vector set with iterables in slice |    6 |  100 |  0.008322 | 0.0005945 | 4698.04441454

Not optimal yet, but it's still quick'n dirty, I can clean that up some and maybe start comparing with lil_matrix too.

Check for matching list lengths in equality implementation

>>> empty_sparse_list = SparseList(0)
>>> [None, None] == empty_sparse_list
True

Implementation of __lt__ also seems inconsistent with that of list.

sparse_list[a:b] should return a sparse list itself

I think it would be nice to have the option to return a sparse list when taking the slice of a sparse_list. Currently it returns a standard (dense) list.
If the current behaviour of slicing is not to be changed, an alternate method could be provided for that.

Slices containing default values do not have the correct length

>>> sl = SparseList(10)
>>> len(sl[0:2])
0

Invalid implementation of `del my_list[a:b]`

Deleting slices does not behave like Python's lists:

>>> standard_list = [1,2,3]
>>> del standard_list[:2]
>>> standard_list
[3]
>>> sparse_list = SparseList([1,2,3])
>>> del sparse_list[:2]
>>> sparse_list
[None, None, 3]

Deleting an empty slice from a sparse list fails

sl = sparse_list.SparseList(xrange(5), None)
del sl[3:3]

throws IndexError: range object index out of range whereas it should just leave the list untouched.

Bursty Sparse List (List of Lists of Blocks)

Great work on Sparse List! I want to use a Sparse List, but I'm not sure the dictionary implementation is the most efficient for me.

Bursty Data

My particular data is sparse, but bursty. While 95% of the indices are empty, the remaining 5% are broken up into relatively bursty chunks. Rather than store everything in a dictionary, it would be nice to store a dictionary of lists, where each list contains a range of data. This could make certain operations much more efficient.

Use Case: Caching Result Sets

The main use case for this is caching result sets. Say I have a query that has 1M results, and I want to cache the ones I pull into a list-type structure. If I pull results 1000 at a time, there will be many contiguous blocks, and it will be much more efficient to store the data in some sort of list of lists structure.

Implementation

Is there any chance that this will be implemented here? Would you accept a pull request that supported this type of storage?

The data structure is not conceptually difficult, but there will be a bit of hairy code for merging contiguous blocks and going from one block to the next.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.