johnsyweb / python_sparse_list Goto Github PK
View Code? Open in Web Editor NEWA list where most values will be None (or default)
Home Page: http://stackoverflow.com/q/17522753/78845
License: MIT License
A list where most values will be None (or default)
Home Page: http://stackoverflow.com/q/17522753/78845
License: MIT License
Your yummy test-suite has spurred me to try my hand at making a quick'n dirty implementation of a sparse list using not a dictionary of keys, but linked lists : two numpy.ndarrays
internally, one of integers for the keys, and one for the values, whose dtype
is left to the choice of the user.
If this works, I'll probably make a sparse_vector lib, unless you think this lib should hold multiple implementations. (my inner sloth hopes that you do)
I called it vector because it's specialized for numerical values.
The idea is to make the init with initial values and the setting via list slicing (see PR #2) both faster.
I tried scipy.sparse
and pandas.Series
but they don't fit my immediate needs.
My constraints are :
1. Python only
2. Performance
3. Available now :)
I just want a simple sparse 1D numerical vector that works well with numpy.
I guess scipy.sparse
should make separate classes for 1D, 2D, 3D, and ND.
They only support 2D right now (and therefore 1D), with classes such as scipy.sparse.lil_matrix
, which is also a linked list implementation.
But the API does not feel right for vectors, look at the instantiation :
# data = [6.28, 1.62, 2.72, 2.675]
# indices = [42, 1e9, 6e6, 1] # not sorted (constraint)
vector = lil_matrix(
(values, (np.zeros(len(indices)), indices)), # data, (rowinds, colsinds)
shape=(1, 1e16)
)
Besides having to np.zeros(len(indices))
to fulfill the 2D signature, there's no settable default value,
which is zero. (and I need another)
And there's a bunch of 2D-only "features" such as :
vector[0] # same object as vector, shape (1, 1e16)
# it somewhat makes sense in 2D, but not in 1D
I have not tried sympy.matrices.sparse.SparseMatrix
yet.
Anyhow, I'm running benchmarks with sparse_vector :
import benchmark
import numpy as np
from sparse_list import SparseList
from sparse_vector import SparseVector
class BenchmarkSparse(benchmark.Benchmark):
each = 100 # number of runs
full_size = 100000 # total size of sparse lists and vectors
data_size = 1000 # actual data size of sparse lists and vectors
def setUp(self):
# Can also specify tearDown, eachSetUp, and eachTearDown
self.a = np.random.rand(self.data_size)
self.k = np.arange(self.data_size)
def test_sparse_list_init(self):
sl = SparseList(self.full_size, default_value=0)
def test_sparse_vector_init(self):
sv = SparseVector(self.full_size, default_value=0)
def test_sparse_list_set_with_iterables_in_slice(self):
sl = SparseList(self.full_size, default_value=0)
sl[self.k] = self.a
def test_sparse_list_set_with_iterables_on_init(self):
sl = SparseList(dict(zip(self.k, self.a)), default_value=0)
sl.size = self.full_size
def test_sparse_vector_set_with_iterables_in_slice(self):
sv = SparseVector(self.full_size, default_value=0)
sv[self.k] = self.a
def test_sparse_vector_set_with_iterables_on_init(self):
sv = SparseVector((self.k, self.a), default_value=0)
sv.size = self.full_size
if __name__ == '__main__':
benchmark.main(format="markdown", numberFormat="%.4g")
which right now yields :
Benchmark Report
================
BenchmarkSparse
---------------
name | rank | runs | mean | sd | timesBaseline
------------------------------------------|------|------|-----------|-----------|--------------
sparse list init | 1 | 100 | 1.771e-06 | 1.344e-06 | 1.0
sparse vector init | 2 | 100 | 4.199e-06 | 1.877e-06 | 2.37012113055
sparse vector set with iterables on init | 3 | 100 | 1.531e-05 | 7.01e-06 | 8.64468371467
sparse list set with iterables on init | 4 | 100 | 0.0004169 | 8.095e-05 | 235.347240915
sparse list set with iterables in slice | 5 | 100 | 0.0004247 | 7.585e-05 | 239.769851952
sparse vector set with iterables in slice | 6 | 100 | 0.008322 | 0.0005945 | 4698.04441454
Not optimal yet, but it's still quick'n dirty, I can clean that up some and maybe start comparing with lil_matrix
too.
>>> empty_sparse_list = SparseList(0)
>>> [None, None] == empty_sparse_list
True
Implementation of __lt__
also seems inconsistent with that of list
.
I think it would be nice to have the option to return a sparse list when taking the slice of a sparse_list. Currently it returns a standard (dense) list.
If the current behaviour of slicing is not to be changed, an alternate method could be provided for that.
>>> sl = SparseList(10)
>>> len(sl[0:2])
0
Deleting slices does not behave like Python's lists:
>>> standard_list = [1,2,3]
>>> del standard_list[:2]
>>> standard_list
[3]
>>> sparse_list = SparseList([1,2,3])
>>> del sparse_list[:2]
>>> sparse_list
[None, None, 3]
sl = sparse_list.SparseList(xrange(5), None)
del sl[3:3]
throws IndexError: range object index out of range
whereas it should just leave the list untouched.
Great work on Sparse List! I want to use a Sparse List, but I'm not sure the dictionary implementation is the most efficient for me.
My particular data is sparse, but bursty. While 95% of the indices are empty, the remaining 5% are broken up into relatively bursty chunks. Rather than store everything in a dictionary, it would be nice to store a dictionary of lists, where each list contains a range of data. This could make certain operations much more efficient.
The main use case for this is caching result sets. Say I have a query that has 1M results, and I want to cache the ones I pull into a list-type structure. If I pull results 1000 at a time, there will be many contiguous blocks, and it will be much more efficient to store the data in some sort of list of lists structure.
Is there any chance that this will be implemented here? Would you accept a pull request that supported this type of storage?
The data structure is not conceptually difficult, but there will be a bit of hairy code for merging contiguous blocks and going from one block to the next.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.