Feature request
Problem
Fastmat offers a nice set of features for efficiently dealing with structured and sparse and whatever matrices. Now, some users might create pretty advanced matrices which take time to compute, using the several fastmat classes as containers to allow fast products. Storing these for later use (to disk) is not straight forward.
- as Cython is used we cannot pickle them (as @ChristophWWagner mentioned)
- as we lose all the structure and benefits of fastmat there is no point in going back to numpy and using their store functions
- many fastmat matrix types exploit a typical structure and define the matrices only by a subset of values compared to the full matrix. Keeping these mechanisms is also beneficial when storing such things to disk
Solution
I did some research on the topic but got no thorough solution yet.
First idea: Make fastmat Pickle-able
As I'm a Python-newby I was also new to pickle. I learned that pickles allows pretty convenient serialization of python objects for e.g. file IO. I also made up a small example which was pretty convenient to implement. Consider some class like this:
import fastmat
class SomeBlockMatrix(fastmat.Matrix):
def __init__(self, items):
# *items is a list of matrices (fastmat, numpy, scipy sparse)
# that is accessed for calculation of products
self._items = items
[...]
now we use this as:
import numpy
import scipy
a = numpy.random.randn(10, 20)
b = scipy.sparse.rand(10, 20)
A = SomeBlockMatrix([a, b, a])
B = SomeBlockMatrix([A, A])
But how to store it to disk? To do so, we have to tell pickle how to pickle, which means, that we have to provide a __reduce__()
function for SomeBlockMatrix
. This function returns the name of the class, s.t. pickle can instantiate a new object of that class upon loading. Furthermore, it returns a tuple of arguments that are passed to the constructor of the class, s.t. an object of the same content is initialized by pickle
class SomeBlockMatrix(fastmat.Matrix):
def __init__(self, items):
# *items is a list of matrices (fastmat, numpy, scipy sparse)
# that is accessed for calculation of products
self._items = items
[...]
# tell pickle how to pickle
def __reduce__(self):
# first argument: class
# second argument: tuple of stuff required by constructor
# reference:
# https://stackoverflow.com/questions/19855156/whats-the-exact-usage-of-reduce-in-pickler
return (self.__class__, (self._items))
This pretty much did it, we can now write and load this to disk, hence, every item
is pickable itself:
import pickle
filename = 'test.mat'
# store to disk
with open(filename, 'wb') as f:
pickle.dump([A, B], f)
# load from disk
with open(filename, 'rb') as f:
C, D = pickle.load(f)
# with C == A and D == B
Note
When I tried to pickle some Cython-stuff like fastmat matrices which have no pickle interface yet I always run into Seg-Faults. There was no warning message as it will occur for pure Python stuff.
Dill instead of Pickle
https://pypi.python.org/pypi/dill
dill extends python’s pickle module
I got some IOError
s when I did call my pickling function to save a file from a different module than the load function was residing at. The corresponding module was not found. There are some hints, e.g. in the discussion of https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory,
that this might not be the case with dill
, as this directly serializes the objects. Not tested by me though
Further reads
Numpy is much faster at storing/loading matrices than pickle:
https://github.com/mverleg/array_storage_benchmark
Security issues of pickle:
https://www.synopsys.com/blogs/software-security/python-pickling/
More on Dill vs. pickle:
https://stackoverflow.com/questions/33968685/pickle-yet-another-importerror-no-module-named-my-module
Harsh cornercases with pickles on Linux, unpickling on Windows
uqfoundation/dill#218