joblib / joblib Goto Github PK

View Code? Open in Web Editor NEW

3.7K 63.0 411.0 4.46 MB

Computing with Python functions.

Home Page: http://joblib.readthedocs.org

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.05% Python 99.17% Shell 0.72% Cython 0.06%

python parallel-computing caching multiprocessing threading memoization

joblib's Introduction

The homepage of joblib with user documentation is located on:

https://joblib.readthedocs.io

Getting the latest code

To get the latest code using git, simply type:

git clone https://github.com/joblib/joblib.git

If you don't have git installed, you can download a zip of the latest code: https://github.com/joblib/joblib/archive/refs/heads/main.zip

Installing

You can use pip to install joblib:

pip install joblib

from any directory or:

python setup.py install

from the source directory.

Dependencies

Joblib has no mandatory dependencies besides Python (supported versions are 3.8+).
Joblib has an optional dependency on Numpy (at least version 1.6.1) for array manipulation.
Joblib includes its own vendored copy of loky for process management.
Joblib can efficiently dump and load numpy arrays but does not require numpy to be installed.
Joblib has an optional dependency on python-lz4 as a faster alternative to zlib and gzip for compressed serialization.
Joblib has an optional dependency on psutil to mitigate memory leaks in parallel worker processes.
Some examples require external dependencies such as pandas. See the instructions in the Building the docs section for details.

Workflow to contribute

To contribute to joblib, first create an account on github. Once this is done, fork the joblib repository to have your own repository, clone it using 'git clone' on the computers where you want to work. Make your changes in your clone, push them to your github account, test them on several computers, and when you are happy with them, send a pull request to the main repository.

Running the test suite

To run the test suite, you need the pytest (version >= 3) and coverage modules. Run the test suite using:

pytest joblib

from the root of the project.

Building the docs

To build the docs you need to have sphinx (>=1.4) and some dependencies installed:

pip install -U -r .readthedocs-requirements.txt

The docs can then be built with the following command:

make doc

The html docs are located in the doc/_build/html directory.

Making a source tarball

To create a source tarball, eg for packaging or distributing, run the following command:

python setup.py sdist

The tarball will be created in the dist directory. This command will compile the docs, and the resulting tarball can be installed with no extra dependencies than the Python standard library. You will need setuptool and sphinx.

Making a release and uploading it to PyPI

This command is only run by project manager, to make a release, and upload in to PyPI:

python setup.py sdist bdist_wheel
twine upload dist/*

Note that the documentation should automatically get updated at each git push. If that is not the case, try building th doc locally and resolve any doc build error (in particular when running the examples).

Updating the changelog

Changes are listed in the CHANGES.rst file. They must be manually updated but, the following git command may be used to generate the lines:

git log --abbrev-commit --date=short --no-merges --sparse

joblib's People

Contributors

Stargazers

Watchers

Forkers

dagss yarikoptic gaelvaroquaux fabianp npinto agramfort jonovik qsnake rgommers bdholt1 kcarnold amueller schwarty ntmartin alexfargus vene larsmans msabramo akeshavan chrodan mluessi eraldop dohmatob cjauvin rgiot nlhepler jnothman mekman pgervais flxb pombredanne dstahlke pv ian-beaver pprett imclab krischer hxu jbzdak sturlamolden gcr justinvf-zz code-of-kpp wmayner psattige b-rich spot-app mechcoder brouberol banilo sethdandridge wisertogether onnokort ibayer jmdcal hakanardo escherba gerritholl lesteve iandewancker lizhangzhan export-default chapmanb cpakuna thomasantony smrhein joblib-ci arthurmensch aabadie simonzack arne-cl windy-ground youcannotburnmyshadow eq4 mreid-moz maxalbert tommoral mrefish giserh svisser zmoon111 yeongseon nielszeilemaker fredludlow chengguangnan naereen bag-of-projects salmista-94 yxlinaqua pyvelepor bloody76 bmabey jabooth harishmk khaledto tlevine rth last-g kdexd allankevinrichie

joblib's Issues

Issue in sklearn grid-search with joblib

I have some trouble using grid search with sklearn and joblib using a somewhat complicated estimator.
I get the following warning:
** (process:1040): WARNING **: 1 dictionaries weren't free'd.
And when all jobs are finished, the programm doesn't continue.
I have no idea what could be causing this. Do you?

The estimator contains nltk stuff and has loads of dicts with texts and tokens and whatnot.

python 3

I have an unusual behaviour when running this code with python 3.2:

import numpy as np
from joblib import Memory
memory = Memory( cachedir='./cache', verbose=True )

@memory.cache
def pippo( n ):
    return np.arange(n)

print( pippo(10) )

This results in:
prova.py:9: JobLibCollisionWarning: Cannot detect name collisions for function 'pippo'
print( pippo(10) )
WARNING:root:[MemorizedFunc(func=<function pippo at 0x7f02735dfa68>, cachedir='./cache/joblib')]: Clearing cache ./cache/joblib/main--home-davide-Desktop-prova/pippo

[Memory] Calling main--home-davide-Desktop-prova.pippo...
pippo(10)
____________________________________________________________pippo - 0.0s, 0.0min
[0 1 2 3 4 5 6 7 8 9]

Every works fine with python 2.7.

Cheers

instance method caching does not work when executed through ipython

consider the following code (taken almost verbatim from the docs):

import joblib
mem = joblib.Memory(cachedir='/tmp/joblib')

class Foo(object):
    def __init__(self):
        self.method = mem.cache(self.method)
    def method(self, x):
        pass

foo = Foo()
foo.method(42)

save it to a file called joblib_bug.py. If you run it through the standard python interpreter, it works.
if you run it through ipython, you get:

$ ipython joblib_bug.py
------------------------------------------------------------
Traceback (most recent call last):
  File "joblib_bug.py", line 11, in <module>
    foo.method(42)
[...]
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 748, in save_global
    (obj, module, name))
PicklingError: Can't pickle <class '__main__.Foo'>: it's not found as __main__.Foo

WARNING: Failure executing file: <joblib_bug.py>

I have no idea what this means. it is most probably not a bug in joblib, but maybe you know what's wrong with ipython.

ciao,
tiziano

Installing joblib on Archlinux

HI,

I have created a PKGBUILD for joblib, so that ArchLinux user may install it more easily. This is the link:

https://aur.archlinux.org/packages.php?ID=55574

You may want to update the installation page in the documentation.

Nice work,

Keep on.

Davide Lasagna

Ability to run closures in parallel.

A very silly example that fails with joblib due to pickling problems:

import numpy as np
from joblib import Parallel, delayed

def foo():
      x = np.random.randn(1000,100000)

      # Sum in parallel.
      def sum_row(i_row):
          return x[i_row].sum()

      results = Parallel(n_jobs=4)(delayed(sum_row)(i) for i in range(x.shape[0]))

Of course this is a silly example, but it is very powerful to be able to map a function across inputs in the current scope. This makes algorithms much easier to read and I can see many places in scikits-learn where this might be useful. It is for example possible to do this using my pmap module.

      ...
      import pmap
      pmap.pmap(f, range(x.shape[0])

I would prefer, however, to use joblib since it is has been picked up by the community and appears to have implemented some nice tricks to give better error reporting. Alas, it is missing this feature and I suspect that this is due to the use of a process pool, since I could not get that to work when I was writing pmap. If I remember correctly, this is because I need to create the processes in a scope which includes the closure. This appeared to be very hard/impossible to do with a pool. It is also possible that this would not jibe well with other joblib features.

Apologies for not opening a pull request but I had a hard time grokking joblibs approach because it appears considerably more advanced.

Error raised when caching in parallel and function code has changed

I am using joblib to cache calls in parallel. The same function is called in parallel with different args. The problem is when the code of the function has changed: I think that all processes try to update the file func_code.py, resulting in an error.

Here is the stacktrace:

Traceback (most recent call last):
  File "run.py", line 264, in <module>
    for (subject_series, movement) in zip(series, dataset.movement))
  File "/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/parallel.py", line 561, in __call__
    self.retrieve()
  File "/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/parallel.py", line 483, in retrieve
    raise exception_type(report)
joblib.my_exceptions.JoblibIOError/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  self.message,
: JoblibIOError
___________________________________________________________________________
Multiprocessing exception:
    ...........................................................................
/mnt/neurospin/sel-poivre/tmp/aa013911/GSPCA_mu1.00_l12.50_a0.04/run.py in <module>()
    259         # recomputation when adding a step. Each step is cached independently
    260         # in the function.
    261         print ('\t\tExtracting subject information')
    262         result = Parallel(n_jobs=2)(delayed(runner.process_subject)(
    263             subject_series, regions, gm_index, movement, memory=regions_memory)
--> 264             for (subject_series, movement) in zip(series, dataset.movement))
    265         regions_series, confounds, covariance, precision = \
    266             zip(*result)
    267 
    268         # Explained variance

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr>>)
    556         self.n_dispatched = 0
    557         try:
    558             for function, args, kwargs in iterable:
    559                 self.dispatch(function, args, kwargs)
    560 
--> 561             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
    562             # Make sure that we get a last message telling us we are done
    563             elapsed_time = time.time() - self._start_time
    564             self._print('Done %3i out of %3i | elapsed: %s finished',
    565                         (len(self._output),

    ---------------------------------------------------------------------------
    Sub-process traceback:
    ---------------------------------------------------------------------------
    IOError                                            Tue Dec  3 07:46:50 2013
PID: 15711     Python 2.7.3: /home/aa013911/epd-7.3-2-rh5-x86_64/bin/python
...........................................................................
/home/aa013911/abraham_miccai2013/msdl/runner.pyc in process_subject(series=MemorizedResult(cachedir="../abide/joblib", func...argument_hash="eab6c74fde4a396a99076425288fa670"), regions=memmap([[-0.45718721, -0.47443968, -0.48146474, ...       -0.42739522, -0.41334313]], dtype=float32), gm_index=[False, True, True, False, False, True, True, True, False, False, True, True, True, True, False, False, False, True, True, True, ...], movement='../dataset/ABIDE/Leuven/Leuven_50683/rp_rest.txt', memory=Memory(cachedir='./model3_noop/joblib'))
    116 
    117     confounds = memory.cache(extract_confounds).call_and_shelve(
    118             regions_series, regions, gm_index=gm_index)
    119 
    120     covariance, precision = memory.cache(covariance_matrix)(
--> 121             regions_series, confounds=[movement, confounds])
    122 
    123     return regions_series, confounds, covariance, precision
    124 
    125 

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/memory.pyc in __call__(self=MemorizedFunc(func=<function covariance_matrix at 0xc51a0c8>, cachedir='./model3_noop/joblib'), *args=(MemorizedResult(cachedir="./model3_noop/joblib",...argument_hash="a5abfb42d4bc86e395de839fa65840c9"),), **kwargs={'confounds': ['../dataset/ABIDE/Leuven/Leuven_50683/rp_rest.txt', MemorizedResult(cachedir="./model3_noop/joblib",...argument_hash="e880120cd72429390ae89b5dabba252a")]})
    478         return MemorizedResult(self.cachedir, self.func, argument_hash,
    479             metadata=metadata, verbose=self._verbose - 1,
    480             timestamp=self.timestamp)
    481 
    482     def __call__(self, *args, **kwargs):
--> 483         return self._cached_call(args, kwargs)[0]
    484 
    485     def __reduce__(self):
    486         """ We don't store the timestamp when pickling, to avoid the hash
    487             depending from it.

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/memory.pyc in _cached_call(self=MemorizedFunc(func=<function covariance_matrix at 0xc51a0c8>, cachedir='./model3_noop/joblib'), args=(MemorizedResult(cachedir="./model3_noop/joblib",...argument_hash="a5abfb42d4bc86e395de839fa65840c9"),), kwargs={'confounds': ['../dataset/ABIDE/Leuven/Leuven_50683/rp_rest.txt', MemorizedResult(cachedir="./model3_noop/joblib",...argument_hash="e880120cd72429390ae89b5dabba252a")]})
    425             # Compare the function code with the previous to see if the
    426             # function code has changed
    427             output_dir, argument_hash = self._get_output_dir(*args, **kwargs)
    428         metadata = None
    429         # FIXME: The statements below should be try/excepted
--> 430         if not (self._check_previous_func_code(stacklevel=4) and
        t = undefined
    431                                  os.path.exists(output_dir)):
    432             if self._verbose > 10:
    433                 _, name = get_func_name(self.func)
    434                 self.warn('Computing func %s, argument hash %s in '

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/memory.pyc in _check_previous_func_code(self=MemorizedFunc(func=<function covariance_matrix at 0xc51a0c8>, cachedir='./model3_noop/joblib'), stacklevel=4)
    618         # XXX: Should be using warnings, and giving stacklevel
    619         if self._verbose > 10:
    620             _, func_name = get_func_name(self.func, resolv_alias=False)
    621             self.warn("Function %s (stored in %s) has changed." %
    622                         (func_name, func_dir))
--> 623         self.clear(warn=True)
    624         return False
    625 
    626     def clear(self, warn=True):
    627         """ Empty the function's cache.

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/memory.pyc in clear(self=MemorizedFunc(func=<function covariance_matrix at 0xc51a0c8>, cachedir='./model3_noop/joblib'), warn=True)
    632         if os.path.exists(func_dir):
    633             shutil.rmtree(func_dir, ignore_errors=True)
    634         mkdirp(func_dir)
    635         func_code, _, first_line = get_func_code(self.func)
    636         func_code_file = os.path.join(func_dir, 'func_code.py')
--> 637         self._write_func_code(func_code_file, func_code, first_line)
    638 
    639     def call(self, *args, **kwargs):
    640         """ Force the execution of the function with the given arguments and
    641             persist the output values.

...........................................................................
/home/aa013911/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages/joblib-0.7.1-py2.7.egg/joblib/memory.pyc in _write_func_code(self=MemorizedFunc(func=<function covariance_matrix at 0xc51a0c8>, cachedir='./model3_noop/joblib'), filename='./model3_noop/joblib/msdl/covariance/covariance_matrix/func_code.py', func_code='# first line: 7\ndef covariance_matrix(series, co...np.eye(series.shape[1]), np.eye(series.shape[1])\n', first_line=7)
    546 
    547     def _write_func_code(self, filename, func_code, first_line):
    548         """ Write the function code and the filename to a file.
    549         """
    550         func_code = '%s %i\n%s' % (FIRST_LINE_TEXT, first_line, func_code)
--> 551         with open(filename, 'w') as out:
    552             out.write(func_code)
    553 
    554     def _check_previous_func_code(self, stacklevel=2):
    555         """

IOError: [Errno 2] No such file or directory: './model3_noop/joblib/msdl/covariance/covariance_matrix/func_code.py'
___________________________________________________________________________

Parallel is broken under Python 3.4

Python 3.4 has introduced a new context system to support several multiprocessing implementation (among which the new forkeserver mode). This change has caused some internal helper tool used by joblib to move.

In particular the assert_spawning helper has moved from multiprocessing.forking to the new multiprocessing.context module.

There might be other changes to take into account.

joblib and py2exe (or pyinstaller)^1

When creating an windows executable with py2exe, and run it I get the error

raise ImportError('[joblib] Attempting to do parallel computing'

ImportError: [joblib] Attempting to do parallel computingwithout protecting your
import on a system that does not support forking. To use parallel-computing in
a script, you must protect you main loop using "if name == 'main'". Plea
se see the joblib documentation on Parallel for more information

The same script works when running it directly in python.
The same problem seems to be when using pyinstaller (scikit-learn/scikit-learn#2114)

`verbose` parameter in parallel.Parallel is documented twice

verbose parameter has two docstrings, the first is more informative:
https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L175
https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L201

The second can be safely removed.

Some basic joblib questions...

Hi,

I just recently discovered joblib, and I had a few quick questions.

When I set n_jobs=-1 in multiprocessing, python uses 8 threads on my quad-core macbook pro. But with joblib.Parallel(n_jobs=-1), it appears to be using only 4 threads? This the same function, just called via joblib. The only reason I'm asking is that the sklearn random forests implementation (which presumably runs on top of joblib.parallel) uses all 8 threads. Am I doing something wrong?
How do you specify the "chunk size" using joblib.Parallel?
The kernel frequently dies when running joblib.Parallel in ipython notebook, although I've never had any such issues when running sklearn. Are there some tricks I should be aware of?

Thanks,

Vishal

joblib parallelism works launched by console but not when launched by script

Launching a job that employs joblib as parallel mechanism works only when it is launched from the python console, but not when it is launched by an imported script.

Reproducible example:

import numpy
from joblib import Parallel, delayed

def Foo(i, data):
    print str(i)
    return i

def ParFoo():
    list_out = Parallel(n_jobs=-1, verbose=10)(delayed(Foo)(i, numpy.random.random((10000, 1000))) for i in range(5))

print 'run ParFoo'
ParFoo()
print 'done'

Run this from bash: $ python script.py
and it spawns child processes and completes quickly.

Launch python console, comment out "ParFoo()" at the end of the script, import script and launch by script.ParFoo() and it spawns child processes and completes quickly.

However, leave "ParFoo()" as is, launch python and launch by importing script, it spawns child processes and then just sits there. No visible CPU activity, plenty of RAM available and no visible disk IO. It's just stuck.

If you want an example with more visible cpu activity, put some heavy labor into Foo()

Parallel and IPython

First of all: joblib is great, thanks! I use joblib's Parallel a lot in combination with IPython. However, I often run into the following problem: When I parallelize a function that print something on the screen (and even if it is only a numpy overflow warning) and I use IPython qtconsole or notebook, either the IPython kernel dies or the function simple gets stuck and never returns. This happens always in the described setting. Using the standard IPython console, everything is fine.
Unfortunately I have no clue how to debug this issue. Can you reproduce the problem? Can you give me any hints?

memory error when using numpy pickle with compression

I wrote a script which loads a large sparse matrix (250 million non-zero values) from a text file then pickle it with joblib. This works fine without compression but I get a MemoryError exception when using compress=9.

As far as I can see, this comes from the fact that joblib converts the data to a big compressed string as can be seen below:

def write_zfile(file_handle, data, compress=1):
    """Write the data in the given file as a Z-file.

Z-files are raw data compressed with zlib used internally by joblib
for persistence. Backward compatibility is not guarantied. Do not
use for external purposes.
"""
    file_handle.write(_ZFILE_PREFIX)
    length = hex(len(data))
    if sys.version_info[0] < 3 and type(length) is long:
        # We need to remove the trailing 'L' in the hex representation
        length = length[:-1]
    # Store the length of the data
    file_handle.write(asbytes(length.ljust(_MAX_LEN)))
    file_handle.write(zlib.compress(asbytes(data), compress))

So, at one point, the data exist twice in memory (once in compressed and oncee in uncompressed form).

I don't know the joblib internals so I could well be wrong but it seems to me that a potential solution would be to use gzip.GzipFile (http://docs.python.org/2/library/gzip.html#module-gzip) instead of zlib.compress. The goal of using GzipFile would be to allow Python to write the compressed file by chunks.

Mathieu

Temporary files aren't cleaned up

Migration of scikit-learn issue 2553 by @yarikoptic: joblib (at least the version in scikit-learn) leaves temporary files lying around.

$ mkdir tmp; TMPDIR=$PWD/tmp nosetests -s -v sklearn
$ ls tmp
joblib  tmp5fI12n  tmpRAx9RT
$ ls tmp/tmp*
tmp/tmp5fI12n:
joblib

tmp/tmpRAx9RT:
joblib

'matrix' object has no attribute '__array_prepare__' on numpy 1.3.0

that happens on elderlyish (but LTShaha) ubuntus (<=10.10):

======================================================================
ERROR: joblib.test.test_numpy_pickle.test_numpy_persistence
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
  File "/tmp/buildd/joblib-0.6.0~b2/joblib/test/test_numpy_pickle.py", line 171, in test_numpy_persistence
    obj_ = numpy_pickle.load(this_filename)
  File "/tmp/buildd/joblib-0.6.0~b2/joblib/numpy_pickle.py", line 411, in load
    obj = unpickler.load()
  File "/usr/lib/python2.6/pickle.py", line 858, in load
    dispatch[key](self)
  File "/tmp/buildd/joblib-0.6.0~b2/joblib/numpy_pickle.py", line 283, in load_build
    array = nd_array_wrapper.read(self)
  File "/tmp/buildd/joblib-0.6.0~b2/joblib/numpy_pickle.py", line 120, in read
    new_array.__array_prepare__(array)
AttributeError: 'matrix' object has no attribute '__array_prepare__'

Joblib predict error

When I try to load the classifier that was saved using joblib.dump, I see an error.

Sample Code:

from sklearn.externals import joblib


# Train the classifier on a dataset of text
# documents with 
# OneVsRestClassifier(SGDClassifier(loss=log, n_iter=35))

# classifier object dumped using joblib.dump,
# without compression, for later use.

# Load vectorizer  to be used for getting the document vector.
# TfidfVectorizer(stop_words='english', smooth_idf=True, sublinear_tf=True, 
#            token_pattern=ur'\b(?!\d)\w\w+\b', ngram_range=(1, 2), use_idf=False)


print 'Loading vectorizer...'
vectorizer = joblib.load("/home/n7/classifier/vectorizer.joblib")

print 'Loading classifier...'
classifier = joblib.load("/home/n7/classifier/classifier.joblib")

with open("topredict.txt") as f:
    document = f.read()

document_vector = vectorizer.transform([document])
predict=classifier.predict(document_vector)

The output seen is:

$ python test.py 
Traceback (most recent call last):
  File "test.py", line 17, in <module>
    predict=classifier.predict(document)
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/multiclass.py", line 182, in predict
    return predict_ovr(self.estimators_, self.label_binarizer_, X)
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/multiclass.py", line 81, in predict_ovr
    Y = np.array([_predict_binary(e, X) for e in estimators])
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/multiclass.py", line 56, in _predict_binary
    return estimator.predict_proba(X)[:, 1]
AttributeError: 'list' object has no attribute 'predict_proba'

Python with stuck state

Hi,

I was trying to execute a script by ssh on a mac pro machine with the extra trees from scikit-learn. Each time the process entered in stuck state.

So here a small script to reproduce

from sklearn.datasets import load_boston
from sklearn.ensemble import ExtraTreesRegressor
data = load_boston()
ExtraTreesRegressor(n_jobs=2, n_estimators=10000).fit(data.data, data.target)

Then I do the following

$ ssh username@mac-pro-machine
$ python script.py

Then the python interpreter enter in a stuck state

PID    COMMAND      %CPU TIME     #TH  #WQ  #POR #MRE MEM    RPRVT  PURG CMPRS  VPRVT  VSIZE  PGRP  PPID  STATE    UID  FAULTS  COW    MSGSENT MSGREC SYSBSD  SYSMACH
65425  python2.7    0.1   00:17.42 6    0    24   523  571M   571M   0B   0B     687M   3068M  65425 65324 stuck    508  149394  425    37      12     19429+  1735

I tried to reproduce on my laptop, but I haven't succeded. Previously, this was working with the multiprocessor backend. Note that using a DecisionTreeRegressor instead of ExtraTreesRegressor works without any problem.

Searching on internet, I found http://rachelbythebay.com/w/2011/06/07/forked/ and though that this might be link to joblib.

Can you help me?

joblib.Parallel does not work with memmap'd arrays as input

numpy.memmap arrays are a great way to reduce memory consumption when doing multicore processing on readonly data. However the current implementation still does not support pickle correctly hence cannot be used in conjunction with multiprocessing.Pool.apply_async that is used by joblib.Parallel.dispatch.

For instance:

>>> from joblib import Parallel, delayed
>>> import numpy as np
>>> a = np.memmap('/tmp/memmaped', dtype=np.float32, mode='w+', shape=(3, 5))
>>> b = np.memmap('/tmp/memmaped', dtype=np.float32, mode='r', shape=(3, 5))
>>> b
memmap([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]], dtype=float32)
>>> Parallel(n_jobs=2)(delayed(np.mean)(x) for x in np.array_split(b, 3))
[snipped main process traceback]
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
AttributeError                                     Wed Jul 11 19:00:18 2012
PID: 52915                                    Python 2.6.1: /usr/bin/python
...........................................................................
/Library/Python/2.6/site-packages/numpy/core/fromnumeric.pyc in mean(a=<class 'numpy.core.memmap.memmap'> instance, axis=None, dtype=None, out=None)
   2369     """
   2370     try:
   2371         mean = a.mean
   2372     except AttributeError:
   2373         return _wrapit(a, 'mean', axis, dtype, out)
-> 2374     return mean(axis, dtype, out)
   2375 
   2376 
   2377 def std(a, axis=None, dtype=None, out=None, ddof=0):
   2378     """

...........................................................................
/Library/Python/2.6/site-packages/numpy/core/memmap.pyc in __array_finalize__(self=memmap([ 0.,  0.,  0.,  0.,  0.], dtype=float32), obj=<class 'numpy.core.memmap.memmap'> instance)
    252         return self
    253 
    254     def __array_finalize__(self, obj):
    255         if hasattr(obj, '_mmap'):
    256             self._mmap = obj._mmap
--> 257             self.filename = obj.filename
    258             self.offset = obj.offset
    259             self.mode = obj.mode
    260         else:
    261             self._mmap = None

AttributeError: 'memmap' object has no attribute 'filename'
___________________________________________________________________________

A possible solution would be to make Parallel.dispatch detect instances of numpy.memmap instances and automatically wrap them before serialization and let joblib.parallel.SafeFunc automatically unwrap them before calling the actual processing.

If @GaelVaroquaux is not strictly opposed to that approach I will open a pull request to experiment with that solution.

Solving this issue would help for parallel machine learning in scikit-learn: scikit-learn/scikit-learn#936

Forking and blowing memory

I have blown my memory once again. Leaving the box in a state in which it is completely unusable and might need a hard reboot :(.

The way that this happened was that I allocated too many CPUs and eventually filled in my memory.

Maybe we could have a 'safe mode', given as an 'n_job' argument (maybe n_jobs=0) that divides the memory remaining on the box by the memory used by the process and decides the number of CPUs to allocate based on that.

Of course, it's a weak heuristic, and can easily be broken, but it would probably be better than nothing.

NumpyPickler broken under Python 3.4

Overriden methods need to pass newly introduced kwargs to super. For instance here is the end traceback of the error:

File "/volatile/ogrisel/code/joblib/joblib/hashing.py", line 87, in save_global
        Pickler.save_global(self, obj, name=name, pack=pack)
TypeError: save_global() got an unexpected keyword argument 'pack'

switching from compressed to not compressed doesn't work

In my script, using Memory with compress=True then compress=False with same input produced the following error:

  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/pickle.py", line 891, in load_persid
    self.append(self.persistent_load(pid))
AttributeError: NumpyUnpickler instance has no attribute 'persistent_load'

I would expect the unpickler to automatically work regardless of the compress option when the object was pickled.

Too many open files using memory mapping

Hi,

Recently I use the feature "Fast persistence of an arbitrary Python object into a files, with dedicated storage for numpy arrays.
" joblib.dump() and joblib.load(). It is a really interesting feature for big matrices.

When I dump a dictionary containing many numpy arrays, I cannot load as mmap_mode="r+", but it works fine for no memory mapping.

It seems that the problem comes from numpy.memmap (Too many open files). Just let you know it exists this issue. It may can be solved at joblib level. Or maybe you have already discussed this problem with #44 and #43. Or maybe my dictionary contains too many numpy arrays.

I am using joblib version on github with the commit b659205

>>> import joblib
>>> import numpy as np
>>> testarray = {}
>>> for i in range(4000):
...     testarray[i] = np.array(range(1000))
... 
>>> filepath = "/tmp/test.joblib"
>>> res = joblib.dump(testarray, filepath)
>>> testarray = joblib.load(filepath, mmap_mode="r+")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jinpeng/github/joblib/joblib/numpy_pickle.py", line 428, in load
    obj = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/home/jinpeng/github/joblib/joblib/numpy_pickle.py", line 293, in load_build
    array = nd_array_wrapper.read(self)
  File "/home/jinpeng/github/joblib/joblib/numpy_pickle.py", line 112, in read
    mmap_mode=unpickler.mmap_mode)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 348, in load
    return format.open_memmap(file, mode=mmap_mode)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 575, in open_memmap
    mode=mode, offset=offset)
  File "/usr/lib/python2.7/dist-packages/numpy/core/memmap.py", line 237, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 24] Too many open files
>>> testarray = joblib.load(filepath)

Regards,
Jinpeng

Random error when lanching sklearn's cross_val_score loops with n_jobs=-1

Hi,

I'm using joblib from the sklearn master. I randomly get the following weird error.
Thx for your help.

Best,

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import ShuffleSplit
from sklearn.cross_validation import cross_val_score

n, p = 78, 19

cv = ShuffleSplit(n, n_iter=20, train_size=.9,
                  test_size=.1, random_state=2)

lr = LinearRegression(fit_intercept=True)
x = np.random.randn(n, p)
y = np.random.randn(n)
for i in range(10):
    cross_val_score(lr, x, y, cv=cv, n_jobs=-1,
                scoring='mean_squared_error').sum()

Yields sometimes the following error:

In [244]: %run debug.py
[Parallel] Pool seems closed
[Parallel] Pool seems closed
[Parallel] Pool seems closed
[Parallel] Pool seems closed
[Parallel] Pool seems closed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    173             else:
    174                 filename = fname
--> 175             __builtin__.execfile(filename, *where)

/volatile/thirion/mygit/bthirion/cohort/archi/debug.py in <module>()
     14 for i in range(10):
     15     cross_val_score(lr, x, y, cv=cv, n_jobs=-1,
---> 16                 scoring='mean_squared_error').sum()
     17 

/volatile/thirion/mygit/scikit-learn/sklearn/cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, score_func, pre_dispatch)
   1146         delayed(_cross_val_score)(clone(estimator), X, y, scorer, train, test,
   1147                                   verbose, fit_params)
-> 1148         for train, test in cv)
   1149     return np.array(scores)
   1150 

/volatile/thirion/mygit/scikit-learn/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    514             n_jobs = max(mp.cpu_count() + 1 + n_jobs, 1)
    515 
--> 516         # The list of exceptions that we will capture

    517         self.exceptions = [TransportableException]
    518         self._lock = threading.Lock()

ValueError: generator already executing

Joblib generators experience race conditions

As per http://bugs.python.org/issue15355 I propose that Joblib stops using generators. I personally am getting a race condition everytime I try to run ~ 10 or so engines using 30 cores. Even with multiple locks I am not able to stop the accessing of the generator.

http://stackoverflow.com/questions/1131430/are-generators-threadsafe

I tend to get ValueError: generator already executing.

Even when wrapping the generators with locks, I still get the error, now on the self.it.next() method.

Dependencies not documented in the README

The README files doesn't document what are the dependencies, in particular the minimal Python version required.

doctest does not discover tests in joblib-@cache'd functions.

doctest does not discover tests in joblib-@cache'd functions. In the output below, note that only 3 of 4 tests get run.

This is with joblib 0.7.0b and python 2.7.2 on Windows 7 64-bit running 32-bit Python.

"""
doctest does not discover tests in joblib-@cache'd functions.

Tests do get discovered in this module docstring:

>>> uncached("module docstring")
'Testing uncached: module docstring'
>>> cached("module docstring")
'Testing cached: module docstring'

However, the doctest in the cached() function below is ignored.
"""

from joblib import Memory
mem = Memory("test-joblib-doctest", verbose=0)

def uncached(x):
    """
    An uncached function.

    >>> uncached("function docstring")
    'Testing uncached: function docstring'
    """
    return "Testing uncached: {}".format(x)

@mem.cache
def cached(x):
    """
    A cached function.

    >>> cached("function docstring")
    'Testing cached: function docstring'
    """
    return "Testing cached: {}".format(x)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Output:

C:\Users\jonvi>python c:/temp/test_joblib_doctest.py -v
Trying:
    uncached("module docstring")
Expecting:
    'Testing uncached: module docstring'
ok
Trying:
    cached("module docstring")
Expecting:
    'Testing cached: module docstring'
ok
Trying:
    uncached("function docstring")
Expecting:
    'Testing uncached: function docstring'
ok
2 items passed all tests:
   2 tests in __main__
   1 tests in __main__.uncached
3 tests in 2 items.
3 passed and 0 failed.
Test passed.

test_numpy_persistence fails in windows

pristine win32 with python(x,y) 2.7.2.1. The backtrace can be found here (sorry but I couldn't copy from that console, it's driving me crazy).

testsuite has pypy out in the snow

I just ran your package's testsuite, and pypy had no hope of passing most of its doctests.

Consider tests like test_numpy_pickle.py

and lines of code like

>>> import numpy as np

from memory.rst

I mean, pypy's numpypy is a work in progress, but pypy can never become really usable if it's not embraced.

Heisen test in pre_dispatch

I got the following failure, but only once in many many times:

======================================================================
FAIL: Check that using pre_dispatch Parallel does indeed dispatch items
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Python/2.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/oliviergrisel/coding/joblib/joblib/test/test_parallel.py", line 216, in test_dispatch_multiprocessing
    'Consumed 0', ])
AssertionError: ['Produced 0', 'Produced 1', 'Produced 2', 'Consumed 1'] != ['Produced 0', 'Produced 1', 'Produced 2', 'Consumed 0']
>>  raise self.failureException, \
          (None or '%r != %r' % (['Produced 0', 'Produced 1', 'Produced 2', 'Consumed 1'], ['Produced 0', 'Produced 1', 'Produced 2', 'Consumed 0']))

Logging is not documented

Logging is underdocumented at the moment. I might want to work on it, but I'm not sure how it's supposed to function.

sklearn model write to disk and read back from disk using joblib

I am building my model and storing it at a location using
joblib.dump(model, modelStoreLocation, compress=9)

Then in a separate execution I am trying to read the model from the location as show below :
# loading the model stored on disk model = joblib.load(modelStoreLocation)

I get the following error :

Traceback (most recent call last):
File "/Users/venuktangirala/PycharmProjects/nFlate/compute/prediction/categories/logisticRegression.py", line 84, in <module>x
prediction = predictDataFrame(dataFrame.ix[600:623, :-1].values, modelStoreLocation)
File "/Users/venuktangirala/PycharmProjects/nFlate/compute/prediction/categories/logisticRegression.py", line 58, in predictDataFrame
prediction = model.predict(dataToPredict)
File "/anaconda/lib/python2.7/site-packages/sklearn/linear_model/base.py", line 223, in predict
scores = self.decision_function(X)
File "/anaconda/lib/python2.7/site-packages/sklearn/linear_model/base.py", line 207, in decision_function
dense_output=True) + self.intercept_
File "/anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
return np.dot(a, b)
TypeError: can't multiply sequence by non-int of type 'float'

The same works with out any problem when its in the same execution i.e. build the model , write to dsik and read from disk in the same execution.

What is the fix for this ?

Joblib + np.random.permutation doesn't generate random values if n_jobs > 1

Test case:

import numpy as np
import joblib

def permute(array):
    # With one row only for simplicity
    new_array = np.random.permutation(array[0,])
    print new_array[0:5] # To show the problem

data = np.random.randn(4, 1000)

pool = joblib.Parallel(n_jobs=2)

bad = pool(joblib.delayed(permute)(data) for item in range(1,11))

[-1.73202122 -1.03073665 -0.74896312  1.33868546  0.5940473 ]
[-1.73202122 -1.03073665 -0.74896312  1.33868546  0.5940473 ]
[-0.5100796  -0.20546035 -0.60574384 -0.92523779  2.05956018]
[-0.5100796  -0.20546035 -0.60574384 -0.92523779  2.05956018]
[ 1.99157153 -0.49067723  1.64050311  0.24392144  0.41525179]
[ 1.99157153 -0.49067723  1.64050311  0.24392144  0.41525179]

Notice how the numbers are equal in groups of 2, exactly the number of jobs specified in Parallel. The only time this is not observed is when n_jobs = 1 (i.e., no parallelism).

This effect is only present if np.random.permutation is used. random.shuffle() from the stdlib works fine. However I'm not sure if the bug is in joblib or in numpy. If it's not in joblib, feel free to close this issue.

Hashing error ndarray is not C-contiguous under Python 3.2

As reported by travis, for instance here:

https://travis-ci.org/joblib/joblib/jobs/14997106

Traceback (most recent call last):

File "/home/travis/virtualenv/python3.2/lib/python3.2/site-packages/nose/failure.py",  line 38, in runTest
  raise self.exc_val.with_traceback(self.tb)
File "/home/travis/virtualenv/python3.2/lib/python3.2/site-packages/nose/loader.py", line 254, in generate
  for test in g():
File "/home/travis/build/joblib/joblib/joblib/test/test_hashing.py", line 127, in test_hash_numpy
  yield nose.tools.assert_not_equal, hash(arr1), hash(arr1.T)
File "/home/travis/build/joblib/joblib/joblib/hashing.py", line 196, in hash
  return hasher.hash(obj)
File "/home/travis/build/joblib/joblib/joblib/hashing.py", line 53, in hash
  self.dump(obj)
File "/usr/lib/python3.2/pickle.py", line 237, in dump
  self.save(obj)
File "/home/travis/build/joblib/joblib/joblib/hashing.py", line 154, in save
  self._hash.update(self._getbuffer(obj))
ValueError: ndarray is not C-contiguous

Accessing a shared cache from Parallel

Hi,

I am re-posting a recent discussion on the mailing list here. Thanks to @ogrisel and @GaelVaroquaux for responding immediately.

Consider the example below:

from joblib import Memory, delayed, Parallel

def f(param):
    print 'recalculating for %d' % param
    return param


memory = Memory(cachedir='.', verbose=0)
cf = memory.cache(f)

params = []
for i in range(100):
    params.extend(range(10))

total_sum = 0
for i in range(10):
    res = Parallel(n_jobs=-1)(delayed(cf)(x) for x in params)
    total_sum += sum(res)

print total_sum

When I run that code for the first time (or immediately after manually deleting the ./joblib directory), I get the following output:

recalculating for 0
recalculating for 1
recalculating for 2
recalculating for 3
recalculating for 4
recalculating for 5
recalculating for 6
recalculating for 7
recalculating for 8
recalculating for 9
WARNING:root:[MemorizedFunc(func=, cachedir='./joblib')]: Exception while loading results for (args=(1,), kwargs={})
 Traceback (most recent call last):
  File "/Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/joblib/memory.py", line 167, in __call__
    out = self.load_output(output_dir)
  File "/Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/joblib/memory.py", line 418, in load_output
    mmap_mode=self.mmap_mode)
  File "/Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 424, in load
    obj = unpickler.load()
  File "/Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/pickle.py", line 880, in load_eof
    raise EOFError
EOFError

recalculating for 1
recalculating for 2
recalculating for 3
...
...
...
45000

I think what is happening is that a worker process is attempting to read from the cache while another worker is writing to it. Despite the exception, the correct value of total_sum is returned. Subsequent runs of the same code simply print 45000, as expected.

Does joblib redo the computation when the exception occurs instead of reading from cache? A bit of extra CPU time is definitely preferable to corrupted results.

I am running OSX 10.8.4 with an HFS+ file system, Python 2.7.2 -- EPD 7.2-2 (64-bit) and joblib 0.7.0d.

JobLibCollisionWarning

I have two functions, get_training_set() and get_test_set() that receive warnings from joblib 0.7d:

-c:2: JobLibCollisionWarning: Cannot detect name collisions for function 'get_training_set'
-c:3: JobLibCollisionWarning: Cannot detect name collisions for function 'get_test_set'

are these warnings a cause for concern? Please see my gist for the full example:
http://bit.ly/VaRlM6

Any interest in a MongoDB backend for joblib?

instead of using filesystem, one could use MongoDB to store/retrieve the memoized objects. It may facilitate distributed computing jobs. This backend would be optional and won't require any forced dependency.

Thoughts?

Bad default value for n_jobs in joblib.Parallel

>>> import numpy as np
>>> from joblib import Parallel, delayed
>>> Parallel()(delayed(np.mean)(x) for x in np.array_split(np.arange(10), 3))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-e8d4da0c4a90> in <module>()
----> 1 Parallel()(delayed(np.mean)(x) for x in np.array_split(np.arange(10), 3))

/Users/oliviergrisel/coding/joblib/joblib/parallel.pyc in __call__(self, iterable)
    433         n_jobs = self.n_jobs
    434         if n_jobs < 0 and multiprocessing is not None:
--> 435             n_jobs = max(multiprocessing.cpu_count() + 1 + n_jobs, 1)
    436 
    437         # The list of exceptions that we will capture

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Giving an explicit value for n_jobs works around the bug:

>>> Parallel(n_jobs=2)(delayed(np.mean)(x) for x in np.array_split(np.arange(10), 3))
[1.5, 5.0, 8.0]

joblib doesn't propogate correct exception, raises SyntaxError instead

Whereas multiprocessing propagates the RuntimeError: Error during user-specified initialization

joblib generates a very strange SyntaxError -- I have no idea how I could specify an encoding in a compiled extension!

======================================================================
ERROR: test_user_init_unspecified (pystan.tests.test_problems.TestStanfit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/abr/documents/pystan/pystan-ariddell/pystan/tests/test_problems.py", line 30, in test_user_init_unspecified
    pystan.stan(model_code=model_code, iter=10, chains=1, seed=2, data=data, init=[dict(mu=4)], warmup=0)
  File "/home/abr/documents/pystan/pystan-ariddell/pystan/api.py", line 321, in stan
    n_jobs=n_jobs, **kwargs)
  File "/home/abr/documents/pystan/pystan-ariddell/pystan/model.py", line 650, in sampling
    ret_and_samples = Parallel(n_jobs)(delayed(call_sampler_star)(args) for args in call_sampler_args)
  File "/home/abr/documents/pystan/pystan-ariddell/pystan/external/joblib/parallel.py", line 602, in __call__
    self.retrieve()
  File "/home/abr/documents/pystan/pystan-ariddell/pystan/external/joblib/parallel.py", line 480, in retrieve
    self._output.append(job.get())
  File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
    raise self._value
nose.proxy.SyntaxError: invalid or missing encoding declaration for '/tmp/tmp5pc7xw/pystan/stanfit4anon_model_652395f2575ceb009e57546b592d498e.cpython-33m.so'

Asynchronous output variation of `Parallel.call`

Parallel.retrieve ensures that order is maintained when Parallel.__call__ is called on an iterable. Rather than returning a list of results in the same order as the input, I propose a generator-based version of Parallel.__call__ that yields output as it is ready without ensuring order.

>>> workers = Parallel(n_jobs=8)
>>> workers(delayed(sqrt)(i) for i in range(10))       # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> workers.async(delayed(square)(i) for i in range(10))   # <generator object _____ at 0x______>

Thoughts? Issues?

Make mangling of full path to filename work on Windows too.

I couldn't figure out how to attach patches to issues, so I paste the patch below and hope you can use it.

Best regards,
Jon Olav Vik

From 49458d39d4056f3886f0de7b33004a7cb3a0c4f7 Mon Sep 17 00:00:00 2001
From: Jon Olav Vik <[email protected]>
Date: Wed, 28 Sep 2011 13:14:22 +0200
Subject: [PATCH 84/84] Make mangling of full path to filename work on Windows too.

This fixes errors like the one quoted at bottom.

(Windows 7 64-bit running 32-bit Python 2.7.2, IPython 0.11)

Line 90 of func_inspect.py tries to make a filename out of a full path:
filename = filename.replace('/', '-')
However, the path separator os.sep is platform-dependent,
and Windows in addition has drive letters like c:.
I have changed the code to replace os.sep and ":" with "-",
making this work on Windows.

The error fixed by this change is:

c:\mydir\myfile.py in <module>()
---> 69 L = myfun(p1p2.protocol)
C:\Python27\lib\site-packages\joblib\memory.pyc in __call__(self, *args, **kwargs)
--> 168         if not (self._check_previous_func_code(stacklevel=3) and
C:\Python27\lib\site-packages\joblib\memory.pyc in _check_previous_func_code(self, stacklevel)
--> 263                 self._write_func_code(func_code_file, func_code, first_line)
C:\Python27\lib\site-packages\joblib\memory.pyc in _write_func_code(self, filename, func_code, first_line)
    241         func_code = '%s %i\n%s' % (FIRST_LINE_TEXT, first_line, func_code)
--> 242         file(filename, 'w').write(func_code)
IOError: [Errno 22] invalid mode ('w') or filename: 'C:\\Temp\\mycache\\joblib\\__main__-c:\\mydir\\myfile\\myfun-alias\\func_code.py'

---
 joblib/func_inspect.py |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/joblib/func_inspect.py b/joblib/func_inspect.py
index 7899e14..b2ade5b 100644
--- a/joblib/func_inspect.py
+++ b/joblib/func_inspect.py
@@ -87,7 +87,8 @@ def get_func_name(func, resolv_alias=True, win_characters=True):
         except:
             filename = None
         if filename is not None:
-            filename = filename.replace('/', '-')
+            filename = filename.replace(os.sep, '-')
+            filename = filename.replace(":", "-")
             if filename.endswith('.py'):
                 filename = filename[:-3]
             module = module + '-' + filename
-- 
1.7.4.msysgit.0

joblib cannot persist "output_dir" keyword argument.

The following causes an crash in joblib:

def f(output_dir="thing"):
    return None

mem = joblib.Memory("out")
mem.cache(f)(output_dir="other/thing")

Excerpt from the traceback:

File "/usr/local/lib/python2.6/site-packages/joblib/memory.py", line 318, in call
    self._persist_input(output_dir, *args, **kwargs)
TypeError: _persist_input() got multiple values for keyword argument 'output_dir'

This should not be hard to fix: _persist_input() should take args and kwargs instead of *args and **kwargs.

safe eval

In Parallel, joblib uses eval for its pre_dispatch, which can be unsafe in some circumstances. Possible solutions:

For Python >= 2.6, the use of the ast.literal_eval would help http://docs.python.org/library/ast.html#ast.literal_eval
For older Python versions, specifying explicite globals and locals in the eval would help a bit, especialy if we provided a builtins with a reduced set of builtins.
Use a callable (function of n_jobs)

Errors don't propagate until the end

When an error raises in the parallelised function, the execution continues until the end of the loop, where the error is shown. This means I have to wait for a long time just to see the program had failed much earlier.

Is it possible to raise exceptions as soon as they happen?

[feature] Request for an additional parameter to joblib.Memory.cache to invalidate the cache if another function has been modified

Would you agree to add an additional parameter to the cache function of a Memory object which allows to invalidate the cache if another function has been modified ?

Here are the reasons of this requests:
when dealing with very huge and complex objects it is sometimes better to ignore them. We can simulate the use of another complex objects by setting another string parameter with a different values.
This way we can use Memory.cache even with different parameters.

For example

@memory.cache(ignore='bigobject')
def function_to_cache(bigobject,key):
    [...]

@memory.cache
def function_to_build_big_object(params):
    [...]

bigobject1 = function_to_build_big_object(params1)
bigobject2 = function_to_build_big_object(params2)

res1 = function_to_cache(bigobject1, 'key_for_bigobject1')
res2 = function_to_cache(bigobject2, 'key_for_bigobject2')

However, it does not manage the invalidation of the cache, if for various reasons the big object have changed and not the representation string (i.e., the function to generate the object has been modified).
A solution to this problem could be to add an optionnal attribute invalidate_if_modified to Memory.cache :

@memory.cache(ignore='bigobject', invalidate_if_modified='function_to_build_big_object')
def function_to_cache(bigobject,key):
    [...]

joblib.hash broken for Numpy scalars

The following fails on Joblib 0.7.0b (Python 2.7, Numpy 1.6.2):

import joblib
print joblib.__version__

import numpy as np

mem = joblib.Memory(cachedir='bug-cache')

@mem.cache
def foo(chi):
    print("this function was called once!")
    u = chi
    return chi

def main():
    # Some strange interaction with numpy.finfo seems to be relevant
    # for this issue.
    chi = (np.finfo(float).eps)**(1./3)
    #chi = 6.0554544523933429e-06

    r = foo(chi) + foo(1e99*chi)
    r_expected = chi + 1e99*chi

    np.testing.assert_allclose(r, r_expected)

if __name__ == "__main__":
    main()

The second call to foo seems to return the value cached for the first call. Pdb shows that NumpyHasher returns the same hash for the two different arg_dict.

If I replace the np.finfo result by the equivalent float literal, the bug goes away. No idea what is going on here.

This bug doesn't occur with Joblib 0.6.5.

Joblib.dump Memory error

Memory usage by joblib.dump jumps to more than 3 times the size of the object.
While training the classifier takes about 11g of memory, but when dumping the classifier
object using joblib.dump(compress=9), the usage jumps up to 38.4g
[Causing issues with limited RAM]. [I tried values compress=3, 5, 7, 9, always get memory
error]. If "compress" is not used for joblib.dump, the classifier object is about 11g.
Following is the minimalistic script that demonstrates the steps in my classifier script.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
import time

# Here "data" is the list of plaintext paragraphs and target array has the 
# category . Each paragraph belongs to one of the n categories. 

def train(data, target):
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
                    smooth_idf=True, sublinear_tf=True, max_df=0.5,
                    token_pattern=ur'\b(?!\d)\w\w+\b', use_idf=False)

    # Vectorize the input data
    print "Extracting features from dataset using %s." % (self.vectorizer)
    start_time=time()
    data_vectors = vectorizer.fit_transform(data)
    extract_features_time = time() - start_time
    print "Feature extraction of training data done in %s seconds" % extract_features_time
    print "Number of samples in training data: %d\n Number of features: %d" % data_vectors.shape
    print ""

    # Dump the vectorizer, dataset and target array objects for later use. This seems to work correctly
    # with any value of compress.
    print "Dumping vectorizer...",
    joblib.dump(self.vectorizer, "vectorizer.joblib", compress=9)
    print  "done"

    print "Dumping data vectors...",
    joblib.dump(data_vectors,  "datavectors.joblib",compress=9)
    print "done." 

    print "Dumping target array...",
     joblib.dump(target, "targetarray.joblib", compress=9)
     print "done." 


    # Train the classifer with OvR. The maximum memory used during training is
    # 11g for a dataset of size 655M.
    clf = OneVsRestClassifier(SGDClassifier(loss='log', n_iter=35,
                                        alpha=0.00001, n_jobs=-1))
    start_time=time()
    print "Training %s" % clf
    clf.fit(data_vectors, target)
    print "done [%.3fs]" % (start_time-time())

    # Dump the classifier for later use. Joblib dumps the classifier correctly
    # without any compression. However the size of the vector dumped is about 10-11g.
    # This seems to be too large for our purpose, and hence trying to compress
    # the dumped object.  For compress=3,5,7,9 the  memory usage jumps to 38.4g
    # Since the available memory is only 32g, the process ends up using swap space
    # where the process is stalled for a long time and eventually killed.
    print "Dumping classifier.....",
    joblib.dump(clf, "classifier.joblib", compress=9)
    print "done"

And following is the traceback from the exception:

Dumping classifier... Unexpected error:Traceback (most recent call last):
  File "trainclf.py", line 324, in train_model
    joblib.dump(self.clf, self.classifierfilename, compress=9)
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 362, in dump
    pickler.close()
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 252, in close
    self.file.getvalue(), self.compress)
  File "/home/n7/env/lib/python2.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 89, in write_zfile
    file_handle.write(zlib.compress(data, compress))
MemoryError: Can't allocate memory to compress data

Moving issue from scikit to Joblib [Ref: https://github.com/scikit-learn/scikit-learn/issues/1414]

Filesystem compression breaks joblib.test.test_disk.test_disk_used

Hello, I am getting the following error when running nosetests on the latest version (latest commit dated Jan 15th.)

======================================================================
FAIL: joblib.test.test_disk.test_disk_used
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose-1.2.1-py2.7.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/home/ls/software/joblib/joblib/test/test_disk.py", line 40, in test_disk_used
    nose.tools.assert_true(disk_used(cachedir) >= target_size)
AssertionError: False is not true
    'False is not true' = self._formatMessage('False is not true', "%s is not true" % safe_repr(False))
>>  raise self.failureException('False is not true')


----------------------------------------------------------------------
Ran 299 tests in 2.890s

FAILED (failures=1)

/tmp is in a ZFS volume on FreeBSD with on-the-fly transparent compression enabled, so disk_used can be smaller than target_size. It indeed is:

[10:14:30] [ls@coyote /tmp]$ dd if=/dev/zero of=test.1M bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes transferred in 0.000609 secs (1721348928 bytes/sec)
[10:14:36] [ls@coyote /tmp]$ ls -al test.1M
-rw-r--r--  1 ls  wheel  1048576 Jan 30 10:14 test.1M
[10:14:39] [ls@coyote /tmp]$ du -k test.1M
1   test.1M
[10:14:50] [ls@coyote /tmp]$ du -Ak test.1M
1024    test.1M

I'm unsure about what to suggest, as I don't know what the authors meant to check here:

to get the apparent size of a file, using stat.st_size is the way to go
however, to get its actual disk usage (to impose a limit upon how much space the cache can use, for instance) stat.st_blocks is fine, but the assumption test_disk_used makes doesn't hold.

Support new (Python 3.4) multiprocessing

The multiprocessing module was completely reworked in Python 3.4, to the point that it breaks joblib: e.g. the (undocumented) multiprocessing.forking is gone. On the bright side, it has a nice feature called the forkserver that promises to spawn worker processes much faster than the old multiprocessing. Had joblib still used the old Pool, the change would have been trivial:

multiprocessing = multiprocessing.get_context("forkserver")

Getting the custom pools to work will take some more effort.

See issues 8713 and 18999 on the Python issue tracker.

Ping @ogrisel.

Feature Request/Discussion: provide "reducer" option for Parallel

ATM Parallel.retrieve simply collects all the results into self._output which is initialized as a list, thus requiring sufficient memory to store all the results as the come in.

In my case it is infeasible to store the actual results, and they all should be processed as they come in.

From a brief code review it seems that I could achieve desired effect if I simply could overload that initialization from list (default) to e.g. some custom construct with .append method which would take care about processing the results as they come in. Or am I missing something which would forbid such use?

may be you could suggest/prefer some other implementation, e.g. where Parallel would have a mode (or function) where it would become a generator of results (i.e. just yield instead of appending to _output), so I could "reduce" them in the loop outside the way I like it... ?

I haven't tried any of those ideas yet and just want to run by you first.