For some reason, this sample code (using WebbPSF) causes a hang on my Mac Pro but not

I threw a bug report into the ether: <a class="issue-link js-issue-link" data-error-te

For future reference: (slightly) higher level Python API over

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang about poppy HOT 14 CLOSED

mperrin commented on June 8, 2024

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang

from poppy.

Comments (14)

mperrin commented on June 8, 2024

I am now having this same problem on my laptop (Macbook Pro), while the exact same code works on my desktop (Mac Pro). Both are OS X 10.9.5.

Here's webbpsf.system_diagnostic() for the one that works:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:36:11)  [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
numpy version: 1.8.2
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.1
pysynphot version: 0.9.5
pyFFTW version: not found

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

And here's the one that doesn't work:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:56:22)  [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
numpy version: 1.8.1
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.dev9488
pysynphot version: 0.9.5
pyFFTW version: 0.9.2

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

from poppy.

mperrin commented on June 8, 2024

Hangs the same way in both python and IPython, so that's not the problem.

This looks maybe relevant: http://stackoverflow.com/questions/12485322/python-2-6-7-multiprocessing-pool-fails-to-shut-down

from poppy.

josePhoenix commented on June 8, 2024

I've done some more investigation, and it appears that the hang happens on a line in matrixDFT.py where it computes a dot product. So I Google "numpy dot multiprocessing" and whaddya know, someone else has had this issue... http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported

The takeaway seems to be:

this is a general issue with some BLAS libraries used by numpy for dot.

Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.

This cannot be fixed by numpy but there are three workarounds:

use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)

use python 3.4 and its new multiprocessing spawn or forkserver start methods

use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines

I can't believe it!

from poppy.

josePhoenix commented on June 8, 2024

I threw a bug report into the ether: numpy/numpy#5752

It may be time to rewrite the Slow Fourier Transform in Cython / C / OpenCL (depending on how adventurous I'm feeling).

from poppy.

mperrin commented on June 8, 2024

Oh good grief. I wasn't expecting this to be a fundamental and intractable issue with the Accelerate framework, of all things. I bet the fact that this has generally worked more reliably to me is due to my using macports which can build its own copy of ATLAS.

Yeah, I agree with your comment on the numpy bug discussion that we should just test for this and warn users if they're hitting this case. Looks like we can test against numpy.__config__.blas_opt_info to see whether a given numpy instance is linked against Accelerate.

I'd be fully in support of developing a more accelerated version of the MFT, definitely. There are some GPU experts in the building if you want to go that route. I'm tempted to think that the Cython approach could have better return on time invested than the complexities of dealing with GPUs, except the matrix_dft function is already mostly just calls to numpy ufuncs that are presumably written in C internally. So Cython might not give much speedup, but I suppose would at least work around the numpy.dot/Accelerate bug.

Possibly useful notebook on linear algebra in Cython: http://nbviewer.ipython.org/github/carljv/cython_testing/blob/master/cython_linalg.ipynb

from poppy.

josePhoenix commented on June 8, 2024

"Good grief": my sentiments exactly.

Further action on the GitHub issue at numpy/numpy sounds like there's a possibility of building wheels with ATLAS instead of Accelerate, perhaps even as the default going forward. I checked our internal ssbx numpy, and it's currently built with Accelerate for Mac users 😞

from poppy.

josePhoenix commented on June 8, 2024

A translation of the NumPy-based matrixDFT code into Cython, with a Cython matrix product function to replace numpy.dot, is about 23x slower. I've gotten the low hanging optimization fruit already, I think, though I'm not really experienced with Cython.

If it had been only 4x slower, I would have been happy... The hot path (the matrix product code) is essentially C at this point, so I'm not sure how much additional benefit a C extension will have.

from poppy.

mperrin commented on June 8, 2024

Bleh. 23x bleh.

I wonder why it's that much slower. There must be some deep SSE magic going on in the assembly code for numpy.dot. Yeah OK, no wins from Cython.

Do you want to give pyCUDA a try? If so maybe talk to Perry or Tim in SSB, I think they have some experience with it or can direct you to those here who do. (FYI there was very mixed experience here a couple years ago trying GPUs for several projects like the CTE correction). Of course, given a goal of "easy distributability" anything GPUish is likely even more of a headache than this Accelerate mess so I'm wary.

Right now we can at least be fairly confident that the existing code should be robust for multiprocessing on Linux, yes? In the long run it may be most efficient to just get things up and running on Python 3.4 since it sounds like that avoids the underlying issue, and we want to go that way anyhow.

from poppy.

josePhoenix commented on June 8, 2024

Yeah, getting things running on 3.4 is definitely something we should do so we're on a supported version of the language. We can just warn users on 2.7.x if they enable multiprocessing on an affected numpy build.

I was looking at PyOpenCL for broader compatibility than PyCUDA, but they're both lower level than I'd like (have to separate our real and imaginary components, etc.).

Re: Linux: it's hard to say, since there are a few different BLAS libs NumPy can be built with and no standard binary build for Linux as far as I know. It sounds like a similar issue has existed in other libraries and been worked around. I would want to check against our ssb* builds at least to confirm it's not an issue there.

from poppy.

josePhoenix commented on June 8, 2024

For future reference:

(slightly) higher level Python API over OpenCL and CUDA: reikna
An apparently very thorough discussion of FFTs with OpenCL from AMD: part 1 part 2 (with some formatting weirdness)

from poppy.

mperrin commented on June 8, 2024

Some thoughts on performance optimization.

I've just done some benchmarking with line_profiler. For a relatively simple case (webbpsf NIRCam F212N direct imaging), the execution time breaks down as follows. Each level of indentation is what's inside the function call above, but I've (manually) converted percentages to be relative to the total execution time.

~99% of time is spent inside OpticalSystem.propagate_mono()
    78% is inside wavefront.propagateTo
        77% is inside matrix_dft
             1% is the two np.outer() calls
            11% is setting up the expXU and expYV arrays
            64% is the two np.dot() calls
             1% is multiplying by norm_coeff
    11% is wavefront *= optic
     5% is wavefront.normalize()
     6% is return wavefront.asFITS()
        5% is unnecessary array copies which we could easily avoid

So, no real surprises. Nearly 80% of the execution time is spent in calls to np.dot, np.outer, and np.exp. There is not likely to be much/anything we can do to optimize those beyond BLAS. (Setting aside cases of "numpy is not compiled to use BLAS so is wastefully slow" as an anti-optimization. Argh.) Extensive web searching has turned up some interesting tidbits but nothing that looks particularly likely to yield any substantial speedups.

Continuum Analytics provides a version of numpy optimized with Intel's Math Kernel Library. However, apparently Apple Accelerate or OpenBLAS are just as fast as MKL, says this quora thread and twitter post with benchmarks. Of course there is the lack-of-fork-safeness issue.
Continuum Analytics also has their numba just-in-time compiler. But that doesn't help here since all the time is being spent inside compiled C kernels of ufuncs that are just calling BLAS anyway.
In old versions of numpy <= 1.7 there were some speed pitfalls depending on array order in memory but apparently no longer the case, as per this stackoverflow discussion
- another stack overflow discussions that's informative but ultimately reaches the same conclusions that BLAS is best http://stackoverflow.com/questions/12239002/how-to-accelerate-matrix-multiplications-in-python

The only additional avenue I can think of to pursue would be GPU computing. In particular Continuum Analytics also has CUDA integrated into their Python JIT compiler toolchain NumbaPro. Here's a example notebook. Seems very slick. CUDA includes its own version of BLAS called CUBLAS. Under the right circumstances it sounds like this can provide significant speedups, and we should probably try it out. On the other hand, the overheads for transferring data in/out of GPU memory could potentially be significant.

(In the back of my mind there's long been an idea that it would be possible to rewrite the code such that the entire propagation takes place on the GPU and only the final results get sent back to main memory, i.e. all the OpticalElement phasor creation code could be compiled to run on the GPUs too. But that is clearly way outside the scope of current efforts. That said, this seems like exactly the sort of thing that scikit-CUDA is intended for.)

Bottom line I think right now we should:

Add error-checking code for Accelerate in numpy.__config__.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.
Investigate GPU options over the longer term, probably most efficiently through NumbaPro.

from poppy.

mperrin commented on June 8, 2024

Should add at least the error-check and warning before release 0.3.5:

Add error-checking code for Accelerate in numpy.config.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.

from poppy.

JarronL commented on June 8, 2024

I recently did a clean install of Mac OS X (10.10) on my MacPro along with the latest Anaconda (Python 2.7) and pyFFTW with corresponding FFTW libraries. After doing some extensive benchmarking, I discovered that the numpy fft isn't that much slower than the pyFFTW, and came to the conclusion that multiprocessing numpy fft is a much better use of my machine's resources. Fortunately, my numpy uses MKL, so I don't have the forking issue under Apple's Accelerate library. Unfortunately, the error checking statement in Poppy (line 1560) assumes that np.__config__.blas_opt_info['extra_link_args'] exists, which causes an error for me since that key is not present in my blas_opt_info dict. For reference, my dictionary looks like:

{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
 'include_dirs': ['/Users/jarron/anaconda2/include'],
 'libraries': ['mkl_intel_lp64',
  'mkl_intel_thread',
  'mkl_core',
  'iomp5',
  'pthread'],
 'library_dirs': ['/Users/jarron/anaconda2/lib']}

Until there is a fix, I modified my copy to also check if that key exists:

d = np.__config__.blas_opt_info
if d.has_key('extra_link_args') and ('-Wl,Accelerate' in d['extra_link_args']):
    accel_bool = True
else:
    accel_bool = False
if ( (sys.version_info < (3,4,0)) and (platform.system()=='Darwin') and accel_bool ):
        _log.error("Multiprocessing not compatible with Apple Accelerate library on Python < 3.4")
        _log.error(" See https://github.com/mperrin/poppy/issues/23 ")
        _log.error(" Either disable multiprocessing, or recompile your numpy without Accelerate.")
        raise NotImplementedError("Multiprocessing not compatible with Apple Accelerate framework.")

from poppy.

mandarup commented on June 8, 2024

Just commenting for reference in case this is useful to someone:

A few years later, still using python2.7 due to some dependencies lagging behind, realized np.dot was hanging joblib.parallel with backend='multiprocessing'

worked around using simple jitted dot implementation

import numba as nb
import numpy as np
@nb.jit(nb.f8[:, :](nb.f8[:, :], nb.f8[:, :]), nopython=True)
def dot_product(a, b):
    size_a = a.shape
    size_b = b.shape
    size_out = (size_a[0], size_b[1])
    out = np.zeros(size_out)
    for i in range(size_a[0]):
        for j in range(size_b[1]):
            for k in range(size_b[0]):
                out[i,j] += a[i,k] * b[k, j]
    return out

from poppy.

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang about poppy HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent