Comments (14)
I am now having this same problem on my laptop (Macbook Pro), while the exact same code works on my desktop (Mac Pro). Both are OS X 10.9.5.
Here's webbpsf.system_diagnostic() for the one that works:
OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct 7 2014, 15:36:11) [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
numpy version: 1.8.2
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4
tkinter version: 0.3.1
wxpython version: 3.0.0.0
astropy version: 0.4.1
pysynphot version: 0.9.5
pyFFTW version: not found
Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
And here's the one that doesn't work:
OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct 7 2014, 15:56:22) [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
numpy version: 1.8.1
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4
tkinter version: 0.3.1
wxpython version: 3.0.0.0
astropy version: 0.4.dev9488
pysynphot version: 0.9.5
pyFFTW version: 0.9.2
Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
from poppy.
Hangs the same way in both python and IPython, so that's not the problem.
This looks maybe relevant: http://stackoverflow.com/questions/12485322/python-2-6-7-multiprocessing-pool-fails-to-shut-down
from poppy.
I've done some more investigation, and it appears that the hang happens on a line in matrixDFT.py
where it computes a dot product. So I Google "numpy dot multiprocessing" and whaddya know, someone else has had this issue... http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported
The takeaway seems to be:
this is a general issue with some BLAS libraries used by numpy for
dot
.Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.
This cannot be fixed by numpy but there are three workarounds:
- use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)
- use python 3.4 and its new multiprocessing
spawn
orforkserver
start methods- use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines
I can't believe it!
from poppy.
I threw a bug report into the ether: numpy/numpy#5752
It may be time to rewrite the Slow Fourier Transform in Cython / C / OpenCL (depending on how adventurous I'm feeling).
from poppy.
Oh good grief. I wasn't expecting this to be a fundamental and intractable issue with the Accelerate framework, of all things. I bet the fact that this has generally worked more reliably to me is due to my using macports which can build its own copy of ATLAS.
Yeah, I agree with your comment on the numpy bug discussion that we should just test for this and warn users if they're hitting this case. Looks like we can test against numpy.__config__.blas_opt_info
to see whether a given numpy instance is linked against Accelerate.
I'd be fully in support of developing a more accelerated version of the MFT, definitely. There are some GPU experts in the building if you want to go that route. I'm tempted to think that the Cython approach could have better return on time invested than the complexities of dealing with GPUs, except the matrix_dft function is already mostly just calls to numpy ufuncs that are presumably written in C internally. So Cython might not give much speedup, but I suppose would at least work around the numpy.dot/Accelerate bug.
Possibly useful notebook on linear algebra in Cython: http://nbviewer.ipython.org/github/carljv/cython_testing/blob/master/cython_linalg.ipynb
from poppy.
"Good grief": my sentiments exactly.
Further action on the GitHub issue at numpy/numpy sounds like there's a possibility of building wheels with ATLAS instead of Accelerate, perhaps even as the default going forward. I checked our internal ssbx
numpy, and it's currently built with Accelerate for Mac users 😞
from poppy.
A translation of the NumPy-based matrixDFT code into Cython, with a Cython matrix product function to replace numpy.dot
, is about 23x slower. I've gotten the low hanging optimization fruit already, I think, though I'm not really experienced with Cython.
If it had been only 4x slower, I would have been happy... The hot path (the matrix product code) is essentially C at this point, so I'm not sure how much additional benefit a C extension will have.
from poppy.
Bleh. 23x bleh.
I wonder why it's that much slower. There must be some deep SSE magic going on in the assembly code for numpy.dot
. Yeah OK, no wins from Cython.
Do you want to give pyCUDA a try? If so maybe talk to Perry or Tim in SSB, I think they have some experience with it or can direct you to those here who do. (FYI there was very mixed experience here a couple years ago trying GPUs for several projects like the CTE correction). Of course, given a goal of "easy distributability" anything GPUish is likely even more of a headache than this Accelerate mess so I'm wary.
Right now we can at least be fairly confident that the existing code should be robust for multiprocessing on Linux, yes? In the long run it may be most efficient to just get things up and running on Python 3.4 since it sounds like that avoids the underlying issue, and we want to go that way anyhow.
from poppy.
Yeah, getting things running on 3.4 is definitely something we should do so we're on a supported version of the language. We can just warn users on 2.7.x if they enable multiprocessing on an affected numpy build.
I was looking at PyOpenCL for broader compatibility than PyCUDA, but they're both lower level than I'd like (have to separate our real and imaginary components, etc.).
Re: Linux: it's hard to say, since there are a few different BLAS libs NumPy can be built with and no standard binary build for Linux as far as I know. It sounds like a similar issue has existed in other libraries and been worked around. I would want to check against our ssb*
builds at least to confirm it's not an issue there.
from poppy.
For future reference:
- (slightly) higher level Python API over OpenCL and CUDA: reikna
- An apparently very thorough discussion of FFTs with OpenCL from AMD: part 1 part 2 (with some formatting weirdness)
from poppy.
Some thoughts on performance optimization.
I've just done some benchmarking with line_profiler. For a relatively simple case (webbpsf NIRCam F212N direct imaging), the execution time breaks down as follows. Each level of indentation is what's inside the function call above, but I've (manually) converted percentages to be relative to the total execution time.
~99% of time is spent inside OpticalSystem.propagate_mono()
78% is inside wavefront.propagateTo
77% is inside matrix_dft
1% is the two np.outer() calls
11% is setting up the expXU and expYV arrays
64% is the two np.dot() calls
1% is multiplying by norm_coeff
11% is wavefront *= optic
5% is wavefront.normalize()
6% is return wavefront.asFITS()
5% is unnecessary array copies which we could easily avoid
So, no real surprises. Nearly 80% of the execution time is spent in calls to np.dot, np.outer
, and np.exp
. There is not likely to be much/anything we can do to optimize those beyond BLAS. (Setting aside cases of "numpy is not compiled to use BLAS so is wastefully slow" as an anti-optimization. Argh.) Extensive web searching has turned up some interesting tidbits but nothing that looks particularly likely to yield any substantial speedups.
- Continuum Analytics provides a version of numpy optimized with Intel's Math Kernel Library. However, apparently Apple Accelerate or OpenBLAS are just as fast as MKL, says this quora thread and twitter post with benchmarks. Of course there is the lack-of-fork-safeness issue.
- Continuum Analytics also has their
numba
just-in-time compiler. But that doesn't help here since all the time is being spent inside compiled C kernels of ufuncs that are just calling BLAS anyway. - In old versions of numpy <= 1.7 there were some speed pitfalls depending on array order in memory but apparently no longer the case, as per this stackoverflow discussion
- another stack overflow discussions that's informative but ultimately reaches the same conclusions that BLAS is best http://stackoverflow.com/questions/12239002/how-to-accelerate-matrix-multiplications-in-python
The only additional avenue I can think of to pursue would be GPU computing. In particular Continuum Analytics also has CUDA integrated into their Python JIT compiler toolchain NumbaPro. Here's a example notebook. Seems very slick. CUDA includes its own version of BLAS called CUBLAS. Under the right circumstances it sounds like this can provide significant speedups, and we should probably try it out. On the other hand, the overheads for transferring data in/out of GPU memory could potentially be significant.
(In the back of my mind there's long been an idea that it would be possible to rewrite the code such that the entire propagation takes place on the GPU and only the final results get sent back to main memory, i.e. all the OpticalElement phasor creation code could be compiled to run on the GPUs too. But that is clearly way outside the scope of current efforts. That said, this seems like exactly the sort of thing that scikit-CUDA is intended for.)
Bottom line I think right now we should:
- Add error-checking code for Accelerate in
numpy.__config__.blas_opt_info
and warn users it is not compatible with multiprocessing on Python 2.7 - Rework the multiprocessing code so that it can use the
forkserver
method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing. - Investigate GPU options over the longer term, probably most efficiently through NumbaPro.
from poppy.
Should add at least the error-check and warning before release 0.3.5:
- Add error-checking code for Accelerate in numpy.config.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
- Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.
from poppy.
I recently did a clean install of Mac OS X (10.10) on my MacPro along with the latest Anaconda (Python 2.7) and pyFFTW with corresponding FFTW libraries. After doing some extensive benchmarking, I discovered that the numpy fft isn't that much slower than the pyFFTW, and came to the conclusion that multiprocessing numpy fft is a much better use of my machine's resources. Fortunately, my numpy uses MKL, so I don't have the forking issue under Apple's Accelerate library. Unfortunately, the error checking statement in Poppy (line 1560) assumes that np.__config__.blas_opt_info['extra_link_args']
exists, which causes an error for me since that key is not present in my blas_opt_info dict. For reference, my dictionary looks like:
{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
'include_dirs': ['/Users/jarron/anaconda2/include'],
'libraries': ['mkl_intel_lp64',
'mkl_intel_thread',
'mkl_core',
'iomp5',
'pthread'],
'library_dirs': ['/Users/jarron/anaconda2/lib']}
Until there is a fix, I modified my copy to also check if that key exists:
d = np.__config__.blas_opt_info
if d.has_key('extra_link_args') and ('-Wl,Accelerate' in d['extra_link_args']):
accel_bool = True
else:
accel_bool = False
if ( (sys.version_info < (3,4,0)) and (platform.system()=='Darwin') and accel_bool ):
_log.error("Multiprocessing not compatible with Apple Accelerate library on Python < 3.4")
_log.error(" See https://github.com/mperrin/poppy/issues/23 ")
_log.error(" Either disable multiprocessing, or recompile your numpy without Accelerate.")
raise NotImplementedError("Multiprocessing not compatible with Apple Accelerate framework.")
from poppy.
Just commenting for reference in case this is useful to someone:
A few years later, still using python2.7 due to some dependencies lagging behind, realized np.dot
was hanging joblib.parallel
with backend='multiprocessing'
worked around using simple jitted dot implementation
import numba as nb
import numpy as np
@nb.jit(nb.f8[:, :](nb.f8[:, :], nb.f8[:, :]), nopython=True)
def dot_product(a, b):
size_a = a.shape
size_b = b.shape
size_out = (size_a[0], size_b[1])
out = np.zeros(size_out)
for i in range(size_a[0]):
for j in range(size_b[1]):
for k in range(size_b[0]):
out[i,j] += a[i,k] * b[k, j]
return out
from poppy.
Related Issues (20)
- need to update minimum version of astropy in setup.py HOT 1
- python3 Integer division in poppy/utils.py HOT 2
- calcPSF halts in certain specific circumstances. HOT 5
- zernike.arbitrary_basis still sometimes clips the edges of apertures HOT 1
- Fresnel propagation detector and setting pixel pitch HOT 1
- poppy travis builds failing silently HOT 1
- OpticalSystem.input_wavefront default sampling should look to OpticalSystem attributes first
- API inconsistency in specifying shifts between FITS and Analytic optics
- test function `test_fresnel_FITS_Optical_element` misses the point
- Bug in QuadraticPhase? HOT 14
- Implement ThermalBloomingWFE HOT 1
- loading FFTW wisdom from disk is unreliable?
- re-importing modules breaks fresnel propagation HOT 3
- Clone / copy repo to https://github.com/spacetelescope/poppy HOT 6
- Adding user-defined Fresnel wavefront error HOT 3
- Remove all Python 2.x items HOT 2
- Phase and ampliude modulation HOT 2
- Missing dependence on pysynphot HOT 2
- Support for intermediate planes? HOT 1
- poppy.display_profiles() Output Plot Inquiry
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from poppy.