Comments (11)
This fix works for me. @gcarleo
Thanks everyone!
from netket.
I cannot reproduce the crash on my machine and therefore did not investigate this thoroughly, however, I have a suspicion:
The crash could be related to the class MatrixReplacement
and the fact that it stores a member of type Eigen::MatrixXcd
: According to the Eigen documentation
fixed-size vectorizable Eigen objects must absolutely be created at 16-byte-aligned locations, otherwise SIMD instructions addressing them will crash.
This is not guaranteed for class member variables. This problem can be solved by specifying the EIGEN_MAKE_ALIGNED_OPERATOR_NEW macro within the class body.
I cannot test this on my machine right now (since I cannot reproduce the crash), but could you check whether making this change fixes the crash for you?
from netket.
Thank you @wuyukai for the bug report, and thanks a lot @femtobit for looking into this!
I think this might be the likely source of the issue. Indeed on my machines I wasn't able to reproduce the bug, and it is most likely compiler/cpu-related.
@wuyukai would you be able to test out the patch? That would be really helpful. I can't test it myself right now, will look into this as soon as possible.
from netket.
This really should not be the issue as Eigen's docs talk about fixed-size objects. The way these objects are usually implemented is that you avoid costly dynamic memory allocation by using arrays as member variables. However, if you want to use SIMD instructions, the data you're operating on should better be aligned. And seeing as alignas
was introduced in C++11, Eigen probably can't (or doesn't want to) do that automatically for you with fixed-size objects. Eigen::MatrixXcd
is not a fixed-size matrix, so I don't see how it could be the issue here. @wuyukai could you perhaps ask your favourite debugger for a stack trace? That would be really helpful for locating the bug.
from netket.
@twesterhout Indeed, thanks for pointing this out. Sorry for the confusion.
from netket.
I am able to reproduce this error (same OS & versions as reported). The proposed fix by @femtobit does not remove this issue unfortunately. Sorry for the ignorance in advance, but if you could let me know what extra info you need from a debugger I'd be happy to provide more than just the below:
Starting program: /usr/local/bin/netket j1j2.json [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7ffff4d4c700 (LWP 19874)] [New Thread 0x7fffef9f0700 (LWP 19875)] ############################################ # NetKet version 1.0.2 # # Website: https://www.netket.org # # Licensed under Apache-2.0 - see LICENSE # ############################################ # Graph created # Number of nodes = 20 # RBM Initizialized with nvisible = 20 and nhidden = 20 # Using visible bias = 1 # Using hidden bias = 1 # Machine initialized with random parameters # Hamiltonian Metropolis sampler with parallel tempering is ready # 16 replicas are being used # Learning running on 1 processes # Using the Stochastic reconfiguration method # With iterative solver Thread 1 "netket" received signal SIGSEGV, Segmentation fault. 0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) () (gdb) bt #0 0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) () #1 0x00005555555b12af in void Eigen::internal::conjugate_gradient, -1, 1, 0, -1, 1> const, -1, 1, true>, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>, Eigen::IdentityPreconditioner>(netket::MatrixReplacement const&, Eigen::Block, -1, 1, 0, -1, 1> const, -1, 1, true> const&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>&, Eigen::IdentityPreconditioner const&, long&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>::RealScalar&) () #2 0x00005555555b39c9 in netket::GroundState::UpdateParameters() () #3 0x00005555555bc956 in netket::GroundState::GroundState(netket::Hamiltonian&, netket::Sampler > >&, netket::Stepper&, nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) () #4 0x00005555555bfc3d in netket::Learning::Learning(nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) () #5 0x0000555555575cb9 in main ()
from netket.
I noticed that in
netket/NetKet/Learning/matrix_replacement.hpp
Lines 58 to 59 in 916bfe2
MatrixReplacement::rows()
returns the number of columns of the underlying matrix (just as ::cols()
), which I suppose is not intentional.If I change this to return the number of rows, I get an error at
netket/NetKet/Learning/ground_state.hpp
Line 331 in 916bfe2
S.rows()
returns a value that is different from b.rows()
.
@gcarleo I am not familiar enough with this part of the code to be sure of what the correct dimensions for everything are. Could you comment on that?
from netket.
matrix_replacement is a class that basically applies a custom matrix to a vector, without explicitly forming the matrix. This is needed to solve a linear system S*deltaP=b through a CG method, to reduce the computational time resulting from explicitly forming the matrix S.
In this case the matrix being replace is therefore:
netket/NetKet/Learning/ground_state.hpp
Line 295 in 916bfe2
which is a square matrix of size npar_ * npar_. Thus rows() and cols() return the same value.
mp_mat_ is the O_k matrix, i.e. the variational derivatives on all the sampled configurations, which is a rectangular one, thus you get the error you were mentioning.
Still, I don't understand where this bug comes from....
from netket.
Thanks @everthemore! I was able to reproduce the error on my local machine. The problem lies in the lifetime management (as usually in C++) ;) I'll create a PR.
from netket.
All done. The fix (I really hope it also works for you @everthemore) is really simple.
from netket.
Thanks @twesterhout !
I was also looking into something similar, indeed here http://eigen.tuxfamily.org/dox/group__MatrixfreeSolverExample.html the matrix is not returned by value...
I can't check this fix on a Linux machine now, @everthemore and/or @wuyukai could you please try the proposed fix and tell us? Thanks a lot.
Next week I am going to set up unit tests on this part of the code, which unfortunately is not covered yet (and I guess there is a correlation with the fact that we found this bug here...)
from netket.
Related Issues (20)
- Memory-allocation overhead when using LocalOperators for hilbert.Fock HOT 1
- [todo] Investigate whether we should turn on `NETKET_EXPERIMENTAL_FFT_AUTOCORRELATION` by default
- KineticEnergy treats mass in an unreadable manner HOT 3
- Improvements to continuous space HOT 12
- Slow `FermionOperator2nd` manipulation HOT 3
- [ENH] Add function to check if state is in Hilbert space HOT 3
- contraction error when running mps-rnn repo HOT 4
- LocalOperator does not handle well having mixed sparse and dense terms
- Add identity to operator.spin/boson/fermion
- [Fermions] fermion.create/destroy have inconsistent HOT 4
- [Fermions] `SpinFermionHilbert.n_fermions` confuses me
- Problems with multi-GPU MPI setup
- nkjax.jacobian has an undocumented rescaling when center=True
- `vs.to
- `vs.to_qobj` is broken for constrained Hilbert spaces
- Error with VMC_SRt
- Incomprehensible bug from PauliOperatorJax
- Rename `MestropolisSampler.n_sweeps` to `sweep_size` HOT 2
- Issue on page /api/_generated/samplers/netket.experimental.sampler.MetropolisExchangePt.html
- [FR] Euclidian distance between nodes in `nk.graph.lattice` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from netket.