Giter VIP home page Giter VIP logo

Comments (11)

wuyukai avatar wuyukai commented on May 20, 2024 1

This fix works for me. @gcarleo
Thanks everyone!

from netket.

femtobit avatar femtobit commented on May 20, 2024

I cannot reproduce the crash on my machine and therefore did not investigate this thoroughly, however, I have a suspicion:

The crash could be related to the class MatrixReplacement and the fact that it stores a member of type Eigen::MatrixXcd: According to the Eigen documentation

fixed-size vectorizable Eigen objects must absolutely be created at 16-byte-aligned locations, otherwise SIMD instructions addressing them will crash.

This is not guaranteed for class member variables. This problem can be solved by specifying the EIGEN_MAKE_ALIGNED_OPERATOR_NEW macro within the class body.

I cannot test this on my machine right now (since I cannot reproduce the crash), but could you check whether making this change fixes the crash for you?

from netket.

gcarleo avatar gcarleo commented on May 20, 2024

Thank you @wuyukai for the bug report, and thanks a lot @femtobit for looking into this!
I think this might be the likely source of the issue. Indeed on my machines I wasn't able to reproduce the bug, and it is most likely compiler/cpu-related.

@wuyukai would you be able to test out the patch? That would be really helpful. I can't test it myself right now, will look into this as soon as possible.

from netket.

twesterhout avatar twesterhout commented on May 20, 2024

This really should not be the issue as Eigen's docs talk about fixed-size objects. The way these objects are usually implemented is that you avoid costly dynamic memory allocation by using arrays as member variables. However, if you want to use SIMD instructions, the data you're operating on should better be aligned. And seeing as alignas was introduced in C++11, Eigen probably can't (or doesn't want to) do that automatically for you with fixed-size objects. Eigen::MatrixXcd is not a fixed-size matrix, so I don't see how it could be the issue here. @wuyukai could you perhaps ask your favourite debugger for a stack trace? That would be really helpful for locating the bug.

from netket.

femtobit avatar femtobit commented on May 20, 2024

@twesterhout Indeed, thanks for pointing this out. Sorry for the confusion.

from netket.

everthemore avatar everthemore commented on May 20, 2024

I am able to reproduce this error (same OS & versions as reported). The proposed fix by @femtobit does not remove this issue unfortunately. Sorry for the ignorance in advance, but if you could let me know what extra info you need from a debugger I'd be happy to provide more than just the below:

Starting program: /usr/local/bin/netket j1j2.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4d4c700 (LWP 19874)]
[New Thread 0x7fffef9f0700 (LWP 19875)]
############################################
# NetKet version 1.0.2                     #
# Website: https://www.netket.org          #
# Licensed under Apache-2.0 - see LICENSE  #
############################################


 # Graph created 
 # Number of nodes = 20
 # RBM Initizialized with nvisible = 20 and nhidden = 20
 # Using visible bias = 1
 # Using hidden bias  = 1
 # Machine initialized with random parameters
 # Hamiltonian Metropolis sampler with parallel tempering is ready 
 # 16 replicas are being used
 # Learning running on 1 processes
 # Using the Stochastic reconfiguration method
 # With iterative solver

Thread 1 "netket" received signal SIGSEGV, Segmentation fault.
0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) ()

(gdb) bt
#0  0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) ()
#1  0x00005555555b12af in void Eigen::internal::conjugate_gradient, -1, 1, 0, -1, 1> const, -1, 1, true>, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>, Eigen::IdentityPreconditioner>(netket::MatrixReplacement const&, Eigen::Block, -1, 1, 0, -1, 1> const, -1, 1, true> const&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>&, Eigen::IdentityPreconditioner const&, long&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>::RealScalar&) ()
#2  0x00005555555b39c9 in netket::GroundState::UpdateParameters() ()
#3  0x00005555555bc956 in netket::GroundState::GroundState(netket::Hamiltonian&, netket::Sampler > >&, netket::Stepper&, nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) ()
#4  0x00005555555bfc3d in netket::Learning::Learning(nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) ()
#5  0x0000555555575cb9 in main ()

from netket.

femtobit avatar femtobit commented on May 20, 2024

I noticed that in

Index rows() const { return mp_mat_.cols(); }
Index cols() const { return mp_mat_.cols(); }
MatrixReplacement::rows() returns the number of columns of the underlying matrix (just as ::cols()), which I suppose is not intentional.
If I change this to return the number of rows, I get an error at
auto deltaP = it_solver.solve(b);
because now of course S.rows() returns a value that is different from b.rows().

@gcarleo I am not familiar enough with this part of the code to be sure of what the correct dimensions for everything are. Could you comment on that?

from netket.

gcarleo avatar gcarleo commented on May 20, 2024

@femtobit

matrix_replacement is a class that basically applies a custom matrix to a vector, without explicitly forming the matrix. This is needed to solve a linear system S*deltaP=b through a CG method, to reduce the computational time resulting from explicitly forming the matrix S.

In this case the matrix being replace is therefore:

Eigen::MatrixXcd S = Ok_.adjoint() * Ok_;

which is a square matrix of size npar_ * npar_. Thus rows() and cols() return the same value.

mp_mat_ is the O_k matrix, i.e. the variational derivatives on all the sampled configurations, which is a rectangular one, thus you get the error you were mentioning.

Still, I don't understand where this bug comes from....

from netket.

twesterhout avatar twesterhout commented on May 20, 2024

Thanks @everthemore! I was able to reproduce the error on my local machine. The problem lies in the lifetime management (as usually in C++) ;) I'll create a PR.

from netket.

twesterhout avatar twesterhout commented on May 20, 2024

All done. The fix (I really hope it also works for you @everthemore) is really simple.

from netket.

gcarleo avatar gcarleo commented on May 20, 2024

Thanks @twesterhout !
I was also looking into something similar, indeed here http://eigen.tuxfamily.org/dox/group__MatrixfreeSolverExample.html the matrix is not returned by value...

I can't check this fix on a Linux machine now, @everthemore and/or @wuyukai could you please try the proposed fix and tell us? Thanks a lot.

Next week I am going to set up unit tests on this part of the code, which unfortunately is not covered yet (and I guess there is a correlation with the fact that we found this bug here...)

from netket.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.