Giter VIP home page Giter VIP logo

mu's People

Contributors

kristianmitk avatar saggy00 avatar zablotchi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mu's Issues

Error calling pthread_setaffinity_np in singularity container

I have a user in my cluster that is trying to run this on a cluster environment, and we do not allow user to run docker.
We are running Centos 7, and using gcc 4.8.5, so we have a problem compiling the program/library with it.

I created a singularity container running Ubuntu Bionic, with the same command as the docker (I omit user creation and sshd).
I used the container to then compile the software.

I ran memcached -vv -p 9999 on one of our node, then run numactl ... 1 4096 1 numactl ... 2 4096 1 and when I run numactl ... 3 4096 1 it would crash all 3 that is running numactl.

Error on the server running the memcached command.

<29 get qp-leader-election-3-for-2
>29 sending key qp-l<30 get qp-leader-election-1-ready(connect)
>30 END
eader-election-3-for-2
>29 END
<28 get qp-leader-election-3-for-1
>28 sending key qp-leader-election-3-for-1
>28 END
<29 set qp-leader-election-2-ready(connect) 0 0 14
>29 STORED
<29 get qp-replication-1-ready(connect)
>29 sending key qp-replication-1-ready(connect)
>29 END
<29 get qp-replication-3-ready(connect)
>29 sending key qp-replication-3-ready(connect)
>29 END
<29 get qp-leader-election-1-ready(connect)
>29 END
<28 set qp-leader-election-1-ready(connect) 0 0 14
>28 STORED
<28 get qp-replication-2-ready(connect)
>28 sending key qp-replication-2-ready(connect)
>28 END
<28 get qp-replication-3-ready(connect)
>28 sending key qp-replication-3-ready(connect)
>28 END
<28 get qp-leader-election-2-ready(connect)
>28 sending key qp-leader-election-2-ready(connect)
>28 END
<28 get qp-leader-election-3-ready(connect)
>28 sending key qp-leader-election-3-ready(connect)
>28 END
<30 get qp-leader-election-1-ready(connect)
>30 sending key qp-leader-election-1-ready(connect)
>30 END
<30 get qp-leader-election-2-ready(connect)
>30 sending key qp-leader-election-2-ready(connect)
>30 END
<29 get qp-leader-election-1-ready(connect)
>29 sending key qp-leader-election-1-ready(connect)
>29 END
<29 get qp-leader-election-3-ready(connect)
>29 sending key qp-leader-election-3-ready(connect)
>29 END
<28 connection closed.
<30 connection closed.
<29 connection closed.

from node1

Singularity> numactl --membind 0 ./crash-consensus/demo/using_conan_fully/build/bin/main-st 1 4096 1
USING PAYLOAD SIZE = 4096
USING OUTSTANDING_REQ = 1
[CONS:info] Device name: mlx5_0, Device verbs name: uverbs0, Extra info: NodeType::CA TransportType::IB
[CONS:info] Binding to the first port of the device... OK
[CONS:info] Binded on (port_id, port_lid) = (1, 70)
[CB:info] PD 'primary' registered
[CB:info] Buffer 'shared-buf' of size 2147483648 allocated
[CB:info] MR 'shared-mr' under PD 'primary' registered with buf 'shared-buf' and rights 7
[CB:info] CQ 'cq-replication' registered
[CB:info] CQ 'cq-leader-election' registered
[CE:info] Publishing qp qp-replication-1-for-2
[CE:info] Publishing qp qp-replication-1-for-3
[CE:info] Publishing qp qp-leader-election-1-for-2
[CE:info] Publishing qp qp-leader-election-1-for-3
[CE:info] Loopback connection was added
[CONS:info] Scratchpad memory :: slot size: 1048576 bytes, total size: 34603008 bytes.
[CONS:info] Log allocation... OK
[CONS:info] Log (address: 0x2adc9afe3040, size: 2112880640 bytes)
[CE:info] Connected with qp-replication-2-for-1
[CE:info] Connected with qp-replication-3-for-1
[CE:info] Connected with qp-leader-election-2-for-1
[CE:info] Connected with qp-leader-election-3-for-1
[CE:info] Loopback connection was established
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error calling pthread_setaffinity_np: Success
Aborted

from node2

Singularity> numactl --membind 0 ./crash-consensus/demo/using_conan_fully/build/bin/main-st 2 4096 1
USING PAYLOAD SIZE = 4096
USING OUTSTANDING_REQ = 1
[CONS:info] Device name: mlx5_0, Device verbs name: uverbs0, Extra info: NodeType::CA TransportType::IB
[CONS:info] Binding to the first port of the device... OK
[CONS:info] Binded on (port_id, port_lid) = (1, 74)
[CB:info] PD 'primary' registered
[CB:info] Buffer 'shared-buf' of size 2147483648 allocated
[CB:info] MR 'shared-mr' under PD 'primary' registered with buf 'shared-buf' and rights 7
[CB:info] CQ 'cq-replication' registered
[CB:info] CQ 'cq-leader-election' registered
[CE:info] Publishing qp qp-replication-2-for-1
[CE:info] Publishing qp qp-replication-2-for-3
[CE:info] Publishing qp qp-leader-election-2-for-1
[CE:info] Publishing qp qp-leader-election-2-for-3
[CE:info] Loopback connection was added
[CONS:info] Scratchpad memory :: slot size: 1048576 bytes, total size: 34603008 bytes.
[CONS:info] Log allocation... OK
[CONS:info] Log (address: 0x2b5fdd72e040, size: 2112880640 bytes)
[CE:info] Connected with qp-replication-1-for-2
[CE:info] Connected with qp-replication-3-for-2
[CE:info] Connected with qp-leader-election-1-for-2
[CE:info] Connected with qp-leader-election-3-for-2
[CE:info] Loopback connection was established
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error calling pthread_setaffinity_np: Success
Aborted

from node3

Singularity> numactl --membind 0 ./crash-consensus/demo/using_conan_fully/build/bin/main-st 3 4096 1
USING PAYLOAD SIZE = 4096
USING OUTSTANDING_REQ = 1
[CONS:info] Device name: mlx5_0, Device verbs name: uverbs0, Extra info: NodeType::CA TransportType::IB
[CONS:info] Binding to the first port of the device... OK
[CONS:info] Binded on (port_id, port_lid) = (1, 69)
[CB:info] PD 'primary' registered
[CB:info] Buffer 'shared-buf' of size 2147483648 allocated
[CB:info] MR 'shared-mr' under PD 'primary' registered with buf 'shared-buf' and rights 7
[CB:info] CQ 'cq-replication' registered
[CB:info] CQ 'cq-leader-election' registered
[CE:info] Publishing qp qp-replication-3-for-1
[CE:info] Publishing qp qp-replication-3-for-2
[CE:info] Publishing qp qp-leader-election-3-for-1
[CE:info] Publishing qp qp-leader-election-3-for-2
[CE:info] Loopback connection was added
[CONS:info] Scratchpad memory :: slot size: 1048576 bytes, total size: 34603008 bytes.
[CONS:info] Log allocation... OK
[CONS:info] Log (address: 0x2af5eb120040, size: 2112880640 bytes)
[CE:info] Connected with qp-replication-1-for-3
[CE:info] Connected with qp-replication-2-for-3
[CE:info] Connected with qp-leader-election-1-for-3
[CE:info] Connected with qp-leader-election-2-for-3
[CE:info] Loopback connection was established
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error calling pthread_setaffinity_np: Success
Aborted

I ran all my test using 32GB, and 4 cores in a Ubuntu Bionic singularity container, with CentOS host.
I tried installing the same Mellanox driver for infiniband in the singularity container that matches our Mellanox driver on the host machine, and I got the same crash.

I also did it without infiniband driver and got the same crash.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.