Giter VIP home page Giter VIP logo

Comments (6)

wxdwfc avatar wxdwfc commented on June 18, 2024

Hi, could you present the hosts.xml information and your compilation options to me (cmake -Dxxxx=xx) ?
If possible, could you give me the results of your RDMA nic information ? (i.e. with "ibstat")
Thanks!

from drtmh.

Alchem-Lab avatar Alchem-Lab commented on June 18, 2024

The hosts.xml is like this:

  1 <hosts>
  2   <!-- all reachable hosts -->
  3   <macs>
  4     <a>nerv1</a>
  5     <a>nerv2</a>
  6     <a>nerv3</a>
  7   </macs>
  8   <black>
  9     <a>nerv4</a>
 10   </black>
 11   <!-- The macs which are ignored-->
 12 </hosts>

compilation options are the same as in your README:
cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -D ROCC_RBUF_SIZE_M=13240 -D RDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2 -DPA=0 .

The following is the the result when I ran ibstat for machine nerv1:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.34.5000
Hardware version: 0
Node GUID: 0x248a070300e47ac0
System image GUID: 0x248a070300e47ac3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x248a070300e47ac1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x248a070300e47ac2
Link layer: InfiniBand

ibstat for machine nerv2:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e4ba0
System image GUID: 0x7cfe9003009e4ba3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 14
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba2
Link layer: InfiniBand

ibstat for machine nerv3:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e5160
System image GUID: 0x7cfe9003009e5163
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5161
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5162
Link layer: InfiniBand

I run the python script on nerv1, which is the first machine in hosts.xml.

Thanks!

from drtmh.

wxdwfc avatar wxdwfc commented on June 18, 2024

Hi, thanks for your information.
First, you can pull from the main stream, which I misses some scripts and files. I'm sorry for that (We are still refining the code).
Second, it seems that you have 1 NIC per machine. So maybe you can change the RWorker::choose_rnic_port() functions, and set use_port_ variable to be always be 0 (This will use the first RNIC on your server, while our servers use two NICs).
Third can you run the smallbank benchmark? It seems that the segmentation fault happens during data loading, while smallbank benchmark uses a much simpler store than TPC-C. You can replace tpcc with bank to run smallbank workload.
Thanks.

from drtmh.

Alchem-Lab avatar Alchem-Lab commented on June 18, 2024

Hi Thanks for the reply!

This segfaults issue is solved.

But I still cannot run the code well. I am trying to debug. This is my current output when run the script:

[chao@nerv1 scripts]$ ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 3 [START] Input parsing done.
[START] cleaning remaining processes.
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 1 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 2 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 0 -t 24 -c 10 -r 100 -p 3
NOCC started with program [noccocc]. at 08-09-2018 03:45:29
[bench_runner.cc:324] Use TCP port 8888
[bench_runner.cc:346] use scale factor: 72; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 3 backups to assign
[Bank]: check workload 25, 15, 15, 15, 15, 15
[util.cc:164] huge page alloc failed!
[librdma] get device name mlx4_0, idx 0
[librdma] : Device 0 has 1 ports
[bench_runner.cc:153] Total logger area 0.00585938G.
[bench_runner.cc:163] add RDMA store size 4.88281G.
[bench_runner.cc:172] [Mem] RDMA heap size 8.03905G.
[util.cc:164] huge page alloc failed!
[util.cc:164] huge page alloc failed!
[Bank], total 21600000 accounts loaded
[bank_main.cc:263] check cv balance 46280
[Runner] local db size: 661.754 MB
[Runner] Cache size: 0 MB
[bench_listener2.cc:64] try log results to ./results/noccocc_bank_3_24_10_100.log
[bench_listener2.cc:73] New monitor running!
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory

This "Connect Memory Region failed error" is due to ibv_reg_mr() returns null. Do you have any idea why ibv_reg_mr() function can fail?

I appreciate your advise.

from drtmh.

wxdwfc avatar wxdwfc commented on June 18, 2024

Hi, it seems that there is no 2M huge pages available on your machine (This results in more memory to register the memory region on the NIC).
Can you allocate enough huge page and then try the results?W You can use command such assu -c "echo 10240 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages" for allocation.

ps: If you can not use huge page, you can configure the RNIC according to this post https://community.mellanox.com/docs/DOC-1120 to allow RNIC register larger memory. But huge page is suggested for better performance.

from drtmh.

Alchem-Lab avatar Alchem-Lab commented on June 18, 2024

I allocated enough huge page memory as your suggested. And the bank bench is now working fine on three machines and with 8 threads on each machine. Thanks for all the help :)

from drtmh.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.