Giter VIP home page Giter VIP logo

Comments (8)

AddyLaddy avatar AddyLaddy commented on August 20, 2024

It doesn't look like the nccl-tests were compiled with MPI=1

from nccl-tests.

thsmfe001 avatar thsmfe001 commented on August 20, 2024

I just followed instruction of readme page. I just downloaded and execute make command.
You mean i need to recomplie with below command?
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

from nccl-tests.

AddyLaddy avatar AddyLaddy commented on August 20, 2024

Yes, it looks like you're trying to run binaries that are not MPI enabled so you just end up with 4 processes each with 2 GPUs.

from nccl-tests.

thsmfe001 avatar thsmfe001 commented on August 20, 2024

Thank you so much. I will try on it and then i will post the result to you.

from nccl-tests.

thsmfe001 avatar thsmfe001 commented on August 20, 2024

I just succeeded reconpiling with MPI options. Then i got below error messages. Based on my investigation of recompiled library, np 1 with any hosts can work proprely but more two processors with np 2 leaded to error. I think it would be caused by MPI communication. Could you check attached error logs?

root@c5e62fb2396d:/workspace# mpirun -np 4 -allow-run-as-root -host 10.10.10.2,10.10.11.2,10.10.20.2,10.10.21.2 /workspace/software/nccl-tests-master/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
[1716868653.124073] [c5e62fb2396d:1618 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.124093] [c5e62fb2396d:1618 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.124098] [c5e62fb2396d:1618 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 1; team_id 0; errmsg No pending message
[c5e62fb2396d:1618 :0:1618] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[1716868653.124031] [877a4a03d442:1142 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.124050] [877a4a03d442:1142 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.124056] [877a4a03d442:1142 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 0; team_id 0; errmsg No pending message
[1716868653.119230] [cb0142391811:1139 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.119252] [cb0142391811:1139 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.119258] [cb0142391811:1139 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 3; team_id 0; errmsg No pending message
[877a4a03d442:1142 :0:1142] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[cb0142391811:1139 :0:1139] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[1716868653.119305] [a1eb0914a5e2:1167 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.119324] [a1eb0914a5e2:1167 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.119329] [a1eb0914a5e2:1167 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 2; team_id 0; errmsg No pending message
[a1eb0914a5e2:1167 :0:1167] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
Uploading error logs.txt…

from nccl-tests.

AddyLaddy avatar AddyLaddy commented on August 20, 2024

It looks like you are having issues running MPI jobs. Perhaps get a simple "hello world" MPI program working first before attempting to run the NCCL tests.
But with UCX based MPI I often find export UCX_TLS=tcp helps most issues. You may also need to select the correct UCX device with UCX_NET_DEVICES

from nccl-tests.

thsmfe001 avatar thsmfe001 commented on August 20, 2024

Thank you for your quick feedback. I just recompile with all option with make command based on readme page.
"make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
After test again based new library and if i faced with same issue i'll adopt your recommandation.
I'll update the result to you. Thank you.

from nccl-tests.

thsmfe001 avatar thsmfe001 commented on August 20, 2024

Thank you for support. After recompling and applying UCX_TLS=tcp the test was well done.
I really appreciate you about quick support again.

from nccl-tests.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.