Giter VIP home page Giter VIP logo

omnireduce's People

Contributors

chenyuho avatar phlix1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

omnireduce's Issues

Fail to run OmniReduce-RDMA example

Description:

Hello,
I encountered an error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" while running OmniReduce-RDMA example. I'm trying to run the single node example which the aggregator and the worker are on the same machine.

Steps to reproduce:

  1. Compile as README said.
  2. Modify the .cfg file
[omnireduce]
num_workers = 1
num_aggregators = 1
num_threads = 8
worker_cores = -1,-1,-1,-1,-1,-1,-1,-1
aggregator_cores = -1,-1,-1,-1,-1,-1,-1,-1
threshold = 0.0
buffer_size = 1024
chunk_size = 1048576
bitmap_chunk_size = 16777216
message_size = 256
block_size = 256
ib_hca = mlx5_0
ib_port = 1
gid_idx = 2
sl = 2
gpu_devId = 0
direct_memory = 1
adaptive_blocksize = 0
tcp_port = 19875
worker_ips = 127.0.0.1
aggregator_ips = 127.0.0.1

  1. Run aggregators
$ ./aggregator 
IB device: mlx5_0
  1. Run workers, and observe the error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" as below
$ mpirun -n 1 -host 127.0.0.1 ./cuda_worker
CUDA device [NVIDIA A100-PCIE-40GB]
IB device: mlx5_0
ibv_reg_mr cuda_comm_buf failed with mr_flags=7
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19406,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Expected behavior:

The program should run without any errors and show the calculated algbw.

Actual behavior:

The error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" is displayed, and the cuda_worker fails to run.

Additional information:

I'm not sure what the "mr_flags=7" part of the error message means, or how to fix the problem. I've tried searching online for solutions, but haven't found anything that helps. Any suggestions or advice would be greatly appreciated.

Thank you for your help!

Alex Li

DAIET RX Queue Map Error

Hi, I was able to compile PyTorch with Gloo and DAIET and now trying to run the training but I get the following error:

Configuration file daiet-server.cfg
** DAIET parameters **
Num updates: 256
Max num pending messages: 64
Worker port: 4000
PS port: 3030
Worker IP: 14.207.254.149
PS0: XX:XX:XX:XX:XX:XX 14.207.254.165
Num workers: 1
Number of threads: 4
CPU freq: 1.8 GHz
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
Link Up. Speed 200000 Mbps - Full-duplex
Cannot init mbuf pool: Cannot allocate memory

My config file for reference daiet-server.cfg:

[daiet]
num_workers = 1
# Block size
num_updates = 256
# Slots per core
max_num_pending_messages = 64
worker_port = 4000
ps_port = 3030
worker_ip = 14.207.254.149
ps_ips = 14.207.254.165
ps_macs = XX:XX:XX:XX:XX:XX
# Deprecated config, no need to change
sync_blocks = 100000000

[dpdk]
# Cores to bind to
cores = 10-13
prefix = daiet
# Extra EAL options
extra_eal_options = -w 0000:8e:00.0
# Port id
port_id = 0
# Pool and pool cache sizes
pool_size = 131072
pool_cache_size = 512
# Number of packets in a burst
burst_rx = 64
burst_tx = 32
# Bulk drain timer (microseconds)
bulk_drain_tx_us = 10

DPDK Build Issue

Hi, I have been trying to follow the guide and install by running
./build_all.sh MLX5 CONDA INSTALL NOSCALING SKIP_EXPS SKIP_EXAMPLES PYTORCH,
but run into the following issue at the DPDK build step:

== Build buildtools
== Build kernel
== Build kernel/linux
== Build buildtools/pmdinfogen
== Build kernel/linux/igb_uio
== Build kernel/linux/kni
HOSTCC pmdinfogen.o
HOSTLD dpdk-pmdinfogen
INSTALL-HOSTAPP dpdk-pmdinfogen
== Build drivers
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: rte_kni.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: kni] Error 2
make[3]: *** Waiting for unfinished jobs....
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: igb_uio.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: igb_uio] Error 2
make[2]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: linux] Error 2
make[1]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.sdkbuild.mk:48: kernel] Error 2
make[1]: *** Waiting for unfinished jobs....

I am running it in docker in the following env:

OS: "CentOS Linux 8"
Arch: "Intel Corporation Sky Lake"
Ethernet controller: "Mellanox Technologies MT2892 Family [ConnectX-6 Dx]"

Also, I have installed the latest DPDK separately and it gets compiled correctly, but for this OmniReduce version it fails. Any pointers in this regard to compile this, or correctly point OmniReduce to the installation of latest DPDK would be really helpful. Although the newer DPDK versions usually don't work with applications using older APIs, which I think would be the case here.

Forward Pass Really Slow.

Hi I have tried 2 different models, both with and without OFFLOAD_BITMAP, on a A100 GPU with a single worker single PS in colocated mode. I see that I reach 100% CPU utilization and close to 0% GPU utilization while the training forward pass gets stuck on x = self.relu(self.conv1(x)) for minutes. I reconfirmed x is a cuda tensor. Any idea why this might be? Any changes which might affect PyTorch's forward pass or underlying libraries?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.