sands-lab / omnireduce Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 13.0 123 KB

Shell 1.37% C++ 91.03% Makefile 1.25% Dockerfile 2.52% Cuda 0.31% CMake 1.25% C 0.36% Python 1.92%

omnireduce's People

Contributors

Stargazers

Watchers

Forkers

wuyujiji stjordanis helloworldcn qwqyyy machinelearningsystem computernetworksystem msbaz2013 mzq308734881 darkwood101 chaospengs oscaryyn erodata asu-gkg

omnireduce's Issues

Forward Pass Really Slow.

Hi I have tried 2 different models, both with and without OFFLOAD_BITMAP, on a A100 GPU with a single worker single PS in colocated mode. I see that I reach 100% CPU utilization and close to 0% GPU utilization while the training forward pass gets stuck on x = self.relu(self.conv1(x)) for minutes. I reconfirmed x is a cuda tensor. Any idea why this might be? Any changes which might affect PyTorch's forward pass or underlying libraries?

DAIET RX Queue Map Error

Hi, I was able to compile PyTorch with Gloo and DAIET and now trying to run the training but I get the following error:

Configuration file daiet-server.cfg
** DAIET parameters **
Num updates: 256
Max num pending messages: 64
Worker port: 4000
PS port: 3030
Worker IP: 14.207.254.149
PS0: XX:XX:XX:XX:XX:XX 14.207.254.165
Num workers: 1
Number of threads: 4
CPU freq: 1.8 GHz
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
Link Up. Speed 200000 Mbps - Full-duplex
Cannot init mbuf pool: Cannot allocate memory

My config file for reference daiet-server.cfg:

[daiet]
num_workers = 1
# Block size
num_updates = 256
# Slots per core
max_num_pending_messages = 64
worker_port = 4000
ps_port = 3030
worker_ip = 14.207.254.149
ps_ips = 14.207.254.165
ps_macs = XX:XX:XX:XX:XX:XX
# Deprecated config, no need to change
sync_blocks = 100000000

[dpdk]
# Cores to bind to
cores = 10-13
prefix = daiet
# Extra EAL options
extra_eal_options = -w 0000:8e:00.0
# Port id
port_id = 0
# Pool and pool cache sizes
pool_size = 131072
pool_cache_size = 512
# Number of packets in a burst
burst_rx = 64
burst_tx = 32
# Bulk drain timer (microseconds)
bulk_drain_tx_us = 10

Fail to run OmniReduce-RDMA example

Description:

Hello,
I encountered an error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" while running OmniReduce-RDMA example. I'm trying to run the single node example which the aggregator and the worker are on the same machine.

Steps to reproduce:

Compile as README said.
Modify the .cfg file

[omnireduce]
num_workers = 1
num_aggregators = 1
num_threads = 8
worker_cores = -1,-1,-1,-1,-1,-1,-1,-1
aggregator_cores = -1,-1,-1,-1,-1,-1,-1,-1
threshold = 0.0
buffer_size = 1024
chunk_size = 1048576
bitmap_chunk_size = 16777216
message_size = 256
block_size = 256
ib_hca = mlx5_0
ib_port = 1
gid_idx = 2
sl = 2
gpu_devId = 0
direct_memory = 1
adaptive_blocksize = 0
tcp_port = 19875
worker_ips = 127.0.0.1
aggregator_ips = 127.0.0.1

Run aggregators

$ ./aggregator 
IB device: mlx5_0

Run workers, and observe the error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" as below

$ mpirun -n 1 -host 127.0.0.1 ./cuda_worker
CUDA device [NVIDIA A100-PCIE-40GB]
IB device: mlx5_0
ibv_reg_mr cuda_comm_buf failed with mr_flags=7
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19406,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Expected behavior:

The program should run without any errors and show the calculated algbw.

Actual behavior:

The error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" is displayed, and the cuda_worker fails to run.

Additional information:

I'm not sure what the "mr_flags=7" part of the error message means, or how to fix the problem. I've tried searching online for solutions, but haven't found anything that helps. Any suggestions or advice would be greatly appreciated.

Thank you for your help!

Alex Li

DPDK Build Issue

Hi, I have been trying to follow the guide and install by running
./build_all.sh MLX5 CONDA INSTALL NOSCALING SKIP_EXPS SKIP_EXAMPLES PYTORCH,
but run into the following issue at the DPDK build step:

== Build buildtools
== Build kernel
== Build kernel/linux
== Build buildtools/pmdinfogen
== Build kernel/linux/igb_uio
== Build kernel/linux/kni
HOSTCC pmdinfogen.o
HOSTLD dpdk-pmdinfogen
INSTALL-HOSTAPP dpdk-pmdinfogen
== Build drivers
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: rte_kni.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: kni] Error 2
make[3]: *** Waiting for unfinished jobs....
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: igb_uio.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: igb_uio] Error 2
make[2]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: linux] Error 2
make[1]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.sdkbuild.mk:48: kernel] Error 2
make[1]: *** Waiting for unfinished jobs....

I am running it in docker in the following env:

OS: "CentOS Linux 8"
Arch: "Intel Corporation Sky Lake"
Ethernet controller: "Mellanox Technologies MT2892 Family [ConnectX-6 Dx]"

Also, I have installed the latest DPDK separately and it gets compiled correctly, but for this OmniReduce version it fails. Any pointers in this regard to compile this, or correctly point OmniReduce to the installation of latest DPDK would be really helpful. Although the newer DPDK versions usually don't work with applications using older APIs, which I think would be the case here.

sands-lab / omnireduce Goto Github PK

omnireduce's People

Contributors

Stargazers

Watchers

Forkers

omnireduce's Issues

Forward Pass Really Slow.

DAIET RX Queue Map Error

Fail to run OmniReduce-RDMA example

Description:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional information:

DPDK Build Issue

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent