omnireduce's People
Forkers
wuyujiji stjordanis helloworldcn qwqyyy machinelearningsystem all-world computernetworksystem msbaz2013 mzq308734881 darkwood101 chaospengs oscaryyn erodataomnireduce's Issues
Fail to run OmniReduce-RDMA example
Description:
Hello,
I encountered an error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" while running OmniReduce-RDMA example. I'm trying to run the single node example which the aggregator and the worker are on the same machine.
Steps to reproduce:
- Compile as README said.
- Modify the
.cfg
file
[omnireduce]
num_workers = 1
num_aggregators = 1
num_threads = 8
worker_cores = -1,-1,-1,-1,-1,-1,-1,-1
aggregator_cores = -1,-1,-1,-1,-1,-1,-1,-1
threshold = 0.0
buffer_size = 1024
chunk_size = 1048576
bitmap_chunk_size = 16777216
message_size = 256
block_size = 256
ib_hca = mlx5_0
ib_port = 1
gid_idx = 2
sl = 2
gpu_devId = 0
direct_memory = 1
adaptive_blocksize = 0
tcp_port = 19875
worker_ips = 127.0.0.1
aggregator_ips = 127.0.0.1
- Run aggregators
$ ./aggregator
IB device: mlx5_0
- Run workers, and observe the error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" as below
$ mpirun -n 1 -host 127.0.0.1 ./cuda_worker
CUDA device [NVIDIA A100-PCIE-40GB]
IB device: mlx5_0
ibv_reg_mr cuda_comm_buf failed with mr_flags=7
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[19406,1],0]
Exit code: 1
--------------------------------------------------------------------------
Expected behavior:
The program should run without any errors and show the calculated algbw
.
Actual behavior:
The error message "ibv_reg_mr cuda_comm_buf failed with mr_flags=7" is displayed, and the cuda_worker fails to run.
Additional information:
I'm not sure what the "mr_flags=7" part of the error message means, or how to fix the problem. I've tried searching online for solutions, but haven't found anything that helps. Any suggestions or advice would be greatly appreciated.
Thank you for your help!
Alex Li
DAIET RX Queue Map Error
Hi, I was able to compile PyTorch with Gloo and DAIET and now trying to run the training but I get the following error:
Configuration file daiet-server.cfg
** DAIET parameters **
Num updates: 256
Max num pending messages: 64
Worker port: 4000
PS port: 3030
Worker IP: 14.207.254.149
PS0: XX:XX:XX:XX:XX:XX 14.207.254.165
Num workers: 1
Number of threads: 4
CPU freq: 1.8 GHz
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
Link Up. Speed 200000 Mbps - Full-duplex
Cannot init mbuf pool: Cannot allocate memory
My config file for reference daiet-server.cfg:
[daiet]
num_workers = 1
# Block size
num_updates = 256
# Slots per core
max_num_pending_messages = 64
worker_port = 4000
ps_port = 3030
worker_ip = 14.207.254.149
ps_ips = 14.207.254.165
ps_macs = XX:XX:XX:XX:XX:XX
# Deprecated config, no need to change
sync_blocks = 100000000
[dpdk]
# Cores to bind to
cores = 10-13
prefix = daiet
# Extra EAL options
extra_eal_options = -w 0000:8e:00.0
# Port id
port_id = 0
# Pool and pool cache sizes
pool_size = 131072
pool_cache_size = 512
# Number of packets in a burst
burst_rx = 64
burst_tx = 32
# Bulk drain timer (microseconds)
bulk_drain_tx_us = 10
DPDK Build Issue
Hi, I have been trying to follow the guide and install by running
./build_all.sh MLX5 CONDA INSTALL NOSCALING SKIP_EXPS SKIP_EXAMPLES PYTORCH
,
but run into the following issue at the DPDK build step:
== Build buildtools
== Build kernel
== Build kernel/linux
== Build buildtools/pmdinfogen
== Build kernel/linux/igb_uio
== Build kernel/linux/kni
HOSTCC pmdinfogen.o
HOSTLD dpdk-pmdinfogen
INSTALL-HOSTAPP dpdk-pmdinfogen
== Build drivers
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: rte_kni.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: kni] Error 2
make[3]: *** Waiting for unfinished jobs....
make[5]: *** /lib/modules/5.6.13-0_fbk17_zion_5815_gc01d8dbd2635/build: No such file or directory. Stop.
make[4]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.module.mk:51: igb_uio.ko] Error 2
make[3]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: igb_uio] Error 2
make[2]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.subdir.mk:37: linux] Error 2
make[1]: *** [/root/tmp/omnireduce/omnireduce-DPDK/daiet/lib/dpdk/mk/rte.sdkbuild.mk:48: kernel] Error 2
make[1]: *** Waiting for unfinished jobs....
I am running it in docker in the following env:
OS: "CentOS Linux 8"
Arch: "Intel Corporation Sky Lake"
Ethernet controller: "Mellanox Technologies MT2892 Family [ConnectX-6 Dx]"
Also, I have installed the latest DPDK separately and it gets compiled correctly, but for this OmniReduce version it fails. Any pointers in this regard to compile this, or correctly point OmniReduce to the installation of latest DPDK would be really helpful. Although the newer DPDK versions usually don't work with applications using older APIs, which I think would be the case here.
Forward Pass Really Slow.
Hi I have tried 2 different models, both with and without OFFLOAD_BITMAP, on a A100 GPU with a single worker single PS in colocated mode. I see that I reach 100% CPU utilization and close to 0% GPU utilization while the training forward pass gets stuck on x = self.relu(self.conv1(x))
for minutes. I reconfirmed x is a cuda tensor. Any idea why this might be? Any changes which might affect PyTorch's forward pass or underlying libraries?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.