Comments (8)
不建议用软链的方式解决,PATH
中不存在~/.local/bin
可能与您使用的系统有关系(我们目前只在Ubuntu下测试过)。您可以考虑在PATH
中增加该路径。
具体而言,我们怀疑您用sudo做了plasm_store
的软链,可能产生了文件用户权限不match的问题。
您可以首先查看一下/tmp/mge_plasma_136f282090b11a1f
是否属于root用户。
如果是的话,尝试一下用sudo运行train.py
是否报错。
如果不再报错说明确实是文件用户权限的问题。
如果暂时无法解决该问题,可以考虑修改Dataloader
的参数num_workers=0
。会在一定程度上影响速度。
from megengine.
请先在shell中尝试运行plasma_store
,如果出现-bash: plasma_store: command not found
,则说明pyarrow没有装好,请尝试删除pyarrow并重装:
pip3 uninstall pyarrow
pip3 install pyarrow -U
另外能否提供一下重装之前pyarrow的版本?我们想确认一下是否是版本问题。
from megengine.
另self.__initialzed
是一个typo,本来lint应该查出来的……感谢指出,我们会尽快修复。
from megengine.
请先在shell中尝试运行
plasma_store
,如果出现-bash: plasma_store: command not found
,则说明pyarrow没有装好,请尝试删除pyarrow并重装:pip3 uninstall pyarrow pip3 install pyarrow -U另外能否提供一下重装之前pyarrow的版本?我们想确认一下是否是版本问题。
重装前后版本均为0.16.0,重装后仍然报错,但注意到pip3提示
WARNING: The script plasma_store is installed in '/home/rlee/.local/bin' which is not on PATH.
于是建立软连接
sudo ln -s /home/rlee/.local/bin/plasma_store /usr/bin/plasma_store
之后解决该报错,但运行时再次出现报错信息如下
26 17:35:50 preparing dataset..
26 17:36:14 Epoch 0 LR 1.250e-02
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 20 more times
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 19 more times
...
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 11 more times
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 10 more times
Traceback (most recent call last):
File "train.py", line 328, in
main()
File "train.py", line 93, in main
worker(0, 1, args)
File "train.py", line 209, in worker
train_func, train_queue, optimizer, args, epoch=epoch
File "train.py", line 239, in train
for step, (image, label) in enumerate(data_queue):
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 152, in next
minibatch = self._get_next_batch()
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 512, in _get_next_batch
batch_data = self._try_get_next_batch()
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 503, in _try_get_next_batch
return self.batch_queue.get(timeout=1)
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/_queue.py", line 77, in get
self.client = plasma.connect(self.socket_name)
File "pyarrow/_plasma.pyx", line 850, in pyarrow._plasma.connect
File "pyarrow/_plasma.pyx", line 291, in pyarrow._plasma.plasma_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Could not connect to socket /tmp/mge_plasma_136f282090b11a1f
^C
所示的进程通信问题,目前仍未解决
from megengine.
另
self.__initialzed
是一个typo,本来lint应该查出来的……感谢指出,我们会尽快修复。
不用谢,辛苦了 :)
from megengine.
添加PATH并删除软链后/tmp下文件不再属于root用户,但原报错仍然出现,如截图中所示
设定num_workers=0后可以运行。
from megengine.
添加PATH并删除软链后/tmp下文件不再属于root用户,但原报错仍然出现,如截图中所示
能否打开debug开关,观察一下log?即设置num_workers > 0
后执行如下命令:
MGE_DATALOADER_PLASMA_DEBUG=1 python3 train.py --save=./data/models
from megengine.
@Rlee719 你好,请尝试用 MegEngine v1.0.0 正式版跑 Models 中的 resnet/train.py ,应该不会有问题啦
如果复现过程中还是存在问题,可以另外开一个 Issue 讨论(注明 1.0.0 版本)
from megengine.
Related Issues (20)
- MegEngine v1.11.0 release中的conv2d性能优化 HOT 1
- 如何针对部分Tensor值进行修改操作 HOT 2
- python 3.10 pip download wanted HOT 4
- DeformableConv2d 的 python 接口不完整 HOT 2
- 关于MegEngine/dnn/src/cuda/conv_bias/matmul/inplace_matmul_impl.cu中的代码问题 HOT 4
- AssertionError: Loss explosion: inf HOT 2
- 编译模型的时候出现error: failed to legalize operation 'MGB.ConvBias'
- NVIDIA GeForce RTX 3080(gpu0) with CUDA capability sm_86 is not compatible with the current MegEngine installation HOT 1
- Help-wanted Issue HOT 1
- RuntimeError: assertion `depth < context.transformations.size()' HOT 3
- error when run inference HOT 2
- topk_fp16 结果错误 HOT 3
- Linux下源码编译失败, 报错 /usr/bin/ld.gold: error: cannot find -lMKL_CORE_LIBRARY-NOTFOUND HOT 2
- 如何测试自定义的带 CUDA 后端的算子 HOT 7
- 将.tm模型量化成int8模型
- pip3 install之后,Downloading 多个MegEngine版本是怎么回事
- 手动实现pixel_unshuffle时使用 F.conv2d报错 HOT 1
- 旷世天元实现矩阵乘法的位置在哪儿 HOT 1
- 使用pip下载MegEngine的时候耗时过长且总是失败该怎么办啊? HOT 1
- MegEngine/src/opr/impl/customop/fillpoly /opr_impl.cu填充算法
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megengine.