Giter VIP home page Giter VIP logo

Comments (8)

wjfwzzc avatar wjfwzzc commented on May 18, 2024 1

不建议用软链的方式解决,PATH中不存在~/.local/bin可能与您使用的系统有关系(我们目前只在Ubuntu下测试过)。您可以考虑在PATH中增加该路径。

具体而言,我们怀疑您用sudo做了plasm_store的软链,可能产生了文件用户权限不match的问题。
您可以首先查看一下/tmp/mge_plasma_136f282090b11a1f是否属于root用户。
如果是的话,尝试一下用sudo运行train.py是否报错。
如果不再报错说明确实是文件用户权限的问题。

如果暂时无法解决该问题,可以考虑修改Dataloader的参数num_workers=0。会在一定程度上影响速度。

from megengine.

wjfwzzc avatar wjfwzzc commented on May 18, 2024

请先在shell中尝试运行plasma_store,如果出现-bash: plasma_store: command not found,则说明pyarrow没有装好,请尝试删除pyarrow并重装:

pip3 uninstall pyarrow
pip3 install pyarrow -U

另外能否提供一下重装之前pyarrow的版本?我们想确认一下是否是版本问题。

from megengine.

wjfwzzc avatar wjfwzzc commented on May 18, 2024

self.__initialzed是一个typo,本来lint应该查出来的……感谢指出,我们会尽快修复。

from megengine.

Rlee719 avatar Rlee719 commented on May 18, 2024

请先在shell中尝试运行plasma_store,如果出现-bash: plasma_store: command not found,则说明pyarrow没有装好,请尝试删除pyarrow并重装:

pip3 uninstall pyarrow
pip3 install pyarrow -U

另外能否提供一下重装之前pyarrow的版本?我们想确认一下是否是版本问题。

重装前后版本均为0.16.0,重装后仍然报错,但注意到pip3提示
WARNING: The script plasma_store is installed in '/home/rlee/.local/bin' which is not on PATH.
于是建立软连接
sudo ln -s /home/rlee/.local/bin/plasma_store /usr/bin/plasma_store
之后解决该报错,但运行时再次出现报错信息如下

26 17:35:50 preparing dataset..
26 17:36:14 Epoch 0 LR 1.250e-02
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 20 more times
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 19 more times
...
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 11 more times
/arrow/cpp/src/plasma/io.cc:168: Connection to IPC socket failed for pathname /tmp/mge_plasma_136f282090b11a1f, retrying 10 more times
Traceback (most recent call last):
File "train.py", line 328, in
main()
File "train.py", line 93, in main
worker(0, 1, args)
File "train.py", line 209, in worker
train_func, train_queue, optimizer, args, epoch=epoch
File "train.py", line 239, in train
for step, (image, label) in enumerate(data_queue):
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 152, in next
minibatch = self._get_next_batch()
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 512, in _get_next_batch
batch_data = self._try_get_next_batch()
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 503, in _try_get_next_batch
return self.batch_queue.get(timeout=1)
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/_queue.py", line 77, in get
self.client = plasma.connect(self.socket_name)
File "pyarrow/_plasma.pyx", line 850, in pyarrow._plasma.connect
File "pyarrow/_plasma.pyx", line 291, in pyarrow._plasma.plasma_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Could not connect to socket /tmp/mge_plasma_136f282090b11a1f
^C

所示的进程通信问题,目前仍未解决

from megengine.

Rlee719 avatar Rlee719 commented on May 18, 2024

self.__initialzed是一个typo,本来lint应该查出来的……感谢指出,我们会尽快修复。

不用谢,辛苦了 :)

from megengine.

Rlee719 avatar Rlee719 commented on May 18, 2024

添加PATH并删除软链后/tmp下文件不再属于root用户,但原报错仍然出现,如截图中所示

screenshot_7
screenshot_8

设定num_workers=0后可以运行。

from megengine.

hukun-megvii avatar hukun-megvii commented on May 18, 2024

添加PATH并删除软链后/tmp下文件不再属于root用户,但原报错仍然出现,如截图中所示

能否打开debug开关,观察一下log?即设置num_workers > 0后执行如下命令:

MGE_DATALOADER_PLASMA_DEBUG=1 python3 train.py --save=./data/models

from megengine.

ChaiByte avatar ChaiByte commented on May 18, 2024

@Rlee719 你好,请尝试用 MegEngine v1.0.0 正式版跑 Models 中的 resnet/train.py ,应该不会有问题啦

如果复现过程中还是存在问题,可以另外开一个 Issue 讨论(注明 1.0.0 版本)

from megengine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.