federatedai / fedvision Goto Github PK
View Code? Open in Web Editor NEWFederated Computer Vision Engine
License: Apache License 2.0
Federated Computer Vision Engine
License: Apache License 2.0
hi,
Can you help focus on the issue, i am follow the step, it will occur the error:
python -m venv venv && source venv/bin/activate
python -m pip install -U pip && python -m pip install fedvision_deploy_toolkit
fedvision-deploy template standalone
fedvision-deploy deploy deploy --config standalone_template.yaml
error log:
File "/home/yangyu/work/fed_demo/venv/lib/python3.8/site-packages/fedvision_deploy_toolkit/_deploy.py", line 44, in deploy
_maybe_create_python_venv(machine)
File "/home/yangyu/work/fed_demo/venv/lib/python3.8/site-packages/fedvision_deploy_toolkit/_deploy.py", line 69, in _maybe_create_python_venv
raise RuntimeError(f"python executable {machine['python']} not valid")
F0427 15:58:34.614382 40827 grpc_client.cc:504] GetRPC name:[conv1_bn_offset], ep:[127.0.0.1:12000], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
*** Check failure stack trace: ***
@ 0x7f810c218c7d google::LogMessage::Fail()
@ 0x7f810c21c72c google::LogMessage::SendToLog()
@ 0x7f810c2187a3 google::LogMessage::Flush()
@ 0x7f810c21dc3e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f810db0f672 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f8187460421 execute_native_thread_routine_compat
@ 0x7f81891426db start_thread
@ 0x7f8188e6b71f clone
@ (nil) (unknown)
Aborted (core dumped)
I started "sh examples/paddle_mnist/run.sh 127.0.0.1:10002"
I am sure the training is started, but how to know the situation that the training is finished.
And if it is finished, how to check the training result, and show "训练效果”?
initial version, in very early stages
In multi-party deployment scenario, the file "template.yaml" seems to retrict the base direction of each machine to be the same. To be more specific, if we set the base direction of two machine differently, the deployment toolkit will raise error for failing to locate base_dir/fedvision.tar.gz file.
Hope this problem or restrain will be optimized in future version of FedVision.
Getting this error always while running "sh FedVision/examples/paddle_detection/run.sh 127.0.0.1:10002" in standalone mode in linux. "fedvision.framework.utils.exception.FedvisionWorkerException: execute task: trainer_0 failed, return code: 134". Attaching the logs for reference. Any lead will e very much helpful.
worker-127.0.0.1.log
trainer.log
hi,
I have a question, Can Fedvision surpport GPU train? Can you share some GPU train step?
should change data path from "dataset/fruit" to "data/fruit"
FedVision可以在Windows下运行吗?如果不行,需要下载哪种Linux操作系统,比如Ubuntu或者centos?
hi,
run the command, the train will always block on DEBUG:data loader ready:
sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10002
if i run sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10003, maser2 can train normally, however, master1 can't work normally, why?
When deploying in our local Linux environment, we use "fedvision-deploy deploy deploy --config standalone_template.yaml", but it occurs the error that "paramiko.ssh_exception.SSHException: No authentication methods available"
(In our /etc/ssh/ssh_config: PasswordAuthentication yes)
Before the block,
root:ERROR:<ppdet.data.source.voc.VOCDataSet object at 0x7fb0547d6f50>
I don't know any relation beteen them. Or the program was blocked by other reasons.
机器为Ubuntu 20.04,有一个Nvidia 3090显卡,python环境为3.8,其余包的版本均按照readme和requriments.txt中安装
但是在运行Run examples部分的
sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10002
语句时遇到如下错误:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'__name__': '__main__', '__doc__': None, '__package__': 'fedvision.framework.cli', '__loader__': <_frozen_importlib_external...
│ └ <code object <module> at 0x7f769b3b7240, file "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master....
└ <function _run_code at 0x7f769b3ff160>
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
│ └ {'__name__': '__main__', '__doc__': None, '__package__': 'fedvision.framework.cli', '__loader__': <_frozen_importlib_external...
└ <code object <module> at 0x7f769b3b7240, file "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master....
File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master.py", line 58, in <module>
start_master()
└ <Command start-master>
File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x7f769aca4a60>
└ <Command start-master>
File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x7f769b446ca0>
│ └ <function Command.invoke at 0x7f769ac97430>
└ <Command start-master>
File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'submitter_port': 10002, 'party_id': 'master1', 'cluster_address': '127.0.0.1:10001', 'coordinator_address': '127.0.0.1:10000'}
│ │ │ │ └ <click.core.Context object at 0x7f769b446ca0>
│ │ │ └ <function start_master at 0x7f769ac9e820>
│ │ └ <Command start-master>
│ └ <function Context.invoke at 0x7f769aca4550>
└ <click.core.Context object at 0x7f769b446ca0>
File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
│ │ └ {'submitter_port': 10002, 'party_id': 'master1', 'cluster_address': '127.0.0.1:10001', 'coordinator_address': '127.0.0.1:10000'}
│ └ ()
└ <function start_master at 0x7f769ac9e820>
File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master.py", line 48, in start_master
loop.run_forever()
│ └ <function BaseEventLoop.run_forever at 0x7f769ad45280>
└ <_UnixSelectorEventLoop running=True closed=False debug=False>
File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
self._run_once()
│ └ <function BaseEventLoop._run_once at 0x7f769ad47dc0>
└ <_UnixSelectorEventLoop running=True closed=False debug=False>
File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
handle._run()
│ └ <function Handle._run at 0x7f769ae33b80>
└ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
│ │ │ │ │ └ <member '_args' of 'Handle' objects>
│ │ │ │ └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
│ │ │ └ <member '_callback' of 'Handle' objects>
│ │ └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
│ └ <member '_context' of 'Handle' objects>
└ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
> File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/master/master.py", line 470, in _co_handler
await job.compile()
│ └ <function PaddleFLJob.compile at 0x7f769891af70>
└ <fedvision.paddle_fl.job.PaddleFLJob object at 0x7f769408b820>
File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/paddle_fl/job.py", line 95, in compile
raise FedvisionJobCompileException("compile error")
└ <class 'fedvision.framework.utils.exception.FedvisionJobCompileException'>
fedvision.framework.utils.exception.FedvisionJobCompileException: compile error```
使用的是fedvision-deploy deploy deploy --config standalone_template.yaml 命令
报错信息如下
Traceback (most recent call last):
File "/root/fedvision/fedvision/bin/fedvision-deploy", line 8, in
sys.exit(app())
File "/root/fedvision/fedvision/lib/python3.6/site-packages/fedvision_deploy_toolkit/_deploy.py", line 44, in deploy
_maybe_create_python_venv(machine)
File "/root/fedvision/fedvision/lib/python3.6/site-packages/fedvision_deploy_toolkit/_deploy.py", line 69, in _maybe_create_python_venv
raise RuntimeError(f"python executable {machine['python']} not valid")
KeyError: 'python'
standalone_template.yaml 文件
`
machines:
coordinator:
name: coordinator1
machine: machine1
port: 10000
clusters:
masters:
name: master1
machine: machine1
submit_port: 10002
coordinator: coordinator1
cluster: cluster1
name: master2
machine: machine1
submit_port: 10003
coordinator: coordinator1
cluster: cluster1
name: master3
machine: machine1
submit_port: 10004
coordinator: coordinator1
cluster: cluster1
name: master4
machine: machine1
submit_port: 10005
coordinator: coordinator1
cluster: cluster1
`
能帮忙double check下官方release的template.yaml文件么?
cluster1中对应的两个worker1 和worker2,怎么分别对应machine1和machine2,跟release的框图不太对,并且在进行多级训练的时候也不对。
clusters:
name: cluster1
manager:
machine: machine1
port: 10001
workers:
name: cluster2
manager:
machine: machine2
port: 10001
workers:
- name: worker1
machine: machine1
ports: 12000-12999
max_tasks: 10
- name: worker2
machine: machine2
ports: 13000-13999
max_tasks: 10
训练过程正常,但是模型训练checkpoint.save的.pdparams文件为空
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.