Giter VIP home page Giter VIP logo

fedvision's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fedvision's Issues

fedvision_deploy_toolkit/_deploy.py", line 69, in _maybe_create_python_venv raise RuntimeError(f"python executable {machine['python']} not valid")

hi,
Can you help focus on the issue, i am follow the step, it will occur the error:
python -m venv venv && source venv/bin/activate
python -m pip install -U pip && python -m pip install fedvision_deploy_toolkit
fedvision-deploy template standalone
fedvision-deploy deploy deploy --config standalone_template.yaml

error log:
File "/home/yangyu/work/fed_demo/venv/lib/python3.8/site-packages/fedvision_deploy_toolkit/_deploy.py", line 44, in deploy
_maybe_create_python_venv(machine)
File "/home/yangyu/work/fed_demo/venv/lib/python3.8/site-packages/fedvision_deploy_toolkit/_deploy.py", line 69, in _maybe_create_python_venv
raise RuntimeError(f"python executable {machine['python']} not valid")

standalone mode, run paddle_detection exception

F0427 15:58:34.614382 40827 grpc_client.cc:504] GetRPC name:[conv1_bn_offset], ep:[127.0.0.1:12000], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
*** Check failure stack trace: ***
@ 0x7f810c218c7d google::LogMessage::Fail()
@ 0x7f810c21c72c google::LogMessage::SendToLog()
@ 0x7f810c2187a3 google::LogMessage::Flush()
@ 0x7f810c21dc3e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f810db0f672 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f8187460421 execute_native_thread_routine_compat
@ 0x7f81891426db start_thread
@ 0x7f8188e6b71f clone
@ (nil) (unknown)
Aborted (core dumped)

train result?

I started "sh examples/paddle_mnist/run.sh 127.0.0.1:10002"
I am sure the training is started, but how to know the situation that the training is finished.
And if it is finished, how to check the training result, and show "训练效果”?

Problem in file "template.yaml" for multi-machine deployment issue

In multi-party deployment scenario, the file "template.yaml" seems to retrict the base direction of each machine to be the same. To be more specific, if we set the base direction of two machine differently, the deployment toolkit will raise error for failing to locate base_dir/fedvision.tar.gz file.
Hope this problem or restrain will be optimized in future version of FedVision.

Error: trainer_0 failed, return code 134

Getting this error always while running "sh FedVision/examples/paddle_detection/run.sh 127.0.0.1:10002" in standalone mode in linux. "fedvision.framework.utils.exception.FedvisionWorkerException: execute task: trainer_0 failed, return code: 134". Attaching the logs for reference. Any lead will e very much helpful.
worker-127.0.0.1.log
trainer.log

GPU train

hi,
I have a question, Can Fedvision surpport GPU train? Can you share some GPU train step?

standalone mode issue

hi,
run the command, the train will always block on DEBUG:data loader ready:
sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10002

if i run sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10003, maser2 can train normally, however, master1 can't work normally, why?

发现了compile error

机器为Ubuntu 20.04,有一个Nvidia 3090显卡,python环境为3.8,其余包的版本均按照readme和requriments.txt中安装
但是在运行Run examples部分的
sh FedVision/examples/paddle_mnist/run.sh 127.0.0.1:10002语句时遇到如下错误:

Traceback (most recent call last):

  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': None, '__package__': 'fedvision.framework.cli', '__loader__': <_frozen_importlib_external...
           │         └ <code object <module> at 0x7f769b3b7240, file "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master....
           └ <function _run_code at 0x7f769b3ff160>
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': None, '__package__': 'fedvision.framework.cli', '__loader__': <_frozen_importlib_external...
         └ <code object <module> at 0x7f769b3b7240, file "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master....

  File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master.py", line 58, in <module>
    start_master()
    └ <Command start-master>

  File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7f769aca4a60>
           └ <Command start-master>
  File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7f769b446ca0>
         │    └ <function Command.invoke at 0x7f769ac97430>
         └ <Command start-master>
  File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'submitter_port': 10002, 'party_id': 'master1', 'cluster_address': '127.0.0.1:10001', 'coordinator_address': '127.0.0.1:10000'}
           │   │      │    │           └ <click.core.Context object at 0x7f769b446ca0>
           │   │      │    └ <function start_master at 0x7f769ac9e820>
           │   │      └ <Command start-master>
           │   └ <function Context.invoke at 0x7f769aca4550>
           └ <click.core.Context object at 0x7f769b446ca0>
  File "/home/lkhpc/projects/fed_sub_machine/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
           │         │       └ {'submitter_port': 10002, 'party_id': 'master1', 'cluster_address': '127.0.0.1:10001', 'coordinator_address': '127.0.0.1:10000'}
           │         └ ()
           └ <function start_master at 0x7f769ac9e820>

  File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/cli/master.py", line 48, in start_master
    loop.run_forever()
    │    └ <function BaseEventLoop.run_forever at 0x7f769ad45280>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>

  File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7f769ad47dc0>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7f769ae33b80>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f769408bb20>(<Future finished result=1>)>

> File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/framework/master/master.py", line 470, in _co_handler
    await job.compile()
          │   └ <function PaddleFLJob.compile at 0x7f769891af70>
          └ <fedvision.paddle_fl.job.PaddleFLJob object at 0x7f769408b820>

  File "/home/lkhpc/projects/fed_sub_machine/FedVision/fedvision/paddle_fl/job.py", line 95, in compile
    raise FedvisionJobCompileException("compile error")
          └ <class 'fedvision.framework.utils.exception.FedvisionJobCompileException'>

fedvision.framework.utils.exception.FedvisionJobCompileException: compile error```

部署fedVision报错

使用的是fedvision-deploy deploy deploy --config standalone_template.yaml 命令
报错信息如下
Traceback (most recent call last):

File "/root/fedvision/fedvision/bin/fedvision-deploy", line 8, in
sys.exit(app())

File "/root/fedvision/fedvision/lib/python3.6/site-packages/fedvision_deploy_toolkit/_deploy.py", line 44, in deploy
_maybe_create_python_venv(machine)

File "/root/fedvision/fedvision/lib/python3.6/site-packages/fedvision_deploy_toolkit/_deploy.py", line 69, in _maybe_create_python_venv
raise RuntimeError(f"python executable {machine['python']} not valid")

KeyError: 'python'

standalone_template.yaml 文件
`
machines:

  • name: machine1
    ip: 127.0.0.1
    ssh_string: 127.0.0.1:22
    base_dir: /data/projects/fedvision
    python_for_venv_create: python3 # use to create venv, python3.7+ required

coordinator start/stop only if machine provided

coordinator:
name: coordinator1
machine: machine1
port: 10000

clusters:

  • name: cluster1
    manager:
    machine: machine1
    port: 10001
    workers:
    • name: worker1
      machine: machine1
      ports: 12000-12099
      max_tasks: 10

masters:

  • name: master1
    machine: machine1
    submit_port: 10002
    coordinator: coordinator1
    cluster: cluster1

  • name: master2
    machine: machine1
    submit_port: 10003
    coordinator: coordinator1
    cluster: cluster1

  • name: master3
    machine: machine1
    submit_port: 10004
    coordinator: coordinator1
    cluster: cluster1

  • name: master4
    machine: machine1
    submit_port: 10005
    coordinator: coordinator1
    cluster: cluster1
    `

Deploying Cluster version, template.yaml issue

能帮忙double check下官方release的template.yaml文件么?
cluster1中对应的两个worker1 和worker2,怎么分别对应machine1和machine2,跟release的框图不太对,并且在进行多级训练的时候也不对。

clusters:

  • name: cluster1
    manager:
    machine: machine1
    port: 10001
    workers:

    • name: worker1
      machine: machine1
      ports: 12000-12999
      max_tasks: 10
    • name: worker2
      machine: machine2
      ports: 13000-13999
      max_tasks: 10
  • name: cluster2
    manager:
    machine: machine2
    port: 10001
    workers:
    - name: worker1
    machine: machine1
    ports: 12000-12999
    max_tasks: 10
    - name: worker2
    machine: machine2
    ports: 13000-13999
    max_tasks: 10

lsof “command not found" problem in "services start" period

In centos7 system, lsof has been installed in root using yum. But in the services start section, there is still a problem that lsof "command not found". How to solve this problem?
0281f6a02d9f507856a76d686cea849
centos7系统,在运行services start部分,已在root中用yum安装了lsof,但还是会出现lsof未找到命令的问题,应该怎么解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.