Giter VIP home page Giter VIP logo

nvidia / nvflare Goto Github PK

View Code? Open in Web Editor NEW
576.0 21.0 159.0 49.31 MB

NVIDIA Federated Learning Application Runtime Environment

Home Page: https://nvidia.github.io/NVFlare/

License: Apache License 2.0

Shell 0.37% Python 89.37% HTML 0.27% Dockerfile 0.01% Jupyter Notebook 7.57% JavaScript 0.01% CMake 0.03% C++ 0.39% Astro 1.99%
python decentralized federated-analytics federated-learning pet privacy-protection federated-computing

nvflare's Issues

Document page states NVFlare only compatible with one single Python version

Previously, NVFlare 1.X was compatible with (and ran on) Python 3.8.10 due to the pip package was released with pyc files only. Those pyc files were compiled by Python 3.8.10 interpreter and thus must run in Python 3.8.10 environment.
In NVFlare 2.x, the pip packages are source codes, in stead of pyc files. Therefore, the original statement may cause confusion.

image

Multiple FL servers on the same machine

When running multi FL servers on the same machine, even with their individual ports for admin and client communications, the secure grpc communication encounters issues:

E0120 10:25:20.267690287 12242 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.

It seemed /tmp/fl_server contained only one of the multiple FL servers configurations.

Time lag on fed events

The server side fed event runner can handle 10 events per sec. When lots of fed events are coming, it could take too long to process all of them.

Improve examples

hello examples are not as refined as cifar10 example. Improve all examples so they're of same quality.

Workflow for automatically building documentation is not working for apidocs

The apidocs are being omitted from being checked in because of the following line in .gitignore:

docs/apidocs/nvflare.*

Since the workflow is automatically using what is checked out from the main branch to run the docs build, the .gitignore is being used and the generated apidocs html files are not checked into the docs branch and thus they do not make it to pages.

Tenseal dependency for HE is not available on ARM aarch64

The tenseal dependency is not available for the ARM aarch64 platform, causing installation to fail. This has been reported for local development on Mac M1 and will affect other non-x86 architectures, Jetson, Clara AGX, IBM POWER, etc..

The tenseal dependency is only required when using the HEBuilder module, and it looks like all other functionality could be used without this dependency. Can tenseal be made optional, with the caveat that HE is not available without tenseal?

One option would be providing an alternate install, a requirements-no-tenseal.txt that includes everything but tenseal. For example, I generated this file in a clean venv on my linux machine using:

pip download nvflare -d /tmp -v \
    | grep Collecting \
    | awk '{print $2}' \
    | tr '[:upper:]' '[:lower:]' \
    | grep -v tenseal \
    | tee requirements-no-tenseal.txt

and verified that I can install nvflare and all deps except tenseal by copying to an aarch64 system (in this case a Jetson TX2) with:

python3 -m pip install --no-deps -r requirements-no-tenseal.txt

This is a pretty awkward solution. It would be much cleaner to remove the tenseal dependency in the default packaging, since HE is optional, and note in the docs that tenseal must be installed when using HE.

Deploying FL on multiple computers.

I am trying to run NVFlare as a realistic setup with multiple computers. After the provisioning steps, I ran the server and clients, admin by startup package. The sever is started but the client and admin computers yielded the communication error.

2022-01-05 21:37:08,624 - Communicator - ERROR - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.0013239383697509766 seconds.
2022-01-05 21:37:08,624 - Communicator - ERROR - Could not connect to server: imtl-85545-3:8765 Setting flag for stopping training. failed to connect to all addresses

I try listing up the listening ports on the server by the nmap and it showed up 127.0.1.1:8002 which means the server is listening only to the localhost but not another computer. This makes me wonder whether the current NVFlare support running realistic scenario or only POC (prove of concept) ? Please help me to solve this problem, thank you.

Use Learner API for examples

NVFLARE now defines a Learner class and a built-in executor that can work with a Learner implementation. Federated deep learning apps should be written as Learners instead of Executors.

Currently all examples use Executors, please change to use Learner API.

Errors in streaming.py

@yanchengnv notice some issues in nvflare/app_common/widgets/streaming.py:

  • Line 47 to Line 52, the checking of the args and error messages are wrong.
  • All these write_xxx() methods, should check the tag and data arg and make sure they are what we expect (str, dict, โ€ฆ)
  • Line 257, in the call self.log_xxx(), we should set send_event=False; otherwise it may cause recursive events
  • Since fed events are handled by a separate thread, there is a potential racing condition that a fed event could be fired after END_RUN event. In the Receiver code, we need to make sure to discard other events after END_RUN (and hence finalize) is done.

admin command "sys_info client" error

admin command "sys_info client" result with error stack_trace.

File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 104, in process_command
self._do_command(conn, command)
File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 92, in _do_command
handler(conn, args)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 66, in sys_info
self._process_replies(conn, replies)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 77, in _process_replies
conn.append_string("Client " + r.client_name)
AttributeError: 'ClientReply' object has no attribute 'client_name'

Error in pt_file_model_persistor.py

I am using NVFLare version 2.0.6
However, when I starting the app on my system (includes 4 clients), the server got error like this:

2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: expect model to be torch.nn.Module but got <class 'dict'>
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2022-01-27 04:48:10,374 - ScatterAndGather - INFO - [run=1]: Initializing ScatterAndGather workflow.
2022-01-27 04:48:10,374 - PTFileModelPersistor - ERROR - [run=1]: error getting state_dict from model object
Traceback (most recent call last):
  File "/home/jupyter-test/.conda/envs/fl/lib/python3.8/site-packages/nvflare/app_common/pt/pt_file_model_persistor.py", line 202, in load_model
    data = self.model.state_dict() if self.model is not None else OrderedDict()
AttributeError: 'dict' object has no attribute 'state_dict'
2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: cannot create state_dict from model object
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Abort signal received. Exiting at round 0.
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Workflow: scatter_gather_ctl finalizing ...
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: ABOUT_TO_END_RUN fired
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: END_RUN fired
2022-01-27 04:48:12,878 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Server runner finished.
2022-01-27 04:48:13,376 - FederatedServer - INFO - Server app stopped.

Please help me resolving this problem, thank you.

NVFlare python version not compatible with Google colab or Google Vertex AI Notebooks

NVFlare requires python 3.8.10 or higher per the pypi page, and both Google Colab and Google Vertex AI Notebooks currently run python 3.7.12 and 3.7.10 respectively. Upgrading these environments is relatively undocumented and complex.

For reference PyTorch 1.10 works with python 3.5 or greater.

Can the dependencies on python 3.8.10 be reduced so that python 3.7 will suffice?

Deploy command

Hi there,

when I tried to get hello-monai deployed with deploy_app hello-monai as stated in the README, I get an error.
It works, if I add either client or server behind the command:

deploy_app hello-monai server
or
deploy_app hello-monai client

Add log_critical to FLComponent

We have log_info, log_warning, log_error, log_exception, log_debug functions already.

Add log_critical to be consistent with Python logger.

Prostate example

Add multi-site prostate example to show monai usage, fedprox algorithm, and non-iid FL scenario

License header in source files is outdated.

The copyright year in license header needs to include this year, 2022. Therefore the first line should change to

Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.

to include year 2022.

shell scripts missing x permission in poc

The shell script files generated from poc command do not have original permission settings, especially the execute permission, after switching to shutil.unpack_archive.

No module named 'pt'

Hi there,

I am trying to get the cifar10 example running with Federated Learning.

I followed all the steps mentioned here https://nvidia.github.io/NVFlare/quickstart.html and then uploaded and deployed the app from the admin terminal. When I am trying to start the app, I am getting the following error:

./run_1/app_server/config/config_fed_server.json in JSON element components.#5: No module named 'pt'

Event though, in the run_1 folder of the clients is a folder called pt with the specified learners. Do I have to configure the path to the custom folders somewhere?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.