nvidia / nvflare Goto Github PK
View Code? Open in Web Editor NEWNVIDIA Federated Learning Application Runtime Environment
Home Page: https://nvidia.github.io/NVFlare/
License: Apache License 2.0
NVIDIA Federated Learning Application Runtime Environment
Home Page: https://nvidia.github.io/NVFlare/
License: Apache License 2.0
Previously, NVFlare 1.X was compatible with (and ran on) Python 3.8.10 due to the pip package was released with pyc files only. Those pyc files were compiled by Python 3.8.10 interpreter and thus must run in Python 3.8.10 environment.
In NVFlare 2.x, the pip packages are source codes, in stead of pyc files. Therefore, the original statement may cause confusion.
When running multi FL servers on the same machine, even with their individual ports for admin and client communications, the secure grpc communication encounters issues:
E0120 10:25:20.267690287 12242 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.
It seemed /tmp/fl_server contained only one of the multiple FL servers configurations.
Deploying a new version of the documentation to the pages site is now a manual process requiring a rebuild. Using a workflow, the docs can be rebuilt and deployed automatically.
The server side fed event runner can handle 10 events per sec. When lots of fed events are coming, it could take too long to process all of them.
hello examples are not as refined as cifar10 example. Improve all examples so they're of same quality.
The apidocs are being omitted from being checked in because of the following line in .gitignore:
docs/apidocs/nvflare.*
Since the workflow is automatically using what is checked out from the main branch to run the docs build, the .gitignore is being used and the generated apidocs html files are not checked into the docs branch and thus they do not make it to pages.
The tenseal dependency is not available for the ARM aarch64 platform, causing installation to fail. This has been reported for local development on Mac M1 and will affect other non-x86 architectures, Jetson, Clara AGX, IBM POWER, etc..
The tenseal dependency is only required when using the HEBuilder module, and it looks like all other functionality could be used without this dependency. Can tenseal be made optional, with the caveat that HE is not available without tenseal?
One option would be providing an alternate install, a requirements-no-tenseal.txt
that includes everything but tenseal. For example, I generated this file in a clean venv on my linux machine using:
pip download nvflare -d /tmp -v \
| grep Collecting \
| awk '{print $2}' \
| tr '[:upper:]' '[:lower:]' \
| grep -v tenseal \
| tee requirements-no-tenseal.txt
and verified that I can install nvflare and all deps except tenseal by copying to an aarch64 system (in this case a Jetson TX2) with:
python3 -m pip install --no-deps -r requirements-no-tenseal.txt
This is a pretty awkward solution. It would be much cleaner to remove the tenseal dependency in the default packaging, since HE is optional, and note in the docs that tenseal must be installed when using HE.
I am trying to run NVFlare as a realistic setup with multiple computers. After the provisioning steps, I ran the server and clients, admin by startup package. The sever is started but the client and admin computers yielded the communication error.
2022-01-05 21:37:08,624 - Communicator - ERROR - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.0013239383697509766 seconds.
2022-01-05 21:37:08,624 - Communicator - ERROR - Could not connect to server: imtl-85545-3:8765 Setting flag for stopping training. failed to connect to all addresses
I try listing up the listening ports on the server by the nmap and it showed up 127.0.1.1:8002 which means the server is listening only to the localhost but not another computer. This makes me wonder whether the current NVFlare support running realistic scenario or only POC (prove of concept) ? Please help me to solve this problem, thank you.
Show how to stream metrics during training to the server and create central tensorboard event files.
NVFLARE now defines a Learner class and a built-in executor that can work with a Learner implementation. Federated deep learning apps should be written as Learners instead of Executors.
Currently all examples use Executors, please change to use Learner API.
To add the support for streaming the logging data from the client to the server.
@yanchengnv notice some issues in nvflare/app_common/widgets/streaming.py:
admin command "sys_info client" result with error stack_trace.
File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 104, in process_command
self._do_command(conn, command)
File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 92, in _do_command
handler(conn, args)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 66, in sys_info
self._process_replies(conn, replies)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 77, in _process_replies
conn.append_string("Client " + r.client_name)
AttributeError: 'ClientReply' object has no attribute 'client_name'
Add CIFAR-10 example of "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (https://arxiv.org/abs/1910.06378)
I am using NVFLare version 2.0.6
However, when I starting the app on my system (includes 4 clients), the server got error like this:
2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: expect model to be torch.nn.Module but got <class 'dict'>
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2022-01-27 04:48:10,374 - ScatterAndGather - INFO - [run=1]: Initializing ScatterAndGather workflow.
2022-01-27 04:48:10,374 - PTFileModelPersistor - ERROR - [run=1]: error getting state_dict from model object
Traceback (most recent call last):
File "/home/jupyter-test/.conda/envs/fl/lib/python3.8/site-packages/nvflare/app_common/pt/pt_file_model_persistor.py", line 202, in load_model
data = self.model.state_dict() if self.model is not None else OrderedDict()
AttributeError: 'dict' object has no attribute 'state_dict'
2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: cannot create state_dict from model object
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Abort signal received. Exiting at round 0.
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Workflow: scatter_gather_ctl finalizing ...
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: ABOUT_TO_END_RUN fired
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: END_RUN fired
2022-01-27 04:48:12,878 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Server runner finished.
2022-01-27 04:48:13,376 - FederatedServer - INFO - Server app stopped.
Please help me resolving this problem, thank you.
NVFlare requires python 3.8.10 or higher per the pypi page, and both Google Colab and Google Vertex AI Notebooks currently run python 3.7.12 and 3.7.10 respectively. Upgrading these environments is relatively undocumented and complex.
For reference PyTorch 1.10 works with python 3.5 or greater.
Can the dependencies on python 3.8.10 be reduced so that python 3.7 will suffice?
When using the LogAnalyticsSender to streaming the logging data to the server, the END_RUN logging data are missing sent to the server.
Add example readmes.
The python code inside test folder is missing license header, we need to add them
NVFlare/examples/cifar10/run_fl.py
Line 1 in d784e7b
Hi there,
when I tried to get hello-monai deployed with deploy_app hello-monai
as stated in the README, I get an error.
It works, if I add either client or server behind the command:
deploy_app hello-monai server
or
deploy_app hello-monai client
We have log_info, log_warning, log_error, log_exception, log_debug functions already.
Add log_critical to be consistent with Python logger.
The docstrings of the streaming widgets needs to be updated, this is a follow up of #117 .
Create Learner API & add LearnerExecutor to support "train", "submit_model" and "validate" executor task.
Add multi-site prostate example to show monai usage, fedprox algorithm, and non-iid FL scenario
We need to add printing of the failed message so we know what caused the failure.
isort and black show issues on some files in nvflare/private.
The args.log_config is not set up properly in the worker_process.py
The copyright year in license header needs to include this year, 2022. Therefore the first line should change to
Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
to include year 2022.
Hi,
When I follow the instruction at https://nvidia.github.io/NVFlare/quickstart.html, after the installation of the nvflare through pip, there is no command poc
found in the virtualenv. is there anything wrong or missing?
Provisioning tool needs to have config_validator option so the generated fed_server.json can have that information.
Since fed_event does not guaranteed to be sent, we should not enforce that the server side has client TensorBoard files.
It's common to set the version attribute to the package's version information. nvflare currently does not have such attribute. Requested by users.
When the client training task got aborted, or run into error, it may return Shareable with ReturnCode not be OK.
This option in sub_start.sh is deprecated. We have to remove that from the template file.
We need to change NVFlare to NVIDIA Flare ("NVFlare") to be more precise.
Add brats example to show Monai usage
The shell script files generated from poc command do not have original permission settings, especially the execute permission, after switching to shutil.unpack_archive.
Hi there,
I am trying to get the cifar10 example running with Federated Learning.
I followed all the steps mentioned here https://nvidia.github.io/NVFlare/quickstart.html and then uploaded and deployed the app from the admin terminal. When I am trying to start the app, I am getting the following error:
./run_1/app_server/config/config_fed_server.json in JSON element components.#5: No module named 'pt'
Event though, in the run_1 folder of the clients is a folder called pt with the specified learners. Do I have to configure the path to the custom folders somewhere?
Users should be able to control that setting outside start.sh/sub_start.sh
There are many duplicate code for different test_apps.
We should identify the common method/class and try to reuse.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.