nvidia / nvflare Goto Github PK

NVIDIA Federated Learning Application Runtime Environment

Home Page: https://nvidia.github.io/NVFlare/

License: Apache License 2.0

Shell 0.37% Python 89.37% HTML 0.27% Dockerfile 0.01% Jupyter Notebook 7.57% JavaScript 0.01% CMake 0.03% C++ 0.39% Astro 1.99%

python decentralized federated-analytics federated-learning pet privacy-protection federated-computing

nvflare's Issues

Document page states NVFlare only compatible with one single Python version

Previously, NVFlare 1.X was compatible with (and ran on) Python 3.8.10 due to the pip package was released with pyc files only. Those pyc files were compiled by Python 3.8.10 interpreter and thus must run in Python 3.8.10 environment.
In NVFlare 2.x, the pip packages are source codes, in stead of pyc files. Therefore, the original statement may cause confusion.

Multiple FL servers on the same machine

When running multi FL servers on the same machine, even with their individual ports for admin and client communications, the secure grpc communication encounters issues:

E0120 10:25:20.267690287 12242 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.

It seemed /tmp/fl_server contained only one of the multiple FL servers configurations.

Create workflow to automatically build docs and deploy them to the pages site

Deploying a new version of the documentation to the pages site is now a manual process requiring a rebuild. Using a workflow, the docs can be rebuilt and deployed automatically.

Time lag on fed events

The server side fed event runner can handle 10 events per sec. When lots of fed events are coming, it could take too long to process all of them.

Setup Contributing guide & codeowners

Improve examples

hello examples are not as refined as cifar10 example. Improve all examples so they're of same quality.

Workflow for automatically building documentation is not working for apidocs

The apidocs are being omitted from being checked in because of the following line in .gitignore:

docs/apidocs/nvflare.*

Since the workflow is automatically using what is checked out from the main branch to run the docs build, the .gitignore is being used and the generated apidocs html files are not checked into the docs branch and thus they do not make it to pages.

Tenseal dependency for HE is not available on ARM aarch64

The tenseal dependency is not available for the ARM aarch64 platform, causing installation to fail. This has been reported for local development on Mac M1 and will affect other non-x86 architectures, Jetson, Clara AGX, IBM POWER, etc..

The tenseal dependency is only required when using the HEBuilder module, and it looks like all other functionality could be used without this dependency. Can tenseal be made optional, with the caveat that HE is not available without tenseal?

One option would be providing an alternate install, a requirements-no-tenseal.txt that includes everything but tenseal. For example, I generated this file in a clean venv on my linux machine using:

pip download nvflare -d /tmp -v \
    | grep Collecting \
    | awk '{print $2}' \
    | tr '[:upper:]' '[:lower:]' \
    | grep -v tenseal \
    | tee requirements-no-tenseal.txt

and verified that I can install nvflare and all deps except tenseal by copying to an aarch64 system (in this case a Jetson TX2) with:

python3 -m pip install --no-deps -r requirements-no-tenseal.txt

This is a pretty awkward solution. It would be much cleaner to remove the tenseal dependency in the default packaging, since HE is optional, and note in the docs that tenseal must be installed when using HE.

Deploying FL on multiple computers.

I am trying to run NVFlare as a realistic setup with multiple computers. After the provisioning steps, I ran the server and clients, admin by startup package. The sever is started but the client and admin computers yielded the communication error.

2022-01-05 21:37:08,624 - Communicator - ERROR - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.0013239383697509766 seconds.
2022-01-05 21:37:08,624 - Communicator - ERROR - Could not connect to server: imtl-85545-3:8765 Setting flag for stopping training. failed to connect to all addresses

I try listing up the listening ports on the server by the nmap and it showed up 127.0.1.1:8002 which means the server is listening only to the localhost but not another computer. This makes me wonder whether the current NVFlare support running realistic scenario or only POC (prove of concept) ? Please help me to solve this problem, thank you.

Add a TensorBoard metric streaming example

Show how to stream metrics during training to the server and create central tensorboard event files.

Use Learner API for examples

NVFLARE now defines a Learner class and a built-in executor that can work with a Learner implementation. Federated deep learning apps should be written as Learners instead of Executors.

Currently all examples use Executors, please change to use Learner API.

Add logging streaming support

To add the support for streaming the logging data from the client to the server.

Errors in streaming.py

@yanchengnv notice some issues in nvflare/app_common/widgets/streaming.py:

Line 47 to Line 52, the checking of the args and error messages are wrong.
All these write_xxx() methods, should check the tag and data arg and make sure they are what we expect (str, dict, …)
Line 257, in the call self.log_xxx(), we should set send_event=False; otherwise it may cause recursive events
Since fed events are handled by a separate thread, there is a potential racing condition that a fed event could be fired after END_RUN event. In the Receiver code, we need to make sure to discard other events after END_RUN (and hence finalize) is done.

admin command "sys_info client" error

admin command "sys_info client" result with error stack_trace.

File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 104, in process_command
self._do_command(conn, command)
File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 92, in _do_command
handler(conn, args)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 66, in sys_info
self._process_replies(conn, replies)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 77, in _process_replies
conn.append_string("Client " + r.client_name)
AttributeError: 'ClientReply' object has no attribute 'client_name'

Fix the looping streaming logging error

Add SCAFFOLD example

Add CIFAR-10 example of "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (https://arxiv.org/abs/1910.06378)

Error in pt_file_model_persistor.py

I am using NVFLare version 2.0.6
However, when I starting the app on my system (includes 4 clients), the server got error like this:

2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: expect model to be torch.nn.Module but got <class 'dict'>
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2022-01-27 04:48:10,374 - ScatterAndGather - INFO - [run=1]: Initializing ScatterAndGather workflow.
2022-01-27 04:48:10,374 - PTFileModelPersistor - ERROR - [run=1]: error getting state_dict from model object
Traceback (most recent call last):
  File "/home/jupyter-test/.conda/envs/fl/lib/python3.8/site-packages/nvflare/app_common/pt/pt_file_model_persistor.py", line 202, in load_model
    data = self.model.state_dict() if self.model is not None else OrderedDict()
AttributeError: 'dict' object has no attribute 'state_dict'
2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: cannot create state_dict from model object
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Abort signal received. Exiting at round 0.
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Workflow: scatter_gather_ctl finalizing ...
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: ABOUT_TO_END_RUN fired
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: END_RUN fired
2022-01-27 04:48:12,878 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Server runner finished.
2022-01-27 04:48:13,376 - FederatedServer - INFO - Server app stopped.

Please help me resolving this problem, thank you.

NVFlare python version not compatible with Google colab or Google Vertex AI Notebooks

NVFlare requires python 3.8.10 or higher per the pypi page, and both Google Colab and Google Vertex AI Notebooks currently run python 3.7.12 and 3.7.10 respectively. Upgrading these environments is relatively undocumented and complex.

For reference PyTorch 1.10 works with python 3.5 or greater.

Can the dependencies on python 3.8.10 be reduced so that python 3.7 will suffice?

Missing the END_RUN logging in log streaming

When using the LogAnalyticsSender to streaming the logging data to the server, the END_RUN logging data are missing sent to the server.

Example READMEs

Add example readmes.

Add license headers into .py files inside test folder

The python code inside test folder is missing license header, we need to add them

Remove the multi_gpu setting for the POC command

CIFAR10 run_fl.py misses license header

NVFlare/examples/cifar10/run_fl.py

Line 1 in d784e7b

import argparse

Deploy command

Hi there,

when I tried to get hello-monai deployed with deploy_app hello-monai as stated in the README, I get an error.
It works, if I add either client or server behind the command:

deploy_app hello-monai server
or
deploy_app hello-monai client

Add log_critical to FLComponent

We have log_info, log_warning, log_error, log_exception, log_debug functions already.

Add log_critical to be consistent with Python logger.

Missing the config_validator parsing

Update streaming widgets docstrings and clean up

The docstrings of the streaming widgets needs to be updated, this is a follow up of #117 .

Using EventRecorder and log streaming together caused the logging loop

Create Learner API & add LearnerExecutor support

Create Learner API & add LearnerExecutor to support "train", "submit_model" and "validate" executor task.

Prostate example

Add multi-site prostate example to show monai usage, fedprox algorithm, and non-iid FL scenario

Improve TB streaming app tests

We need to add printing of the failed message so we know what caused the failure.

Coding style in files in nvflare/private folder

isort and black show issues on some files in nvflare/private.

The args.log_config is not set up properly

The args.log_config is not set up properly in the worker_process.py

Setup pre-merge & post-merge CI

License header in source files is outdated.

The copyright year in license header needs to include this year, 2022. Therefore the first line should change to

to include year 2022.

poc command is not found

Hi,

When I follow the instruction at https://nvidia.github.io/NVFlare/quickstart.html, after the installation of the nvflare through pip, there is no command poc found in the virtualenv. is there anything wrong or missing?

Missing config_validator option in provisioning tool

Provisioning tool needs to have config_validator option so the generated fed_server.json can have that information.

Fixed the hardcode port when running MultiProcessEecutor

Update TensorBoard streaming test_apps

Since fed_event does not guaranteed to be sent, we should not enforce that the server side has client TensorBoard files.

nvflare does not have version attribute

It's common to set the version attribute to the package's version information. nvflare currently does not have such attribute. Requested by users.

Remove the LocalLogger dependency on LogSender

Improve documentation in app_common

enhance the Learner_Executor to check the Learner return code

When the client training task got aborted, or run into error, it may return Shareable with ReturnCode not be OK.

multi_gpu option in client sub_start.sh

This option in sub_start.sh is deprecated. We have to remove that from the template file.

Change NVFlare in the text / code to NVIDIA Flare

We need to change NVFlare to NVIDIA Flare ("NVFlare") to be more precise.

Brats example for Monai usage

Add brats example to show Monai usage

shell scripts missing x permission in poc

The shell script files generated from poc command do not have original permission settings, especially the execute permission, after switching to shutil.unpack_archive.

No module named 'pt'

Hi there,

I am trying to get the cifar10 example running with Federated Learning.

I followed all the steps mentioned here https://nvidia.github.io/NVFlare/quickstart.html and then uploaded and deployed the app from the admin terminal. When I am trying to start the app, I am getting the following error:

./run_1/app_server/config/config_fed_server.json in JSON element components.#5: No module named 'pt'

Event though, in the run_1 folder of the clients is a folder called pt with the specified learners. Do I have to configure the path to the custom folders somewhere?

CUDA_VISIBLE_DEVICES setting

Users should be able to control that setting outside start.sh/sub_start.sh

Update test_apps inside app_testing

There are many duplicate code for different test_apps.
We should identify the common method/class and try to reuse.

nvidia / nvflare Goto Github PK

nvflare's Issues

Recommend Projects

Recommend Topics

Recommend Org