federatedai / fate-llm Goto Github PK

View Code? Open in Web Editor NEW

134.0 134.0 22.0 5.63 MB

Federated Learning for LLMs.

License: Apache License 2.0

Python 100.00%

fate-llm's People

Contributors

Stargazers

Watchers

Forkers

chunhui-shi liuliuliu0605 kinddevil usametallica danwenxuan mayi140611 ncepulcy paulkmandal pinglmlcv relinda2019 mrbungle-codes tianqing-zhu callmesunny zhanghuiyong ksultana2023 titi-devv kundjanasith fahadalsehami

fate-llm's Issues

How to quick start?

I found no documentation on fate-llm, and the gpt2 documentation has disappeared. Please tell me how to use this project and I would like to use llama.

GPT2-example bind_table时报错

打印输出result如下：{'retcode': 100, 'retmsg': "Internal server error. Nothing in response. You may check out the configuration in 'FATE/conf/service_conf.yaml' and restart fate flow server."}

单机版的部署，可以用本项目来做联邦大模型训练吗？

单机版的部署，可以用本项目来做联邦大模型训练吗？比如例子中的chatglm-6b

The FedLLM Privacy Hub

看到你们论文中提到了The FedLLM Privacy Hub，但是在FATE-LLM中并没有看到这部分的代码，想问下你们这边的差分隐私方案是如何实现的呢？

How to use model in GPT2-example

Hi,
I want to know how to use model when i complete GPT2-example, is there any sample or README?
screenshot as below:

i only fetch 3 sample data from IMDB.csv for testing.

FATE-LLM Qwen模型问题

各位大佬好，想请问下，fate中使用的哪个千问模型？我使用Qwen-7B和Qwen1.5-7B，都无法正常跑起来。使用ChatGLM3-6B是可以正常训练的。

在非docker环境的本机standalone_fate环境跑offsite-tuning报错ModuleNotFoundError: No module named 'eggroll

报错如下
Traceback (most recent call last):
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/task_controller.py", line 216, in kill_task
backend_engine.kill(task)
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/engine_controller/deepspeed.py", line 134, in kill
from eggroll.deepspeed.submit import client
ModuleNotFoundError: No module named 'eggroll'

我想问一下standalone环境可以跑fate-llm吗，还是必须需要cluster方式安装才可以？

ValueError: IP not configured. Please use command line tool `pipeline init` to set it.

1.错误描述：

使用chatglm6b进行联邦大模型训练，报错：
ValueError: IP not configured. Please use command line tool pipeline init to set it.

执行pipeline init配置后，仍然报相同错误：
ValueError: IP not configured. Please use command line tool pipeline init to set it.

不知道改怎么弄？
哪个大神知道原因吗？

2.背景：

2.1联邦训练代码来源：https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/ChatGLM-6B_ds.ipynb，只是一些目录改成了本地目录。
2.2部署FATE，使用的是单机版的源码安装，参考连接：https://fate.readthedocs.io/en/latest/zh/deploy/cluster-deploy/doc/fate_on_eggroll/fate-allinone_deployment_guide/

更新后的2.0版本standalone部署怎么跑ChatGLM呢？

一定要cluster部署才可以吗

fedkseed_runner not found in the import path.

您好，我在运行https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/fedkseed/fedkseed-example.ipynb 这个示例，Submit Federated Task 这个部分，出现问题
ValueError: Job is failed, please check out job_id=202405310203079396430 in fate_flow log directory
查看日志发现
ValueError: Module: fate.components.components.nn.runner.fedkseed_runner not found in the import path.
请问是什么原因，应该如何处理，谢谢~

运行ChatGLM-6B报错误后VGPU-CORE资源不释放

运行ChatGLM-6B报错误后（错误信息见issue-运行tutorial中的ChatGLM-6B报grpc错误），发现VGPU-CORE资源不足，但是eggroll的dashboard展示的可分配VGPU-CORE资源数量是正常的。
到mysql中手动修改node 和processor manage表，将deepspeed任务pre-allocated的VGPU-CORE记录清除，才能重新提交任务。
清楚后可以分配到资源，但是visibleCudaDevices又变成-1了，使用nvidia-smi查看GPU是正常的，执行非FATE的GPU训练任务也能正常执行。

ChatGLM-6B TypeError: Object of type set is not JSON serializable

File "demo.py", line 100, in
pipeline.compile()
│ └ <function PipeLine.compile at 0x7f0a0aa4ac10>
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 428, in compile
self._train_conf = self._construct_train_conf()
│ │ │ └ <function PipeLine._construct_train_conf at 0x7f0a0aa4a550>
│ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 395, in _construct_train_conf
LOGGER.debug(f"self._train_conf: \n {json.dumps(self._train_conf, indent=4, ensure_ascii=False)}")
│ └ <function Logger.debug at 0x7f0a0dd301f0>
└ <loguru.logger handlers=[(id=1, level=20, sink=), (id=2, level=20, sink='/data/zhihao/anaconda3/envs/fate_env/lib/pyt...
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/init.py", line 234, in dumps
return cls(
└ <class 'json.encoder.JSONEncoder'>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 201, in encode
chunks = list(chunks)
└ <generator object _make_iterencode.._iterencode at 0x7f0a083f4f20>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
│ │ └ 0
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <function _make_iterencode.._iterencode_dict at 0x7f0a0aa563a0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a08388430>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083883c0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083884a0>
[Previous line repeated 7 more times]
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 438, in _iterencode
o = _default(o)
│ └ {'query_key_value'}
└ <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x7f0a0aa58b80>>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '

TypeError: Object of type set is not JSON serializable

FedCoLLM模块

我想了解一下FedCoLLM模块的实现方法，请问这个模块的代码是在FATE-LLM还是在FATE仓库

hello 我在使用 deepspeed运行chatglm出现以下错误 resource request gpu count 2 is too big，请问如何解决呢？

出现以下错误gpu请求过大，实际上我有两块GPU。同时，我使用的是python提交命令，而非jupyter

[ERROR] [2023-10-07 22:58:01,892] [202310072257522498440] [22816:139678211446592] - [deepspeed_utils._run] [line:67]: failed to call CommandURI(_uri=v1/cluster-manager/job/submitJob) to xxx.xxx.xxx.xxx:4670: <_InactiveRpcError of RPC that terminated with:
2
status = StatusCode.INTERNAL
3
details = "xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big
4
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)
5
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)
6
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
7
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
8
at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)
9
at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47)
10
at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41)
11
at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43)
12
at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41)
13
at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257)
14
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
15
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346)
16
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860)
17
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
18
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
19
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
20
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
21
at java.lang.Thread.run(Thread.java:750)
22
"
23
debug_error_string = "{"created":"@1696690681.818528053","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:4670","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)\n\tat

训练GPT模型卡住

FATE-LLM训练GPT模型时，卡住在这里不动了，最开始以为是资源问题，使用了2台机器上跑，每台1块GPU，结果还是卡住，没报错，也没日志输出。哪位大佬知道怎么调整吗？

FATE-LLM模型训练报错FP16 Mixed问题

按照https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM3-6B_ds.ipynb教程进行训练模型时，在提交任务后，出现FP16报错的情况——在client的docker容器中提交的，也加入了FATE-LLM/python到PYTHONPATH环境变量中。请问下各位大佬，这个该怎么解决呢？谢谢了。
FP16 Mixed precision trainning with AMP or APEX('--fp16') and FP16 half precision evaluation('--fp16_full_eval') can only be used on CUDA or NPU devices or certain XPU devices (with IPEX)

No module named 'federatedml'

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

Python version 3.8

Steps followed for installing FATE 1.11.3 version

Pulled a Docker python 3.8 image, mounted the ./FATELLM/python directory to /usr/local/lib/python3.8/site- packages/fate/python in the container and installed the below packages

pip install fate client[fate, fate_flow]==1.11.3

apt update

apt install -y lsof

apt purge python3-click

pip install click==8.1.6

fate_flow init --ip 127.0.0.1 --port 9380 --home/fate_home

pipeline init --ip 127.0.0.1 --port 9380

fate_flow start

After this I appended the fate python path as said in the tutorial.

The below error is after running offset tuning example

"""
import sys

your_path_to_fate_python = '/usr/local/lib/python3.8/site-packages/fate/python'

sys.path.append(your_path_to_fate_python)

from fate_llm.model_zoo.offsite_tuning.offsite_tuning_model import OffsiteTuningSubModel, OffsiteTuningMainModel

"""

Error:

Traceback (most recent call last):
File "", line 1, in

File "/usr/local/lib/python3.8/site-packages/fate/python/fate_llm/model_zoo/offsite_tuning/offsite_tuning_model.py", line 18, in from federatedml.util import LOGGER

ModuleNotFoundError: No module named 'federatedml'

FATE-LLM数据上传问题

https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM3-6B_ds.ipynb
按照这个教程中的指导，需要上传train.json到存储引擎中
{"file":"xxxx/train.json","head",false,"partition":4,"meta":{},"namespace":"experiment","name":"ad"}
上传数据失败。需要需要设置：Please provide sample_id_name

Picture 404 in tutorial

https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM-6B_ds.ipynb
this tutorial contain a picture that is now 404:
https://raw.githubusercontent.com/FederatedAI/FATE-LLM/4a5911b8903c4df559a03f7dda3f258ddd6aae6d/doc/tutorial/images/fate-llm-chatglm-6b.png

GPT2 Example job issues when supplying target_modules param in LoraConfig

When following the GPT2 example, the following errors occur:

After debugging, this error is caused by the json/encoder.py file when trying to reference the target_modules = ['c_attn'] component of the LoraConfig object. Changing this value to a string 'c_attn' resolves the issue, however limits the ability to fine-tune multiple categories. However, after doing so the following error occurs.

This occurs because the t.nn.CustModel object created in the pipeline job does not reference any of the layers from the GPT2 model, and there does not appear a way to do so. Is there a workaround for this, or would this be an environment issue?