federatedai / fate-llm Goto Github PK
View Code? Open in Web Editor NEWFederated Learning for LLMs.
License: Apache License 2.0
Federated Learning for LLMs.
License: Apache License 2.0
单机版的部署,可以用本项目来做联邦大模型训练吗?比如例子中的chatglm-6b
看到你们论文中提到了The FedLLM Privacy Hub,但是在FATE-LLM中并没有看到这部分的代码,想问下 你们这边的差分隐私方案是如何实现的呢?
各位大佬好,想请问下,fate中使用的哪个千问模型?我使用Qwen-7B和Qwen1.5-7B,都无法正常跑起来。使用ChatGLM3-6B是可以正常训练的。
报错如下
Traceback (most recent call last):
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/task_controller.py", line 216, in kill_task
backend_engine.kill(task)
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/engine_controller/deepspeed.py", line 134, in kill
from eggroll.deepspeed.submit import client
ModuleNotFoundError: No module named 'eggroll'
我想问一下standalone环境可以跑fate-llm吗,还是必须需要cluster方式安装才可以?
1.错误描述:
使用chatglm6b进行联邦大模型训练,报错:
ValueError: IP not configured. Please use command line tool pipeline init
to set it.
执行pipeline init配置后,仍然报相同错误:
ValueError: IP not configured. Please use command line tool pipeline init
to set it.
不知道改怎么弄?
哪个大神知道原因吗?
2.背景:
2.1联邦训练代码来源:https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/ChatGLM-6B_ds.ipynb,只是一些目录改成了本地目录。
2.2部署FATE,使用的是单机版的源码安装,参考连接:https://fate.readthedocs.io/en/latest/zh/deploy/cluster-deploy/doc/fate_on_eggroll/fate-allinone_deployment_guide/
一定要cluster部署才可以吗
您好,我在运行https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/fedkseed/fedkseed-example.ipynb 这个示例,Submit Federated Task 这个部分,出现问题
ValueError: Job is failed, please check out job_id=202405310203079396430 in fate_flow log directory
查看日志发现
ValueError: Module: fate.components.components.nn.runner.fedkseed_runner not found in the import path.
请问是什么原因,应该如何处理,谢谢~
运行ChatGLM-6B报错误后(错误信息见issue-运行tutorial中的ChatGLM-6B报grpc错误),发现VGPU-CORE资源不足,但是eggroll的dashboard展示的可分配VGPU-CORE资源数量是正常的。
到mysql中手动修改node 和processor manage表,将deepspeed任务pre-allocated的VGPU-CORE记录清除,才能重新提交任务。
清楚后可以分配到资源,但是visibleCudaDevices又变成-1了,使用nvidia-smi查看GPU是正常的,执行非FATE的GPU训练任务也能正常执行。
File "demo.py", line 100, in
pipeline.compile()
│ └ <function PipeLine.compile at 0x7f0a0aa4ac10>
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 428, in compile
self._train_conf = self._construct_train_conf()
│ │ │ └ <function PipeLine._construct_train_conf at 0x7f0a0aa4a550>
│ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 395, in _construct_train_conf
LOGGER.debug(f"self._train_conf: \n {json.dumps(self._train_conf, indent=4, ensure_ascii=False)}")
│ └ <function Logger.debug at 0x7f0a0dd301f0>
└ <loguru.logger handlers=[(id=1, level=20, sink=), (id=2, level=20, sink='/data/zhihao/anaconda3/envs/fate_env/lib/pyt...
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/init.py", line 234, in dumps
return cls(
└ <class 'json.encoder.JSONEncoder'>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 201, in encode
chunks = list(chunks)
└ <generator object _make_iterencode.._iterencode at 0x7f0a083f4f20>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
│ │ └ 0
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <function _make_iterencode.._iterencode_dict at 0x7f0a0aa563a0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a08388430>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083883c0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083884a0>
[Previous line repeated 7 more times]
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 438, in _iterencode
o = _default(o)
│ └ {'query_key_value'}
└ <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x7f0a0aa58b80>>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type set is not JSON serializable
我想了解一下FedCoLLM模块的实现方法,请问这个模块的代码是在FATE-LLM还是在FATE仓库
出现以下错误gpu请求过大,实际上我有两块GPU。同时,我使用的是python提交命令,而非jupyter
[ERROR] [2023-10-07 22:58:01,892] [202310072257522498440] [22816:139678211446592] - [deepspeed_utils._run] [line:67]: failed to call CommandURI(_uri=v1/cluster-manager/job/submitJob) to xxx.xxx.xxx.xxx:4670: <_InactiveRpcError of RPC that terminated with:
2
status = StatusCode.INTERNAL
3
details = "xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big
4
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)
5
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)
6
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
7
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
8
at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)
9
at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47)
10
at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41)
11
at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43)
12
at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41)
13
at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257)
14
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
15
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346)
16
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860)
17
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
18
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
19
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
20
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
21
at java.lang.Thread.run(Thread.java:750)
22
"
23
debug_error_string = "{"created":"@1696690681.818528053","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:4670","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)\n\tat
按照https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM3-6B_ds.ipynb教程进行训练模型时,在提交任务后,出现FP16报错的情况——在client的docker容器中提交的,也加入了FATE-LLM/python到PYTHONPATH环境变量中。请问下各位大佬,这个该怎么解决呢?谢谢了。
FP16 Mixed precision trainning with AMP or APEX('--fp16') and FP16 half precision evaluation('--fp16_full_eval') can only be used on CUDA or NPU devices or certain XPU devices (with IPEX)
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Python version 3.8
Steps followed for installing FATE 1.11.3 version
Pulled a Docker python 3.8 image, mounted the ./FATELLM/python directory to /usr/local/lib/python3.8/site- packages/fate/python in the container and installed the below packages
pip install fate client[fate, fate_flow]==1.11.3
apt update
apt install -y lsof
apt purge python3-click
pip install click==8.1.6
fate_flow init --ip 127.0.0.1 --port 9380 --home/fate_home
pipeline init --ip 127.0.0.1 --port 9380
fate_flow start
After this I appended the fate python path as said in the tutorial.
The below error is after running offset tuning example
"""
import sys
your_path_to_fate_python = '/usr/local/lib/python3.8/site-packages/fate/python'
sys.path.append(your_path_to_fate_python)
from fate_llm.model_zoo.offsite_tuning.offsite_tuning_model import OffsiteTuningSubModel, OffsiteTuningMainModel
"""
Error:
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.8/site-packages/fate/python/fate_llm/model_zoo/offsite_tuning/offsite_tuning_model.py", line 18, in from federatedml.util import LOGGER
ModuleNotFoundError: No module named 'federatedml'
https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM3-6B_ds.ipynb
按照这个教程中的指导,需要上传train.json到存储引擎中
{"file":"xxxx/train.json","head",false,"partition":4,"meta":{},"namespace":"experiment","name":"ad"}
上传数据失败。需要需要设置:Please provide sample_id_name
When following the GPT2 example, the following errors occur:
After debugging, this error is caused by the json/encoder.py file when trying to reference the target_modules = ['c_attn'] component of the LoraConfig object. Changing this value to a string 'c_attn' resolves the issue, however limits the ability to fine-tune multiple categories. However, after doing so the following error occurs.
This occurs because the t.nn.CustModel object created in the pipeline job does not reference any of the layers from the GPT2 model, and there does not appear a way to do so. Is there a workaround for this, or would this be an environment issue?
How to deploy and run Federated LLM on a single machine? Is there a guide document?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.