Giter VIP home page Giter VIP logo

fate-llm's People

Contributors

dylan-fan avatar hainingzhang avatar mgqa34 avatar nemirorox avatar sagewe avatar talkingwallace avatar yankang18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fate-llm's Issues

How to quick start?

I found no documentation on fate-llm, and the gpt2 documentation has disappeared. Please tell me how to use this project and I would like to use llama.
1701939140462

GPT2-example bind_table时报错

打印输出result如下:{'retcode': 100, 'retmsg': "Internal server error. Nothing in response. You may check out the configuration in 'FATE/conf/service_conf.yaml' and restart fate flow server."}
image

The FedLLM Privacy Hub

看到你们论文中提到了The FedLLM Privacy Hub,但是在FATE-LLM中并没有看到这部分的代码,想问下 你们这边的差分隐私方案是如何实现的呢?

How to use model in GPT2-example

Hi,
I want to know how to use model when i complete GPT2-example, is there any sample or README?
screenshot as below:
image
i only fetch 3 sample data from IMDB.csv for testing.
image

FATE-LLM Qwen模型问题

各位大佬好,想请问下,fate中使用的哪个千问模型?我使用Qwen-7B和Qwen1.5-7B,都无法正常跑起来。使用ChatGLM3-6B是可以正常训练的。

在非docker环境的本机standalone_fate环境跑offsite-tuning报错ModuleNotFoundError: No module named 'eggroll

报错如下
Traceback (most recent call last):
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/task_controller.py", line 216, in kill_task
backend_engine.kill(task)
File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/engine_controller/deepspeed.py", line 134, in kill
from eggroll.deepspeed.submit import client
ModuleNotFoundError: No module named 'eggroll'

我想问一下standalone环境可以跑fate-llm吗,还是必须需要cluster方式安装才可以?

ValueError: IP not configured. Please use command line tool `pipeline init` to set it.

1.错误描述:

image

使用chatglm6b进行联邦大模型训练,报错:
ValueError: IP not configured. Please use command line tool pipeline init to set it.

执行pipeline init配置后,仍然报相同错误:
ValueError: IP not configured. Please use command line tool pipeline init to set it.

不知道改怎么弄?
哪个大神知道原因吗?

2.背景:

2.1联邦训练代码来源:https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/ChatGLM-6B_ds.ipynb,只是一些目录改成了本地目录。
2.2部署FATE,使用的是单机版的源码安装,参考连接:https://fate.readthedocs.io/en/latest/zh/deploy/cluster-deploy/doc/fate_on_eggroll/fate-allinone_deployment_guide/

fedkseed_runner not found in the import path.

您好,我在运行https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/fedkseed/fedkseed-example.ipynb 这个示例,Submit Federated Task 这个部分,出现问题
ValueError: Job is failed, please check out job_id=202405310203079396430 in fate_flow log directory
查看日志发现
ValueError: Module: fate.components.components.nn.runner.fedkseed_runner not found in the import path.
请问是什么原因,应该如何处理,谢谢~

运行ChatGLM-6B报错误后VGPU-CORE资源不释放

运行ChatGLM-6B报错误后(错误信息见issue-运行tutorial中的ChatGLM-6B报grpc错误),发现VGPU-CORE资源不足,但是eggroll的dashboard展示的可分配VGPU-CORE资源数量是正常的。
到mysql中手动修改node 和processor manage表,将deepspeed任务pre-allocated的VGPU-CORE记录清除,才能重新提交任务。
清楚后可以分配到资源,但是visibleCudaDevices又变成-1了,使用nvidia-smi查看GPU是正常的,执行非FATE的GPU训练任务也能正常执行。

ChatGLM-6B TypeError: Object of type set is not JSON serializable

File "demo.py", line 100, in
pipeline.compile()
│ └ <function PipeLine.compile at 0x7f0a0aa4ac10>
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 428, in compile
self._train_conf = self._construct_train_conf()
│ │ │ └ <function PipeLine._construct_train_conf at 0x7f0a0aa4a550>
│ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <pipeline.backend.pipeline.PipeLine object at 0x7f0b293d0ca0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/site-packages/pipeline/backend/pipeline.py", line 395, in _construct_train_conf
LOGGER.debug(f"self._train_conf: \n {json.dumps(self._train_conf, indent=4, ensure_ascii=False)}")
│ └ <function Logger.debug at 0x7f0a0dd301f0>
└ <loguru.logger handlers=[(id=1, level=20, sink=), (id=2, level=20, sink='/data/zhihao/anaconda3/envs/fate_env/lib/pyt...
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/init.py", line 234, in dumps
return cls(
└ <class 'json.encoder.JSONEncoder'>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 201, in encode
chunks = list(chunks)
└ <generator object _make_iterencode.._iterencode at 0x7f0a083f4f20>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
│ │ └ 0
│ └ {'dsl_version': 2, 'initiator': {'role': 'guest', 'party_id': 9999}, 'role': {'guest': [9999], 'host': [10000], 'arbiter': [9...
└ <function _make_iterencode.._iterencode_dict at 0x7f0a0aa563a0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a08388430>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083883c0>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
└ <generator object _make_iterencode.._iterencode_dict at 0x7f0a083884a0>
[Previous line repeated 7 more times]
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 438, in _iterencode
o = _default(o)
│ └ {'query_key_value'}
└ <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x7f0a0aa58b80>>
File "/data/zhihao/anaconda3/envs/fate_env/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '

TypeError: Object of type set is not JSON serializable

FedCoLLM模块

我想了解一下FedCoLLM模块的实现方法,请问这个模块的代码是在FATE-LLM还是在FATE仓库

hello 我在使用 deepspeed运行chatglm出现以下错误 resource request gpu count 2 is too big,请问如何解决呢?

出现以下错误gpu请求过大,实际上我有两块GPU。同时,我使用的是python提交命令,而非jupyter

[ERROR] [2023-10-07 22:58:01,892] [202310072257522498440] [22816:139678211446592] - [deepspeed_utils._run] [line:67]: failed to call CommandURI(_uri=v1/cluster-manager/job/submitJob) to xxx.xxx.xxx.xxx:4670: <_InactiveRpcError of RPC that terminated with:
2
status = StatusCode.INTERNAL
3
details = "xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big
4
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)
5
at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)
6
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
7
at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)
8
at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)
9
at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47)
10
at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41)
11
at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43)
12
at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41)
13
at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257)
14
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
15
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346)
16
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860)
17
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
18
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
19
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
20
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
21
at java.lang.Thread.run(Thread.java:750)
22
"
23
debug_error_string = "{"created":"@1696690681.818528053","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:4670","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)\n\tat

image

训练GPT模型卡住

image
FATE-LLM训练GPT模型时,卡住在这里不动了,最开始以为是资源问题,使用了2台机器上跑,每台1块GPU,结果还是卡住,没报错,也没日志输出。哪位大佬知道怎么调整吗?

FATE-LLM模型训练报错FP16 Mixed问题

No module named 'federatedml'

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

Python version 3.8

Steps followed for installing FATE 1.11.3 version

Pulled a Docker python 3.8 image, mounted the ./FATELLM/python directory to /usr/local/lib/python3.8/site- packages/fate/python in the container and installed the below packages

pip install fate client[fate, fate_flow]==1.11.3

apt update

apt install -y lsof

apt purge python3-click

pip install click==8.1.6

fate_flow init --ip 127.0.0.1 --port 9380 --home/fate_home

pipeline init --ip 127.0.0.1 --port 9380

fate_flow start

After this I appended the fate python path as said in the tutorial.

The below error is after running offset tuning example

"""
import sys

your_path_to_fate_python = '/usr/local/lib/python3.8/site-packages/fate/python'

sys.path.append(your_path_to_fate_python)

from fate_llm.model_zoo.offsite_tuning.offsite_tuning_model import OffsiteTuningSubModel, OffsiteTuningMainModel

"""

Error:

Traceback (most recent call last):
File "", line 1, in

File "/usr/local/lib/python3.8/site-packages/fate/python/fate_llm/model_zoo/offsite_tuning/offsite_tuning_model.py", line 18, in from federatedml.util import LOGGER

ModuleNotFoundError: No module named 'federatedml'

GPT2 Example job issues when supplying target_modules param in LoraConfig

When following the GPT2 example, the following errors occur:
image
After debugging, this error is caused by the json/encoder.py file when trying to reference the target_modules = ['c_attn'] component of the LoraConfig object. Changing this value to a string 'c_attn' resolves the issue, however limits the ability to fine-tune multiple categories. However, after doing so the following error occurs.
image
This occurs because the t.nn.CustModel object created in the pipeline job does not reference any of the layers from the GPT2 model, and there does not appear a way to do so. Is there a workaround for this, or would this be an environment issue?

运行tutorial中的ChatGLM-6B报GRPC错误

运行tutorial中的ChatGLM-6B报GRPC错误,求助各位大佬们指导,多谢多谢
安装用的是:AnsibleFATE_2.1.0_LLM_2.0.0_release_offline.tar.gz,host、guest全部都是按照默认安装配置
报错信息如下:
IMG_20240524_090730_edit_263428382654074_resized_20240524_090943824
IMG_20240524_090224_edit_263482122643649_resized_20240524_090943724

Issues of GPT2-example

When I followed the GPT2-example in the tutorial, I encountered the following problem.

image

By the way, in the GPT2-example in the tutorial, the packages of TrainerParam and DatasetParamn are missing, and they should be imported by
from pipeline.component.nn import TrainerParam, DatasetParam

ChatGLM-6B模型训练问题

各位大佬好,想请假下,在fate中使用LLM训练GPT模型时,报以下错误,根据deepspeed的相关问题和解决,将其fp16禁止掉,但在fate中,将其fp16:{enable:False}后,还是报以下错误,想问下有遇到过这个问题的吗?
环境:
2台3090GPU机器,每台1块GPU。deepspeed==1.13.1
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.