scir-hi / med-chatglm Goto Github PK

View Code? Open in Web Editor NEW

942.0 942.0 147.0 842 KB

Repo for Chinese Medical ChatGLM 基于中文医学知识的ChatGLM指令微调

License: Apache License 2.0

Python 99.21% Shell 0.79%

chinese medical medqa nlp

med-chatglm's People

Stargazers

Watchers

Forkers

zjcanjux cn-vhql ruibai1999 jangocheng anyz01 greay83 wishgale zhuhm1996 baizhiyong shangzchao anshiquanshu66 huangzhimin4read phybrain shanjiecai fangcao1314 dendihust liningxiao shto-git great1001 thunderfox-6 spiderking1108 r00mz tian64873493 cyber261998 yiyi50211 filesos alexlan123 teanon aspnetcs andy3278 henryhesz kiminh turansh sinlt scutcyr knowledgefold dst1213 xczhanjun sfidea williamqiubing zero506 hbcbh1999 nerohin denglizong virusyou davidsolomon21cn tky2022 ligenxun kangcaijun allenzhipu dancebear itsharex lyhiving matou9 godlys lawrencesun hqman cyt1984 yuanhuanglin zhangqile900621 deep-cognition json9666 crackercat gitlfc163 sylar003 fanjingang forkgitss 826385240 fangcaotank mbzj petercao cderfdsa wtwong316 assassindesign skyrookieyu pandaupc chenbingxiayu 0xfreeman-ai 54457616 hackbuteer001 zhanzhenguang rogercummins auugkuu17 gq570566705 seceum 15737939656 kankan1987 lanyan520 huiguyy marx-yu shitoudidi pengwei-iie francisliyy haklmtt liuye1987 yinxx sherryran08 hongdangshao theadmaster paulandari

med-chatglm's Issues

把chatglm-6b-med模型放到官方chatglm-6b中训练报错

ImportError: This modeling file requires the following packages that were not
found in your environment: configuration_chatglm. Run pip install configuration_chatglm

sh script/sft_medchat.sh 应该为sh scripts/sft_medchat.sh

规格严格，功夫到家。指令微调这里的小错误望改正。

对modeling_chatglm.py细节的一些问题

在modeling_chatglm.py中第834和978行，为啥是seq = input_ids[0].tolist()，以及后面mask_position = seq.index(mask_token)。之后的计算poisition_embedding似乎也只用了seq，但是打印input_ids发现一个批次数据里，gMASK的位置并不相同，为啥只考虑二维矩阵的第0条数据，没有找到input_ids其他数据在哪里被计算。

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/MODEL_PATH/resolve/main/config.json

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/MODEL_PATH/resolve/main/config.json
有遇到这种情况的没？

我用提供的train.txt文件微调完模型后，模型对于有些提问的回答出现了一直打印重复答案的问题

回答：你好，欢迎来问我问题。
请输入您的问题：（输入q以退出）小明感染咽鼓管炎，有哪些临床症状和体征？
回答：小明的临床症状和体征包括耳痛、耳痛、听力下降、耳鸣、听力下降、耳道流脓、耳痛、耳道流水、耳道流脓、耳痛、耳道流水、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳道流脓、耳痛、听力下降、耳

是否对词表进行改动，导致bos_token_id和eos_token_id与原版差异大

RT。若改动，辛苦告知具体改动的点，便于调整微调设置

微调自己数据集遇到账户之类的错误问题，请问怎么解决，实在不理解报错

关于指令数据集的构建思路

感谢你们的工作。因为你们好像暂时还未公布指令集的构建流程和代码，所以想请问一下你们关于依靠结构化的知识信息来指令数据集构建的思路。
加入知识信息的指令任务构建，如何在prompt(instruction + input)和回答（output）中结合结构化知识（knowledge）的特定信息呢？例如疾病，症状，发病部位等？
我的理解你们是通过手动构建模版（template），嵌入结构化信息（template:【症状，发病部位】），然后利用gpt3.5阅读完整的knowledge，然后给出output；还是说通过类似self-instruct的方式通过给予gpt3.5阅读knowledge后生成不同的提问(prompt)和回答(output)呢？
前者的话，手动构建模版数量有限，如何保证指令的多样性呢？

全部的数据集可以开源吗

ValueError: 150001 is not in list

执行sft_medchat.sh，报ValueError: 150001 is not in list，请问是什么原因

学长好，给您看一下我的测试结果，测到我怀疑人生

我也是工大的，现在已经工作了，我知道你们的实力很强，一定是我弄错了，没弄明白应该怎么infer，可以帮我看出问题所在吗，用的原始代码没有修改，怎么解决重复词汇这么严重的情况啊

请问微调用了多久？

显存不够，手动加.quantize(8)也不行

请问要如何量化运行该模型，没有找到输入量化参数的位置。

跑 sh scripts/sft_medchat.sh报错，莫名其妙

是否支持使用多机多卡进行微调？

与未经过微调的、未量化的、最新版的chatGLM-6B对比

下面是我用未经过微调、未量化的、最新版的chatGLM-6B的输出，和你们主页的很不一样。你们的chatGLM-6B是量化的还是老版本的？

问：
小李最近出现了心动过速的症状，伴有轻度胸痛。体检发现P-R间期延长，伴有T波低平和ST段异常
答：
小李出现了心动过速的症状和轻度胸痛，体检发现P-R间期延长，伴有T波低平和ST段异常，需要考虑以下治疗和预防：

药物治疗：可以使用β受体阻滞剂、钙通道阻滞剂等药物来控制心动过速，减轻症状。但需要根据医生的建议选用药物，并严格按照剂量和用药时间使用。
改变生活方式：避免饮用咖啡、茶和碳酸饮料等刺激性饮料，戒烟限酒，保持充足的睡眠和饮食均衡，减轻症状。
预防发作：如果心动过速的症状持续存在或频繁发作，建议及时就医，接受动态心电图(的动态监测)和超声心动图等检查，以明确病因和制定治疗方案。
手术治疗：对于病因明确的心动过速患者，可能需要手术治疗，如射频消融术等。
总之，对于小李这种情况，需要及时就医，接受专业检查和治疗，以保持健康的生活方式，预防症状和并发症的发生。

问：
谷氨酰转肽酶水平会因吸毒或饮酒而升高吗？
答：
谷氨酰转肽酶(AST)是一种氨基酸代谢酶，主要作用是将氨基酸转化为脱羧酸。AST升高通常与炎症反应有关，特别是在肝脏方面。吸毒或饮酒可能会导致肝脏炎症或损伤，从而增加AST的水平。
吸毒可能会导致肝脏损伤或炎症，进而增加AST的水平。例如，吸食毒品可能会导致肝脏细胞受损，导致肝细胞内谷氨酰转肽酶的释放增加。饮酒也可能会导致肝脏炎症或损伤，从而增加AST的水平。
然而，吸毒或饮酒并不总是导致AST升高的唯一原因。其他因素，如感染、药物滥用、营养不良等也可能导致AST升高。因此，如果AST水平升高，应该考虑到这些因素，并寻求医疗帮助以确定原因并采取适当的治疗措施。

”医学知识库和数据集构建代码还在整理中，整理完成将会发布。“

您好，请问：
1.知识库的数据集构建代码发布了吗？
2.请问构建知识库的具体代码可以发布吗？
3.使用uie和cMeKG做医学文献抽取哪个更好呢？
期待您的回答！！！

modeling_chatglm.py里的quantize能用么？在run_clm.py里添加model.quantize(4)为什么报ImportError: attempted relative import with no known parent package错误

Traceback (most recent call last):
File "/home/skong/Fo-ChatGLM/run_clm.py", line 570, in
main()
File "/home/skong/Fo-ChatGLM/run_clm.py", line 433, in main
model.quantize(4)
File "/home/skong/Fo-ChatGLM/modeling_chatglm.py", line 1225, in quantize
from .quantization import quantize
ImportError: attempted relative import with no known parent package

报错150004 is not in list，请问如何进一步修改？谢谢

bin文件加载

本地机器不太行，加载9G的bin文件会爆内存，请问有什么方法可以切割成小文件吗？bin.index部分要怎么修改？

请问如何用多卡GPU训练呢？

显示不出答案

一直保持这个状态，而且也没办法进行下一步

模型能否上传到huggingface方便下载？

如题

会遇到最后一句话不断重复的现象，这是什么原因造成的呢

用户：阿莫西林主要治疗哪些病症？

阿莫西林主要治疗肺炎球菌病、肺炎、链球菌肺炎、青霉素过敏、青霉素过敏引起的肺炎、肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、青霉素过敏引起的肺炎、。

130001 is not in list

在安装好环境后，运行脚本：python infer.py，报错。
Traceback (most recent call last):
File "/home/kemove/miniconda3/envs/qlora/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kemove/miniconda3/envs/qlora/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 39, in
cli.main()
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="main")
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/kemove/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/kemove/kylin/med-ChatGLM/infer.py", line 12, in
response, history = model.chat(tokenizer, "问题：" + a.strip() + '\n答案：', max_length=256, history=[])
File "/home/kemove/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kemove/kylin/med-ChatGLM/modeling_chatglm.py", line 1114, in chat
outputs = self.generate(**input_ids, **gen_kwargs)
File "/home/kemove/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kemove/.local/lib/python3.8/site-packages/transformers/generation/utils.py", line 1565, in generate
return self.sample(
File "/home/kemove/.local/lib/python3.8/site-packages/transformers/generation/utils.py", line 2609, in sample
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File "/home/kemove/kylin/med-ChatGLM/modeling_chatglm.py", line 979, in prepare_inputs_for_generation
mask_position = seq.index(mask_token)
ValueError: 130001 is not in list

6G显存问题和模型参数本地目录指向问题

1、我的显卡只有6G内存，请问怎么配置？（或者在哪个文件修改哪行代码）
2、模型参数ChatGLM-6B-Med下载到本地目录后，请问应该在哪里进行配置以便于指向它？

请问一下医学知识库和数据集分别是指什么

数据集是train.txt文件
请问一下医学知识库是什么，是用在哪里的？
询问问题后，回答是基于医学知识库和数据集一起的吗？
医学知识会要求回答的准确性，请问一下怎么样去尽可能的保持回答的准确性？

一些疑惑，请教一下，谢谢！

项目如何改成cpu微调的，可不可以把用内存，不用显存

指令集微调这个只是ChatGPT训练的第一步，后续的奖励函数和基于RLHF的训练并没有做，作者是觉得这个后续过程不太重要吗？

如何用多卡训练？

我看到ChatGLM的使用多卡有一个函数load_model_on_gpus()，这里应该怎么修改呢？

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

这个是pytorch报的错，为什么会报这个错啊？

编程萌新求助如何运行该程序

我是一个医学生，对编程没有太多知识储备，看到有人宣传这个想尝试一下，我安装了python3.11，cmd运行了pip install -r requirements.txt，下载并解压了code里面的压缩包，并把云盘里面的十多个g的参数放到了model文件夹里面了，但是在Med-ChatGLM-main目录里面用cmd运行python infer.py后，报ModuleNotFoundError: No module named 'torch'，是我操作有问题吗

模型结果的结论并没有找到训练数据的出处，所以这样的微调对吗？

模型效果对比

测试输入	ChatGLM-6B输出	ChatGLM-6B-Med输出
谷氨酰转肽酶水平会因吸毒或饮酒而升高吗?	吸毒和饮酒都会对身体的健康产生负面影响,包括对肝脏和心脏的影响。饮酒可能导致肝脏损伤和心脏疾病,因为酒精会破坏肝脏和心脏的功能。长期饮酒还可能导致脂肪肝、肝硬化、心脏病等健康问题。吸毒会对肝脏和心脏产生严重的损伤,导致肝脏纤维化、肝硬化、心脏血管疾病等。吸毒还可能导致精神障碍、社会退缩、暴力和犯罪等问题。因此,无论是饮酒还是吸毒,都会对健康产生负面影响,应该避免过度使用。	谷氨酰转肽酶升高的原因有很多,包括感染、消化性溃疡、感染、药物副作用等,而吸毒或饮酒等不良行为可能会导致谷氨酰转肽酶的升高。
小李最近出现了心动过速的症状，伴有轻度胸痛。体检发现P-R间期延长，伴有T波低平和ST段异常	非常抱歉,我不理解您的问题。请提供更多背景信息或上下文,以便我更好地回答您的问题。谢谢!	小李出现了心动过速的症状,并且伴有胸痛,需要考虑是否有心肌病、冠状动脉粥样硬化等心血管疾病,建议进行心电图检查、血液检查、心脏超声检查等
......	......	......

针对这个回答，我搜了一下训练数据，发现并没有任何数据可以支持以上【谷氨酰转肽酶升高的原因有很多,包括感染、消化性溃疡、感染、药物副作用等,而吸毒或饮酒等不良行为可能会导致谷氨酰转肽酶的升高。】结论的，所以这样的微调对吗？
https://github.com/SCIR-HI/Med-ChatGLM/tree/main/data

Tensors must have same number of dimensions: got 4 and 2

│ /root/autodl-tmp/tuning-chatglm/Med-ChatGLM/modeling_chatglm.py:1114 in chat │
│ │
│ 1111 │ │ │ prompt += "[Round {}]\n问：{}\n答：".format(len(history), query) │
│ 1112 │ │ input_ids = tokenizer([prompt], return_tensors="pt", padding=True) │
│ 1113 │ │ input_ids = input_ids.to(self.device) │
│ ❱ 1114 │ │ outputs = self.generate(**input_ids, **gen_kwargs) │
│ 1115 │ │ outputs = outputs.tolist()[0][len(input_ids["input_ids"][0]):] │
│ 1116 │ │ response = tokenizer.decode(outputs) │
│ 1117 │ │ response = response.strip()

请问下，这个问题怎么解决？

OSError: models/chatglm-6b-med does not appear to have a file named modeling_chatglm.py.

尝试在 text-generation-webui 中加载模型的时候

获得如下错误: [模型是通过提供的百度网盘下载的]

Traceback (most recent call last): File “/root/autodl-tmp/text-generation-webui/server.py”, line 67, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “/root/autodl-tmp/text-generation-webui/modules/models.py”, line 85, in load_model model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=trust_remote_code) File “/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py”, line 462, in from_pretrained model_class = get_class_from_dynamic_module( File “/root/miniconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py”, line 377, in get_class_from_dynamic_module final_module = get_cached_module_file( File “/root/miniconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py”, line 232, in get_cached_module_file resolved_module_file = cached_file( File “/root/miniconda3/lib/python3.10/site-packages/transformers/utils/hub.py”, line 380, in cached_file raise EnvironmentError( OSError: models/chatglm-6b-med does not appear to have a file named modeling_chatglm.py. Checkout ‘https://huggingface.co/models/chatglm-6b-med/None’ for available files.

返回答案都是空

Out of memory. 48G is not enough, either. What happend?

OutOfMemoryError: CUDA out of memory

显存不够，根据网上的方案也无法解决这个问题，请问有什么别的方法可以解决下面的问题
OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 39.56
GiB total capacity; 37.88 GiB already allocated; 32.56 MiB free; 38.21 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

此模型怎么在显存小于10GB上玩起来?

安装完依赖运行显示No module named 'transformers_modules

安装完依赖运行显示No module named 'transformers_modules怎么处理呀？

这个项目微调的时候可以设quantization_bit么？在sf_medchat.sh里怎么设？

请问数据库数据是如何利用gpt3.5 转换为问答对的？

请问数据库数据如何利用gpt3.5转换为ChatGLM的训练数据（问答对）？程序中只提到了使用了gpt3.5结合了 prompt，并没有相关的代码，能否提供数据库数据到问答对的细节代码？

mask_token相关报错

File "Med-ChatGLM/modeling_chatglm.py", line 836, in forward
mask_position = seq.index(mask_token)
ValueError: 150001 is not in list
看了下逻辑，

        if past_key_values is None:
            past_key_values = tuple([None] * len(self.layers))

            MASK, gMASK = 150000, 150001
            mask_token = MASK if MASK in input_ids else gMASK
            use_gmask = False if MASK in input_ids else gMASK
            seq = input_ids[0].tolist()

            mask_position = seq.index(mask_token)

这段代码似乎隐藏了一个assert，即MASK in seq or gMASK in seq,但看起来这个假设并不成立导致报错?

windows下会出现以下报错：

File "C:\python code\Med-ChatGLM-main\modeling_chatglm.py", line 979, in prepare_inputs_for_generation
mask_position = seq.index(mask_token)
ValueError: 130001 is not in list

请问全量微调的话学习率应该调大点还是调小点啊～

请问一下启动脚本里是5e-5，请问正式训练就是这个参数吗，还是应该调大点或者调小点，如果发现效果不好的话，原语言模型功能丧失的话应该调大点还是调小点啊～

考虑对 mac 添加对于 MPS 后端的支持吗？

如题

Exception has occurred: ValueError 130001 is not in list

@modeling_chatglm.py line: 979
Exception has occurred: ValueError
130001 is not in list
File "E:\OpenSourceModel\Med-ChatGLM-main\Med-ChatGLM-main\modeling_chatglm.py", line 979, in prepare_inputs_for_generation
mask_position = seq.index(mask_token)
File "E:\OpenSourceModel\Med-ChatGLM-main\Med-ChatGLM-main\modeling_chatglm.py", line 1114, in chat
outputs = self.generate(**input_ids, **gen_kwargs)
File "E:\OpenSourceModel\Med-ChatGLM-main\Med-ChatGLM-main\infer.py", line 12, in
response, history = model.chat(tokenizer, "问题：" + a.strip() + '\n答案：', max_length=256, history=[])
ValueError: 130001 is not in list

请问如何将知识图谱批量地转为问答对

问题

你好，请问如果提问了不在数据集里的问题，请问还能不能做出正确回答

模型加载报错，提示这个错误，对应的版本应该是什么呢

from transformers import AutoTokenizer, AutoModel
model_med = AutoModel.from_pretrained("./chatglm-6b-med/", trust_remote_code=True)

File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:818, in ChatGLMModel.init(self, config, empty_init)
816 self.hidden_size_per_attention_head = self.hidden_size // self.num_attention_heads
817 self.position_encoding_2d = config.position_encoding_2d
--> 818 self.pre_seq_len = config.pre_seq_len
819 self.prefix_projection = config.prefix_projection
821 self.word_embeddings = init_method(
822 torch.nn.Embedding,
823 num_embeddings=self.vocab_size, embedding_dim=self.hidden_size,
824 dtype=self.params_dtype
825 )

File /opt/conda/envs/xs_llm/lib/python3.8/site-packages/transformers/configuration_utils.py:260, in PretrainedConfig.getattribute(self, key)
258 if key != "attribute_map" and key in super().getattribute("attribute_map"):
259 key = super().getattribute("attribute_map")[key]
--> 260 return super().getattribute(key)

AttributeError: 'ChatGLMConfig' object has no attribute 'pre_seq_len'