Giter VIP home page Giter VIP logo

alpaca_chinese_dataset's Issues

这个数据集是不是有点问题,使用merge.py的时候就会出问题

File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 112 column 1 (char 11779)

发现README几处笔误

感谢博主整理的数据,已Star

image image

第一个地方去掉第一“并”感觉通顺些。
第二个地方,似乎手抖打错了🤔

PS:就不提PR了,好好学习,争取提个有质量的PRヾ(◍°∇°◍)ノ゙

感谢开源

看了readme,如果官方数据集能实现翻译 + 人工审核,那数据集的质量一定很高,再加上chatglm、文心一言、chatgpt的数据,想想都起飞,感谢辛苦开源,非常期待,大佬们加油

数据集错误集锦

感谢Contributors的无私奉献~ 今天下载数据集翻了翻,发现两个小问题!我觉得可以弄一层楼,专门给大家反馈英-中转换过程的错误,这样也许能更好的提升质量~ :p

alpaca_data-6382-9407-(jlh)中文-已完成.json
{
"instruction": "将给定的字符串更改为大写。",
"input": "你好世界。",
"output": "你好世界。"
},
{
"instruction": "找出给定单词的押韵词。",
"input": "有",
"output": "垫子、老鼠、蝙蝠、床、村庄、聊天。"
},
第一个问题,字符串大小写应该是针对英文,中文不存在这个转换?
第二个问题,押韵的词是不是英文原本押韵,但是翻译成中文后并不押韵?输入有,应该输出ou的韵对应词

数据文件中的input使用问题

{
"instruction": "从给定列表中选择一种颜色,并描述它如何用于创造一个舒适的房间氛围。",
"input": "黄色",
"output": "黄色是一种温暖和愉快的颜色,可以用来创造一个舒适的房间氛围。通过使用浅黄色的墙壁和装饰品,可以给人一种舒适和快乐的感觉。柔和的灯光会让房间感到温馨,黄色的暖色调则会增添明亮、阳光般的气氛。"
},
类似于这种有input的数据,input是不是要和instruction合在一起作为question更合适。

关于无监督数据灌入模型

你好对这个方向非常感兴趣,比如在公司实际落地的时候,需要模型理解一个系统的整体概念,用langchain+向量检索无法理解大的语境 。期待你们的论文,另外有推荐的相关领域论文或者资料吗,非常感谢

微调chatglm后遗忘很严重

您好,我用8900个单轮聊天数据微调chatglm,模型遗忘很严重。轮次多了,所有的回答都往微调数据的领域扯。但轮次少了,微调数据学不到,应该如何解决呢?

运行data_uilts的时候出错

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpg1hbjeku
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpg1hbjeku/_remote_module_non_scriptable.py
INFO:lightning_fabric.utilities.seed:Global seed set to 42
Traceback (most recent call last):
File "/home/cike/zzp/alpaca/chatglm_finetuning/data_utils.py", line 272, in
tokenizer, config, , = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config
tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name,
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer
tokenizer = class_name.from_pretrained(tokenizer_name, **tokenizer_kwargs)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in init
self.sp_tokenizer = SPTokenizer(vocab_file)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in init
self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer
self._configure_tokenizer(
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer
text_tokenizer.refresh()
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh
self.sp.Load(model_proto=self.proto.SerializeToString())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 904, in Load
return self.LoadFromSerializedProto(model_proto)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 250, in LoadFromSerializedProto
return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized)
RuntimeError: Internal: [MASK] is already defined.

请问,如何评测微调后的模型效果?

除了输入“你是谁?”,或者数据集中特有一些问题,有没有别的量化方案可以评估训练完的模型性能提升或变差?
这边也是训练完了, 想知道结果怎么样

数据标注

请问一下,对话模型的数据集一般是怎么获取的,有没有什么数据标注的软件

你好,如何使用 fastapi 部署微调后的模型。就像官方的一样,如何修改api.py的脚本???

你好,如何使用 fastapi 部署微调后的模型
就像官方的一样。如何修改api.py的脚本???
官方的如下:
首先需要安装额外的依赖 pip install fastapi uvicorn,然后运行仓库中的 api.py
python api.py
默认部署在本地的 8000 端口,通过 POST 方法进行调用

curl -X POST "http://127.0.0.1:8000"
-H 'Content-Type: application/json'
-d '{"prompt": "你好", "history": []}'

ValueError: Can't find config.json at './best_ckpt/'

您好,我在用您给的代码进行微调的时候,发现在最后调用模型,用 LoraArguments 读取 /best_ckpt/config.json 文件 的时候,即使相关目录下面存在 config.json 文件,但是最终还是报"ValueError: Can't find config.json at './best_ckpt/'" 的错误:

lora_args = LoraArguments.from_pretrained('./best_ckpt/')
ValueError: Can't find config.json at './best_ckpt/'

不知道是什么原因导致,以下是 config.json 文件的内容,您遇到过这样的问题吗,或者您知道可能是什么原因导致的吗,期待您的回复.

{
"architectures": [
"ChatGLMModel"
],
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
},
"bos_token_id": 150004,
"eos_token_id": 150005,
"hidden_size": 4096,
"initializer_range": 0.02,
"initializer_weight": false,
"inner_hidden_size": 16384,
"layernorm_epsilon": 1e-05,
"max_sequence_length": 2048,
"model_type": "chatglm",
"num_attention_heads": 32,
"num_layers": 28,
"pad_token_id": 20003,
"position_encoding_2d": true,
"pre_seq_len": null,
"precision": 16,
"prefix_projection": false,
"quantization_bit": 0,
"return_dict": false,
"task_specific_params": {
"learning_rate": 2e-05,
"learning_rate_for_task": 2e-05
},
"torch_dtype": "float16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 150528
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.