Giter VIP home page Giter VIP logo

alpaca_chinese_dataset's People

Contributors

akoukou123 avatar aurorays avatar galahad-12138 avatar hikariming avatar jiao03 avatar jiaqi-roh avatar misaka152 avatar rqming avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpaca_chinese_dataset's Issues

请问,如何评测微调后的模型效果?

除了输入“你是谁?”,或者数据集中特有一些问题,有没有别的量化方案可以评估训练完的模型性能提升或变差?
这边也是训练完了, 想知道结果怎么样

微调chatglm后遗忘很严重

您好,我用8900个单轮聊天数据微调chatglm,模型遗忘很严重。轮次多了,所有的回答都往微调数据的领域扯。但轮次少了,微调数据学不到,应该如何解决呢?

关于无监督数据灌入模型

你好对这个方向非常感兴趣,比如在公司实际落地的时候,需要模型理解一个系统的整体概念,用langchain+向量检索无法理解大的语境 。期待你们的论文,另外有推荐的相关领域论文或者资料吗,非常感谢

感谢开源

看了readme,如果官方数据集能实现翻译 + 人工审核,那数据集的质量一定很高,再加上chatglm、文心一言、chatgpt的数据,想想都起飞,感谢辛苦开源,非常期待,大佬们加油

发现README几处笔误

感谢博主整理的数据,已Star

image image

第一个地方去掉第一“并”感觉通顺些。
第二个地方,似乎手抖打错了🤔

PS:就不提PR了,好好学习,争取提个有质量的PRヾ(◍°∇°◍)ノ゙

数据集错误集锦

感谢Contributors的无私奉献~ 今天下载数据集翻了翻,发现两个小问题!我觉得可以弄一层楼,专门给大家反馈英-中转换过程的错误,这样也许能更好的提升质量~ :p

alpaca_data-6382-9407-(jlh)中文-已完成.json
{
"instruction": "将给定的字符串更改为大写。",
"input": "你好世界。",
"output": "你好世界。"
},
{
"instruction": "找出给定单词的押韵词。",
"input": "有",
"output": "垫子、老鼠、蝙蝠、床、村庄、聊天。"
},
第一个问题,字符串大小写应该是针对英文,中文不存在这个转换?
第二个问题,押韵的词是不是英文原本押韵,但是翻译成中文后并不押韵?输入有,应该输出ou的韵对应词

你好,如何使用 fastapi 部署微调后的模型。就像官方的一样,如何修改api.py的脚本???

你好,如何使用 fastapi 部署微调后的模型
就像官方的一样。如何修改api.py的脚本???
官方的如下:
首先需要安装额外的依赖 pip install fastapi uvicorn,然后运行仓库中的 api.py
python api.py
默认部署在本地的 8000 端口,通过 POST 方法进行调用

curl -X POST "http://127.0.0.1:8000"
-H 'Content-Type: application/json'
-d '{"prompt": "你好", "history": []}'

数据文件中的input使用问题

{
"instruction": "从给定列表中选择一种颜色,并描述它如何用于创造一个舒适的房间氛围。",
"input": "黄色",
"output": "黄色是一种温暖和愉快的颜色,可以用来创造一个舒适的房间氛围。通过使用浅黄色的墙壁和装饰品,可以给人一种舒适和快乐的感觉。柔和的灯光会让房间感到温馨,黄色的暖色调则会增添明亮、阳光般的气氛。"
},
类似于这种有input的数据,input是不是要和instruction合在一起作为question更合适。

数据标注

请问一下,对话模型的数据集一般是怎么获取的,有没有什么数据标注的软件

这个数据集是不是有点问题,使用merge.py的时候就会出问题

File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 112 column 1 (char 11779)

ValueError: Can't find config.json at './best_ckpt/'

您好,我在用您给的代码进行微调的时候,发现在最后调用模型,用 LoraArguments 读取 /best_ckpt/config.json 文件 的时候,即使相关目录下面存在 config.json 文件,但是最终还是报"ValueError: Can't find config.json at './best_ckpt/'" 的错误:

lora_args = LoraArguments.from_pretrained('./best_ckpt/')
ValueError: Can't find config.json at './best_ckpt/'

不知道是什么原因导致,以下是 config.json 文件的内容,您遇到过这样的问题吗,或者您知道可能是什么原因导致的吗,期待您的回复.

{
"architectures": [
"ChatGLMModel"
],
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
},
"bos_token_id": 150004,
"eos_token_id": 150005,
"hidden_size": 4096,
"initializer_range": 0.02,
"initializer_weight": false,
"inner_hidden_size": 16384,
"layernorm_epsilon": 1e-05,
"max_sequence_length": 2048,
"model_type": "chatglm",
"num_attention_heads": 32,
"num_layers": 28,
"pad_token_id": 20003,
"position_encoding_2d": true,
"pre_seq_len": null,
"precision": 16,
"prefix_projection": false,
"quantization_bit": 0,
"return_dict": false,
"task_specific_params": {
"learning_rate": 2e-05,
"learning_rate_for_task": 2e-05
},
"torch_dtype": "float16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 150528
}

运行data_uilts的时候出错

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpg1hbjeku
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpg1hbjeku/_remote_module_non_scriptable.py
INFO:lightning_fabric.utilities.seed:Global seed set to 42
Traceback (most recent call last):
File "/home/cike/zzp/alpaca/chatglm_finetuning/data_utils.py", line 272, in
tokenizer, config, , = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config
tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name,
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer
tokenizer = class_name.from_pretrained(tokenizer_name, **tokenizer_kwargs)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in init
self.sp_tokenizer = SPTokenizer(vocab_file)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in init
self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer
self._configure_tokenizer(
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer
text_tokenizer.refresh()
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh
self.sp.Load(model_proto=self.proto.SerializeToString())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 904, in Load
return self.LoadFromSerializedProto(model_proto)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 250, in LoadFromSerializedProto
return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized)
RuntimeError: Internal: [MASK] is already defined.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.