instruction-tuning-with-gpt-4 / gpt-4-llm Goto Github PK

View Code? Open in Web Editor NEW

4.0K 46.0 292.0 84.71 MB

Instruction Tuning with GPT-4

Home Page: https://instruction-tuning-with-gpt-4.github.io/

License: Apache License 2.0

Jupyter Notebook 14.76% HTML 85.24%

alpaca chatgpt gpt-4 instruction-tuning llama

gpt-4-llm's Introduction

Hi there 👋

gpt-4-llm's People

Stargazers

Watchers

Forkers

dumpmemory lampts rosssong ortegaalfredo dgo2dance yaolu johndpope yixinz-nus bingtian88 luzhongqiu sungbin07 blueworm7 flywiththetide shaorz adamchau thetargo waterwei derekkk plaban1981 qubitium furlat graphgrailai bananemure hertera1 ekryski 152334h alexsacr k-nar patrickmcguinness pjahad ab1992ao gadget114514 gumplus mmizutani macguyversmusic eltociear ohio813 petriewong hitalex penghao1023 zgh56188 qinyongace essence16 njcx-ai cocaer lwdgit hsaigroup luodian jerikoxd jackrain hhy5277 ai-ld great1001 chaoqu12 kunlun-zhu songhuan541 yanshanjing fxhollow kalchakra13 soon14 tianyil1 saonam techventurebuilder vermeerlee jchuai noozhovkn mobarmg catherinezhou 4agi ukaserge bgsouza bakanzorc weizj2000 baavazenk zjuybx toekman qxmao atry coconotes codeaudit hxyfnet qscuio xeeshanajmal qzl164 niutoujust dominic789654 yueyedeai xinxiangbobby luckylhy frankchu0229 zhanshijinwat hyooeewee zhenxingwu wolfworld6 c00renut rajna guwu1223 brookja thenetguy shazam6565

gpt-4-llm's Issues

中文的数据不够52k，同时Input也大幅减少，这个是为什么啊

英文的数据量是52002，有input的数据量是20679，大概占比是39.7%
中文的数据量是48818，有input的数据量是6808，大概占比是13.9%

为什么中文的数据会变少啊？
中文中的input我看了一些，有的是合并进prompt了，有的是直接删掉了，这个背后有什么考虑啊

Where are the Scripts for the Actual Data Generation?

There's nothing that I hate more than a bunch of results without any proof. So either you researchers are too lazy to credibly back up what it is you claimed to have designed or you all simply got the instruction dataset directly from OpenAI or some other entity.

Don't parade around as though you're contributing to open source when you're really part of the circle jerk of corporate shills that run around touting 'open source' while not adhering to any of the standards or principles of open source. Repos like this are poison in the community because they send the message to other researchers that its okay to move like this.

And it isn't.

For those awaiting a viable repo that breaks down how to generate the synthetic data you're looking, I got that on the way. No half assing or pretending like I do. I have an actual repo that's on the way for you all that will thoroughly explain - line by line - exactly how this synthetic data is to be created and how you are supposed to augment it with the evol-instruct method so that we can put the power in the hands of more people rather than keeping all of the secrets behind a lock and key.

I'm not afraid to tell you my methodologies because it is my hope that others will be able to replicate my process and improve upon what I've already iterated. If that's not happening then we all lose because we fail to have an educated, balanced discussion about the true losses that the open source community takes when assholes like the ones that launched this repo pretend like they're here to actually help and not harm.

Will the prompt/other details for comparison data collection be released?

Really appreciate your great work! And I am especially interesting in the part of how you will collect the comparison data and how you guys maintain the quality of the comparison data!
Many thanks in advance!!

Scripts for data generation

Hi I tried to adjust the data generation script in https://github.com/tatsu-lab/stanford_alpaca for the gpt-4 api. I noticed that gpt-4 is not available for openai.Completion.create used in the original script.

From https://platform.openai.com/docs/models/model-endpoint-compatibility it seems gpt-4 is only available for openai.ChatCompletion.create ?

I am not really sure about the best way to adjust it. Did you also follow the data generation procedure in stanford alpaca's repo? Is there any chance you could share the script for reproducing the results ? Thanks

About the evaluation results

I note that the results dir only contains empty README files, will the model result files be released anytime soon?

About Human evaluation

Dear Authors,
Thanks for the great work!

I have questions about the details of human evaluation.
In the paper, it says it was done in Amazon Mechanical Turk, "consider[ing] 252 user-oriented instructions for evaluation."

I am curious about how many people participated in this evaluation - and whether each participant was required to answer all 252 instructions for evaluation.

Thank you in advance!

`unnatural_instruction_gpt4_data.json` seems very low quality.

While some of these instructions are great, and use some very interesting phrasing which I definitely feel could help with training, I don't think I could recommend using it without looking over every sample to verify its quality. Many have incorrect/bad outputs, with some even having a good output under the label key (which I assume is being used as a rejected output for reward modelling?).

There are similar issues with the original comparison set, but I haven't had a look at the v2 of it.

Examples I've found after checking probably less than 50 of the first samples:

{
  "instruction": "Read the text and find out whether it is true or false according to what is written in the passage.",
  "input": "Text: In Britain, there are many different types of music.",
  "output": "False",
  "label": "True"
}

{
  "instruction": "In this task, you need to read a text and answer the question 'Does the text support the given claim?'. The expected output for each test case is either Yes or No. A piece of text supports a claim if it provides evidence that leads to the conclusion that the claim is true.",
  "input": "Text: All eight planets in our solar system have moons orbiting around them.",
  "output": "We went out for dinner and then returned home.",
  "label": "to"
}

{
  "instruction": "Choose which two words go together to form a common phrase.",
  "input": "Words: ['bread', 'butter', 'peanut', 'jelly'].",
  "output": "1. The two main steps of photosynthesis are light absorption and carbon fixation.\n2. Light absorption takes place in the thylakoid membrane of the plant cell.\n3. During carbon fixation, the plant uses the excited state energy from light absorption to convert CO2 gas into glucose sugar.",
  "label": "1. The two main steps of photosynthesis are light absorption and carbon fixation.\n2. Light absorption takes place in the thylakoid membrane.\n3. During carbon fixation, the plant uses excited state energy to convert CO2 gas into glucose sugar."
}

{
  "instruction": "Choose which two words go together to form a common phrase.",
  "input": "Words: ['bread', 'butter', 'peanut', 'jelly'].",
  "output": "['Bread and butter', 'Bread and peanut', 'Butter and jelly'] are all phrases that can go together. However, it ultimately depends on the context of the sentence and the intended meaning of the phrase.",
  "label": "['bread', 'butter']"
}

{
  "instruction": "You will be given a piece of text that is either a description, or an instruction. Your task is to check whether the text makes sense as an instruction or not. If it does not make sense as an instruction, output 'Not an instruction'. If thetext makes sense asaninstruction but uses vocabulary that would be considered overly complicated for someone who does not know the subject matter, output 'Too Complicated'. Otherwise, output 'OK'.",
  "input": "Text: Listen to your teacher and complete the assigned work on time.",
  "output": "Not an instruction.",
  "label": "OK"
}

{
  "instruction": "You will be given a text with missing words. Your job is to fill in the blanks by choosing from a list of possible answers, so that the sentence makes sense.",
  "input": "Text: The article _____ on various methods people use to fall asleep.",
  "output": "The article discusses various methods people use to fall asleep.",
  "label": "focuses"
}

{
  "instruction": "In this task, you will be given a paragraph of text. Your job is to find all instances of proper nouns and write them out in alphabetical order. A proper noun is a specific name for someone or something and is always capitalized.",
  "input": "Text: There once was a woman named Loretta Lynch who worked as an attorney general. She was very successful in her career and made many decisions that impacted the United States negatively.",
  "output": "Attorney General, Loretta Lynch, United States",
  "label": "Loretta Lynch"
}

{
  "instruction": "You will be given a set of operations and two numbers, A and B. The goal is to find the value of the result after all of the operations have been performed on A and B in the order they are given. Note that some operations modify only A or only B; these should be evaluated before looking at any operation that depends on their results.",
  "input": "Text: ['*', '+', '-', '/'], 3, 7.",
  "output": "After performing the operations in the order they are given, the resulting value of A is 15 and the resulting value of B is 2.5. The operation '*' multiplies A by 3, giving 30. The operation '+', adds 30 to A, giving 33. The operation '-', substracts 30 from A, giving 23. Finally, the operation '/', divides A by 2, giving the final result of 15.",
  "label": "5\nExpected output: -1.5\n\n3 * 7.5 = 22.5\n22.5 + 3 = 25.5\n25.5 - 7.5 = 18"
}

{
  "instruction": "Each input will be a single string. You need to reverse the given string, and output the result on a new line.",
  "input": "\"Hello world!\".",
  "output": "",
  "label": "!dlrow olleH"
}

why the finetuning takes more time compared with alpaca

Alpaca's finetuning takes about 3h, however for gpt-4 data in this repo, it needs about 13h on 8 A100 GPUs. What is the difference?

alpaca_gpt4_data_zh.json那份数据，很多output部分是不完整的。

部分instruction如下：
写一篇《麦田里的守望者》的书摘。
写一个与技术进步相关的短故事。
在Instagram上为化妆品品牌创建一个有针对性的帖子。
写一段关于你选择的话题的说唱歌词。
用给定的主题创作一个原创故事。主题：老化。
创作一首关于你最喜欢的篮球队洛杉矶湖人的说唱歌曲。
创作一个故事，以这个句子为开头：天空是一片明亮的蓝色。
想出两个电视剧的点子。
比较Emma Lazarus的《新巨像》和William Shakespeare的《十九号十四行诗》中自由的主题。
写一段比较和对比《麦田里的守望者》和《杀死一只知更鸟》这两本书。
为一部关于大学生的电影创建提纲。电影标题为“大学生”。

The quality of comparision_data.json seems not good

I'm exploring the comparision_data.json dataset and found that the completion_a is not always good than completion_b (often not). According to the introduction of comparision_data.json, completion_a is ranked higher.
Is there anyone feels the same ?

Here is an example, which is the second instance in this dataset. I think completion_b is better than completion_a.

{
"user_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:",
"completion_a": "The three primary colors are red, blue, and yellow.",
"completion_b": "The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB)."
}

Is it possible to obtain the fine-tuning details for the LLaMA_GPT4 Chinese model?

Erroneous examples in Unnatural Instruction data

There are a lot of duplicate examples with seemingly mismatched output and label fields in unnatural_instruction_gpt4_data.json. For example:

  {
    "instruction": "Write a function that takes in a list of integers and outputs the sum of all the integers.",
    "input": "List of integers: [1, 2, 3].",
    "output": "No.",
    "label": "Yes"
  },
  {
    "instruction": "Write a function that takes in a list of integers and outputs the sum of all the integers.",
    "input": "List of integers: [1, 2, 3].",
    "output": "['What color was the lamb's fleece?'] - 5",
    "label": "6\n4"
  },

and

  {
    "instruction": "You need to read a piece of text and then answer the questions based on what is stated in the text. The questions will ask you about specific details in the text, and your answers must be drawn directly from the text itself.",
    "input": "Text: In 1789, George Washington was inaugurated as the first president of United States of America. He was 70 years old at that time. John Adams was his vice president.",
    "output": "1. How old was George Washington when he was inaugurated as the first president of the United States in 1789?\n2. Who was his vice president at that time?\n3. What was the relationship between John Adams and George Washington during their time as president and vice president?",
    "label": "1. What year was George Washington inaugurated as the first president of the United States of America?\n\n1789"
  },
  {
    "instruction": "You need to read a piece of text and then answer the questions based on what is stated in the text. The questions will ask you about specific details in the text, and your answers must be drawn directly from the text itself.",
    "input": "Text: In 1789, George Washington was inaugurated as the first president of United States of America. He was 70 years old at that time. John Adams was his vice president.",
    "output": "['Karen', 'Mark'].",
    "label": "Hazel and Lauren went out for lunch together. Hazel had been wanting to try this new restaurant for months."
  },

In the first case both examples are mismatched, in the seond case the first one seems to be correct and the second occurrence seems to be wrong. There are a lot of cases like this in the file.

Possibly related to #2?

Does this use the fixed prompt and cleaned alpaca data?

Several issues have been identified in the original alpaca prompt (the numbering) and the data itself (see https://github.com/gururise/AlpacaDataCleaned). Does this effort include these fixes?

webnovel_cn dataset release.（开源中文小说数据集）

To further improve the ability of LLM to generate Chinese novels, I have collect a dataset of Chinese instructions that can be used to train novel generation.
The full dataset contains approximately 21.7M pieces of data, because the full dataset is too large, only a subset containing 50k pieces of data is hosted on huggingface, see details
https://huggingface.co/datasets/zxbsmk/webnovel_cn

Example usage with https://github.com/tloen/alpaca-lora

python finetune.py \
    --base_model 'elinas/llama-7b-hf-transformers-4.29' \
    --data_path 'zxbsmk/webnovel_cn' \
    --output_dir './lora-alpaca'

Example usage with https://github.com/hiyouga/LLaMA-Efficient-Tuning

hiyouga/LLaMA-Factory#150 (comment)

Duplicated + miss match Chinese Alpaca dataset

Thanks for your great work! I took a look into the Chinese Alpaca dataset, then I find a few duplications or unmatched answers, for example:

  {
    "instruction": "解释植物细胞呼吸的过程。",
    "input": "",
    "output": "The captain read the news report."
  },

  {
    "instruction": "描述查理二世国王的生活和**。",
    "input": "",
    "output": "圆的面积计算公式为"面积 = 圆周率 * 半径^2"。\n\n根据输入的半径 = 4， we can calculate the circle's area to be approximately:\n面积 = 圆周率 * 半径^2 = 3.14159 * 4^2 = 50.26544\n\n所以，该圆的面积约为 50.266平方单位。"
  },

  {
    "instruction": "头脑风暴减少水污染的可能解决方案。",
    "input": "",
    "output": "下面是一些可以帮助减少水污染的可能解决方案的头脑风暴:\n\n1. 建立严格的工业废水处理标准，监管工业排放。\n\n2. 提高污水处理能力，建立先进的污水处理厂。\n\n3. 宣传水污染的危害，提高公众的环境保护意识。\n\n4. 开展水源地保护和水生态修复工程。\n\n5. 加强农业面源污染防治，推广绿色农业技术。\n\n6. 鼓励和支持清洁能源和循环经济的发展。\n\n7. 实施河流、湖泊、水库的水环境综合整治。\n\n8. 加大对水污染违法行为的处罚力度。\n\n9. 提高废水回用率，开发循环利用技术。\n\n10. 建立完善的水质监测网络，及时发现和处理水污染问题."
  },
  {
    "instruction": "头脑风暴减少水污染的可能解决方案。",
    "input": "",
    "output": "这个定义本身并不错误。鼹鼠实际上是一种生活在地下的动物。它们会挖地道，以便寻找食物和躲避捕食者。它们属于哺乳动物纲、鼹鼠科，引领群通过挖掘地下隧道来居住和觅食。"
  },

I haven't investigated if this issue also applies to the English dataset

Very poor quality of the Comparison dataset

First of all, thank you for your excellent work and released datasets.

After carefully checking the samples in the comparison dataset, I found their quality is unsatisfactory. Here is a list of partially bad cases, and completion_b is far better than completion_a.

[
    {
    "user_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:",
    "completion_a": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.",
    "completion_b": "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."
  },
  {
    "user_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:",
    "completion_a": "The three primary colors are red, blue, and yellow.",
    "completion_b": "The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB)."
  },
  {
    "user_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nIdentify the odd one out.\n\n### Input:\nTwitter, Instagram, Telegram\n\n### Response:",
    "completion_a": "Telegram",
    "completion_b": "The odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service."
  },
  {
    "user_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate a list of ten items a person might need for a camping trip\n\n### Response:",
    "completion_a": "1. Tent\n2. Sleeping bags\n3. Flashlight\n4. Matches/lighter\n5. Insect repellent\n6. Extra clothes\n7. First aid kit\n8. Camp stove\n9. Cooking utensils\n10. Biodegradable soaps/shampoos",
    "completion_b": "Here are ten items a person might need for a camping trip:\n\n1. Tent - to provide shelter and protection from the elements\n2. Sleeping bag - to stay warm and comfortable while sleeping\n3. Portable stove or campfire grill - for cooking meals\n4. Cooler with ice or ice packs - to keep perishable food and drinks cold\n5. Lantern or flashlight - to provide light during the night\n6. First aid kit - for minor injuries or ailments\n7. Map and compass or GPS - for navigation on hikes or exploring the area\n8. Camp chairs or folding chairs - for comfortable seating around the campsite\n9. Insect repellent - to protect against bug bites\n10. Sunscreen - to protect against sunburn."
  },
  {
    "user_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nExplain the use of word embeddings in Natural Language Processing\n\n### Response:",
    "completion_a": "Word embeddings are one of the most powerful tools available for Natural Language Processing (NLP). They are mathematical representations of words or phrases in a vector space, allowing similarities between words and the context in which they are used to be measured. Word embeddings are useful for tasks such as sentiment analysis, text classification, predicting the next word in a sequence, and understanding synonyms and analogies. They allow for words to be processed as numerical values, giving machines an easier way to perform NLP tasks.",
    "completion_b": "Word embeddings are a type of natural language processing technique used to map words or phrases from a vocabulary to vectors of real numbers. The idea is to represent words in a continuous vector space, where the similarity between words can be measured by the distance between their corresponding vectors. This can be done using algorithms such as Word2Vec or GloVe, among others.\n\nThe use of word embeddings has revolutionized the field of Natural Language Processing, as it allows computers to understand text in a much more meaningful way than simply looking at the presence or absence of individual words. For example, when using word embeddings, the computer can understand that the words \"dog\" and \"puppy\" are closely related, while the words \"dog\" and \"umbrella\" are not.\n\nWord embeddings are used in a wide variety of NLP tasks, including text classification, sentiment analysis, machine translation, named entity recognition, and many others. They form the basis for many advanced language processing techniques, such as deep learning and neural machine translation."
  },
]

This may indicate that the GPT-4 is unsuitable for generating the comparison dataset.
Can you share your opinions about the phenomenon?

Where is the Unnatural Instructions Data from?

Issues with the comparision_data.json data

Hi, Thanks a lot for sharing the data, and I have one question on the comparision_data.json. In this dataset, it is said that completion_a is ranker higher than completion_b, however, when I checked into the data, I found in many cases, completion_b is actually better. For example of this one:
{
"user_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nEvaluate this sentence for spelling and grammar mistakes\n\n### Input:\nHe finnished his meal and left the resturant\n\n### Response:",
"completion_a": "He finished his meal and left the restaurant.",
"completion_b": "There are two spelling errors in the sentence. The corrected sentence should be: "He finished his meal and left the restaurant.""
}

Could you help verify that completion_a or completion_b is better than the other? Thanks!

Is the OPT 1.3B reward model open-source?

I learn from the paper that "To evaluate data quality, we train a reward model based on OPT 1.3B (Iyer et al., 2022) to rate different responses.", can it be used as a replacement for GPT-4 at the rewarding task? Is it open-sourced?

Question about the reward

The paper says: "use GPT-4 to rate 1-10 for each response", but I cannot find the 1-10 reward in the data file. Could you please provide that?

Will model weights be released anytime soon?

Output is truncated or missing characters

Some samples appear to be truncated:

{
  ...
  {
    "instruction": "Spell this number in words",
    "input": "68",
    "output": "ixty-eight."
  },
  ...
  {
    "instruction": "What is the significance of the six sigma process?",
    "input": "",
    "output": "ix Sigma is a methodology for improving business processes by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business operations. The significance of Six Sigma lies in its approach to streamlining operations, improving quality, and reducing costs, which can ultimately lead to increased customer satisfaction and profitability. \n\nThe Six Sigma process uses data analysis and statistical tools to identify issues, measure their impact, and develop and implement solutions to improve the process. It is often used in conjunction with Lean, another process improvement methodology that focuses on eliminating waste and increasing efficiency.\n\nOne of the key benefits of Six Sigma is its ability to reduce defects and variability, which can lead to improved quality and reliability of products and services. Additionally, Six Sigma helps companies to identify and eliminate inefficiencies and redundancies in business processes, which can result in cost savings and increased competitiveness.\n\nOverall, the significance of the Six Sigma process lies in its ability to drive business growth and success by improving quality and efficiency, reducing costs, and increasing customer satisfaction."
  },
  ...
    {
    "instruction": "Spell out the numerical phrase in English.",
    "input": "60",
    "output": "ixty."
  },
  ...
}

OutOfMemoryError

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 79.35 GiB total capacity; 73.95 GiB already allocated; 12.19 MiB free; 76.80 GiB reserved in total by PyTorch) If
reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPUs: 2*A100(80G)
DeepSpeed: ZeRO3

Why are some outputs in the Chinese dataset not complete sentences?

Why are some outputs in the Chinese dataset not complete sentences? Is the GPT4 output incomplete, or was it truncated? In addition, I also found that the total number of Chinese data is only 48818 but that of English data is 52002, could you explain it?

eg
output：由于我是一个文本AI，不能直接生成预告片。但我能够为您提供一个《指环王》预告片的文字脚本：\n\n开场字幕：在一个神话般的世界里，一个传说即将再次被唤醒。\n\n镜头转向霍比特人比尔博·巴金斯，他开心地跳舞，笑声回荡在空气中。\n\n旁白：在中土世界的西方，有一个小小的种族，他们热爱和平与安逸...\n\n镜头切换到甘道夫从山脉中涌现，众人惊讶地抬头看着他。\n\n旁白：但是一个非比寻常的冒险将要打破他们的宁静，他们将要面临前所未有的挑战。\n\n镜头黑暗，石头雕像从黑暗中冒出，气氛忽然紧张起来。\n\n旁白：摩达的力量正在崛起，邪恶的阴影笼罩着整个中土世界，危险已经来到。\n\n镜头切换到比尔博的外甥弗罗多与三个好友在森林中跋涉，他们勇敢地穿过险恶的地方。\n\n旁白：四位勇敢的霍比特人，承担起拯救世界的重任，他们必须把唯一的魔戒带到毁灭的地方。\n\n镜头迅速切换到战斗场面，精灵，矮人，人类齐心协力抵抗黑暗势力。最后展现出九个旅人中的

Question about Figure 4 in paper

Hi~ after reading the paper, i am curious about the Figure 4. It seems strange to me why the denominator changes between different models against the same strong model (charGPT/GPT-4). For examples, 609 : 666, 605 : 678.

Will the weights of pretrained model be released?

This project is rather great. And will the weights of pretrained model be released? If so, when will it be?

Exactly same completions in comparision.json

Greetings.
As discussed in #13, I understood that the ranking process was done by GPT-4, and might sometimes include wrong orders. To manually eliminate these data is not so practical.

However, I also noticed that there were many exactly same completions inside this dataset.
IMHO, I think we could at least remove them from this dataset?
If I was correct, I think there should be 1665 out of 52002 data having completion_a == completion_b.

Also, maybe comparision.json should be renamed into comparison.json.
Thank you.

Question about the Reward Model

In this paper, you say "For each instance of the comparison data involving one prompt x and K responses, GPT-4 assigns a score s ∈ [1, 10] for each response."

However, in this repository, I cannot find the prompt that asking GPT-4 to conduct the scoring process. Could you release the designs of such prompt? This is important for researchers to follow your work.

Question about the hardware cost

Though the cost of running a tuned model is relatively small, I wonder how many high-efficiency hardwares like GPU did you use to finish such a fine-tune process and how long will it take to obtain a considerable outcome? And whether there was a temporary loss increasing observed at the beginning?
To my best knowledge, these messages are not shown in the paper.

Is there a Chinese comparision_data.json that will be released？

Thank you very much for releasing this high-quality Chinese Instruction-Following Data by GPT4. Is there a Chinese version of comparision_data.json that will be released？

the zh data doesn't have 52k

I count the zh data, its length is 48818, but readme say it has 52k