Giter VIP home page Giter VIP logo

sotopia-pi's Issues

[FEAT]: Use 3rd party libraries as dependencies?

Description

A low-level Q: I am wondering to which extent we need to directly work on FastChat library?

As mentioned in Llama-recipe, finetuning only requires a line of code if the input file is in the right form. So is inheriting the whole FastChat repo an overkill? In this way the CI pipeline is broken, and code structure is not very clear.

Additional Information

No response

[FEAT]: Explore the multi-turn capability of INST format of Together AI

Description

According to together AI doc, the INST format support should support multi-turn dialogue directly, and there is no need to split the fine-tuning data points to (previous dialogue, current response). The goal of this Feature is to experiment with 1) the original (previous dialogue, current response) as the fine-tuning dataset entry and 2) putting the entire dialogue as one entry.
WechatIMG2283

Additional Information

No response

[FEAT]: add data processing file

Description

Adding data processing file to convert sotopia data into together AI readable format.

Additional Information

No response

[BUG]: (Update) Together AI fine-tuning always stops at step 10 ❌ Need further explanations on how steps work in Together AI ✅

Description of the bug

All fine-tuning processes still stop at step 10. According to @lwaekfjlk, no early stopping code is found in the source code. And the curves shown in the below image stop at different losses. Need further experiments.
WechatIMG2322

Steps To Reproduce

No response

Additional Information

Here is a comparison between two different fine-tuning settings:
epoch = 2, batch=4 -> epoch 1 stopped at step 5, epoch 2 stopped at step 10
epoch = 4, batch=32 -> epoch 1 stopped at step 3, epoch 2 stopped at step 6, epoch 3 stopped at step 9, epoch 4 stopped at step 10.

[FEAT]: Add data point level filtering based on quality of all selected dialogues in train and test sets (whole dataset)

Description

For each set of scenario split by easy and hard, we previously discuss if we want to use only dialogues that reach high performance / high quality. If we want to add this filtering, how we are going to define "high quality" should be clarified. Currently, the most simplified approach we adopt is use overall reward score, but this might be too high level.

Additional Information

No response

[FEAT]: Plans for evaluation and fine-tuning data format

Description

Currently we use plain dialogue (passing the other agent’s direct response instead of passing the prompt from environment agent) to fine-tune Llama 2 on Together AI. A qualitative analysis indicates that the fine-tuned llama-2-13b-chat performs well on selected difficult scenarios.

In order to call the fine-tuned model’s API and evaluate its performance based on Sotopia’s evaluation metrics, we may have two solutions:

Solution 1.

Keep the same fine-tuning data and revise the prompt template for the fine-tuned model during inference. A potential problem with this solution would be the unfair(?) comparison between the fine-tuned model and the original Llama 2 model - their prompts during inference are different.

Solution 2.

Modify the fine-tuning data to be consistent with the prompts for all other models. Fine-tune Llama 2 again with QA format instead of multi-turn INST format.

Before implementing Solution 2., we still have two key points to discuss and solve:

Issue 1. Currently we only have episode logs in the database. We could either 1) Re-run (GPT-4/GPT-4) dialogues and
collect the environment prompts in the process, or 2) Reverse the episode logs in the database to the environment prompts following a specific template.
Issue 2. Currently we have no idea about the performances of multi-turn INST fine-tuning and QA fine-tuning of Together AI. Once Issue 1 is resolved, we will experiment on the difference in performances between multi-turn INST fine-tuning and QA fine-tuning (#24).

Discussion

We expect the evaluation result of the fine-tuned model trained on the bable servers to align perfectly with the fine-tuned model on Together AI.

On a side note, we are curious about the motivation of using environment prompts from the third-party GPT-4 (e.g., why is prompting with previous dialogue history important? what goals can’t multi-turn prompting format do?).

Additional Information

No response

[FEAT]: Build Mistral finetuning and deploying pipeline

Description

First, we need to confirm which platform to use for finetuning.
Secondly, we need to run 13b-llama on this platform.
Thirdly, we need to run Mistral7b on this platfom.

The output of this issue is a model checkpoint.

Additional Information

No response

[BUG]: together AI finetuning not work as exptected

Description of the bug

Based on the wandb visualization, it always stops at the 10th steps.

Have tried:

  1. change model name
  2. change suffix
  3. increasing training data to more than 1000 examples

All not working.

Steps To Reproduce

No response

Additional Information

No response

[BUG]: Unstable model output when fine-tuned on INST format

Description of the bug

In the following example, the model outputs "[INST] What is the best way to eat a snowman?" after answering the question. The sentence is not found in the fine-tuning dataset and it looks like the model's hallucination.
WechatIMG2339

Steps To Reproduce

Use either <s>[INST] question [/INST] answer</s> or <s>[INST] <<SYS>> <</SYS>> question [/INST] answer</s> on single-turn dialogues.

Additional Information

No response

[FEAT]: Difficulty-based Train-test split of Sotopia data

Description

The current split of fine-tuning data is based on task difficulty/variance:

All GPT-4/GPT-4 dialogues (with tag "gpt-4_gpt-4_v0.0.1_clean"): 90 * 5 = 450

Train set: 76 (easy scenarios) * 5 = 380
Test set: 14 (hard scenarios) * 5 = 70

Choosing hard scenarios based on EnvironmentList.find(EnvironmentList.name == "hard_env_set").all()

An agreement on this archived issue should resolve issues #20 #5 #6

Additional Information

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.