Giter VIP home page Giter VIP logo

alpacadatacleaned's People

Contributors

claysauruswrecks avatar dustinreed-info avatar gururise avatar hzj5790 avatar josemlopez avatar kenakafrosty avatar matusstas avatar minh-v avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpacadatacleaned's Issues

ModuleNotFoundError: No module named 'utils'

I'm trying to run

python -m generate_instruction generate_instruction_following_data \
  --output_dir ./ \
  --num_instructions_to_generate 10 \
  --model_name="text-davinci-003"

then got error "ModuleNotFoundError: No module named 'utils'"

Separate instructions by functionality

There are some specific instructions like generating images, models, searching the internet that the dataset is simply trained to refuse.

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<nooutput>"
    }

To it became an unbiased dataset, We could split some instructions per file. For example:

no_image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "This type of instruction cannot be fulfilled by a GPT model."
    }

So the end user could use a custom JSON like:
image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<image-api>company logo"
    }

How to format dataset fields in model prompt?

Hi I'm looking to finetune an LLM using this dataset, and was wondering if there's any advice on how to format the prompt given the instruction vs input fields?

For example consider these entries:

  {
    "output":"The author has used personification in the sentence \"The cold breeze chills my bones.\" Personification is a figure of speech in which a non-human subject is given human characteristics. In this case, the non-human subject is the cold breeze, which is given the human characteristic of being able to chill someone's bones.",
    "input":"The cold breeze chills my bones.",
    "instruction":"Identify a stylistic device used by the author in the following sentence."
  }

 {
    "output":"Two players from the Kansas City Chiefs team are Patrick Mahomes and Tyreek Hill.",
    "input":"",
    "instruction":"Name two players from the Chiefs team?"
  }

I imagine two approaches:

  1. Use the "instruction" as the system prompt, and the "input" as the first user chat message (which would often be empty though)...
  2. Concatenate the instruction + input fields into a single (first) user chat message.

I think I'll use approach 2 but would appreciate any insights or references on this topic :)

Evaluation Metric

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

  1. SquaD Dataset - F1 Score
  2. WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

Identify code snippet in "input" fields

I want to translate the training data into another language with Google translate, but code snippets should not be translated, so I have to replace code snippets with placeholders before translating.

All code snippets in "output" fields are quoted with triple backticks, so they're quite easy to identify. But code snippets in "input" fields aren't quoted.

Any suggestions on identifying those in "input" fields ?

Correct or potentially to be cleaned?

During the short time that I have been helping here, I have noticed that we divide the data into 3 categories:
A) Cleaned
B) Correct
C) Potentially to be cleaned

Determining whether it is option A is very easy because it is a difference in datasets. However deciding whether it is option B or C is no longer so simple. Therefore, I think that we should be able to mark if the data is correct. It would be nice to add another parameter, for example done (True, False), so that we and potential new contributors don't have to deal with already correct data.

Thanks to this, we will increase the speed and overview of the data itself. We will also see our progress. At the end, of course, we delete that variable.

Either we can incorporate it into the already created GUI or do something else. It would be nice if we could vote there whether it is OK or not yet. Let it not be the decision of only one person. I believe that most people understand me :)

80% of math outputs are wrong

As noted in PR #3, many math outputs are incorrect (estimated at nearly 80% by @HideLord). Ideally, anyone wants to work these issues should follow the same format:

Original Wrong Answer:

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "5"

Corrected Answer (with Chain of Thought):

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "The median is (4+5)/2 = 4.5"

To identify math problems, one can use the regex provided:
"input": "([\d\^\+\-\/yx=,\(\) ]|\\u00b\d)+"

Modifying an existing tool script (in the tools directory) to use this regex would be easy. The first 200 math questions have been checked in the above PR.

overall approach

I came across this quote in an anthropic paper, and thought I would share:

"we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences" Askell et al

This is interesting because this project's philosophy is to focus on a small clean dataset (~30k rows). But there are large preference datasets out there e.g. SHP 300k, or stack-exchange-perefences at 10M rows. But it looks Anthropic believes a two stage training might work quite well.

I also want to point you to this project where they used flan as a base, instead of LLaMA. The result has a commercially compatible license and may be better. Simply because flan was trained on forums, while LLaMA missed out on this dialogue pre-training.

Hopefully this is interesting, that is all ;p

good job

can you briefly introduce the method you used in this clean work? thanks

PIQA dataset's metric

how does PIQA got 78 acc ?
I see the eval folder's readme file, it says the metric is not trustworthy ?

Is the "alpaca_data_cleaned_archive.json" file having all cleaned data?

I just want to confirm the content of the file "alpaca_data_cleaned_archive.json". I found "alpaca_data_cleaned.json" is 44.3M, while "alpaca_data_cleaned_archive.json" is only 22.5M. Based on my understanding, the second file should combine all the cleaned data from the first one with the remaining replaced by the outputs generated by GPT-4. Therefore, the size of these two files should be similar and the "alpaca_data_cleaned_archive.json" might make llama get better results after the fine-tuning. Is that right?

Any chance we could improve the dataset beyond fixing?

Would that be relevant in the scope of this project? Like adding a couple sorts of task examples could improve its generalized capabilities, for instance:

Longer responses
GPT-4 Generated Responses for similar tasks it already has
Roleplaying
Chain of Thought

etc

Contributing to the dataset curation with Argilla and the Alpaca Garbage collector

Hi @gururise, first of all, thanks and congrats on this important effort.

I'm Dani from Argilla. We've spent some time looking at data quality issues of the Alpaca dataset and its translations. We're helping out teams of the Spanish and German efforts to use Argilla for flagging bad or problematic instructions so that they can be later fixed (either manually or with post-processing).

Along the way, we've spent some time labeling AlpacaDataCleaned. It has already good quality but there are still examples to improve, so we'd like to contribute.

Today we have released this model to help teams with cleaning up Alpaca translation, but this can be used to contribute to this repo too: https://huggingface.co/argilla/alpaca-garbage-collector-multilingual

We've also deployed this space for browsing and validating the records. This is what it shows for last night's version of AlpacaCleaned (login with argilla/1234).

We plan to spend some more time labeling and contributing back to this project. My question is if it would be possible to share a set of flagged records (with positional ids as in the original json) with you to make sure we edit them in the right way. For example, what do with requests related to attached photos, paintings, and so on.

Adding scripts for data cleaning

Thanks for your work! Did you make the changes primarily by manually examining the dataset? If scripts were used for the cleanup, it might be helpful for others to have access to those scripts as well. I am in the process of creating a dataset for the German language using GPT and will likely run into similar issues. So it would be nice to be able to access these scripts. I'm sure you understand where I'm going with this. What are your thoughts on this? I know it's not straightforward, because when I cleaned up the lists, it was also a combination of using Regex and manually monitoring the changes.

Idea about better cleaning

  1. Probably need to move cleaned data from one file to another so no need to check again and again.
  2. For one other model guys prepared Telegram bot. So people could read random Question/Answer and choose button:
  • Everything correct
  • Wrong question
  • Wrong answer
  • Skip it

Maybe make sense to get duplicate confirmations...

Hosting your dataset on the Hugging Face Hub

Hi @gururise, this is a really cool project and great job identifying all these problems with the Alpaca dataset!

Would you be interested in hosting this on the Hugging Face Hub (https://huggingface.co/datasets)? The Alpaca dataset is also hosted there (link) and your version would be of wide interest to the community :)

If that sounds interesting, you just need to create a new dataset repo and upload your alpaca_cleaned.json file through the UI. For more details, you can check out our docs here

The MNLI score in lm-evaluation-harness

Thanks for the great work!

I'm trying to reproduce the results you report. I downloaded the model weights from link https://huggingface.co/yahma/alpaca-7b-lora and evaluated them under the framework of lm-evaluation-harness. But I only got 41.7% accuracy on MNLI dataset.

When using lm-evaluation-harness, did you perform other data processing tricks to get 51.6% acc?

Diffs as data

What if we used the diffs from all this cleaning effort to train a model to do the cleaning?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.