gururise / alpacadatacleaned Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 146.0 79.66 MB

Alpaca dataset from Stanford, cleaned and curated

License: Apache License 2.0

Python 88.93% JavaScript 4.15% HTML 6.92%

alpacadatacleaned's People

Contributors

Stargazers

Watchers

Forkers

hidelord claysauruswrecks fredzannarbor wilfoderek minh-v rosssong cdtaylormesec phillipgimmi battbeach godmapper stanleyjacob josemlopez ukaserge techthiyanes c00renut melsiddieg llegomark djoffrey bdarmech ahmetkca vertinski hzj5790 toyokolabs saridsa1 dustinreed-info dumpmemory hyojunguy baronrustamov ricklentz tironiigor mragungsetiaji rioncarter xargs007 oemmerson eriker75 jackusa ddkang1 pixelkaiser m-deepankar-singh loveryanzi git2swarm killersite catchcake sorokinvld sneaking osubeaver808 sizzles kyrolabs nataschaberg teknium1 shohanursobuj matusstas zurichrain newmedia2 mistial-dev jiahuei jithinkpraveen b1sounours mm86133 if-ai haxel0rd kenakafrosty andykeh710 webclinic017 csabag joskid mindrages oblik-io h2nguyen x07-it ekryski rkhomyk humantillnow paulsunnypark 2003pro oliverwehrens yu-jeffy suguoyg amikos-tech alexl83 liuhoward ai-ld blue0rigin xjohnxjohn leonweber priyanshu-sharma jesusoctavioas henriettaknight adjustmode1 oumacavin wesleysanjose henryfcb tweetyukky yudhanjaya dexcompiler shossain pints-app sudosu4pp wmaousley nasame

alpacadatacleaned's Issues

ModuleNotFoundError: No module named 'utils'

I'm trying to run

python -m generate_instruction generate_instruction_following_data \
  --output_dir ./ \
  --num_instructions_to_generate 10 \
  --model_name="text-davinci-003"

then got error "ModuleNotFoundError: No module named 'utils'"

Separate instructions by functionality

There are some specific instructions like generating images, models, searching the internet that the dataset is simply trained to refuse.

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<nooutput>"
    }

To it became an unbiased dataset, We could split some instructions per file. For example:

no_image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "This type of instruction cannot be fulfilled by a GPT model."
    }

So the end user could use a custom JSON like:
image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<image-api>company logo"
    }

How to format dataset fields in model prompt?

Hi I'm looking to finetune an LLM using this dataset, and was wondering if there's any advice on how to format the prompt given the instruction vs input fields?

For example consider these entries:

  {
    "output":"The author has used personification in the sentence \"The cold breeze chills my bones.\" Personification is a figure of speech in which a non-human subject is given human characteristics. In this case, the non-human subject is the cold breeze, which is given the human characteristic of being able to chill someone's bones.",
    "input":"The cold breeze chills my bones.",
    "instruction":"Identify a stylistic device used by the author in the following sentence."
  }

 {
    "output":"Two players from the Kansas City Chiefs team are Patrick Mahomes and Tyreek Hill.",
    "input":"",
    "instruction":"Name two players from the Chiefs team?"
  }

I imagine two approaches:

Use the "instruction" as the system prompt, and the "input" as the first user chat message (which would often be empty though)...
Concatenate the instruction + input fields into a single (first) user chat message.

I think I'll use approach 2 but would appreciate any insights or references on this topic :)

Evaluation Metric

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

SquaD Dataset - F1 Score
WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

Identify code snippet in "input" fields

I want to translate the training data into another language with Google translate, but code snippets should not be translated, so I have to replace code snippets with placeholders before translating.

All code snippets in "output" fields are quoted with triple backticks, so they're quite easy to identify. But code snippets in "input" fields aren't quoted.

Any suggestions on identifying those in "input" fields ?

Incorrect key string in alpaca_data_cleaned.json

Line 96648, the key is "instruction : " with an extra " : ", which will cause a KeyError when processing data.

Correct or potentially to be cleaned?

During the short time that I have been helping here, I have noticed that we divide the data into 3 categories:
A) Cleaned
B) Correct
C) Potentially to be cleaned

Determining whether it is option A is very easy because it is a difference in datasets. However deciding whether it is option B or C is no longer so simple. Therefore, I think that we should be able to mark if the data is correct. It would be nice to add another parameter, for example done (True, False), so that we and potential new contributors don't have to deal with already correct data.

Thanks to this, we will increase the speed and overview of the data itself. We will also see our progress. At the end, of course, we delete that variable.

Either we can incorporate it into the already created GUI or do something else. It would be nice if we could vote there whether it is OK or not yet. Let it not be the decision of only one person. I believe that most people understand me :)

Chinese sft data

Does it support cleaning Chinese sft data?

80% of math outputs are wrong

As noted in PR #3, many math outputs are incorrect (estimated at nearly 80% by @HideLord). Ideally, anyone wants to work these issues should follow the same format:

Original Wrong Answer:

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "5"

Corrected Answer (with Chain of Thought):

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "The median is (4+5)/2 = 4.5"

To identify math problems, one can use the regex provided:
"input": "([\d\^\+\-\/yx=,\(\) ]|\\u00b\d)+"

Modifying an existing tool script (in the tools directory) to use this regex would be easy. The first 200 math questions have been checked in the above PR.

Command to run the evaluation

Hi thanks for the great work!

could I ask for the command used to run the evaluation on the https://github.com/EleutherAI/lm-evaluation-harness/?
one that is running on any dataset is fine, I am getting some tokenizer errors and I just want to be sure I am running the same eval

thanks so much!

Is there a boost in performance for full fine-tuning versus LoRA?

It seems that the evaluation comparison made was all under the training scheme of LoRA. Any ideas on full fine-tuning versus the LoRA approach?

overall approach

I came across this quote in an anthropic paper, and thought I would share:

"we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences" Askell et al

This is interesting because this project's philosophy is to focus on a small clean dataset (~30k rows). But there are large preference datasets out there e.g. SHP 300k, or stack-exchange-perefences at 10M rows. But it looks Anthropic believes a two stage training might work quite well.

I also want to point you to this project where they used flan as a base, instead of LLaMA. The result has a commercially compatible license and may be better. Simply because flan was trained on forums, while LLaMA missed out on this dialogue pre-training.

Hopefully this is interesting, that is all ;p

good job

can you briefly introduce the method you used in this clean work? thanks

PIQA dataset's metric

how does PIQA got 78 acc ?
I see the eval folder's readme file, it says the metric is not trustworthy ?

Is the "alpaca_data_cleaned_archive.json" file having all cleaned data?

I just want to confirm the content of the file "alpaca_data_cleaned_archive.json". I found "alpaca_data_cleaned.json" is 44.3M, while "alpaca_data_cleaned_archive.json" is only 22.5M. Based on my understanding, the second file should combine all the cleaned data from the first one with the remaining replaced by the outputs generated by GPT-4. Therefore, the size of these two files should be similar and the "alpaca_data_cleaned_archive.json" might make llama get better results after the fine-tuning. Is that right?

How are you going about cleaning?

How are you going about cleaning this?

Manually or with GPT-4.

Any chance we could improve the dataset beyond fixing?

Would that be relevant in the scope of this project? Like adding a couple sorts of task examples could improve its generalized capabilities, for instance:

Longer responses
GPT-4 Generated Responses for similar tasks it already has
Roleplaying
Chain of Thought

etc

Contributing to the dataset curation with Argilla and the Alpaca Garbage collector

Hi @gururise, first of all, thanks and congrats on this important effort.

I'm Dani from Argilla. We've spent some time looking at data quality issues of the Alpaca dataset and its translations. We're helping out teams of the Spanish and German efforts to use Argilla for flagging bad or problematic instructions so that they can be later fixed (either manually or with post-processing).

Along the way, we've spent some time labeling AlpacaDataCleaned. It has already good quality but there are still examples to improve, so we'd like to contribute.

Today we have released this model to help teams with cleaning up Alpaca translation, but this can be used to contribute to this repo too: https://huggingface.co/argilla/alpaca-garbage-collector-multilingual

We've also deployed this space for browsing and validating the records. This is what it shows for last night's version of AlpacaCleaned (login with argilla/1234).

We plan to spend some more time labeling and contributing back to this project. My question is if it would be possible to share a set of flagged records (with positional ids as in the original json) with you to make sure we edit them in the right way. For example, what do with requests related to attached photos, paintings, and so on.

What about starting a crowdfunding campaign to collect money to run the examples against GPT-4?

Collect money (~500 USD should be enough I believe?)
Open account with OpenAI and connect it to the bank account holding the money above
Get API key
Run the examples against GPT-4 to correct them

I guess one challenge is to maintain transparency at every step, or what would be the legal implications, but its just USD 500, so it shouldn't matter as much anyway!

Where is the 9k cleaned alpaca data in the paper Alpagasus?

Adding scripts for data cleaning

Thanks for your work! Did you make the changes primarily by manually examining the dataset? If scripts were used for the cleanup, it might be helpful for others to have access to those scripts as well. I am in the process of creating a dataset for the German language using GPT and will likely run into similar issues. So it would be nice to be able to access these scripts. I'm sure you understand where I'm going with this. What are your thoughts on this? I know it's not straightforward, because when I cleaned up the lists, it was also a combination of using Regex and manually monitoring the changes.

Idea about better cleaning

Probably need to move cleaned data from one file to another so no need to check again and again.
For one other model guys prepared Telegram bot. So people could read random Question/Answer and choose button:

Everything correct
Wrong question
Wrong answer
Skip it

Maybe make sense to get duplicate confirmations...

Hosting your dataset on the Hugging Face Hub

Hi @gururise, this is a really cool project and great job identifying all these problems with the Alpaca dataset!

Would you be interested in hosting this on the Hugging Face Hub (https://huggingface.co/datasets)? The Alpaca dataset is also hosted there (link) and your version would be of wide interest to the community :)

If that sounds interesting, you just need to create a new dataset repo and upload your alpaca_cleaned.json file through the UI. For more details, you can check out our docs here

The MNLI score in lm-evaluation-harness

Thanks for the great work!

I'm trying to reproduce the results you report. I downloaded the model weights from link https://huggingface.co/yahma/alpaca-7b-lora and evaluated them under the framework of lm-evaluation-harness. But I only got 41.7% accuracy on MNLI dataset.

When using lm-evaluation-harness, did you perform other data processing tricks to get 51.6% acc?

Diffs as data

What if we used the diffs from all this cleaning effort to train a model to do the cleaning?