Planning on adding an evaluation metric that can be used to benchmark trained alpaca m

Just added piqa <a href="https://github.com/gururise/AlpacaDataCleaned/pull/51" data-h

Decided to standardize by using the <a href="https://github.com/EleutherAI/lm-evaluati

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

I have mentioned a few options previously in this issue: <a class="issue-link js-issue

Just FYI. I re-ran the SQUADmini bench on a <a href="https://huggingface.co/yahma/alpa

Evaluation Metric about alpacadatacleaned HOT 8 OPEN

gururise commented on August 10, 2024

Evaluation Metric

from alpacadatacleaned.

Comments (8)

gururise commented on August 10, 2024 1

So I've compared two different alpaca 7b models on the Squad Dataset:

dataset	model	Squad(Mini) F1
Original Alpaca	samwit/alpaca7B-lora	34.63
Cleaned Alpaca	tloen/alpaca-lora-7b	49.64

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

from alpacadatacleaned.

gururise commented on August 10, 2024 1

Just added piqa benchmark, also redid the scoring of the Squad bench:

dataset	Hugging Face	parameters	SquadMini (f1)	Piqa (acc)
Original Alpaca	samwit/alpaca7B-lora	7b	74.271	~~50.5~~
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b	75.629	~~54.0~~
Cleaned Alpaca (Mar 31)	yahma/alpaca-7b-lora	7b	76.388	~~52.6~~
GPT4All	nomic-ai/gpt4all-lora	7b	72.643	~~49.5~~

Note: PIQA benchmark has issues. Do not use it yet.

from alpacadatacleaned.

gururise commented on August 10, 2024 1

Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:

Dataset	Model	parameters	WikiText (ppl)	MNLI (acc)	Piqa (acc norm)
Original Alpaca	samwit/alpaca7B-lora	7b (lora)	9.5396	38.33	78.51
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b (lora)	9.4885	51.6	79.33
GPT4All	nomic-ai/gpt4all-lora	7b (lora)	10.09	38.97	78.40

Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.

from alpacadatacleaned.

claysauruswrecks commented on August 10, 2024 1

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

from alpacadatacleaned.

claysauruswrecks commented on August 10, 2024

I have mentioned a few options previously in this issue: tloen/alpaca-lora#147

from alpacadatacleaned.

gururise commented on August 10, 2024

Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.

from alpacadatacleaned.

YukinoshitaKaren commented on August 10, 2024

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

from alpacadatacleaned.

gururise commented on August 10, 2024

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.

from alpacadatacleaned.

Evaluation Metric about alpacadatacleaned HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent