argilla-io / notus Goto Github PK

Notus is a collection of fine-tuned LLMs using SFT, DPO, SFT+DPO, and/or any other RLHF techniques, while always keeping a data-first approach

License: MIT License

Python 92.06% Shell 7.94%

dpo fine-tuning trl zephyr lm-alignment preference-data alignment-handbook

notus's Introduction

💨 Notus

Notus is a collection of fine-tuned models using SFT, DPO, SFT+DPO, and/or any other RLAIF/RLHF techniques; following a data-first, human-centric approach, since that's what we do best at Argilla.

Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with other similar LLMs.

Notus name comes from the ancient Greek god Notus, as a wink to Zephyr, which comes from the ancient Greek god Zephyrus; with the difference that Notus is the god of the south wind, and Zephyr the god of the west wind. More information at https://en.wikipedia.org/wiki/Anemoi.

Being able to fine-tune LLMs while still keeping a data-first approach wouldn't have been possible without the inestimable help of the open source community and all the amazing resources out there intended for the general public. We are very grateful for that, and we hope that our work can be useful for others as well.

🎩 h/t HuggingFace H4 team for their amazing work with alignment-handbook, and also for the fruitful discussions we had with them and their support.

News

December 1st, 2023: Notus 7B v1 is released! 🎉 Using the same DPO fine-tuning approach as Zephyr 7B Beta, but changing the data source from UltraFeedback to binarize it using the average of the different criterias, instead of the critique score. Notus 7B improved in both AlpacaEval and LM Eval Harness compared to Zephyr 7B Beta, while for MT-Bench the results were on par. More information at v1/.

Resources

🤗 HuggingFace Hub Collection

Available at: https://huggingface.co/collections/argilla/notus-7b-v1-655529d7c73cb6c830e9555a

💬 Chat UI

Chat with Notus at https://argilla-notus-chat-ui.hf.space/ (powered by https://github.com/huggingface/chat-ui)

Citation

Since most of the content is ported / adapted from huggingface/alignment-handbook, we recommend citing their work.

@misc{alignment_handbook2023,
  author = {Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Alexander M. Rush and Thomas Wolf},
  title = {The Alignment Handbook},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/alignment-handbook}}
}

Additionally, if you find any of the contents within this repository useful, please feel free to use the following BibTeX cite as well:

@misc{notus2023,
  author = {Alvaro Bartolome and Gabriel Martin and Daniel Vila},
  title = {Notus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/argilla-io/notus}}
}

Note

Alphabetically ordered by last name due to equal contribution.

notus's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes neomyst jeffara apollohuang1 dave7922 suryatmodulus ainfachalex josephrp gvc0461082002 ceoaitek2023 polya20 integracore2

notus's Issues

Run DPO step with multibinarized dataset

One important open question, especially for distilabel is: does generating more pairs of chosen/rejected samples improve the DPO process. Both Notus, Zephyr and Tulu, just use the best chosen and a random rejected.

We need to run an experiment to better understand how multibinarization impacts the model.

The dataset is ready:
https://huggingface.co/datasets/argilla/notus-uf-dpo-multibinarized

It contains pairs with the best chosen response and then several pairs with responses that go a lower rating, so instead of generating just one sample per UF row (choosing a random rejected), we generated >=1<4 pairs

【question】are you planning to support multi language model?

Hi everyone!

Thanks for opening great project to make high quality data and LLM model like Notux and Notus!
I appreciate your passion to publish everything you do.
I have no doubt that you are producing a great PROGRESS in the development of AI in the world.

Let me ask following questions.

Do you have any plan to support multi language model?
Or can I contribute somehow to make LLM model which support Japanese?
I tested Japanese and result was close to perfect.
I felt NOTUS have basic knowledge of Japan but Japanese is not natural.

I think creating multi-language model can accelate to develop Open LLM model in world scale

I wish you will give me reply!
Thanks!

Evaluate using MT-Bench and AlpacaEval to compare with Zephyr 7B Beta

Description

To be able to actually compare Notus 7B with Zephyr 7B Beta, we will need to run the same benchmarks i.e. MT-Bench and AlpacaEval.

MT-Bench at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
AlpacaEval at https://github.com/tatsu-lab/alpaca_eval#quick-start

Following the instructions at https://github.com/huggingface/alignment-handbook/blob/main/scripts/README.md#evaluating-chat-models

Curate UltraFeedack dataset's overall_score

Based on our curation efforts, we spotted a bug in the overall_score of UltraFeedback AI Critique score. TLDR: Responses getting the lowest score (1 or less) become a high score (10 or 8.0 or 7.5 who knows!). Our initial work with notus shows that by using something different to the overall score, we can train a better model.

In this task, we want to really clean up the original dataset to make sure others build on an error free dataset. I have myself curated a few hundreds (sorting by chosen score = 10) and most of the responses getting a 10 are totally useless according to the rationale (natural language) explanation.

The objective is as follows:

Using this dataset take the col best_overall_score_response, get the critique text and run it through a very simple sentiment analysis (I suggest starting with TextBlob's because it's really fast and the rationales are very expressive when the response is really bad).
Add this sentiment score to the dataset on a new column, best_overall_score_response_critique_sentiment.
Based on this new dataset, let's try to find out those examples that get a high overall_score but a bad sentiment.
Iterate as much as we can to really narrow down those problematic cases. I'd strongly suggest to use Argilla UI with sort and filters to quickly adjust.
Once we know the problematic cases, we have several choices, the best I can think of is reduce their overall_score (dividing by 10 :-) ) in the completions object.
Now we have a clean dataset, we can use to experiment further (compare rating vs critique, etc.) and most important share it with the community so people build on a clean version!

More details about the initial analysis on the dataset readme.

Please keep us posted as you start and iterate!

Run SFT step with the rating binarized data chosen responses

I'd recommend to run an experiment:

Using the base SFT Zephyr model, do SFT with the chosen responses of our UF dataset. As we know there are several issues with the original train_prefs we should evaluate if their results (SFT on the chosen responses not improving the overall recipe) still holds.