Giter VIP home page Giter VIP logo

lamini's Introduction

Lamini: The LLM engine for rapidly customizing models πŸ¦™

License Python 3.7+ Code style: black

Official repo for Lamini's data generator for generating instructions to train instruction-following LLMs.

All data and LLMs are under a CC-BY license that allows commercial useβ€”all yours, you own it! πŸ¦™πŸ¦™πŸ¦™

What's here?

  • A 71K dataset of instructions used for finetuning your own instruction-following LLM (like ChatGPT, which was also trained to follow instructions).
  • The code for the data generator, which only needs 100 datapoints to start generating 70k+ datapoints. You can customize the original 100+ datapoints to your own domain, to focus the data generator on that domain.
  • Open-source fine-tuned LLMs that follow instructions, fine-tuned using a base Pythia model with the Lamini engine: [weights] [playground].

See our blogpost for layperson's terms of what's going on.

Lamini Process Step by Step

Authentication to Lamini

Ready to configure your API key? It's easy-peasy! πŸ”‘

First, navigate to your Lamini account page to retrieve your unique API key. Remember to keep this key a secret, and don't expose it in any client-side code or share it with others.

Next, create a config file, like so:

mkdir ~/.powerml
touch ~/.powerml/configure_llama.yaml # backend system names

Finally, open the file with a text editor and place your key in it:

production:
    key: "<YOUR-KEY-HERE>"

The best part? The Lamini python package will automatically load your key from this config file for you, so you don't have to worry about it πŸ™Œ

If you're running Lamini in a docker container, make sure to copy/mount this file inside the container 🐳

See our API docs for more details.

Run

Clone the repository:

git clone [email protected]:lamini-ai/lamini.git

Using Python 🐍

In the repository, install python dependencies:

pip install -r requirements.txt

Run the program, to start generating data πŸ“ŠπŸ“ŠπŸ“Š

python3 generate_data.py

Using Docker 🐳

Make sure you have docker installed.

Then, run this command:

./run_generate_data_docker.sh

Expected Outputs & Autosaved Data πŸ¦™

When you run the program, you should start seeing output of a Seed Question, from the original small dataset in seed_tasks.jsonl, and a Novel Question, which is a generated question based on that Seed Question.1

====== Seed Question =====
 question='Plan a weekly lunch menu for a school. Write down a main dish, a carbohydrate side dish, a vegetable side dish, and a dessert for each day.'
===== Novel Question =====
 question='Write a knock knock story that has the same theme as the given one, but the characters, plots and settings are different'

These generated questions are saved to data/questions.jsonl. This JSON file is a list of dictionaries with a question field.

Next, you'll see a Response generated for each Novel Question.

====== Question =====
 question='Which exercises are best'
===== Response =====
 response='It really depends what you want to achieve. If you want to lose weight, I would recommend high intensity interval training (HIIT) because it is really effective. If you want to gain muscle, I would recommend steady-state cardio like running or walking as these help maintain muscle mass. If you are interested in both losing weight and gaining muscle, I would recommend a combination of both HIIT and steady-state cardio.\n\n\nHIIT exercises are: running fast for short periods of time, followed by slow walking or jogging for longer periods of time. An example of a HIIT exercise is called the β€œLunges'

These pairs are saved to data/dataset.jsonl. This JSON file is a list of dictionaries with question and response fields.

It's poggers πŸ’₯

Modify

I want to use my own seed data

We suggest creating your own dataset and changing the path to the seed_tasks.jsonl in generate_data.py(./generate_data.py) --- or you can replace seed_tasks.jsonl with your own data in the same format. You can of course also modify how the data is loaded or write your own script with the llama-llm library (pssst, API docs).

I only want to generate questions (to start)

In generate_data.py(./generate_data.py), you can just run generate_questions. This is a common use case for using human review after the question generation step to filter only the good ones for the next step of generating a response for each question.

I have my own instructions, and just want to generate responses

In generate_data.py(./generate_data.py), you can just use the function make_pairs to create the question-response pairs. This is a common use case step to run this stage separately, e.g. after human review of the generated questions, or if there was an error at this step last time.

I want to generate more than 100 instructions

Change the count flag -c for the number question-repsonse pairs to generate in total. The default is set to 100.

Cleaning

Using Python 🐍

In the repository, run the remove_duplicates.py to remove duplicate questions from data/dataset.jsonl

python3 remove_duplicates.py

Run the program, to run a basic cleaning job on your data 🧼🧼🧼

In the repository, run the remove_duplicates_completion.py to remove responses where the model repeats itself from data/dataset.jsonl

python3 remove_duplicates_completion.py

Run the program, to run a more extensive cleaning job on your data πŸ›πŸ›πŸ›

These are examples. Consider using human filtering or writing additional post processing programs to further cleand and improve your data. Your fine-tuned models will thank you!

Data Release

We've run this script a few times and saved the results for you to freely use, at data/lamini_dataset.jsonl πŸ’Έ

This file contains 72K instruction-following data for commercial use (ie. feel free to use it for your business! πŸ’°πŸ“ˆ). It's the same as the output, a list of dictionaries, each of which contains the following fields:

  • question: str, describes the task the model should perform. Each of the 52K instructions is unique, as generated by lamini/open.
  • response: str, the answer to the instruction as generated by lamini/instruct.

About Lamini

Lamini is the world's most powerful LLM engine, unlocking the power of generative AI for every company by putting their data to work. It is based on the lamini tribe, which includes llamas (LLMs!), alpacas, etc.

Footnotes

  1. The Seed Questions in the Lamini seed dataset are instructions (combination of questions and commands), based on the self-instruct dataset. The generated questions are similar in nature to those and therefore don't have to be questions. You can find the seed dataset at data/lamini_dataset.jsonl. ↩

lamini's People

Contributors

gdiamos avatar greg1232 avatar omonida avatar sharonzhou avatar thedch avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.