neulab / prompt2model Goto Github PK

prompt2model - Generate Deployable Models from Natural Language Instructions

License: Apache License 2.0

Python 95.90% Jupyter Notebook 4.10%

prompt2model's Introduction

Prompt2Model - Generate Deployable Models from Instructions

Prompt2Model is a system that takes a natural language task description (like the prompts used for LLMs such as ChatGPT) to train a small special-purpose model that is conducive for deployment.

Quick Start

Notebook

You can run our demo of Prompt2Model through a notebook:

Command Line

You can also run through the command line.

pip install prompt2model

Prompt2Model supports various platforms such as OpenAI, Anthropic, Huggingface, etc. using LiteLLM.

If you are using OpenAI models (such as the default gpt-3.5-turbo), please obtain an OpenAI API key on their website then set the environment variable OPENAI_API_KEY to your API key by running the following command in your terminal:

export OPENAI_API_KEY=<your key>

List of all supported providers

You can then run

python prompt2model_demo.py

to create a small model from a prompt, as shown in the demo video below. This script must be run on a device with an internet connection to access the OpenAI API. For best results, run this script on a device with a GPU for training your model.

Demo

promptmodel_demo.mp4

Tips and Examples to Write a Good Prompt

You can see the tips and examples to write a good prompt in prompt_examples.

Components

The prompt2model package is composed of several components, each designed to fulfill a specific purpose. To gain a comprehensive understanding of how to utilize each component effectively, please consult the readme.md file situated in the directory of the respective component. These files can be found at ./prompt2model/<component>/readme.md. They provide detailed information and instructions on customizing and maximizing the functionality of each component within the package.

Contribution

If you're interested in contributing to the prompt2model project, please

refer to CONTRIBUTING.md
open an issue or submit a PR
join us on discord
or reach out to @vijaytarian and @Chenan3_Zhao on Twitter

Cite

We have written a paper describing Prompt2Model in detail.

If you use Prompt2Model in your research, please cite us!

If you discuss or use the overall prompt2model framework, please reference

@misc{prompt2model,
      title={Prompt2Model: Generating Deployable Models from Natural Language Instructions},
      author={Vijay Viswanathan and Chenyang Zhao and Amanda Bertsch and Tongshuang Wu and Graham Neubig},
      year={2023},
      eprint={2308.12261},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you discuss or use our dataset retrieval and transformation tools, please reference

@misc{prompt2modeldatatune,
      title={Better Synthetic Data by Retrieving and Transforming Existing Datasets}, 
      author={Saumya Gandhi and Ritu Gala and Vijay Viswanathan and Tongshuang Wu and Graham Neubig},
      year={2024},
      eprint={2404.14361},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

prompt2model's People

Contributors

Stargazers

Watchers

Forkers

eltociear ouyangchucai ederbuug ankitshah009 oms1996 techthiyanes nnajiabraham satyarepala touristshaun bmedi apollohuang1 webwahab ai-nishikant farhaduneci jesusoctavioas sakiayres tonywhite11 rhinojosa sorokinvld raucous01 ar4s-eth sohil75 ailabteam pterameta stracerxx jhalljhall shigos90 notht99 kumar045 web3underbelly octag0no gmx0 utkarshx kekewind jingli-wtbox partnerise louderthanthunderx1 francyjglisboa qsimeon nikson931 aveshrahmani05 badshah12584 t46 korotkovaqeou thanhpham1987 mbrukman ai-dialogos-chatbot-with-llms hitech777 anindyadeep crossme0809 dayadaya222 mayi140611 johnny2020 binshi-bing zechchair hubayirp f901107 shuyuhuang rwmjhb geekywizkid ahmedalaa24494 coswise bobkingdom martincooperbiz fcodag yonetaniryo haorand maoxingda sprklinginfo gadsdencode gsgampo fangwudi doctchen wbing520 ai-dataset-and-tools coinhubx damodaran013 haigangyuan lvyv rayendito arimkatz jaedukseo rajendharmendra bigsea00001 krrishdholakia benjamin-ky c00renut malo94 rayhunter gavinchen1314 ramstorage simran-awadia ishine pannenkoekje28-2 frankloud lattic souryadipstan aicodehunt playbestlin thelyoncrypt

prompt2model's Issues

Pad out of range

The same as:

huggingface/transformers#6263

Rename our `Trainer` class to `ModelTrainer`

A common problem in Python programming is naming your class the same as third-party modules.

In our model Trainer, we will use transformers.Trainer, which potentially conflict with our class name.

Tokenizer `model_max_length`

When I was running alpha-test for Trainer, I encountered this:

FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.

Then I realized that we should add model_max_length for T5-Encoder.

Setting model_max_length is important for several reasons:

Handling longer sequences: It allows you to process sequences that exceed the default maximum length of tokenizers without truncation or padding issues.
Avoiding warnings and errors: By specifying model_max_length, you can prevent warnings and errors that may occur when tokenizing or padding longer sequences.
Consistency with model requirements: Some models have specific input length requirements. Setting model_max_length ensures that the tokenizer aligns with the expected input length of the model.
Customizing sequence lengths: You can tailor the maximum length of tokenized sequences to suit your needs, providing flexibility for tasks that require longer context or models that support extended sequences.

Mock `openai.ChatCompletion.create` in unit test

To make unit test independent of external API
To make unit test faster and save API fee

Create rudimentary documentation on how to use the library

Currently there is no documentation on how to use the library, some should be added to README.md or elsewhere.

Just enough for me to understand how to get started would be OK for now.

Combine NLI and NLG tasks `DatasetGenerator`

At this point, I was hoping we could avoid making the distinction between classification tasks and generation tasks. This class hierarchy is fine, but maybe we could use GenerateTaskGenerator for everything right now (even classification tasks)

Refactor `DatasetGenerator.generate_example`

These lines may need minor modification:

        prompt = self.generate_prompt(
            instruction=prompt_spec.instruction,
            examples=prompt_spec.examples,
            prompt_template=prompt_spec.prompt_template,
        )

Change model executor

        if self.tokenizer.pad_token is None:
            logging.warning(
                "Trying to init an ModelExecutor's tokenizer without pad_token"
            )
            self.tokenizer.pad_token = "[PAD]"
            self.model.config.pad_token_id = len(self.tokenizer)
            self.model.resize_token_embeddings(len(self.tokenizer))
            self.model.config.attention_mask_fn = lambda input_ids: (
                input_ids != self.model.config.pad_token_id
            ).float()

Use task-specific training architectures

Our current system makes the simplifying assumption that every task can be treated as text generation (either "text-to-text" generation or autoregressive language modeling).

This may not be appropriate for other tasks, such as text classification, span extraction, or sequence labeling. There, using custom, task-specific architectures may lead to better use of training data. We would welcome contributions from folks on inferring the correct architecture for a new task.

Add unit test that generated dataset has same length as `num_examples`

in #5 we are introducing the DatasetGenerator class. The current testing-only implementation of this class satisfies the requirement that the generated dataset has the same length as the specified num_examples. We should codify this in a unit test.

Change GPT2 test helper

    if gpt2_tokenizer.pad_token is None:
        gpt2_tokenizer.pad_token = "[PAD]"
        gpt2_model.config.pad_token_id = len(gpt2_tokenizer)
        gpt2_model.resize_token_embeddings(len(gpt2_tokenizer))
        gpt2_model.config.attention_mask_fn = lambda input_ids: (
            input_ids != gpt2_model.config.pad_token_id
        ).float()

Specific `Dataset` and `DatasetDict` from our skeleton

In the HuggingFace APIs, the DatasetDict class is a distinct entity that contains several splits, such as train/val/test. We can easily access the training set by using Dataset_dict["train"] without converting it to a dict.

However, in our previous implementation, DatasetDict was misused as datasets. Three Dataset objects, each representing different splits of a whole dataset on HuggingFace, should be a DatasetDict instead of a list[Dataset].

Real evaluation during Training

The previous trainer don't do real evaluations. Here are my plans.

Change the save and evaluate strategy to epoch.
Add chrf, exact_match and bert_score to evaluate during training.
Change Trainer and Argument to Seq2SeqTrainer and Seq2SeqArgument.

There are duplicated usage in dataset_processor_test

Just delete it.

Surported Task Type

I have come to realize that some of the NLP tasks really don't need example and annotation. Sometimes the user just wants example without annotion.

For example:

Perhaps I just want to generate examples to train ToolFormer, i.e., I just need the output of this picture.

Your task is to give API Call examples by writing QA(question)]" where "question' is the question you want to ask. Here are some examples of APl calls:

Joe Biden was born in [QA("Where was JoeBiden born?")] Scranton, [QA("In which state isScranton?")] Pennsylvania.

Coca-Cola, or [QA("What other name isCoca-Cola known by?")] Coke, is a carbonated soft drinkmanufactured by [QA("Who manufactures Coca-Cola?")the Coca-Cola Company.

Rename `model ID` to `model name`

Our previous skeleton used model ID. A more proper way is to use model name.

add `max_API_call`

Currently, just hard-code our API call.

Parameter Selector

Refactor `handle_openai_error`

handle_openai_error should only handle errors. self.api_call_counter += 1 should be done whenever an API is called, not only when an error occurs.

Mock model loading in `test_run_locally`

In our integration test, we're currently loading a model in the Trainer. this should be mocked to avoid having to download a full transformers model each time this test is run on a new machine.

fast tokenizer

During my alpha test, I found that our pipeline is slowed down due to unexpected processes. We are not using the fast tokenizer for the trainer, rendering tokenization relatively slow.

Implement the `prompt_spec.parse_from_prompt` function to get `instruction, examples`, or even `prompt_template`.

Hey Vijay.

By now, I mock three fileds in generate_examples:

instruction = (
    "Give me some translation from Chinese to English."
    " Input Chinese and output English."
)
examples = [
    "input: '人生苦短，我用 Python', output: 'Life is short, I use Python.'",
    "input: '明天是周末', output: 'Tomorrow is weekend.'",
]
prompt_template = (
    "Requirement: {instruction} \n"
    "Few-Shot Examples: {examples} \n"
    "sample: \n"
    "annotation: \n"
    "Please answer me in JSON format, with `sample` and `annotation` keys."
)

Later, I plan to change it to simply:

# expect to parse natural_instruction and few_shot_examples from prompt_spec
# currently hard-coded
natural_instruction, few_shot_examples = prompt_spec.parse_from_prompt()

Then hard-coding input won’t be a problem.

A rather confusing part is the prompt_template, as we discussed in PR 20.

This could be done in a future PR, but you might want to make this configurable. For instance, pass into the function an argument prompt_template, which includes text as {{instruction}} and {{examples}} that gets replaced within this function.

How could we get users' prompt_template?

We can let users change the codes in openai.py.
We can let users add this field in their input instruction, and we parse instruction, examples, prompt_template from it.

Trainer on CPU

Our current trainer trains the model fully on the CPU. I add a feature to use GPU.

DEMO running slowly and unexpected

I will discuss this with some of my friends.

Add Model Retriever to skeleton architecture

We now want to add a Model Retriever to our system architecture, so the architecture diagram will look like this:

This is a departure from our current architecture, which assumed a fixed set of models to be finetuned:

To start making this change, we need to add a dummy Model Retriever to our skeleton architecture and integration tests.

Surport more `Seq2SeqModel`

Currently, we are only using T5 models to do Seq2Seq. But any Seq2SeqModel can be far more apart from T5 (encoder-decoder). For our initial version, we can only support the T5 model, but I think decoder-only models like Bloomz are more burgeoning.

Our Prompt Parser cannot support very long prompts

#34 introduced a prompt parser which uses a long "meta-prompt" to parse a given prompt. Since the "meta-prompt" is long (1.8k tokens), this limits the prompts that can be provided to our system to 2.2k tokens.

Using long-context models like GPT-4 could be one solution here, but this may be a problem we'll need to address

Wrong post processing for decoder model

For the decoder-only model:

The training dataset should be:

model_input = (
                f"<task {task_id}> {instruction} Example: {example['input_col']}"
                + f" Label: {example['output_col']}"
            )

But the test and validation dataset should not contain the Label!

Refactor `DatasetGenerator`'s methods

Use following prompt template and extract labels and example from JSON response:

{Natural Instruction}
Few-shot examples: {Shuffled random exmaples}
New Example:
Label:
Explanation:
Please answer me in JSON format.

Add optimzier

Currently, we are not using Optimzier and do not assign the default learning rate.

Split `SimpleDatasetGenerator` into `classification task` and `generation task`

change SimpleDatasetGenerator to an ABC OpenAIDatasetGenerator and derived two new classes ClassifyTaskGenerator and GenerateTaskGenerator, one for finite output space, and the other for infinite output space.

`OpenAIInstructionParser` can not surport `OpenAIDatasetGenerator.generate_prompt`

Today when I was debugging, I found that our generate_dataset_split needs these lines:

        prompt = self.generate_prompt(
            instruction=prompt_spec.instruction,
            examples=prompt_spec.demonstration,
            prompt_template=prompt_spec.prompt_template,
        )

Actually, OpenAIInstructionParser don't set default prompt_template.

Furthermore, I was using examples in my DatasetGenerator, but OpenAIInstructionParser uses demonstrations.

Evaluation Metirc

I will also add the same metric as evaluation during training.

Configure logger in the main script

Tokenizer padding error

Previously, we are using:

if self.tokenizer.pad_token is None:
	self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if self.model.config.pad_token_id is None:
	self.model.config.pad_token_id = self.model.config.eos_token_id

This will fail due to we set pad_token as [PAD], but assign pad_token_id as eos_token_id.

We can change it this way:

if self.tokenizer.pad_token is None:
	self.tokenizer.pad_token = self.tokenizer.eos_token
	self.model.config.pad_token_id = self.model.config.eos_token_id
	self.model.config.attention_mask_fn = lambda input_ids: (input_ids != self.model.config.pad_token_id).float()

Setting the attention_mask_fn is not necessary because [EOS] will automatically set the attention mask as 0.

The previous one may not be optimal:

Using the EOS token as a padding token may not be ideal because it is semantically meaningful. A separate padding token, such as [PAD], is generally preferred to avoid potential confusion or interference with the model's understanding and generation process.

So my best answer is:

if self.tokenizer.pad_token is None:
    self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if self.model.config.pad_token_id is None:
    self.model.config.pad_token_id = len(self.tokenizer)
    self.model.resize_token_embeddings(len(self.tokenizer))
    self.model.config.attention_mask_fn = lambda input_ids: (
        input_ids != self.model.config.pad_token_id
    ).float()

Using Real Evaluation Set

We actually have generated val datasets in our dataset generator, so I think we should leave a parameter for real val datasets.

Modify `DatasetGenerator` to not assume the existence of `few_shot_examples`

Currently, the version of DatasetGenerator in #20 explicitly assumes the existence of separate input fields for natural_instructions and few_shot_examples. Our intended design (which will be implemented in the resolution of #26) is that the user need not necessarily provide few_shot_examples.

We should change DatasetGenerator to treat few_shot_examples as an optional field.

Evaluator

Add getter method to the PromptSpec to fetch the actual prompt provided

Currently the PromptSpec just has a single method which is parse_from_prompt. To maintain abstraction, we should add getter methods: get_prompt and get_task_type to this class.

Evaluation set for training

Our previous trainer didn’t provide an evaluation dataset. So I plan to extract 20% of training data for evaluation. Or can we set evaluation_strategy=None? By the way, since we are using chrf++, exact_match, bert_score in evaluation, should we set the same metrics for training?

ValueError: Trainer: evaluation requires an eval_dataset

Retrain the final layer of model

To solve the inconsistency of pretraining and finetuning:

Use the retrieved model as a pretrained feature extractor
Retrain the final layer from scratch

Ban `wandb` in unit tests

We really shouldn't be calling wandb in the tests. I think that os.environ["WANDB_DISABLED"] = "true" will do this.

Manual Tokenizer for decoder only can't save

I previously asked about how to add padding to a decoder-only model. I added it before and completed the training, but now I realize that I cannot save the model.

When trying to save, I encountered the following error: TypeError: Object of type function is not JSON serializable.

Link to the relevant GitHub issue: huggingface/transformers#5393

DataFinder should retrieve datasets with exactly two columns: `input_col` and `output_col`

The Dataset Generator implementation and the (upcoming) Trainer design both assume that each dataset consists of exactly two columns: input_col and output_col. We should ensure that the DataFinder retrieves datasets which fit this schema

Support metrics slection

Let users choose the metrics among chrf, exact_match and bertscore.

Combine `MockPromptSpec` and `DefaultSpec`

I find it confusing that we defined MockPromptSpec and DefaultSpec in PromptSpec. We use DefaultSpec in run_local, so I think one of them is redundant. 🤔

list of datasetDict are hard to store

I suddenly realized that all of the datasets/datasetDicts are never stored through our pipeline. And TextualizeProcessor will return a list of datasetDicts, which should be stored one-by-one.

Needs default model max length or a if branch

               self.tokenizer = transformers.T5Tokenizer.from_pretrained(
                    pretrained_model_name, model_max_length=model_max_length
                )

If we pass in model_max_length=None, this would fail.

Component Tracker

Components on our critical path to a prototype

Add truncaiton

Bug Report: I was running this 2-class classification on t5-small. I found that previously we haven’t filtered out too long training examples, which ends in a very long token map. Thus get CUDA out of Memory.