Giter VIP home page Giter VIP logo

prompt2model's Introduction

Prompt2Model - Generate Deployable Models from Instructions

PyPI version Github Actions CI tests MIT license Discord Colab

Prompt2Model is a system that takes a natural language task description (like the prompts used for LLMs such as ChatGPT) to train a small special-purpose model that is conducive for deployment.

prompt2model_teaser

Quick Start

Notebook

You can run our demo of Prompt2Model through a notebook:

Command Line

You can also run through the command line.

pip install prompt2model

Prompt2Model supports various platforms such as OpenAI, Anthropic, Huggingface, etc. using LiteLLM.

If you are using OpenAI models (such as the default gpt-3.5-turbo), please obtain an OpenAI API key on their website then set the environment variable OPENAI_API_KEY to your API key by running the following command in your terminal:

export OPENAI_API_KEY=<your key>

List of all supported providers

You can then run

python prompt2model_demo.py

to create a small model from a prompt, as shown in the demo video below. This script must be run on a device with an internet connection to access the OpenAI API. For best results, run this script on a device with a GPU for training your model.

Demo

promptmodel_demo.mp4

Tips and Examples to Write a Good Prompt

You can see the tips and examples to write a good prompt in prompt_examples.

Components

The prompt2model package is composed of several components, each designed to fulfill a specific purpose. To gain a comprehensive understanding of how to utilize each component effectively, please consult the readme.md file situated in the directory of the respective component. These files can be found at ./prompt2model/<component>/readme.md. They provide detailed information and instructions on customizing and maximizing the functionality of each component within the package.

Contribution

If you're interested in contributing to the prompt2model project, please

Cite

We have written a paper describing Prompt2Model in detail.

If you use Prompt2Model in your research, please cite us!

If you discuss or use the overall prompt2model framework, please reference

@misc{prompt2model,
      title={Prompt2Model: Generating Deployable Models from Natural Language Instructions},
      author={Vijay Viswanathan and Chenyang Zhao and Amanda Bertsch and Tongshuang Wu and Graham Neubig},
      year={2023},
      eprint={2308.12261},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you discuss or use our dataset retrieval and transformation tools, please reference

@misc{prompt2modeldatatune,
      title={Better Synthetic Data by Retrieving and Transforming Existing Datasets}, 
      author={Saumya Gandhi and Ritu Gala and Vijay Viswanathan and Tongshuang Wu and Graham Neubig},
      year={2024},
      eprint={2404.14361},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

prompt2model's People

Contributors

7flash avatar abertsch72 avatar anindyadeep avatar eltociear avatar farhaduneci avatar hitenvidhani avatar krrishdholakia avatar neubig avatar ritugala avatar saum7800 avatar souryadipstan avatar viswavi avatar zhaochenyang20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prompt2model's Issues

Rename our `Trainer` class to `ModelTrainer`

A common problem in Python programming is naming your class the same as third-party modules.

In our model Trainer, we will use transformers.Trainer, which potentially conflict with our class name.

Tokenizer `model_max_length`

When I was running alpha-test for Trainer, I encountered this:

FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.

Then I realized that we should add model_max_length for T5-Encoder.

Setting model_max_length is important for several reasons:

  1. Handling longer sequences: It allows you to process sequences that exceed the default maximum length of tokenizers without truncation or padding issues.

  2. Avoiding warnings and errors: By specifying model_max_length, you can prevent warnings and errors that may occur when tokenizing or padding longer sequences.

  3. Consistency with model requirements: Some models have specific input length requirements. Setting model_max_length ensures that the tokenizer aligns with the expected input length of the model.

  4. Customizing sequence lengths: You can tailor the maximum length of tokenized sequences to suit your needs, providing flexibility for tasks that require longer context or models that support extended sequences.

Combine NLI and NLG tasks `DatasetGenerator`

At this point, I was hoping we could avoid making the distinction between classification tasks and generation tasks. This class hierarchy is fine, but maybe we could use GenerateTaskGenerator for everything right now (even classification tasks)

Refactor `DatasetGenerator.generate_example`

These lines may need minor modification:

        prompt = self.generate_prompt(
            instruction=prompt_spec.instruction,
            examples=prompt_spec.examples,
            prompt_template=prompt_spec.prompt_template,
        )

Change model executor

        if self.tokenizer.pad_token is None:
            logging.warning(
                "Trying to init an ModelExecutor's tokenizer without pad_token"
            )
            self.tokenizer.pad_token = "[PAD]"
            self.model.config.pad_token_id = len(self.tokenizer)
            self.model.resize_token_embeddings(len(self.tokenizer))
            self.model.config.attention_mask_fn = lambda input_ids: (
                input_ids != self.model.config.pad_token_id
            ).float()

Use task-specific training architectures

Our current system makes the simplifying assumption that every task can be treated as text generation (either "text-to-text" generation or autoregressive language modeling).

This may not be appropriate for other tasks, such as text classification, span extraction, or sequence labeling. There, using custom, task-specific architectures may lead to better use of training data. We would welcome contributions from folks on inferring the correct architecture for a new task.

Change GPT2 test helper

    if gpt2_tokenizer.pad_token is None:
        gpt2_tokenizer.pad_token = "[PAD]"
        gpt2_model.config.pad_token_id = len(gpt2_tokenizer)
        gpt2_model.resize_token_embeddings(len(gpt2_tokenizer))
        gpt2_model.config.attention_mask_fn = lambda input_ids: (
            input_ids != gpt2_model.config.pad_token_id
        ).float()

Specific `Dataset` and `DatasetDict` from our skeleton

In the HuggingFace APIs, the DatasetDict class is a distinct entity that contains several splits, such as train/val/test. We can easily access the training set by using Dataset_dict["train"] without converting it to a dict.

However, in our previous implementation, DatasetDict was misused as datasets. Three Dataset objects, each representing different splits of a whole dataset on HuggingFace, should be a DatasetDict instead of a list[Dataset].

Real evaluation during Training

The previous trainer don't do real evaluations. Here are my plans.

  1. Change the save and evaluate strategy to epoch.
  2. Add chrf, exact_match and bert_score to evaluate during training.
  3. Change Trainer and Argument to Seq2SeqTrainer and Seq2SeqArgument.

Surported Task Type

I have come to realize that some of the NLP tasks really don't need example and annotation. Sometimes the user just wants example without annotion.

For example:

image

Perhaps I just want to generate examples to train ToolFormer, i.e., I just need the output of this picture.

Your task is to give API Call examples by writing QA(question)]" where "question' is the question you want to ask. Here are some examples of APl calls:

Joe Biden was born in [QA("Where was JoeBiden born?")] Scranton, [QA("In which state isScranton?")] Pennsylvania.

Coca-Cola, or [QA("What other name isCoca-Cola known by?")] Coke, is a carbonated soft drinkmanufactured by [QA("Who manufactures Coca-Cola?")the Coca-Cola Company.

Refactor `handle_openai_error`

handle_openai_error should only handle errors. self.api_call_counter += 1 should be done whenever an API is called, not only when an error occurs.

Mock model loading in `test_run_locally`

In our integration test, we're currently loading a model in the Trainer. this should be mocked to avoid having to download a full transformers model each time this test is run on a new machine.

fast tokenizer

During my alpha test, I found that our pipeline is slowed down due to unexpected processes. We are not using the fast tokenizer for the trainer, rendering tokenization relatively slow.

Implement the `prompt_spec.parse_from_prompt` function to get `instruction, examples`, or even `prompt_template`.

Hey Vijay.

By now, I mock three fileds in generate_examples:

instruction = (
    "Give me some translation from Chinese to English."
    " Input Chinese and output English."
)
examples = [
    "input: '人生苦短,我用 Python', output: 'Life is short, I use Python.'",
    "input: '明天是周末', output: 'Tomorrow is weekend.'",
]
prompt_template = (
    "Requirement: {instruction} \n"
    "Few-Shot Examples: {examples} \n"
    "sample: \n"
    "annotation: \n"
    "Please answer me in JSON format, with `sample` and `annotation` keys."
)

Later, I plan to change it to simply:

# expect to parse natural_instruction and few_shot_examples from prompt_spec
# currently hard-coded
natural_instruction, few_shot_examples = prompt_spec.parse_from_prompt()

Then hard-coding input won’t be a problem.


A rather confusing part is the prompt_template, as we discussed in PR 20.

This could be done in a future PR, but you might want to make this configurable. For instance, pass into the function an argument prompt_template, which includes text as {{instruction}} and {{examples}} that gets replaced within this function.

How could we get users' prompt_template?

  1. We can let users change the codes in openai.py.
  2. We can let users add this field in their input instruction, and we parse instruction, examples, prompt_template from it.

Trainer on CPU

Our current trainer trains the model fully on the CPU. I add a feature to use GPU.

Add Model Retriever to skeleton architecture

We now want to add a Model Retriever to our system architecture, so the architecture diagram will look like this:
Prompt-to-deployment architecture

This is a departure from our current architecture, which assumed a fixed set of models to be finetuned:
Prompt-to-deployment architecture

To start making this change, we need to add a dummy Model Retriever to our skeleton architecture and integration tests.

Surport more `Seq2SeqModel`

Currently, we are only using T5 models to do Seq2Seq. But any Seq2SeqModel can be far more apart from T5 (encoder-decoder). For our initial version, we can only support the T5 model, but I think decoder-only models like Bloomz are more burgeoning.

Our Prompt Parser cannot support very long prompts

#34 introduced a prompt parser which uses a long "meta-prompt" to parse a given prompt. Since the "meta-prompt" is long (1.8k tokens), this limits the prompts that can be provided to our system to 2.2k tokens.

Using long-context models like GPT-4 could be one solution here, but this may be a problem we'll need to address

Wrong post processing for decoder model

For the decoder-only model:

The training dataset should be:

model_input = (
                f"<task {task_id}> {instruction} Example: {example['input_col']}"
                + f" Label: {example['output_col']}"
            )

But the test and validation dataset should not contain the Label!

Refactor `DatasetGenerator`'s methods

Use following prompt template and extract labels and example from JSON response:

{Natural Instruction}
Few-shot examples: {Shuffled random exmaples}
New Example:
Label:
Explanation:
Please answer me in JSON format.

Add optimzier

Currently, we are not using Optimzier and do not assign the default learning rate.

`OpenAIInstructionParser` can not surport `OpenAIDatasetGenerator.generate_prompt`

Today when I was debugging, I found that our generate_dataset_split needs these lines:

        prompt = self.generate_prompt(
            instruction=prompt_spec.instruction,
            examples=prompt_spec.demonstration,
            prompt_template=prompt_spec.prompt_template,
        )

Actually, OpenAIInstructionParser don't set default prompt_template.

Furthermore, I was using examples in my DatasetGenerator, but OpenAIInstructionParser uses demonstrations.

Tokenizer padding error

Previously, we are using:

if self.tokenizer.pad_token is None:
	self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if self.model.config.pad_token_id is None:
	self.model.config.pad_token_id = self.model.config.eos_token_id

This will fail due to we set pad_token as [PAD], but assign pad_token_id as eos_token_id.

We can change it this way:

if self.tokenizer.pad_token is None:
	self.tokenizer.pad_token = self.tokenizer.eos_token
	self.model.config.pad_token_id = self.model.config.eos_token_id
	self.model.config.attention_mask_fn = lambda input_ids: (input_ids != self.model.config.pad_token_id).float()

Setting the attention_mask_fn is not necessary because [EOS] will automatically set the attention mask as 0.

The previous one may not be optimal:

Using the EOS token as a padding token may not be ideal because it is semantically meaningful. A separate padding token, such as [PAD], is generally preferred to avoid potential confusion or interference with the model's understanding and generation process.

So my best answer is:

if self.tokenizer.pad_token is None:
    self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if self.model.config.pad_token_id is None:
    self.model.config.pad_token_id = len(self.tokenizer)
    self.model.resize_token_embeddings(len(self.tokenizer))
    self.model.config.attention_mask_fn = lambda input_ids: (
        input_ids != self.model.config.pad_token_id
    ).float()

Using Real Evaluation Set

We actually have generated val datasets in our dataset generator, so I think we should leave a parameter for real val datasets.

Modify `DatasetGenerator` to not assume the existence of `few_shot_examples`

Currently, the version of DatasetGenerator in #20 explicitly assumes the existence of separate input fields for natural_instructions and few_shot_examples. Our intended design (which will be implemented in the resolution of #26) is that the user need not necessarily provide few_shot_examples.

We should change DatasetGenerator to treat few_shot_examples as an optional field.

Retrain the final layer of model

To solve the inconsistency of pretraining and finetuning:

  1. Use the retrieved model as a pretrained feature extractor
  2. Retrain the final layer from scratch

Ban `wandb` in unit tests

We really shouldn't be calling wandb in the tests. I think that os.environ["WANDB_DISABLED"] = "true" will do this.

Manual Tokenizer for decoder only can't save

I previously asked about how to add padding to a decoder-only model. I added it before and completed the training, but now I realize that I cannot save the model.

When trying to save, I encountered the following error: TypeError: Object of type function is not JSON serializable.

Link to the relevant GitHub issue: huggingface/transformers#5393

list of datasetDict are hard to store

I suddenly realized that all of the datasets/datasetDicts are never stored through our pipeline. And TextualizeProcessor will return a list of datasetDicts, which should be stored one-by-one.

Component Tracker

Components on our critical path to a prototype

  • Prompt Parser (#57)
  • Dataset Generator (#56)
  • Dataset Processer (#48)
  • Model Retriever
  • DataFinder integration
  • Trainer (#53)
  • #76
  • Model Executor (#61)
  • Evaluator (#77)
  • Demo Creator (#74)
  • Model Info Retrieval(#54)

Add truncaiton

Bug Report: I was running this 2-class classification on t5-small. I found that previously we haven’t filtered out too long training examples, which ends in a very long token map. Thus get CUDA out of Memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.