Giter VIP home page Giter VIP logo

fabricator's Introduction

Fabricator Logo Fabricator Logo

A flexible open-source framework to generate datasets with large language models.

version python Static Badge

News

  • [10/23] We released the first version of this repository on PyPI. You can install it via pip install fabricator-ai.
  • [10/23] Our paper got accepted at EMNLP 2023. You can find the preprint here. You can find the experimental scripts under release v0.1.0.
  • [09/23] Support for gpt-3.5-turbo-instruct added in the new Haystack release!
  • [08/23] Added several experimental scripts to investigate the generation and annotation ability of gpt-3.5-turbo on various downstream tasks + the influence of few-shot examples on the performance for different downstream tasks.
  • [07/23] Refactorings of majors classes - you can now simply use our BasePrompt class to create your own customized prompts for every downstream task!
  • [07/23] Added dataset transformations for token classification to prompt LLMs with textual spans rather than with list of tags.
  • [06/23] Initial version of fabricator supporting text classification and question answering tasks.

Overview

This repository:

  • is an easy-to-use open-source library to generate datasets with large language models. If you want to train a model on a specific domain / label distribution / downstream task, you can use this framework to generate a dataset for it.
  • builds on top of deepset's haystack and huggingface's datasets libraries. Thus, we support a wide range of language models and you can load and use the generated datasets as you know it from the Datasets library for your model training.
  • is highly flexible and offers various adaptions possibilities such as prompt customization, integration and sampling of fewshot examples or annotation of the unlabeled datasets.

Installation

Using conda:

git clone [email protected]:flairNLP/fabricator.git
cd fabricator
conda create -y -n fabricator python=3.10
conda activate fabricator
pip install fabricator-ai

If you want to install in editable mode, you can use the following command:

pip install -e .

Basic Concepts

This framework is based on the idea of using large language models to generate datasets for specific tasks. To do so, we need four basic modules: a dataset, a prompt, a language model and a generator:

  • Dataset: We use huggingface's datasets library to load fewshot or unlabeled datasets and store the generated or annotated datasets with their Dataset class. Once created, you can share the dataset with others via the hub or use it for your model training.
  • Prompt: A prompt is the instruction made to the language model. It can be a simple sentence or a more complex template with placeholders. We provide an easy interface for custom dataset generation prompts in which you can specify label options for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled dataset in a specific way.
  • LLM: We use deepset's haystack library as our LLM interface. deepset supports a wide range of LLMs including OpenAI, all models from the HuggingFace model hub and many more.
  • Generator: The generator is the core of this framework. It takes a dataset, a prompt and a LLM and generates a dataset based on your specifications.

Examples

With our library, you can generate datasets for any task you want. You can start as simple as that:

Generate a dataset from scratch

import os
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

prompt = BasePrompt(
    task_description="Generate a short movie review.",
)

prompt_node = PromptNode(
    model_name_or_path="gpt-3.5-turbo",
    api_key=os.environ.get("OPENAI_API_KEY"),
    max_length=100,
)

generator = DatasetGenerator(prompt_node)
generated_dataset = generator.generate(
    prompt_template=prompt,
    max_prompt_calls=10,
)

generated_dataset.push_to_hub("your-first-generated-dataset")

In our tutorial, we introduce how to create classification datasets with label options to choose from, how to include fewshot examples or how to annotate unlabeled data into predefined categories.

Citation

If you find this repository useful, please cite our work.

@inproceedings{golde2023fabricator,
    title = "Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher {LLM}s",
    author = "Golde, Jonas  and Haller, Patrick  and Hamborg, Felix  and Risch, Julian  and Akbik, Alan",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-demo.1",
    pages = "1--11",
}

fabricator's People

Contributors

whoisjones avatar hallerpatrick avatar fhamborg avatar julian-risch avatar michelbartels avatar deathreaper0965 avatar eltociear avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.