Giter VIP home page Giter VIP logo

autolabel-cn's Introduction

Refuel logo

lint Tests Commit Activity Discord open in colab

加油标志

皮棉 测试 提交活动 不和谐 在 Colab 中打开

⚡ 快速安装

pip install refuel-autolabel

📖 文档

https://docs.refuel.ai/

🏷 什么是自动标签

访问大型、干净且多样化的标记数据集是任何机器学习工作取得成功的关键组成部分。GPT-4 等最先进的法学硕士能够以高精度自动标记数据,并且与手动标记相比,成本和时间仅为一小部分。

Autolabel 是一个 Python 库,可使用您选择的任何大型语言模型 (LLM) 来标记、清理和丰富文本数据集。

🌟(新!)通过 Autolabel 访问 RefuelLLM

您可以通过 Autolabel 访问 RefuelLLM,这是我们最近宣布的专为数据标记而构建的 LLM(在这篇博文中了解更多信息)。RefuelLLM 是 Llama-v2-13b 基础模型,针对超过 2500 个独特(5.24B 标记)标记任务进行了调整,涵盖分类、实体解析、匹配、阅读理解和信息提取等类别。您可以在此处的操场上试验该模型。

加油性能

您可以在此处请求访问 RefuelLLM 。请阅读有关在autolabel 中使用 RefuelLLM 的文档。

🚀 开始使用

Autolabel 提供了一个简单的 3 步标记数据过程:

  1. 指定要在 JSON 配置中使用的标签指南和 LLM 模型。
  2. 试运行以确保最终的提示看起来不错。
  3. 开始为您的数据集进行标记运行!

假设我们正在构建一个 ML 模型来分析电影评论的情感分析。我们有一个电影评论数据集,我们希望首先对其进行标记。对于本例,示例数据集和配置如下所示:

{
    "task_name": "MovieSentimentReview",
    "task_type": "classification",
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "dataset": {
        "label_column": "label",
        "delimiter": ","
    },
    "prompt": {
        "task_guidelines": "You are an expert at analyzing the sentiment of movie reviews. Your job is to classify the provided movie review into one of the following labels: {labels}",
        "labels": [
            "positive",
            "negative",
            "neutral"
        ],
        "few_shot_examples": [
            {
                "example": "I got a fairly uninspired stupid film about how human industry is bad for nature.",
                "label": "negative"
            },
            {
                "example": "I loved this movie. I found it very heart warming to see Adam West, Burt Ward, Frank Gorshin, and Julie Newmar together again.",
                "label": "positive"
            },
            {
                "example": "This movie will be played next week at the Chinese theater.",
                "label": "neutral"
            }
        ],
        "example_template": "Input: {example}\nOutput: {label}"
    }
}

初始化标记代理并向其传递配置:

from autolabel import LabelingAgent, AutolabelDataset

agent = LabelingAgent(config='config.json')

预览将发送到 LLM 的示例提示:

ds = AutolabelDataset('dataset.csv', config = config)
agent.plan(ds)

这打印:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0:00:00 0:00:00
┌──────────────────────────┬─────────┐
│ Total Estimated Cost     │ $0.538  │
│ Number of Examples       │ 200     │
│ Average cost per example │ 0.00269 │
└──────────────────────────┴─────────┘
─────────────────────────────────────────

Prompt Example: You are an expert at analyzing the sentiment of movie reviews. Your job is to classify the provided movie review into one of the following labels: [positive, negative, neutral]

Some examples with their output answers are provided below:

Example: I got a fairly uninspired stupid film about how human industry is bad for nature. Output: negative

Example: I loved this movie. I found it very heart warming to see Adam West, Burt Ward, Frank Gorshin, and Julie Newmar together again. Output: positive

Example: This movie will be played next week at the Chinese theater. Output: neutral

Now I want you to label the following example: Input: A rare exception to the rule that great literature makes disappointing films. Output:

─────────────────────────────────────────────────────────────────────────────────────────

最后,我们可以对数据集的子集或整个数据集运行标签:

ds = agent.run(ds)

输出数据框包含标签列:

ds.df.head()
                                                text  ... MovieSentimentReview_llm_label
0  I was very excited about seeing this film, ant...  ...                       negative
1  Serum is about a crazy doctor that finds a ser...  ...                       negative
4  I loved this movie. I knew it would be chocked...  ...                       positive
...

特征

  1. NLP 任务(例如分类、问答和命名实体识别、实体匹配等)标记数据。
  2. 使用来自 OpenAI、Anthropic、HuggingFace、Google 等提供商的商业或开源法学硕士。
  3. 支持经过研究证明的法学硕士技术,以提高标签质量,例如少样本学习和思维链提示。
  4. 对每个输出标签进行开箱即用的置信度估计和解释
  5. 缓存和状态管理可最大限度地减少成本和实验时间

获得 Refuel 主办的法学硕士课程

Refuel 提供对托管开源 LLM 的访问权限以进行标记和估计置信度这很有帮助,因为您可以为标记任务校准置信度阈值,然后将不太置信度的标签发送给人类,同时您仍然可以获得自动标记的好处对于自信的例子。

为了使用 Refuel 托管的法学硕士,您可以在此处请求访问权限

基准

查看我们的技术报告,详细了解各种法学硕士和人工注释者在标签质量、周转时间和成本方面的表现。

🛠️路线图

查看我们的公共路线图,了解有关 Autolabel 库正在进行和计划的改进的更多信息。

我们一直在寻求社区的建议和贡献。加入Discord上的讨论或打开Github 问题来报告错误和请求功能。

🙌 贡献

Autolabel 是一个快速发展的项目。我们欢迎各种形式的贡献 - 错误报告、拉取请求和改进库的想法。

  1. 加入Discord上的对话
  2. 在 Github 上打开问题以查找错误并请求功能。
  3. 抓住一个未解决的问题,并提交一个拉取请求

autolabel-cn's People

Contributors

abhinav-naikawadi avatar chirag-manwani avatar chiranthans23 avatar dhruvabansal00 avatar eltociear avatar gruentee avatar guptav96 avatar iomap avatar jtarakram avatar mkchaitanya03 avatar nihit avatar rajasbansal avatar rishabh-bhargava avatar sardhendu avatar turian avatar tyrest avatar yadavsahil197 avatar yuanzhongqiao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.