hustai / uie_pytorch Goto Github PK

View Code? Open in Web Editor NEW

568.0 4.0 93.0 622 KB

PaddleNLP UIE模型的PyTorch版实现

License: Apache License 2.0

Python 100.00%

information-extraction pytorch transformers uie

uie_pytorch's Introduction

通用信息抽取 UIE(Universal Information Extraction) PyTorch版

迁移PaddleNLP中的UIE模型到PyTorch上

2022-10-3: 新增对UIE-M系列模型的支持，增加了ErnieM的Tokenizer。ErnieMTokenizer使用C++实现的高性能分词算子FasterTokenizer进行文本预处理加速。需要通过pip install faster_tokenizer安装FasterTokenizer库后方可使用。

PyTorch版功能介绍

convert.py: 自动下载并转换模型，详见开箱即用。
doccano.py: 转换标注数据，详见数据标注。
evaluate.py: 评估模型，详见模型评估。
export_model.py: 导出ONNX推理模型，详见模型部署。
finetune.py: 微调训练，详见模型微调。
model.py: 模型定义。
uie_predictor.py: 推理类。

1. 模型简介
2. 应用示例
3. 开箱即用
4. 训练定制

1. 模型简介

UIE(Universal Information Extraction)：Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模，并使得不同任务间具备良好的迁移和泛化能力。为了方便大家使用UIE的强大能力，PaddleNLP借鉴该论文的方法，基于ERNIE 3.0知识增强预训练模型，训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取，实现零样本快速冷启动，并具备优秀的小样本微调能力，快速适配特定的抽取目标。

UIE的优势

使用简单：用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。实现开箱即用，并满足各类信息抽取需求。
降本增效：以往的信息抽取技术需要大量标注数据才能保证信息抽取的效果，为了提高开发过程中的开发效率，减少不必要的重复工作时间，开放域信息抽取可以实现零样本（zero-shot）或者少样本（few-shot）抽取，大幅度降低标注数据依赖，在降低成本的同时，还提升了效果。
效果领先：开放域信息抽取在多种场景，多种任务上，均有不俗的表现。

2. 应用示例

UIE不限定行业领域和抽取目标，以下是一些零样本行业示例：

医疗场景-专病结构化

法律场景-判决书抽取

金融场景-收入证明、招股书抽取

公安场景-事故报告抽取

旅游场景-宣传册、手册抽取

3. 开箱即用

uie_predictor提供通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。实现开箱即用，并满足各类信息抽取需求

uie_predictor现在可以自动下载模型了，无需手动convert，如果想手动转换模型，可以参照以下方法。

下载并转换模型，将下载Paddle版的uie-base模型到当前目录中，并生成PyTorch版模型uie_base_pytorch。

python convert.py

如果没有安装paddlenlp，则使用以下命令。这将不会导入paddlenlp，以及不会验证转换结果正确性。

python convert.py --no_validate_output

可配置参数说明：

input_model: 输入的模型所在的文件夹，例如存在模型./model_path/model_state.pdparams，则传入./model_path。如果传入uie-base或uie-tiny等在模型列表中的模型，且当前目录不存在此文件夹时，将自动下载模型。默认值为uie-base。

支持自动下载的模型
- uie-base
- uie-medium
- uie-mini
- uie-micro
- uie-nano
- uie-medical-base
- uie-tiny (弃用，已改为uie-medium)
- uie-base-en
- uie-m-base
- uie-m-large
- ernie-3.0-base-zh*
output_model: 输出的模型的文件夹，默认为uie_base_pytorch。
no_validate_output：是否关闭对输出模型的验证，默认打开。

* : 使用ernie-3.0-base-zh时不会验证模型，需要微调后才能用于预测

3.1 实体抽取

命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。

例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下：

['时间', '选手', '赛事名称']

调用示例：

>>> from uie_predictor import UIEPredictor
>>> from pprint import pprint

>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
>>> ie = UIEPredictor(model='uie-base', schema=schema)
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中**选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
[{'时间': [{'end': 6,
          'probability': 0.9857378532924486,
          'start': 0,
          'text': '2月8日上午'}],
  '赛事名称': [{'end': 23,
            'probability': 0.8503089953268272,
            'start': 6,
            'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
  '选手': [{'end': 31,
          'probability': 0.8981548639781138,
          'start': 28,
          'text': '谷爱凌'}]}]

例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下：

['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']

在上例中我们已经实例化了一个UIEPredictor对象，这里可以通过set_schema方法重置抽取目标。

调用示例：

>>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
>>> ie.set_schema(schema)
>>> pprint(ie("（右肝肿瘤）肝细胞性肝癌（II-III级，梁索型和假腺管型），肿瘤包膜不完整，紧邻肝被膜，侵及周围肝组织，未见脉管内癌栓（MVI分级：M0级）及卫星子灶形成。（肿物1个，大小4.2×4.0×2.8cm）。"))
[{'肝癌级别': [{'end': 20,
            'probability': 0.9243267447402701,
            'start': 13,
            'text': 'II-III级'}],
  '肿瘤的个数': [{'end': 84,
            'probability': 0.7538413804059623,
            'start': 82,
            'text': '1个'}],
  '肿瘤的大小': [{'end': 100,
            'probability': 0.8341128043459491,
            'start': 87,
            'text': '4.2×4.0×2.8cm'}],
  '脉管内癌栓分级': [{'end': 70,
              'probability': 0.9083292325934664,
              'start': 67,
              'text': 'M0级'}]}]

例如抽取的目标实体类型是"person"和"organization"，schema构造如下：

['person', 'organization']

英文模型调用示例：

>>> from uie_predictor import UIEPredictor
>>> from pprint import pprint
>>> schema = ['Person', 'Organization']
>>> ie_en = UIEPredictor(model='uie-base-en', schema=schema)
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Organization': [{'end': 53,
                    'probability': 0.9985840259877357,
                    'start': 48,
                    'text': 'Apple'}],
  'Person': [{'end': 14,
              'probability': 0.999631971804547,
              'start': 9,
              'text': 'Steve'}]}]

3.2 关系抽取

关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。

例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下：

{
  '竞赛名称': [
    '主办方',
    '承办方',
    '已举办次数'
  ]
}

调用示例：

>>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction
>>> ie.set_schema(schema) # Reset schema
>>> pprint(ie('2022语言与智能技术竞赛由**中文信息学会和**计算机学会联合主办，百度公司、**中文信息学会评测工作委员会和**计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
[{'竞赛名称': [{'end': 13,
            'probability': 0.7825402622754041,
            'relations': {'主办方': [{'end': 22,
                                  'probability': 0.8421710521379353,
                                  'start': 14,
                                  'text': '**中文信息学会'},
                                  {'end': 30,
                                  'probability': 0.7580801847701935,
                                  'start': 23,
                                  'text': '**计算机学会'}],
                          '已举办次数': [{'end': 82,
                                    'probability': 0.4671295049136148,
                                    'start': 80,
                                    'text': '4届'}],
                          '承办方': [{'end': 39,
                                  'probability': 0.8292706618236352,
                                  'start': 35,
                                  'text': '百度公司'},
                                  {'end': 72,
                                  'probability': 0.6193477885474685,
                                  'start': 56,
                                  'text': '**计算机学会自然语言处理专委会'},
                                  {'end': 55,
                                  'probability': 0.7000497331473241,
                                  'start': 40,
                                  'text': '**中文信息学会评测工作委员会'}]},
            'start': 0,
            'text': '2022语言与智能技术竞赛'}]}]

例如以"person"作为抽取主体，抽取关系类型为"Company"和"Position", schema构造如下：

{
  'Person': [
    'Company',
    'Position'
  ]
}

英文模型调用示例：

>>> schema = [{'Person': ['Company', 'Position']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Person': [{'end': 14,
              'probability': 0.999631971804547,
              'relations': {'Company': [{'end': 53,
                                        'probability': 0.9960158209451642,
                                        'start': 48,
                                        'text': 'Apple'}],
                            'Position': [{'end': 44,
                                          'probability': 0.8871063806420736,
                                          'start': 41,
                                          'text': 'CEO'}]},
              'start': 9,
              'text': 'Steve'}]}]

3.3 事件抽取

事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument)，组合为相应的事件结构化信息。

例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息，schema构造如下：

{
  '地震触发词': [
    '地震强度',
    '时间',
    '震中位置',
    '震源深度'
  ]
}

触发词的格式统一为`触发词`或``XX触发词`，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。

调用示例：

>>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction
>>> ie.set_schema(schema) # Reset schema
>>> ie('**地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')
[{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度，东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]

英文模型暂不支持事件抽取

3.4 评论观点抽取

评论观点抽取，是指抽取文本中包含的评价维度、观点词。

例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向，schema构造如下：

{
  '评价维度': [
    '观点词',
    '情感倾向[正向，负向]'
  ]
}

调用示例：

>>> schema = {'评价维度': ['观点词', '情感倾向[正向，负向]']} # Define the schema for opinion extraction
>>> ie.set_schema(schema) # Reset schema
>>> pprint(ie("店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队")) # Better print results using pprint
[{'评价维度': [{'end': 20,
            'probability': 0.9817040258681473,
            'relations': {'情感倾向[正向，负向]': [{'probability': 0.9966142505350533,
                                          'text': '正向'}],
                          '观点词': [{'end': 22,
                                  'probability': 0.957396472711558,
                                  'start': 21,
                                  'text': '高'}]},
            'start': 17,
            'text': '性价比'},
          {'end': 2,
            'probability': 0.9696849569741168,
            'relations': {'情感倾向[正向，负向]': [{'probability': 0.9982153274927796,
                                          'text': '正向'}],
                          '观点词': [{'end': 4,
                                  'probability': 0.9945318044652538,
                                  'start': 2,
                                  'text': '干净'}]},
            'start': 0,
            'text': '店面'}]}]

英文模型schema构造如下：

{
  'Aspect': [
    'Opinion',
    'Sentiment classification [negative, positive]'
  ]
}

调用示例：

>>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en("The teacher is very nice."))
[{'Aspect': [{'end': 11,
              'probability': 0.4301476415932193,
              'relations': {'Opinion': [{'end': 24,
                                        'probability': 0.9072940447883724,
                                        'start': 15,
                                        'text': 'very nice'}],
                            'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
                                                                              'text': 'positive'}]},
              'start': 4,
              'text': 'teacher'}]}]

3.5 情感分类

句子级情感倾向分类，即判断句子的情感倾向是“正向”还是“负向”，schema构造如下：

'情感倾向[正向，负向]'

调用示例：

>>> schema = '情感倾向[正向，负向]' # Define the schema for sentence-level sentiment classification
>>> ie.set_schema(schema) # Reset schema
>>> ie('这个产品用起来真的很流畅，我非常喜欢')
[{'情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}]

英文模型schema构造如下：

```text
'情感倾向[正向，负向]'
```

英文模型调用示例：

```python
>>> schema = 'Sentiment classification [negative, positive]'
>>> ie_en.set_schema(schema)
>>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
[{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
```

3.6 跨任务抽取

例如在法律场景同时对文本进行实体抽取和关系抽取，schema可按照如下方式进行构造：

[
  "法院",
  {
      "原告": "委托代理人"
  },
  {
      "被告": "委托代理人"
  }
]

调用示例：

>>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
>>> ie.set_schema(schema)
>>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。")) # Better print results using pprint
[{'原告': [{'end': 37,
          'probability': 0.9949814024296764,
          'relations': {'委托代理人': [{'end': 46,
                                  'probability': 0.7956844697990384,
                                  'start': 44,
                                  'text': '李四'}]},
          'start': 35,
          'text': '张三'}],
  '法院': [{'end': 10,
          'probability': 0.9221074192336651,
          'start': 0,
          'text': '北京市海淀区人民法院'}],
  '被告': [{'end': 67,
          'probability': 0.8437349536631089,
          'relations': {'委托代理人': [{'end': 92,
                                  'probability': 0.7267121388225029,
                                  'start': 90,
                                  'text': '赵六'}]},
          'start': 64,
          'text': 'B公司'}]}]

3.7 模型选择

多模型选择，满足精度、速度要求

模型	结构	语言
`uie-base` (默认)	12-layers, 768-hidden, 12-heads	中文
`uie-base-en`	12-layers, 768-hidden, 12-heads	英文
`uie-medical-base`	12-layers, 768-hidden, 12-heads	中文
`uie-medium`	6-layers, 768-hidden, 12-heads	中文
`uie-mini`	6-layers, 384-hidden, 12-heads	中文
`uie-micro`	4-layers, 384-hidden, 12-heads	中文
`uie-nano`	4-layers, 312-hidden, 12-heads	中文
`uie-m-large`	24-layers, 1024-hidden, 16-heads	中、英文
`uie-m-base`	12-layers, 768-hidden, 12-heads	中、英文

uie-nano调用示例：

>>> from uie_predictor import UIEPredictor

>>> schema = ['时间', '选手', '赛事名称']
>>> ie = UIEPredictor('uie-nano', schema=schema)
>>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中**选手谷爱凌以188.25分获得金牌！")
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]

uie-m-base和uie-m-large支持中英文混合抽取，调用示例：

>>> from pprint import pprint
>>> from uie_predictor import UIEPredictor

>>> schema = ['Time', 'Player', 'Competition', 'Score']
>>> ie = UIEPredictor(schema=schema, model="uie-m-base", schema_lang="en")
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中**选手谷爱凌以188.25分获得金牌！", "Rafael Nadal wins French Open Final!"]))
[{'Competition': [{'end': 23,
                  'probability': 0.9373889907291257,
                  'start': 6,
                  'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
  'Player': [{'end': 31,
              'probability': 0.6981119555336441,
              'start': 28,
              'text': '谷爱凌'}],
  'Score': [{'end': 39,
            'probability': 0.9888507878270296,
            'start': 32,
            'text': '188.25分'}],
  'Time': [{'end': 6,
            'probability': 0.9784080036931151,
            'start': 0,
            'text': '2月8日上午'}]},
{'Competition': [{'end': 35,
                  'probability': 0.9851549932171295,
                  'start': 18,
                  'text': 'French Open Final'}],
  'Player': [{'end': 12,
              'probability': 0.9379371275888104,
              'start': 0,
              'text': 'Rafael Nadal'}]}]

3.8 更多配置

>>> from uie_predictor import UIEPredictor

>>> ie = UIEPredictor('uie_nano',   
                       schema=schema)

model：选择任务使用的模型，默认为uie-base，可选有uie-base, uie-medium, uie-mini, uie-micro, uie-nano和uie-medical-base, uie-base-en。
schema：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
schema_lang：设置schema的语言，默认为zh, 可选有zh和en。因为中英schema的构造有所不同，因此需要指定schema的语言。该参数只对uie-m-base和uie-m-large模型有效。
batch_size：批处理大小，请结合机器情况进行调整，默认为1。
task_path：设定自定义的模型。
position_prob：模型对于span的起始位置/终止位置的结果概率在0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
use_fp16：是否使用fp16进行加速，默认关闭。fp16推理速度更快。如果选择fp16，请先确保机器正确安装NVIDIA相关驱动和基础软件，确保CUDA>=11.2，cuDNN>=8.1.1，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：GPU硬件与支持精度对照表。

4. 训练定制

对于简单的抽取目标可以直接使用UIEPredictor实现零样本（zero-shot）抽取，对于细分场景我们推荐使用轻定制功能（标注少量数据进行模型微调）以进一步提升效果。下面通过报销工单信息抽取的例子展示如何通过5条训练数据进行UIE模型微调。

4.1 代码结构

.
├── utils.py          # 数据处理工具
├── model.py          # 模型组网脚本
├── doccano.py        # 数据标注脚本
├── doccano.md        # 数据标注文档
├── finetune.py       # 模型微调脚本
├── evaluate.py       # 模型评估脚本
└── README.md

4.2 数据标注

我们推荐使用数据标注平台doccano 进行数据标注，本示例也打通了从标注到训练的通道，即doccano导出数据后可通过doccano.py脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考doccano数据标注指南。

原始数据示例：

深大到双龙28块钱4月24号交通费

抽取的目标(schema)为：

schema = ['出发地', '目的地', '费用', '时间']

标注步骤如下：

在doccano平台上，创建一个类型为序列标注的标注项目。
定义实体标签类别，上例中需要定义的实体标签有出发地、目的地、费用和时间。
使用以上定义的标签开始标注数据，下面展示了一个doccano标注示例：

标注完成后，在doccano平台上导出文件，并将其重命名为doccano_ext.json后，放入./data目录下。
这里我们提供预先标注好的文件doccano_ext.json，可直接下载并放入./data目录。执行以下脚本进行数据转换，执行后会在./data目录下生成训练/验证/测试集文件。

python doccano.py \
    --doccano_file ./data/doccano_ext.json \
    --task_type ext \
    --save_dir ./data \
    --splits 0.8 0.2 0

可配置参数说明：

doccano_file: 从doccano导出的数据标注文件。
save_dir: 训练数据的保存目录，默认存储在data目录下。
negative_ratio: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全负例。
splits: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照8:1:1的比例将数据划分为训练集、验证集和测试集。
task_type: 选择任务类型，可选有抽取和分类两种类型的任务。
options: 指定分类任务的类别标签，该参数只对分类类型任务有效。默认为["正向", "负向"]。
prompt_prefix: 声明分类任务的prompt前缀信息，该参数只对分类类型任务有效。默认为"情感倾向"。
is_shuffle: 是否对数据集进行随机打散，默认为True。
seed: 随机种子，默认为1000.
separator: 实体类别/评价维度与分类标签的分隔符，该参数只对实体/评价维度级分类任务有效。默认为"##"。

备注：

默认情况下 doccano.py 脚本会按照比例将数据划分为 train/dev/test 数据集
每次执行 doccano.py 脚本，将会覆盖已有的同名数据文件
在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过negative_ratio控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
对于从doccano导出的文件，默认文件中的每条数据都是经过人工正确标注的。

更多不同类型任务（关系抽取、事件抽取、评价观点抽取等）的标注规则及参数说明，请参考doccano数据标注指南。

此外，也可以通过数据标注平台 Label Studio 进行数据标注。本示例提供了 labelstudio2doccano.py 脚本，将 label studio 导出的 JSON 数据文件格式转换成 doccano 导出的数据文件格式，后续的数据转换与模型微调等操作不变。

python labelstudio2doccano.py --labelstudio_file label-studio.json

可配置参数说明：

labelstudio_file: label studio 的导出文件路径（仅支持 JSON 格式）。
doccano_file: doccano 格式的数据文件保存路径，默认为 "doccano_ext.jsonl"。
task_type: 任务类型，可选有抽取（"ext"）和分类（"cls"）两种类型的任务，默认为 "ext"。

4.3 模型微调

通过运行以下命令进行模型微调：

python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 16 \
    --max_seq_len 512 \
    --num_epochs 100 \
    --model "uie_base_pytorch" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 100 \
    --device "gpu"

可配置参数说明：

train_path: 训练集文件路径。
dev_path: 验证集文件路径。
save_dir: 模型存储路径，默认为./checkpoint。
learning_rate: 学习率，默认为1e-5。
batch_size: 批处理大小，请结合机器情况进行调整，默认为16。
max_seq_len: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
num_epochs: 训练轮数，默认为100。
model: 选择模型，程序会基于选择的模型进行模型微调，默认为uie_base_pytorch。
seed: 随机种子，默认为1000.
logging_steps: 日志打印的间隔steps数，默认10。
valid_steps: evaluate的间隔steps数，默认100。
device: 选用什么设备进行训练，可选cpu或gpu。
max_model_num: 保存的模型的个数，不包含model_best和early_stopping保存的模型，默认为5。
early_stopping: 是否采用提前停止（Early Stopping），默认不使用。

4.4 模型评估

通过运行以下命令进行模型评估：

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --batch_size 16 \
    --max_seq_len 512

评估方式说明：采用单阶段评价的方式，即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。

可开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试：

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --debug

输出打印示例：

[2022-09-14 03:13:58,877] [    INFO] - -----------------------------
[2022-09-14 03:13:58,877] [    INFO] - Class Name: 疾病
[2022-09-14 03:13:58,877] [    INFO] - Evaluation Precision: 0.89744 | Recall: 0.83333 | F1: 0.86420
[2022-09-14 03:13:59,145] [    INFO] - -----------------------------
[2022-09-14 03:13:59,145] [    INFO] - Class Name: 手术治疗
[2022-09-14 03:13:59,145] [    INFO] - Evaluation Precision: 0.90000 | Recall: 0.85714 | F1: 0.87805
[2022-09-14 03:13:59,439] [    INFO] - -----------------------------
[2022-09-14 03:13:59,440] [    INFO] - Class Name: 检查
[2022-09-14 03:13:59,440] [    INFO] - Evaluation Precision: 0.77778 | Recall: 0.56757 | F1: 0.65625
[2022-09-14 03:13:59,708] [    INFO] - -----------------------------
[2022-09-14 03:13:59,709] [    INFO] - Class Name: X的手术治疗
[2022-09-14 03:13:59,709] [    INFO] - Evaluation Precision: 0.90000 | Recall: 0.85714 | F1: 0.87805
[2022-09-14 03:13:59,893] [    INFO] - -----------------------------
[2022-09-14 03:13:59,893] [    INFO] - Class Name: X的实验室检查
[2022-09-14 03:13:59,894] [    INFO] - Evaluation Precision: 0.71429 | Recall: 0.55556 | F1: 0.62500
[2022-09-14 03:14:00,057] [    INFO] - -----------------------------
[2022-09-14 03:14:00,058] [    INFO] - Class Name: X的影像学检查
[2022-09-14 03:14:00,058] [    INFO] - Evaluation Precision: 0.69231 | Recall: 0.45000 | F1: 0.54545

可配置参数说明：

model_path: 进行评估的模型文件夹路径，路径下需包含模型权重文件pytorch_model.bin及配置文件config.json。
test_path: 进行评估的测试集文件。
batch_size: 批处理大小，请结合机器情况进行调整，默认为16。
max_seq_len: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
device: 选用进行训练的设备，可选cpu或gpu。

4.5 定制模型一键预测

UIEPredictor装载定制模型，通过task_path指定模型权重文件的路径，路径下需要包含训练好的模型权重文件pytorch_model.bin。

>>> from pprint import pprint
>>> from uie_predictor import UIEPredictor

>>> schema = ['出发地', '目的地', '费用', '时间']
# 设定抽取目标和定制化模型权重路径
>>> my_ie = UIEPredictor(model='uie-base',task_path='./checkpoint/model_best', schema=schema)
>>> pprint(my_ie("城市内交通费7月5日金额114广州至佛山"))
[{'出发地': [{'end': 17,
           'probability': 0.9975287467835301,
           'start': 15,
           'text': '广州'}],
  '时间': [{'end': 10,
          'probability': 0.9999476678061399,
          'start': 6,
          'text': '7月5日'}],
  '目的地': [{'end': 20,
           'probability': 0.9998511131226735,
           'start': 18,
           'text': '佛山'}],
  '费用': [{'end': 15,
          'probability': 0.9994474579292856,
          'start': 12,
          'text': '114'}]}]

4.6 实验指标

我们在互联网、医疗、金融三大垂类自建测试集上进行了实验：

	金融		医疗		互联网
	0-shot	5-shot	0-shot	5-shot	0-shot	5-shot
uie-base (12L768H)	46.43	70.92	71.83	85.72	78.33	81.86
uie-medium (6L768H)	41.11	64.53	65.40	75.72	78.32	79.68
uie-mini (6L384H)	37.04	64.65	60.50	78.36	72.09	76.38
uie-micro (4L384H)	37.53	62.11	57.04	75.92	66.00	70.22
uie-nano (4L312H)	38.94	66.83	48.29	76.74	62.86	72.35
uie-m-large (24L1024H)	49.35	74.55	70.50	92.66	78.49	83.02
uie-m-base (12L768H)	38.46	74.31	63.37	87.32	76.27	80.13

0-shot表示无训练数据直接通过UIEPredictor进行预测，5-shot表示每个类别包含5条标注数据进行模型微调。实验表明UIE在垂类场景可以通过少量数据（few-shot）进一步提升效果。

4.7 模型部署

以下是UIE Python端的部署流程，包括环境准备、模型导出和使用示例。

环境准备 UIE的部署分为CPU和GPU两种情况，请根据你的部署环境安装对应的依赖。
- CPU端
  
  CPU端的部署请使用如下命令安装所需依赖
```
pip install onnx onnxruntime
```
- GPU端
  
  为了在GPU上获得最佳的推理性能和稳定性，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，cuDNN >= 8.1.1，并使用以下命令安装所需依赖
```
pip install onnx onnxconverter_common onnxruntime-gpu
```
  如需使用半精度（FP16）部署，请确保GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：GPU硬件与支持精度对照表
模型导出

将训练后的动态图参数导出为静态图参数：
```
python export_model.py --model_path ./checkpoint/model_best --output_path ./export
```
可配置参数说明：
- model_path: 动态图训练保存的参数路径，路径下包含模型参数文件pytorch_model.bin和模型配置文件config.json。
- output_path: 静态图参数导出路径，默认导出路径为model_path，即保存到输入模型同目录下。
推理
- CPU端推理样例
  
  在CPU端，请使用如下命令进行部署
```
python uie_predictor.py --task_path ./export --engine onnx --device cpu
```
  可配置参数说明：
  - model：选择任务使用的模型，默认为uie-base，可选有uie-base, uie-medium, uie-mini, uie-micro, uie-nano和uie-medical-base, uie-base-en。
  - task_path: 用于推理的ONNX模型文件所在文件夹。例如模型文件路径为./export/inference.onnx，则传入./export。如果不设置，将自动下载model对应的模型。
  - position_prob：模型对于span的起始位置/终止位置的结果概率0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
  - max_seq_len: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
  - engine: 可选值为pytorch和onnx。推理使用的推理引擎。
- GPU端推理样例
  
  在GPU端，请使用如下命令进行部署
```
python uie_predictor.py --task_path ./export --engine onnx --device gpu --use_fp16
```
  可配置参数说明：
  - model：选择任务使用的模型，默认为uie-base，可选有uie-base, uie-medium, uie-mini, uie-micro, uie-nano和uie-medical-base, uie-base-en。
  - task_path: 用于推理的ONNX模型文件所在文件夹。例如模型文件路径为./export/inference.onnx，则传入./export/inference。如果不设置，将自动下载model对应的模型。
  - use_fp16: 是否使用FP16进行加速，默认关闭。
  - position_prob：模型对于span的起始位置/终止位置的结果概率0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
  - max_seq_len: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
  - engine: 可选值为pytorch和onnx。推理使用的推理引擎。

uie_pytorch's People

Contributors

Stargazers

Watchers

Forkers

aiiluochen yyligl wellslo 2627796155 liwenju0 qfxlcyc fireae wudi001007 haikuoxin qhduan baifengbai nipi64310 yinghunlp hecongqing w0lker yyht miaomiaoxiaobai dingjia123 sunyongfa solofeng blackxer liqunw xusenlinzy zhangluustb done520 paulpaul91 mayi140611 adambear louiszango drypilgrim zjms peter-xbs nlp4whp listwebit tonchen3 pythongc anastasiaxuejing qiwangzhixinxin captaintec xddd-ys tuzeao code-wd lbeing leehommlee 18106574249 ddingwang12 qshuang123 henry-sun1974 yongquan-he 1547350403 xiatian1975 ningz7 yang-collect zhoulei163 aixia121 zh-nj zhoufangquan jieshenai fengdjiao hongdangshao gallllong travelerbiu gdmyd ezraxe flyingwaters farawaysfy guoqingru0911 xubaozhang450123 huxiang088 njgu yang182 xyx1926885268 urrethless rogelx pierrezhangcw williamchw gepeng18 whuhxb xiaocode337317439 proximamonkey sopao xx55511 lxh118 yy99299 cqray1990 komalkhan10101 yhengye road2018 louiszh haojiepan1 hansen523 catlovecherry

uie_pytorch's Issues

数据格式中的prompt含义

请问为什么doccano.py转化之后的数据格式中的prompt表示的是啥呢？作用是什么呢？

convert uie-m-base报错AttributeError: 'ErnieMTokenizer' object has no attribute 'vocab'

Traceback (most recent call last):
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 679, in
uie = UIEPredictor(model=args.model, task_path=args.task_path, schema_lang=args.schema_lang, schema=args.schema, engine=args.engine, device=args.device,
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 147, in init
self._prepare_predictor()
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 162, in _prepare_predictor
self._tokenizer = ErnieMTokenizerFast.from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
return cls._from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2049, in _from_pretrained
slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/root/autodl-tmp/uie_pytorch/tokenizer.py", line 139, in init
super().init(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
File "/root/autodl-tmp/uie_pytorch/tokenizer.py", line 185, in get_vocab
return dict(self.vocab, **self.added_tokens_encoder)
AttributeError: 'ErnieMTokenizer' object has no attribute 'vocab'

Parameter error

直接运行命令python convert.py有以下错误：
TypeError: forward() got an unexpected keyword argument 'pos_ids'

请问这是什么原因造成的？

prompt（in-context learning）实现信息抽取

请问是否有基于prompt（in-context learning）实现信息抽取的教程啊？

ernie-m是不是有问题，在自己的数据集上微调训练一半报错

好像不支持普通分类的模型微调是吗？

我指标准的classification, 不是情感分类，好像无法实现模型训练，我已经将type从ext改为cls.

转onnx模型的时候报错

Bug in ErnieMConverter Class

Using -m-large version, but met a bug in class ErnieMConverter(Converter):

Traceback (most recent call last):
  File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/src/run.py", line 23, in <module>
    ie = UIEPredictor(model='uie-m-large', schema=schema, device="cuda" if torch.cuda.is_available() else "cpu")
  File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/uie_predictor.py", line 146, in __init__
    self._prepare_predictor()
  File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/uie_predictor.py", line 160, in _prepare_predictor
    self._tokenizer = ErnieMTokenizerFast.from_pretrained(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
    return cls._from_pretrained(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/tokenizer.py", line 477, in __init__
    super().__init__(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1342, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/tokenizer.py", line 576, in __init__
    from transformers.utils import sentencepiece_model_pb2 as model_pb2
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 789, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

预训练模型不存在

OSError: uie_base_pytorch is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

schema里面添加的属性是不是不能太多？

比如我利用的schema=['a','b','c','d','e']的时候，我验证的时候总是不出现‘b’这个属性，这是为什么呀？

KeyError occur when select ernie_3.0_base_zh to convert

run:
python convert.py -i ernie-3.0-base-zh --no_validate_output

got:

2023-03-01 01:34:35.798449: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2023-03-01 01:34:37,186] [ INFO] - Downloading resource files...
[2023-03-01 01:34:37,187] [ INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams
[2023-03-01 01:37:55,405] [ INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt
[2023-03-01 01:37:55,798] [ INFO] - ====================save config file====================
[2023-03-01 01:37:55,800] [ INFO] - ====================save vocab file====================
[2023-03-01 01:37:55,801] [ INFO] - ====================extract weights====================
Traceback (most recent call last):
File "convert.py", line 468, in
do_main()
File "convert.py", line 427, in do_main
extract_and_convert(args.input_model, args.output_model, verbose=True)
File "convert.py", line 297, in extract_and_convert
del paddle_paddle_params['StructuredToParameterName@@']
KeyError: 'StructuredToParameterName@@'

本地执行finetune.py 模型微调提示@runtime_checkable异常

"D:\Program Files\anaconda3\envs\uie\python.exe" "D:/Program Files/JetBrains/PyCharm 2022.2.2/plugins/python/helpers/pydev/pydevd.py" --multiprocess --qt-support=auto --client 127.0.0.1 --port 49998 --file D:\work\develop\python_work\uie_work\uie_pytorch\finetune.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --learning_rate 1e-5 --batch_size 4 --max_seq_len 512 --num_epochs 100 --model uie_base_pytorch --seed 1000 --logging_steps 10 --valid_steps 100 --device cpu
Traceback (most recent call last):
File "D:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev\pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:\work\develop\python_work\uie_work\uie_pytorch\finetune.py", line 253, in
do_train()
File "D:\work\develop\python_work\uie_work\uie_pytorch\finetune.py", line 44, in do_train
train_data_loader = DataLoader(
File "D:\Program Files\anaconda3\envs\uie\lib\site-packages\torch\utils\data\dataloader.py", line 200, in init
if isinstance(dataset, IterableDataset):
File "D:\Program Files\anaconda3\envs\uie\lib\typing.py", line 1498, in instancecheck
raise TypeError("Instance and class checks can only be used with"
TypeError: Instance and class checks can only be used with @runtime_checkable protocols

情感分类支持微调吗？

训练集：
{"content": "不错的上网本，外形很漂亮，操作系统应该是个很大的卖点，电池还可以。整体上讲，作为一个上网本的定位，还是不错的。\t", "result_list": [{"text": "正向", "start": -7, "end": -5}], "prompt": "情感倾向[正向,负向]"}
{"content": "<荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!\t", "result_list": [{"text": "正向", "start": -4, "end": -2}], "prompt": "情感倾向[负向,正向]"}

用这个去微调情感分类会报错显示：
RequestsDependencyWarning)
Traceback (most recent call last):
File "finetune.py", line 253, in
do_train()
File "finetune.py", line 35, in do_train
tokenizer = BertTokenizerFast.from_pretrained(args.model)
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1706, in from_pretrained
local_files_only=local_files_only,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 711, in get_file_from_repo
use_auth_token=use_auth_token,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 292, in cached_path
local_files_only=local_files_only,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 563, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

pytorch模型转onnx报错

运行命令
python export_model.py --model_path ./ckpt/anno_data_0210_ckpt_10words_ent/model_best/ --output_path ./export时报错

uie pytorch config 参数问题

在uie-base的config.json中，没有task_id的值，所以实际运行过程，task_type_embeddings没有生效？一直采用默认值0吗？

evaluate.py执行时报错

在utils.py680行，修改如下，可以修复这个bug:

def get_relation_type_dict(relation_data):
    def compare(a, b):
        a = a[::-1]
        b = b[::-1]
        res = ''
        for i in range(min(len(a), len(b))):
            if a[i] == b[i]:
                res += a[i]
            else:
                break
        if res == "":
            return res
        elif res[::-1][0] == "的":
            return res[::-1][1:]
        return ""
    relation_type_dict = {}
    added_list = []
    for i in range(len(relation_data)):
        added = False
        if relation_data[i][0] not in added_list:
            for j in range(i + 1, len(relation_data)):
                match = compare(relation_data[i][0], relation_data[j][0])
                if match != "":
                    match = unify_prompt_name(match)
                    if relation_data[i][0] not in added_list:
                        added_list.append(relation_data[i][0])
                        relation_type_dict.setdefault(match, []).append(
                            relation_data[i][1])
                    added_list.append(relation_data[j][0])
                    relation_type_dict.setdefault(match, []).append(
                        relation_data[j][1])
                    added = True
            if not added:
                added_list.append(relation_data[i][0])
                suffix = relation_data[i][0].rsplit("的", 1)[1]
                suffix = unify_prompt_name(suffix)
               #好像是只有一个对象时会遍历到这里执行，如果执行下面这句将把字典（而不是列表）赋给relation_type_dict
                relation_type_dict.setdefault(suffix, []).append(
                            relation_data[i][1])                          
                # relation_type_dict[suffix] = relation_data[i][1]
    return relation_type_dict

这个版本现在支持GPU多卡吗

训练的时候好像只走一个卡

为啥论文中用的是Transformer架构而实际实现却用bert？

uie-base 转torch，验证时报错

`[2024-04-16 12:03:00,092] [ INFO] - Validating PyTorch model...

W0416 12:03:00.840634 21336 gpu_resources.cc:275] WARNING: device: . The installed Paddle is compiled with CUDNN 8.6, but CUDNN version in your machine is 8.1, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.

[2024-04-16 12:03:03,910] [ INFO] - -[✓] Pytorch model output names match reference model ({'start_prob', 'end_prob'})

[2024-04-16 12:03:03,930] [ INFO] - - Validating PyTorch Model output "start_prob":

[2024-04-16 12:03:03,934] [ INFO] - -[✓] (1, 512) matches (1, 512)

[2024-04-16 12:03:03,943] [ INFO] - -[x] values not close enough (atol: 1e-05)

Traceback (most recent call last):
File ".\convert.py", line 478, in
do_main()
File ".\convert.py", line 462, in do_main
validate_model(tokenizer, model, paddle_model, model_type)
File ".\convert.py", line 424, in validate_model
"Outputs values doesn't match between reference model and Pytorch converted model: "
ValueError: Outputs values doesn't match between reference model and Pytorch converted model: Got max absolute difference of: 0.6232820749282837`

模型训练，gpu使用率波动太大

训练过程中，gpu占存几乎没有变化，稳定在14g左右，但是使用率波动较大，初步怀疑是evaluate过程，使用率太低

依赖库的版本号

请问下transformers等库的版本

使用gpu，中文文本长度过长时会出现报错。长度较短则不会。报错信息如下

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'/encoder/embeddings/Add_2' Status Message: /encoder/embeddings/Add_2: right operand cannot broadcast on dim 1 LeftShape: {2,6514,768}, RightShape: {1,2048,768}

docker部署出错无法推理结果

系统：kylin v10 armV8 aarch64
镜像：FROM kumatea/pytorch

python3 /app/uie_pytorch/uie-backend-api.py

[2023-09-10 14:42:23,681] [ INFO] - >>> [PyTorchInferBackend] Creating Engine ...
[2023-09-10 14:42:39,516] [ INFO] - >>> [PyTorchInferBackend] Use CPU to inference ...
[2023-09-10 14:42:39,518] [ INFO] - >>> [PyTorchInferBackend] Engine Created ...
/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py:909: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(

调用无结果 502bad
POST http://127.0.0.1:888/
Error: socket hang up
Request Headers
Content-Type: application/json
User-Agent: PostmanRuntime/7.32.3
Accept: /
Postman-Token: 3f204252-7d5a-4732-8268-c60829276d57
Host: 127.0.0.1:888
Accept-Encoding: gzip, deflate, br
Connection: keep-alive

utils.py文件中900行处bug

列表循环的顺序
all_relation_examples = [
r
for relation_example in relation_examples
for r in relation_example
]

实体嵌套问题

楼主，请问这套UIE支持嵌套实体抽取吗？
我尝试了下uie_predictor，发现无法抽出嵌套实体？

uie_m_large_pytorch 问题

Some weights of UIE were not initialized from the model checkpoint at uie_m_large_pytorch and are newly initialized: ['encoder.embeddings.token_type_embeddings.weight']
加载uie_m_large_pytorch ，提示有部分权重无法加载

数据预处理格式 - 关系抽取和事件抽取

请问下关系抽取和事件抽取的微调数据的格式是一样的吗？都如下图？

不同点是不是事件抽取的entities会有一个trigger word，然后relations里面全部都是这个trigger word为from_id，其余角色为to_id?
然后关系抽取的entities里面可能就没有trigger word, 然后relations里面就纯粹是不同角色的关系？

UIEPredictor(model='uie-base', schema=schema)默认模型存在哪

报错module 'paddle.fluid.dygraph' has no attribute 'load_dygraph'，请问怎么解决

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:680 in │
│ │
│ 677 │ args.schema = ['航母'] │
│ 678 │ args.schema_lang = "en" │
│ 679 │ uie = UIEPredictor(model=args.model, task_path=args.task_path, schema_lang=args.sche │
│ ❱ 680 │ │ │ │ │ position_prob=args.position_prob, max_seq_len=args.max_seq_len, b │
│ 681 │ print(uie("印媒所称的“印度第一艘国产航母”—“维克兰特”号")) │
│ 682 │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:147 in init │
│ │
│ 144 │ │ self._is_en = True if model in ['uie-base-en' │
│ 145 │ │ │ │ │ │ │ │ │ │ ] or schema_lang == 'en' else False │
│ 146 │ │ self.set_schema(schema) │
│ ❱ 147 │ │ self._prepare_predictor() │
│ 148 │ │
│ 149 │ def _prepare_predictor(self): │
│ 150 │ │ assert self._engine in ['pytorch', │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:158 in _prepare_predictor │
│ │
│ 155 │ │ │ if not os.path.exists(self._task_path): │
│ 156 │ │ │ │ from convert import check_model, extract_and_convert │
│ 157 │ │ │ │ check_model(self._model) │
│ ❱ 158 │ │ │ │ extract_and_convert(self._model, self._task_path) │
│ 159 │ │ │
│ 160 │ │ if self._multilingual: │
│ 161 │ │ │ from tokenizer import ErnieMTokenizerFast │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\convert.py:292 in extract_and_convert │
│ │
│ 289 │ │ import paddle.fluid.dygraph as D │
│ 290 │ │ from paddle import fluid │
│ 291 │ │ with fluid.dygraph.guard(): │
│ ❱ 292 │ │ │ paddle_paddle_params, _ = D.load_dygraph( │
│ 293 │ │ │ │ os.path.join(input_dir, 'model_state')) │
│ 294 │ else: │
│ 295 │ │ paddle_paddle_params = pickle.load( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: module 'paddle.fluid.dygraph' has no attribute 'load_dygraph'

a bug for multi-text input

https://github.com/heiheiyoyo/uie_pytorch/blob/141ffa32596f29c83bce506cb5bcd6e503033fed/uie_predictor.py#L459

run uie_predictor.py
there is a bug when your input are multi-texts , [a_long_text, a_short_text,....], and the a_long_text is longer than 512.

当输入是多个text，且其中有个text是长于512的时候，会报错。

已经产生了uiebasepytorch文件夹但是bin没了

OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory uie_base_pytorch.执行如下代码报错上述的，已经产生了uiebasepytorch文件夹但是bin没了
schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
ie = UIEPredictor(model='uie-base', schema=schema)
pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中**选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint

ModuleNotFoundError: No module named 'tqdm.contrib.logging'

我上網也不太能找到解決這個的辦法。請問是什麼這個要怎麼找？

erniemtokenizerFast实例化失败，debug显示unable to get repr for class tokenizer

BUG AttributeError: 'Namespace' object has no attribute 'max_seq_le'

In file named evaluate.py
line 119： test_ds = IEMapDataset(relation_type_dict[key], tokenizer=tokenizer,
max_seq_len=args.max_seq_le)
"args.max_seq_le" should be written as "args.max_seq_len".

👍
BW2U

有UIE-X扩展的打算吗？

ernie_m 的finetuning 的数组越界错误

我尝试对ernie_m进行fineturing ，发现有数组越界错误，排除是因为这一段代码（在ernie_m.py中的270行）

导致对position_embedding取tensor发生越界

这个positon_ids += 2的作用是什么？要怎么改？

多label进行训练之后的测试集的F1值针对的是所有标签的嘛，如何看针对一个标签的F1值？

模型输出问题

想问下,我想像bert输出那样取出最后一层的隐藏状态和pooler_output值，代码这么写有无问题：

model= UIE.frompretrained(路径)
last_hidden_state = model(inputs*).hidden_states[-1]
pooler_output = torch.max(model(inputs*).hidden_states[-1])

另外，模型输出的start_prob，end_prob是什么？

UIEPredictor 推理会报错

UIEPredictor 中无batch 填充逻辑。会导致报错：
File "/home/wangjiawei/baishen/UIE/uie_predictor.py", line 560, in _auto_joiner
for i in range(len(short_results[v])):
IndexError: list index out of range

covert.py 执行报错

作者你好，在执行模型转换时出现以下问下，请问一下，这是什么原因：
目前transformers的版本是4.20.0
from transformers.utils import ModelOutput
ImportError: cannot import name 'ModelOutput'

uie-m-large model convert 的时候验证报错，ValueError: Outputs values doesn't match between reference model and Pytorch converted model: Got max absolute difference of: 4.9104968638857827e-05

[2022-11-17 11:44:05,198] [ INFO] - Validating PyTorch model...
[2022-11-17 11:44:26,931] [ INFO] - -[✓] Pytorch model output names match reference model ({'start_prob', 'end_prob'})
[2022-11-17 11:44:26,935] [ INFO] - - Validating PyTorch Model output "start_prob":
[2022-11-17 11:44:26,937] [ INFO] - -[✓] (2, 512) matches (2, 512)
[2022-11-17 11:44:26,956] [ INFO] - -[x] values not close enough (atol: 1e-05)
Traceback (most recent call last):
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 468, in
do_main()
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 452, in do_main
validate_model(tokenizer, model, paddle_model, model_type)
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 414, in validate_model
raise ValueError(
ValueError: Outputs values doesn't match between reference model and Pytorch converted model: Got max absolute difference of: 4.9104968638857827e-05

微调模型时疑似报错：he OrderedVocab you are attempting to save contains a hole for index 12084, your vocabulary could be corrupted !

我查询了部分资料，问题可能时出在uie_base_pytorch/vocab.txt中了。但是我无法解决这个问题，希望各位大佬帮忙指导！

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.