hustai / uie_pytorch Goto Github PK
View Code? Open in Web Editor NEWPaddleNLP UIE模型的PyTorch版实现
License: Apache License 2.0
PaddleNLP UIE模型的PyTorch版实现
License: Apache License 2.0
我指标准的classification, 不是情感分类,好像无法实现模型训练,我已经将type从ext改为cls.
我查询了部分资料,问题可能时出在uie_base_pytorch/vocab.txt中了。但是我无法解决这个问题,希望各位大佬帮忙指导!
UIEPredictor(model='uie-base', schema=schema)默认模型存在哪
UIEPredictor 中无batch 填充逻辑。会导致报错:
File "/home/wangjiawei/baishen/UIE/uie_predictor.py", line 560, in _auto_joiner
for i in range(len(short_results[v])):
IndexError: list index out of range
作者你好,在执行模型转换时出现以下问下,请问一下,这是什么原因:
目前transformers的版本是4.20.0
from transformers.utils import ModelOutput
ImportError: cannot import name 'ModelOutput'
run:
python convert.py -i ernie-3.0-base-zh --no_validate_output
got:
2023-03-01 01:34:35.798449: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2023-03-01 01:34:37,186] [ INFO] - Downloading resource files...
[2023-03-01 01:34:37,187] [ INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams
[2023-03-01 01:37:55,405] [ INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt
[2023-03-01 01:37:55,798] [ INFO] - ====================save config file====================
[2023-03-01 01:37:55,800] [ INFO] - ====================save vocab file====================
[2023-03-01 01:37:55,801] [ INFO] - ====================extract weights====================
Traceback (most recent call last):
File "convert.py", line 468, in
do_main()
File "convert.py", line 427, in do_main
extract_and_convert(args.input_model, args.output_model, verbose=True)
File "convert.py", line 297, in extract_and_convert
del paddle_paddle_params['StructuredToParameterName@@']
KeyError: 'StructuredToParameterName@@'
比如我利用的schema=['a','b','c','d','e']的时候,我验证的时候总是不出现‘b’这个属性,这是为什么呀?
Using -m-large version, but met a bug in class ErnieMConverter(Converter)
:
Traceback (most recent call last):
File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/src/run.py", line 23, in <module>
ie = UIEPredictor(model='uie-m-large', schema=schema, device="cuda" if torch.cuda.is_available() else "cpu")
File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/uie_predictor.py", line 146, in __init__
self._prepare_predictor()
File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/uie_predictor.py", line 160, in _prepare_predictor
self._tokenizer = ErnieMTokenizerFast.from_pretrained(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
return cls._from_pretrained(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/tokenizer.py", line 477, in __init__
super().__init__(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1342, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "/Users/liuyilin/Downloads/NLP_project/Kaggle_PIIDD/uie_pytorch/tokenizer.py", line 576, in __init__
from transformers.utils import sentencepiece_model_pb2 as model_pb2
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
_descriptor.EnumValueDescriptor(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 789, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
请问是否有基于prompt(in-context learning)实现信息抽取的教程啊?
In file named evaluate.py
line 119: test_ds = IEMapDataset(relation_type_dict[key], tokenizer=tokenizer,
max_seq_len=args.max_seq_le)
"args.max_seq_le" should be written as "args.max_seq_len".
👍
BW2U
ModuleNotFoundError: No module named 'tqdm.contrib.logging'
我上網也不太能找到解決這個的辦法。請問是什麼這個要怎麼找?
[2022-11-17 11:44:05,198] [ INFO] - Validating PyTorch model...
[2022-11-17 11:44:26,931] [ INFO] - -[✓] Pytorch model output names match reference model ({'start_prob', 'end_prob'})
[2022-11-17 11:44:26,935] [ INFO] - - Validating PyTorch Model output "start_prob":
[2022-11-17 11:44:26,937] [ INFO] - -[✓] (2, 512) matches (2, 512)
[2022-11-17 11:44:26,956] [ INFO] - -[x] values not close enough (atol: 1e-05)
Traceback (most recent call last):
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 468, in
do_main()
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 452, in do_main
validate_model(tokenizer, model, paddle_model, model_type)
File "/Users/momo/Documents/code/uie_pytorch/convert.py", line 414, in validate_model
raise ValueError(
ValueError: Outputs values doesn't match between reference model and Pytorch converted model: Got max absolute difference of: 4.9104968638857827e-05
Traceback (most recent call last):
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 679, in
uie = UIEPredictor(model=args.model, task_path=args.task_path, schema_lang=args.schema_lang, schema=args.schema, engine=args.engine, device=args.device,
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 147, in init
self._prepare_predictor()
File "/root/autodl-tmp/uie_pytorch/uie_predictor.py", line 162, in _prepare_predictor
self._tokenizer = ErnieMTokenizerFast.from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
return cls._from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2049, in _from_pretrained
slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/root/autodl-tmp/uie_pytorch/tokenizer.py", line 139, in init
super().init(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/root/autodl-tmp/conda/envs/uie_torch_cpu/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
File "/root/autodl-tmp/uie_pytorch/tokenizer.py", line 185, in get_vocab
return dict(self.vocab, **self.added_tokens_encoder)
AttributeError: 'ErnieMTokenizer' object has no attribute 'vocab'
run uie_predictor.py
there is a bug when your input are multi-texts , [a_long_text, a_short_text,....], and the a_long_text is longer than 512.
当输入是多个text, 且其中有个text是长于512的时候,会报错。
训练集:
{"content": "不错的上网本,外形很漂亮,操作系统应该是个很大的 卖点,电池还可以。整体上讲,作为一个上网本的定位,还是不错的。\t", "result_list": [{"text": "正向", "start": -7, "end": -5}], "prompt": "情感倾向[正向,负向]"}
{"content": "<荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!\t", "result_list": [{"text": "正向", "start": -4, "end": -2}], "prompt": "情感倾向[负向,正向]"}
用这个去微调情感分类会报错显示:
RequestsDependencyWarning)
Traceback (most recent call last):
File "finetune.py", line 253, in
do_train()
File "finetune.py", line 35, in do_train
tokenizer = BertTokenizerFast.from_pretrained(args.model)
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1706, in from_pretrained
local_files_only=local_files_only,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 711, in get_file_from_repo
use_auth_token=use_auth_token,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 292, in cached_path
local_files_only=local_files_only,
File "/home/ma-user/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/transformers/utils/hub.py", line 563, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
系统:kylin v10 armV8 aarch64
镜像:FROM kumatea/pytorch
[2023-09-10 14:42:23,681] [ INFO] - >>> [PyTorchInferBackend] Creating Engine ...
[2023-09-10 14:42:39,516] [ INFO] - >>> [PyTorchInferBackend] Use CPU to inference ...
[2023-09-10 14:42:39,518] [ INFO] - >>> [PyTorchInferBackend] Engine Created ...
/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py:909: FutureWarning: The device
argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
调用无结果 502bad
POST http://127.0.0.1:888/
Error: socket hang up
Request Headers
Content-Type: application/json
User-Agent: PostmanRuntime/7.32.3
Accept: /
Postman-Token: 3f204252-7d5a-4732-8268-c60829276d57
Host: 127.0.0.1:888
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
OSError: uie_base_pytorch is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
请问下transformers等库的版本
在utils.py680行,修改如下,可以修复这个bug:
def get_relation_type_dict(relation_data):
def compare(a, b):
a = a[::-1]
b = b[::-1]
res = ''
for i in range(min(len(a), len(b))):
if a[i] == b[i]:
res += a[i]
else:
break
if res == "":
return res
elif res[::-1][0] == "的":
return res[::-1][1:]
return ""
relation_type_dict = {}
added_list = []
for i in range(len(relation_data)):
added = False
if relation_data[i][0] not in added_list:
for j in range(i + 1, len(relation_data)):
match = compare(relation_data[i][0], relation_data[j][0])
if match != "":
match = unify_prompt_name(match)
if relation_data[i][0] not in added_list:
added_list.append(relation_data[i][0])
relation_type_dict.setdefault(match, []).append(
relation_data[i][1])
added_list.append(relation_data[j][0])
relation_type_dict.setdefault(match, []).append(
relation_data[j][1])
added = True
if not added:
added_list.append(relation_data[i][0])
suffix = relation_data[i][0].rsplit("的", 1)[1]
suffix = unify_prompt_name(suffix)
#好像是只有一个对象时会遍历到这里执行,如果执行下面这句将把字典(而不是列表)赋给relation_type_dict
relation_type_dict.setdefault(suffix, []).append(
relation_data[i][1])
# relation_type_dict[suffix] = relation_data[i][1]
return relation_type_dict
楼主,请问这套UIE支持嵌套实体抽取吗?
我尝试了下uie_predictor,发现无法抽出嵌套实体?
Some weights of UIE were not initialized from the model checkpoint at uie_m_large_pytorch and are newly initialized: ['encoder.embeddings.token_type_embeddings.weight']
加载uie_m_large_pytorch ,提示有部分权重无法加载
在uie-base的config.json中,没有task_id的值,所以实际运行过程,task_type_embeddings没有生效?一直采用默认值0吗?
请问为什么doccano.py转化之后的数据格式中的prompt表示的是啥呢?作用是什么呢?
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'/encoder/embeddings/Add_2' Status Message: /encoder/embeddings/Add_2: right operand cannot broadcast on dim 1 LeftShape: {2,6514,768}, RightShape: {1,2048,768}
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:680 in │
│ │
│ 677 │ args.schema = ['航母'] │
│ 678 │ args.schema_lang = "en" │
│ 679 │ uie = UIEPredictor(model=args.model, task_path=args.task_path, schema_lang=args.sche │
│ ❱ 680 │ │ │ │ │ position_prob=args.position_prob, max_seq_len=args.max_seq_len, b │
│ 681 │ print(uie("印媒所称的“印度第一艘国产航母”—“维克兰特”号")) │
│ 682 │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:147 in init │
│ │
│ 144 │ │ self._is_en = True if model in ['uie-base-en' │
│ 145 │ │ │ │ │ │ │ │ │ │ ] or schema_lang == 'en' else False │
│ 146 │ │ self.set_schema(schema) │
│ ❱ 147 │ │ self._prepare_predictor() │
│ 148 │ │
│ 149 │ def _prepare_predictor(self): │
│ 150 │ │ assert self._engine in ['pytorch', │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\uie_predictor.py:158 in _prepare_predictor │
│ │
│ 155 │ │ │ if not os.path.exists(self._task_path): │
│ 156 │ │ │ │ from convert import check_model, extract_and_convert │
│ 157 │ │ │ │ check_model(self._model) │
│ ❱ 158 │ │ │ │ extract_and_convert(self._model, self._task_path) │
│ 159 │ │ │
│ 160 │ │ if self._multilingual: │
│ 161 │ │ │ from tokenizer import ErnieMTokenizerFast │
│ │
│ c:\Users\n\Desktop\uie_pytorch-main (1)\convert.py:292 in extract_and_convert │
│ │
│ 289 │ │ import paddle.fluid.dygraph as D │
│ 290 │ │ from paddle import fluid │
│ 291 │ │ with fluid.dygraph.guard(): │
│ ❱ 292 │ │ │ paddle_paddle_params, _ = D.load_dygraph( │
│ 293 │ │ │ │ os.path.join(input_dir, 'model_state')) │
│ 294 │ else: │
│ 295 │ │ paddle_paddle_params = pickle.load( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: module 'paddle.fluid.dygraph' has no attribute 'load_dygraph'
列表循环的顺序
all_relation_examples = [
r
for relation_example in relation_examples
for r in relation_example
]
训练的时候好像只走一个卡
想问下,我想像bert输出那样取出最后一层的隐藏状态和pooler_output值,代码这么写有无问题:
model= UIE.frompretrained(路径)
last_hidden_state = model(inputs*).hidden_states[-1]
pooler_output = torch.max(model(inputs*).hidden_states[-1])
另外,模型输出的start_prob,end_prob是什么?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.