jinyuanli0012 / pgim Goto Github PK

[EMNLP 2023 Findings] Prompting Chatgpt in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

Python 100.00%

pgim's Introduction

Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

Here are code and dataset for our Findings of EMNLP 2023 paper: Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

News🔥

📆 [Jun. 2024] A new research has been released. We propose a new Segmented Multimodal Named Entity Recognition (SMNER) task and construct the corresponding Twitter-SMNER dataset. Code and Twitter-SMNER dataset coming soon~
📆 [May. 2024] RiVEG (the sequel to PGIM about GMNER) has been accepted to ACL 2024 Findings.
📆 [Oct. 2023] PGIM has been accepted to EMNLP 2023 Findings.

Dataset

To ease the code running, you can find our pre-processed datasets at here. And the predefined artificial samples are here.

Requirement

python == 3.7
torch == 1.13.1
transformers == 4.30.2
modelscope == 1.7.1

Usage

PGIM is based on AdaSeq, AdaSeq project is based on Python version >= 3.7 and PyTorch version >= 1.8.

Step 1: Installation

git clone https://github.com/modelscope/adaseq.git
cd adaseq
pip install -r requirements.txt -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Step 2: Copy PGIM folder into .../adaseq/examples/

-adaseq
---|examples
-----|PGIM
-------|twitter-15-txt.yaml
-------|twitter-17-txt.yaml

Step 3: Replace the original adaseq folder with our adaseq folder

-adaseq
---|.git
---|.github
---|adaseq   <-- (Use our adaseq replace it)  
---|docs
---|examples
---|scripts
---|tests
---|tools

Step 4: Training Model

-For Baseline:

	python -m scripts.train -c examples/PGIM/twitter-15.yaml
	python -m scripts.train -c examples/PGIM/twitter-17.yaml

-For PGIM:

	python -m scripts.train -c examples/PGIM/twitter-15-PGIM.yaml
	python -m scripts.train -c examples/PGIM/twitter-17-PGIM.yaml

Citation

If you find PGIM useful in your research, please consider citing:

@inproceedings{li2023prompting,
  title={Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge},
  author={Li, Jinyuan and Li, Han and Pan, Zhuo and Sun, Di and Wang, Jiahao and Zhang, Wenkun and Pan, Gang},
  booktitle={Findings of the Association for Computational Linguistics (EMNLP), 2023},
  year={2023}
}

@inproceedings{li2024llms,
  title={LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition},
  author={Li, Jinyuan and Li, Han and Sun, Di and Wang, Jiahao and Zhang, Wenkun and Wang, Zan and Pan, Gang},
  booktitle={Findings of the Association for Computational Linguistics (ACL), 2024},
  year={2024}
}

@article{li2024advancing,
  title={Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation},
  author={Li, Jinyuan and Li, Ziyan and Li, Han and Yu, Jianfei and Xia, Rui and Sun, Di and Pan, Gang},
  journal={arXiv preprint arXiv:2406.07268},
  year={2024}
}

Acknowledgement

Our code is built upon the open-sourced AdaSeq and MoRe, Thanks for their great work!

pgim's People

Contributors

Stargazers

Watchers

pgim's Issues

requests.exceptions.HTTPError: SequenceLabelingPreprocessor: 404 Client Error: Not Found for url: http://www.modelscope.cn/api/v1/models/xlm-roberta-large/revisions?EndTime=1688313600

模型地址现在貌似找不到了?不知道有没有其他下载地址,下载到本地,然后到哪里改下代码呢?

无法访问huggingface下载模型导致指令执行不成功

我已经按照要求配置好所有环境，在最终执行时报错
(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /xlm-roberta-large/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f00c74d8880>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"),
我尝试把模型拷贝到本地，在yaml文件中修改模型位置，还是遇到上述问题。
希望您能帮帮我

manual annotation

您好，不知道我对于论文的理解是否正确，如果正确我能不能拿到人工标注的数据集来学习一下？万分感谢

人工标注少部分example
利用训练好的vanilla MNER模型获得图文特征
对于每一条训练数据，图文特征计算cos相似度，选取相似度最高的当做example
组合Prompt送入大模型获取auxiliary Refined Knowledge

運行到第四步第一句時報錯AttributeError: 'ConfigDict' object has no attribute 'data_collator'

無論是否注釋掉base.py中的162-164行都報這個錯

AttributeError: 'ConfigDict' object has no attribute 'evaluation'

Hi！ I encountered an error “AttributeError: 'ConfigDict' object has no attribute 'evaluation'” When following Step 4 . Could you tell me how to solve this problem? Thank you so much!

一些数据集相关问题

我在测试数据集中看到有些标签只有I-PER，但是前面没有B-PER这种，请问这种是不是标签有问题？或者说一个词就是一个实体的话，那么这个词标 I-PER和B-PER都是对的？？
比如：
IMGID:1908210
ONLY O
BELIEVE O
the O
dynamic O
gospel O
album O
from O
Elder O
Arthur O
R O
. O
Johnson I-PER
. O
Available O
at O
select O
outlets O
: O
http://t.co/0WzcfKMSNn O

IMGID:68088
George B-PER
W O
. O
Bush I-PER
takes O
ice B-OTHER
bucket I-OTHER
challenge I-OTHER
http://t.co/oAe1Sdi9AY O
challenges O
@BillClinton O
next. O
http://t.co/0YnM6RNe9P O

IMGID:21659
Good O
morning O
to O
you O
too O
, O
Mr O
. O
Hopewell I-PER
http://t.co/Dacow9O4mh O

咨询data_PGIM/2015数据集

您好，我想问一下，数据集中这个X标签，我看labels.txt中并没有定义X呢，呢代码是怎么知道需要X的含义呢，并且这部分在代码中哪里体现，
RT O
@JayKenMinaj O
_ O
: O
Me O
outside O
of O
where O
George B-PER
Zimmerman I-PER
got O
. O
http://t.co/Z3neVBQ7vF O
X
Named X
entities:1. X

数据加载错误

datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

When follow your instructions，dataset download raise an error[datasets.builder.DatasetGenerationError: An error occurred while generating the dataset]，it seems that when create the cache file in the cache but without downloading,so it's an empty file.

多模态部分是如何体现的

论文原文中称使用UMT（2020）获取融合的多模态特征，但在开源代码中我并未找到对应模块。且训练所使用的数据集也是标注好的纯文本数据集。我可否理解为本方法将图像转化为文本再使用chatgpt增强表示后与原文本拼接来做的NER任务

build_dataset error log: 'sequence-labeling-model is not in the custom_datasets registry group named-entity-recognition. Please make sure the correct version of ModelScope library is used.

作者您好，在用adaseq框架复现PGIM的时候出现了以下报错，程序执行到这个地方就卡住了（所有四个脚本都一样）。

报错内容：
** build_dataset error log: 'sequence-labeling-model is not in the custom_datasets registry group named-entity-recognition. Please make sure the correct version of ModelScope library is used.'
** build_dataset error log: 'sequence-labeling-model is not in the custom_datasets registry group named-entity-recognition. Please make sure the correct version of ModelScope library is used.'

完整信息：
2024-06-20 10:39:37,335 - modelscope - INFO - PyTorch version 1.10.1+cu111 Found.
2024-06-20 10:39:37,336 - modelscope - INFO - Loading ast index from /home/tangjielong/.cache/modelscope/ast_indexer
2024-06-20 10:39:37,365 - modelscope - INFO - Loading done! Current index file version is 1.7.1, with md5 55cda22e675324e12989be11cd8d8653 and a total number of 861 components indexed
2024-06-20 10:39:38,305 - modelscope - WARNING - The reference has been Deprecated in modelscope v1.4.0+, please use from modelscope.msdatasets.dataset_cls.custom_datasets import TorchCustomDataset
2024-06-20 10:39:38,401 - INFO - adaseq.data.dataset_manager - Will use a custom loading script: /data0/tangjielong/MNER_LLM/AdaSeq-master/adaseq/data/dataset_builders/named_entity_recognition_dataset_builder.py
Downloading and preparing dataset named_entity_recognition_dataset_builder/default to /data0/tangjielong/.cache/huggingface/datasets/named_entity_recognition_dataset_builder/default-df0a7beb617cd5ee/0.0.0/db737b9bb893f20fb03d04403a30bf7c033256c212b7e9f0ebc6e9c958535c51...
Downloading data: 216kB [00:00, 1.40MB/s]
Dataset named_entity_recognition_dataset_builder downloaded and prepared to /data0/tangjielong/.cache/huggingface/datasets/named_entity_recognition_dataset_builder/default-df0a7beb617cd5ee/0.0.0/db737b9bb893f20fb03d04403a30bf7c033256c212b7e9f0ebc6e9c958535c51. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 829.68it/s]
2024-06-20 10:39:40,242 - INFO - adaseq.data.dataset_manager - First sample in train set: {'id': '0', 'tokens': ['New', 'Post', ':', 'Blackburn', 'Festival', 'of', 'Voice', '2017'], 'spans': [{'start': 3, 'end': 7, 'type': 'MISC'}], 'mask': [True, True, True, True, True, True, True, True]}
2024-06-20 10:39:40,643 - INFO - adaseq.data.preprocessors.sequence_labeling_preprocessor - label_to_id: {'O': 0, 'B-LOC': 1, 'I-LOC': 2, 'E-LOC': 3, 'S-LOC': 4, 'B-MISC': 5, 'I-MISC': 6, 'E-MISC': 7, 'S-MISC': 8, 'B-ORG': 9, 'I-ORG': 10, 'E-ORG': 11, 'S-ORG': 12, 'B-PER': 13, 'I-PER': 14, 'E-PER': 15, 'S-PER': 16}
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias']

This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
** build_dataset error log: 'sequence-labeling-model is not in the custom_datasets registry group named-entity-recognition. Please make sure the correct version of ModelScope library is used.'
** build_dataset error log: 'sequence-labeling-model is not in the custom_datasets registry group named-entity-recognition. Please make sure the correct version of ModelScope library is used.'

Code for ICL sample selection

作者您好！
请问一下是否可以开源如何选择ICL样例部分的代码，希望学习一下。
谢谢！

PGIM的推理与adaseq代码

您好,请问如何利用训练好的PGIM模型进行推理演示，本项目针对adaseq主要修改了那部分的代码？数据集中通过img_id继续链接图片，推理时具体图片和文本如何输入？
感谢！