yhcc / cnn_nested_ner Goto Github PK

Python 100.00%

cnn_nested_ner's Introduction

This is the code for An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition

We found previous nested NER related work used different sentence tokenizations, resulting in different number of sentences and entities, which would make the comparison between different papers unfair. To solve this issue, we propose using the pre-processing scripts under preprocess to get the ACE2004, ACE2005 and Genia datasets. Please refer the readme for more details.

To run the genia dataset, using

python train.py -n 5 --lr 7e-6 --cnn_dim 200 --biaffine_size 400 --n_head 4 -b 8 -d genia  --logit_drop 0 --cnn_depth 3

for ACE2004, using

python train.py -n 50 --lr 2e-5 --cnn_dim 120 --biaffine_size 200 --n_head 5 -b 48 -d ace2004 --logit_drop 0.1 --cnn_depth 2

for ACE2005, using

python train.py -n 50 --lr 2e-5 --cnn_dim 120 --biaffine_size 200 --n_head 5 -b 48 -d ace2005 --logit_drop 0 --cnn_depth 2

Here, we set n_heads, cnn_dim and biaffine_size for small number of parameters, based on our experiment, reduce n_head and enlarge cnn_dim and biaffine_size should get slightly better performance.

Customized data

If you want to use your own data, please organize your data line like the following way, the data folder should have the following files

customized_data/
    - train.jsonelines
    - dev.jsonlines
    - test.jsonlines

in each file, each line should be a json object, like the following

{"tokens": ["Our", "data", "suggest", "that", "lipoxygenase", "metabolites", "activate", "ROI", "formation", "which", "then", "induce", "IL-2", "expression", "via", "NF-kappa", "B", "activation", "."], "entity_mentions": [{"entity_type": "protein", "start": 12, "end": 13, "text": "IL-2"}, {"entity_type": "protein", "start": 15, "end": 17, "text": "NF-kappa B"}, {"entity_type": "protein", "start": 4, "end": 5, "text": "lipoxygenase"}, {"entity_type": "protein", "start": 4, "end": 6, "text": "lipoxygenase metabolites"}]}

the entity start and end is inclusive and exclusive, respectively.

[update in 20220818]
We add pre-processing code to extract Genia entities from raw data. We split train/dev/test based on documents to facilitate document-level NER study.

cnn_nested_ner's People

Contributors

Stargazers

Watchers

Forkers

anshiquanshu66 nick-2008 lisaterumi emanuelaboros huxiaotao0620 dnau15 hamadasalhab

cnn_nested_ner's Issues

how to filter the nested entities and the flat entities？

Hi！You propose a nice idea！But i am confused that how to filter the nested entities and the flat entities in the code？just set allow_entity_nested false?

为什么模型在推理的时候显存消耗上会远低于W2NER？

您好，我看论文中提到与W2NER的对比，两个模型的参数量也相差不大，但是我在运行W2NER的推理时却时常出现显存不够的问题，但是使用CNN_NEST_NER却基本没有出现过这种情况，不知道具体问题出在哪？

使用自定义数据集运行出错

我将自己的数据集处理成代码要求的格式并确保所有的token序列长度小于512，使用chinese-roberta-wwm-ext作为预训练模型，参数和genia数据集相同，出现如下错误

显示是data/padder.py文件下的 buffer[i, :len(f), :len(f)] = torch.from_numpy(f)出现错误
非常期待您的回复，谢谢

Did you test this model on normal ner tasks?

Great work!
Just wondering its performance on normal ner tasks, like unnested entity.

关于Run过程中的一点问题

这个下载的文件是什么？预训练模型吗？下载速度太慢了，有没有可以下载的文件提前下载好后直接加载？

"Not compiled with CUDA support" issue

Hello,
Thank you for sharing your work and making it publicly available. I am trying to reproduce your experiments and eventually try it on another custom dataset.
However, upon installing there requirements and launching the training script, I get the following error:

Traceback (most recent call last): File "train.py", line 190, in <module> trainer.run(num_train_batch_per_epoch=-1, num_eval_batch_per_dl=-1, num_eval_sanity_batch=1) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/trainer.py", line 663, in run sanity_check_res = self.evaluator.run(num_eval_batch_per_dl=num_eval_sanity_batch) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 288, in run raise e File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 281, in run results = self.evaluate_batch_loop.run(self, dataloader) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 55, in run raise e File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 43, in run self.batch_step_fn(evaluator, batch) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 68, in batch_step_fn outputs = evaluator.evaluate_step(batch) # 将batch输入到model中得到结果 File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 416, in evaluate_step outputs = self.driver.model_call(batch, self._evaluate_step, self._evaluate_step_signature_fn) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/drivers/torch_driver/single_device.py", line 85, in model_call return auto_param_call(fn, batch, signature_fn=signature_fn) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/utils/utils.py", line 149, in auto_param_call return fn(**_has_params) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/Arabic-NER/CNN_Nested_NER/model/model.py", line 59, in forward state = scatter_max(last_hidden_states, index=indexes, dim=1)[0][:, 1:] # bsz x word_len x hidden_size File "/home/user/.local/lib/python3.8/site-packages/torch_scatter/scatter.py", line 72, in scatter_max return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size) File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 502, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: Not compiled with CUDA support

P.S.: I am using a Conda environment with python 3.8 (I have also tried 3.9 and 3.10 to no avail).

数据集情况

请问能使用自制的中文数据集在这个模型上吗

Longformer integration

Hi,
I have a use case where the max_length will be more than 512. Can this model work with LongFormer, if so what all places will require changes? I see in padder there is max_length specified, and there are a few other checks as well. Will there be any logical issue with using Longformer?

Thanks for the open source.

> 使用softmax的话，需要改的，因为相当于判断这个地方是不是实体的方式会改变。

          > 使用softmax的话，需要改的，因为相当于判断这个地方是不是实体的方式会改变。

您好我还想问一下问什么损失函数要用二元交叉熵？这不是一个多分类问题吗？不应该是交叉熵损失函数吗

Originally posted by @houyuchao in #23 (comment)
仔细看论文

这个报错怎么解决？为什么max_len会出现为0 的情况

 bsz, max_len, dim = h.size()
 h = h.reshape(bsz, max_len, self.n_head, -1)

此处max_len 出现为0的情况

关于损失函数的问题

我看代码和论文中的损失函数都是BCE，为什么要基于sigmoid函数和给定的0.5阈值进行预测呢，这个多分类任务不应该是基于softmax吗？

损失函数问题咨询

为什么要基于sigmoid函数和给定的0.5阈值进行预测呢，这个多分类任务不应该是基于softmax吗？因为有个数据集存在一个span同时属于了两个entity类别，索性就都用了sigmoid。如果我们处理的是特定的领域的都是平面实体，我们使用softmax的话，我们还用对代码中decode里面的内容进行修改吗

Some question about preprocess

Sorry for interrupting you. Regarding the preprocessing code you provided, it seems that the Chinese ACE05 corpus cannot be processed, and the assert part has too many restrictions on the Chinese corpus.

没有独立显卡的mac电脑，怎么运行代码呀，改成cpu了还一直报错

Traceback (most recent call last):
File "/Users/ekko/begin/CNN_Nested_NER-master/train.py", line 204, in
progress_bar='rich')
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/controllers/trainer.py", line 492, in init
**kwargs
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/choose_driver.py", line 37, in choose_driver
return initialize_torch_driver(driver, device, model, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/initialize_torch_driver.py", line 86, in initialize_torch_driver
return TorchSingleDriver(model, device, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/single_device.py", line 60, in init
super(TorchSingleDriver, self).init(model, fp16=fp16, torch_kwargs=torch_kwargs, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/torch_driver.py", line 62, in init
self.auto_cast, _grad_scaler = _build_fp16_env(dummy=not self.fp16)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/utils.py", line 174, in _build_fp16_env
raise RuntimeError("Pytorch is not installed in gpu version, please use device='cpu'.")
RuntimeError: Pytorch is not installed in gpu version, please use device='cpu'.

结果的FEP FER NEP NER这些数据结果怎么看

代码中的bpes表示的什么？

assert len(bpes)<=512, len(bpes)
我已经把我的数据中长度超过400的截断了，为什么这里还会报错？
这里的bpes表示的什么？

调用multiprocess出错

File "D:@yongzhao@pyproject\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1049, in init
w.start()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main

ACE2005数据集

你好请问bpes代表什么意思？

bpes = [self.cls]
indexes = [0]
spans = []
ins_lst = []
new_ent_str = Counter()
for _raw_words, _raw_ents in zip(raw_sents, raw_entss):
_indexes = []
_bpes = []
for s, e, t in _raw_ents:
new_ent_str[''.join(_raw_words[s:e+1])] += 1

            for idx, word in enumerate(_raw_words, start=0):
                if word in word2bpes:
                    __bpes = word2bpes[word]
                else:
                    __bpes = self.tokenizer.encode(' '+word if self.add_prefix_space else word,
                                                   add_special_tokens=False)
                    word2bpes[word] = __bpes
                _indexes.extend([idx]*len(__bpes))
                _bpes.extend(__bpes)
            next_word_idx = indexes[-1]+1
            if len(bpes) + len(_bpes) <= self.max_len:
                bpes = bpes + _bpes
                indexes += [i + next_word_idx for i in _indexes]
                spans += [(s+next_word_idx-1, e+next_word_idx-1, label2idx.get(t), ) for s, e, t in _raw_ents]
            else:
                new_ins = get_new_ins(bpes, spans, indexes)
                ins_lst.append(new_ins)
                indexes = [0] + [i + 1 for i in _indexes]
                spans = [(s, e, label2idx.get(t), ) for s, e, t in _raw_ents]
                bpes = [self.cls] + _bpes
        if bpes:
            ins_lst.append(get_new_ins(bpes, spans, indexes))

关于论文中这一部分的代码我没有理解是想处理什么

损失函数

论文中提到的golden label在代码中体现在哪里呢？没找到，是这个matrix吗？

损失函数

您好我没找到论文中所说的sigmoid、以及阈值0.5对应的程序代码，是在损失函数里面？，我也没有找到啊

能否用来处理中文？

你好，这个模型适用于中文吗？此外，是否只适用于嵌套实体抽取，如果数据集中同时包含扁平和嵌套实体，扁平实体也能识别吗？

ace中文数据集划分

你好，有没有权威的ACE04和ACE05中文的划分文件。

你好，我想问问使用roberta时的各项参数是什么

我发现代码里没有关于roberta的参数内容，所以想问问使用roberta时参数是如何设置的

您好，论文中的这个实验结果是测试集的结果还是验证集

损失函数

您好我没找到论文中所说的sigmoid、以及阈值0.5对应的程序代码，是在损失函数里面？，我也没有找到啊

可能是fastNLP框架的问题咨询

在运行的python文件代码中的倒数第二行，trainer.run（），运行时说是bool值不能运行，这怎么解决

yij真实标签的编码方式

真是标签的编码方式是bpe吗？还是one hot呢

wandb

How using a pre-trained model for predicting Nested NER sentences?

Hello,
Thank you for sharing your repo public it so useful.
I have successfully trained a model using the provided code and data example genia. However, I'm uncertain about the process of using this pre-trained model to make predictions on new data containing Nested NER sentences. I would greatly appreciate any instructions
Thank you for your assistance!

在论文中gelu是被用到的，但是在您这段代码中，最后一次的gelu是没有用到的，这个是否会对结果有较大影响呢？

还想问一个，就是为什么我在用您的代码能跑通，但是precision，f等分数都是0.0，是我哪里用错了吗？

这是用到的gpu

请问一下这部分代码的意思

论文里说multi-head Biaffine就接CNN了，这里计算的是什么呢？

在运行ACE2004时报错：KeyError: "Field name 'input_ids' not found in dataset dev"

请问这个怎么办

损失函数

我想将二元交叉熵变换为多元交叉熵函数，将F.binary_cross_entropy更换为F.cross_entropy，不对预测分数和target矩阵做flat_scores = scores.reshape(-1)，flat_matrix = matrix.reshape(-1)直接使用scores 和matrix作为F.cross_entropy，但是在最后 loss = ((flat_loss.view(input_ids.size(0), -1)*mask).sum(dim=-1)).mean()这一步的时候出现了维度不匹配问题，经过交叉熵出来的损失函数维度变成了[8，15，5]而二元交叉熵的输出维度为[8，15，15，5]导致使用交叉熵后与mask维度不匹配，这该怎么办啊？

作者有没有什么办法啊，求助求助

yhcc / cnn_nested_ner Goto Github PK

cnn_nested_ner's Introduction

Customized data

cnn_nested_ner's People

Contributors

Stargazers

Watchers

Forkers

cnn_nested_ner's Issues

Recommend Projects

Recommend Topics

Recommend Org