Giter VIP home page Giter VIP logo

cnn_nested_ner's Introduction

This is the code for An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition

We found previous nested NER related work used different sentence tokenizations, resulting in different number of sentences and entities, which would make the comparison between different papers unfair. To solve this issue, we propose using the pre-processing scripts under preprocess to get the ACE2004, ACE2005 and Genia datasets. Please refer the readme for more details.

To run the genia dataset, using

python train.py -n 5 --lr 7e-6 --cnn_dim 200 --biaffine_size 400 --n_head 4 -b 8 -d genia  --logit_drop 0 --cnn_depth 3 

for ACE2004, using

python train.py -n 50 --lr 2e-5 --cnn_dim 120 --biaffine_size 200 --n_head 5 -b 48 -d ace2004 --logit_drop 0.1 --cnn_depth 2

for ACE2005, using

python train.py -n 50 --lr 2e-5 --cnn_dim 120 --biaffine_size 200 --n_head 5 -b 48 -d ace2005 --logit_drop 0 --cnn_depth 2 

Here, we set n_heads, cnn_dim and biaffine_size for small number of parameters, based on our experiment, reduce n_head and enlarge cnn_dim and biaffine_size should get slightly better performance.

Customized data

If you want to use your own data, please organize your data line like the following way, the data folder should have the following files

customized_data/
    - train.jsonelines
    - dev.jsonlines
    - test.jsonlines

in each file, each line should be a json object, like the following

{"tokens": ["Our", "data", "suggest", "that", "lipoxygenase", "metabolites", "activate", "ROI", "formation", "which", "then", "induce", "IL-2", "expression", "via", "NF-kappa", "B", "activation", "."], "entity_mentions": [{"entity_type": "protein", "start": 12, "end": 13, "text": "IL-2"}, {"entity_type": "protein", "start": 15, "end": 17, "text": "NF-kappa B"}, {"entity_type": "protein", "start": 4, "end": 5, "text": "lipoxygenase"}, {"entity_type": "protein", "start": 4, "end": 6, "text": "lipoxygenase metabolites"}]}

the entity start and end is inclusive and exclusive, respectively.

  • [update in 20220818]
    We add pre-processing code to extract Genia entities from raw data. We split train/dev/test based on documents to facilitate document-level NER study.

cnn_nested_ner's People

Contributors

yhcc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cnn_nested_ner's Issues

使用自定义数据集运行出错

我将自己的数据集处理成代码要求的格式并确保所有的token序列长度小于512,使用chinese-roberta-wwm-ext作为预训练模型,参数和genia数据集相同,出现如下错误
screenshot-20240414-172235
显示是data/padder.py文件下的 buffer[i, :len(f), :len(f)] = torch.from_numpy(f)出现错误
非常期待您的回复,谢谢

关于Run过程中的一点问题

image这个下载的文件是什么?预训练模型吗?下载速度太慢了,有没有可以下载的文件提前下载好后直接加载?

"Not compiled with CUDA support" issue

Hello,
Thank you for sharing your work and making it publicly available. I am trying to reproduce your experiments and eventually try it on another custom dataset.
However, upon installing there requirements and launching the training script, I get the following error:

Traceback (most recent call last): File "train.py", line 190, in <module> trainer.run(num_train_batch_per_epoch=-1, num_eval_batch_per_dl=-1, num_eval_sanity_batch=1) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/trainer.py", line 663, in run sanity_check_res = self.evaluator.run(num_eval_batch_per_dl=num_eval_sanity_batch) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 288, in run raise e File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 281, in run results = self.evaluate_batch_loop.run(self, dataloader) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 55, in run raise e File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 43, in run self.batch_step_fn(evaluator, batch) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/loops/evaluate_batch_loop.py", line 68, in batch_step_fn outputs = evaluator.evaluate_step(batch) # 将batch输入到model中得到结果 File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/controllers/evaluator.py", line 416, in evaluate_step outputs = self.driver.model_call(batch, self._evaluate_step, self._evaluate_step_signature_fn) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/drivers/torch_driver/single_device.py", line 85, in model_call return auto_param_call(fn, batch, signature_fn=signature_fn) File "/home/user/.local/lib/python3.8/site-packages/fastNLP/core/utils/utils.py", line 149, in auto_param_call return fn(**_has_params) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/Arabic-NER/CNN_Nested_NER/model/model.py", line 59, in forward state = scatter_max(last_hidden_states, index=indexes, dim=1)[0][:, 1:] # bsz x word_len x hidden_size File "/home/user/.local/lib/python3.8/site-packages/torch_scatter/scatter.py", line 72, in scatter_max return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size) File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 502, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: Not compiled with CUDA support

P.S.: I am using a Conda environment with python 3.8 (I have also tried 3.9 and 3.10 to no avail).

数据集情况

请问能使用自制的中文数据集在这个模型上吗

Longformer integration

Hi,
I have a use case where the max_length will be more than 512. Can this model work with LongFormer, if so what all places will require changes? I see in padder there is max_length specified, and there are a few other checks as well. Will there be any logical issue with using Longformer?

Thanks for the open source.

关于损失函数的问题

我看代码和论文中的损失函数都是BCE,为什么要基于sigmoid函数和给定的0.5阈值进行预测呢,这个多分类任务不应该是基于softmax吗?

损失函数问题咨询

为什么要基于sigmoid函数和给定的0.5阈值进行预测呢,这个多分类任务不应该是基于softmax吗?因为有个数据集存在一个span同时属于了两个entity类别,索性就都用了sigmoid。如果我们处理的是特定的领域的都是平面实体,我们使用softmax的话,我们还用对代码中decode里面的内容进行修改吗

Some question about preprocess

Sorry for interrupting you. Regarding the preprocessing code you provided, it seems that the Chinese ACE05 corpus cannot be processed, and the assert part has too many restrictions on the Chinese corpus.

没有独立显卡的mac电脑,怎么运行代码呀,改成cpu了还一直报错

Traceback (most recent call last):
File "/Users/ekko/begin/CNN_Nested_NER-master/train.py", line 204, in
progress_bar='rich')
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/controllers/trainer.py", line 492, in init
**kwargs
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/choose_driver.py", line 37, in choose_driver
return initialize_torch_driver(driver, device, model, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/initialize_torch_driver.py", line 86, in initialize_torch_driver
return TorchSingleDriver(model, device, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/single_device.py", line 60, in init
super(TorchSingleDriver, self).init(model, fp16=fp16, torch_kwargs=torch_kwargs, **kwargs)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/torch_driver.py", line 62, in init
self.auto_cast, _grad_scaler = _build_fp16_env(dummy=not self.fp16)
File "/Users/ekko/begin/CNN_Nested_NER-master/fastNLP_10/core/drivers/torch_driver/utils.py", line 174, in _build_fp16_env
raise RuntimeError("Pytorch is not installed in gpu version, please use device='cpu'.")
RuntimeError: Pytorch is not installed in gpu version, please use device='cpu'.

代码中的bpes表示的什么?

assert len(bpes)<=512, len(bpes)
我已经把我的数据中长度超过400的截断了,为什么这里还会报错?
这里的bpes表示的什么?

调用multiprocess出错

File "D:@yongzhao@pyproject\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1049, in init
w.start()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main

你好 请问bpes代表什么意思?

bpes = [self.cls]
indexes = [0]
spans = []
ins_lst = []
new_ent_str = Counter()
for _raw_words, _raw_ents in zip(raw_sents, raw_entss):
_indexes = []
_bpes = []
for s, e, t in _raw_ents:
new_ent_str[''.join(_raw_words[s:e+1])] += 1

            for idx, word in enumerate(_raw_words, start=0):
                if word in word2bpes:
                    __bpes = word2bpes[word]
                else:
                    __bpes = self.tokenizer.encode(' '+word if self.add_prefix_space else word,
                                                   add_special_tokens=False)
                    word2bpes[word] = __bpes
                _indexes.extend([idx]*len(__bpes))
                _bpes.extend(__bpes)
            next_word_idx = indexes[-1]+1
            if len(bpes) + len(_bpes) <= self.max_len:
                bpes = bpes + _bpes
                indexes += [i + next_word_idx for i in _indexes]
                spans += [(s+next_word_idx-1, e+next_word_idx-1, label2idx.get(t), ) for s, e, t in _raw_ents]
            else:
                new_ins = get_new_ins(bpes, spans, indexes)
                ins_lst.append(new_ins)
                indexes = [0] + [i + 1 for i in _indexes]
                spans = [(s, e, label2idx.get(t), ) for s, e, t in _raw_ents]
                bpes = [self.cls] + _bpes
        if bpes:
            ins_lst.append(get_new_ins(bpes, spans, indexes))

关于论文中这一部分的代码我没有理解是想处理什么

损失函数

论文中提到的golden label在代码中体现在哪里呢?没找到,是这个matrix吗?
image

损失函数

您好我没找到论文中所说的sigmoid、以及阈值0.5对应的程序代码,是在损失函数里面?,我也没有找到啊
image

能否用来处理中文?

你好,这个模型适用于中文吗?此外,是否只适用于嵌套实体抽取,如果数据集中同时包含扁平和嵌套实体,扁平实体也能识别吗?

损失函数

您好我没找到论文中所说的sigmoid、以及阈值0.5对应的程序代码,是在损失函数里面?,我也没有找到啊
image

How using a pre-trained model for predicting Nested NER sentences?

Hello,
Thank you for sharing your repo public it so useful.
I have successfully trained a model using the provided code and data example genia. However, I'm uncertain about the process of using this pre-trained model to make predictions on new data containing Nested NER sentences. I would greatly appreciate any instructions
Thank you for your assistance!

损失函数

我想将二元交叉熵变换为多元交叉熵函数,将F.binary_cross_entropy更换为F.cross_entropy,不对预测分数和target矩阵做flat_scores = scores.reshape(-1),flat_matrix = matrix.reshape(-1)直接使用scores 和matrix作为F.cross_entropy,但是在最后 loss = ((flat_loss.view(input_ids.size(0), -1)*mask).sum(dim=-1)).mean()这一步的时候出现了维度不匹配问题,经过交叉熵出来的损失函数维度变成了[8,15,5]而二元交叉熵的输出维度为[8,15,15,5]导致使用交叉熵后与mask维度不匹配,这该怎么办啊?
image
作者有没有什么办法啊,求助求助

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.