Giter VIP home page Giter VIP logo

bartner's People

Contributors

shadowteamcn avatar yhcc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bartner's Issues

论文模型图

请问,论文模型图的输出索引为什么是2 3 7 2 5 6,而不是2 3 7 2 5 7?

代码询问

您好,我在运行代码时碰到一个batch里面可能每个样本的实体数不一样,这在自动padding的时候就会报错,请问该怎么做呢

运行train.py在Conll-2003数据集上时遇到value error

我尝试运行train.py在CoNLL-2003数据集,但是遇到了如下问题:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
我在conll-2003的文档中保存数据以如下形式:
EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

Peter B-PER
Blackburn I-PER

BRUSSELS B-LOC
1996-08-22 O
...
请问有人遇到类似的情况吗

OntoNote结果的复现

同学你好,

感谢你开源的代码和完整的注释。

我在conll03上复现了论文的结果,但在ontonote上遇到了一些困难。

我使用https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO上提供的数据,并修改

BARTNER/train.py

Lines 124 to 126 in d54d331

elif dataset_name == 'en-ontonotes':
paths = '../data/en-ontonotes/english'
data_bundle = pipe.process_from_file(paths)

        paths = {
                 'train': "/home/yedeming/data/ontonotes/onto.train.ner",
                 'dev': "/home/yedeming/data/ontonotes/onto.development.ner",
                 'test': "/home/yedeming/data/ontonotes/onto.test.ner",
                 }

运行结果如下:

Save cache to caches/data_facebook/bart-large_en-ontonotes_word.pt.
max_len_a:0.8, max_len:10
In total 3 datasets:
        train has 115812 instances.
        dev has 15680 instances.
        test has 12217 instances.
The number of tokens in tokenizer  50265
50283 50288
......
Best test performance(may not correspond to the best dev performance):{'Seq2SeqSpanMetric': {'f': 87.64999999999999, 'rec': 88.52, 'pre': 86.79, 'em': 0.8727}} achieved at Epoch:16.
Best test performance(correspond to the best dev performance):{'Seq2SeqSpanMetric': {'f': 87.36, 'rec': 88.28, 'pre': 86.47, 'em': 0.8717}} achieved at Epoch:26.

In Epoch:26/Step:68409, got best dev performance:
Seq2SeqSpanMetric: f=88.94, rec=89.79, pre=88.11, em=0.859

得到测试集F1=87.36,与正常数值相差较大,不知道出现了什么问题

期待您的回复!
叶德铭

FASTNLP batch loader error

Anyone has similar issues when running train.py with a customized dataset? My error is as follows:
training epochs started 2022-12-31-17-14-10-312655
Traceback (most recent call last):
File "train.py", line 252, in
trainer.train(load_best_model=False)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 667, in train
raise e
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 658, in train
self._train()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 712, in _train
for batch_x, batch_y in self.data_iterator:
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 267, in iter
for indices, batch_x, batch_y in self.dataiter:
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 91, in collate_fn
sin_y = _pad(sin_y, dataset=self.dataset, as_numpy=self.as_numpy)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 44, in _pad
res = f.pad(vlist)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 492, in pad
return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype, dim=self._cell_ndim)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 247, in call
return np.array(contents)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.

Indexes shifting

Hello, nice work! Sorry if I miss something. I have a question about the decoder's output in your code.

Based on your code it seems you're shifting position indexes of the tokens by a number of labels. I was wondering why shift tokens instead of shift the labels. Thank you!

Please find a example below (results from here).

raw_words: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
target_ids: [0, 11, 2, 20, 3, 1]	# each of the token positions is shifted by 6
word_bpe_ids: [0, 13910, 3376, 2076, 111, 344, 591, 1889, 7777, 226, 23806, 975, 17164, 2156, 3858, 16712, 2808, 31987, 4454, 18819, 5885, 10885, 2571, 479, 2]
word_bpe_tokens: ['<s>', 'ĠSO', 'CC', 'ER', 'Ġ-', 'ĠJ', 'AP', 'AN', 'ĠGET', 'ĠL', 'UCK', 'Y', 'ĠWIN', 'Ġ,', 'ĠCH', 'INA', 'ĠIN', 'ĠSUR', 'PR', 'ISE', 'ĠDE', 'FE', 'AT', 'Ġ.', '</s>']

vocab_file找不到

File "C:\Users\1\miniconda3\lib\site-packages\transformers\models\gpt2\tokenization_gpt2.py", line 179, in init
with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
您好,我使用3.4.0的transformers加载tokenizer的时候一直报错,然后看了一下transformers的源码,BartTokenizer要加载json格式的词表,但我从huggingface下载的bart 预训练模型中词表是TXT文件,是transformers版本问题吗? 谢谢

'tgt_seq_len' is not defined

Traceback (most recent call last):
File "predictor.py", line 128, in
tgt_seq_len = (tgt_seq_len - 2).tolist()
NameError: name 'tgt_seq_len' is not defined

 在中文bart上尝试出现错误

想在中文数据集上使用模型,将bart换成了huggingface的chinese-bart,但一直报如下错误:
Cannot set field:src_tokens as input, exception happens at the 0 value. Traceback (most recent call last): File "/workspace/BARTNER/train.py", line 135, in <module> data_bundle, tokenizer, mapping2id = get_data() File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/utils.py", line 357, in wrapper results = func(*args, **kwargs) File "/workspace/BARTNER/train.py", line 127, in get_data data_bundle = pipe.process_from_file(paths, demo=demo) File "/workspace/BARTNER/data/pipe.py", line 218, in process_from_file data_bundle = self.process(data_bundle) File "/workspace/BARTNER/data/pipe.py", line 192, in process data_bundle.set_input('tgt_tokens', 'src_tokens', 'src_seq_len', 'tgt_seq_len', 'first') File "/usr/local/lib/python3.6/dist-packages/fastNLP/io/data_bundle.py", line 142, in set_input dataset.set_input(field_name, flag=flag, use_1st_ins_infer_dim_type=use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 787, in set_input raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 784, in set_input self.field_arrays[name].is_input = flag File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 371, in is_input self._check_dtype_and_ndim(only_check_1st_ins_dim_type=self._use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 423, in _check_dtype_and_ndim raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 406, in _check_dtype_and_ndim type_0, dim_0 = _get_ele_type_and_dim(cell_0) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in _get_ele_type_and_dim res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in <listcomp> res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 84, in _get_ele_type_and_dim raise SetInputOrTargetException(f"Cannot process type:{type(cell)}.") fastNLP.core.field.SetInputOrTargetException: Cannot process type:<class 'NoneType'>.
才学习NER,请问该怎么解决?谢谢

Order of the target sequence

Hi! Thanks for your wonderful work!

May I is there any specific order of the entities in the target sequence? Or it is just random?

Thank you!

Error loading the datasets

Hi,

Thank you very much for your paper and your models. I'm attempting to replicate the experimental results in your paper on conll2003 and en-ontonotes. I'm currently faced with an error for both datasets, which I'm not sure how to go about solving. You can see the output of running python train.py below

Click to expand
2021-07-08 14:43:47.895031: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "BARTNER/train.py", line 131, in <module>
    data_bundle, tokenizer, mapping2id = get_data()
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/utils.py", line 357, in wrapper
    results = func(*args, **kwargs)
  File "BARTNER/train.py", line 123, in get_data
    data_bundle = pipe.process_from_file(paths, demo=demo)
  File "/content/BARTNER/data/pipe.py", line 206, in process_from_file
    data_bundle = Conll2003NERLoader(demo=demo).load(paths)
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in load
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in <dictcomp>
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/content/BARTNER/data/pipe.py", line 271, in _load
    target = iob2(ins['target'])
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/pipe/utils.py", line 30, in iob2
    raise TypeError("The encoding schema is not a valid IOB type.")
TypeError: The encoding schema is not a valid IOB type.

I'm running on colab.

As for conll2003, I've simply extracted the original files for English and have put them in a folder data/conll2003 as per your instructions.

As for ontonotes, to generate bio tags I've followed this repo: https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and put the files in data/en-ontonotes/english/ as per instructions.

Currently in the folder I've got onto.development.ner, onto.train.ner, onto.test.ner as you can see on image below:
image

Could you please advise what am I doing wrong? Thanks.

RuntimeError("The program has been running off.")

In Epoch:16/Step:17296, got best dev performance:
Seq2SeqSpanMetric: f=0.0, rec=0.0, pre=0.0, em=0.1836
....
File "/root/work/BARTNER/model/callbacks.py", line 107, in on_valid_end
raise RuntimeError("The program has been running off.")
RuntimeError: The program has been running off.

Does this mean the model stops by itself because of convergence failure?
What can I do now? To train it again?
Thanks

Migrate to transformers 4.0.0 or above?

Hello! I'm using this model architecture to do NER on domain specific tasks and it worked pretty well! However the old version of transformer is still a little bit troublesome.

For example, in order to be close to the pretraining process of BART, I want to directly encode the whole sentence using the tokenizer, rather than split it by space and then using 'add_prefix_space=True'. So I tried the 'span' method successfully. But for 'word' method it will need extra work to do that because of the old version of tokenizer.

Is there any plan to release a transformer 4.0.0 (or above) version?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.