yhcc / bartner Goto Github PK

View Code? Open in Web Editor NEW

213.0 213.0 23.0 65 KB

Python 100.00%

bartner's People

Contributors

Stargazers

Watchers

bartner's Issues

high precision and low recall?

Has anybody had similar issues? I tried negative sampling, but this does not improve the result a lot.

除了conll2003predictor，能否提供一个通用的测试代码

fastNLP框架有点复杂，提供的predictor看着好像只能识别出conll2003格式的数据，能不能给一个通用的predictor，或者用来测试数据f值得evaluator

运行train.py在Conll-2003数据集上时遇到value error

我尝试运行train.py在CoNLL-2003数据集，但是遇到了如下问题：
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
我在conll-2003的文档中保存数据以如下形式：
EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

Peter B-PER
Blackburn I-PER

BRUSSELS B-LOC
1996-08-22 O
...
请问有人遇到类似的情况吗

RuntimeError("The program has been running off.")

In Epoch:16/Step:17296, got best dev performance:
Seq2SeqSpanMetric: f=0.0, rec=0.0, pre=0.0, em=0.1836
....
File "/root/work/BARTNER/model/callbacks.py", line 107, in on_valid_end
raise RuntimeError("The program has been running off.")
RuntimeError: The program has been running off.

Does this mean the model stops by itself because of convergence failure?
What can I do now? To train it again?
Thanks

Error loading the datasets

Hi,

Thank you very much for your paper and your models. I'm attempting to replicate the experimental results in your paper on conll2003 and en-ontonotes. I'm currently faced with an error for both datasets, which I'm not sure how to go about solving. You can see the output of running python train.py below

Click to expand

2021-07-08 14:43:47.895031: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "BARTNER/train.py", line 131, in <module>
    data_bundle, tokenizer, mapping2id = get_data()
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/utils.py", line 357, in wrapper
    results = func(*args, **kwargs)
  File "BARTNER/train.py", line 123, in get_data
    data_bundle = pipe.process_from_file(paths, demo=demo)
  File "/content/BARTNER/data/pipe.py", line 206, in process_from_file
    data_bundle = Conll2003NERLoader(demo=demo).load(paths)
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in load
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in <dictcomp>
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/content/BARTNER/data/pipe.py", line 271, in _load
    target = iob2(ins['target'])
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/pipe/utils.py", line 30, in iob2
    raise TypeError("The encoding schema is not a valid IOB type.")
TypeError: The encoding schema is not a valid IOB type.

I'm running on colab.

As for conll2003, I've simply extracted the original files for English and have put them in a folder data/conll2003 as per your instructions.

As for ontonotes, to generate bio tags I've followed this repo: https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and put the files in data/en-ontonotes/english/ as per instructions.

Currently in the folder I've got onto.development.ner, onto.train.ner, onto.test.ner as you can see on image below:

Could you please advise what am I doing wrong? Thanks.

在中文bart上尝试出现错误

想在中文数据集上使用模型，将bart换成了huggingface的chinese-bart，但一直报如下错误：
Cannot set field:src_tokens as input, exception happens at the 0 value. Traceback (most recent call last): File "/workspace/BARTNER/train.py", line 135, in <module> data_bundle, tokenizer, mapping2id = get_data() File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/utils.py", line 357, in wrapper results = func(*args, **kwargs) File "/workspace/BARTNER/train.py", line 127, in get_data data_bundle = pipe.process_from_file(paths, demo=demo) File "/workspace/BARTNER/data/pipe.py", line 218, in process_from_file data_bundle = self.process(data_bundle) File "/workspace/BARTNER/data/pipe.py", line 192, in process data_bundle.set_input('tgt_tokens', 'src_tokens', 'src_seq_len', 'tgt_seq_len', 'first') File "/usr/local/lib/python3.6/dist-packages/fastNLP/io/data_bundle.py", line 142, in set_input dataset.set_input(field_name, flag=flag, use_1st_ins_infer_dim_type=use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 787, in set_input raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/dataset.py", line 784, in set_input self.field_arrays[name].is_input = flag File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 371, in is_input self._check_dtype_and_ndim(only_check_1st_ins_dim_type=self._use_1st_ins_infer_dim_type) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 423, in _check_dtype_and_ndim raise e File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 406, in _check_dtype_and_ndim type_0, dim_0 = _get_ele_type_and_dim(cell_0) File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in _get_ele_type_and_dim res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 56, in <listcomp> res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell] File "/usr/local/lib/python3.6/dist-packages/fastNLP/core/field.py", line 84, in _get_ele_type_and_dim raise SetInputOrTargetException(f"Cannot process type:{type(cell)}.") fastNLP.core.field.SetInputOrTargetException: Cannot process type:<class 'NoneType'>.
才学习NER，请问该怎么解决？谢谢

我将代码中的bart-large换成了bart-base，没有任何报错但是训不起来

如题，在复现结果时需要用到bart-base，麻烦问一下大家有没有遇到这种情况，还是说改模型的话某些部分也要一并修改。

替换为中文的fnlp/bart-large-chinese 模型后，如何改动代码，能让bart训练中文？请问有训练成功过吗

请问替换为中文的fnlp/bart-large-chinese 模型后，如何改动代码，能让bart训练中文？请问有训练成功过吗？

mapping = torch.LongTensor([0, 2]+label_ids)

您好！
想咨询下，FBartDecoder类中，mapping = torch.LongTensor([0, 2]+label_ids) 其中的 0， 2 代表什么？

OntoNote结果的复现

同学你好，

感谢你开源的代码和完整的注释。

我在conll03上复现了论文的结果，但在ontonote上遇到了一些困难。

我使用https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO上提供的数据，并修改

BARTNER/train.py

Lines 124 to 126 in d54d331

 elif dataset_name == 'en-ontonotes': 

 paths = '../data/en-ontonotes/english' 

 data_bundle = pipe.process_from_file(paths)

为

        paths = {
                 'train': "/home/yedeming/data/ontonotes/onto.train.ner",
                 'dev': "/home/yedeming/data/ontonotes/onto.development.ner",
                 'test': "/home/yedeming/data/ontonotes/onto.test.ner",
                 }

运行结果如下：

Save cache to caches/data_facebook/bart-large_en-ontonotes_word.pt.
max_len_a:0.8, max_len:10
In total 3 datasets:
        train has 115812 instances.
        dev has 15680 instances.
        test has 12217 instances.
The number of tokens in tokenizer  50265
50283 50288
......
Best test performance(may not correspond to the best dev performance):{'Seq2SeqSpanMetric': {'f': 87.64999999999999, 'rec': 88.52, 'pre': 86.79, 'em': 0.8727}} achieved at Epoch:16.
Best test performance(correspond to the best dev performance):{'Seq2SeqSpanMetric': {'f': 87.36, 'rec': 88.28, 'pre': 86.47, 'em': 0.8717}} achieved at Epoch:26.

In Epoch:26/Step:68409, got best dev performance:
Seq2SeqSpanMetric: f=88.94, rec=89.79, pre=88.11, em=0.859

得到测试集F1=87.36，与正常数值相差较大，不知道出现了什么问题

期待您的回复！
叶德铭

FASTNLP batch loader error

Anyone has similar issues when running train.py with a customized dataset? My error is as follows:
training epochs started 2022-12-31-17-14-10-312655
Traceback (most recent call last):
File "train.py", line 252, in
trainer.train(load_best_model=False)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 667, in train
raise e
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 658, in train
self._train()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/trainer.py", line 712, in _train
for batch_x, batch_y in self.data_iterator:
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 267, in iter
for indices, batch_x, batch_y in self.dataiter:
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 91, in collate_fn
sin_y = _pad(sin_y, dataset=self.dataset, as_numpy=self.as_numpy)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/batch.py", line 44, in _pad
res = f.pad(vlist)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 492, in pad
return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype, dim=self._cell_ndim)
File "/home/velvinfu/miniconda3/envs/py38/lib/python3.8/site-packages/fastNLP/core/field.py", line 247, in call
return np.array(contents)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.

'tgt_seq_len' is not defined

Traceback (most recent call last):
File "predictor.py", line 128, in
tgt_seq_len = (tgt_seq_len - 2).tolist()
NameError: name 'tgt_seq_len' is not defined

计算出score后如何进一步计算得到的概率分布？

想请教一下计算出score后如何进一步计算得到的概率分布？在代码中没有看到具体实现的地方，希望能够解答一下~

Order of the target sequence

Hi! Thanks for your wonderful work!

May I is there any specific order of the entities in the target sequence? Or it is just random?

Thank you!

How to get the predictions of test set written to a file ?

论文模型图

请问，论文模型图的输出索引为什么是2 3 7 2 5 6，而不是2 3 7 2 5 7？

论文有在中文数据集上的效果吗？

如题。
请问，论文有在中文数据集上进行实验吗？效果如何？

代码询问

您好，我在运行代码时碰到一个batch里面可能每个样本的实体数不一样，这在自动padding的时候就会报错，请问该怎么做呢

Indexes shifting

Hello, nice work! Sorry if I miss something. I have a question about the decoder's output in your code.

Based on your code it seems you're shifting position indexes of the tokens by a number of labels. I was wondering why shift tokens instead of shift the labels. Thank you!

Please find a example below (results from here).

raw_words: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
target_ids: [0, 11, 2, 20, 3, 1]	# each of the token positions is shifted by 6
word_bpe_ids: [0, 13910, 3376, 2076, 111, 344, 591, 1889, 7777, 226, 23806, 975, 17164, 2156, 3858, 16712, 2808, 31987, 4454, 18819, 5885, 10885, 2571, 479, 2]
word_bpe_tokens: ['<s>', 'ĠSO', 'CC', 'ER', 'Ġ-', 'ĠJ', 'AP', 'AN', 'ĠGET', 'ĠL', 'UCK', 'Y', 'ĠWIN', 'Ġ,', 'ĠCH', 'INA', 'ĠIN', 'ĠSUR', 'PR', 'ISE', 'ĠDE', 'FE', 'AT', 'Ġ.', '</s>']

vocab_file找不到

File "C:\Users\1\miniconda3\lib\site-packages\transformers\models\gpt2\tokenization_gpt2.py", line 179, in init
with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
您好，我使用3.4.0的transformers加载tokenizer的时候一直报错，然后看了一下transformers的源码，BartTokenizer要加载json格式的词表，但我从huggingface下载的bart 预训练模型中词表是TXT文件，是transformers版本问题吗？谢谢

如果低于0.04大概率是讯飞了

想问问train.py中的“如果低于0.04大概率是讯飞了”是什么意思呢

Migrate to transformers 4.0.0 or above?

Hello! I'm using this model architecture to do NER on domain specific tasks and it worked pretty well! However the old version of transformer is still a little bit troublesome.

For example, in order to be close to the pretraining process of BART, I want to directly encode the whole sentence using the tokenizer, rather than split it by space and then using 'add_prefix_space=True'. So I tried the 'span' method successfully. But for 'word' method it will need extra work to do that because of the old version of tokenizer.

Is there any plan to release a transformer 4.0.0 (or above) version?

	elif dataset_name == 'en-ontonotes':
	paths = '../data/en-ontonotes/english'
	data_bundle = pipe.process_from_file(paths)

yhcc / bartner Goto Github PK

bartner's People

Contributors

Stargazers

Watchers

Forkers

bartner's Issues

Recommend Projects

Recommend Topics

Recommend Org