sunyilgdx / sifrank_zh Goto Github PK

Keyphrase or Keyword Extraction 基于预训练模型的中文关键词抽取方法（论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码）

Python 100.00%

sifrank keyphrase-extraction keyword-extraction elmo pre-trained-language-models sif word-embeddings sentence-embeddings python36

sifrank_zh's People

Contributors

Stargazers

Watchers

sifrank_zh's Issues

test.py文件执行报错

Traceback (most recent call last):
File "D:/Codes/Information-extraction/keyword-extraction/SIFRank_zh/test/test.py", line 16, in
ELMO = word_emb_elmo.WordEmbeddings(model_file)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\embeddings\word_emb_elmo.py", line 19, in init
self.elmo = Embedder(model_path)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 107, in init
self.model, self.config = self.get_model()
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 163, in get_model
model.load_model(self.model_dir)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\frontend.py", line 206, in load_model
map_location=lambda storage, loc: storage))
File "D:\softwares\miniconda\envs\torch1.8\lib\site-packages\torch\nn\modules\module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

index can't contain negative values

Traceback (most recent call last):
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 38, in
embs = elmo.get_tokenized_words_embeddings(sents)
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values

求问大佬，这个报错是什么原因啊？

清华分词工具包THULAC thulac.models,应该下载哪个model?

我无法确定下载链接里面应该下载哪一个model?

按照正常操作，没办法正常运行，是环境导致的吗？

Highway.forward: return type <class 'torch.Tensor'> is not a <class 'NoneType'>

encoder.pkl和token_embedder.pkl怎么添加

ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

Model loaded succeed
2022-06-10 00:47:25,759 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "D:\BaDouAI\SIFRank_zh-master\main.py", line 16, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\BaDouAI\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 746, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 521, in _as_pairs
return np.broadcast_to(x, (ndim, 2)).tolist()
File "<array_function internals>", line 6, in broadcast_to
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 180, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 125, in _broadcast_to
op_flags=['readonly'], itershape=shape, order='C')
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

模型下载失败

下载zhs, thulac 都失败，有其它地方下吗？

The sentence will note bs split by '。'

If you do that, there will be another error.

作者注意看一下，这个SIF的中文版其实很有问题。
输入不会按中文句号分句的，不带.都只有一句。
分句的话，embedding返回的list中每个Tensor维度不一致(根据多少word来的)，想请教作者怎么做pad_sequence。

提取的关键词倾向于带英文字母

大佬好！
我用这份代码提取《大话数据结构》全书，发现得到的关键词大多都含字母，且不大像一个词，如下图。
请问，我该怎么改进呢？

处理数据

您好请问一下如果要跑整本书的文字量，要跑多久？

COMPARE WITH OTHER PRE-TRAINED LANGUAGE MODELS

hi，感谢开源代码~
在论文中有提到"We compare the effect of replacing ELMo with word embeddings of GloVe and BERT"
请问要如何修改代码将ELMo替换成其他预训练模型比如BERT呢？

运行时出现报错

ValueError: could not broadcast input array from shape (3,41,1024) into shape (3)

可视化

您好，我想请问一下文章里的图4可视化是怎么做的，可以分享一下代码嘛？

代码报错--

word_emb_elmo.py 中np.pad函数报错，提示index为负数

为什么pip install overrides==3.1.0 仍然提示Highway.forward: `input` must be present

可否上线一个简单的网页 demo？

利于展示模型的性能，也方便直观感受模型的优点

批量处理句子

首先感谢分享您的工作.
请问是否实现有批量处理句子的接口, 即用类似batch的方式而不是单个句子进行提交.
期待答复

index can't contain negative values

看了之前的问题，还是没有具体解决方案，我把我的elmo.py贴出来，非常感谢!
elmo.txt

model/input_representation.py 下stanfordcorenlp模块未注释

model/input_representation.py 下stanfordcorenlp模块未注释。导致运行失败
Traceback (most recent call last):
File "test.py", line 7, in
from model.method import SIFRank, SIFRank_plus
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/method.py", line 9, in
from model import input_representation
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/input_representation.py", line 9, in
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'

运行test/test.py报错

报错信息如下：
Model loaded succeed
2020-03-03 14:04:31,192 INFO: 1 batches, avg len: 153.0
Traceback (most recent call last):
File "D:/my_code/github项目/MeteorMan's nlp_lib/关键词抽取/SIFRank_zh-master/test/test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

这里面对于elmo_embeddings的处理是否存在问题？

代码运行没报错，但是没有输出结果，是什么原因呢

是否还有其他类似的工作？

您好！拜读了您这篇关于无监督关键词抽取的论文，我看到相关工作以及模型的对比中，您的工作首次将 pretrained model 引入到无监督关键词抽取中来。想向您请教一下目前还有没有其他类似工作出现？您对这一方向的未来发展有怎么的看法呢？

请问词性标注用的什么工具啊？名词性短语，如何考虑虚词的呢？

您好：
请问词性标注用的什么工具啊？
名词性短语，如何考虑虚词的呢？
比如，名词性短语“计算机科学与技术”，其中“与”字是虚词，如何被考虑进去的呢？形容词加名词的正则会忽略虚词的。

import jieba.analyse

test.py脚本中的import jieba.analyse这个需要额外下载吗？

运行test.py时报错: size mismatch for word_emb_layer.embedding.weight

您好，我在运行您提供的测试用例test.py时报错:

2021-08-15 18:28:10,586 INFO: char embedding size: 6169
2021-08-15 18:28:10,924 INFO: word embedding size: 71222
2021-08-15 18:28:16,333 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(71222, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(6169, 50, padding_idx=6166)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_features=4096, bias=True)
      )
    )
    (projection): Linear(in_features=2148, out_features=512, bias=True)
  )
  (encoder): ElmobiLm(
    (forward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (forward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
  )
)
Traceback (most recent call last):
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/test/test.py", line 14, in <module>
    ELMO = word_emb_elmo.WordEmbeddings(model_file)
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/embeddings/word_emb_elmo.py", line 22, in __init__
    self.elmo = Embedder(model_path)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 106, in __init__
    self.model, self.config = self.get_model()
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 182, in get_model
    model.load_model(self.model_dir)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/frontend.py", line 207, in load_model
    map_location=lambda storage, loc: storage))
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
	size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
	size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

我的运行环境是按照您README里给出的配置的。

期待您的回复，谢谢

_{Sent from PPHub}

请问 dict.txt 是通过什么语料得到的呢？

请问一下，这个运行顺序，运行方法是什么？一直报错ValueError: index can't contain negative values，照着之前的提问里的改了，还是不行

无法运行。。。ValueError: index can't contain negative values

2022-01-10 17:05:17,709 INFO: char embedding size: 6169
2022-01-10 17:05:17,918 INFO: word embedding size: 71222
2022-01-10 17:05:21,442 INFO: Model(
(token_embedder): ConvTokenEmbedder(
(word_emb_layer): EmbeddingLayer(
(embedding): Embedding(71222, 100, padding_idx=3)
)
(char_emb_layer): EmbeddingLayer(
(embedding): Embedding(6169, 50, padding_idx=6166)
)
(convolutions): ModuleList(
(0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
(1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
(2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
(3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
(5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
(6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
)
(highways): Highway(
(_layers): ModuleList(
(0): Linear(in_features=2048, out_features=4096, bias=True)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
)
(projection): Linear(in_features=2148, out_features=512, bias=True)
)
(encoder): ElmobiLm(
(forward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(forward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
)
)
Model loaded succeed
2022-01-10 17:05:24,990 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/test/test.py", line 21, in
keyphrases = SIFRank(text, SIF, zh_model, N=5,elmo_layers_weight=elmo_layers_weight)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/sent_emb_sif.py", line 48, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values
请问这个问题怎么处理啊

匹配时报错

哈喽，大神，我又来了，这次是这个报错：

Run the test.py, follow is the error message:

Traceback (most recent call last):
File "test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "../model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "../embeddings/sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "../embeddings/sent_emb_sif.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

sunyilgdx / sifrank_zh Goto Github PK

sifrank_zh's People

Contributors

Stargazers

Watchers

Forkers

sifrank_zh's Issues

Recommend Projects

Recommend Topics

Recommend Org