Giter VIP home page Giter VIP logo

sunyilgdx / sifrank_zh Goto Github PK

View Code? Open in Web Editor NEW
417.0 8.0 80.0 2.44 MB

Keyphrase or Keyword Extraction 基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)

Python 100.00%
sifrank keyphrase-extraction keyword-extraction elmo pre-trained-language-models sif word-embeddings sentence-embeddings python36

sifrank_zh's People

Contributors

sunyilgdx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sifrank_zh's Issues

test.py文件执行报错

Traceback (most recent call last):
File "D:/Codes/Information-extraction/keyword-extraction/SIFRank_zh/test/test.py", line 16, in
ELMO = word_emb_elmo.WordEmbeddings(model_file)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\embeddings\word_emb_elmo.py", line 19, in init
self.elmo = Embedder(model_path)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 107, in init
self.model, self.config = self.get_model()
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 163, in get_model
model.load_model(self.model_dir)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\frontend.py", line 206, in load_model
map_location=lambda storage, loc: storage))
File "D:\softwares\miniconda\envs\torch1.8\lib\site-packages\torch\nn\modules\module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

index can't contain negative values

Traceback (most recent call last):
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 38, in
embs = elmo.get_tokenized_words_embeddings(sents)
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values

求问大佬,这个报错是什么原因啊?

ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

Model loaded succeed
2022-06-10 00:47:25,759 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "D:\BaDouAI\SIFRank_zh-master\main.py", line 16, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\BaDouAI\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 746, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 521, in _as_pairs
return np.broadcast_to(x, (ndim, 2)).tolist()
File "<array_function internals>", line 6, in broadcast_to
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 180, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 125, in _broadcast_to
op_flags=['readonly'], itershape=shape, order='C')
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

The sentence will note bs split by '。'

If you do that, there will be another error.

作者注意看一下,这个SIF的中文版其实很有问题。
输入不会按中文句号分句的,不带.都只有一句。
分句的话,embedding返回的list中每个Tensor维度不一致(根据多少word来的),想请教作者怎么做pad_sequence。

提取的关键词倾向于带英文字母

大佬好!
我用这份代码提取《大话数据结构》全书,发现得到的关键词大多都含字母,且不大像一个词,如下图。
请问,我该怎么改进呢?

SIFRank关键词

处理数据

您好请问一下如果要跑整本书的文字量,要跑多久?

COMPARE WITH OTHER PRE-TRAINED LANGUAGE MODELS

hi,感谢开源代码~
在论文中有提到"We compare the effect of replacing ELMo with word embeddings of GloVe and BERT"
请问要如何修改代码将ELMo替换成其他预训练模型比如BERT呢?

运行时出现报错

ValueError: could not broadcast input array from shape (3,41,1024) into shape (3)

image

可视化

您好,我想请问一下文章里的图4可视化是怎么做的,可以分享一下代码嘛?

代码报错--

word_emb_elmo.py 中np.pad函数报错,提示index为负数

批量处理句子

首先感谢分享您的工作.
请问是否实现有批量处理句子的接口, 即用类似batch的方式而不是单个句子进行提交.
期待答复

model/input_representation.py 下stanfordcorenlp模块未注释

model/input_representation.py 下stanfordcorenlp模块未注释。导致运行失败
Traceback (most recent call last):
File "test.py", line 7, in
from model.method import SIFRank, SIFRank_plus
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/method.py", line 9, in
from model import input_representation
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/input_representation.py", line 9, in
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'

运行test/test.py报错

报错信息如下:
Model loaded succeed
2020-03-03 14:04:31,192 INFO: 1 batches, avg len: 153.0
Traceback (most recent call last):
File "D:/my_code/github项目/MeteorMan's nlp_lib/关键词抽取/SIFRank_zh-master/test/test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

这里面对于elmo_embeddings的处理是否存在问题?

是否还有其他类似的工作?

您好!拜读了您这篇关于无监督关键词抽取的论文,我看到相关工作以及模型的对比中,您的工作首次将 pretrained model 引入到无监督关键词抽取中来。想向您请教一下目前还有没有其他类似工作出现?您对这一方向的未来发展有怎么的看法呢?

运行test.py时报错: size mismatch for word_emb_layer.embedding.weight

您好,我在运行您提供的测试用例test.py时报错:

2021-08-15 18:28:10,586 INFO: char embedding size: 6169
2021-08-15 18:28:10,924 INFO: word embedding size: 71222
2021-08-15 18:28:16,333 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(71222, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(6169, 50, padding_idx=6166)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_features=4096, bias=True)
      )
    )
    (projection): Linear(in_features=2148, out_features=512, bias=True)
  )
  (encoder): ElmobiLm(
    (forward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (forward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
  )
)
Traceback (most recent call last):
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/test/test.py", line 14, in <module>
    ELMO = word_emb_elmo.WordEmbeddings(model_file)
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/embeddings/word_emb_elmo.py", line 22, in __init__
    self.elmo = Embedder(model_path)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 106, in __init__
    self.model, self.config = self.get_model()
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 182, in get_model
    model.load_model(self.model_dir)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/frontend.py", line 207, in load_model
    map_location=lambda storage, loc: storage))
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
	size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
	size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

我的运行环境是按照您README里给出的配置的。

期待您的回复,谢谢

Sent from PPHub

无法运行。。。ValueError: index can't contain negative values

2022-01-10 17:05:17,709 INFO: char embedding size: 6169
2022-01-10 17:05:17,918 INFO: word embedding size: 71222
2022-01-10 17:05:21,442 INFO: Model(
(token_embedder): ConvTokenEmbedder(
(word_emb_layer): EmbeddingLayer(
(embedding): Embedding(71222, 100, padding_idx=3)
)
(char_emb_layer): EmbeddingLayer(
(embedding): Embedding(6169, 50, padding_idx=6166)
)
(convolutions): ModuleList(
(0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
(1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
(2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
(3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
(5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
(6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
)
(highways): Highway(
(_layers): ModuleList(
(0): Linear(in_features=2048, out_features=4096, bias=True)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
)
(projection): Linear(in_features=2148, out_features=512, bias=True)
)
(encoder): ElmobiLm(
(forward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(forward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
)
)
Model loaded succeed
2022-01-10 17:05:24,990 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/test/test.py", line 21, in
keyphrases = SIFRank(text, SIF, zh_model, N=5,elmo_layers_weight=elmo_layers_weight)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/sent_emb_sif.py", line 48, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values
请问 这个问题怎么处理啊

匹配时报错

哈喽,大神,我又来了,这次是这个报错:
image

Run the test.py, follow is the error message:

Traceback (most recent call last):
File "test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "../model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "../embeddings/sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "../embeddings/sent_emb_sif.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.