sunyilgdx / sifrank_zh Goto Github PK
View Code? Open in Web Editor NEWKeyphrase or Keyword Extraction 基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)
Keyphrase or Keyword Extraction 基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)
Traceback (most recent call last):
File "D:/Codes/Information-extraction/keyword-extraction/SIFRank_zh/test/test.py", line 16, in
ELMO = word_emb_elmo.WordEmbeddings(model_file)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\embeddings\word_emb_elmo.py", line 19, in init
self.elmo = Embedder(model_path)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 107, in init
self.model, self.config = self.get_model()
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 163, in get_model
model.load_model(self.model_dir)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\frontend.py", line 206, in load_model
map_location=lambda storage, loc: storage))
File "D:\softwares\miniconda\envs\torch1.8\lib\site-packages\torch\nn\modules\module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).
Traceback (most recent call last):
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 38, in
embs = elmo.get_tokenized_words_embeddings(sents)
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values
求问大佬,这个报错是什么原因啊?
我无法确定下载链接里面应该下载哪一个model?
Highway.forward: return type <class 'torch.Tensor'>
is not a <class 'NoneType'>
Model loaded succeed
2022-06-10 00:47:25,759 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "D:\BaDouAI\SIFRank_zh-master\main.py", line 16, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\BaDouAI\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 746, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 521, in _as_pairs
return np.broadcast_to(x, (ndim, 2)).tolist()
File "<array_function internals>", line 6, in broadcast_to
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 180, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 125, in _broadcast_to
op_flags=['readonly'], itershape=shape, order='C')
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)
下载zhs, thulac 都失败,有其它地方下吗?
If you do that, there will be another error.
作者注意看一下,这个SIF的中文版其实很有问题。
输入不会按中文句号分句的,不带.都只有一句。
分句的话,embedding返回的list中每个Tensor维度不一致(根据多少word来的),想请教作者怎么做pad_sequence。
您好请问一下如果要跑整本书的文字量,要跑多久?
hi,感谢开源代码~
在论文中有提到"We compare the effect of replacing ELMo with word embeddings of GloVe and BERT"
请问要如何修改代码将ELMo替换成其他预训练模型比如BERT呢?
您好,我想请问一下文章里的图4可视化是怎么做的,可以分享一下代码嘛?
word_emb_elmo.py 中np.pad函数报错,提示index为负数
利于展示模型的性能,也方便直观感受模型的优点
首先感谢分享您的工作.
请问是否实现有批量处理句子的接口, 即用类似batch的方式而不是单个句子进行提交.
期待答复
看了之前的问题,还是没有具体解决方案,我把我的elmo.py贴出来,非常感谢!
elmo.txt
model/input_representation.py 下stanfordcorenlp模块未注释。导致运行失败
Traceback (most recent call last):
File "test.py", line 7, in
from model.method import SIFRank, SIFRank_plus
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/method.py", line 9, in
from model import input_representation
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/input_representation.py", line 9, in
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'
报错信息如下:
Model loaded succeed
2020-03-03 14:04:31,192 INFO: 1 batches, avg len: 153.0
Traceback (most recent call last):
File "D:/my_code/github项目/MeteorMan's nlp_lib/关键词抽取/SIFRank_zh-master/test/test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3
这里面对于elmo_embeddings的处理是否存在问题?
您好!拜读了您这篇关于无监督关键词抽取的论文,我看到相关工作以及模型的对比中,您的工作首次将 pretrained model 引入到无监督关键词抽取中来。想向您请教一下目前还有没有其他类似工作出现?您对这一方向的未来发展有怎么的看法呢?
您好:
请问词性标注用的什么工具 啊?
名词性短语,如何考虑虚词的呢?
比如,名词性短语“计算机科学与技术”,其中“与”字是虚词,如何被考虑进去的呢?形容词加名词的正则会忽略虚词的。
test.py脚本中的import jieba.analyse这个需要额外下载吗?
您好,我在运行您提供的测试用例test.py
时报错:
2021-08-15 18:28:10,586 INFO: char embedding size: 6169
2021-08-15 18:28:10,924 INFO: word embedding size: 71222
2021-08-15 18:28:16,333 INFO: Model(
(token_embedder): ConvTokenEmbedder(
(word_emb_layer): EmbeddingLayer(
(embedding): Embedding(71222, 100, padding_idx=3)
)
(char_emb_layer): EmbeddingLayer(
(embedding): Embedding(6169, 50, padding_idx=6166)
)
(convolutions): ModuleList(
(0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
(1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
(2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
(3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
(5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
(6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
)
(highways): Highway(
(_layers): ModuleList(
(0): Linear(in_features=2048, out_features=4096, bias=True)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
)
(projection): Linear(in_features=2148, out_features=512, bias=True)
)
(encoder): ElmobiLm(
(forward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(forward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
)
)
Traceback (most recent call last):
File "/Users/xing.sun/PycharmProjects/SIFRank_zh/test/test.py", line 14, in <module>
ELMO = word_emb_elmo.WordEmbeddings(model_file)
File "/Users/xing.sun/PycharmProjects/SIFRank_zh/embeddings/word_emb_elmo.py", line 22, in __init__
self.elmo = Embedder(model_path)
File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 106, in __init__
self.model, self.config = self.get_model()
File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 182, in get_model
model.load_model(self.model_dir)
File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/frontend.py", line 207, in load_model
map_location=lambda storage, loc: storage))
File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).
我的运行环境是按照您README
里给出的配置的。
期待您的回复,谢谢
Sent from PPHub
2022-01-10 17:05:17,709 INFO: char embedding size: 6169
2022-01-10 17:05:17,918 INFO: word embedding size: 71222
2022-01-10 17:05:21,442 INFO: Model(
(token_embedder): ConvTokenEmbedder(
(word_emb_layer): EmbeddingLayer(
(embedding): Embedding(71222, 100, padding_idx=3)
)
(char_emb_layer): EmbeddingLayer(
(embedding): Embedding(6169, 50, padding_idx=6166)
)
(convolutions): ModuleList(
(0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
(1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
(2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
(3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
(5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
(6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
)
(highways): Highway(
(_layers): ModuleList(
(0): Linear(in_features=2048, out_features=4096, bias=True)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
)
(projection): Linear(in_features=2148, out_features=512, bias=True)
)
(encoder): ElmobiLm(
(forward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(forward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
)
)
Model loaded succeed
2022-01-10 17:05:24,990 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/test/test.py", line 21, in
keyphrases = SIFRank(text, SIF, zh_model, N=5,elmo_layers_weight=elmo_layers_weight)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/sent_emb_sif.py", line 48, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values
请问 这个问题怎么处理啊
Traceback (most recent call last):
File "test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "../model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "../embeddings/sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "../embeddings/sent_emb_sif.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.