swisscom / ai-research-keyphrase-extraction Goto Github PK

View Code? Open in Web Editor NEW

426.0 426.0 87.0 186 KB

EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings (official implementation)

License: Apache License 2.0

Python 95.18% Dockerfile 4.82%

ai-research-keyphrase-extraction's People

Contributors

Stargazers

Watchers

Forkers

a-pagano floscha sodagreencellur roysh owlyone shamaoge dragosdorneanu kthwaite chauency guanlongtianzi shekharknvs fajri91 bambangdw ssanbu08 seriatin chen1234yue augmen risjain kamilbs nlp-kabukiage we1l1n hoi-nx avi0gaur ajisaka yyht wjianxz guoqunabc sunyilgdx sqooba cosecant-csc gkubon kyusonglee strongio zahidse khieunguyen noke8868 w601sxs tu-cao trungtv henry-nlp syyunn onizukalab pierre-zhao ngoctanle akornilo atomicjets yu-tingwang henghuiz-zz avstegmann woooodbond useric yejiyeo anki08 khronosplus gandharvsuri bgmartins nkees lunnada amarnamarpan jingmiao-ti xrosliang liaowc qianrenjian penguinkang wuxiaoxue daizzhisheng thatalfredh tiffen vinayasathyanarayana wilfoderek tokenmill imani charlesbak joywang233 dliofindia ashuk203 kaka1126 zaibacu sohamtiwari3120 isabella232 narayanacharya6 kidclone3 aqhali jimmy-inl dwhnicholas

ai-research-keyphrase-extraction's Issues

evaluation

Can we utilize GPU to extract keyphrase?

Hello,
I'm trying to use GPU to extract keyphrase.

I tried this:

with tf.device('/device:GPU:2'):
kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en')  #extract 10 keyphrases

But it shows that "Make sure the device specification refers to a valid device.
[[init]] "

Does this work with the sent2vec unigram wiki model

Hi there,

Currently attempting to launch in docker container.

Been trying to try this out with the unigram model, but am getting silent failures when attempting to load the embedding distributor:
embedding_distributor = launch.load_local_embedding_distributor('en')

It just closes out and crashes the docker container.

The command that launches the container:

docker run --memory=10g -v /Desktop/ai-research-keyphrase-extraction/deps/sent2vec/models/wiki_unigrams.bin:/sent2vec/pretrained_model.bin -it keyphrase-extraction

import launch
embedding_distributor = launch.load_local_embedding_distributor('en') <---FAILURE

Thanks in advance for your help.
Dan

Key phrases in other languages.

I was wondering if there is a list of requirements we need to make sure are fulfilled to be able to run this system to extract key phrases in other languages. It seems to me that at least the Stanford CoreNLP server should be called with an appropriate language model, and the trained model used by SENT2VEC should be set appropriately. Could you say more about this?

How to Evaluate the Model

Hello,
As its unsupervised,
I want to know how can we evaluate the model and create the training data.
I have looked you have calculated f-measure

Getting Value Error when running launch.py in python3.5 Ubuntu 16.04

Getting Value Error when running launch.py

+++++++++++++++++++++++++++++++++++++++
In [1]: import launch

ValueError Traceback (most recent call last)
in ()
----> 1 import launch

~/ai-research-keyphrase-extraction/launch.py in ()
2 from configparser import ConfigParser
3
----> 4 from swisscom_ai.research_keyphrase.embeddings.emb_distrib_local import EmbeddingDistributorLocal
5 from swisscom_ai.research_keyphrase.model.input_representation import InputTextObj
6 from swisscom_ai.research_keyphrase.model.method import MMRPhrase

~/ai-research-keyphrase-extraction/swisscom_ai/research_keyphrase/embeddings/emb_distrib_local.py in ()
7
8 from swisscom_ai.research_keyphrase.embeddings.emb_distrib_interface import EmbeddingDistributor
----> 9 import sent2vec
10
11

init.pxd in init sent2vec()

ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

+++++++++++++++++++++++++++++++++++++++

config.ini problem

what does model_path_de in config.ini.template mean?
in readme only one configuration ‘model_path’ is mentioned.

ConnectionRefusedError: [Errno 111] Connection refused

I've used Docker to set up the repository and trying to run the code snipped under 'Usage' in the readme.

I'm new to docker so please confirm if I'm doing it correctly. I first run the 'docker run' command and turn on the stranford_corenlp server. Then on another terminal, I run my code snippet.
I get the following error.

Traceback (most recent call last):
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1264, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1310, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1259, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 976, in send
    self.connect()
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f95e8d03b70>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /?properties=%7B%22outputFormat%22%3A+%22json%22%2C+%22annotators%22%3A+%22tokenize%2Cssplit%2Cpos%22%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f95e8d03b70>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 5,'en')
  File "/home/goorulabs/NarrativeArc/ai-research-keyphrase-extraction/launch.py", line 27, in extract_keyphrases
    tagged = ptagger.pos_tag_raw_text(raw_text)
  File "/home/goorulabs/NarrativeArc/ai-research-keyphrase-extraction/swisscom_ai/research_keyphrase/preprocessing/postagging.py", line 215, in pos_tag_raw_text
    tagged_text = list(raw_tag_text())
  File "/home/goorulabs/NarrativeArc/ai-research-keyphrase-extraction/swisscom_ai/research_keyphrase/preprocessing/postagging.py", line 211, in raw_tag_text
    tagged_data = self.parser.api_call(text, properties=properties)
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/nltk/parse/corenlp.py", line 247, in api_call
    timeout=timeout,
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/requests/sessions.py", line 578, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/goorulabs/NarrativeArc/narrativearc/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /?properties=%7B%22outputFormat%22%3A+%22json%22%2C+%22annotators%22%3A+%22tokenize%2Cssplit%2Cpos%22%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f95e8d03b70>: Failed to establish a new connection: [Errno 111] Connection refused',))

meet BrokenPipeError while extract_keyphrases, are there steps wrong?

I try following command and meet BrokenPipeError

docker build . -t keyphrase-extraction
docker run -v torontobooks_bigrams.bin:/sent2vec/pretrained_model.bin -it keyphrase-extraction

in docker

import launch

embedding_distributor = launch.load_local_embedding_distributor('en')
pos_tagger = launch.load_local_pos_tagger('en')
raw_text = 'this is the text i want to extract keyphrases from'
kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en')

Traceback (most recent call last):
File "", line 1, in
File "/app/launch.py", line 29, in extract_keyphrases
return MMRPhrase(embedding_distrib, text_obj, N=N, beta=beta, alias_threshold=alias_threshold)
File "/app/swisscom_ai/research_keyphrase/model/method.py", line 87, in MMRPhrase
candidates, X = extract_candidates_embedding_for_doc(embdistrib, text_obj)
File "/app/swisscom_ai/research_keyphrase/model/methods_embeddings.py", line 44, in extract_candidates_embedding_for_doc
embeddings = np.array(embedding_distrib.get_tokenized_sents_embeddings(candidates)) # Associated embeddings
File "/app/swisscom_ai/research_keyphrase/embeddings/emb_distrib_local.py", line 41, in get_tokenized_sents_embeddings
self._stdin_handle.flush()
BrokenPipeError: [Errno 32] Broken pipe

Extraction is very slow?

I am running extraction on short sentences, roughly 20 words.

I have followed your instructions to load the embedding model and the part of speech tagger once.

However, it takes about 3 seconds per extraction on a dedicated machine.

Is this expected? How can I make extraction faster?

train model file give me 'Model file has wrong file format!'

I got 'Model file has wrong file format!'

when using docker:
docker run -v torontobooks_bigrams.bin:/sent2vec/pretrained_model.bin -it keyphrase-extraction

that i try two files torontobooks_bigrams.bin ,wiki_unigram.bin

but they give me the same error!

thanks in advance

config.ini.template

hello，excuse me
which one should i choice ?
jar_path = 'path'
or
jar_path = path

Inspec evaluation

Hey,
In the paper you mention that when evaluating the model on NUS dataset you union the keyphrases assigned by authors and annotators. Do you also union the controlled and uncontrolled keyphrases from Inspec dataset?

invalid value encountered in true_divide

raw_text = 'this is the text i want to extract keyphrases from'

kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en')
/home1/zy/anaconda3/envs/py36pc2t/lib/python3.6/site-packages/ai-research-keyphrase-extraction/swisscom_ai/research_keyphrase/model/method.py:44: RuntimeWarning: invalid value encountered in true_divide
0.5 + (sim_between_norm - np.nanmean(sim_between_norm, axis=0)) / np.nanstd(sim_between_norm, axis=0)

std::bad_alloc Aborted

Hi,

I managed to launch the program on my personal computer without any problem (16gb RAM). However, when I try to launch it on a server (8gb RAM), I get this error in 100% of the cases:

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted
I read that this error appears when there is no more RAM available.

Any idea to fix this issue?

Error in importing ABC

Thanks for the code.

I'm having mentioned issue. Will be glad if you could look into it.

This method tends to extract more detailed phrases, rather than the core ones

i test this method in chinese keyphrase extraction.

i use bert CLS embedding and added token embedding as the sentence sentence embedding.
and apply your method.

the resulst shows that a lot of details rather than key phrases are extracted.
i ll give u an example:

据美国约翰斯·霍普金斯大学发布的新冠疫情最新统计数据显示，截至北京时间4月27日11时50分左右，全球新冠肺炎确诊病例达2971669例，死亡病例达206544例。 数据显示，美国是疫情最严重的国家，美国的新冠肺炎确诊病例已达965783例，死亡病例达54883例。紧随其后的分别是西班牙，意大利，法国，死亡病例均超过2万例。值得注意的是，美国确诊人数高出西班牙近74万例，西班牙目前累计确诊226629例。老友喊话特朗普：美国人的命比你的选举更重要如果你继续以自己为中心，继续玩弄愚蠢的政治……如果你意识不到自己的错误，你就做不对。”目前，特朗普已“取关”了这位老友。
视频来源：CGTN
据海外网消息，上周，美国总统特朗普在记者会上语出惊人，暗示也许可以注射消毒剂来杀死新冠病毒，随即引发舆论哗然。当地时间26日，同为共和党人的马里兰州州长拉里·霍根表示，他对白宫在美国面临新冠疫情危机的当下发出这种令人糊涂的信息感到担忧。他也强调，美国总统在试图向民众通报疫情时，坚持事实是“至关重要”的。
比尔·盖茨：**在疫情暴发时做了对的事 可悲的是美国本可以做好
当地时间4月26日，比尔·盖茨接受CNN采访时，被问及怎么看待有人指责**掩盖疫情，对此，他称赞**在疫情暴发的“一开始就做了很多正确的事情”，可悲的是本来能做好的美国却做得特别差。他称指责**是不正确和不公平的。加州官员：低收入社区新冠肺炎致死率是较富裕社区的三倍
据央视新闻报道，当地时间4月26日，美国加州洛杉矶县公共卫生局局长芭芭拉在当天举行的新冠肺炎疫情新闻发布会上表示，洛杉矶县当天新增440例新冠肺炎确诊病例，新增18例死亡病例。芭芭拉同时透露，生活在洛杉矶县低收入社区的居民死于新冠肺炎的可能性是较富裕社区的三倍，拥有低收入家庭超过30%的社区，每10万人中有16.5人因新冠肺炎死亡，而拥有低收入家庭低于10%的社区，每10万人中有5.3人死亡。
据海外网消息，美国媒体26日消息称，位于纽约市的美国海军“安慰”号医疗船的最后一名新冠肺炎患者于当天获准出院，而该医疗船的此次任务也接近尾声。据了解，自3月底抵达纽约至今，拥有1000张病床的“安慰”号仅收治了182名病人。
白宫经济顾问称美国4月失业率将达16%，经济面临历史冲击
据澎湃新闻报道，当地时间4月26日，白宫经济顾问凯文·哈塞特（Kevin Hassett）表示，4月份美国的失业率可能达到16%或更高，需要更多刺激措施以确保经济出现强劲反弹。
凯文·哈塞特在接受美国广播公司新闻网（ABCNews）采访时表示，“形势非常严峻”，“我认为，这是我们经济史上最大的负面冲击。”“我们将看到要失业率接近20世纪30年代大萧条时期的水平。”

u could use google translation api to get the english version of this news.
this embed rank predicts:

低收入社区新冠肺炎致死率
白宫经济顾问凯文·哈塞特
世纪30年代
语出惊人
多刺激措施

however, the tfidf and lda features combined method predicts:

美国的新冠肺炎确诊病例
新冠病毒
低收入社区新冠肺炎致死率
美国海军
特朗普老友皮尔斯·摩根

several other texts shows similar conclusion that embed rank prone to predict details, rather than core phrases
i find the logger the candidate phrases are, the more similar with the whole text.
i assume this happens in sen2vec

pke

EXCUSE ME ,Do you know ''pke'' ?
Have you compared your model with it

AttributeError: module 'sent2vec' has no attribute 'Sent2vecModel'

I have followed each and every step as said in readme file. But there seems to be some problem due to which the error saying AttributeError: module 'sent2vec' has no attribute 'Sent2vecModel' . I referred to other similar complaints, and re-cloned the repo twice but it does not work.

position bias

Because I want to use this method on long papers, but you do not show how to use the Embedrank+position bias.
Can you tell me how to di it?
THANKS

Train new model for arabic and persian language

Hello
I run the module with pretrained model, it works fine. I want to train new models for arabic and persian. I trained sent2vec model but I didn't work. Can you please guide me what else should I do? How can I use this module for other languages?

How to extract keyphrase using Doc2Vec instead of Sent2Vec?

Hello, I've been trying to run the embedRank using Doc2Vec. What would be the changes made to the config.ini file and steps involved to setup the EmbedRank project using Doc2Vec instead of the sent2vec technique that is already mentioned in the Repo description?

Thanks!

Model evaluation

I using doc2vec with DUC01 dataset but results are not as expected? Can you help me?

Installation on Anaconda

Hi I face problem when I try to install it with anaconda prompt
and I am quite not understand what is write in the README.md, for example:

Install sent2vec from https://github.com/epfml/sent2vec

Clone/Download the directory
go to sent2vec directory
git checkout f827d014a473aa22b2fef28d9e29211d50808d48
make
pip install cython
inside the src folder
python setup.py build_ext
pip install .

how should I do with this part? Any useful suggestion is appreciate.
Thank you.

Interactive session quits while trying to load embedding_distributor

Hi!

First of all, thank you for sharing the code of the paper! It's super helpful.

I have a problem with the interactive session with docker. Basically what happens is that the session justs ends while loading the embedding_distributor.

Here's the command I use to launch the interactive session:

docker run -v ~/tests/embedrank/ai-research-keyphrase-extraction/wiki_unigrams.bin:/sent2vec/pretrained_model.bin -it keyphrase-extraction

and here's the docker logs from that session:

Python 3.6.5 (default, Apr  4 2018, 23:11:43)
[GCC 6.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import launch
>>> embedding_distributor = launch.load_local_embedding_distributor('en')

Nothing really happens, no error, not anything. It just quits. What am I doing wrong here? Any help much appreciated.

swisscom / ai-research-keyphrase-extraction Goto Github PK

ai-research-keyphrase-extraction's People

Contributors

Stargazers

Watchers

Forkers

ai-research-keyphrase-extraction's Issues

+++++++++++++++++++++++++++++++++++++++ In [1]: import launch

Recommend Projects

Recommend Topics

Recommend Org

+++++++++++++++++++++++++++++++++++++++
In [1]: import launch