csurfer / rake-nltk Goto Github PK
View Code? Open in Web Editor NEWPython implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
Home Page: https://csurfer.github.io/rake-nltk
License: MIT License
Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
Home Page: https://csurfer.github.io/rake-nltk
License: MIT License
How do we tune the Rake parameters, as in this example:
https://www.airpair.com/nlp/keyword-extraction-tutorial
So for example:
rake_object = rake.Rake("SmartStoplist.txt", 5, 3, 4)
Each word has at least 5 characters
Each phrase has at most 3 words
Each keyword appears in the text at least 4 times
These parameters change as a function of the corpus
Hello and first of all great API,
Is there a way to allow the incorporation of stopwords and punctuation in the keyword analysis?
Thanks!
Rake-nltk is using wordpunct_tokenize
for tokenization, which is defined in nltk/tokenize/regexp.py
as wordpunct_tokenize = WordPunctTokenizer().tokenize
.
However, this tokenizer does not work with some languages like Chinese/Japanese (which looks like '当地时间6月9日下午,伊利诺伊大学香槟分校**访问学者章莹颖外出办事途中失踪,当地警方及学生学者、华侨华人全力投入搜索,但章莹颖至今下落不明。章莹颖是在离开住处前往租房签约过程中,坐上一辆深色Saturn Astra汽车后失踪。总领馆接到消息后第一时间与当地警方、章莹颖的老师和同学以及当地**学生学者联谊会取得联系,了解案情进展,并与家属保持沟通,就家属申办来美签证提供协助。') which has no whitespaces as delimiters. (There is a built-in StanfordSegmenter
of nltk
for Chinese segmentation.)
I think maybe adding an optional keyword argument tokenizer
to Rake
constructor for customization is the way to go. How do you think about it? Discussions are welcome. Thanks.
I'm not an expert in NLTK, but I tried following the algorithm and I don't understand how it can work.
It seems _build_frequency_dist
is supposed to count frequency of phrases. However, the phrase_list
it receives is the one generated by _generate_phrases
which returns a set()
, which means every phrase can only appear there once.
The generated Counter
object counts every phrase as appearing once.
This doesn't make sense no?
thanks for sharing! here's the rake.py file edited to use spacy instead of nltk. it removes certain verb types in _get_phrase_list_from_words, which i found to improve performance a bit (in small sample size).
# -*- coding: utf-8 -*- """Implementation of Rapid Automatic Keyword Extraction algorithm. As described in the paper
Automatic keyword extraction from individual
documents` by Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley.
"""
import string
from collections import Counter, defaultdict
from itertools import chain, groupby, product
from enum import Enum
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')
class Metric(Enum):
"""Different metrics that can be used for ranking."""
DEGREE_TO_FREQUENCY_RATIO = 0 # Uses d(w)/f(w) as the metric
WORD_DEGREE = 1 # Uses d(w) alone as the metric
WORD_FREQUENCY = 2 # Uses f(w) alone as the metric
class Rake(object):
"""Rapid Automatic Keyword Extraction Algorithm."""
def __init__(
self,
stopwords=None,
punctuations=None,
language="english",
ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO,
max_length=100000,
min_length=1,
verb_tags_to_rm=None
):
"""Constructor.
:param stopwords: List of Words to be ignored for keyword extraction.
:param punctuations: Punctuations to be ignored for keyword extraction.
:param language: Language to be used for stopwords
:param max_length: Maximum limit on the number of words in a phrase
(Inclusive. Defaults to 100000)
:param min_length: Minimum limit on the number of words in a phrase
(Inclusive. Defaults to 1)
"""
# By default use degree to frequency ratio as the metric.
if isinstance(ranking_metric, Metric):
self.metric = ranking_metric
else:
self.metric = Metric.DEGREE_TO_FREQUENCY_RATIO
# If stopwords not provided we use language stopwords by default.
self.stopwords = stopwords
if self.stopwords is None:
self.stopwords = list(STOP_WORDS)
# If punctuations are not provided we ignore all punctuation symbols.
self.punctuations = punctuations
if self.punctuations is None:
self.punctuations = string.punctuation
"""
RM: VB VERB VerbForm=inf verb, base form
RM: VBD VERB VerbForm=fin Tense=past verb, past tense
KEEP: VBG VERB VerbForm=part Tense=pres Aspect=prog verb, gerund or present participle
KEEP: VBN VERB VerbForm=part Tense=past Aspect=perf verb, past participle
RM: VBP VERB VerbForm=fin Tense=pres verb, non-3rd person singular present
RM: VBZ VERB VerbForm=fin Tense=pres Number=sing Person=3 verb, 3rd person singular present
"""
self.verb_tags_to_rm = verb_tags_to_rm
if self.verb_tags_to_rm is None:
self.verb_tags_to_rm = set(['VB','VBD','VBP','VBZ'])
# All things which act as sentence breaks during keyword extraction.
self.to_ignore = set(chain(self.stopwords, self.punctuations))
# Assign min or max length to the attributes
self.min_length = min_length
self.max_length = max_length
# Stuff to be extracted from the provided text.
self.frequency_dist = None
self.degree = None
self.rank_list = None
self.ranked_phrases = None
def extract_keywords_from_text(self, text):
"""Method to extract keywords from the text provided.
:param text: Text to extract keywords from, provided as a string.
"""
sentences = [str(s) for s in nlp(text.lower()).sents if str(s) not in {'.!?'}]
# sentences = nltk.tokenize.sent_tokenize(text)
self.extract_keywords_from_sentences(sentences)
def extract_keywords_from_sentences(self, sentences):
"""Method to extract keywords from the list of sentences provided.
:param sentences: Text to extraxt keywords from, provided as a list
of strings, where each string is a sentence.
"""
phrase_list = self._generate_phrases(sentences)
self._build_frequency_dist(phrase_list)
self._build_word_co_occurance_graph(phrase_list)
self._build_ranklist(phrase_list)
def get_ranked_phrases(self):
"""Method to fetch ranked keyword strings.
:return: List of strings where each string represents an extracted
keyword string.
"""
return self.ranked_phrases
def get_ranked_phrases_with_scores(self):
"""Method to fetch ranked keyword strings along with their scores.
:return: List of tuples where each tuple is formed of an extracted
keyword string and its score. Ex: (5.68, 'Four Scoures')
"""
return self.rank_list
def get_word_frequency_distribution(self):
"""Method to fetch the word frequency distribution in the given text.
:return: Dictionary (defaultdict) of the format `word -> frequency`.
"""
return self.frequency_dist
def get_word_degrees(self):
"""Method to fetch the degree of words in the given text. Degree can be
defined as sum of co-occurances of the word with other words in the
given text.
:return: Dictionary (defaultdict) of the format `word -> degree`.
"""
return self.degree
def _build_frequency_dist(self, phrase_list):
"""Builds frequency distribution of the words in the given body of text.
:param phrase_list: List of List of strings where each sublist is a
collection of words which form a contender phrase.
"""
self.frequency_dist = Counter(chain.from_iterable(phrase_list))
def _build_word_co_occurance_graph(self, phrase_list):
"""Builds the co-occurance graph of words in the given body of text to
compute degree of each word.
:param phrase_list: List of List of strings where each sublist is a
collection of words which form a contender phrase.
"""
co_occurance_graph = defaultdict(lambda: defaultdict(lambda: 0))
for phrase in phrase_list:
# For each phrase in the phrase list, count co-occurances of the
# word with other words in the phrase.
#
# Note: Keep the co-occurances graph as is, to help facilitate its
# use in other creative ways if required later.
for (word, coword) in product(phrase, phrase):
co_occurance_graph[word][coword] += 1
self.degree = defaultdict(lambda: 0)
for key in co_occurance_graph:
self.degree[key] = sum(co_occurance_graph[key].values())
def _build_ranklist(self, phrase_list):
"""Method to rank each contender phrase using the formula
phrase_score = sum of scores of words in the phrase.
word_score = d(w)/f(w) where d is degree and f is frequency.
:param phrase_list: List of List of strings where each sublist is a
collection of words which form a contender phrase.
"""
self.rank_list = []
for phrase in phrase_list:
rank = 0.0
for word in phrase:
if self.metric == Metric.DEGREE_TO_FREQUENCY_RATIO:
rank += 1.0 * self.degree[word] / self.frequency_dist[word]
elif self.metric == Metric.WORD_DEGREE:
rank += 1.0 * self.degree[word]
else:
rank += 1.0 * self.frequency_dist[word]
self.rank_list.append((rank, " ".join(phrase)))
self.rank_list.sort(reverse=True)
self.ranked_phrases = [ph[1] for ph in self.rank_list]
def _generate_phrases(self, sentences):
"""Method to generate contender phrases given the sentences of the text
document.
:param sentences: List of strings where each string represents a
sentence which forms the text.
:return: Set of string tuples where each tuple is a collection
of words forming a contender phrase.
"""
phrase_list = set()
# Create contender phrases from sentences.
for sentence in sentences:
word_list, words_to_rm = [], set()
for d in nlp(sentence):
tok_str = str(d).lower()
if tok_str not in {'.!?'}:
word_list.append(tok_str)
if d.tag_ in self.verb_tags_to_rm:
words_to_rm.add(tok_str)
# word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list, words_to_rm))
return phrase_list
def _get_phrase_list_from_words(self, word_list, words_to_rm):
"""Method to create contender phrases from the list of words that form
a sentence by dropping stopwords and punctuations and grouping the left
words into phrases. Only phrases in the given length range (both limits
inclusive) would be considered to build co-occurrence matrix. Ex:
Sentence: Red apples, are good in flavour.
List of words: ['red', 'apples', ",", 'are', 'good', 'in', 'flavour']
List after dropping punctuations and stopwords.
List of words: ['red', 'apples', *, *, good, *, 'flavour']
List of phrases: [('red', 'apples'), ('good',), ('flavour',)]
List of phrases with a correct length:
For the range [1, 2]: [('red', 'apples'), ('good',), ('flavour',)]
For the range [1, 1]: [('good',), ('flavour',)]
For the range [2, 2]: [('red', 'apples')]
:param word_list: List of words which form a sentence when joined in
the same order.
:return: List of contender phrases that are formed after dropping
stopwords and punctuations.
"""
# would rather use an index of word instead of words_to_rm, but can't figure it out
groups = groupby(word_list, lambda x: x not in self.to_ignore and x not in words_to_rm)
phrases = [tuple(group[1]) for group in groups if group[0]]
return list(
filter(
lambda x: self.min_length <= len(x) <= self.max_length, phrases
)
)`
The list of results gives duplicated keywords with score 1 . This happened after upgrading Anaconda , and after that I had to reinstall rake-nltk
Text to reproduce the error giving two times the keyword "solar" with score = 1:
"spectroscopy of the globular cluster dip source x 1746 371 ngc 6441. we propose a 50 xmm observation of the dipping xray source x 1747 371 located in the globular cluster ngc 6441. this source exhibits highly energy independent dips, consistent with an abundance >150 times less than solar, which repeat every 5.7 hours, whereas the overall cluster abundance is only a factor 4 to 10 below solar. resolving this discrepancy is the prime goal of this proposal. this study requires the high throughput, good spectral resolution, and continuous coverage afforded by xmm"
Does RAKE nltk supports German language if text is in German?
Hi, may I ask about the workthrough to use Spanish rake? thanks.
First, great package. Very easy to use!
May I kindly suggest/request a feature? In the RAKE paper, they say that there are multiple ways to compute word scores, each of which favor different kinds of key phrases:
The third is what they primarily use and is what your package uses. However, as Rose et al note, this metric favors longer keyphrases. Indeed, in my dataset of news articles, I find that this produces very wonky, idiosyncratic phrases that don't, impressionistically, describe my articles well.
So it could be nice to allow something like the following:
r = Rake(metric='word_freq')
when initializing the Rake object.
I would help with this but I don't really know Python OOP...
I use the simply code from demo like
`from rake_nltk import Rake
r = Rake()
mytext = "Hello World!"
r.extract_keywords_from_text(mytext)
r.get_ranked_phrases()`
but no any results were output. Is the python version issue?(use 3.6)
In [19]: runfile('C:/Users/yclin57/AI_summary_test.py', wdir='C:/Users/yclin57') In [20]:
I am using the followign requirements to install to Google App Engine:
nltk==3.4.5
rake-nltk==1.0.4
The deployment fails with the error messages below; I can, however, deploy nltk 3.4.5 without any issues. Only the post installation tasks of rake-nltk seem to be the issues.
Step #1 - "builder": Copying rake_nltk.egg-info to build/bdist.linux-x86_64/wheel/rake_nltk-1.0.4-py3.7.egg-info
Step #1 - "builder": running install_scripts
Step #1 - "builder": Running post installation tasks
Step #1 - "builder": Traceback (most recent call last):
Step #1 - "builder": File "", line 1, in
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 81, in
Step #1 - "builder": cmdclass={"develop": PostDevelop, "install": PostInstall},
Step #1 - "builder": File "/env/lib/python3.7/site-packages/setuptools/init.py", line 145, in setup
Step #1 - "builder": return distutils.core.setup(**attrs)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/core.py", line 148, in setup
Step #1 - "builder": dist.run_commands()
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 966, in run_commands
Step #1 - "builder": self.run_command(cmd)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
Step #1 - "builder": cmd_obj.run()
Step #1 - "builder": File "/env/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 259, in run
Step #1 - "builder": self.run_command('install')
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
Step #1 - "builder": self.distribution.run_command(command)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
Step #1 - "builder": cmd_obj.run()
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 36, in run
Step #1 - "builder": self.execute(_post_install, [], msg="Running post installation tasks")
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/cmd.py", line 335, in execute
Step #1 - "builder": util.execute(func, args, msg, dry_run=self.dry_run)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/util.py", line 291, in execute
Step #1 - "builder": func(*args)
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 17, in _post_install
Step #1 - "builder": import nltk
Step #1 - "builder": ModuleNotFoundError: No module named 'nltk'
Step #1 - "builder": ----------------------------------------
Step #1 - "builder": ERROR: Failed building wheel for rake-nltk
Step #1 - "builder": ERROR: Failed to build one or more wheels
I have installed it using pip
Hi !!
first of all, thanks for the implemenetation of RAKE !!
I wanted to ask you if the punctuation must be passed as string (as you say here https://csurfer.github.io/rake-nltk/_build/html/advanced.html#to-provide-your-own-list-of-stop-words-and-punctuations) or as a list (as you say here https://csurfer.github.io/rake-nltk/_build/html/advanced.html#to-provide-your-own-list-of-stop-words-and-punctuations)
All the best, Thanks
Hello,
When trying to build a wheel for rake-nltk package it fails if system does not have nltk package installed explicitly.
pip wheel rake-nltk
...
File "/tmp/pip-wheel-u4cxytr5/rake-nltk/setup.py", line 17, in _post_install
import nltk
ModuleNotFoundError: No module named 'nltk'
----------------------------------------
Failed building wheel for rake-nltk
Running setup.py clean for rake-nltk
Failed to build rake-nltk
ERROR: Failed to build one or more wheels
Apparently during bdist
or bdist_wheel
commands distutils also does an "fake" install -- and apparently without installing requirements before.
From docs: https://docs.python.org/3/distutils/builtdist.html
then the Distutils builds my module distribution (the Distutils itself in this case), does a “fake” installation
I am not familiar with distutils enough to know if it is possible to distinguish such "fake" install from real install.
Our use case is to build wheels during packaging progress on build server and then later use these wheels to speed up deployment. There is no reason to install python packages on build server itself and in some cases it would be hard to identify requirements before -- for an example if some other package has rake-nltk as install_requirement in setup.py.
This problem would not exists if rake-nltk would be available as wheel on pypi (but then package installation would not automatically download "punkt" and "stopwords").
Does rake-nltk fully support Chinese in the newest version?
Thx for your project, the only thing I'm missing is the max word limit.
I had one project that I had to specify it but couldn't without overwriting
so I used other open source rake project.
Thx.
not and other negation words are currently part of the stopword list which is thus not detected as a part of the keyphrase. This in turn changes the context of the keyphrase.
for eg - This is not one of the best places i have been to
keyphrase - best place
Can't understand how to use this to perform keyword extractions directly from pdf
Hi, I am trying to extract key phrases in a sentence and it works quite good. However when trying to decompose this sentence:
S&P stocks are falling, whereas Google is struggling
The model is splitting the sentence into 2 clause. However in the first clause it adds space before and after the &, like S & P. which makes problems in the following step of my algorithm (entity recognition).
The code for initialization of rake is the following:
#Creating stopword list
coord_conj=[', and', ', or', ', but', ', nor', ', as', ', for', ', so', ', however,', '; ']
subord_conj=[ 'after', 'although', 'as', 'as if', 'as long as', 'as though', 'because', 'before', 'even if', 'even though', 'if', 'if only', 'in order that', 'now that', 'once', 'rather than', 'since', 'so that', 'though', 'till', 'unless', 'until', 'when', 'whenever', 'where', 'whereas', 'wherever', 'while', 'following', 'and the']
stopwords = ['and the','amid', 'under', 'but', 'where', 'itself', 'himself', ' nor', 'whom', 'once','before', 'these','most', 'just', "that'll", "it's", 'other', 'or', 'theirs','them', 'those','how', 'any', 'against', 'again', 'yourself', 'as', 'some', 'until', 'during', 'yourselves', 'ours', 'at', 'while', 'him', 'same','few']
stopwords= stopwords + coord_conj + subord_conj
capital_stopwords=[]
for sw in stopwords:
capital_stopwords.append(sw.capitalize())
stopwords = stopwords + capital_stopwords
r = Rake(stopwords = stopwords, punctuations = '\=_*^#@!~?><"‘', min_length = 2, max_length = 100)
r.extract_keywords_from_text(text)
return(r.get_ranked_phrases())
Hi~ I am a student who study rake firstly. When I tried to extract the key phrases, I could just use the commands in your "Usage" without run rake.py. So I am really confused about that, so I want to ask how can I run rake.py and extract key phrases? My question may be a little silly, but i am really interested in this implementation. Best wishes.
Hello,
I am getting the following error message "error: package directory 'rake_nltk' does not exist" when installing rake-nltk with:
git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install :
I also tried the option pip install rake-nltk but the installation also fails:
File "/tmp/pip-build-2zTHYP/rake-nltk/setup.py", line 17, in _post_install
import nltk
ImportError: No module named nltk
Failed building wheel for rake-nltk
Running setup.py clean for rake-nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: /home/sylwia/.cache/pip/wheels/4b/c8/24/b2343664bcceb7147efeb21c0b23703a05b23fcfeaceaa2a1e
Successfully built nltk
Failed to build rake-nltk
Installing collected packages: six, singledispatch, nltk, rake-nltk
Running setup.py install for rake-nltk ... done
Successfully installed nltk-3.4 rake-nltk-1.0.4 singledispatch-3.4.0.3 six-1.12.0
I would appreciate if you could help me out with that.
Kind regards
Sylwia
WSJ920211-0036.txt
I've tried the project with sample data. see one example attached.
The results are not near what one would expect of "keywords". Also shown are keywords selected manually by humans
[tagsManual] => Array (
[0] => gun control
[1] => second amendment
[2] => gun ownership
[3] => national guard
[4] => gun-control legislation
[5] => american citizens
[6] => undeniable right
)
[tagsExtractedRake] => Array (
[0] => goldwin resident scholar american enterprise institute washington
[1] => elbridge gerry opposed james madison
[2] => current law requiring 18
[3] => another letter writer thinks
[4] => control legislation without exceeding
)
code used is
r = Rake()
r.extract_keywords_from_text(text)
print r.get_ranked_phrases()[:5]
Any thoughts?
Hi,
Here is my example
from rake_nltk import Rake
r = Rake()
r.extract_keywords_from_text("foo is a foo. but bar is not foo")
r.get_word_frequency_distribution()
Here is my output
{'bar': 1, 'foo': 1}
Why frequency of "foo" is 1?
Thank advance.
Ps. My English skill is not good. Sorry :(
Ever since 2-3 years ago, control of the aneesha/RAKE implementation of rake was transferred over to fabianvf/python-rake (which I'm a maintainer of), and the aneesha/RAKE implementation stopped being maintained. Could you please change this in your readme?
I tried to specify the min_length and max_length for Rake but it says TypeError: init()got an unexpected keyword argument 'min_lenght'
as it shows in the attached pic i am using the get ranked phrases with scores, and I am getting:
[(9.0, 'issue fund act'),
(9.0, 'international financial position'),
(9.0, 'human resource development'),
(9.0, 'chambers office holder')]
I wanna know which rows the 'issue fund act' were returned from?
Thanks!
I really liked this implementation of the algorithm, however, I noticed one discrepancy.
The paper mentions (Section 1.2.3 - Adjoining Keywords) that adjoining keywords must occur at least twice in the same order for them to considered to be as the same phrase. The current implementation doesn't account for this though!
We are using get_ranked_phrases_with_scores to get the key phrases and ranking score. for our examples the api is returning 2750 key phrases as we have lengthy paragraph. In the output we are getting ranking score as well. What is the meaning of score and can we ignore some key phrases depend on score given? And if yes, then what ranking score we should consider to ignore key phrases
I write the string into punctuation, but it does't work. In reazalt i have a lot words with '.!@ and so on
r =Rake(language = 'russian',min_length=2, max_length=2,stopwords = stop, punctuations = '–-—»!».!/.,|&($:;«»".).' )
This implementation ignores phases with multiple occurence, for example:
text = 'Red apples, are good in flavour. Where are my red apples? Apples!'
According to the paper, we should get a list of phrases and their weights like:
['red apples', 'good', 'flavor', 'red apples', 'apples']
word | good | flavour | apples | red |
---|---|---|---|---|
degree | 1 | 1 | 5 | 4 |
frequency | 1 | 1 | 3 | 2 |
ratio | 1 | 1 | 1.67 | 2 |
So the correct ranked phrases should be:
(3.67, 'red apples')
(1.67, 'apples')
(1.0, 'good')
(1.0, 'flavour')
However, in the current implementation, the extracted phrase list is:
['red apples', 'good', 'flavor', 'apples']
Obviously, the second 'red apples' is ignored, so the ranked phrases have wrong scores:
(3.5, 'red apples')
(1.5, 'apples')
(1.0, 'good')
(1.0, 'flavour')]
This bug could be fixed very easily, simply change the function extract_keywords_from_sentences
and _generate_phrases
as shown below:
def extract_keywords_from_sentences(self, sentences):
"""Method to extract keywords from the list of sentences provided.
:param sentences: Text to extraxt keywords from, provided as a list
of strings, where each string is a sentence.
"""
phrase_list = self._generate_phrases(sentences)
self._build_frequency_dist(phrase_list)
self._build_word_co_occurance_graph(phrase_list)
self._build_ranklist(set(phrase_list))
def _generate_phrases(self, sentences):
"""Method to generate contender phrases given the sentences of the text
document.
:param sentences: List of strings where each string represents a
sentence which forms the text.
:return: Set of string tuples where each tuple is a collection
of words forming a contender phrase.
"""
phrase_list = []
# Create contender phrases from sentences.
for sentence in sentences:
word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
phrase_list += self._get_phrase_list_from_words(word_list)
return phrase_list
If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".
I did all the setup as explained, but when I run the python file nothing is printed to the terminal? What did I do wrong.
from rake_nltk import Rake
# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()
mytext = '''
Black-on-black ware is a 20th- and 21st-century pottery tradition developed by the Puebloan Native American ceramic artists in Northern New Mexico. Traditional reduction-fired blackware has been made for centuries by pueblo artists. Black-on-black ware of the past century is produced with a smooth surface, with the designs applied through selective burnishing or the application of refractory slip. Another style involves carving or incising designs and selectively polishing the raised areas. For generations several families from Kha'po Owingeh and P'ohwhóge Owingeh pueblos have been making black-on-black ware with the techniques passed down from matriarch potters. Artists from other pueblos have also produced black-on-black ware. Several contemporary artists have created works honoring the pottery of their ancestors.
'''
# Extraction given the text.
r.extract_keywords_from_text(mytext)
# Extraction given the list of strings where each string is a sentence.
#r.extract_keywords_from_sentences(<list of sentences>)
# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()
You are calculating frequency distribution of words from Phrase list.
Phrase list is a set, so a word will be present just once in it. Using Counter(chain.from_iterable(phrase_list)) to find freq_distribution will only take care of the words in a set, which I think is wrong.
Candidate keywords which might occur many times in the text will be there just once in the phrase list.
When i am selecting some other language in rake then it is not identifying keyphrase rather its throwing the whole sentence as an key phrase.
I have tried this simple snippet to test keyword extraction on longer text using NLTK.
from nltk.corpus import gutenberg
from rake_nltk.rake import Rake
r = Rake()
r.extract_keywords_from_text(gutenberg.raw("austen-emma.txt"))
for phrase in r.get_ranked_phrases()[:5]:
print phrase
The output I got is as follows:
.-- _she_ _felt_ _the_ _engagement_ _to_ _be_ _a_ _source_ _of_ _repentance_ _and_ _misery_ _to_ _each_
best time -- never tired -- every sort good -- hautboy infinitely superior --
said nothing worth hearing -- looked without seeing -- admired without intelligence -- listened without knowing
scarce -- chili preferred -- white wood finest flavour
without scruple -- without apology -- without much apparent diffidence
I assume this is caused by either the word tokenizer or the sentence tokenizer, but I'm not sure how to fix it.
Is it possible to get the number of occurrences of the extracted keywords in the text?
when I try it, all get 1.0 as the number of occurrences.
when I run this
`from rake_nltk import Rake
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
text = '''Compatibility of systems of linear constraints over the set of
natural numbers. Criteria of compatibility of a system of linear
Diophantine equations, strict inequations, and nonstrict inequations are
considered. Upper bounds for components of a minimal set of solutions
and algorithms of construction of minimal generating sets of solutions
for all types of systems are given. These criteria and the corresponding
algorithms for constructing a minimal supporting set of solutions can be
used in solving all the considered types of systems and systems of mixed
types.'''
r.extract_keywords_from_text(text)
r.get_ranked_phrases()`
I get this result:
Traceback (most recent call last): File "/Users/donya/PycharmProjects/test/test.py", line 20, in <module> r.extract_keywords_from_text(text) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 66, in extract_keywords_from_text self.extract_keywords_from_sentences(sentences) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 74, in extract_keywords_from_sentences phrase_list = self._generate_phrases(sentences) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 172, in _generate_phrases word_list = [word.lower() for word in wordpunct_tokenize(sentence)] File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 126, in tokenize self._check_regexp() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 121, in _check_regexp self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 56, in compile_regexp_to_noncapturing return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 52, in convert_regexp_to_noncapturing_parsed parsed_pattern.pattern.groups = 1 AttributeError: can't set attribute
I am new to python. Please help me
Thanks
According to the function of frequency calculation :
def _build_frequency_dist(self, phrase_list):
"""Builds frequency distribution of the words in the given body of text.
:param phrase_list: List of List of strings where each sublist is a
collection of words which form a contender phrase.
"""
self.frequency_dist = Counter(chain.from_iterable(phrase_list))
Tracing back to the calculation of phrase_list :
def _generate_phrases(self, sentences):
"""Method to generate contender phrases given the sentences of the text
document.
:param sentences: List of strings where each string represents a
sentence which forms the text.
:return: Set of string tuples where each tuple is a collection
of words forming a contender phrase.
"""
phrase_list = set()
# Create contender phrases from sentences.
for sentence in sentences:
word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list))
return phrase_list
Clearly, phrase_list is a set, and contains unique keywords. So if keywords repeat in a text, they're ignored, and the value of frequency, as tested by me, comes out faulty.
I have modified the Rake() object to ensure the calculations are correct. @csurfer ,kindly assign me this issue, so I can create a pull request.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.