Giter VIP home page Giter VIP logo

rake-nltk's People

Contributors

cgratie avatar csurfer avatar dependabot[bot] avatar letianfeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rake-nltk's Issues

Support for more languages

Rake-nltk is using wordpunct_tokenize for tokenization, which is defined in nltk/tokenize/regexp.py as wordpunct_tokenize = WordPunctTokenizer().tokenize.

However, this tokenizer does not work with some languages like Chinese/Japanese (which looks like '当地时间6月9日下午,伊利诺伊大学香槟分校**访问学者章莹颖外出办事途中失踪,当地警方及学生学者、华侨华人全力投入搜索,但章莹颖至今下落不明。章莹颖是在离开住处前往租房签约过程中,坐上一辆深色Saturn Astra汽车后失踪。总领馆接到消息后第一时间与当地警方、章莹颖的老师和同学以及当地**学生学者联谊会取得联系,了解案情进展,并与家属保持沟通,就家属申办来美签证提供协助。') which has no whitespaces as delimiters. (There is a built-in StanfordSegmenter of nltk for Chinese segmentation.)

I think maybe adding an optional keyword argument tokenizer to Rake constructor for customization is the way to go. How do you think about it? Discussions are welcome. Thanks.

Don't think frequency distribution is working

I'm not an expert in NLTK, but I tried following the algorithm and I don't understand how it can work.

It seems _build_frequency_dist is supposed to count frequency of phrases. However, the phrase_list it receives is the one generated by _generate_phrases which returns a set(), which means every phrase can only appear there once.

The generated Counter object counts every phrase as appearing once.

This doesn't make sense no?

spacy implementation with certain verb types removed

thanks for sharing! here's the rake.py file edited to use spacy instead of nltk. it removes certain verb types in _get_phrase_list_from_words, which i found to improve performance a bit (in small sample size).

# -*- coding: utf-8 -*- """Implementation of Rapid Automatic Keyword Extraction algorithm. As described in the paper Automatic keyword extraction from individual
documents` by Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley.
"""

ADAPtED tO USE SPACY INStEAD OF NLtK

import string
from collections import Counter, defaultdict
from itertools import chain, groupby, product

from enum import Enum
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load('en_core_web_sm')

class Metric(Enum):
"""Different metrics that can be used for ranking."""

DEGREE_TO_FREQUENCY_RATIO = 0  # Uses d(w)/f(w) as the metric
WORD_DEGREE = 1  # Uses d(w) alone as the metric
WORD_FREQUENCY = 2  # Uses f(w) alone as the metric

class Rake(object):
"""Rapid Automatic Keyword Extraction Algorithm."""

def __init__(
    self,
    stopwords=None,
    punctuations=None,
    language="english",
    ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO,
    max_length=100000,
    min_length=1,
    verb_tags_to_rm=None
):
    """Constructor.
    :param stopwords: List of Words to be ignored for keyword extraction.
    :param punctuations: Punctuations to be ignored for keyword extraction.
    :param language: Language to be used for stopwords
    :param max_length: Maximum limit on the number of words in a phrase
                       (Inclusive. Defaults to 100000)
    :param min_length: Minimum limit on the number of words in a phrase
                       (Inclusive. Defaults to 1)
    """

    # By default use degree to frequency ratio as the metric.
    if isinstance(ranking_metric, Metric):
        self.metric = ranking_metric
    else:
        self.metric = Metric.DEGREE_TO_FREQUENCY_RATIO

    # If stopwords not provided we use language stopwords by default.
    self.stopwords = stopwords
    if self.stopwords is None:
        self.stopwords = list(STOP_WORDS)

    # If punctuations are not provided we ignore all punctuation symbols.
    self.punctuations = punctuations
    if self.punctuations is None:
        self.punctuations = string.punctuation

    """
    RM: VB  VERB    VerbForm=inf    verb, base form
    RM: VBD VERB    VerbForm=fin Tense=past verb, past tense
    KEEP: VBG VERB    VerbForm=part Tense=pres Aspect=prog    verb, gerund or present participle
    KEEP: VBN VERB    VerbForm=part Tense=past Aspect=perf    verb, past participle
    RM: VBP VERB    VerbForm=fin Tense=pres verb, non-3rd person singular present
    RM: VBZ VERB    VerbForm=fin Tense=pres Number=sing Person=3    verb, 3rd person singular present
    """
    self.verb_tags_to_rm = verb_tags_to_rm
    if self.verb_tags_to_rm is None:
        self.verb_tags_to_rm = set(['VB','VBD','VBP','VBZ'])

    # All things which act as sentence breaks during keyword extraction.
    self.to_ignore = set(chain(self.stopwords, self.punctuations))

    # Assign min or max length to the attributes
    self.min_length = min_length
    self.max_length = max_length

    # Stuff to be extracted from the provided text.
    self.frequency_dist = None
    self.degree = None
    self.rank_list = None
    self.ranked_phrases = None


def extract_keywords_from_text(self, text):
    """Method to extract keywords from the text provided.
    :param text: Text to extract keywords from, provided as a string.
    """
    sentences = [str(s) for s in nlp(text.lower()).sents if str(s) not in {'.!?'}]
    # sentences = nltk.tokenize.sent_tokenize(text)
    self.extract_keywords_from_sentences(sentences)

def extract_keywords_from_sentences(self, sentences):
    """Method to extract keywords from the list of sentences provided.
    :param sentences: Text to extraxt keywords from, provided as a list
                      of strings, where each string is a sentence.
    """
    phrase_list = self._generate_phrases(sentences)
    self._build_frequency_dist(phrase_list)
    self._build_word_co_occurance_graph(phrase_list)
    self._build_ranklist(phrase_list)

def get_ranked_phrases(self):
    """Method to fetch ranked keyword strings.
    :return: List of strings where each string represents an extracted
             keyword string.
    """
    return self.ranked_phrases

def get_ranked_phrases_with_scores(self):
    """Method to fetch ranked keyword strings along with their scores.
    :return: List of tuples where each tuple is formed of an extracted
             keyword string and its score. Ex: (5.68, 'Four Scoures')
    """
    return self.rank_list

def get_word_frequency_distribution(self):
    """Method to fetch the word frequency distribution in the given text.
    :return: Dictionary (defaultdict) of the format `word -> frequency`.
    """
    return self.frequency_dist

def get_word_degrees(self):
    """Method to fetch the degree of words in the given text. Degree can be
    defined as sum of co-occurances of the word with other words in the
    given text.
    :return: Dictionary (defaultdict) of the format `word -> degree`.
    """
    return self.degree

def _build_frequency_dist(self, phrase_list):
    """Builds frequency distribution of the words in the given body of text.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """
    self.frequency_dist = Counter(chain.from_iterable(phrase_list))

def _build_word_co_occurance_graph(self, phrase_list):
    """Builds the co-occurance graph of words in the given body of text to
    compute degree of each word.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """
    co_occurance_graph = defaultdict(lambda: defaultdict(lambda: 0))
    for phrase in phrase_list:
        # For each phrase in the phrase list, count co-occurances of the
        # word with other words in the phrase.
        #
        # Note: Keep the co-occurances graph as is, to help facilitate its
        # use in other creative ways if required later.
        for (word, coword) in product(phrase, phrase):
            co_occurance_graph[word][coword] += 1
    self.degree = defaultdict(lambda: 0)
    for key in co_occurance_graph:
        self.degree[key] = sum(co_occurance_graph[key].values())

def _build_ranklist(self, phrase_list):
    """Method to rank each contender phrase using the formula
          phrase_score = sum of scores of words in the phrase.
          word_score = d(w)/f(w) where d is degree and f is frequency.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """
    self.rank_list = []
    for phrase in phrase_list:
        rank = 0.0
        for word in phrase:
            if self.metric == Metric.DEGREE_TO_FREQUENCY_RATIO:
                rank += 1.0 * self.degree[word] / self.frequency_dist[word]
            elif self.metric == Metric.WORD_DEGREE:
                rank += 1.0 * self.degree[word]
            else:
                rank += 1.0 * self.frequency_dist[word]
        self.rank_list.append((rank, " ".join(phrase)))
    self.rank_list.sort(reverse=True)
    self.ranked_phrases = [ph[1] for ph in self.rank_list]

def _generate_phrases(self, sentences):
    """Method to generate contender phrases given the sentences of the text
    document.
    :param sentences: List of strings where each string represents a
                      sentence which forms the text.
    :return: Set of string tuples where each tuple is a collection
             of words forming a contender phrase.
    """
    phrase_list = set()
    # Create contender phrases from sentences.
    for sentence in sentences:
        word_list, words_to_rm = [], set()
        for d in nlp(sentence):
            tok_str = str(d).lower()
            if tok_str not in {'.!?'}:
                word_list.append(tok_str)
            if d.tag_ in self.verb_tags_to_rm:
                words_to_rm.add(tok_str)
        # word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
        phrase_list.update(self._get_phrase_list_from_words(word_list, words_to_rm))
    return phrase_list

def _get_phrase_list_from_words(self, word_list, words_to_rm):
    """Method to create contender phrases from the list of words that form
    a sentence by dropping stopwords and punctuations and grouping the left
    words into phrases. Only phrases in the given length range (both limits
    inclusive) would be considered to build co-occurrence matrix. Ex:
    Sentence: Red apples, are good in flavour.
    List of words: ['red', 'apples', ",", 'are', 'good', 'in', 'flavour']
    List after dropping punctuations and stopwords.
    List of words: ['red', 'apples', *, *, good, *, 'flavour']
    List of phrases: [('red', 'apples'), ('good',), ('flavour',)]
    List of phrases with a correct length:
    For the range [1, 2]: [('red', 'apples'), ('good',), ('flavour',)]
    For the range [1, 1]: [('good',), ('flavour',)]
    For the range [2, 2]: [('red', 'apples')]
    :param word_list: List of words which form a sentence when joined in
                      the same order.
    :return: List of contender phrases that are formed after dropping
             stopwords and punctuations.
    """
    # would rather use an index of word instead of words_to_rm, but can't figure it out
    groups = groupby(word_list, lambda x: x not in self.to_ignore and x not in words_to_rm)
    phrases = [tuple(group[1]) for group in groups if group[0]]
    return list(
        filter(
            lambda x: self.min_length <= len(x) <= self.max_length, phrases
        )
    )`

results are duplicating keywords with score 1 instead of

Github1

The list of results gives duplicated keywords with score 1 . This happened after upgrading Anaconda , and after that I had to reinstall rake-nltk

Text to reproduce the error giving two times the keyword "solar" with score = 1:
"spectroscopy of the globular cluster dip source x 1746 371 ngc 6441. we propose a 50 xmm observation of the dipping xray source x 1747 371 located in the globular cluster ngc 6441. this source exhibits highly energy independent dips, consistent with an abundance >150 times less than solar, which repeat every 5.7 hours, whereas the overall cluster abundance is only a factor 4 to 10 below solar. resolving this discrepancy is the prime goal of this proposal. this study requires the high throughput, good spectral resolution, and continuous coverage afforded by xmm"

About Spanish

Hi, may I ask about the workthrough to use Spanish rake? thanks.

Feature suggestion -- alternative word scoring metrics

First, great package. Very easy to use!

May I kindly suggest/request a feature? In the RAKE paper, they say that there are multiple ways to compute word scores, each of which favor different kinds of key phrases:

  1. word frequency (freq(w))
  2. word degree (deg(w)),
  3. ratio of degree to frequency (deg(w)/freq(w))

The third is what they primarily use and is what your package uses. However, as Rose et al note, this metric favors longer keyphrases. Indeed, in my dataset of news articles, I find that this produces very wonky, idiosyncratic phrases that don't, impressionistically, describe my articles well.

So it could be nice to allow something like the following:

r = Rake(metric='word_freq')

when initializing the Rake object.

I would help with this but I don't really know Python OOP...

get no results

I use the simply code from demo like
`from rake_nltk import Rake

r = Rake()
mytext = "Hello World!"

r.extract_keywords_from_text(mytext)
r.get_ranked_phrases()`

but no any results were output. Is the python version issue?(use 3.6)
In [19]: runfile('C:/Users/yclin57/AI_summary_test.py', wdir='C:/Users/yclin57') In [20]:

Installation fails on Google App Engine

I am using the followign requirements to install to Google App Engine:

nltk==3.4.5
rake-nltk==1.0.4

The deployment fails with the error messages below; I can, however, deploy nltk 3.4.5 without any issues. Only the post installation tasks of rake-nltk seem to be the issues.

Step #1 - "builder": Copying rake_nltk.egg-info to build/bdist.linux-x86_64/wheel/rake_nltk-1.0.4-py3.7.egg-info
Step #1 - "builder": running install_scripts
Step #1 - "builder": Running post installation tasks
Step #1 - "builder": Traceback (most recent call last):
Step #1 - "builder": File "", line 1, in
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 81, in
Step #1 - "builder": cmdclass={"develop": PostDevelop, "install": PostInstall},
Step #1 - "builder": File "/env/lib/python3.7/site-packages/setuptools/init.py", line 145, in setup
Step #1 - "builder": return distutils.core.setup(**attrs)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/core.py", line 148, in setup
Step #1 - "builder": dist.run_commands()
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 966, in run_commands
Step #1 - "builder": self.run_command(cmd)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
Step #1 - "builder": cmd_obj.run()
Step #1 - "builder": File "/env/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 259, in run
Step #1 - "builder": self.run_command('install')
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
Step #1 - "builder": self.distribution.run_command(command)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
Step #1 - "builder": cmd_obj.run()
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 36, in run
Step #1 - "builder": self.execute(_post_install, [], msg="Running post installation tasks")
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/cmd.py", line 335, in execute
Step #1 - "builder": util.execute(func, args, msg, dry_run=self.dry_run)
Step #1 - "builder": File "/opt/python3.7/lib/python3.7/distutils/util.py", line 291, in execute
Step #1 - "builder": func(*args)
Step #1 - "builder": File "/tmp/pip-wheel-bw7rzw5o/rake-nltk/setup.py", line 17, in _post_install
Step #1 - "builder": import nltk
Step #1 - "builder": ModuleNotFoundError: No module named 'nltk'
Step #1 - "builder": ----------------------------------------
Step #1 - "builder": ERROR: Failed building wheel for rake-nltk
Step #1 - "builder": ERROR: Failed to build one or more wheels

Algo returning empty list

image
trying to run the algo on a text snippet. It is returning an empty list. Not sure why. can you please help

Building wheel fails due post-install hook

Hello,

When trying to build a wheel for rake-nltk package it fails if system does not have nltk package installed explicitly.

  1. create empty virtual environment & switch to it
  2. issue pip wheel command:
pip wheel rake-nltk
...
    File "/tmp/pip-wheel-u4cxytr5/rake-nltk/setup.py", line 17, in _post_install        
      import nltk                                                                       
  ModuleNotFoundError: No module named 'nltk'                                           
                                                                                        
  ----------------------------------------                                              
  Failed building wheel for rake-nltk                                                   
  Running setup.py clean for rake-nltk                                                  
Failed to build rake-nltk                                                               
ERROR: Failed to build one or more wheels                                               

Apparently during bdist or bdist_wheel commands distutils also does an "fake" install -- and apparently without installing requirements before.

From docs: https://docs.python.org/3/distutils/builtdist.html

then the Distutils builds my module distribution (the Distutils itself in this case), does a “fake” installation

I am not familiar with distutils enough to know if it is possible to distinguish such "fake" install from real install.

Our use case is to build wheels during packaging progress on build server and then later use these wheels to speed up deployment. There is no reason to install python packages on build server itself and in some cases it would be hard to identify requirements before -- for an example if some other package has rake-nltk as install_requirement in setup.py.

This problem would not exists if rake-nltk would be available as wheel on pypi (but then package installation would not automatically download "punkt" and "stopwords").

Max words missing

Thx for your project, the only thing I'm missing is the max word limit.
I had one project that I had to specify it but couldn't without overwriting
so I used other open source rake project.
Thx.

Losing the context of phrase

not and other negation words are currently part of the stopword list which is thus not detected as a part of the keyphrase. This in turn changes the context of the keyphrase.
for eg - This is not one of the best places i have been to
keyphrase - best place

Pdf

Can't understand how to use this to perform keyword extractions directly from pdf

Problems with spacing

Hi, I am trying to extract key phrases in a sentence and it works quite good. However when trying to decompose this sentence:
S&P stocks are falling, whereas Google is struggling
The model is splitting the sentence into 2 clause. However in the first clause it adds space before and after the &, like S & P. which makes problems in the following step of my algorithm (entity recognition).
The code for initialization of rake is the following:

#Creating stopword list
coord_conj=[', and', ', or', ', but', ', nor', ', as', ', for', ', so', ', however,', '; ']
subord_conj=[ 'after', 'although', 'as', 'as if', 'as long as', 'as though', 'because', 'before', 'even if', 'even though', 'if', 'if only', 'in order that', 'now that', 'once', 'rather than', 'since', 'so that', 'though', 'till', 'unless', 'until', 'when', 'whenever', 'where', 'whereas', 'wherever', 'while', 'following', 'and the']
stopwords =  ['and the','amid', 'under', 'but', 'where', 'itself', 'himself', ' nor', 'whom', 'once','before', 'these','most', 'just', "that'll", "it's", 'other', 'or', 'theirs','them',  'those','how', 'any', 'against', 'again', 'yourself', 'as', 'some', 'until', 'during', 'yourselves', 'ours', 'at', 'while', 'him', 'same','few']
stopwords= stopwords + coord_conj + subord_conj
capital_stopwords=[]
for sw in stopwords:
    capital_stopwords.append(sw.capitalize())
stopwords = stopwords + capital_stopwords
r = Rake(stopwords = stopwords, punctuations = '\=_*^#@!~?><"‘', min_length = 2, max_length = 100)  
r.extract_keywords_from_text(text)
return(r.get_ranked_phrases())

What is the relationship between the usage and rake.py?

Hi~ I am a student who study rake firstly. When I tried to extract the key phrases, I could just use the commands in your "Usage" without run rake.py. So I am really confused about that, so I want to ask how can I run rake.py and extract key phrases? My question may be a little silly, but i am really interested in this implementation. Best wishes.

python rake-nltk/setup.py install error: package directory 'rake_nltk' does not exist

Hello,

I am getting the following error message "error: package directory 'rake_nltk' does not exist" when installing rake-nltk with:
git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install :

I also tried the option pip install rake-nltk but the installation also fails:

File "/tmp/pip-build-2zTHYP/rake-nltk/setup.py", line 17, in _post_install
import nltk
ImportError: No module named nltk


Failed building wheel for rake-nltk
Running setup.py clean for rake-nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: /home/sylwia/.cache/pip/wheels/4b/c8/24/b2343664bcceb7147efeb21c0b23703a05b23fcfeaceaa2a1e
Successfully built nltk
Failed to build rake-nltk
Installing collected packages: six, singledispatch, nltk, rake-nltk
Running setup.py install for rake-nltk ... done
Successfully installed nltk-3.4 rake-nltk-1.0.4 singledispatch-3.4.0.3 six-1.12.0

I would appreciate if you could help me out with that.

Kind regards
Sylwia

return long keyword chains and not short keywords

WSJ920211-0036.txt
I've tried the project with sample data. see one example attached.

The results are not near what one would expect of "keywords". Also shown are keywords selected manually by humans

[tagsManual] => Array (
[0] => gun control
[1] => second amendment
[2] => gun ownership
[3] => national guard
[4] => gun-control legislation
[5] => american citizens
[6] => undeniable right
)

[tagsExtractedRake] => Array (
[0] => goldwin resident scholar american enterprise institute washington
[1] => elbridge gerry opposed james madison
[2] => current law requiring 18
[3] => another letter writer thinks
[4] => control legislation without exceeding
)

code used is

r = Rake()
r.extract_keywords_from_text(text)
print r.get_ranked_phrases()[:5]

Any thoughts?

How get_word_frequency_distribution() work?

Hi,
Here is my example

from rake_nltk import Rake
r = Rake()
r.extract_keywords_from_text("foo is a foo. but bar is not foo")
r.get_word_frequency_distribution()

Here is my output

{'bar': 1, 'foo': 1}

Why frequency of "foo" is 1?
Thank advance.

Ps. My English skill is not good. Sorry :(

Update readme to reflect successor to aneesha/RAKE

Ever since 2-3 years ago, control of the aneesha/RAKE implementation of rake was transferred over to fabianvf/python-rake (which I'm a maintainer of), and the aneesha/RAKE implementation stopped being maintained. Could you please change this in your readme?

How can i know which rows the extracted keywords were returned from?

as it shows in the attached pic i am using the get ranked phrases with scores, and I am getting:
[(9.0, 'issue fund act'),
(9.0, 'international financial position'),
(9.0, 'human resource development'),
(9.0, 'chambers office holder')]

I wanna know which rows the 'issue fund act' were returned from?
Thanks!

image

clarity regarding score returned

We are using get_ranked_phrases_with_scores to get the key phrases and ranking score. for our examples the api is returning 2750 key phrases as we have lengthy paragraph. In the output we are getting ranking score as well. What is the meaning of score and can we ignore some key phrases depend on score given? And if yes, then what ranking score we should consider to ignore key phrases

Dont work punctuation

I write the string into punctuation, but it does't work. In reazalt i have a lot words with '.!@ and so on
r =Rake(language = 'russian',min_length=2, max_length=2,stopwords = stop, punctuations = '–-—»!».!/.,|&($:;«»".).' )

Bug - phrase_list shouldn't be a set at the beginning

This implementation ignores phases with multiple occurence, for example:
text = 'Red apples, are good in flavour. Where are my red apples? Apples!'

According to the paper, we should get a list of phrases and their weights like:
['red apples', 'good', 'flavor', 'red apples', 'apples']

word good flavour apples red
degree 1 1 5 4
frequency 1 1 3 2
ratio 1 1 1.67 2

So the correct ranked phrases should be:

(3.67, 'red apples')
(1.67, 'apples')
(1.0, 'good')
(1.0, 'flavour')

However, in the current implementation, the extracted phrase list is:
['red apples', 'good', 'flavor', 'apples']

Obviously, the second 'red apples' is ignored, so the ranked phrases have wrong scores:

(3.5, 'red apples')
(1.5, 'apples')
(1.0, 'good')
(1.0, 'flavour')]

This bug could be fixed very easily, simply change the function extract_keywords_from_sentences and _generate_phrases as shown below:

    def extract_keywords_from_sentences(self, sentences):
        """Method to extract keywords from the list of sentences provided.

        :param sentences: Text to extraxt keywords from, provided as a list
                          of strings, where each string is a sentence.
        """
        phrase_list = self._generate_phrases(sentences)
        self._build_frequency_dist(phrase_list)
        self._build_word_co_occurance_graph(phrase_list)
        self._build_ranklist(set(phrase_list))
    def _generate_phrases(self, sentences):
        """Method to generate contender phrases given the sentences of the text
        document.

        :param sentences: List of strings where each string represents a
                          sentence which forms the text.
        :return: Set of string tuples where each tuple is a collection
                 of words forming a contender phrase.
        """
        phrase_list = []
        # Create contender phrases from sentences.
        for sentence in sentences:
            word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
            phrase_list += self._get_phrase_list_from_words(word_list)
        return phrase_list

Run the demo and nothing printed

I did all the setup as explained, but when I run the python file nothing is printed to the terminal? What did I do wrong.

from rake_nltk import Rake

# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()

mytext = '''
Black-on-black ware is a 20th- and 21st-century pottery tradition developed by the Puebloan Native American ceramic artists in Northern New Mexico. Traditional reduction-fired blackware has been made for centuries by pueblo artists. Black-on-black ware of the past century is produced with a smooth surface, with the designs applied through selective burnishing or the application of refractory slip. Another style involves carving or incising designs and selectively polishing the raised areas. For generations several families from Kha'po Owingeh and P'ohwhóge Owingeh pueblos have been making black-on-black ware with the techniques passed down from matriarch potters. Artists from other pueblos have also produced black-on-black ware. Several contemporary artists have created works honoring the pottery of their ancestors.
'''

# Extraction given the text.
r.extract_keywords_from_text(mytext)

# Extraction given the list of strings where each string is a sentence.
#r.extract_keywords_from_sentences(<list of sentences>)

# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()

# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()

Word Frequency calculation in a phrase list might be wrong

You are calculating frequency distribution of words from Phrase list.
Phrase list is a set, so a word will be present just once in it. Using Counter(chain.from_iterable(phrase_list)) to find freq_distribution will only take care of the words in a set, which I think is wrong.
Candidate keywords which might occur many times in the text will be there just once in the phrase list.

Language selection

When i am selecting some other language in rake then it is not identifying keyphrase rather its throwing the whole sentence as an key phrase.

Weird results when running with text from Gutenberg corpus

I have tried this simple snippet to test keyword extraction on longer text using NLTK.

from nltk.corpus import gutenberg
from rake_nltk.rake import Rake
r = Rake()
r.extract_keywords_from_text(gutenberg.raw("austen-emma.txt"))
for phrase in r.get_ranked_phrases()[:5]:
    print phrase

The output I got is as follows:

.-- _she_ _felt_ _the_ _engagement_ _to_ _be_ _a_ _source_ _of_ _repentance_ _and_ _misery_ _to_ _each_
best time -- never tired -- every sort good -- hautboy infinitely superior --
said nothing worth hearing -- looked without seeing -- admired without intelligence -- listened without knowing
scarce -- chili preferred -- white wood finest flavour
without scruple -- without apology -- without much apparent diffidence

I assume this is caused by either the word tokenizer or the sentence tokenizer, but I'm not sure how to fix it.

can't run the example

when I run this
`from rake_nltk import Rake

r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.

text = '''Compatibility of systems of linear constraints over the set of
natural numbers. Criteria of compatibility of a system of linear
Diophantine equations, strict inequations, and nonstrict inequations are
considered. Upper bounds for components of a minimal set of solutions
and algorithms of construction of minimal generating sets of solutions
for all types of systems are given. These criteria and the corresponding
algorithms for constructing a minimal supporting set of solutions can be
used in solving all the considered types of systems and systems of mixed
types.'''
r.extract_keywords_from_text(text)
r.get_ranked_phrases()`

I get this result:
Traceback (most recent call last): File "/Users/donya/PycharmProjects/test/test.py", line 20, in <module> r.extract_keywords_from_text(text) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 66, in extract_keywords_from_text self.extract_keywords_from_sentences(sentences) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 74, in extract_keywords_from_sentences phrase_list = self._generate_phrases(sentences) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rake_nltk/rake.py", line 172, in _generate_phrases word_list = [word.lower() for word in wordpunct_tokenize(sentence)] File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 126, in tokenize self._check_regexp() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 121, in _check_regexp self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 56, in compile_regexp_to_noncapturing return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 52, in convert_regexp_to_noncapturing_parsed parsed_pattern.pattern.groups = 1 AttributeError: can't set attribute
I am new to python. Please help me
Thanks

Word frequency calculation is wrong

According to the function of frequency calculation :

def _build_frequency_dist(self, phrase_list):

    """Builds frequency distribution of the words in the given body of text.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """

    self.frequency_dist = Counter(chain.from_iterable(phrase_list))

Tracing back to the calculation of phrase_list :

def _generate_phrases(self, sentences):

    """Method to generate contender phrases given the sentences of the text
    document.
    :param sentences: List of strings where each string represents a
                      sentence which forms the text.
    :return: Set of string tuples where each tuple is a collection
             of words forming a contender phrase.
    """
    phrase_list = set()
    # Create contender phrases from sentences.
    for sentence in sentences:
        word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
        phrase_list.update(self._get_phrase_list_from_words(word_list))
    return phrase_list

Clearly, phrase_list is a set, and contains unique keywords. So if keywords repeat in a text, they're ignored, and the value of frequency, as tested by me, comes out faulty.

I have modified the Rake() object to ensure the calculations are correct. @csurfer ,kindly assign me this issue, so I can create a pull request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.