indigoresearch / textteaser Goto Github PK

View Code? Open in Web Editor NEW

623.0 42.0 145.0 181 KB

Official version of TextTeaser.

License: MIT License

Python 100.00%

textteaser's Introduction

TextTeaser

TextTeaser is an automatic summarization algorithm.

This is now the official version of TextTeaser. Future developments of TextTeaser will be in this repository.

The original Scala TextTeaser can still be accessed here.

Installation

>>> git clone https://github.com/IndigoResearch/textteaser.git
>>> pip install -r textteaser/requirements.txt

How to Use

>>> from textteaser import TextTeaser
>>> tt = TextTeaser()
>>> tt.summarize(title, text)

You can also test TextTeaser by running python test.py.

textteaser's People

Contributors

Stargazers

Watchers

Forkers

bhutley kai33 yyi-99plus chilitechno djadmin shanky-259 vivalapanda bolajav samim23 why-not-sky jonnoftw ciriarte austinlostinboston jimmy0000 billsmith2 laskarcyber xuanhan863 xsongx ganapativs onstash hxcomet rcelha citrusit roynos michigan-com chethan123 sandeepan stevenlol datnamer alhy3410 piratos qhduan raeed20 brianpetro fatmas1982 ashim888 kngu9 quanjin9099 drawersapp blinkystitt automotola mskylsjwg akhiltas danielhones garysieling ken4scholars plean alonle agentflare colinsongf s-ai shiva96b andhdo jaspreet10 mickelfeng willwil datagold2017 chingan90 ramaswamym1987 hanmayz rikahoa xueyouluo zhangzhaocs mystyc runrunliuliu tongzhenguo text-summarization jx57 praveenmunagapati ghiblifield cooravi meizh kev216 richardliaw olivervin freewing-jp ianb abiraja2004 thuzarwin dr-data schmamps theblackcat102 joanzhou fabiant7t zouchl vinceblot yuanjie-ai liujx42 xiaoxialei mkscala piermarcobarbe hfq-guorui summba-nlp wjianxz yazici myailab fastcode3d zlhcsm zhongyunuestc shihuaxing

textteaser's Issues

cannot run the example code..

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "textteaser\__init__.py", line 11, in summarize
    result = self.summarizer.summarize(text, title, source, category)
  File "textteaser\summarizer.py", line 11, in summarize
    sentences = self.parser.splitSentences(text)
  File "textteaser\parser.py", line 61, in splitSentences
    tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file
_)) + '/trainer/english.pickle')
  File "c:\Python27\lib\site-packages\nltk\data.py", line 786, in load
    resource_val = pickle.load(opened_resource)
ImportError: No module named copy_reg

then

C:\Users\user>pip install copy_reg
Downloading/unpacking copy-reg
  Could not find any downloads that satisfy the requirement copy-reg
Cleaning up...
No distributions at all found for copy-reg
Storing debug log for failure in C:\Users\user\pip\pip.log

Can this project support for Chinese?

I want to use this project to process Chinese.

Is there a bias of presenting sentences from the end of the article?

If I use the Chrome app, the sentences I get (when the number of sentences slider is at its default) seem to all come from the end of the article.

Here are two examples:

url: http://www.lrb.co.uk/v38/n08/john-lanchester/when-bitcoin-grows-up
summary:

It’s time for the cryptocurrency to decide what it wants to be when it grows up.
Blockchains could become merely a new technique to ensure the continuation of banking hegemony in its current form.
That would be one of those final plot twists which leaves everybody thinking that although they enjoyed most of the show, the ending was so disappointing they now wish they hadn’t bothered.
Or, along with peer-to-peer lending and mobile payments, they could have an impact as great as the new kind of banking introduced in Renaissance Italy.
That would be more fun.

url: https://en.wikipedia.org/wiki/Automatic_summarization#Current_challenges_in_evaluating_summaries_automatically
summary:

Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but also manual annotation has to be performed in some of them (e.g.
SCU in the Pyramid Method).
In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries.
Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.
To overcome these problems, we think that the quantitative evaluation might not be the only way to evaluate summaries, and a qualitative automatic evaluation would be also important.

In both of the situations, the summaries seem to be generated from sentences at the end. Do you think this has something to do with the Chrome Extension and not your code base or the other way around?

ImportError: No module named copy_reg

Issues with pickle?
Windows 10 python 2.7.6,

C:\Users\valentin\textteaser_nltk\Scripts\python.exe D:/dev_earthcube/textteaser/test.py
Traceback (most recent call last):
File "D:/dev_earthcube/textteaser/test.py", line 12, in
sentences = tt.summarize(title, text)
File "D:\dev_earthcube\textteaser\textteaser__init__.py", line 13, in summarize
result = self.summarizer.summarize(text, title, source, category)
File "D:\dev_earthcube\textteaser\textteaser\summarizer.py", line 11, in summarize
sentences = self.parser.splitSentences(text)
File "D:\dev_earthcube\textteaser\textteaser\parser.py", line 61, in splitSentences
tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(file)) + '/trainer/english.pickle')
File "C:\Users\valentin\textteaser_nltk\lib\site-packages\nltk\data.py", line 786, in load
resource_val = pickle.load(opened_resource)
ImportError: No module named copy_reg

Indentation oversight in summarizer.py

Hi, I believe there is a bug in sbs function at summarizer.py. In the for loop at line 73, the code loops through words in sentences but these if statements are outside of the loop:

if word in keywordList:
        index = keywordList.index(word)

if index > -1:
        score += topKeywords[index]['totalScore']

Therefore, the scores are only calculated for the last words of each sentence.

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring

I expect that the issues and PRs I've posted here are going to my own fork and no further, but if anyone out there is reading, your feedback is welcome.

In Parser.getKeywords():

# ...
uniqueWords = list(set(words))

keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords] 
keywords = sorted(keywords, key=lambda x: -x['count'])

# ... returns: ...
keywords[0]  = {'word': 'foo', 'count': 100}  # rank: 1st (most common keyword)
keywords[1]  = {'word': 'eggs', 'count': 25}  # rank: 9th
keywords[2]  = {'word': 'bacon', 'count': 32} # rank: 8th
# ...
keywords[10] = {'word': 'spam', 'count': 25}  # rank: 9th
keywords[11] = {'word': 'bar', 'count': 19}   # rank: 13th
keywords[12] = {'word': 'ham', 'count': 25}   # rank: 9th
# ...

set() does not return an ordered enumerable, so uniqueWords is an unordered list. Iterating it to build keywords means this list is also unordered, so the return order of sorted() on the count key alone is inconsistent.

Down in Summarizer.summarize() it means sentence scores are also inconsistent:

# ... 
(keywords, wordCount) = self.parser.getKeywords(text)

topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)

# ... iterating sentences ...
sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList)

# ... calculate sentence score based on these features ...
# ...

For consistency, when the summarize method calls getTopKeywords(), should it still pass a fixed list of ten keywords that's been sorted with more tiebreakers…

# replace: keywords = sorted(keywords, key=lambda x: -x['count'])
keywords = sorted(keywords, key=lambda x: (-x['count'], -len(x['word'], x['word']))

…every keyword that ranks 10th or better, i.e. this comprehension…

topKeywordSlice = [kw for kw in keywords if kw['count'] >= keywords[9]['count']]

… or maybe both?

Computationally, the advanced sort seems unnecessarily expensive, but I don't know if there's a rationale for exactly ten top keywords. What's the best way to make this work?

Words not delimited by newline in Parser.getKeywords()

Source text (snippet):

Nay, never play the brave man, else when you go back home, your own mother
won't know you. But, dear friends and allies, first let us lay our burdens down;

I'm expecting [...{word: 'mother', count: 1}, {word: 'wont', count: 1}...] but get [...{word: 'motherwont', count: 1}...] instead.

There are possible side effects to fixing the culprit, but Parser.removePunctations() should filter on t.isAlnum() or t.isSpace(), not t.isAlnum() or t == ' '.

edit: having problems with Markdown

from textteaser import TextTeaser
title = "Limitations of the GET method in HTTP"
text = "We spend a lot of time thinking about web API design, and we learn a lot from other APIs and discussion with their authors. In the hopes that it helps others, we want to share some thoughts of our own. In this post, we’ll discuss the limitations of the HTTP GET method and what we decided to do about it in our own API. As a rule, HTTP GET requests should not modify server state. "
tt = TextTeaser()
TypeError: init() takes exactly 2 arguments (1 given)

TypeError Traceback (most recent call last)
in ()
----> 1 tt = TextTeaser()
TypeError: init() takes exactly 2 arguments (1 given)

ModuleNotFoundException _version

When I run textteaser inside my code in Python 3.7.3, I get the following:

File "/Users/gregory/.pyenv/versions/3.7.3/lib/python3.7/site-packages/textteaser/init.py", line 25, in
from _version import version
ModuleNotFoundError: No module named '_version'