Giter VIP home page Giter VIP logo

textteaser's Introduction

TextTeaser

TextTeaser is an automatic summarization algorithm.

This is now the official version of TextTeaser. Future developments of TextTeaser will be in this repository.

The original Scala TextTeaser can still be accessed here.

Installation

>>> git clone https://github.com/IndigoResearch/textteaser.git
>>> pip install -r textteaser/requirements.txt

How to Use

>>> from textteaser import TextTeaser
>>> tt = TextTeaser()
>>> tt.summarize(title, text)

You can also test TextTeaser by running python test.py.

textteaser's People

Contributors

akizarojo avatar plean avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textteaser's Issues

cannot run the example code..

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "textteaser\__init__.py", line 11, in summarize
    result = self.summarizer.summarize(text, title, source, category)
  File "textteaser\summarizer.py", line 11, in summarize
    sentences = self.parser.splitSentences(text)
  File "textteaser\parser.py", line 61, in splitSentences
    tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file
_)) + '/trainer/english.pickle')
  File "c:\Python27\lib\site-packages\nltk\data.py", line 786, in load
    resource_val = pickle.load(opened_resource)
ImportError: No module named copy_reg

then

C:\Users\user>pip install copy_reg
Downloading/unpacking copy-reg
  Could not find any downloads that satisfy the requirement copy-reg
Cleaning up...
No distributions at all found for copy-reg
Storing debug log for failure in C:\Users\user\pip\pip.log

Is there a bias of presenting sentences from the end of the article?

If I use the Chrome app, the sentences I get (when the number of sentences slider is at its default) seem to all come from the end of the article.

Here are two examples:

url: http://www.lrb.co.uk/v38/n08/john-lanchester/when-bitcoin-grows-up
summary:

It’s time for the cryptocurrency to decide what it wants to be when it grows up.
Blockchains could become merely a new technique to ensure the continuation of banking hegemony in its current form.
That would be one of those final plot twists which leaves everybody thinking that although they enjoyed most of the show, the ending was so disappointing they now wish they hadn’t bothered.
Or, along with peer-to-peer lending and mobile payments, they could have an impact as great as the new kind of banking introduced in Renaissance Italy.
That would be more fun.

url: https://en.wikipedia.org/wiki/Automatic_summarization#Current_challenges_in_evaluating_summaries_automatically
summary:

Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but also manual annotation has to be performed in some of them (e.g.
SCU in the Pyramid Method).
In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries.
Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.
To overcome these problems, we think that the quantitative evaluation might not be the only way to evaluate summaries, and a qualitative automatic evaluation would be also important.

In both of the situations, the summaries seem to be generated from sentences at the end. Do you think this has something to do with the Chrome Extension and not your code base or the other way around?

ImportError: No module named copy_reg

Issues with pickle?
Windows 10 python 2.7.6,

C:\Users\valentin\textteaser_nltk\Scripts\python.exe D:/dev_earthcube/textteaser/test.py
Traceback (most recent call last):
File "D:/dev_earthcube/textteaser/test.py", line 12, in
sentences = tt.summarize(title, text)
File "D:\dev_earthcube\textteaser\textteaser__init__.py", line 13, in summarize
result = self.summarizer.summarize(text, title, source, category)
File "D:\dev_earthcube\textteaser\textteaser\summarizer.py", line 11, in summarize
sentences = self.parser.splitSentences(text)
File "D:\dev_earthcube\textteaser\textteaser\parser.py", line 61, in splitSentences
tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(file)) + '/trainer/english.pickle')
File "C:\Users\valentin\textteaser_nltk\lib\site-packages\nltk\data.py", line 786, in load
resource_val = pickle.load(opened_resource)
ImportError: No module named copy_reg

Indentation oversight in summarizer.py

Hi, I believe there is a bug in sbs function at summarizer.py. In the for loop at line 73, the code loops through words in sentences but these if statements are outside of the loop:

if word in keywordList:
        index = keywordList.index(word)

if index > -1:
        score += topKeywords[index]['totalScore']

Therefore, the scores are only calculated for the last words of each sentence.

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring

I expect that the issues and PRs I've posted here are going to my own fork and no further, but if anyone out there is reading, your feedback is welcome.

In Parser.getKeywords():

# ...
uniqueWords = list(set(words))

keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords] 
keywords = sorted(keywords, key=lambda x: -x['count'])

# ... returns: ...
keywords[0]  = {'word': 'foo', 'count': 100}  # rank: 1st (most common keyword)
keywords[1]  = {'word': 'eggs', 'count': 25}  # rank: 9th
keywords[2]  = {'word': 'bacon', 'count': 32} # rank: 8th
# ...
keywords[10] = {'word': 'spam', 'count': 25}  # rank: 9th
keywords[11] = {'word': 'bar', 'count': 19}   # rank: 13th
keywords[12] = {'word': 'ham', 'count': 25}   # rank: 9th
# ...

set() does not return an ordered enumerable, so uniqueWords is an unordered list. Iterating it to build keywords means this list is also unordered, so the return order of sorted() on the count key alone is inconsistent.

Down in Summarizer.summarize() it means sentence scores are also inconsistent:

# ... 
(keywords, wordCount) = self.parser.getKeywords(text)

topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)

# ... iterating sentences ...
sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList)

# ... calculate sentence score based on these features ...
# ... 

For consistency, when the summarize method calls getTopKeywords(), should it still pass a fixed list of ten keywords that's been sorted with more tiebreakers…

# replace: keywords = sorted(keywords, key=lambda x: -x['count'])
keywords = sorted(keywords, key=lambda x: (-x['count'], -len(x['word'], x['word']))

…every keyword that ranks 10th or better, i.e. this comprehension…

topKeywordSlice = [kw for kw in keywords if kw['count'] >= keywords[9]['count']]

… or maybe both?

Computationally, the advanced sort seems unnecessarily expensive, but I don't know if there's a rationale for exactly ten top keywords. What's the best way to make this work?

Words not delimited by newline in Parser.getKeywords()

Source text (snippet):

Nay, never play the brave man, else when you go back home, your own mother
won't know you. But, dear friends and allies, first let us lay our burdens down; 

I'm expecting [...{word: 'mother', count: 1}, {word: 'wont', count: 1}...] but get [...{word: 'motherwont', count: 1}...] instead.

There are possible side effects to fixing the culprit, but Parser.removePunctations() should filter on t.isAlnum() or t.isSpace(), not t.isAlnum() or t == ' '.

edit: having problems with Markdown

ImportError: cannot import name 'TextTeaser'

I use :pip install textteaser in window. The install is success. when i run the demo code. I have encounter the error as follow:
ImportError: cannot import name 'TextTeaser'

best wish
yuquanle

Not able to create TextTeaser() object. Does it require an API key?

from textteaser import TextTeaser
title = "Limitations of the GET method in HTTP"
text = "We spend a lot of time thinking about web API design, and we learn a lot from other APIs and discussion with their authors. In the hopes that it helps others, we want to share some thoughts of our own. In this post, we’ll discuss the limitations of the HTTP GET method and what we decided to do about it in our own API. As a rule, HTTP GET requests should not modify server state. "
tt = TextTeaser()
TypeError: init() takes exactly 2 arguments (1 given)


TypeError Traceback (most recent call last)
in ()
----> 1 tt = TextTeaser()
TypeError: init() takes exactly 2 arguments (1 given)

ModuleNotFoundException _version

When I run textteaser inside my code in Python 3.7.3, I get the following:

File "/Users/gregory/.pyenv/versions/3.7.3/lib/python3.7/site-packages/textteaser/init.py", line 25, in
from _version import version
ModuleNotFoundError: No module named '_version'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.