xiaoxu193 / pyteaser Goto Github PK

View Code? Open in Web Editor NEW

1.2K 53.0 179.0 993 KB

Summarizes news articles

Home Page: http://xiaoxu193.github.io/PyTeaser/

License: MIT License

Python 100.00%

pyteaser news-articles

pyteaser's Introduction

PyTeaser

PyTeaser takes any news article and extract a brief summary from it. It's based on the original Scala project.

Summaries are created by ranking sentences in a news article according to how relevant they are to the entire text. The top 5 sentences are used to form a "summary". Each sentence is ranked by using four criteria:

Relevance to the title
Relevance to keywords in the article
Position of the sentence
Length of the sentence

Installation:

Requires Python 2.7. (Need Collections.Counter)

sudo pip install pyteaser

These dependency packages will be automatically installed:

Pillow
lxml
cssselect
jieba
beautifulsoup

Note: if you're installing on Windows, you have to install one of the dependency package lxml manually using:

easy_install lxml==2.3.3

More information about this issue here: #17

Usage:

sample command:

>>> from pyteaser import SummarizeUrl
>>> url = 'http://www.huffingtonpost.com/2013/11/22/twitter-forward-secrecy_n_4326599.html'
>>> summaries = SummarizeUrl(url)
>>> print summaries

output

["Twitter\'s move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.", "\\"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy.\\"", "The online messaging service, which began scrambling communications in 2011 using traditional HTTPS encryption, said on Friday it has added an advanced layer of protection for HTTPS known as \\"forward secrecy.\\"", "\\"A year and a half ago, Twitter was first served completely over HTTPS,\\" the company said in a blog posting.", " \\"I\'m glad this is the direction the industry is taking.\\" \\n\\n(Reporting by Jim Finkle; editing by Andrew Hay)"]

you can use Summarize(title, text) directly if you already have the text and the title. Otherwise you must install Python Goose to extract text from url.

pyteaser's People

Contributors

Stargazers

Watchers

Forkers

bitelchux taimur38 p0wl wuschel rand komuw bertomartin gth158a timonweb-forks hellcoderz rverbitsky arjunmenon davejm mgrouchy albertpadin deedeethepinhead mpuig anhqle tgallant robbestad btian web5design sangohan nukleas mattiasbergstrom datalibs ericmok ishenkoyv ratdotcom bahrunnur nimblemachine mikkogozalo zhanif3 kalium panyang johnconnelly75 metricle nusmu00 ctroncoso harshnisar bitcity willpapper dotoca manithnuon alilja manstable09 shirish93 igor-shevchenko yaltabaoth abresler verhovsky sas23 raniglas iamjackhall jenic cheekybastard bschuster3434 travismarx iontom msherry jameskhedley richter-gh panditarevolution treetoppin zborboa-g cipherwraith laranea hawkinsunlimited pizzapanther yohannist jiunnher cheezy64 bolajav axb156 dgillis herrschr ciriarte x54329 voneiden alex--m randy-ran mozii mongolia19 medanat jlegendary ctounkar huntrar mdchad sandeepan treestompz hakz reeddunkle rickyall chyt hsd315 zbxzc35 petre2dor davidchu201 vyraun securextools

pyteaser's Issues

Why is this not based on nltk?

I wonder why you didn't harness nltk, which is widely used in text based analysis.

Error!

I get this error when I run it. Seems like multiple Errors! :-

HERE IS THE ERROR!
Traceback (most recent call last):
File "C:\Users\mike\Desktop\Python Repos\PyTeaser-master\pyteaser.py", line 259, in
main()
File "C:\Users\mike\Desktop\Python Repos\PyTeaser-master\pyteaser.py", line 256, in main
Summarize('Framework for Partitioning and Execution of Data Stream Applications in Mobile Cloud Computing', 'The contribution of cloud computing and mobile computing technologies lead to the newly emerging mobile cloud com- puting paradigm. Three major approaches have been pro- posed for mobile cloud applications: 1) extending the access to cloud services to mobile devices; 2) enabling mobile de- vices to work collaboratively as cloud resource providers; 3) augmenting the execution of mobile applications on portable devices using cloud resources. In this paper, we focus on the third approach in supporting mobile data stream applica- tions. More specifically, we study how to optimize the com- putation partitioning of a data stream application between mobile and cloud to achieve maximum speed/throughput in processing the streaming data. To the best of our knowledge, it is the first work to study the partitioning problem for mobile data stream applica- tions, where the optimization is placed on achieving high throughput of processing the streaming data rather than minimizing the makespan of executions as in other appli- cations. We first propose a framework to provide runtime support for the dynamic computation partitioning and exe- cution of the application. Different from existing works, the framework not only allows the dynamic partitioning for a single user but also supports the sharing of computation in- stances among multiple users in the cloud to achieve efficient utilization of the underlying cloud resources. Meanwhile, the framework has better scalability because it is designed on the elastic cloud fabrics. Based on the framework, we design a genetic algorithm for optimal computation parti- tion. Both numerical evaluation and real world experiment have been performed, and the results show that the par- titioned application can achieve at least two times better performance in terms of throughput than the application without partitioning.')
File "C:\Users\mike\Desktop\Python Repos\PyTeaser-master\pyteaser.py", line 86, in Summarize
sentences = split_sentences(text)
File "C:\Users\mike\Desktop\Python Repos\PyTeaser-master\pyteaser.py", line 204, in split_sentences
import nltk.data # natural language split sentences
ImportError: No module named 'nltk'

New line and quotation marks

When a sentence ends with a quotation mark followed by a new line, the quotation mark is sometimes mistakenly taken as the start of the next sentence.

Maybe replace new lines with spaces?

Not sure if this regex does what you want

I was looking over project regex and I noticed this one. It seems to try to look for punctuation and is trying to use \p{} for unicode but this is not a feature supported in python. It also might be dead code in general.

How to integrated with HTML in bottle server framwork

How this can be integrated with an HTML page and Bottle server. When we input an URL through the textbox and show the output in a page when submit.

Remove reporting city and date in first sentence

Ex: http://www.ksl.com/index.php?sid=28054168&nid=148&title=exelis-to-make-materials-for-new-boeing-jet&fm=home_page&s_cid=queue-1

SALT LAKE CITY — Exelis has been selected by Boeing to produce composite airframe substructures for the 787 Dreamliner.

Ex2: http://fox59.com/2013/12/22/time-warner-cable-and-tribune-reach-multi-year-distribution-agreement/

NEW YORK and CHICAGO (December 22, 2013) — Tribune reached a multi-year agreement on retransmission consent with Time Warner Cable today.

The reporting cities ruin the format of the summary.

Python unicode support

When the Summarize function is called with unicode objects (unicode strings) in the second argument (text) and it has multiple sentences (separated by punctuation) an exception is raised.

That happens because the function split_sentences tries to map each part as a 'str', so it cannot support more than one-byte-encoding strings (like unicode objects).

Example to reproduce the issue:

import pyteaser
pyteaser.Summarize(u'Título', u'Artículo. Contenido.')

Results in:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\LVELEZ\Anaconda\lib\site-packages\pyteaser.py", line 86, in Summarize
sentences = split_sentences(text)
File "C:\Users\LVELEZ\Anaconda\lib\site-packages\pyteaser.py", line 215, in split_sentences
s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 3:
ordinal not in range(128)
>>> pyteaser.Summarize(u'Título', u'Artículo. Contenido.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\LVELEZ\Anaconda\lib\site-packages\pyteaser.py", line 86, in Summarize
sentences = split_sentences(text)
File "C:\Users\LVELEZ\Anaconda\lib\site-packages\pyteaser.py", line 215, in split_sentences
s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 3:ordinal not in range(128)```

Python 3 support?

Will this code not work in Python 3?

Error when running pip install pyteaser

This is the output when running pip install pyteaser

Collecting pyteaser
  Downloading https://files.pythonhosted.org/packages/d4/7a/310592c6e7998440e56a8650446ecf3ded076431415c60f0f3b946b54462/pyteaser-2.0.tar.gz (40kB)
Requirement already satisfied: Pillow in c:\users\jmorales\appdata\roaming\python\python35\site-packages (from pyteaser) (5.2.0)
Requirement already satisfied: lxml in c:\users\jmorales\appdata\roaming\python\python35\site-packages (from pyteaser) (4.2.3)
Requirement already satisfied: cssselect in c:\users\jmorales\appdata\local\programs\python\python35\lib\site-packages (from pyteaser) (1.0.3)
Requirement already satisfied: jieba in c:\users\jmorales\appdata\roaming\python\python35\site-packages (from pyteaser) (0.39)
Collecting beautifulsoup (from pyteaser)
  Downloading https://files.pythonhosted.org/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\jmorales\AppData\Local\Temp\pip-install-69mmyqv0\beautifulsoup\setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\jmorales\AppData\Local\Temp\pip-install-69mmyqv0\beautifulsoup\

This is my setup:

Python version: 3.5
OS: Windows 10 Home
System Type: 64-bit Operating System, x64-based processor
Processor: AMD A10-8700P Radeon R6

pep8 errors

pyteaser.py has 193 pep8 errors

4       E101 indentation contains mixed spaces and tabs
2       E222 multiple spaces after operator
7       E225 missing whitespace around operator
2       E231 missing whitespace after ','
8       E261 at least two spaces before inline comment
7       E262 inline comment should start with '# '
10      E302 expected 2 blank lines, found 1
1       E303 too many blank lines (2)
6       E501 line too long (3625 > 79 characters)
1       E703 statement ends with a semicolon
138     W191 indentation contains tabs
2       W291 trailing whitespace
5       W293 blank line contains whitespace

I'm happy to fix them. If no-one objects?

Other languages support

Hi, are there any other steps besides translating the words from pyteaser.py?
I've tried that but it doesn't work. It looks like it fails to create a CrawlCandidate object. Have you tried adding other languages, or should i just go and debug it until i find out what's wrong?

integrate with HTML

i am student, i just want to know how this can be integrated with an HTML page. When we input an URL through the textbox and show the output in a page when submit.

Setup error for lxml on Windows

This is a problem with lxml package on Windows if you install with "pip install lxml", or the setup file.

error: Unable to find vcvarsall.bat

Current fix is to install lxml manually on windows using "easy_setup lxml==2.3.3" as suggested here: http://stackoverflow.com/a/9643941

Error getting NYTimes article

... as well as some sites that require cookie. This is a documented error in Python Goose.

grangier/python-goose#35

Maybe set a special case for nytimes.com

Python 3 support

Hello,

I am trying to play with this package and can't install it due to python 3 incompatibility from the beautifulsoup dependency. It appears beautifulsoup4 supports python 3, can we update the requirement?

display python version.

In python 2.6 version,

ImportError: cannot import name Counter

Counter is only supported python2.7 and higher and is not available in earlier versions.

Have to display python version at github or pypi(https://pypi.python.org/pypi/pyteaser/1.0).

What you mean with "based n the original Scala project"

Does Based on the original Scala project mean that the algorithm to build the summary are the same for both?

With the algorithm I mean the selection of the top 5 sentences based on the following criteria:

Relevance to the title
Relevance to keywords in the article
Position of the sentence
Length of the sentence

Or does it mean something else too?

Sentences selected for summary not in chronological order

They don't seem to appear in chronological order, instead they are sorted by their score.

Quote split into different sentences

Using split_sentences on text grabbed from: http://www.mynews4.com/mostpopular/story/Rail-City-Casino-Owner-Confirms-Data-Breach/lbpbkJtdN0-13YwhAGuoxg.cspx

Sparks, Nev. (KRNV & MyNews4.com)- Rail City Casino's parent company, Affinity Gaming, confirmed a digital security breach that compromises credit and debit card information of their patrons.
If you tried your luck at Rail City Casino in Sparks, you may be at risk for fraud and identity theft.
Affinity Gaming announced today that the system that processes customer credit and debit card information was breached between March 14th and October 26th at several of its casinos when it became infected by malware.
Arminda Jimenez is a consumer credit counselor and says sensitive information could end up in the wrong hands. 
“They can draw money from your account.
It may not be a huge amount.
It can be a little at a time,” she says.
She says there's a limit that can be charged on credit cards, but compromised debit cards puts your cold hard cash in danger. 
“With debit cards, if you have direct deposit, or anything like that, they can go in there and take as much money from your bank account as possible," she warns.
That's why she suggests checking your bank accounts several times a week for unusual activity, and to add an extra layer of safety, closing your bank account and opening a new one altogether.
"Your credit card number is linked to your social security to everything else; to an address.
So it may not stop at money wise,” says Jimenez. “ It can be addresses, and stuff like that."
Affinity Gaming discovered the intrusion when law enforcement linked fraudulent charges to a data breach.
The company says its system is now secure, but encourages patrons to take steps to protect their financial information.
The company has established a confidential inquiry line at 877-238-2179 for those who may have been affected.
If you do find that you've been a victim of fraud or identity theft, Jimenez says to close your bank account and file a report with the police and the Federal Trade Commission.

Specifically the following sentences...

“They can draw money from your account.
It may not be a huge amount.
It can be a little at a time,” she says.

... should be one sentence, because each sentences makes no sense without the entire quote.

Also, the first sentence awkwardly starts with a quote, but doesn't end with one.

Remove dependency on ntlk sentence splitter

Because it's annoying to install.

Special characters issue

If you sum up this link: http://news.antiwar.com/2014/04/02/italy-troops-crack-down-on-secessionists-nationwide/

This is the result

[
   'A referendum earlier this month showed 89 percent support in Venice for returning to its role as an independent republic, while Sardinia-Piedmont continues to have designs on splitting from Italy and seeking annexation by Switzerland.',
   '\n\nProtesters complain that Italy\xe2\x80\x99s central government is an excessive drag on their regional economies, particularly in Venice, one of the wealthier parts of the country, but one which is heavily taxed to subsidize the poorer south.',
   'With secessionist sentiment continuing to soar across Italy, the nation has deployed its special forces in multiple operations nationwide, but primarily centered on northern secessionist regions, arresting 24 \xe2\x80\x9csuspected secessionists.\xe2\x80\x9d\n\nDuring the crackdown, police captured a bulldozer in Venice, which they claimed was being dressed up to look like an armored vehicle for deployment during a secessionist protest, and unspecified weapons in both Lombardy and Piedmont, as well as on the island of Sardinia.'
]

Specifically this phrase

\xe2\x80\x9d\n\nDuring the crackdown,

started with special characters that didn't get split.

PyPI Project Request Transfer

In https://github.com/pypa/warehouse/issues/6413, another user has requested to take ownership of the PyPI name PyTeaser to continue maintenance of this project.

Could you please respond to that request?

UnicodeEncodeError in split_sentences

  s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]
E UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 85: ordinal not in range(128)

Reference

Could you please cite the research paper that you have used for creating pyTeaser and assigning importance to the sentences for creating the summary.

python 2.7 discontinued, can no longer install... can you update?

Unable to import name counter

This is an Enhancement that should be added to the Program when run, or as a comment or part of the Read Me to solve the issue for new comers.

Wall Street Journal Summaries Do Not Work

When I enter in this article, I get back a summary of the email form settings rather than a summary of the article.

Here is the code I am using for testing:

from pyteaser import SummarizeUrl
url = 'http://blogs.wsj.com/accelerators/2014/06/03/jessica-livingston-why-startups-need-to-focus-on-sales-not-marketing/'
summaries = SummarizeUrl(url)
print summaries

And here is the output:

[" Please try again .\n\n\xe2\x80\xa2 You can't enter more than 20 emails.\n\n\xe2\x80\xa2 You must enter the verification code below to send.\n\n\xe2\x80\xa2 Invalid entry: Please type the verification code again.", 'An error has occured and your email has not been sent.', 'Your email has been sent.']

I'm not sure if this is an issue with PyTeaser or Goose. If it's a problem with Goose, then feel free to close the issue and I'll submit it there.

UnicodeDecodeError in split_sentences

Using a url from BBC news:

$ python2.7 /tmp/news1.py http://www.bbc.co.uk/news/magazine-29631332
Traceback (most recent call last):
  File "/tmp/news1.py", line 4, in <module>
    summaries = SummarizeUrl(url)
  File "/usr/local/lib/python2.7/site-packages/pyteaser-1.0-py2.7.egg/pyteaser.py", line 81, in SummarizeUrl
    summaries = Summarize(title, text)
  File "/usr/local/lib/python2.7/site-packages/pyteaser-1.0-py2.7.egg/pyteaser.py", line 87, in Summarize
    sentences = split_sentences(text)
  File "/usr/local/lib/python2.7/site-packages/pyteaser-1.0-py2.7.egg/pyteaser.py", line 210, in split_sentences
    s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 142: ordinal not in range(128)

Looks like the problem is the triple character ellipsis "..." in context of "so that plastic straw or cup lid you dropped, the cigarette butt you threw on the road… they could all end up in the sea.".

The following diff seems to fix it - probably better to unicode() the whole string before trying to do individual characters - although could probably use more thought.

<     s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
<     s_iter.append(sentences[-1])
<     return s_iter

---
> 
>     s_iter2 = []    
>     for y in s_iter:
>         uy = unicode(y)
>         t0 = map(unicode,uy)
>         t1 = ''.join(t0)
>         t2 = t1.lstrip()
>         s_iter2.append(t2)
> 
>     s_iter2.append(sentences[-1])
>     return s_iter2

Get summary by providing text and title directly

Instead of just taking in an url.

Language support

Hi,
Can I use PyTeaser to summarize a text in any language?

Lightweight version without Goose or url support

An option to install without Goose or SummarizeUrl.

ex. pip install pyteaser light

Quick question about the keywords!

Hey there, quick question about the keywords! I know this library was written with news articles in mind, which the keywords are revolved around. But how would one best update the keywords for what they want? Say if I wanted to summarize text that wasn't news?

Failure to run SummarizeUrl() (ImportError)

ImportError: dlopen(/Users/ryAn/.../lib/python2.7/site-packages/lxml/etree.so, 2): 
 Library not loaded: libxml2.2.dylib

Referenced from: /Users/ryAn/.../lib/python2.7/site-packages/lxml/etree.so

Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0

Used in the Python 2.7.6 REPL, following the example in the README, this error comes up upon attempting to run the line SummarizeUrl(url).

The module was installed using pip inside a virtualenv.