codelucas / newspaper Goto Github PK

View Code? Open in Web Editor NEW

13.9K 385.0 2.1K 17.92 MB

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Home Page: https://goo.gl/VX41yK

License: MIT License

Python 100.00%

python news crawler crawling scraper news-aggregator

newspaper's People

Contributors

Stargazers

Watchers

Forkers

skswbwt afthill charlie-cao michaelhood silky cretax shcalm karthikdwarakanath donnyzhang tomtaylor waseem18 sgallancy voidfiles strogo mpuig lifuzu ezioruan mysterier girasquid otemnov darkslategrey techaddict damilare sghosh73 whereswardy mcastilho hellcoderz billwiliams simp75 adamstac zhizhou idefine githubber mikkhait gabstehr netconstructor erikdies vibster mocyuto dotpot metakermit frandman 99plus2 bollwang cswanghan caxaria ilovejs adamramadhan dreamfrog bilalghalib zhoubug weimingzhen nordineb omarkhan watermars qingu gavinkao anty djrondon liqiang-ict culturalstudies pd520c kiwi4py wuzixiao redwind ksdthegr8 jeffnappi zzhaozeng pavelwen halved jimmy0000 techiev2 divya-ai chishaku chennacotla summerhq mrduongnv genba mohamadhussien joejean andy071001 klarahan rhartley wyrover eox03y handol artur hachiya captainjack100 justinleoye kungfucop mohheader parhammmm jwhite462 pkeeper karls adamlwgriffiths lizuyao2010 pr0hest tskatom

newspaper's Issues

article.text and keywords error

Hi,
I tested with article.url = http://www.windytan.com/2013/03/eavesdropping-on-wireless-keyboard.html

The text located to the right col. "A self-taught signals & electronics hacker from Helsinki, Finland. Fond of mysteries, codes and ciphers, and vintage tech. Absorptions is a blog about my hobbies.\n\n\n\nWorks in IT. Apart from electronics and signals, likes singing, kung fu, photography, and collecting G1 MLPs."

Parsing Raw HTML

Is it possible to send raw html directly to the Article.parse() function without it being downloaded by Article.download()?

Multithread & gevent framework built into newspaper

I will add this feature tonight or tomorrow. Opening an issue for it because it is so important. Multithreading has always existed in newspaper but there hasn't been a public API for it.

Downloading multiple articles concurrently is super useful and newspaper has an effective setup to do so.

no document on how to add language

There is no documentation about adding a new language to your great newspaper application.

[Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

When I am trying newspaper.build('http://www.venturebeat.com')
It give me these errors
[Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
[Category parse ERR] http://feeds.venturebeat.com

Can you please help or let me know what might be the issue.

article.movies missing 'http:'

I've noticed results from Article.movies are missing the protocol prefix.

>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> a = newspaper.Article(url)
>>> a.download()
>>> a.parse()
>>> a.movies
['//www.youtube.com/embed/LgNLRT6QyQE', '//www.youtube.com/embed/jsqVLa1yy1M']

$ pip show newspaper

---
Name: newspaper
Version: 0.0.7
Location: /home/adamgriffiths/.anaconda/envs/collected-redux/lib/python2.7/site-packages
Requires: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup

Character encoding detection

I noticed that to get unicode, you rely on the requests package's request.text attribute (in network.py->get_html). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.

You can use another function from requests to give you the encodings listed in the HTML: requests.utils.get_encodings_from_content() which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call the requests.utils.get_encodings_from_content() which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.

In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml

There is no HTTP header encoding, and an incorrect encoding declaration in the content: content="text/html; charset=uISO-8859-1. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).

I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?

Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.

Memoize Articles - Not Printing

Articles not being parsed from Memoize?

import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=True)

for article in cnn_paper.articles:
    print article.url

It runs for the first time as it is not cached and prints all the results, The second time nothing is printed, -- BLANK --

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

File "/home/tim/Workspace/Development/hacks/pressmonitor/pressmon/articles/management/commands/collect_articles.py", line 27, in handle
article.nlp()
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/article.py", line 307, in nlp
summary_sents = nlp.summarize(title=self.title, text=self.text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 34, in summarize
sentences = split_sentences(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 146, in split_sentences
sentences = tokenizer.tokenize(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

Huge internationalization / API revamp underway!

So far the newspaper library does a decent job with basic API calls but for a lot of the foreign language stuff, configuration details; it is still a bit clunky.. But do not worry, in the next 48 hours a HUGE revamp will be done on this library.

I'll make it very seamless to change languages, auto detect languages.

I will fix Chinese and Arabic extractions (right now they are broken due to the fact that I was incorrectly using the requests library (response.content vs response.text for foreign articles).

I will also add a few more languages to the suite.

Timegm error?

After installing the script, I get the following error when importing newspaper 👍
ImportError: cannot import name timegm

ANy idea why this is happening ?

AttributeError: 'module' object has no attribute 'build'

import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)

for article in cnn_paper.articles:
    print article.url

The code above works sometime and other times it doesn't. I'm working on a virtualenv however all the required libraries are also installed in the system.

Traceback (most recent call last):
  File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/newspaperparser.py", line 1, in <module>
    import newspaper
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/__init__.py", line 10, in <module>
    from .article import Article, ArticleException
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/article.py", line 16, in <module>
    from . import images
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/images.py", line 20, in <module>
    from . import urls
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/urls.py", line 17, in <module>
    from .packages.tldextract import tldextract
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/__init__.py", line 1, in <module>
    from .tldextract import extract, TLDExtract
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/tldextract.py", line 37, in <module>
    import pkg_resources
  File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/pkg_resources.py", line 76, in <module>
    import parser
  File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/parser.py", line 4, in <module>
    cnn_paper = newspaper.build('http://cnn.com')

Refactor codebase so newspaper is actually pythonic

Upon re-examining the code to this lib (which has many chunks taken from various locations of the open source community) i've come to the conclusion that it's totally shit and really needs to be refactored.

Will try to make something happen this weekend.

Portuguese is misspelled

Portugease should be Portuguese

Hosted demo

Any chance of hosting a demo of newspaper in action so we can try it out before going through the setup steps?

It'd be nice to have a "try before you buy" before committing to the setup.

Python venv only?

Hey, so I downloaded the library and tried writing a small python program to do the scripts you mentioned in the tutorial. But nothing runs. Newspaper is found but beyond that, nothing.

Error for proof:

Traceback (most recent call last):
  File "/newspaper.py", line 1, in <module>
    import newspaper
  File "/newspaper.py", line 3, in <module>
    cnn_paper = newspaper.build('http://cnn.com')
AttributeError: 'module' object has no attribute 'build'
[Finished in 0.0s with exit code 1]

How to assign html content without downloading it?

Is it possible to assing already downloaded html string to the Article object without calling download() method?

Because I want to use it in Scrapy project and html page is already downloaded so I need simply to parse it.

Can't install newspaper

It looks like something has broken during the refactor perhaps. Essentially, I'm unable to install newspaper either from a local directory or via git using pip.

setup.py specifies the newspaper.data package as a dependency, but the data/ directory doesn't exist any more and the install therefore fails.

python 3 support request

Python 3.4+Windows+ pip install =

SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 16, in
\AppData\Local\Temp\pip_build_piyush\newspaper\setup.py", line 60
print ''

SyntaxError: invalid syntax

Don't work if content have <strong> tag.

It doesn't work on http://mil.news.sina.com.cn/2014-01-25/1318761762.html.

only strong character has been recognized, the result is recorded as below :

>>> print a.text
中新网北京1月24日电 (记者 孙自法)“2013**科学年度新闻人物”评选结果24日晚在北京揭晓，领衔科研团队在国际上首次实现“量子反常霍尔效应”的薛其坤院士、神舟载人飞船系统总设计师张柏楠、**首艘航母关键配套导航系统领航人张崇猛、运-20总设计师唐长红院士等10名科技专家，从40位候选人中脱颖而出、成功当选。

这次评出的十大**科学年度新闻人物包括基础研究领域科学家3名、技术创新和科技成果转化杰出者3名、科技企业领军人物3名、科技传播者1名，他们分别是：

――**科学院院士、清华大学副校长薛其坤。2013年，他带领研究团队，在国际上首次实现“量子反常霍尔效应”，让**科学界站在了下一次信息革命的战略制高点。

――中科院院士、清华大学生命学院院长施一公。这位知名结构生物学家的科研小组2013年研究进展不断，包括“运用X-射线晶体学手段在细胞凋亡研究领域做出突出贡献，为开发新型抗癌、预防老年痴呆的药物提供重要线索”等。

――量子世界“追梦人”、**科学技术大学教授陈宇翱。2013年，凭借在光子、冷原子量子操纵和量子信息、量子模拟等领域的杰出贡献，他荣获2013年度“菲涅尔奖”。

――**航天科技集团空间技术研究院载人飞船系统总设计师张柏楠。2013年，他带领团队突破一系列关键技术，实现天宫一号与神舟十号手控交会对接，完成**载人天地往返运输系统的首次应用性飞行。

Add a BeautifulSoup4 parser.

BS4 should provide an extremely robust solution to parsing articles of questionable encoding, etc.

article_html does not keep the img tags

When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?

    article_cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b',
            'em', 'i', 'tt', 'code', 'pre', 'blockquote', 'img', 'h1',
            'h2', 'h3', 'h4', 'h5', 'h6']
    article_cleaner.remove_unknown_tags = False

Sites it doesn't work on

I've got a running list of URLs that newspaper doesn't work phenomenally against. Is there an open issue to catalogue these? In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values.

For example, this link gets basically nothing:
http://www.empireonline.com/news/story.asp?NID=40344

Closing quotation mark is removed from title

I work mostly with Russian articles. In Russian, «angled» quotes are main variation of quotation marks. So, I noticed that if there's «angled» quotes in a title of an article, all closing quotation marks are removed from extracted title and it contains only the opening ones. I found that in happens here in ContentExtractor:

TITLE_REPLACEMENTS = ReplaceSequence().create(u"&raquo;").append(u"»")

...

return TITLE_REPLACEMENTS.replaceAll(title).strip()

As far as I understand, it's needed for removing » from titles where this character is used as a delimiter. Maybe it'll make sense to modify the regular expressions to not remove right quotes that have left quotes before them?

Here's an example of a page that have broken quotation marks in extracted title (Russian language).

article does not release_resources()

When running article.parse() I am running into memory issues with a large number of articles being processed.

Each time the function is called it eats up about 0.5MB of memory that is not released when the parsing is done.

I took a look at the parse() function in article.py and it looks like the release_resources() function still has a TODO to be properly implemented:

https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L355

I'm curious if you can give more detail about a proper implementation of this function so that parse() will release the memory once it is done with it.

Problem in Brazilian sites

I got problems using the newspaper in Brazilian sites.
Following is an example:

import newspaper

info = newspaper.build('http://globoesporte.globo.com/futebol/times/sao-paulo')
len(info.artices)

It returned only 3 articles.

Sorry if I am using it wrongly.

Docs for adding category sources

I've been playing around with newspaper today. Looks awesome. I've had trouble picking up sufficient amount of categories on a number of sites... might be a good idea to add docs for add new categories to sources.

Cache folder settings should be optional/configurable.

Newspaper currently creates its cache-folder under ~/.newspaper_scraper. This should be a configurable option and should be able to be disabled altogether for those not using the 'memoized' functionality of newspaper.

Add URL headers while building a "paper"

Is there an ability to add custom user agents while building the paper? It is possible to add it? or any tricks to use it right now?

Port to Ruby

It would be lovely if I could port this to Ruby. I think one issue I'll have to deal with is replacing Beautiful Soup with Nokogiri. I need to sit down and go through all of the code to spot any issues that may arise.

Multi-threading article downloads not working

I am following the example at http://newspaper.readthedocs.org/en/latest/user_guide/advanced.html#advanced. After calling news_pool.join(), I then attempt to call parse() on an article belonging to a paper from the pool. It fails with the ArticleException: "You must download() an article before parsing it!".

KeyError when calling newspaper.languages()

Hi,

It's a minor thing but I tried to see what languages are available by calling newspaper.languages() and it exited with KeyError: nb exception. It seems someone forgot to add this language in the language_dict inside print_available_languages() defined in newspaper/utils/__init__.py.

Calling nlp() on an article causes 'tokenizers/punkt/english.pickle' Not Found Error

I know the fix to this, will wait for tomorrow to implement it, it's late. I'll have the setup.py install the required nltk tokenizers.

Retain HTML markup for extracted article

I currently use Boilerpipe to do article extraction in order to generate Kindle MOBI files to send to my Kindle. I'm wondering if it's possible to feature-request the ability to do something similar in Newspaper: in that the article text extraction retains a minimal set of markup around it, enough to give the text structure as far as HTML is concerned. This makes forward conversion to other formats a lot easier, and allows the ability to retain certain markup that can only be expressed using HTML (such as images in situ and code fragments).

Retain <a> tags in top article node?

I'm wondering if there's a way to retain a tags in article.top_node, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!

Article.top_node == Article.clean_top_node

In Article.parse, top_node is over-written with the cleaned node.
Then Article.clean_top_node is copied from this.
Both nodes are equal. I'm not sure what the reasons are, but it prevents extraction by external tools by hiding the extracted article html.

Preferably, Article.top_node shouldn't be over-written, and existing code should be modified to use clean_top_node where required.

Not extracting UL LI text

Bulleted points in articles are not extracted and are totally missing from the extracted text

issue with stopwords-tr.txt

Hello. I keep running into a error when parsing the downloaded articles. The error has to do with "Couldn't open file /home/.../newspaper/utils/../resources/text/stopwords-tr.txt". I am not sure where the issue is coming from, but my guess is that it might come from some updates in python-goose? Thanks.

Brazilian portuguese support

I would like to use newspaper in brazilian sites. I tested it but unsuccessfully.
The articles are not extracted correctly.

newspaper is internationalized and I am using it in a wrong way or can we make it able to brazilian portuguese?

Thank you.

DocumentCleaner is missing clean_body_classes

Add clean_body_classes from the python-goose upstream. Without this, there are cases where the body tag may get removed.

Typo in newspaper.build argument "memoize_articles"

Great work with this library, just a little typo I've noticed: I think the argument in newspaper.build is supposed to be "memorize_articles", not "memoize_articles".

Other language support.

Can you add a new section where describing how to add a new language support?

SyntaxError: invalid syntax

import newspaper
Traceback (most recent call last):
File "", line 1, in
File "newspaper/init.py", line 10, in
from .article import Article, ArticleException
File "newspaper/article.py", line 15, in
from . import nlp
File "newspaper/nlp.py", line 171
if (normalized > 1.0) #just in case

.nlp() could not work

I have been following the example in the README and I encountered this:

>>> article = cnn_paper.articles[1]
>>> article.download()
>>> article.parse()
>>> article.nlp()
Traceback (most recent call last):
zipfile.BadZipfile: File is not a zip file

Add extraction publishing date from article.

The publishing date of an article is critical. I believe a good way of extracting publishing dates is to use a set of regex patterns and/or some notable id/class names.

Bound for memory usage

First, thanks a lot for the great tool. I've been trying it out, and seems magic (except for some corner cases, websites for which it doesn't work, etc) but really cool :)

However, I tried it in a setting with scarce ressources (1G of RAM), and I have the impression that the memory keeps growing build after build until ... memory error. I deactivated the memoize articles, tried to empty the articles, dereference the sources, but looks like a bunch of other things are also memoized, and kept in memory, with no means to deactivate them. What is the best way to handle this? How does newspaper handle the increase of memory usage build after build? Is there a limit?

Thanks again for the magic tool :)
raspooti

Doesn't work with Arabic news sites

I tried newspaper 0.0.6 with a bunch of Arabic websites and it didn't seem to fetch any articles.

In [38]: newspaper.version.version_info
Out[38]: (0, 0, 6)

In [39]: alarabiya = newspaper.build('http://www.alarabiya.net/', language='ar')

In [40]: tahrirnews = newspaper.build('http://tahrirnews.com/', language='ar')

In [41]: ahram = newspaper.build('http://www.ahram.org.eg/', language='ar')

In [42]: almasryalyoum = newspaper.build('http://www.almasryalyoum.com/', language='ar')

In [43]: for src in (alarabiya, tahrirnews, ahram, almasryalyoum):
   ....:     print(src.size())
   ....:     
0
0
0
0

Having issues installing due to lxml

I'm not sure if this is an OS X 10.10 or possibly even Xcode 6.1 command line tools issue, but having some issues installing this.

Once I get to the pip install newspaper step, it errors out while building for lxml. Any thoughts on what could be going wrong?

Robs-MacBook-Air:~ rob$ pip install newspaper
Requirement already satisfied (use --upgrade to upgrade): newspaper in /Library/Python/2.7/site-packages
Downloading/unpacking lxml (from newspaper)
  Downloading lxml-3.4.0.tar.gz (3.5MB): 3.5MB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py) egg_info for package lxml
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.0.
    Building without Cython.
    Using build configuration of libxslt 1.1.28

    warning: no previously-included files found matching '*.py'
Downloading/unpacking requests (from newspaper)
  Downloading requests-2.4.3-py2.py3-none-any.whl (459kB): 459kB downloaded
Downloading/unpacking nltk (from newspaper)
  Downloading nltk-3.0.0.tar.gz (962kB): 962kB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/nltk/setup.py) egg_info for package nltk

    warning: no files found matching 'Makefile' under directory '*.txt'
    warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking Pillow (from newspaper)
  Downloading Pillow-2.6.1.tar.gz (7.3MB): 7.3MB downloaded
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/Pillow/setup.py) egg_info for package Pillow

    warning: no files found matching '*.yaml'
    warning: no files found matching '*.bdf' under directory 'Images'
    warning: no files found matching '*.fli' under directory 'Images'
    warning: no files found matching '*.gif' under directory 'Images'
    warning: no files found matching '*.icns' under directory 'Images'
    warning: no files found matching '*.ico' under directory 'Images'
    warning: no files found matching '*.jpg' under directory 'Images'
    warning: no files found matching '*.pbm' under directory 'Images'
    warning: no files found matching '*.pil' under directory 'Images'
    warning: no files found matching '*.png' under directory 'Images'
    warning: no files found matching '*.ppm' under directory 'Images'
    warning: no files found matching '*.psd' under directory 'Images'
    warning: no files found matching '*.tar' under directory 'Images'
    warning: no files found matching '*.webp' under directory 'Images'
    warning: no files found matching '*.xpm' under directory 'Images'
    warning: no files found matching 'README' under directory 'Sane'
    warning: no files found matching 'README' under directory 'Scripts'
    warning: no files found matching '*.icm' under directory 'Tests'
    warning: no files found matching '*.txt' under directory 'Tk'
Downloading/unpacking cssselect (from newspaper)
  Downloading cssselect-0.9.1.tar.gz
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/cssselect/setup.py) egg_info for package cssselect

    no previously-included directories found matching 'docs/_build'
Downloading/unpacking BeautifulSoup (from newspaper)
  Downloading BeautifulSoup-3.2.1.tar.gz
  Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/BeautifulSoup/setup.py) egg_info for package BeautifulSoup

Installing collected packages: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup
  Running setup.py install for lxml
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.0.
    Building without Cython.
    Using build configuration of libxslt 1.1.28
    building 'lxml.etree' extension
    cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
    cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so
    ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'
    ld: framework not found CrashReporterSupport
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command 'cc' failed with exit status 1
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile:
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

Building lxml version 3.4.0.

Building without Cython.

Using build configuration of libxslt 1.1.28

running install

running build

running build_py

creating build

creating build/lib.macosx-10.10-intel-2.7

creating build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/_elementpath.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/cssselect.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/doctestcompare.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/ElementInclude.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/sax.py -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml

creating build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/includes

creating build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/clean.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/defs.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/diff.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/formfill.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/html5parser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/soupparser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml/html

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron

copying src/lxml/isoschematron/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron

copying src/lxml/lxml.etree.h -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/lxml.etree_api.h -> build/lib.macosx-10.10-intel-2.7/lxml

copying src/lxml/includes/c14n.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/config.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/dtdvalid.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/etreepublic.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/htmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/relaxng.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/schematron.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/tree.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/uri.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xinclude.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlerror.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xmlschema.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xpath.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/xslt.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/etree_defs.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes

copying src/lxml/includes/lxml-version.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng

copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl

creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

creating build/temp.macosx-10.10-intel-2.7

creating build/temp.macosx-10.10-intel-2.7/src

creating build/temp.macosx-10.10-intel-2.7/src/lxml

cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace

cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so

ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'

ld: framework not found CrashReporterSupport

clang: error: linker command failed with exit code 1 (use -v to see invocation)

error: command 'cc' failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml
Storing debug log for failure in /Users/rob/Library/Logs/pip.log

You must download and parse an article before parsing it

Here the stack trace:

[Parse lxml ERR] line 1045: Tag nav invalid
[Article parse ERR] http://www.cnet.com/products/apple-ipad-march-2012/
You must download and parse an article before parsing it!
Traceback (most recent call last):
  File "crawler.py", line 30, in <module>
    a.nlp()
  File "/root/.virtualenvs/cnet-crawler/local/lib/python2.7/site-packages/newspaper/article.py", line 276, in nlp
    raise ArticleException()
newspaper.article.ArticleException

I'm not using the concurrent version, I'm not building a newspaper from a url, but rather I have a list of all the articles and I build a new Article from them.

Doesn't work on http://www.le360.ma/fr

ve = build(" http://www.le360.ma/fr", memoize_articles=False)

links = dict()

for each in ve.articles:
    links[each.title] = each.url

-> Links is empty