codelucas / newspaper Goto Github PK
View Code? Open in Web Editor NEWnewspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Home Page: https://goo.gl/VX41yK
License: MIT License
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Home Page: https://goo.gl/VX41yK
License: MIT License
Hi,
I tested with article.url = http://www.windytan.com/2013/03/eavesdropping-on-wireless-keyboard.html
The text located to the right col. "A self-taught signals & electronics hacker from Helsinki, Finland. Fond of mysteries, codes and ciphers, and vintage tech. Absorptions is a blog about my hobbies.\n\n\n\nWorks in IT. Apart from electronics and signals, likes singing, kung fu, photography, and collecting G1 MLPs."
Is it possible to send raw html directly to the Article.parse() function without it being downloaded by Article.download()?
I will add this feature tonight or tomorrow. Opening an issue for it because it is so important. Multithreading has always existed in newspaper but there hasn't been a public API for it.
Downloading multiple articles concurrently is super useful and newspaper has an effective setup to do so.
There is no documentation about adding a new language to your great newspaper application.
When I am trying newspaper.build('http://www.venturebeat.com')
It give me these errors
[Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
[Category parse ERR] http://feeds.venturebeat.com
Can you please help or let me know what might be the issue.
I've noticed results from Article.movies are missing the protocol prefix.
>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> a = newspaper.Article(url)
>>> a.download()
>>> a.parse()
>>> a.movies
['//www.youtube.com/embed/LgNLRT6QyQE', '//www.youtube.com/embed/jsqVLa1yy1M']
$ pip show newspaper
---
Name: newspaper
Version: 0.0.7
Location: /home/adamgriffiths/.anaconda/envs/collected-redux/lib/python2.7/site-packages
Requires: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup
I noticed that to get unicode, you rely on the requests package's request.text attribute (in network.py->get_html
). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()
), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.
You can use another function from requests to give you the encodings listed in the HTML: requests.utils.get_encodings_from_content()
which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call the requests.utils.get_encodings_from_content()
which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.
In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml
There is no HTTP header encoding, and an incorrect encoding declaration in the content: content="text/html; charset=uISO-8859-1
. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).
I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?
Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.
Articles not being parsed from Memoize?
import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=True)
for article in cnn_paper.articles:
print article.url
It runs for the first time as it is not cached and prints all the results, The second time nothing is printed, -- BLANK --
File "/home/tim/Workspace/Development/hacks/pressmonitor/pressmon/articles/management/commands/collect_articles.py", line 27, in handle
article.nlp()
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/article.py", line 307, in nlp
summary_sents = nlp.summarize(title=self.title, text=self.text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 34, in summarize
sentences = split_sentences(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/newspaper/nlp.py", line 146, in split_sentences
sentences = tokenizer.tokenize(text)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/home/tim/.virtualenvs/pressmon/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
So far the newspaper library does a decent job with basic API calls but for a lot of the foreign language stuff, configuration details; it is still a bit clunky.. But do not worry, in the next 48 hours a HUGE revamp will be done on this library.
I'll make it very seamless to change languages, auto detect languages.
I will fix Chinese and Arabic extractions (right now they are broken due to the fact that I was incorrectly using the requests library (response.content vs response.text for foreign articles).
I will also add a few more languages to the suite.
After installing the script, I get the following error when importing newspaper 👍
ImportError: cannot import name timegm
ANy idea why this is happening ?
import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for article in cnn_paper.articles:
print article.url
The code above works sometime and other times it doesn't. I'm working on a virtualenv however all the required libraries are also installed in the system.
Traceback (most recent call last):
File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/newspaperparser.py", line 1, in <module>
import newspaper
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/__init__.py", line 10, in <module>
from .article import Article, ArticleException
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/article.py", line 16, in <module>
from . import images
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/images.py", line 20, in <module>
from . import urls
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/urls.py", line 17, in <module>
from .packages.tldextract import tldextract
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/__init__.py", line 1, in <module>
from .tldextract import extract, TLDExtract
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/newspaper/packages/tldextract/tldextract.py", line 37, in <module>
import pkg_resources
File "/Users/Shapath/Developer/Python/Newspaper/lib/python2.7/site-packages/pkg_resources.py", line 76, in <module>
import parser
File "/Users/Shapath/Developer/Python/Newspaper/Newspaper/Parser/parser.py", line 4, in <module>
cnn_paper = newspaper.build('http://cnn.com')
Upon re-examining the code to this lib (which has many chunks taken from various locations of the open source community) i've come to the conclusion that it's totally shit and really needs to be refactored.
Will try to make something happen this weekend.
Portugease should be Portuguese
Any chance of hosting a demo of newspaper in action so we can try it out before going through the setup steps?
It'd be nice to have a "try before you buy" before committing to the setup.
Hey, so I downloaded the library and tried writing a small python program to do the scripts you mentioned in the tutorial. But nothing runs. Newspaper is found but beyond that, nothing.
Error for proof:
Traceback (most recent call last):
File "/newspaper.py", line 1, in <module>
import newspaper
File "/newspaper.py", line 3, in <module>
cnn_paper = newspaper.build('http://cnn.com')
AttributeError: 'module' object has no attribute 'build'
[Finished in 0.0s with exit code 1]
Is it possible to assing already downloaded html string to the Article object without calling download() method?
Because I want to use it in Scrapy project and html page is already downloaded so I need simply to parse it.
It looks like something has broken during the refactor perhaps. Essentially, I'm unable to install newspaper
either from a local directory or via git using pip
.
setup.py
specifies the newspaper.data
package as a dependency, but the data/
directory doesn't exist any more and the install therefore fails.
Python 3.4+Windows+ pip install =
SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 16, in
\AppData\Local\Temp\pip_build_piyush\newspaper\setup.py", line 60
print ''
^
SyntaxError: invalid syntax
It doesn't work on http://mil.news.sina.com.cn/2014-01-25/1318761762.html.
only strong character has been recognized, the result is recorded as below :
>>> print a.text
中新网北京1月24日电 (记者 孙自法)“2013**科学年度新闻人物”评选结果24日晚在北京揭晓,领衔科研团队在国际上首次实现“量子反常霍尔效应”的薛其坤院士、神舟载人飞船系统总设计师张柏楠、**首艘航母关键配套导航系统领航人张崇猛、运-20总设计师唐长红院士等10名科技专家,从40位候选人中脱颖而出、成功当选。
这次评出的十大**科学年度新闻人物包括基础研究领域科学家3名、技术创新和科技成果转化杰出者3名、科技企业领军人物3名、科技传播者1名,他们分别是:
――**科学院院士、清华大学副校长薛其坤。2013年,他带领研究团队,在国际上首次实现“量子反常霍尔效应”,让**科学界站在了下一次信息革命的战略制高点。
――中科院院士、清华大学生命学院院长施一公。这位知名结构生物学家的科研小组2013年研究进展不断,包括“运用X-射线晶体学手段在细胞凋亡研究领域做出突出贡献,为开发新型抗癌、预防老年痴呆的药物提供重要线索”等。
――量子世界“追梦人”、**科学技术大学教授陈宇翱。2013年,凭借在光子、冷原子量子操纵和量子信息、量子模拟等领域的杰出贡献,他荣获2013年度“菲涅尔奖”。
――**航天科技集团空间技术研究院载人飞船系统总设计师张柏楠。2013年,他带领团队突破一系列关键技术,实现天宫一号与神舟十号手控交会对接,完成**载人天地往返运输系统的首次应用性飞行。
BS4 should provide an extremely robust solution to parsing articles of questionable encoding, etc.
When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?
article_cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b',
'em', 'i', 'tt', 'code', 'pre', 'blockquote', 'img', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6']
article_cleaner.remove_unknown_tags = False
I've got a running list of URLs that newspaper doesn't work phenomenally against. Is there an open issue to catalogue these? In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values.
For example, this link gets basically nothing:
http://www.empireonline.com/news/story.asp?NID=40344
I work mostly with Russian articles. In Russian, «angled» quotes are main variation of quotation marks. So, I noticed that if there's «angled» quotes in a title of an article, all closing quotation marks are removed from extracted title and it contains only the opening ones. I found that in happens here in ContentExtractor
:
TITLE_REPLACEMENTS = ReplaceSequence().create(u"»").append(u"»")
...
return TITLE_REPLACEMENTS.replaceAll(title).strip()
As far as I understand, it's needed for removing »
from titles where this character is used as a delimiter. Maybe it'll make sense to modify the regular expressions to not remove right quotes that have left quotes before them?
Here's an example of a page that have broken quotation marks in extracted title (Russian language).
When running article.parse() I am running into memory issues with a large number of articles being processed.
Each time the function is called it eats up about 0.5MB of memory that is not released when the parsing is done.
I took a look at the parse() function in article.py and it looks like the release_resources() function still has a TODO to be properly implemented:
https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L355
I'm curious if you can give more detail about a proper implementation of this function so that parse() will release the memory once it is done with it.
I got problems using the newspaper in Brazilian sites.
Following is an example:
import newspaper
info = newspaper.build('http://globoesporte.globo.com/futebol/times/sao-paulo')
len(info.artices)
It returned only 3 articles.
Sorry if I am using it wrongly.
I've been playing around with newspaper today. Looks awesome. I've had trouble picking up sufficient amount of categories on a number of sites... might be a good idea to add docs for add new categories to sources.
Newspaper currently creates its cache-folder under ~/.newspaper_scraper. This should be a configurable option and should be able to be disabled altogether for those not using the 'memoized' functionality of newspaper.
Is there an ability to add custom user agents while building the paper? It is possible to add it? or any tricks to use it right now?
It would be lovely if I could port this to Ruby. I think one issue I'll have to deal with is replacing Beautiful Soup with Nokogiri. I need to sit down and go through all of the code to spot any issues that may arise.
I am following the example at http://newspaper.readthedocs.org/en/latest/user_guide/advanced.html#advanced. After calling news_pool.join()
, I then attempt to call parse()
on an article belonging to a paper from the pool. It fails with the ArticleException
: "You must download() an article before parsing it!".
Hi,
It's a minor thing but I tried to see what languages are available by calling newspaper.languages()
and it exited with KeyError: nb
exception. It seems someone forgot to add this language in the language_dict
inside print_available_languages()
defined in newspaper/utils/__init__.py
.
I know the fix to this, will wait for tomorrow to implement it, it's late. I'll have the setup.py install the required nltk tokenizers.
I currently use Boilerpipe to do article extraction in order to generate Kindle MOBI files to send to my Kindle. I'm wondering if it's possible to feature-request the ability to do something similar in Newspaper: in that the article text extraction retains a minimal set of markup around it, enough to give the text structure as far as HTML is concerned. This makes forward conversion to other formats a lot easier, and allows the ability to retain certain markup that can only be expressed using HTML (such as images in situ and code fragments).
I'm wondering if there's a way to retain a
tags in article.top_node
, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!
In Article.parse
, top_node
is over-written with the cleaned node.
Then Article.clean_top_node
is copied from this.
Both nodes are equal. I'm not sure what the reasons are, but it prevents extraction by external tools by hiding the extracted article html.
Preferably, Article.top_node
shouldn't be over-written, and existing code should be modified to use clean_top_node
where required.
Bulleted points in articles are not extracted and are totally missing from the extracted text
Hello. I keep running into a error when parsing the downloaded articles. The error has to do with "Couldn't open file /home/.../newspaper/utils/../resources/text/stopwords-tr.txt". I am not sure where the issue is coming from, but my guess is that it might come from some updates in python-goose? Thanks.
I would like to use newspaper in brazilian sites. I tested it but unsuccessfully.
The articles are not extracted correctly.
newspaper is internationalized and I am using it in a wrong way or can we make it able to brazilian portuguese?
Thank you.
Add clean_body_classes from the python-goose upstream. Without this, there are cases where the body tag may get removed.
Great work with this library, just a little typo I've noticed: I think the argument in newspaper.build is supposed to be "memorize_articles", not "memoize_articles".
Can you add a new section where describing how to add a new language support?
import newspaper
Traceback (most recent call last):
File "", line 1, in
File "newspaper/init.py", line 10, in
from .article import Article, ArticleException
File "newspaper/article.py", line 15, in
from . import nlp
File "newspaper/nlp.py", line 171
if (normalized > 1.0) #just in case
I have been following the example in the README and I encountered this:
>>> article = cnn_paper.articles[1]
>>> article.download()
>>> article.parse()
>>> article.nlp()
Traceback (most recent call last):
zipfile.BadZipfile: File is not a zip file
The publishing date of an article is critical. I believe a good way of extracting publishing dates is to use a set of regex patterns and/or some notable id/class names.
First, thanks a lot for the great tool. I've been trying it out, and seems magic (except for some corner cases, websites for which it doesn't work, etc) but really cool :)
However, I tried it in a setting with scarce ressources (1G of RAM), and I have the impression that the memory keeps growing build after build until ... memory error. I deactivated the memoize articles, tried to empty the articles, dereference the sources, but looks like a bunch of other things are also memoized, and kept in memory, with no means to deactivate them. What is the best way to handle this? How does newspaper handle the increase of memory usage build after build? Is there a limit?
Thanks again for the magic tool :)
raspooti
I tried newspaper 0.0.6 with a bunch of Arabic websites and it didn't seem to fetch any articles.
In [38]: newspaper.version.version_info
Out[38]: (0, 0, 6)
In [39]: alarabiya = newspaper.build('http://www.alarabiya.net/', language='ar')
In [40]: tahrirnews = newspaper.build('http://tahrirnews.com/', language='ar')
In [41]: ahram = newspaper.build('http://www.ahram.org.eg/', language='ar')
In [42]: almasryalyoum = newspaper.build('http://www.almasryalyoum.com/', language='ar')
In [43]: for src in (alarabiya, tahrirnews, ahram, almasryalyoum):
....: print(src.size())
....:
0
0
0
0
I'm not sure if this is an OS X 10.10 or possibly even Xcode 6.1 command line tools issue, but having some issues installing this.
Once I get to the pip install newspaper
step, it errors out while building for lxml. Any thoughts on what could be going wrong?
Robs-MacBook-Air:~ rob$ pip install newspaper
Requirement already satisfied (use --upgrade to upgrade): newspaper in /Library/Python/2.7/site-packages
Downloading/unpacking lxml (from newspaper)
Downloading lxml-3.4.0.tar.gz (3.5MB): 3.5MB downloaded
Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py) egg_info for package lxml
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Building lxml version 3.4.0.
Building without Cython.
Using build configuration of libxslt 1.1.28
warning: no previously-included files found matching '*.py'
Downloading/unpacking requests (from newspaper)
Downloading requests-2.4.3-py2.py3-none-any.whl (459kB): 459kB downloaded
Downloading/unpacking nltk (from newspaper)
Downloading nltk-3.0.0.tar.gz (962kB): 962kB downloaded
Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/nltk/setup.py) egg_info for package nltk
warning: no files found matching 'Makefile' under directory '*.txt'
warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking Pillow (from newspaper)
Downloading Pillow-2.6.1.tar.gz (7.3MB): 7.3MB downloaded
Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/Pillow/setup.py) egg_info for package Pillow
warning: no files found matching '*.yaml'
warning: no files found matching '*.bdf' under directory 'Images'
warning: no files found matching '*.fli' under directory 'Images'
warning: no files found matching '*.gif' under directory 'Images'
warning: no files found matching '*.icns' under directory 'Images'
warning: no files found matching '*.ico' under directory 'Images'
warning: no files found matching '*.jpg' under directory 'Images'
warning: no files found matching '*.pbm' under directory 'Images'
warning: no files found matching '*.pil' under directory 'Images'
warning: no files found matching '*.png' under directory 'Images'
warning: no files found matching '*.ppm' under directory 'Images'
warning: no files found matching '*.psd' under directory 'Images'
warning: no files found matching '*.tar' under directory 'Images'
warning: no files found matching '*.webp' under directory 'Images'
warning: no files found matching '*.xpm' under directory 'Images'
warning: no files found matching 'README' under directory 'Sane'
warning: no files found matching 'README' under directory 'Scripts'
warning: no files found matching '*.icm' under directory 'Tests'
warning: no files found matching '*.txt' under directory 'Tk'
Downloading/unpacking cssselect (from newspaper)
Downloading cssselect-0.9.1.tar.gz
Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/cssselect/setup.py) egg_info for package cssselect
no previously-included directories found matching 'docs/_build'
Downloading/unpacking BeautifulSoup (from newspaper)
Downloading BeautifulSoup-3.2.1.tar.gz
Running setup.py (path:/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/BeautifulSoup/setup.py) egg_info for package BeautifulSoup
Installing collected packages: lxml, requests, nltk, Pillow, cssselect, BeautifulSoup
Running setup.py install for lxml
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Building lxml version 3.4.0.
Building without Cython.
Using build configuration of libxslt 1.1.28
building 'lxml.etree' extension
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so
ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'
ld: framework not found CrashReporterSupport
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'cc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile:
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Building lxml version 3.4.0.
Building without Cython.
Using build configuration of libxslt 1.1.28
running install
running build
running build_py
creating build
creating build/lib.macosx-10.10-intel-2.7
creating build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/_elementpath.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/cssselect.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/doctestcompare.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/ElementInclude.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/sax.py -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml
creating build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/includes
creating build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/builder.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/clean.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/defs.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/diff.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/formfill.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/html5parser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/soupparser.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.10-intel-2.7/lxml/html
creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron
copying src/lxml/isoschematron/__init__.py -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron
copying src/lxml/lxml.etree.h -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/lxml.etree_api.h -> build/lib.macosx-10.10-intel-2.7/lxml
copying src/lxml/includes/c14n.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/config.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/dtdvalid.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/etreepublic.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/htmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/relaxng.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/schematron.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/tree.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/uri.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xinclude.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xmlerror.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xmlparser.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xmlschema.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xpath.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/xslt.pxd -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/etree_defs.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes
copying src/lxml/includes/lxml-version.h -> build/lib.macosx-10.10-intel-2.7/lxml/includes
creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources
creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng
copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/rng
creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl
creating build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.macosx-10.10-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
creating build/temp.macosx-10.10-intel-2.7
creating build/temp.macosx-10.10-intel-2.7/src
creating build/temp.macosx-10.10-intel-2.7/src/lxml
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch x86_64 -arch i386 -Wl,-F. -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks -framework CrashReporterSupport build/temp.macosx-10.10-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.10-intel-2.7/lxml/etree.so
ld: warning: directory not found for option '-F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.Internal.sdk/System/Library/PrivateFrameworks'
ld: framework not found CrashReporterSupport
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'cc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip-dac3OE-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /private/var/folders/1l/yvmzyjz95_g11p540_g9sz100000gn/T/pip_build_rob/lxml
Storing debug log for failure in /Users/rob/Library/Logs/pip.log
Here the stack trace:
[Parse lxml ERR] line 1045: Tag nav invalid
[Article parse ERR] http://www.cnet.com/products/apple-ipad-march-2012/
You must download and parse an article before parsing it!
Traceback (most recent call last):
File "crawler.py", line 30, in <module>
a.nlp()
File "/root/.virtualenvs/cnet-crawler/local/lib/python2.7/site-packages/newspaper/article.py", line 276, in nlp
raise ArticleException()
newspaper.article.ArticleException
I'm not using the concurrent version, I'm not building a newspaper from a url, but rather I have a list of all the articles and I build a new Article from them.
ve = build(" http://www.le360.ma/fr", memoize_articles=False)
links = dict()
for each in ve.articles:
links[each.title] = each.url
-> Links is empty
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.