scrapy / scrapely Goto Github PK

A pure-python HTML screen-scraping library

Python 48.06% HTML 51.94%

scrapely's Introduction

Scrapy

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Releases

You can check https://docs.scrapy.org/en/latest/news.html for the release notes.

Community (blog, twitter, mail list, IRC)

See https://scrapy.org/community/ for details.

Contributing

See https://docs.scrapy.org/en/master/contributing.html for details.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

scrapely's People

Contributors

Stargazers

Watchers

Forkers

dangra pjob netconstructor esimionato cloudappsetup boite daqing15 madberry alepharchives cheekybastard wmelton diarmuidw chiehwen tankiit rakhoush joskid jheusser zpassenger scraping-xx big-data bigdata-tools mapping dreampuf listings-xx 4iji vovkd nvdnkpr lmorillas adorsk saidimu olpe masdude streambo chishaku johnstuartrutledge jurecuhalev imclab bghyun rgeorgioff web5design ivanchak guoyunsky laapsaap alexriina d1on houbl leeight joeromero xyb dustinthughes genba alice-private-life juraseg kmike mlespiau lvheyang wolph vyrus azizur77 eric011 mirona shangma tpeng liangkai haywhisksoftware georgefs aaronmartin0303 agstudy yupengyan nsdown derekrazo bitliner pborky erinkhoo eliasdorneles bookeriii theringer gbts ashishnitinpatil cyberplant cgc1983 priestd09 touhonoob marchchadwick 1060460048 damianzhou surfyst clizarralde ruairif lionel0822 wha000tif jaykizhou shannonyu miminus zhoubaozhou xunyou elad-l eval-usertoken thinker007 codeops

scrapely's Issues

Unable to pull in https

I'm trying to follow the into documentation. I changed the training url to be an https one and get the following.

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/site-packages/scrapely/init.py", line 48, in train
page = url_to_page(url, encoding)
File "/usr/local/lib/python3.6/site-packages/scrapely/htmlpage.py", line 183, in url_to_page
fh = urlopen(url)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

I'm using Python 3.6.5.

URL was:

https://www.amazon.com/Xbox-One-X-1TB-Console/dp/B074WPGYRF/ref=sr_1_3?s=videogames&ie=UTF8&qid=1524486645&sr=1-3&keywords=xbox%2Bone%2Bx&th=1

Thanks!

Obtaining sectioned article text

Hello,

I have a project that I was looking to use Scrapely for. From what I've read and found out this sounds like it's something that I would like to use. I have run into a problem with it though. when I pass a url that contains sectioned article text (which appears to be almost all of my urls) I only receive the first section of the text.

Here's a site that I tried: http://www.autostraddle.com/12-black-friday-deals-you-can-get-without-having-to-put-pants-on-266850/

and here's what I used to train scrapely:

{'title':'15 Things You Learn When You Move In With Your Girlfriend', 'author': 'by Kate', 'postdate':'November 10, 2014 at 9:00am PST', 'count':'82', 'content':'There comes a point in every relationship when it makes sense for you to think about cohabitation.'}

if I then have scrapely scrape that same url it only gives me that first paragraph.

So my question is, how would I get scrapely to obtain all of the articles main text (basically the text between the social media icons).

Any help would be greatly appreciated!

Thanks

Does the order of annotations matter - Weird output

I've been playing with scrapely, and this script generates some weird output:

annotate url1
try scrapping url1, got the expected output
annotate url2
try scrapping url2, got nothing from scrapping url2.

I thought it could be train since it is not supposed to be reliable, but when exported the annotated data the annotations seems alright.

Then I inverted the order:

annotate url2
try scrapping url2, got the expected output
annotate url1
try scrapping url1, got something different from the annotation( a subset of what was annotated)

Is this a expected behaviour ?

Installing via pip on Python 3.7 fails

➜  beacon-scrapy git:(master) ✗ pip3 install scrapely
Collecting scrapely
  Downloading https://files.pythonhosted.org/packages/5e/8b/dcf53699a4645f39e200956e712180300ec52d2a16a28a51c98e96e76548/scrapely-0.13.4.tar.gz (134kB)
    100% |████████████████████████████████| 143kB 5.1MB/s
Requirement already satisfied: numpy in /usr/local/lib/python3.7/site-packages (from scrapely) (1.15.0)
Requirement already satisfied: w3lib in /usr/local/lib/python3.7/site-packages (from scrapely) (1.19.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from scrapely) (1.11.0)
Building wheels for collected packages: scrapely
  Running setup.py bdist_wheel for scrapely ... error
  Complete output from command /usr/local/opt/python/bin/python3.7 -u -c "import setuptools, tokenize;__file__='/private/var/folders/7c/dm671s4x4v5bm8_6tprr861r0000gn/T/pip-install-p7z3xbo1/scrapely/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/7c/dm671s4x4v5bm8_6tprr861r0000gn/T/pip-wheel-7rl3xgbc --python-tag cp37:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.13-x86_64-3.7
  creating build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/descriptor.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/version.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/extractors.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/__init__.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/template.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/htmlpage.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/tool.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  creating build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/pageobjects.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/similarity.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/__init__.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/regionextract.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/pageparsing.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  running egg_info
  writing scrapely.egg-info/PKG-INFO
  writing dependency_links to scrapely.egg-info/dependency_links.txt
  writing requirements to scrapely.egg-info/requires.txt
  writing top-level names to scrapely.egg-info/top_level.txt
  reading manifest file 'scrapely.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  writing manifest file 'scrapely.egg-info/SOURCES.txt'
  copying scrapely/_htmlpage.c -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/_htmlpage.pyx -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/extraction/_similarity.c -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/_similarity.pyx -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  running build_ext
  building 'scrapely._htmlpage' extension
  creating build/temp.macosx-10.13-x86_64-3.7
  creating build/temp.macosx-10.13-x86_64-3.7/scrapely
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/usr/local/lib/python3.7/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scrapely/_htmlpage.c -o build/temp.macosx-10.13-x86_64-3.7/scrapely/_htmlpage.o
  scrapely/_htmlpage.c:7367:65: error: too many arguments to function call, expected 3, have 4
      return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                     ^~~~
  /Library/Developer/CommandLineTools/usr/lib/clang/9.1.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
  #  define NULL ((void*)0)
                 ^~~~~~~~~~
  1 error generated.
  error: command 'clang' failed with exit status 1

iso-8859-1

Trying to scrape pages with a content-encoding of iso-8859-1 throws a unicode error:
>>> url1 = 'http://www[DOT]getmobile[DOT]de/handy/NO68128,Nokia-C3-01-Touch-and-Type.html' #url changed to prevent backlinking
>>> data = {'name': 'Nokia C3-01 Touch and Type', 'price': '129,00'}
>>> s.train(url1,data)
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 32, in train
File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 50, in _get_page
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1512-1514: invalid data

Is really Python 3 supported?

I have problems with running scrapely with Python 3.
Scrapely depends on slybot, which depends on scrapy, which depend on Twisted, which don't support yet Python 3.

Please remove info about supporting Python 3 or give instructions how it can be possible.

error [SSL: CERTIFICATE_VERIFY_FAILED] on travel sites

Im just starting with this tool and im trying to scrape travel prices but i got error [SSL: CERTIFICATE_VERIFY_FAILED].

from scrapely import Scraper

s = Scraper()
url1 = 'XXXXX' # URL of site
data = {'price': '16.929'}
s.train(url1, data)

url2 = 'XXXXX' # URL of same site but different search params, same destination and origin just one month later
print(s.scrape(url2))

Full console log:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1400, in connect
server_hostname=server_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 401, in wrap_socket
_context=self, _session=session)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 808, in init
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 1061, in do_handshake
self._sslobj.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 683, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "spider.py", line 6, in
s.train(url1, data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapely/init.py", line 48, in train
page = url_to_page(url, encoding)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapely/htmlpage.py", line 183, in url_to_page
fh = urlopen(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)>

Any idea what could be the problem?

Random failing doctests

I have run the testsuite multiple times and some times it fails (and sometime it doesnt) due to:

======================================================================
FAIL: Doctest: scrapely.extraction.regionextract.RecordExtractor
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/doctest.py", line 2226, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for scrapely.extraction.regionextract.RecordExtractor
  File "/home/daniel/src/scrapely/scrapely/extraction/regionextract.py", line 291, in RecordExtractor

----------------------------------------------------------------------
File "/home/daniel/src/scrapely/scrapely/extraction/regionextract.py", line 306, in scrapely.extraction.regionextract.RecordExtractor
Failed example:
    ex.extract(page)
Expected:
    [{u'description': [u'description'], u'name': [u'name']}]
Got:
    [{u'name': [u'name'], u'description': [u'description']}]


----------------------------------------------------------------------
Ran 99 tests in 0.578s

FAILED (failures=1)
ERROR: InvocationError: '/home/daniel/src/scrapely/.tox/py27/bin/nosetests scrapely tests'
_______________________________________________________________________________________________ summary _______________________________________________________________________________________________
ERROR:   py27: commands failed

Unable to extract some fields ?

Is it possible to extract the address: "15, Vishal Est, Opp Bharat Party Plt,amraiwadi, Ahmedabad-380026, Ahmedabad, 380026"
from http://indisearch.com/s-v-machine-tools,-rajkot/74709

Because, the address does not come under an HTML element ?

I am aware of editing the template.json file. So, if this can be done please let me know.

Thanks.

Provide method for parsing HTML that has already been downloaded by external libraries.

While this library seems very appealing, the fact that Scraper.scrape makes blocking IO calls is a problem for those of us who would like to use it with an asynchronous framework such as Twisted.

It would be nice to have a Scraper function that takes a string of HTML and parses it, thus allowing the user to avoid blocking calls.

Specifying integer values in the data dict

Amazing work! This is really useful.

I ran into a minor issue with the way you provide data. The documentation does not say you can't provide integer values, so I ended up providing this data:

In [1]: from scrapely import Scraper

In [2]: s = Scraper()

In [3]: data = {'name': 'scrapy/scrapely', 'url': 'https://github.com/scrapy/scrapely', 'description': 'A pure-python HTML screen-scraping library', 'watchers': 42, 'forks': 9}

In [4]: url = "https://github.com/scrapy/scrapely"

and ran into this exception:

In [5]: s.train(url, data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

...

/home/ubuntu/scrapely/scrapely/template.py in func(fragment, page)
     93     def func(fragment, page):
     94         fdata = page.fragment_data(fragment).strip()
---> 95         if text in fdata:
     96             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     97         else:

TypeError: 'in <string>' requires string as left operand

It took me a while to realize what the issue was, it was with the integer values in the data variable.

So, you can either make it all unicode string:

if unicode(text) in fdata:
    return float(len(unicode(text))) / len(fdata) - (1e-6 * fragment.start)

or specify in the documentation that values should all be strings.

benchmarks?

Instead of using xpath with very own scrapy, does scrapely deliver better performance with its own learning system?

Extract from javascript?

Would it be possible to pull values out of javascript off a page? For example, I'm looking to pull some content that is contained within a string, such as "Here is some string with my value 1998". I want to annotate the 1998, however a lot of times I get a bunch of html with it too. But in the JS, there is a variable that holds what I need:

<script>
data['item'] = {
  "year": "1998"
};
</script>

Would this be possible?

Thx

Drop Python 2.6 support

What about dropping Python 2.6 support after a new scrapely version is released, i.e. making scrapely 0.13.0 the last version which supports Python 2.6?

Is this still an active project?

No new commits since February? Is there a more updated fork ?

Python 3 support

Just a place for prospective porters to start with their ideas.
Would really love this to be ported but I don't have the time to do it myself.

add a tag for 0.10 release

I'm not sure I can add it without breaking auto-release setup - never done this before. //cc @dangra?

Tag should be added for this commit: 62a46da (I've checked an archive uploaded to pypi for 0.10 versions and compared it with the source code).

README Usage (command line tool) correction

In the Usage (command line tool) section, after the text "To add a new annotation, you usually test the selection criteria first:"

The command says:

scrapely> a 0 w3lib 1.1

which should be corrected to:

scrapely> t 0 w3lib 1.1

Also in scrapy command line:

help t

prints

ts <template> <text> - test selection text

when it should print:

t <template> <text> - test selection text

Just adding these for the benefit of other who got confused like I did.

Output processor for @href and @src (Image Field) : to remove whitespace characters if present

Please read details here on portia project : Portia issue #378

Multiple matches?

I am very interested in using scrapely in a project and started playing with it. Is it possible to find multiple matches on a page? It seems to only find one.

Html page containing more than one single entity. How to annotate?

Let's imagine that an html page contains more than one single entity to extract.

Does Scrapely have a direct support for it?

I'm actually handling this situation manually, I will add it to scrapely in case I don't find any support of it and as soon as I understand the project more in details

Import Error: Cannot import name 'Scraper'

I'm trying to build something with the Scrapely library. After a bit of fixing I finally got all install issues out of the way.
Running the sample code:

from scrapely import Scraper
s = Scraper()
url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)

I get the error:

Import Error: Cannot import name 'Scraper'

How would I fix this?

Use in production

I got very curious about this project. Today I use scrapy a lot, with beutifulsoup, and this make me think that could be used too.

Anybody using this in production?
Any gotchas?

l

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

Hi,
I am having the following problem. Not sure if i am following the right steps.
This is the repro.
Regards,

--------------------------------------
root
--------------------------------------
root@tex:/home/scraper# python --version
Python 3.4.3+
root@tex:/home/scraper# virtualenv venv_scrapely
Using base prefix '/usr'
New python executable in /home/scraper/venv_scrapely/bin/python3
Also creating executable in /home/scraper/venv_scrapely/bin/python
Installing setuptools, pip, wheel...done.
root@tex:/home/scraper# ls -lrt
total 4
drwxr-xr-x 5 root root 4096 Feb  6 18:23 venv_scrapely
root@tex:/home/scraper# source ./venv_scrapely/bin/activate
(venv_scrapely) root@tex:/home/scraper# pip install scrapely
Collecting scrapely
Collecting w3lib (from scrapely)
  Using cached w3lib-1.16.0-py2.py3-none-any.whl
Collecting numpy (from scrapely)
  Using cached numpy-1.12.0-cp34-cp34m-manylinux1_i686.whl
Requirement already satisfied: six in ./venv_scrapely/lib/python3.4/site-packages (from scrapely)
Installing collected packages: w3lib, numpy, scrapely
Successfully installed numpy-1.12.0 scrapely-0.13.3 w3lib-1.16.0
(venv_scrapely) root@tex:/home/scraper#
(venv_scrapely) root@tex:/home/scraper# pip list
(1.4.0)
numpy (1.12.0)
packaging (16.8)
pip (9.0.1)
pyparsing (2.1.10)
scrapely (0.13.3)
setuptools (34.1.1)
six (1.10.0)
w3lib (1.16.0)
wheel (0.29.0)
------------------------
with user scraper
------------------------
scraper@tex:$ source ./venv_scrapely/bin/activate
(venv_scrapely) scraper@tex:~$ python --version
Python 3.4.3+
(venv_scrapely) scraper@tex:~$ python
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
*** from scrapely import Scraper
*** s=Scraper()
*** url1='https://github.com/ripple/rippled'
*** data={'name':'ripple/rippled','commits':'11,292','releases':'66','contributors':'56'}
*** s.train(url1,data)
*** url2='https://github.com/scrapy/scrapely/'
*** s.scrape(url2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 53, in scrape
    return self.scrape_page(page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 59, in scrape_page
    return self._ex.extract(page)[0]
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/__init__.py", line 119, in extract
    extracted = extraction_tree.extract(extraction_page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 575, in extract
    items.extend(extractor.extract(page, start_index, end_index, self.template.ignored_regions))
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 351, in extract
    _, _, attributes = self._doextract(page, extractors, start_index, end_index, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 396, in _doextract
    labelled, start_index, end_index_exclusive, self.best_match, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 148, in similar_region
    data_length - range_end, data_length - range_start)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 85, in longest_unique_subsequence
    matches = naive_match_length(to_search, subsequence, range_start, range_end)
  File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'
```bash

How to use use html data instead of direct URLs

Older issue mentions 'train_from_htmlpage' method but its not working anymore? What I try to do is provide preprocessed html data (utf8 conversion done to make scrapely work) for scrapely.

possible to pass scrapy response object to scrapely?

Instead of url, is it possible to do

def parse(self,response):
s.train(response.body,encoding='iso-885901')

instead of making scraply fetch things manually from url or local file.

how to use to_file method

I test scrapely with your example...but I don't know how to store templates to file (or database)...
I tried

from scrapely import Scraper
s = Scraper()
url1 = 'http://pypi.python.org/pypi/w3lib'
data = {'name': 'w3lib 1.0', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)

s.tofile('testemplatefile')
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 28, in tofile
json.dump({'templates': tpls}, file)
File "/usr/lib/python2.7/json/init.py", line 182, in dump
fp.write(chunk)
AttributeError: 'str' object has no attribute 'write'

so I test

s = Scraper('abc.json')
url1 = 'http://pypi.python.org/pypi/w3lib'
data = {'name': 'w3lib 1.0', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 41, in train
self.templates.append(tm.get_template())
AttributeError: 'str' object has no attribute 'append'
s.tofile(url1)
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 27, in tofile
tpls = [page_to_dict(x) for x in self.templates]
File "scrapely/htmlpage.py", line 32, in page_to_dict
'url': page.url,

what should I do to store template to file (or database) then use it again?? maybe redis is my database choice...

Incorrect cleaning of <img> tag

Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.

It is expected that the img tag appear in the self-closing way (<img src='github.png' />) but it might appear in this way: <img src='stackoverflow.png'>. In this case, the safehtml cleans the text incorrectly. For example, see the test in the terminal:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my'

IMHO, the output was expected to be my img is <strong>cool</strong>. The same behavior is witnessed with the tag <input>.

Best regards,

How to extract a list of items

How I can extract a list of items
example.html

Beer red 1.50
Coffee black 2.0
Corn yellow 3.65

I know i can do it

data = {
  'name': 'Beer',
  'color': 'red'
  'price': '1.50'
}

s = scrapely.Scraper()
s.train('http//example.com', data)
...

to train example.html but how I can extract the rest of the data. I mean I need extract a list of items from that page

How to scrape within Python using generated JSON from command line?

After doing:

python -m scrapely.tool myscraper.json
scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
scrapely> a 0 w3lib 1.1 -n 0 -f name

How would I then use the myscraper.json from within Python for scraping?

I tried:

with open('myscraper.json') as f:
     s.fromfile(f)
     m = s.scrape('http://pypi.python.org/pypi/Django/1.3')

But it returns nothing.

ZeroDivisionError when training with zero-length data

(Minor bug.)
I installed scrapely from pip this morning.

This is a wacky edge case, but I think you could raise a more constructive error.

(Who wants to extract a zero-length string from a document? It's a bit like a magician pulling some atmosphere out of a hat: it's always going to be there...)

Check it out:

In [97]: from scrapely import Scraper

In [98]: s = Scraper()

In [99]: s.train('http://www.google.com', {'image': u''})
- - - - - - - - - - - - - - - - -
ZeroDivisionError                         Traceback (most recent call last)
/home/username/myfolder/<ipython-input-99-233d0ac90e7f> in <module>()
----> 1 s.train('http://www.google.com', {'image': u''})

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train(self, url, data, encoding)
     44     def train(self, url, data, encoding=None):
     45         page = url_to_page(url, encoding)
---> 46         self.train_from_htmlpage(page, data)
     47 
     48     def scrape(self, url, encoding=None):

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train_from_htmlpage(self, htmlpage, data)
     39                 if isinstance(value, str):
     40                     value = value.decode(htmlpage.encoding or 'utf-8')
---> 41                 tm.annotate(field, best_match(value))
     42         self.add_template(tm.get_template())
     43 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in annotate(self, field, score_func, best_match)
     31 
     32         """
---> 33         indexes = self.select(score_func)
     34         if not indexes:
     35             raise FragmentNotFound("Fragment not found annotating %r using: %s" % 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in select(self, score_func)
     46         matches = []
     47         for i, fragment in enumerate(htmlpage.parsed_body):
---> 48             score = score_func(fragment, htmlpage)
     49             if score:
     50                 matches.append((score, i))

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in func(fragment, page)
     95         fdata = page.fragment_data(fragment).strip()
     96         if text in fdata:
---> 97             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     98         else:
     99             return 0.0

ZeroDivisionError: float division by zero

Question: automate training

Hi,

I was wondering if it would be possible to do automating training using something like boilerpipe or goose to do title, content and date discovery?

I know by default those libraries don't supply the extracted nodes and just the values.... Figured it was better to ask before diving in to see if somebody has already done this.

I want to scrap all the contact list of different food industry website in a specific city?

I can data scrap a specific website but I am wondering how to scrap different web sites simultaneously. In this case, scrapely can be helpful?

Regards,
Shiva

safehtml should ensure tabular content safety

safehtml should ensure that tabular content is safe to display enforcing <table> tags where needed, take as an example:

>>> print safehtml(htmlregion(u'<span>pre text</span><tr><td>hello world</td></tr>'))
u'pre text<tr><td>hello world</td></tr>'

That output will break any table layout where the content is rendered.

Support for passing HTML, not just URLs

http://groups.google.com/group/scraperwiki/browse_thread/thread/d750d093ca5220bf
... was posted, wanting to use Mechanize to download HTML [since the data was behind a login] and Scrapely to parse it.

As far as I can see, Scrapely doesn't support that.

I've made https://scraperwiki.com/scrapers/scrapely-hack/ to try to work around that.

The core change is in Scraper._get_page where:

if html:
    body=html.decode(encoding)
else:

is added before

    body = urllib.urlopen(url).read().decode(encoding)

, an optional 'html' parameter is added to Scraper.scrape, .train and _get_page [and passed to _get_page], and the 'url' parameter is made optional.

safehtml omit some important (all) attributes of tags

Let's consider that someone (like me) want to keep an img tag so the src attribute of this tag would be important for him/her. But safehtml() function omit all the attributes of the relevant tag.
I think it would better to keep attributes of allowed_tags or add another param named allowed_attributes to specify which attributes to keep.

Correct example at README.rst

w3lib has changed version, so the example should be:

data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}

If not, it raises a FragmentNotFound exception.

problem with bad encoding and BOM?

I'm trying to annotate this page http://departamentos.inmobusqueda.com.ar/alquileres/buenos-aires/san-justo/74082/ that seems to have a UTF8 BOM encoded to iso8859-1.

I cannot find the way to annotate the page, scrapely won't be able to annotate (at least from console.

I was able to find a workaround downloading the file, "iconv"erted from iso8859-1 to utf8, and putting it into a server on localhost (then it has a rare converted bom, but it could be annotated with scrapely)

Duplicate Values (but valid) in the same html

Hi - I just noticed about this library for unsupervised learning mechanism, I guess this is based on Wrapper Induction. This is a wonderful implementation. I have been playing around for few days.

What I am trying to understand that if a value, e.g. a flight ticket have two journeys, in the same itinerary (just as a practical example)

Outbound Journey:
Depart : Newyork
Arrive: Dallas

Forward Journey:
Depart: Dallas
Arrive: California

If I I train for "Dallas" value which can be both Arrive & Depart, scrapely is mentioning fragment being already annotated.

data {'Depart_1' : 'Newyork', 'Arrive_1' : 'Dallas,'Depart_2' : 'Dallas', 'Arrive_2' : 'California}

This is an example for places, can be more like time, date, etc. Also, for return journeys in the same itinerary.

How can we achieve this using scrapely?

Anshuk

Slow Extraction Times

It's currently taking me around 2s to run the extraction on a single page.

Following is the output of the lineprofiler:
'''
Line #, Hits, Time, Per Hit, % Time, Line Contents

53                                           def extract(url, page, scraper):
54                                               """Returns a dictionary containing the extraction output
55                                               """
56        10         2923    292.3      0.1      page = unicode(page, errors = 'ignore')
57        10       704147  70414.7     17.8      html_page = HtmlPage(url, body=page, encoding = 'utf-8')
58                                           
59        10      2604545 260454.5     65.9      ex = InstanceBasedLearningExtractor(scraper.templates)
60        10       640413  64041.3     16.2      records = ex.extract(html_page)[0]
61        10          141     14.1      0.0      return records[0]

'''

Am I doing something wrong ? The extraction code is similar to that found in tool.py and init.py
But, I get faster extraction times when I run scrapely from the command line than using the code above.

Please advice.

scrapely.template.FragmentNotFound: Fragment not found annotating 'price' using: <function func at 0x...>

How does it work? What it get to train?

For example: span class="cldt-price sc-font-xl sc-font-bold" data-item-name="price">€ 7.600,-</span

In this case, I did: data = {"price" : "€ 7.600,-"}
But I receive that error.

What I have must to do to get the price?

Interest in other wrapper induction techniques?

Hi all,

I'm sorry if this is not the right place for this discussion. If there is a more appropriate forum, I'd be happy to move over there.

I've been digging into the wrapper induction literature, and have really appreciated the work that y'all have done with this library and pydepta and mdr.

I'd like to build a library using the ideas from the Trinity paper or @AdiOmari's SYNTHIA approach.

It does not seem like your wrapper induction libraries are currently a very active area of interest, but I wanted to know if these would be of interest to y'all (or other methods)?

tool.parse_criteria normalizes whitespace

Unfortunately, this breaks on templates with critera that include multiple whitespaces.

This can be seen on this page: http://lookbook.nu/msha //h1[@class="left rightspaced inline"]/a/text() with the following scrapely session:

scrapely> ta http://lookbook.nu/msha
[1] http://lookbook.nu/msha
scrapely> t 0 Melisa  I
scrapely>

Can I train the scraper on multiple pages so given a certain page it chooses automatically the template?

Can I train the same scraper on multiple pages so, given a certain page, it chooses automatically the template?

Please, release a version with a better python3 support

I saw I lot of commits, like this 58f0886, that solve my problem:

test/pipeline/test_item_validator.py:3: in <module>
    from newscrawler.pipelines import ItemValidatorPipeline
newscrawler/pipelines.py:12: in <module>
    from scrapely.extractors import safehtml, htmlregion, _TAGS_TO_REPLACE
.eggs/scrapely-0.12.0-py3.5.egg/scrapely/__init__.py:4: in <module>
    from scrapely.htmlpage import HtmlPage, page_to_dict, url_to_page
.eggs/scrapely-0.12.0-py3.5.egg/scrapely/htmlpage.py:8: in <module>
    import re, hashlib, urllib2
E   ImportError: No module named 'urllib2'

Is possible to release a new scrapely version?

How can I help?

Thanks

remove most Scrapy mentions from the README

I think we should remove Scrapy mentions from the readme. It is weird that instead of install instructions or package description we start README with a chapter about Scrapy, essentially describing that scrapely is not related to Scrapy.

Installing pip on Python 3.7 still fails

When installing with python 3.7 it still fails.

Collecting scrapely
  Using cached https://files.pythonhosted.org/packages/5e/8b/dcf53699a4645f39e200956e712180300ec52d2a16a28a51c98e96e76548/scrapely-0.13.4.tar.gz
Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.15.2)
Requirement already satisfied: w3lib in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.19.0)
Requirement already satisfied: six in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.11.0)
Installing collected packages: scrapely
  Running setup.py install for scrapely ... error
    Complete output from command /Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -u -c "import setuptools, tokenize;__file__='/private/var/folders/p5/w24gg45x3mngmm2nk1v8v18h0000gn/T/pip-install-chwlaolb/scrapely/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/p5/w24gg45x3mngmm2nk1v8v18h0000gn/T/pip-record-l4aa8igy/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.7
    creating build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/descriptor.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/version.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/extractors.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/template.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/htmlpage.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/tool.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    creating build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/pageobjects.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/similarity.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/regionextract.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/pageparsing.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    running egg_info
    writing scrapely.egg-info/PKG-INFO
    writing dependency_links to scrapely.egg-info/dependency_links.txt
    writing requirements to scrapely.egg-info/requires.txt
    writing top-level names to scrapely.egg-info/top_level.txt
    reading manifest file 'scrapely.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'scrapely.egg-info/SOURCES.txt'
    copying scrapely/_htmlpage.c -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/_htmlpage.pyx -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/extraction/_similarity.c -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/_similarity.pyx -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    running build_ext
    building 'scrapely._htmlpage' extension
    creating build/temp.macosx-10.9-x86_64-3.7
    creating build/temp.macosx-10.9-x86_64-3.7/scrapely
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/include -I/Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scrapely/_htmlpage.c -o build/temp.macosx-10.9-x86_64-3.7/scrapely/_htmlpage.o
    scrapely/_htmlpage.c:7367:65: error: too many arguments to function call, expected 3, have 4
        return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                     ^~~~
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/10.0.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
    #  define NULL ((void*)0)
                   ^~~~~~~~~~
    1 error generated.
    error: command 'gcc' failed with exit status 1

Wrong tag getting annotated

i have the following situation. can you guys look into this issue once.

    <span class='a-color-price'>  <-------- desired tag to be annotated as this contains the data
           <span class="currencyINR">   <----- actual tag being annotated and hence outputting wrongly as &nbsp;&nbsp;
            	&nbsp;&nbsp;
           </span>
        237.00  <-------- desired data passed for training
    </span>

i tried looking at the code to see if can understand whats going on, but it was slightly hard to find out.

What you mean with "The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. "

May you specify more in details the meaning of the sentence in the README.md

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. ...you should use train() with caution and make sure it annotates the area of the page you intended

Which are the problems that may come out from using the trainining implementation?

A mismatch of encoding between the data provided as input and the encoding of the html pages?
Others?

If you can make a list of all the known problems I may help with the development of one of them