sensiblecodeio / scraperwiki-python Goto Github PK

View Code? Open in Web Editor NEW

160.0 25.0 69.0 270 KB

ScraperWiki Python library for scraping and saving data

Home Page: https://scraperwiki.com

License: BSD 2-Clause "Simplified" License

Python 97.17% Shell 0.26% Makefile 0.45% Dockerfile 2.12%

scraper scraperwiki

scraperwiki-python's Introduction

ScraperWiki Python library

This is a Python library for scraping web pages and saving data. It is the easiest way to save data on the ScraperWiki platform, and it can also be used locally or on your own servers.

Installing

pip install scraperwiki

Scraping

scraperwiki.scrape(url[, params][,user_agent])

Returns the downloaded string from the given url.

params are sent as a POST if set.

user_agent sets the user-agent string if provided.

Saving data

Helper functions for saving and querying an SQL database. Updates the schema automatically according to the data you save.

Currently only supports SQLite. It will make a local SQLite database. It is based on SQLAlchemy. You should expect it to support other SQL databases at a later date.

scraperwiki.sql.save(unique_keys, data[, table_name="swdata"])

Saves a data record into the datastore into the table given by table_name.

data is a dict object with field names as keys; unique_keys is a subset of data.keys() which determines when a record is overwritten. For large numbers of records data can be a list of dicts.

scraperwiki.sql.save is entitled to buffer an arbitrary number of rows until the next read via the ScraperWiki API, an exception is hit, or until process exit. An effort is made to do a timely periodic flush. Records can be lost if the process experiences a hard-crash, power outage or SIGKILL due to high memory usage during an out-of-memory condition. The buffer can be manually flushed with scraperwiki.sql.flush().

scraperwiki.sql.execute(sql[, vars])

Executes any arbitrary SQL command. For example CREATE, DELETE, INSERT or DROP.

vars is an optional list of parameters, inserted when the SQL command contains ‘?’s. For example:

scraperwiki.sql.execute("INSERT INTO swdata VALUES (?,?,?)", [a,b,c])

The ‘?’ convention is like "paramstyle qmark" from Python's DB API 2.0 (but note that the API to the datastore is nothing like Python's DB API). In particular the ‘?’ does not itself need quoting, and can in general only be used where a literal would appear. (Note that you cannot substitute in, for example, table or column names.)

scraperwiki.sql.select(sqlfrag[, vars])

Executes a select command on the datastore. For example:

scraperwiki.sql.select("* FROM swdata LIMIT 10")

Returns a list of dicts that have been selected.

vars is an optional list of parameters, inserted when the select command contains ‘?’s. This is like the feature in the .execute command, above.

scraperwiki.sql.commit()

Commits to the file after a series of execute commands. (sql.save auto-commits after every action).

scraperwiki.sql.show_tables([dbname])

Returns an array of tables and their schemas in the current database.

scraperwiki.sql.table_info(name)

Returns an array of attributes for each element of the table.

scraperwiki.sql.save_var(key, value)

Saves an arbitrary single-value into a table called swvariables. Intended to store scraper state so that a scraper can continue after an interruption.

scraperwiki.sql.get_var(key[, default])

Retrieves a single value that was saved by save_var. Only works for string, float, or int types. For anything else, use the pickle library to turn it into a string.

Miscellaneous

scraperwiki.status(type, message=None): If run on the ScraperWiki platform (the new one, not Classic), updates the visible status of the dataset. If not on the platform, does nothing. type can be 'ok' or 'error'. If no message is given, it will show the time since the update. See dataset status API in the documentation for details.
scraperwiki.pdftoxml(pdfdata): Convert a byte string containing a PDF file into an XML file containing the coordinates and font of each text string (see the pdftohtml documentation for details). This requires pdftohtml which is part of poppler-utils.

Environment Variables

SCRAPERWIKI_DATABASE_NAME: default: scraperwiki.sqlite - name of database
SCRAPERWIKI_DATABASE_TIMEOUT: default: 300 - number of seconds database will wait for a lock

scraperwiki-python's People

Contributors

Stargazers

Watchers

scraperwiki-python's Issues

No new release since July

Or is this an intentional hold off until #58 is resolved...?

Offline/standalone use - default database

For offline/standalone use, what's the default and suggested best practice for specifying the sqlite database to be attached?

Bare strings not a valid parameter to scraperwiki.sqlite.select

Related to #7 -

scraperwiki.sqlite.select('* from sqlite_master where tbl_name=?',url)
works on classic but not boxes.

Boxes expect a list, and interpret the string as an iterable.

Work around: add [ ] around the single parameter.

pdftohtml should be optional?

Got this error installing scraperwiki_local, but I don't really care about pdftohtml. Would it be possible for it to be optional?

Running setup.py egg_info for package scraperwiki-local
Traceback (most recent call last):
File "", line 14, in
File "/Users/ross/Work/scrapes/build/scraperwiki-local/setup.py", line 37, in
'Local Scraperlibs requires pdftohtml, but pdftoxml was not found\n'
ImportError: Local Scraperlibs requires pdftohtml, but pdftoxml was not found
in the PATH. You probably need to install it.
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 14, in

File "/Users/ross/Work/scrapes/build/scraperwiki-local/setup.py", line 37, in

'Local Scraperlibs requires pdftohtml, but pdftoxml was not found\n'

ImportError: Local Scraperlibs requires pdftohtml, but pdftoxml was not found

in the PATH. You probably need to install it.

scraperwiki.sqlite.save_var / .get_var wierdness (poorly defined)

import scraperwiki
print scraperwiki.sqlite.get_var('jam')
scraperwiki.sqlite.save_var('bacon',12)
print scraperwiki.sqlite.get_var('bacon')

If run once, correctly gives:

None
12

If run twice, fails with:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    print scraperwiki.sqlite.get_var('jam')
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sqlite.py", line 73, in get_var
    dt.execute(u"CREATE TABLE %s (`value` blob, `type` text, `key` text PRIMARY KEY)" % dt._DumpTruck__vars_table, commit = False)
  File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 115, in execute
    self.cursor.execute(sql, *args)
sqlite3.OperationalError: table _dumptruckvars already exists

None not a valid parameter to scraperwiki.sqlite.execute

print scraperwiki.sqlite.execute("select * from sqlite_master where tbl_name", None)

works on classic, but not on boxes.

Using "" works on both.

pdftoxml should return a Unicode object

It should call xml.decode(encoding) where encoding can be found in the XML output from pdftoxml command.

doubleflush bug

http://pastebin.com/J4eGcWDC

--> contains additional debugging statments
https://github.com/scraperwiki/scraperwiki-python/tree/doubleflush_bug

#!/usr/bin/env python
import logging
import scraperwiki
import json

logging.basicConfig(level=logging.DEBUG)

data = [{'rowx':x} for x in range(900042)]
#scraperwiki.sql.save(['rowx'], data)

print 'cat'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'dog'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'nope'

print 'la la la la\n'*6

cat
AttributeError("'NoneType' object has no attribute 'decode'",)
dog
RuntimeError('Double flush',)
nope
la la la la
la la la la
la la la la
la la la la
la la la la
la la la la

scraperwiki.sqlite.save() doesn't coerce dictionaries/lists into string representations

This code works in Classic:

import scraperwiki
d = {"name": {"first_name":"zarino", "last_name":"zappia"}, "age": 24, "id": 1}
scraperwiki.sqlite.save(['id'], d)

But it raises an exception in scraperwiki-python:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/scraperwiki/sqlite.py", line 34, in save
    return dt.upsert(data, table_name = table_name)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 301, in upsert
    self.insert(upsert=True, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 284, in insert
    self.execute(sql, values, commit=False)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 138, in execute
    raise self.sqlite3.InterfaceError(unicode(msg) + '\nTry converting types or pickling.')
sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.
Try converting types or pickling.

ScraperWiki Classic used to call repr() or something on cell values that aren't strings, integers, floats or nones; scraperwiki-python does not. This is a bug, since scraperwiki-python is meant to replicate Classic as closely as possible.

mysterious "no such table" crash

Traceback (most recent call last):
  File "tool/hooks/refresh", line 131, in 
    main()
  File "tool/hooks/refresh", line 25, in main
    return convert_one(url)
  File "tool/hooks/refresh", line 72, in convert_one
    scraperwiki.sql.save([], features, table_name="feature")
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sql.py", line 187, in save
    fit_row(connection, row, unique_keys)
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sql.py", line 325, in fit_row
    add_column(connection, new_column)
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sql.py", line 345, in add_column
    connection.execute(stmt)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/ddl.py", line 69, in _execute_on_connection
    return connection._execute_ddl(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 783, in _execute_ddl
    compiled
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
    context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
    exc_info
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 951, in _execute_context
    context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 436, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (OperationalError) no such table: feature u'ALTER TABLE feature ADD COLUMN x FLOAT' ()

What I did to get this is not so clear. I had a pre-existing GeoJSON tool into which I had copied a sqlite file, then I changed the code to the shapefile branch of geojson-tool and then poked it with the UI.

Investigating.

utils.py uses both urllib2 and requests to make requests

Could be tidied.

SQLAlchemy error storing empty strings

I've just updated an old scraperwiki library for the latest pip version and one a breaking change I've come across seems to be the storage of empty strings when SQLalchemy is trying to cast to a float (sqlalchemy.exc.StatementError: could not convert string to float).

I've tidied up functions where I'm supposed to return a number from returning an empty string (which always used to get added as NULL, I think?) to returning a float('nan'), but it strikes me that in the general quick'n'dirty scruffy case where you're scraping and throwing as much stuff into SQLite as quickly as possible that for scraperwiki to be able to cast empty string to NaN or None/NULL as a default when SQLalchemy is trying to cast it to a float, rather than throwing an error, would be handy?

getting the box's URL from Python code is too hard.

table-xtract-tool has the JavaScript UI put the boxUrl in a settings.json file, just so that the Python code can read it out.

This would be much less error prone (and clearer) if it was just

import scraperwiki
scraperwiki.boxURL

Fix test failure in Python 3.5 and up

Output for Python 3.5, but same test failure for Python 3.6, 3.7 and 3.8.

$ nosetests --exe

...................................F.

======================================================================

FAIL: test_empty (tests.TestUniqueKeys)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/scraperwiki/scraperwiki-python/tests.py", line 140, in test_empty

    self.assertEqual(observed, {u'data': [], u'keys': []})

AssertionError: {'data': [], 'keys': ['seq', 'name', 'unique', 'origin', 'partial']} != {'data': [], 'keys': []}

- {'data': [], 'keys': ['seq', 'name', 'unique', 'origin', 'partial']}

+ {'data': [], 'keys': []}

----------------------------------------------------------------------

Ran 37 tests in 1.218s

FAILED (failures=1)

The command "nosetests --exe" exited with 1.

`unique_keys` doesn't work for buffered data

The whole idea around unique_keys=["col"] doesn't work for buffered rows. Since not all modifications are committed directly to the database, this make the unique constraint working unreliably.

To make it work, you have to commit transactions manually after each modification.

scraperwiki.sqlite.commit_transactions()

Unique key issue when running scraper locally

Hi. I was trying out the new, awesome Cobalt service and ran into the following issue while testing locally:

Traceback (most recent call last):
  File "scraper.py", line 35, in <module>
    scraperwiki.sqlite.save(unique_keys=['id'], data=item)
  File "/Library/Python/2.7/site-packages/scraperwiki/sqlite.py", line 30, in save
    return dt.insert(data, table_name = table_name)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 250, in insert
    self.execute(sql, values, commit=False)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 107, in execute
    self.cursor.execute(sql, *args)
sqlite3.IntegrityError: column id is not unique

I simply ran python scraper.py and only locally it would do this. While running on Cobalt or in the traditional ScraperWiki web interface it ran fine.

For reference, the scraper is:
https://box.scraperwiki.com/zzolo/mn-registered-voters

I realize this may be an issue for Dumptruck, but wanted to start here first as it.

TypeError on import (init() got an unexpected keyword argument 'adapt_and_convert' )

Using github checkouts of scraperwik-local and dumptruck, this is what happens when I try to import scraperwiki:

In [1]: import scraperwiki

TypeError Traceback (most recent call last)
/home/cin/cinscrapers/ in ()
----> 1 import scraperwiki

/home/dan/.virtualenvs/taxhaven/local/lib/python2.7/site-packages/scraperwiki/init.py in ()
23
24 from .utils import log, scrape, pdftoxml, swimport
---> 25 import utils, sqlite
26 import geo

/home/dan/.virtualenvs/taxhaven/local/lib/python2.7/site-packages/scraperwiki/sqlite.py in ()
9 dt = DumpTruck(dbname = dbname, adapt_and_convert = False)
10
---> 11 _connect()
12
13 def execute(sqlquery, data=[], verbose=1):

/home/dan/.virtualenvs/taxhaven/local/lib/python2.7/site-packages/scraperwiki/sqlite.py in _connect(dbname)
7 'Initialize the database (again). This is mainly for testing'
8 global dt
----> 9 dt = DumpTruck(dbname = dbname, adapt_and_convert = False)
10
11 _connect()

TypeError: init() got an unexpected keyword argument 'adapt_and_convert'

cannot `import scraperwiki` on read only directory

If I can't write to the current directory then import scraperwiki fails with a sqlite (!) error:

drj@services:~$ python -m 'scraperwiki'
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/usr/lib/python2.7/runpy.py", line 109, in _get_module_details
    return _get_module_details(pkg_main_name)
  File "/usr/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/usr/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/usr/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/usr/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/__init__.py", line 10, in <module>
    import utils, sqlite
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sqlite.py", line 11, in <module>
    _connect()
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sqlite.py", line 9, in _connect
    dt = DumpTruck(dbname = dbname,  adapt_and_convert = False)
  File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 95, in __init__
    self.connection=self.sqlite3.connect(dbname, detect_types = self.sqlite3.PARSE_DECLTYPES)
sqlite3.OperationalError: unable to open database file

This is horrible, and most unlibrary like.

It shouldn't try and open the sqlite file until it needs it.

We found this problem when deploying the newsreader website which being a good web server cannot write to its filesystem. It's loading data-services-helpers and it's not even using import scraperwiki, it just gets sucked in by data-services-helpers.

Doesn't work in Python 3.

Nigel Hall reports https://groups.google.com/forum/#!topic/scraperwiki/bN_6nMQZXHQ/discussion

that it does not pip install on Python 3.

Not Python 3 compatible

Change README file from .html to .rst

reStructuredText is the format PyPI uses, so this would make https://pypi.python.org/pypi/scraperwiki page look right.

Github also supports it, meaning the README would look fine there.

Documentation:
http://docutils.sourceforge.net/rst.html

[Inconsistency in scraperwiki.sqlite.get_var before save_var.

@scraperdragon posted this to the dummy repository issue tracker, so I copy it here.

On Scraperwiki Classic:

import scraperwiki
print scraperwiki.sqlite.get_var('kitten')
returns None.

On boxes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/src/scraperwiki-local/scraperwiki/sqlite.py", line 66, in get_var
return dt.get_var(name)
File "/usr/local/lib/python2.7/dist-packages/src/dumptruck/dumptruck/dumptruck.py", line 273, in get_var
data = self.execute(u'SELECT * FROM %s WHEREname= ?' % vt, [key], commit = False)
File "/usr/local/lib/python2.7/dist-packages/src/dumptruck/dumptruck/dumptruck.py", line 107, in execute
self.cursor.execute(sql, *args)
sqlite3.OperationalError: no such table: swvariables

Dependency on pdftohtml

The method pdftoxml is dependent on the utility pdftohtml. If the utility is missing, no error is reported but no xml is returned either on calling the method on pdf file data. To fix the problem pdftohtml needs to be installed which is part of poppler-utils. So run sudo apt-get install poppler-utils. The devs need to update the Readme file for dependencies.

Needs to require lxml on install

If installing fresh (e.g. pip install scraperwiki), without lxml already installed, you will get an error:

File "dumptruck/dumptruck.py", line 28, in
import lxml.etree

INSERT OR REPLACE INTO is sqlite-specific

Turns out that postgres bails when we try to do a .save:

In [8]: time S.save([], {"a": "b"})

ProgrammingError: (ProgrammingError) syntax error at or near "OR"
LINE 1: INSERT OR REPLACE INTO test (a) VALUES ('b')
               ^
 'INSERT OR REPLACE INTO test (a) VALUES (%(a)s)' {'a': 'b'}

Import should not have side-effects

On further thought, it's devious that just importing something should have a side effect. An explicit activation call would be more polite.

Not least because it breaks my PEP8 checker due to it being an "unused" import!

It's very non-standard (I have never seen an import "doing" stuff before) and therefore it's a trap for the unwary. I can imagine the "fun" that will ensue when someone carelessly copy / pastes a load of imports, "knowing" that it's safe to do so.

show_tables should be scraperwiki.sql.meta?

To keep it in some sync with new platform API.

Would return same JSON as new API.

pdftoxml() docs inconsistent with code

Docs:

scraperwiki.pdftoxml(pdfdata, options='')

Code:

def pdftoxml(pdfdata):
    ...

Which of these should be fixed?

pdftoxml in utils.py is not portable to Windows.

The /dev/null needs to be NUL on Windows.
NamedTemporaryFile behaves differently in Windows to Unix.

Deploy master to pypi

Requested by @paulfurley.

TODO:

#18 (pw) - rst README
#29

sql.select and sql.execute functions don't take "?" arguments as documented

The documentation says scraperwiki.sql.select() and scraperwiki.sql.execute() take an optional second argument of a list/tuple of strings to be escaped and inserted into the given SQL query wherever there are question marks "?", like so:

import scraperwiki

scraperwiki.sql.execute('CREATE TABLE "people" (?,?)', ['name', 'age'])
# should execute something like: CREATE TABLE "people" ("name", "age")

scraperwiki.sql.select('* FROM "people" WHERE ?=?', ("name", "Tony Stark"))
# should execute something like: "SELECT * FROM "people" WHERE "name"="Tony Stark"

Running the above code, however, results in an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/scraperwiki/sqlite.py", line 19, in execute
    result = dt.execute(sqlquery, data, commit=False)
  File "/Library/Python/2.7/site-packages/dumptruck/dumptruck.py", line 136, in execute
    self.cursor.execute(sql, *args)
sqlite3.OperationalError: near "?": syntax error

scraperwiki passes the query (containing question marks) and the list of substitutions to dumptruck (sqlite.py line 3), but dumptruck just ignores the list and leaves the question marks in the sql query, passing it directly to the underlying sql.cursor.execute where it causes the exception (dumptruck.py line 129).

Either scraperwiki needs to perform the qmark value escaping and substitutions, before passing the completed query to dumptruck, or dumptruck needs to do it.

I, as a person writing a tool that uses the scraperwiki python library, certainly shouldn't have to write my own functions for escaping SQL strings. Life's too short.

sqlalchemy.exc.OperationalError on writing to a table `details`

As described in this issue:

https://bitbucket.org/scraperwikids/contract-finder/issues/2/sqlalchemyexcoperationalerror-for-most

This scraper fails on:

scraperwiki.sqlite.save(
            unique_keys=['URL'],
            data=record,
            table_name='details')

For version 0.5.1 with an error that begins:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: details [SQL:

scraperwiki does not support python version 3.10.7

After updating platform to early_release and python to version 3.10.7, the scrappers failed:
Traceback (most recent call last):
File "/app/scraper.py", line 13, in
import scraperwiki
File "/app/.heroku/python/lib/python3.10/site-packages/scraperwiki/init.py", line 12, in
from . import sql
File "/app/.heroku/python/lib/python3.10/site-packages/scraperwiki/sql.py", line 2, in
from collections import Iterable, Mapping, OrderedDict
ImportError: cannot import name 'Iterable' from 'collections' (/app/.heroku/python/lib/python3.10/collections/init.py)

This happens because in python 3.10.7 there is no more Iterable from collections.
For 3,10.7 it may be necessary the folowing code:
try:
from collections.abc import Iterable
except ImportError:
from collections import Iterable

Proposal: scraperwiki.sql.variable for getting/setting variables.

Basically, like scraperwiki.sqlite.save_var, but cleaned up.

same interface in Python and JavaScript.
share basic datatypes (string, number, list, dict) between Python and JavaScript. Bonus for "native" dates.
looks like a dictionary(*): scraperwiki.sql.variable['myvar'] = 7
stored in SQL table _sw_variable (note: starts with underscore to show that it's an internal table)

*maybe not in javascript if we can't work out how to do that

scraperwiki.runlog is not documented

Cannot change the columns after the first save.

rm scraperwiki.sqlite
python -c 'import scraperwiki; scraperwiki.sql.save(["id"], dict(id=1, foo=7))'
python -c 'import scraperwiki; scraperwiki.sql.save(["id"], dict(id=1, foo=3, bar=4))'

I get

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "scraperwiki/sql.py", line 186, in save
    connection.execute(insert.values(row))
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 664, in execute
    return meth(self, multiparams, params)
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 282, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 761, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 874, in _execute_context
    context)
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1023, in _handle_dbapi_exception
    exc_info
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 185, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 867, in _execute_context
    context)
  File "/home/drj/.local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 388, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (OperationalError) table swdata has no column named bar u'INSERT OR REPLACE INTO swdata (foo, id, bar) VALUES (?, ?, ?)' (3, 1, 4)

This is a bit elusive. Doing the two saves in the same process is just fine (I suspect the batching is hiding the bug).

pdftoxml - problem with -nodrm switch?

I'm trying out scraperwiki local for the first time and just installed (on a Mac) what I guess is a required library for pdftoxml calls:

brew install pdftohtml

This pulls down:

which does not have a -nodrm switch? (This switch is in the scraperwiki call and causes the pdftohtml call to fail?)

Are you using a more recent library (if so, from where can I install it?)

Can't save lxml strings.

[edited on 2014-09-22] scraperwiki.sql.save() can't save values that are instances of lxml.etree._ElementStringResult (see example below).

Discovered when running the archinterface scraper: it crashes because it was trying to save some sort of lxml string object.

Error on `atexit` event when using only pdftoxml util function

Hi there,

I am using scraperwiki due to its pdftoxml function, which is quite handy. However, when running some custom tests that depend on this function, I end up greeted with the following stacktrace right after my tests pass:

Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/path/to/virtualenv/local/lib/python2.7/site-packages/scraperwiki/sql.py", line 126, in commit_transactions
    if _State._transaction is not None:
AttributeError: 'NoneType' object has no attribute '_transaction'

I am not using any of the SQL-related functions in my test files. How is it that the _State class is None in a file-context function (commit_transactions)?

Unicode column names fail the second time you use them.

>>> import scraperwiki
>>> scraperwiki.sqlite.save(data = {"i": 1, u"a\xa0b": 1}, unique_keys = ['i'])
>>> scraperwiki.sqlite.save(data = {"i": 1, u"a\xa0b": 1}, unique_keys = ['i'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dragon/.local/lib/python2.7/site-packages/scraperwiki/sql.py", line 202, in save
    fit_row(connection, row, unique_keys)
  File "/home/dragon/.local/lib/python2.7/site-packages/scraperwiki/sql.py", line 354, in fit_row
    add_column(connection, new_column)
  File "/home/dragon/.local/lib/python2.7/site-packages/scraperwiki/sql.py", line 374, in add_column
    connection.execute(stmt)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 841, in execute
    return meth(self, multiparams, params)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/sql/ddl.py", line 69, in _execute_on_connection
    return connection._execute_ddl(self, multiparams, params)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 895, in _execute_ddl
    compiled
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1070, in _execute_context
    context)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1271, in _handle_dbapi_exception
    exc_info
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1063, in _execute_context
    context)
  File "/home/dragon/.local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 442, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (OperationalError) duplicate column name: a b u'ALTER TABLE swdata ADD COLUMN "a\xa0b" BIGINT' ()

Caused, I suspect, by sql.py:345 -- if not str(new_column) in list(_State.table.columns.keys()) which doesn't preserve unicodeness and hence breaks.

dumptruck is not optional

The blog post at http://blog.scraperwiki.com/2012/06/07/local-scraperwiki-library/#more-758217004 suggests dumptruck is optional, however if you try to install scraperwiki_local it complains with the following stack trace. Suggest it is either really optional or it is included in the scraperwiki_local setup.py requirements.

....

Downloading/unpacking scraperwiki-local
Running setup.py egg_info for package scraperwiki-local
Traceback (most recent call last):
File "", line 14, in
File "/Users/ross/Work/scrapes/build/scraperwiki-local/setup.py", line 20, in
import scraperwiki
File "scraperwiki/init.py", line 25, in
import utils, sqlite
File "scraperwiki/sqlite.py", line 1, in
from dumptruck import DumpTruck
ImportError: No module named dumptruck
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 14, in

File "/Users/ross/Work/scrapes/build/scraperwiki-local/setup.py", line 20, in

import scraperwiki

File "scraperwiki/init.py", line 25, in

import utils, sqlite

File "scraperwiki/sqlite.py", line 1, in

from dumptruck import DumpTruck

ImportError: No module named dumptruck

Command python setup.py egg_info failed with error code 1

Module has no attribute scrape

It might just be because I'm an amateur but when I run my code I get this error.
Here's my code

import scraperwiki
fixtuers = scraperwiki.scrape('http://www.sydgram.nsw.edu.au/co-curricular/sport/fixtures/2013-06-01.php')
print(fixtures)

Invalid Syntax Error in sqlite.py

Hi,
Am getting Syntax Error in line 58 of sqlite.py. Any clue what could be causing the problem here?

Traceback (most recent call last):
  File "/home/----/ParseCallReport.py", line 6, in <module>
    import scraperwiki, urllib2;
  File "/usr/lib/python2.6/site-packages/scraperwiki/__init__.py", line 10, in <module>
    import utils, sqlite
  File "/usr/lib/python2.6/site-packages/scraperwiki/sqlite.py", line 58
    return {row['name']: row['sql'] for row in response}
                                      ^
SyntaxError: invalid syntax

send_email function would be really useful

Something like this. Request for comment :)

import smtplib

FROM_USER = '[email protected]'


def send_email(recipient, subject, body, attach=None):
    headers = [
        "From: " + FROM_USER,
        "Subject: " + subject,
        "To: " + recipient,
        "MIME-Version: 1.0",
        "Content-Type: text/html"]
    headers = "\r\n".join(headers)

    server = smtplib.SMTP('localhost', 25)
    server.ehlo()
    server.sendmail(FROM_USER, recipient, headers + "\r\n\r\n" + body)
    server.close()

sql.save has weird functionality with existing data

The sql.save function is a useful quick way of saving data. However, it has very odd functionality if you try to build data incrementally from various sources.

The use case I'm thinking of is this: I have two data sets with different fields but a common key. One has been saved already to the SQLite database in scraper wiki.

Using the common key as the unique field with the sql.save function does not behave as I would expect. if the key already exists in the data set, I would expect the function to replace the data in any matching columns, and create new columns (if they don't already exist) an input the rest of the data, leaving the existing data unchanged if there is no matching column for it.

The actual behaviour is that it wipes all populated fields for a given row by setting them to null if the column was not in the list of columns for the sql.save command.

The expected behaviour would be better, as when working with multiple data sets it becomes frustrating to have to use embedded SQL code to manage updates.

Problems with SQLite data types and the table_info() method

I was creating a scraper on the ScraperWiki website and wanted the table containing its data to be ordered by the values of a column of integers. However, both the 'View in a table' and 'Query with SQL' tools are only sorting the data alphanumerically (i.e. the values starting with 1 first, then the values starting with 2 &c) rather than by their numeric values.

This led me to believe that for some reason this column is being given a string type despite containing only integers, so I tried to use:

scraperwiki.sql.table_info('swdata')

which I found in the readme but this is giving me an attribute error.

Can't save to DB; string encoding problem?

After experimenting at TCamp 2012:

import scraperwiki
html = scraperwiki.scrape('http://blog.zephod.com')
scraperwiki.sqlite.save([], {'content' : html} )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scraperwiki/sqlite.py", line 21, in save
    return dt.insert(data, table_name = table_name)
  File "/Users/zephod/code/scrapely/%env.scrapely/lib/python2.7/site-packages/dumptruck/dumptruck.py", line 214, in insert
    self.execute(sql, row.values(), commit=False)
  File "/Users/zephod/code/scrapely/%env.scrapely/lib/python2.7/site-packages/dumptruck/dumptruck.py", line 110, in execute
    self.cursor.execute(sql, *args)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.