Comments (11)
Nice sleuthing. Spotted a couple of bugs as a result.
Boiled the test case down to this:
#!/usr/bin/env python
import scraperwiki
data = [{'rowx': None} for x in range(10000)]
try:
scraperwiki.sql.save(['rowx'], data)
except Exception as e:
print "Masked exception:", repr(e)
scraperwiki.sql.save(['rowx'], data)
from scraperwiki-python.
I'm not yet sure what to do in this case. What's happening is that the first save fails due to an exception (because the user passed bad data). The user masks this, then goes on to try another save, which fails.
If I were writing the code I wouldn't a) pass bad data or b) mask the exception, i.e, I would want it to blow up.
The problem is that the user is doing something which causes save()
to explode but they expect it to be non-fatal. By buffering the save and replaying it at a later date, we cause the explosion to happen elsewhere, which isn't optimal.
I think the solution is to sanity-check the input in append before buffering it.
from scraperwiki-python.
There is also another bug where state is incorrectly reset in the flush()
which needs to be preserved during a long append()
, which I'll submit a pull request for.
from scraperwiki-python.
I note that:
scraperwiki.sql.save(['rowx'], {})
is fine, but:
scraperwiki.sql.save(['rowx'], [{}])
raises ValueError: You passed no sample values, or all the values you passed were null.
.
It feels like it violates a generalisation principle that passing no data isn't a NOOP. But I don't know what the user might expect.
It feels pretty evil that dumptruck deletes keys with None
as the value. What if you want to set a column to null? Then after having deleted them it explodes if nothing is left.
/cc @IanHopkinson @StevenMaude @scraperdragon
from scraperwiki-python.
I think the solution is to sanity-check the input in append before buffering it.
Agreed, but not sure entirely what the input's requirements are. There may be some odd edge cases (dates spring to mind).
It feels like it violates a generalisation principle that passing no data isn't a NOOP
Agreed.
It feels pretty evil that dumptruck deletes keys with None as the value.
Agreed. I think this is an artifact of the way the dataproxy used to work / was intended to work, where:
scraperwiki.sql.save({'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save({'my_id':5, 'pet_name': 'Rover'})
would lead to a row of `{'my_id':5, 'animal': 'dog', 'pet_name': 'Rover'}).
However, I don't remember this ever working.
from scraperwiki-python.
I'm happy with the behaviour you just stated, that's how it should work. However what it does it takes:
scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover', 'animal': None})
And where I would expect animal to be nulled, it just ignores the animal field and has the effect of the code from your example, it's as if you had run -
scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover'})
from scraperwiki-python.
We still have to somehow deal with the case that real_save
legitimately explodes (someone just yanked the hard-disk, etc) and the user "legitimately" does a catch all and then proceeds to later try writing again. In that case I think we have to scream a big warning that there was an error whilst flushing the buffer, and that accept that that error might jump into the fray at any later point.
Ultimately this is all just hacking around the fact that dumptruck is (extremely) suboptimal for doing individual row insertions. If @sean-duffy or someone attacks the core of the problem by rewriting that logic in sqlalchemy then it should be possible to make it so that .save()
is as cheap as an INSERT
. I guess some time will have to be spent considering the transaction semantics of that, however.
from scraperwiki-python.
(From my last point I conclude that we may wish to indefinitely defer the buffering question entirely in favour of the rewrite to use sqlalchemy - getting this right could be very hard and might not give a well behaved solution in all circumstances, whereas fundamentally fixing the backend probably will result in obviously correct behaviour).
from scraperwiki-python.
FWIW I agree with @scraperdragon. The thing about passing (key,None)
pairs is just wrong.
The thing about exceptions is that it's an exception. All bets are off. You should not reasonably expect any future .save()
s to work (though @pwaller is right, we could make it so that this case was more clearly diagnosed).
A rewrite in sqlalchemy is clearly a good idea, but so is doing a million other things. (PRs accepted).
from scraperwiki-python.
@drj11 A sqlalchemy rewrite of dumptruck is what I'm currently doing:
https://github.com/scraperwiki/dumptruck/tree/sqlalchemy-rewrite
from scraperwiki-python.
Almost all irrelevant now we have the SQLAlchemy rewrite
from scraperwiki-python.
Related Issues (20)
- Error on `atexit` event when using only pdftoxml util function HOT 3
- Fix test failure in Python 3.5 and up
- `unique_keys` doesn't work for buffered data
- scraperwiki does not support python version 3.10.7
- pdftoxml() docs inconsistent with code HOT 4
- pdftoxml in utils.py is not portable to Windows. HOT 3
- utils.py uses both urllib2 and requests to make requests HOT 3
- cannot `import scraperwiki` on read only directory HOT 3
- No new release since July HOT 2
- INSERT OR REPLACE INTO is sqlite-specific HOT 10
- Cannot change the columns after the first save. HOT 5
- Can't save lxml strings. HOT 6
- mysterious "no such table" crash HOT 4
- SQLAlchemy error storing empty strings HOT 2
- Not Python 3 compatible HOT 6
- Unicode column names fail the second time you use them.
- sqlalchemy.exc.OperationalError on writing to a table `details` HOT 5
- Dependency on pdftohtml HOT 2
- sql.save has weird functionality with existing data
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scraperwiki-python.