Giter VIP home page Giter VIP logo

Comments (11)

pwaller avatar pwaller commented on July 22, 2024

Nice sleuthing. Spotted a couple of bugs as a result.

Boiled the test case down to this:

#!/usr/bin/env python
import scraperwiki

data = [{'rowx': None} for x in range(10000)]

try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print "Masked exception:", repr(e)

scraperwiki.sql.save(['rowx'], data)

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

I'm not yet sure what to do in this case. What's happening is that the first save fails due to an exception (because the user passed bad data). The user masks this, then goes on to try another save, which fails.

If I were writing the code I wouldn't a) pass bad data or b) mask the exception, i.e, I would want it to blow up.

The problem is that the user is doing something which causes save() to explode but they expect it to be non-fatal. By buffering the save and replaying it at a later date, we cause the explosion to happen elsewhere, which isn't optimal.

I think the solution is to sanity-check the input in append before buffering it.

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

There is also another bug where state is incorrectly reset in the flush() which needs to be preserved during a long append(), which I'll submit a pull request for.

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

I note that:

scraperwiki.sql.save(['rowx'], {})

is fine, but:

scraperwiki.sql.save(['rowx'], [{}])

raises ValueError: You passed no sample values, or all the values you passed were null..

It feels like it violates a generalisation principle that passing no data isn't a NOOP. But I don't know what the user might expect.

It feels pretty evil that dumptruck deletes keys with None as the value. What if you want to set a column to null? Then after having deleted them it explodes if nothing is left.

/cc @IanHopkinson @StevenMaude @scraperdragon

from scraperwiki-python.

scraperdragon avatar scraperdragon commented on July 22, 2024

I think the solution is to sanity-check the input in append before buffering it.

Agreed, but not sure entirely what the input's requirements are. There may be some odd edge cases (dates spring to mind).

It feels like it violates a generalisation principle that passing no data isn't a NOOP

Agreed.

It feels pretty evil that dumptruck deletes keys with None as the value.

Agreed. I think this is an artifact of the way the dataproxy used to work / was intended to work, where:

scraperwiki.sql.save({'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save({'my_id':5, 'pet_name': 'Rover'})

would lead to a row of `{'my_id':5, 'animal': 'dog', 'pet_name': 'Rover'}).
However, I don't remember this ever working.

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

I'm happy with the behaviour you just stated, that's how it should work. However what it does it takes:

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover', 'animal': None})

And where I would expect animal to be nulled, it just ignores the animal field and has the effect of the code from your example, it's as if you had run -

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover'})

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

We still have to somehow deal with the case that real_save legitimately explodes (someone just yanked the hard-disk, etc) and the user "legitimately" does a catch all and then proceeds to later try writing again. In that case I think we have to scream a big warning that there was an error whilst flushing the buffer, and that accept that that error might jump into the fray at any later point.

Ultimately this is all just hacking around the fact that dumptruck is (extremely) suboptimal for doing individual row insertions. If @sean-duffy or someone attacks the core of the problem by rewriting that logic in sqlalchemy then it should be possible to make it so that .save() is as cheap as an INSERT. I guess some time will have to be spent considering the transaction semantics of that, however.

from scraperwiki-python.

pwaller avatar pwaller commented on July 22, 2024

(From my last point I conclude that we may wish to indefinitely defer the buffering question entirely in favour of the rewrite to use sqlalchemy - getting this right could be very hard and might not give a well behaved solution in all circumstances, whereas fundamentally fixing the backend probably will result in obviously correct behaviour).

from scraperwiki-python.

drj11 avatar drj11 commented on July 22, 2024

FWIW I agree with @scraperdragon. The thing about passing (key,None) pairs is just wrong.

The thing about exceptions is that it's an exception. All bets are off. You should not reasonably expect any future .save()s to work (though @pwaller is right, we could make it so that this case was more clearly diagnosed).

A rewrite in sqlalchemy is clearly a good idea, but so is doing a million other things. (PRs accepted).

from scraperwiki-python.

sean-duffy avatar sean-duffy commented on July 22, 2024

@drj11 A sqlalchemy rewrite of dumptruck is what I'm currently doing:
https://github.com/scraperwiki/dumptruck/tree/sqlalchemy-rewrite

from scraperwiki-python.

drj11 avatar drj11 commented on July 22, 2024

Almost all irrelevant now we have the SQLAlchemy rewrite

from scraperwiki-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.