Giter VIP home page Giter VIP logo

Comments (17)

shazow avatar shazow commented on May 21, 2024

πŸ‘, I'll look into it closer this weekend.

from urllib3.

piotr-dobrogost avatar piotr-dobrogost commented on May 21, 2024

@brandon-rhodes Which solution would you choose?

from urllib3.

brandon-rhodes avatar brandon-rhodes commented on May 21, 2024

Um. Well.

Okay.

[Thinks for several minutes.]

You've got me.

To my vast surprise, it turns out that I would actually do something more complicated!

First, I would make the API distinguish between redirection and errors. To me, it is a fundamentally different issue to tell a library β€œbe willing to go through 4 redirections, but then stop; 5 is right out!” than to tell the library β€œif you cannot resolve the hostname, or connect, or they keep hanging up on you, just try again a few times, and then let me know after all your attempts have failed.” In fact, the latter is not a feature that had ever occurred to me, until I saw it this week in this library. I guess I possess daily evidence that it does in fact help to retry random errors, which is demonstrated every time I hit β€œreload” on a hung page load and my browser magically gets it successfully the second time. But following up an error with an automatic retry seems very different than following successful redirects. Yet urllib3 uses the same retries counter for both.

So, first, I would introduce a new parameter redirects=3 to urlopen() and have that counter be the one that counts down when stepping through 3xx responses. And each subsequent response object would have a redirectred_from attribute that held the previous response object, so that a quick for loop on the part of the user would let them see the whole chain of responses that had led them to the final page. (Would the series of requests also need to get saved, I wonder, or does the response chain alone provide enough information for them to reconstruct what happened? For now my idea is that they only need the responses.)

And I should mention one complication: even though redirects=3 and retries=0 as parameter defaults would seem to be pretty much what anyone would sanely want as a default β€” after all, that's what the browser does, it hides redirects from the user but exposes actual failures to fetch pages β€” there is one situation where something like the current retries logic can be quite nice: when you are fetching a huge file with curl or wget and are able to fetch some of it but then the connection dies but then the tool, through an option, does not reissue the exact same request that failed, but a new request that fast-forwards past what has been fetched so far. Why have I not brought up this stunningly useful case in my discussion? Because: I think that, instead of making urlopen() even more complicated, this feature could easily be provided in a separate function or tool that (a) called urlopen() with retries=0 and redirects=3 to get redirected to and try downloading the resource; (b) if it fails partway through, try a direct fetch of that final URL with retries=0 and redirects=0 this time; and (c) put the pieces together to return them (unless some kind of counter or time limit of its own has run out). Of course this could not be done with .data or even simply with a parameterless .read() because, unless I'm forgetting something, both of those will discard everything they've read if they hit an error; but the tool could use little .read(1024) calls or something and thus hopefully only lose a few bytes when the error finally killed their read loop.

So, to sum up: what would I do?

If I had no users or were still in alpha, I would remove the retries parameter and, for the moment, get rid of the idea that the library should automatically keep trying after getting socket errors. I would have a redirects parameter instead. If the user specifies redirects=0 then he is simply given the response back, whether it be 303 or 307 or whatever. If they specify a nonzero redirects, then I would assume that they want a non-300 resolution to their search, and I would raise the HTTP response as an exception instead.

Oh, yes, had I mentioned that? I would under no circumstances have a separate exception; I would instead make my HTTP response object inherit from exception, so that I could return them or raise them depending on what the user wanted. The caller would get all the same information back on the response whether it was raised or returned, of course; but the manner of delivery would help them avoid having to check the status themselves if they needed a 200-level response as the only really useful result.

But urllib3 has existing users. What then?

Okay.

I am summoning all of my decisiveness.

Okay.

  1. I create a subclass of HTTPResponse called HTTPErrorResponse. The HTTPResponse.from_httplib() would make instances of the latter for non-200 responses (but would not, itself, raise them).
  2. The name MaxRetryError becomes a deprecated synonym for another catch-all exception (see below), so that people who are calling urlopen() today wrapped in an exception handler keep working.
  3. The retries parameter sticks around but now has a default value retries=0.
  4. A new redirects parameter appears with the default value redirects=3; unless users are doing crazy things that I have not imagined, they will not even notice this change, because my guess is that they have been using retries to mean redirects.
  5. The urllib3 keeps its current assumption that only 200-level responses are useful to users, and that anything else is a failure. Therefore, when it gets a socket error or HTTP protocol error, it retries if the retries counter is not yet exhausted, and then finally fails by re-raising the last response/error. When it gets a 3xx as a final return value after exhausting the redirects counter, it raises the 3xx HTTP response itself as an exception.
  6. When a redirect is followed, the previous HTTP response (though not its body?) is saved as the .redirected_from attribute of the new response that comes back. (But it might be better to have some other data structure, like a flat .redirects list of (url, httpresponse) pairs; but I had to make some kind of decision or never finish with this answer!)
  7. What about socket errors and protocol exceptions, which do not result in pretty 4xx or 5xx response objects? I would make the same choice that urllib3 makes today: I would let the user avoid having to mention all those possibilities in their exception handlers that they wrap around code that makes urllib3 calls. But I would want to expose them to the failure so they can see something more than β€œI ran out of retries” like they see today. So I guess my exception hierarchy would look like this:
HTTPError
  β†’ TimeoutError
  β†’ NonTimeoutError (ugly idea, but NEEDED, because we make MaxRetryError a deprecated alias for this!)
      β†’ CommunicationError (to which we attach information from socket and protocol errors)
          β†’  SSLError
     β†’ HTTPErrorResponse (multiply inherits from HTTPResponse and HTTPError)
          β†’ HTTP4xxResponse
          β†’ HTTP5xxResponse
          β†’ (and even more specific ones? or is even breaking up 4xx and 5xx silly?)

So that, I think, is how I would (a) stop hiding from callers the circumstances under which failure occurred, (b) while asserting the invariant that a non-200 response always results in an exception being raised (which I think makes client code easiest to write?), (c) while keeping the old MaxRetryError as a catch-all that will still catch all non-timeout errors, and (d) separating out the useful idea of following redirects from the much less useful idea of hitting a web site over and over again when a communications failure is underway.

Thoughts? (ducks)

(Have continued editing this to clean out earlier thoughts and eliminate contradictions as I changed my mind. Hope no one else is awake yet!)

from urllib3.

shazow avatar shazow commented on May 21, 2024

Ah craziness. :P

A few thoughts:

  1. @kennethreitz's Requests does much of what you already want on top of urllib3, such as tracking the redirect path.

  2. I feel really uncomfortable having a Response object which inherits from Exception and could be raised or returned. Not quite sure why, I need to think about this more. I guess I haven't seen this pattern done before and my instinct tells me there will be gotchas there.

  3. There's an open related issue on the old GoogleCode project which still needs to be migrated to Github if people are passionate about it: (18) There should be separate limits for redirect and failure.

    Based on this, my thoughts were to create a kind of Retries object which you can create and pass in place of a plain integer, and it would track Retries on a more granular scale, such as:

    urlopen(..., retries=Retries(total=7, redirect=5, timeout=3))

    Obviously each param would be optional and None would mean infinite. Passing in a plain integer like retries=7 as we do now in place of a Retries object would translate to retries=Retries(total=7).

    Then we could attach this Retries object along with exceptions or responses for additional debugging information.

Your thoughts on this?

from urllib3.

piotr-dobrogost avatar piotr-dobrogost commented on May 21, 2024

Interesting subject.

But following up an error with an automatic retry seems very different than following successful redirects.

I totally agree. That's what I felt from the very beginning and the reason I opened (18) There should be separate limits for redirect and failure. issue which I'm glad Andrey remembers about. Btw, I would migrate this here and move discussion about redirects/retries over to it.

But urllib3 has existing users. What then?

With all my sympathy for Andrey and urllib3 we should remember it's very young project and I wouldn't hold necessary changes off because of backward compatibility. If it's rather easily doable then sure but not for any price.

I create a subclass of HTTPResponse called HTTPErrorResponse. The HTTPResponse.from_httplib() would make instances of the > latter for non-200 responses (but would not, itself, raise them).

I do not like treating non-200 responses (read 3xx ones) as errors in any circumstances. It just doesn't feel right.

The retries parameter sticks around (...)

Sure.

A new redirects parameter appears (...)

What can I say? At last :)

The urllib3 keeps its current assumption that only 200-level responses are useful to users (...)

Andrey, do you agree with this? I don't and I think there's no such assumption and if current behavior suggests otherwise it's only because the library has been lacking redirect/retry separation what influenced its behavior in a bad way.

When it gets a 3xx as a final return value after exhausting the redirects counter, it raises the 3xx HTTP response itself as an exception.

As it should has already been clear from my previous remarks I'm against this.

When a redirect is followed, the previous HTTP response (though not its body?) is saved as the (...)

As Andrey noted requests lib does this. However I hoped requests would use redirect logic from urllib3 and then creating history could probably be moved to urllib3 as well. Kenneth said (couldn't find this post now) requests would never use redirect logic from urllib3, however. The division between what should be in urllib3 and what in requests is tricky. However I have a feeling that right now urllib3 is not as extensible as it could and that's the reason requests do many things on its own.

On the idea of responses being returned or raised. It's too cool idea not make use of it :) The question is how to use it :)

Based on this, my thoughts were to create a kind of Retries object

Nice! However, we have to remember this never can be as elastic as signals. For example if someone would like to have some time-based adaptive retry strategy it's only feasible with signals.

from urllib3.

shazow avatar shazow commented on May 21, 2024

With all my sympathy for Andrey and urllib3 we should remember it's very young project and I wouldn't hold necessary changes off because of backward compatibility. If it's rather easily doable then sure but not for any price.

It's been around since 2008, which is moderately old as far as Python projects go. :-)

The urllib3 keeps its current assumption that only 200-level responses are useful to users (...)

urlopen(..., redirect=False)

... [stuff about Requests doing its own redirect logic]

urllib3 has exactly 3 lines of redirect logic with the goal of covering +80% of use cases (to avoid mandatory boilerplate) but not 100%, this is why you can turn off redirecting easily altogether.

    def urlopen(self, ...):
        ...
        redirect_location = redirect and response.get_redirect_location()
        if redirect_location:
            return self.urlopen(..., retries-1)

This is how I'd prefer to keep it, since anything more sophisticated would still not cover 100% user cases and it wouldn't be worth the added complexity to the codebase.

I'm satisfied with the extendability of urllib3 in this department as it is (ie. redirect=False and do it yourself).

Nice! However, we have to remember this never can be as elastic as signals. For example if someone would like to have some time-based adaptive retry strategy it's only feasible with signals.

I don't understand what you mean by signals. As for time-based adaptive retry strategy, I use things like apiclient/ratelimiter.py.

from urllib3.

brandon-rhodes avatar brandon-rhodes commented on May 21, 2024

First, it is urllib2 that probably first gave me the idea of an object which is both an exception and also a valid HTTP response object, accomplished through multiple inheritance. Which might be an indication that it's a bad idea; I don't know. :)

My desire to have 3xx be an exception might be because I am thinking too much about people calling urllib3 directly instead of using a pretty layer like requests on top. I like writing code where I can make a request and then start using the body Β­β€” the data that comes back β€” as though it is the content of the resource I have named in the URL. And I want an exception to interrupt me, and not let my content-processing code run, if what comes back is not the body of the resource I have named. Since a 3xx response does not provide me with the body of the resource, and indeed represents a failure to fetch the resource (although a failure that gives me a hint as to where to look next), my code, I think, will be cleanest if I don't constantly have to add if statements to make sure a 200 is what has come back before using the result of getting a URL.

Ah! That's the difference. If this were an HTTP library, I would expect any HTTP response to just come back to the caller as a normal value. But it's not! It's a URL library! And doing a few retries and still being stuck with a 3xx response is a failure to fetch the URL β€” it's a failure to run the resource to ground and make it bottom out into content. So my doctrine is that a URL library should raise, and not return, indications that the resource named by the URL was not successfully recovered. If you want 3xx, 4xx, or 5xx HTTP responses back without being considered errors then you are not dealing with the semantics of URLs and resources; you are dealing with the semantics of HTTP requests and responses.

But let me step back and re-read the title of this request.

Ah, yes. My problem is that the urlopen() code hides socket and HTTP protocol errors from me β€” it throws them away. This is stopping me from using the library, because I very much need connection pooling β€” and also, as it happens, need to play with NTLM, which I have stumbled across in the contrib directory (could a mention of it be added to the front page of the docs?) β€” but I also need to report specific socket and HTTP errors back to the user of my API because, even when using something fun like a connection pool, I need to be able to name the real error (host name unknown? host unreachable?) that has been encountered.

My first try at achieving this was simply to import urllib3.connectionpool and then set its HTTPException and SocketError symbols to None. This eviscerates the problem exception handler which is hiding errors from me:

        except (HTTPException, SocketError), e:
            # Connection broken, discard. It will be replaced next _get_conn().
            conn = None

So I can see real errors; great! However, if I read the code correctly, this might also make the connection pool more fragile, because while the _get_conn() method does make a basic check for whether the socket it has pulled off of the pool has simply closed, any more complicated symptom of a stale socket will cause an HTTPError or SocketError in the code that I quoted above β€” and will kill my connection attempt, even though with a fresh socket the attempt would work.

(Or am I being paranoid, and the chances of the socket failing between the check that _get_conn() makes and its actual use a moment later are too small to worry about?)

So I would love some way for the except: clause above to work if conn is the old connection that has been pulled from the queue, but then for the except: clause to do nothing and let me see the error if the attempt has been made with a fresh connection.

So what is my real problem?

I guess my real problem is that HTTPConnectionPool is not really, despite the name, a connection pool. It would seem that a connection pool would do one thing: it would try using an old connection if one were around, but hide any I/O or protocol error that resulted; and it would respond to such an error by making a single good-faith attempt with a new connection if an attempt with an old connection failed. But the class as it stands muddies the water with ideas like redirection and also with the idea of ignoring I/O errors on fresh connections β€” it makes no distinction between the idea β€œthis is an old socket, so the error can be ignored” and β€œwhoa, a completely fresh socket just gave me an error, I had better tell the client about this!”

It would actually be cleaner if HTTPConnectionPool let someone like requests do redirection, and dropped down to just doing pooling. :)

Why did you throw redirection and error-survival (beyond the forgiving of a single error on an old socket) into the same method with something as useful, and unique in the ecosystem, as connection pooling? Were you afraid that putting other functionality in another layer or wrapper would create the horrors of urllib2 all over again, with its need to build a whole stack of tiny handlers just to do a simple request? :)

from urllib3.

brandon-rhodes avatar brandon-rhodes commented on May 21, 2024

Oh! And, your Retries() idea deserves comment.

My gut feeling is that it's a terrible idea β€” passing even more complex information into urlopen() and thus requiring it to have even more complicated logic inside. If you really want to go from two (retries and redirects) to three (adding total) parameters that control redirection, then I think you are definitely reaching a level of complexity where redirection should just be moved into a separate object altogether. Keep the connection pool simple. And have a RetryRobot or something that can be instantiated with (total=7, redirect=5, timeout=3) and then will go to town making requests to a Pool or a simpler HTTP dispatcher beneath it.

Or, again, am I missing the point, and you're willing to swallow a more and more complicated urlopen() function because the alternative is the much-reviled handler stack supported by urllib2? Or are the problems with urllib2 in some awkwardness in the way it wants pieces to compose, and not in the idea itself?

from urllib3.

shazow avatar shazow commented on May 21, 2024

Oh god humongous wall of text. I'll try to work through it but I might miss a bunch of things, I welcome you to to re-summarize your core points if that happens. :)

Since a 3xx response does not provide me with the body of the resource, and indeed represents a failure to fetch the resource

That's not necessarily true. I could totally return a meaningful body in a 3xx request. I could also write a webserver that returns a 1337 status, and urllib3 should still work with it. I don't want for urllib3 to make any arrogant assumptions about the use scenarios. That's what Requests is for. ;-)

If this were an HTTP library [...] But it's not! It's a URL library!

:/ Please don't hold the name against it. It's the best I could come up with 4 years ago, and I've somewhat regretted it since. :P

I guess my real problem is that HTTPConnectionPool is not really, despite the name, a connection pool. [...]

You're mistaken. HTTPConnectionPool is very much a connection pool. It just so happens that urlopen is not what you want. Have you considered this solution?

class BrandonHTTPConnectionPool(HTTPConnectionPool):
    def urlopen(...):
        # Implement your own urlopen logic as you wish? :)

All of the "connection pooling" actually happens in the helper methods like _get_conn, _new_conn, _put_conn, etc. It seems that everything you have problems with is exactly urlopen. If there are any useful pieces in urlopen that you'd want factored out into a helper such that you can reimplement your own urlopen peacefully, I'm happy to accept such requests. :)

[Stuff about Retries() ...] passing even more complex information into urlopen() and thus requiring it to have even more complicated logic inside. [...]

Nah, all the logic would happen inside of the Retries object. There would be maybe 2-3 more lines in urlopen which calls helpers within the Retries object.

from urllib3.

shazow avatar shazow commented on May 21, 2024

@brandon-rhodes I welcome you to add yourself to the CONTRIBUTORS.txt file, your design review, fixes, and improvements are very worthy of recognition. :)

from urllib3.

brandon-rhodes avatar brandon-rhodes commented on May 21, 2024

I will comment at greater length, but for the moment: you say β€œI could totally return a meaningful body in a 3xx.” Could you elaborate a bit on this? Are you arguing, for example, that wget or curl would be correct to return the body of a 3xx response as their standard output and exit with a non-error return code?

from urllib3.

piotr-dobrogost avatar piotr-dobrogost commented on May 21, 2024

Are you arguing, for example, that wget or curl would be correct to return the body of a 3xx response as their standard output and exit with a non-error return code?

Z:\>curl http://httpbin.org/redirect/2 && echo SUCCESS
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="/redirect/1">/redirect/1</a>.  If not click the link.SUCCESS

from urllib3.

shazow avatar shazow commented on May 21, 2024

Thanks, @piotr-dobrogost.

@brandon-rhodes β€” The thing that I'm trying to keep in mind is that we're not building a web browser, or a command-line tool, or a downloader, or a web crawler, or an API client. We're building a library that could be used to build any of these things (and that means even if the endpoint server is doing something weird and unusual).

from urllib3.

piotr-dobrogost avatar piotr-dobrogost commented on May 21, 2024

Well, in this case maybe it would really be better not to do any redirects? Following redirects is feature expected from these higher level tools you mention and not from low level tool you want urllib3 to be. Existence of this very feature lead people to believe urllib3 is more than library that does http connection pools.

Btw,

urllib3 has exactly 3 lines of redirect logic with the goal of covering +80% of use cases

I would say it's the opposite (20%). I have a feeling most redirects need cookies for them to work as expected (as in browser) and since this is not something urllib3 handles the usefulness of redirects is rather low. I guess the reason we don't have issues with redirects is because people don't use this feature in urllib3.

from urllib3.

brandon-rhodes avatar brandon-rhodes commented on May 21, 2024

@piotr-dobrogost β€” I'm abashed! I am a longtime wget user and expected curl to work the same way, and was completely wrong. :)

@shazow β€” Agreed! The purpose is to be a library. And two things stand in the way of the library being dropped into place to provide connection pooling where a more primitive library might have been in use before. (1) It hides real communications errors behind its own MaxRetryError which both invalidates except: clauses that might already be standing around HTTP request calls, and also means that meaningful error messages can neither be logged nor presented to users. (2) It hides errors regardless of whether they arise from pooling or not. That is, instead of doing a hide-the-error-and-magically-retry in the one singular situation where an old connection is being used β€” to, appropriately, make the connection pooling transparent β€” it will hide errors even from completely new connections just pulled off of the queue. This means that urllib3 cannot be used by a developer who wants an automatic retry to take place when an old connection proves stale (and thus not have extra errors appear in their app merely because connection pooling is taking place), but who needs to avoid the expense of double-requests taking place every time a bad hostname is typed or a target host is down.

from urllib3.

shazow avatar shazow commented on May 21, 2024

@piotr-dobrogost:

I believe the significant number of users are using urllib3 to access HTTP-based APIs. Some people use it to mimic a browser, but I think that's a minority and Requests is a great option for that.

I think not doing redirects at all is a bad idea--at least it would annoy me greatly and increase the barrier to entry for the library. I am not opposed to improving the redirect logic. I'm just weary of diminishing returns.

If you'd like to compile a list of redirect scenarios you'd like for urllib3 to handle better (ideally in the form of unit tests), we can take a crack at it and see what's feasible without increasing complexity significantly.

@brandon-rhodes:

It hides real communications errors [...]

Yup, we're all in agreement here. This bug will remain open until this is fixed. I'm leaning towards attaching more such information to the exception that does get raised.

It hides errors regardless of whether they arise from pooling or not. [...]

Pooling errors should not be raised, that would indicate that the library is broken as that is one of the core functions (unlike redirects). I already agreed above that the number of retries needs to be configurable for each kind of failure.

[...] double-requests taking place every time a bad hostname is typed or a target host is down.

Does retries=False not allow you to do this (assuming we fix the problems above)?

from urllib3.

shazow avatar shazow commented on May 21, 2024

I believe this is fixed, especially with the addition of the new Retry object.

from urllib3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.