Giter VIP home page Giter VIP logo

Comments (19)

mattnworb avatar mattnworb commented on June 30, 2024

Btw the way in which I am trying to reproduce "a request is sent but just never gets a response" might be totally invalid, I wasn't really sure how to recreate that when testing with a memcache daemon on localhost since I didn't want the client to get a connection reset from the OS by closing the serverSocket down (since in my case there was no OS to send reset packets back on the other side). I am not familiar enough with Netty to understand what would happen with it's channels if the server on the other end suddenly went away without closing connections properly.

from folsom.

spkrka avatar spkrka commented on June 30, 2024

What version of folsom were you using? Without knowing this, debugging is pretty much impossible.

With the latest master (and 0.6.1) I expect this to happen:
The memcached server stops responding - which would lead to the timeout checker to kill the connection (or the memcached server breaks the tcp connection directly). All pending requests should immediately fail with MemcacheClosedException. All new requests will also fail with this. The DefaultRawMemcacheClient is now closed and won't ever be used again.

Then the reconnectingclient will try to reconnect periodically. Once the server is available again, a new DefaultRawMemcacheClient is created.

But perhaps there's a bug in here somewhere which both keeps the old instance around and the DefaultRawMemcacheClient doesn't consider itself closed.

@danielnorberg or @protocol7, do you have time to investigate? I may not have much time at the moment.

from folsom.

danielnorberg avatar danielnorberg commented on June 30, 2024

Maybe we could try reproing this using SIGSTOP?

from folsom.

spkrka avatar spkrka commented on June 30, 2024

Hm, maybe this bugfix is important enough to justify a new release:
47f9826

from folsom.

spkrka avatar spkrka commented on June 30, 2024

Also, DefaultRawMemcacheClient doesn't check if it's closed in its send.
Does that mean that we don't get the failure callback in the call to:
channel.write(request, new RequestWritePromise(channel, request)); ?

Or maybe it just means that we don't actually decrement the counter from that codepath.

@danielnorberg - That seems like your area of expertise :)

from folsom.

spkrka avatar spkrka commented on June 30, 2024

I guess it's not really a big problem if a disconnected client keeps failing with overloaded instead of closed (but that should be fixed too), the problem in this case was probably that it failed to reconnect, which is probably related to the bugfix in master (but we'll need to verify that)

from folsom.

danielnorberg avatar danielnorberg commented on June 30, 2024

Writing on a closed channel always fails.

from folsom.

mattnworb avatar mattnworb commented on June 30, 2024

What version of folsom were you using?

0.6.1.

from folsom.

spkrka avatar spkrka commented on June 30, 2024

@danielnorberg Right, so we increment pending, then write the request which immediately fails, but then we never decrement pending.

from folsom.

mattnworb avatar mattnworb commented on June 30, 2024

With the latest master (and 0.6.1) I expect this to happen:
The memcached server stops responding - which would lead to the timeout checker to kill the connection (or the memcached server breaks the tcp connection directly). All pending requests should immediately fail with MemcacheClosedException. All new requests will also fail with this. The DefaultRawMemcacheClient is now closed and won't ever be used again.

Then the reconnectingclient will try to reconnect periodically. Once the server is available again, a new DefaultRawMemcacheClient is created.

I am not 100% sure yet but I think what happened in my case was that the memcached server never broke the connection. I am going to go back through the logs from when this happened to see if requests were getting MemcacheClosedException during the downtime or something else. I also will try to look into creating a test that better reproduces what I saw.

from folsom.

mattnworb avatar mattnworb commented on June 30, 2024

Huge apologies but after double-checking our logs again it turns out I misread / misremembered them - after the memcached host came back online, we were not actually get folsom MemcacheOverloadedException as originally claimed but instead a "too many outstanding requests" exception from a downstream service (which we only call on cache misses in memcache).

Really sorry for the false alarm.

I think there might still be some merit to the original test case where a server that is still "connected" but simply not sending any response to requests causes MemcacheOverloadedExceptions to be thrown even after the "stuck" requests time out, but that seems tangental to my original report (where the server was down for hours and there would have to be some really confusing behavior in Netty and/or the OS to think the socket was still open).

I think a part of the reason why I conflated the two exceptions is that our Metrics listener was reporting "failures" for all GETs after the server came back - but that might be because we are attaching transform functions directly to the ListenableFuture from MemcacheClient and calling that downstream service in the same-thread-executor, so an exception in the attached transform function might bubble up to the onFailure method of a FutureCallback attached to the original ListenableFuture - but that is totally our fault.

Apologies again for the mixup. Feel free to close this one.

from folsom.

spkrka avatar spkrka commented on June 30, 2024

While the original report may be incorrect, I still think there might be some more things worth investigating, so I suggest leaving it open for now.

from folsom.

rgruener avatar rgruener commented on June 30, 2024

I just saw this issue come up with one of our services in production. Basically at some point we hit the too many outstanding requests limit within folsom and the folsom client basically just died. It seems like the request queue just never flushes once it hits the limit. During this time, folsom seems to eat up a ton of resources causing my service to return 503's for a brief amount of time. Eventually the connection to memcached just dies completely and folsom never successfully reconnects.

I have tried looking into what is causing this but havent been able to nail it down as of yet, however this seems to be a pretty large issue.

Stack Trace:
: Mar 30, 2015 1:41:19 PM com.google.common.util.concurrent.Futures$CombinedFuture setExceptionAndMaybeLog
974d9da0808dd536180afb27d31dec871f7bb7[1]: SEVERE: input future failed.
974d9da0808dd536180afb27d31dec871f7bb7[1]: com.spotify.folsom.MemcacheOverloadedException: too many outstanding requests
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.client.DefaultRawMemcacheClient.send(DefaultRawMemcacheClient.java:162)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.reconnect.ReconnectingClient.send(ReconnectingClient.java:92)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.ketama.KetamaMemcacheClient.sendSplitRequest(KetamaMemcacheClient.java:98)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.ketama.KetamaMemcacheClient.send(KetamaMemcacheClient.java:69)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.retry.RetryingClient.send(RetryingClient.java:46)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.client.binary.DefaultBinaryMemcacheClient.multiget(DefaultBinaryMemcacheClient.java:188)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.client.binary.DefaultBinaryMemcacheClient.getAndTouch(DefaultBinaryMemcacheClient.java:198)
974d9da0808dd536180afb27d31dec871f7bb7[1]: at com.spotify.folsom.client.binary.DefaultBinaryMemcacheClient.get(DefaultBinaryMemcacheClient.java:157)

from folsom.

spkrka avatar spkrka commented on June 30, 2024

Thanks for the report, we should investigate further. One known problem is that we never decrement the pending counter for some types of failed requests (failed writes), but since a failed write only happens if it disconnects, I didn't think that was important enough to fix.

I am on parental leave now, but I will take a look when I get back to work next week.

@danielnorberg Maybe this is something you want to investigate? I think you understand the netty integration best.

from folsom.

danielnorberg avatar danielnorberg commented on June 30, 2024

Sounds bad. Would be nice if we had some way to reproduce it. I'll take a stab at putting together a repro by connecting a client to a memcached mockup, hit the outstanding request limit and then verify that it can recover.

from folsom.

mattnworb avatar mattnworb commented on June 30, 2024

@danielnorberg in case you missed it I attached a test case in my original report up top.

from folsom.

spkrka avatar spkrka commented on June 30, 2024

Release 0.6.2 should be out now and fix this issue.

from folsom.

danielnorberg avatar danielnorberg commented on June 30, 2024

Can this issue be closed?

from folsom.

mattnworb avatar mattnworb commented on June 30, 2024

Yep, I think #43 tests the same condition that I tried to reproduce in the test attached to this issue. Thanks

from folsom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.