Giter VIP home page Giter VIP logo

Comments (11)

essen avatar essen commented on June 12, 2024 1

If this {current_function,{gen,do_call,4}} never changes, you could try tracing the process with dbg (or recon which can disable the trace automatically after a few seconds).

If you see nothing, then I suppose the process is really stuck in gen:do_call/4. In that case it's likely to be a bug in OTP. There's been a bug or two that were fixed in patch releases more or less recently that you might be running into.

from gun.

essen avatar essen commented on June 12, 2024 1

Just do

dbg:tracer().
dbg:tpl(gen).
dbg:tpl(gen_statem).
dbg:tpl(gun).
dbg:p(<0.31175.48>, c).

from gun.

essen avatar essen commented on June 12, 2024 1

Ah well you probably are running a release that doesn't include runtime_tools. Make note to always include runtime_tools in the future to allow debugging things. I would also recommend adding https://github.com/zhongwencool/observer_cli (which also depends on recon).

In any case if you are on 23 I would recommend upgrading. What you are seeing looks like a bug in OTP, which may have been fixed since. At the very least make sure you are on the latest 23 patch release (OTP-23.3.4.20).

Then if you run into this again, with runtime_tools we would have an easier time confirming that the process is, indeed, stuck. Then a report can be sent to OTP team.

from gun.

essen avatar essen commented on June 12, 2024 1

That's OTP-23.3.4.18.

OTP-23.3.4.19 fixes an issue where processes that use max_heap_size (that you might have set) to become zombies, as well as issues around process signal queues.

OTP-23.3.4.20 doesn't seem to have a relevant fix but doesn't hurt to use it.

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024 1

I think its pid is <0.21289.50> if your node is still up.

I had to restart the application because of complaints, so I can't investigate that process. And currently there is no queues in gun connection processes, as I said it happens occasionally under unknown conditions.

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024

Thanks, yes, {current_function, {gen, do_call, 4}} is always the same.
I'll try to dbg and report here if something pops up.

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024

I couldn't dbg somehow, I tried to erl -remsh to my app and open trace port, but got error:

dbg:tracer(port, dbg:trace_port(ip, 50000)).
** exception error: undefined function dbg:trace_port/2

The erlang version on the server is 23.

Also I did
rp(process_Info(<0.3238.0>, backtrace)).

and got this:
{backtrace,<<"Program counter: 0x00007f3542681490 (gen:do_call/4 + 280)\ny(0) []\ny(1) infinity\ny(2) []\ny(3) #Ref<0.3234177338.3271557122.27248>\ny(4) <0.21289.50>\n\n0x00007f34ab0c1630 Return addr 0x00007f34c72fb808 (gen_statem:call_dirty/4 + 104)\ny(0) infinity\ny(1) {start,infinity}\ny(2) <0.21289.50>\ny(3) Catch 0x00007f34c72fb840 (gen_statem:call_dirty/4 + 160)\n\n0x00007f34ab0c1658 Return addr 0x00007f3542a7e1c8 (ssl_gen_statem:call/2 + 72)\ny(0) Catch 0x00007f3542a7e1d8 (ssl_gen_statem:call/2 + 88)\n\n0x00007f34ab0c1668 Return addr 0x00007f3542a751d0 (ssl_gen_statem:handshake/2 + 152)\ny(0) []\ny(1) {sslsocket,{gen_tcp,#Port<0.5296215>,tls_connection,undefined},[<0.21289.50>,<0.22818.50>]}\n\n0x00007f34ab0c1680 Return addr 0x00007f3542a94790 (tls_gen_connection:start_fsm/8 + 1240)\ny(0) []\ny(1) []\ny(2) []\ny(3) []\ny(4) []\ny(5) []\ny(6) []\ny(7) []\ny(8) []\ny(9) []\ny(10) Catch 0x00007f3542a947a0 (tls_gen_connection:start_fsm/8 + 1256)\n\n0x00007f34ab0c16e0 Return addr 0x00007f3542a74fc8 (ssl_gen_statem:connect/8 + 104)\ny(0) Catch 0x00007f3542a74fd8 (ssl_gen_statem:connect/8 + 120)\n\n0x00007f34ab0c16f0 Return addr 0x00007f34c5189890 (gun:normal_tls_handshake/4 + 384)\ny(0) []\ny(1) gun_default_event_h\ny(2) []\ny(3) [http2,http]\ny(4) #{socket=>#Port<0.5296215>,timeout=>infinity,tls_opts=>[{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}},{server_name_indication,\"management.me\"},{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}}]}\ny(5) {state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}\ny(6) undefined\n\n0x00007f34ab0c1730 Return addr 0x00007f34c5188518 (gun:initial_tls_handshake/3 + 416)\ny(0) []\ny(1) []\ny(2) 2\ny(3) []\ny(4) []\n\n0x00007f34ab0c1760 Return addr 0x00007f34c72fea10 (gen_statem:loop_state_callback/11 + 872)\ny(0) []\ny(1) []\ny(2) true\ny(3) false\ny(4) []\ny(5) false\ny(6) []\ny(7) {initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}}\ny(8) [{internal,{retries,2,#Port<0.5296215>}}]\ny(9) {state,{initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}},[],#{},false}\ny(10) []\ny(11) {params,state_functions,false,<0.1197.0>,[gun],<0.3238.0>,infinity}\ny(12) Catch 0x00007f34c72fea80 (gen_statem:loop_state_callback/11 + 984)\n\n0x00007f34ab0c17d0 Return addr 0x00007f34c7c83c00 (proc_lib:init_p/5 + 264)\ny(0) []\ny(1) []\ny(2) []\ny(3) Catch 0x00007f34c7c83c10 (proc_lib:init_p/5 + 280)\n\n0x00007f34ab0c17f8 Return addr 0x000000000092aa38 (<terminate process normally>)\n">>}

But this didn't give me any clue.

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024

It failed on the first command:

dbg:tracer().
** exception error: undefined function dbg:tracer/0

It seems that module dbg is not present on the server.

code:is_loaded(dbg).
false

tried to load it:

code:ensure_loaded(dbg).
{error,embedded}

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024

Ok, thanks, currently I see this version:
`cat /usr/lib/erlang/releases/RELEASES

%% coding: utf-8
[{release,"Erlang/OTP","23","11.2.2.18",
[{kernel,"7.3.1.7","/usr/lib/erlang/lib/kernel-7.3.1.7"},
{stdlib,"3.14.2.3","/usr/lib/erlang/lib/stdlib-3.14.2.3"},
{sasl,"4.0.2","/usr/lib/erlang/lib/sasl-4.0.2"}],
permanent}].`

from gun.

IgorKoshnarov avatar IgorKoshnarov commented on June 12, 2024

I'll put here the reply which I got from Fred Herbert on erlangforums, based on my backtrace:

https://erlangforums.com/t/gun-gen-statem-process-queue-messages-stoped-being-handled/3413/5

Your FSM is currently stuck in a call to another FSM. In this case it appears to be related to TLS connection handling, during the handshake. The pid of that FSM Is <0.21289.50>.

Since that call has an infinity timeout, your own FSM is gonna be stuck until the other one returns; if it’s hung (because it’s waiting on your FSM or on a broken external connection), then you’re also going to be stuck forever.

Your options are likely to be one of:

put a timeout on the call (and possibly handle the errors)
move the call to be asynchronous to a sub-process so even if it hangs it doesn’t block the FSM
investigate and resolve the issue with the TLS handshake to make sure it can’t happen (may not always be possible if the causes are outside of your control)

from gun.

essen avatar essen commented on June 12, 2024

Yes I suppose you could configure tls_handshake_timeout to something other than infinity to avoid the issue of the Gun connection process getting stuck. But I don't know what would happen to the TLS process. Maybe investigating it would tell us more. I think its pid is <0.21289.50> if your node is still up.

Edit: Note that depending on what the issue is, setting a timeout will not help. It depends on which process gets stuck and why.

from gun.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.