Giter VIP home page Giter VIP logo

Comments (21)

gerhard avatar gerhard commented on August 31, 2024 1

Not a problem, didn't want this to become a drag, happy for you to get around to this in your own time.

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

What's the libc version? did you try to recompile and archive it yourself?

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

No, didn't try to recompile. It worked fine until the RabbitMQ node got restarted. Nothing changed on the OS.

ii  libc-bin                            2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Binaries
ii  libc-dev-bin                        2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Development binaries
ii  libc6:amd64                         2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Shared libraries
ii  libc6-dev:amd64                     2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Development Libraries and Header Files

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

This only happens when the VM starts with the prometheus_process_collector enabled. If the plugin is disabled, the Erlang VM restarts successfully.

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

yea, this likely means binary incompatibility. Can you clone https://github.com/deadtrickster/prometheus_process_collector and run rebar3 archive (on target machine preferably). This will generate ez archive.

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

OK, will do, most likely next week. Thanks!

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

I'm now compiling promehteus_process_collector using rebar3 archive, the same segfault is still there.

Is there anything else that we can do to address this?

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

my gdb complains about missing file, symbols, etc. could you please post gdb output here? all these bt, bt full.

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

-O3 can be too much, maybe you could also try -O0

from prometheus_process_collector.

essen avatar essen commented on August 31, 2024
#0  0x0000000000000cd6 in ?? ()
No symbol table info available.
#1  0x00007febd737c137 in get_process_info ()
   from /var/vcap/store/rabbitmq-server/mnesia/rabbit@rmq0-memory-alloc-a-plugins-expand/prometheus_process_collector-1.3.1/priv/prometheus_process_collector.so
No symbol table info available.
#2  0x000000000044b5c4 in process_main (x_reg_array=0x7febdb9e2df0, f_reg_array=0x0)
    at beam/beam_emu.c:3601
        fp = 0x7febd737c120 <get_process_info>
        env = {mod_nif = 0x7fec1a601050, proc = 0x7febd9eea968, hp = 0x7febd84f1ee0, 
          hp_end = 0x7febd84f27a8, heap_frag = 0x0, fpe_was_unmasked = 0, tmp_obj_list = 0x0, 
          exception_thrown = 0, tracee = 0x0, exiting = 0}
        live_hf_end = 0x0
        nif_bif_result = 140650912738208
        bif_nif_arity = 0
        init_done = 1
        c_p = 0x7febd9eea968
        reds_used = 0
        reg = 0x7fec1a980100
        opcodes = {0x44d086 <process_main+25782>, 0x448f5f <process_main+9103>, 
          0x448fbc <process_main+9196>, 0x449039 <process_main+9321>, 0x44d0d0 <process_main+25856>, 
          0x44a8d1 <process_main+15617>, 0x44ccca <process_main+24826>, 0x44a939 <process_main+15721>, 
          0x44dad6 <process_main+28422>, 0x44a9e8 <process_main+15896>, 0x44b046 <process_main+17526>, 
          0x44b0c4 <process_main+17652>, 0x44bc5f <process_main+20623>, 0x44c4f9 <process_main+22825>, 
          0x44cbba <process_main+24554>, 0x44bd0d <process_main+20797>, 0x44cd8f <process_main+25023>, 
          0x44c808 <process_main+23608>, 0x44c831 <process_main+23649>, 0x44c48e <process_main+22718>, 
          0x44c85f <process_main+23695>, 0x449f2a <process_main+13146>, 0x44aa08 <process_main+15928>, 
          0x44b56f <process_main+18847>, 0x44cbfc <process_main+24620>, 0x449cad <process_main+12509>, 
          0x4499f3 <process_main+11811>, 0x44aabd <process_main+16109>, 0x44d59c <process_main+27084>, 
          0x44d13e <process_main+25966>, 0x44aa52 <process_main+16002>, 0x446fbd <process_main+1005>, 
          0x44d5bb <process_main+27115>, 0x44d759 <process_main+27529>, 0x44d6fc <process_main+27436>, 
          0x44d6c8 <process_main+27384>, 0x44d8e2 <process_main+27922>, 0x447458 <process_main+2184>, 
          0x44747b <process_main+2219>, 0x4474a6 <process_main+2262>, 0x4474d0 <process_main+2304>, 
          0x4474f2 <process_main+2338>, 0x44751e <process_main+2382>, 0x447555 <process_main+2437>, 
          0x44758b <process_main+2491>, 0x4475c1 <process_main+2545>, 0x4475f6 <process_main+2598>, 
          0x44762c <process_main+2652>, 0x447661 <process_main+2705>, 0x447696 <process_main+2758>, 
          0x44d8d4 <process_main+27908>, 0x44b8e7 <process_main+19735>, 0x44b991 <process_main+19905>, 
          0x44d940 <process_main+28016>, 0x44ba2d <process_main+20061>, 0x44d905 <process_main+27957>, 
          0x44afd0 <process_main+17408>, 0x44a73f <process_main+15215>, 0x44a7a6 <process_main+15318>, 
          0x44a811 <process_main+15425>, 0x44a28c <process_main+14012>, 0x44a306 <process_main+14134>, 
          0x44a6b1 <process_main+15073>, 0x44a3ae <process_main+14302>, 0x44b4c2 <process_main+18674>, 
          0x449ea0 <process_main+13008>, 0x44a3f7 <process_main+14375>, 0x44bd37 <process_main+20839>, 
          0x44beb9 <process_main+21225>, 0x44bf80 <process_main+21424>, 0x44c535 <process_main+22885>, 
          0x44bfd7 <process_main+21511>, 0x44c027 <process_main+21591>, 0x44c65c <process_main+23180>, 
          0x44c884 <process_main+23732>, 0x44c8ff <process_main+23855>, 0x44c98d <process_main+23997>, 
          0x44c40a <process_main+22586>, 0x44c9fc <process_main+24108>, 0x44c4b5 <process_main+22757>, 
          0x44c7c8 <process_main+23544>, 0x44bbaa <process_main+20442>, 0x44cb04 <process_main+24372>, 
          0x44cd75 <process_main+24997>, 0x44cb88 <process_main+24504>, 0x44cb12 <process_main+24386>, 
          0x44ca12 <process_main+24130>, 0x44cfea <process_main+25626>, 0x44ce30 <process_main+25184>, 
          0x44cf11 <process_main+25409>, 0x44ba68 <process_main+20120>, 0x44bc4e <process_main+20606>, 
          0x44bbc0 <process_main+20464>, 0x44c5dd <process_main+23053>, 0x44bdcd <process_main+20989>, 
          0x44ce80 <process_main+25264>, 0x44be99 <process_main+21193>, 0x44be79 <process_main+21161>, 
          0x44c0c8 <process_main+21752>, 0x44c140 <process_main+21872>, 0x44c1c0 <process_main+22000>, 
          0x44c1f5 <process_main+22053>, 0x44d074 <process_main+25764>, 0x44b75c <process_main+19340>, 
          0x44cebe <process_main+25326>, 0x44d00b <process_main+25659>, 0x44cdd1 <process_main+25089>, 
          0x44cf96 <process_main+25542>, 0x44aedc <process_main+17164>, 0x44ab19 <process_main+16201>, 
          0x44ac2b <process_main+16475>, 0x44712c <process_main+1372>, 0x44719c <process_main+1484>, 
          0x44715b <process_main+1419>, 0x446fd6 <process_main+1030>, 0x44a86d <process_main+15517>, 
          0x44a53a <process_main+14698>, 0x4470f5 <process_main+1317>, 0x4470d1 <process_main+1281>, 
          0x44d97b <process_main+28075>, 0x449b4d <process_main+12157>, 0x449ba0 <process_main+12240>, 
          0x44d630 <process_main+27232>, 0x449974 <process_main+11684>, 0x449ca0 <process_main+12496>, 
          0x446fbd <process_main+1005>, 0x44b85c <process_main+19596>, 0x44b80a <process_main+19514>, 
          0x44b8a0 <process_main+19664>, 0x44d67c <process_main+27308>, 0x44cc2c <process_main+24668>, 
          0x44ace2 <process_main+16658>, 0x44b316 <process_main+18246>, 0x44b196 <process_main+17862>, 
          0x44d778 <process_main+27560>, 0x44cc80 <process_main+24752>, 0x44cd28 <process_main+24920>, 
          0x4476ca <process_main+2810>, 0x4476fa <process_main+2858>, 0x447728 <process_main+2904>, 
          0x447756 <process_main+2950>, 0x447787 <process_main+2999>, 0x4477b8 <process_main+3048>, 
          0x4477e8 <process_main+3096>, 0x447818 <process_main+3144>, 0x44b3ea <process_main+18458>, 
          0x4478f3 <process_main+3363>, 0x44c222 <process_main+22098>, 0x44791b <process_main+3403>, 
          0x44c246 <process_main+22134>, 0x447847 <process_main+3191>, 0x44787b <process_main+3243>, 
          0x4478b7 <process_main+3303>, 0x44d9e3 <process_main+28179>, 0x44962c <process_main+10844>, 
          0x4496dc <process_main+11020>, 0x4496ea <process_main+11034>, 0x44ad7b <process_main+16811>, 
          0x44a179 <process_main+13737>, 0x447942 <process_main+3442>, 0x44795a <process_main+3466>, 
          0x447977 <process_main+3495>, 0x4498fd <process_main+11565>, 0x449921 <process_main+11601>, 
          0x447993 <process_main+3523>, 0x4479b0 <process_main+3552>, 0x4498cf <process_main+11519>, 
          0x4498f3 <process_main+11555>, 0x44a711 <process_main+15169>, 0x44a6f3 <process_main+15139>, 
          0x44aeb2 <process_main+17122>, 0x44ae94 <process_main+17092>, 0x447017 <process_main+1095>, 
          0x44a0cf <process_main+13567>, 0x44c269 <process_main+22169>, 0x449807 <process_main+11319>, 
          0x44992b <process_main+11611>, 0x447120 <process_main+1360>, 0x447190 <process_main+1472>, 
          0x447153 <process_main+1411>, 0x446fce <process_main+1022>, 0x4470ed <process_main+1309>, 
          0x4470c9 <process_main+1273>, 0x44c360 <process_main+22416>, 0x44c2c0 <process_main+22256>, 
          0x44c310 <process_main+22336>, 0x44b6b1 <process_main+19169>, 0x44b660 <process_main+19088>, 
          0x44d4ba <process_main+26858>, 0x44d467 <process_main+26775>, 0x44da32 <process_main+28258>, 
          0x4496f7 <process_main+11047>, 0x4497a8 <process_main+11224>, 0x4497be <process_main+11246>, 
          0x4479cc <process_main+3580>, 0x447a59 <process_main+3721>, 0x44a9a7 <process_main+15831>, 
          0x446ffc <process_main+1068>, 0x44abc8 <process_main+16376>, 0x44a653 <process_main+14979>...}
#3  0x00000000004f5489 in sched_thread_func (vesdp=0x7fec19642100) at beam/erl_process.c:8906
        callbacks = {arg = 0x7fec19641c00, wakeup = 0x4fa030 <thr_prgr_wakeup>, 
          prepare_wait = 0x4f50a0 <thr_prgr_prep_wait>, wait = 0x4f6350 <thr_prgr_wait>, 
          finalize_wait = 0x4f5080 <thr_prgr_fin_wait>}
        esdp = 0x7fec19642100
        no = 1
#4  0x000000000067806f in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
        result = 0
        c = 0 '\000'
        res = <optimized out>
        twd = <optimized out>
        thr_func = 0x4f5370 <sched_thread_func>
        arg = 0x7fec19642100
        tsep = 0x7fec1a400100
#5  0x00007fec5b4be184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#6  0x00007fec5afe303d in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

We will try to get the debug symbols to have a better view.

from prometheus_process_collector.

essen avatar essen commented on August 31, 2024

Running into various issues so far. Anyway. Details are fuzzy, but the crash only occurs when there is a uid change just before starting RabbitMQ, and doesn't occur without the rabbitmq-server script (or at least I couldn't reproduce yet).

Whatever happens, I believe catching exceptions around here[1] and returning an empty list to Erlang would solve our issue. Not enough time today but worth experimenting on this next because if this is a transient issue at startup then catching is a good idea, and if the issue is different then being able to inspect the VM state would help.

[1] https://github.com/deadtrickster/prometheus_process_collector/blob/master/c_src/prometheus_process_collector_nif.cc#L43

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

hmm, I wonder what's so special about your environment.

when there is a uid change just before starting RabbitMQ

How do you mean?

Also, about this being start-up only issue, why this function called on startup? or it just coincides with scraping?

from prometheus_process_collector.

essen avatar essen commented on August 31, 2024

It uses start-stop-daemon to start it as user vcap instead of user root.

For the other questions I don't know yet. I think it coincides with scraping yes.

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

anything special about this vcap user? start-stop-daemon simply calls setuid AFAIK, I could try to check this starting as root and calling setuid in on_load

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

Nothing special about the vcap user, it's the equivalent of ubuntu or debian:

id vcap
uid=1000(vcap) gid=1000(vcap) groups=1000(vcap),4(adm),30(dip),44(video),46(plugdev),1003(google-sudoers)

We are using vcap as a good practice, which is to not run services as root. This is the entire process tree:

init─┬─auditd─┬─audispd───{audispd}
     │        └─{auditd}
     ├─beam.smp(vcap)─┬─erl_child_setup
     │                └─81*[{beam.smp}]
     ├─cron
     ├─dhclient
     ├─epmd(vcap)
     ├─6*[getty]
     ├─google_accounts
     ├─google_clock_sk
     ├─google_ip_forwa
     ├─netdata(netdata)─┬─apps.plugin(root)
     │                  ├─bash
     │                  ├─python───{python}
     │                  └─13*[{netdata}]
     ├─route-registrar(vcap)─┬─2*[route_registrar─┬─route_registrar───perl]
     │                       │                    └─tee───route_registrar───logger]
     │                       └─8*[{route-registrar}]
     ├─rpc.idmapd
     ├─rpc.statd(statd)
     ├─rpcbind
     ├─rsyslogd(syslog)───3*[{rsyslogd}]
     ├─runsvdir─┬─runsv─┬─bosh-agent───13*[{bosh-agent}]
     │          │       └─svlogd
     │          └─runsv─┬─monit───{monit}
     │                  └─svlogd
     ├─sshd───sshd───sshd(bosh_514cb0a8bcd048a)───bash───pstree
     ├─systemd-udevd
     ├─upstart-file-br
     ├─upstart-socket-
     └─upstart-udev-br

monit suns as root and supervises in this case route-registrar & beam.smp. epmd starts implicitly, but the uid is already vcap.

This is how we start RabbitMQ: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/jobs/rabbitmq-server/templates/bin/_start_rabbitmq-server

This is the exact start-stop-daemon.c that we use: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/src/start-stop-daemon-1.9.18/start-stop-daemon.c

Can you see a problem with this approach of starting rabbitmq-server for prometheus_process_collector?

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

no, I should probably try changing uid myself. The problem with suggested catch is that SIGSEGV is a signal not a C++ exception so try...catch won't work. Signal handler can be set up but recovery strategy is unclear. Now I'm really intrigued what's going on

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

Thanks for digging into this, let me know if there is anything that I can help with.

from prometheus_process_collector.

gerhard avatar gerhard commented on August 31, 2024

I've stopped using prometheus_process_collector for now, I'll be looking into bridging netdata with prometheus instead. Thanks for your help!

from prometheus_process_collector.

deadtrickster avatar deadtrickster commented on August 31, 2024

Yeah, I see. looks like I just don't have time for this, leave this issue open

from prometheus_process_collector.

letrec avatar letrec commented on August 31, 2024

I believe the root cause is throwing C++ exceptions. This is not what a well-behaved NIF should do.

https://github.com/deadtrickster/prometheus_process_collector/blob/master/c_src/prometheus_process_info_linux.cc#L76

There are other places like this that do that.

from prometheus_process_collector.

letrec avatar letrec commented on August 31, 2024

I think returning some well-defined number (like -1) in such situations would be more preferable than crashing the VM.

from prometheus_process_collector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.