Describe the problem The agent process uses 250-300% CPU the entir

Thanks for the extra info <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

do-agent process constant high CPU usage about do-agent HOT 22 CLOSED

lots0logs commented on July 17, 2024 1

do-agent process constant high CPU usage

from do-agent.

Comments (22)

bsnyder788 commented on July 17, 2024 1

Thanks for the extra info @lots0logs . I'll see if I can reproduce it on a k8s cluster and get to the bottom of why these errors are popping up for you.

from do-agent.

bsnyder788 commented on July 17, 2024 1

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

from do-agent.

bsnyder788 commented on July 17, 2024 1

Is your 20.04 image the stock DO image or a custom 20.04 image?
…
On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .

Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.

Thanks. I will try to reproduce from that .

from do-agent.

bsnyder788 commented on July 17, 2024 1

I couldn't reproduce on a myriad of 20.04 droplets either, but I went ahead and made a new beta release that disables the collection of /boot mountpoints. If some of you would give it a try to see if it now works on your specific droplets that would be fantastic. You can install it via curl -SsL https://repos.insights.digitalocean.com/install.sh | sudo BETA=1 bash . Please let me know if that fixes your issues. cc @UnKnoWn-Consortium @lots0logs @plutocrat

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024 1

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue. At least it is no longer spamming those two error messages and taking a whole lot of CPU resources.

from do-agent.

plutocrat commented on July 17, 2024 1

24 hours later, and its still OK.
Note: if you've been affected by this issue you might want to clean out your systemctl logs. Just got rid of 3.5 Gb of spam from mine using "/bin/journalctl --vacuum-size=500M". Your mileage may vary: there may be more subtle ways to remove the logs from just do-agent, although I haven't found them.

from do-agent.

lots0logs commented on July 17, 2024 1

Sorry for the delayed response. We had a hurricane here and I was without power for a few days. I'm glad to see that y'all were able to identify the problem and implement a fix! Thanks!!

from do-agent.

bsnyder788 commented on July 17, 2024 1

3.8.0 is officially released. I am going to close this. Please open a new issue if you see anything similar in the future. Thanks!

from do-agent.

bsnyder788 commented on July 17, 2024

@lots0logs I can't reproduce this with just spinning up a 20.04 droplet with the agent - it seems to be working just fine on the 10, 20.04 droplets I just spun up (and there is no spammy logs hitting the journal). We did have a similar report (#228) - but in that case the user was using DOKS and it was with an alpha version of kube-state-metrics. I notice your logs also say "k8s-cluster-stage..", do you have kube-state-metrics installed? If so, what version of kube-state-metrics?

from do-agent.

lots0logs commented on July 17, 2024

@bsnyder788 Yeah looks like I have rancher/coreos-kube-state-metrics:v1.9.5 container running on one of my nodes. Though the issue I described happens on all nodes.

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024

I have encountered the exactly same issue (even the systemd message is the same) with do-agent on two droplets running Ubuntu 20.04.1. One was set up yesterday while the other one has been in use for just a month or so. I have to stop the do-agent service.

P.S. I am not running Kubernetes on the two affected droplets.

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:

2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values

from do-agent.

plutocrat commented on July 17, 2024

Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4, do-agent cpu is running at around 95% all the time.
Tried the --web.listen instruction, but apparently nothing listening on that port when I do.
I can also confirm the error messages in /var/log/syslog and journalctl -xe
Installed version: 3.7.1
For now I've just got rid of it with apt purge do-agent

from do-agent.

bsnyder788 commented on July 17, 2024

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:
2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values

When you do the curl localhost:9100 what is the raw output?

from do-agent.

bsnyder788 commented on July 17, 2024

Is your 20.04 image the stock DO image or a custom 20.04 image?

…

On Fri, Oct 30, 2020, 5:15 AM plutocrat ***@***.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ> .

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:
2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
When you do the curl localhost:9100 what is the raw output?

The quoted part was exactly what I got when I did curl localhost:9100.

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024

Is your 20.04 image the stock DO image or a custom 20.04 image?
…
On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .

Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.

from do-agent.

bsnyder788 commented on July 17, 2024

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:
2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
When you do the curl localhost:9100 what is the raw output?
The quoted part was exactly what I got when I did curl localhost:9100.

Ok, thanks. I wanted to make sure that was all the info we could discern.

from do-agent.

bsnyder788 commented on July 17, 2024

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.

Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.

from do-agent.

UnKnoWn-Consortium commented on July 17, 2024

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.

Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.

Okay I will keep an eye out and see if anything goes astray with it (I seriously hope not). Have a nice weekend btw.

from do-agent.

plutocrat commented on July 17, 2024

Also confirming DO stock Ubuntu 20.04 build.
Have installed the beta release on one of the four affected boxes, and its showing healthy, near-zero CPU. Thanks. Will monitor.

from do-agent.

bsnyder788 commented on July 17, 2024

Thanks all! I'm going to go ahead and release 3.8.0 on the stable branch as well.

from do-agent.

do-agent process constant high CPU usage about do-agent HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent