Giter VIP home page Giter VIP logo

Comments (9)

martinrusev avatar martinrusev commented on July 4, 2024

@jyksnw Can you check the log file, I think the default is set to INFO and logs every request made + a timestamp. Maybe it will be easier to debug if we know when the agent stopped sending data.

from amonagent.

jyksnw avatar jyksnw commented on July 4, 2024

@martinrusev

Here are the lines of the log from one of the servers (I have obfuscated the URI and API key).

time="2017-03-21T17:44:05-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-21T17:44:05-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Starting Amon Agent (Version: 0.7.2)\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Agent Config: Interval:1m0s\n"
time="2017-03-22T06:25:19-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-22T06:25:20-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"

from amonagent.

jyksnw avatar jyksnw commented on July 4, 2024

I think I see the issue. It appears to be a two part issue:

  1. After searching the logs I found that named had a number of logged errors indicating that it couldn't resolve to our amon server.
  2. Go's http.Client defaults to a timeout of 0 which is no timeout. It appears that in the scenario above the call to SendData never returns as the client is stuck looking to complete the connection to a hostname that it can't resolve or reach.

Luckily this is easy to fix by creating the http.Client with a specified timeout. I can add this in without any issue but wanted to know if the timeout should be a configuration option or a statically set value (say 10 seconds).

I am going to create a local build to test this theory out, but after reading through our logs and looking into the http.Client request handling I am highly confident this was the cause of this issue.

from amonagent.

jyksnw avatar jyksnw commented on July 4, 2024

Sorry I didn't look into how the transport was constructed before commenting. Looks like a 10 second timeout is already being utilized via the transport.

from amonagent.

martinrusev avatar martinrusev commented on July 4, 2024

@jyksnw It could be a goroutine leak somewhere, although I do check for data races before releasing. One way to determine if that is the case is to monitor the memory usage.

What makes this one difficult to catch I think is that it has some parts of this bug which are hardware / distro related. I personally have 5 agents that have been running since last August

from amonagent.

jyksnw avatar jyksnw commented on July 4, 2024

We have 3 other servers running with similar hardware/distro configuration that we haven't seen any issues on.

CloudFlare has en excellent writeup and graph outlining Go's client connection sequence and where each of the various timeout settings come into play.


Source - CloudFlare: The complete guide to Go net/http timeouts

So though a timeout is being set for ResponseHeaderTimeout, the request might not have reached that point and still stuck. There is a suggested Transport structure setup in the write-up that could be implemented. I will create a build for just these two servers with the suggested Transport setup and see if the issue presents itself again.

from amonagent.

martinrusev avatar martinrusev commented on July 4, 2024

@jyksnw Thanks for sharing the guide. Yes, this could be the issue - the amonagent does not have a cancel request policy, just timeout

from amonagent.

jyksnw avatar jyksnw commented on July 4, 2024

I have a local branch that implements a more fine grained timeout along with adding a cancel request policy that currently cancels the request after a 10 second delay. I will test this out a bit against the two servers we have been having issues with to see if it solves the problem as well as see if it introduces any other potential issues.

from amonagent.

martinrusev avatar martinrusev commented on July 4, 2024

@jyksnw Cool. If it works - you can submit as a pull request and I will merge / push a new release for the agent with the fix

from amonagent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.