Comments (9)
@jyksnw Can you check the log file, I think the default is set to INFO and logs every request made + a timestamp. Maybe it will be easier to debug if we know when the agent stopped sending data.
from amonagent.
Here are the lines of the log from one of the servers (I have obfuscated the URI and API key).
time="2017-03-21T17:44:05-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-21T17:44:05-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Starting Amon Agent (Version: 0.7.2)\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Agent Config: Interval:1m0s\n"
time="2017-03-22T06:25:19-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-22T06:25:20-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"
from amonagent.
I think I see the issue. It appears to be a two part issue:
- After searching the logs I found that named had a number of logged errors indicating that it couldn't resolve to our amon server.
- Go's http.Client defaults to a timeout of 0 which is no timeout. It appears that in the scenario above the call to SendData never returns as the client is stuck looking to complete the connection to a hostname that it can't resolve or reach.
Luckily this is easy to fix by creating the http.Client with a specified timeout. I can add this in without any issue but wanted to know if the timeout should be a configuration option or a statically set value (say 10 seconds).
I am going to create a local build to test this theory out, but after reading through our logs and looking into the http.Client request handling I am highly confident this was the cause of this issue.
from amonagent.
Sorry I didn't look into how the transport was constructed before commenting. Looks like a 10 second timeout is already being utilized via the transport.
from amonagent.
@jyksnw It could be a goroutine leak somewhere, although I do check for data races before releasing. One way to determine if that is the case is to monitor the memory usage.
What makes this one difficult to catch I think is that it has some parts of this bug which are hardware / distro related. I personally have 5 agents that have been running since last August
from amonagent.
We have 3 other servers running with similar hardware/distro configuration that we haven't seen any issues on.
CloudFlare has en excellent writeup and graph outlining Go's client connection sequence and where each of the various timeout settings come into play.
Source - CloudFlare: The complete guide to Go net/http timeouts
So though a timeout is being set for ResponseHeaderTimeout, the request might not have reached that point and still stuck. There is a suggested Transport structure setup in the write-up that could be implemented. I will create a build for just these two servers with the suggested Transport setup and see if the issue presents itself again.
from amonagent.
@jyksnw Thanks for sharing the guide. Yes, this could be the issue - the amonagent does not have a cancel request policy, just timeout
from amonagent.
I have a local branch that implements a more fine grained timeout along with adding a cancel request policy that currently cancels the request after a 10 second delay. I will test this out a bit against the two servers we have been having issues with to see if it solves the problem as well as see if it introduces any other potential issues.
from amonagent.
@jyksnw Cool. If it works - you can submit as a pull request and I will merge / push a new release for the agent with the fix
from amonagent.
Related Issues (20)
- ARM binary?
- Amon repository down HOT 3
- Windows agent HOT 4
- amonagent tags HOT 1
- "check-process.rb -p amonagent" failing HOT 4
- Can not connect to MongoDB database
- Systemd - Killed by SIGPIPE
- processes: null (Linux) HOT 4
- ruby scripts empty response when amonagent starts as service HOT 2
- Amazon Linux dependency issue for RPM HOT 4
- Wrong command at documentation in bash script
- Agent build for Raspbian HOT 2
- Bug: Debian/Ubuntu - Agent failed to start HOT 5
- Explore adding support for Datadog Integrations HOT 2
- Amonagent failing to restart / monitoring within RedHat / Fedora with systemd
- Add ability to configure amonagent logging to custom file.
- Server without name in "servers" collection in the DB breaks the global Pause Alerts page
- mysql plugin HOT 1
- write on log when agent crash!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amonagent.