Giter VIP home page Giter VIP logo

dd-agent's Introduction

Build Status

Important note

This repository contains the source code for the Datadog Agent up to and including major version 5. Although still supported, no major feature is planned for this release line and we encourage users and contributors to refer to the new Agent codebase, introduced with the release of version 6.0.0 and tracked in a different git repository.

Changes

Please refer to the Change log for more details about the changes introduced at each release.

How to contribute code

Before submitting any code, please read our contributing guidelines. We'll keep accepting contributions as long as the major version 5 is supported but please consider submitting new features to the new Agent codebase.

Please note that the Agent is licensed for simplicity's sake under a simplified BSD license, as indicated in the LICENSE file. Exceptions are marked with LICENSE-xxx where xxx is the component name. If you do not agree with the licensing terms and wish to contribute code nonetheless, please email us at [email protected] before submitting your pull request.

Setup your environment

Required:

  • python 2.7
  • bundler (to get it: gem install bundler)
# Clone the repository
git clone [email protected]:DataDog/dd-agent.git

# Create a virtual environment and install the dependencies:
cd dd-agent
bundle install
rake setup_env

# Activate the virtual environment
source venv/bin/activate

# Lint
bundle exec rake lint

# Run a flavored test
bundle exec rake ci:run[apache]

Integrations

All checks have been moved to the Integrations Core repo. Please look there to submit related issues, PRs, or review the latest changes.

Tests

More about how to write tests and run them here

How to configure the Agent

If you are using packages on linux, the main configuration file lives in /etc/dd-agent/datadog.conf. Per-check configuration files are in /etc/dd-agent/conf.d. We provide an example in the same directory that you can use as a template.

How to write your own checks

Writing your own checks is easy using our checks.d interface. Read more about how to use it on our Guide to Agent Checks.

Contributors

git log --all | grep 'Author' | sort -u

dd-agent's People

Contributors

alq666 avatar brettlangdon avatar clofresh avatar clutchski avatar conorbranagan avatar davidmytton avatar dcrosta avatar degemer avatar derekwbrown avatar elijahandrews avatar gmmeyer avatar hkaj avatar hush-hush avatar johnlzeller avatar leocavaille avatar lotharsee avatar masci avatar nmuesch avatar olivielpeau avatar pcockwell avatar rduffield avatar remh avatar revivek avatar sjenriquez avatar talwai avatar tmichelet avatar truthbk avatar whatarthurcodes avatar xvello avatar yannmh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dd-agent's Issues

haproxy integration: turn csv into metrics

echo show stats | socat stdio haproxy-unix-socket yields a useful csv of the form:

show stats

#pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,
public,FRONTEND,,,537,904,2000,13853595,503908946957,3666729890,0,0,26517,,,,,OPEN,,,,,,,,,1,1,0,,,,0,18,0,713,
dogarchive-frontend,FRONTEND,,,11,373,2000,4392434,146797799875,476916196,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,20,0,138,
static-via-nginx,i-23518d45,0,0,0,3,,4398,1655717,153869053,,0,,0,0,0,0,UP,1,1,0,0,0,254029,0,,1,3,1,,4398,,2,0,,8,
static-via-nginx,i-6529e303,0,0,0,3,,4398,1659293,155868354,,0,,0,0,0,0,UP,1,1,0,0,0,254029,0,,1,3,2,,4398,,2,0,,8,
static-via-nginx,i-8bcc18ed,0,0,0,3,,4398,1661308,153635419,,0,,0,0,0,0,UP,1,1,0,0,0,254029,0,,1,3,3,,4398,,2,0,,8,
static-via-nginx,i-cc97eeb5,0,0,0,2,,4397,1663183,154360737,,0,,0,0,0,0,UP,1,1,0,0,0,254029,0,,1,3,4,,4397,,2,0,,8,
static-via-nginx,BACKEND,0,0,0,9,0,17591,6639501,617733563,0,0,,0,0,0,0,UP,4,4,0,,0,254029,0,,1,3,0,,17591,,1,0,,30,
dogweb,i-23518d45,0,0,0,5,,40093,67184288,99422227,,0,,0,2,0,0,UP,1,1,0,2,0,254029,0,,1,4,1,,40093,,2,0,,8,
dogweb,i-6529e303,0,0,0,7,,40089,67030944,97600762,,0,,0,1,0,0,UP,1,1,0,4,1,5297,10,,1,4,2,,40089,,2,0,,8,
dogweb,i-8bcc18ed,0,0,0,6,,40093,67191329,98024312,,0,,0,0,0,0,UP,1,1,0,3,0,254029,0,,1,4,3,,40093,,2,0,,8,
dogweb,i-cc97eeb5,0,0,0,6,,40092,67048898,98365683,,0,,0,0,0,0,UP,1,1,0,2,0,254029,0,,1,4,4,,40092,,2,0,,8,
dogweb,BACKEND,0,0,0,15,0,174144,305418390,976649094,0,0,,0,3,0,0,UP,4,4,0,,0,254029,0,,1,4,0,,160367,,1,0,,32,
dogdispatcher,i-23518d45:9000,0,0,12,222,,1725534,62725117453,258693780,,0,,0,17924,14067,5530,UP,1,1,0,374,78,3130,1989,,1,5,1,,1711467,,2,8,,79,
dogdispatcher,i-23518d45:9001,0,0,12,210,,1724173,62677102667,258190380,,0,,0,18093,15724,5828,UP,1,1,0,394,92,5627,2401,,1,5,2,,1708449,,2,8,,81,
dogdispatcher,i-6529e303:9000,0,0,13,235,,1723737,62491230058,258209244,,0,,0,8098,15299,5593,UP,1,1,0,454,87,119,2455,,1,5,3,,1708438,,2,8,,79,
dogdispatcher,i-6529e303:9001,0,0,11,208,,1721279,63636834581,256958028,,0,,0,8481,19821,6860,UP,1,1,0,482,114,1998,3334,,1,5,4,,1701458,,2,8,,79,
dogdispatcher,i-8bcc18ed:9000,0,0,12,219,,1723402,63130226541,257551390,,0,,0,9504,18225,6724,UP,1,1,0,431,99,2556,2875,,1,5,5,,1705177,,2,8,,80,
dogdispatcher,i-8bcc18ed:9001,0,0,11,211,,1722527,62529116896,257088094,,0,,0,10151,20196,6756,UP,1,1,0,473,115,734,3289,,1,5,6,,1702331,,2,8,,80,
dogdispatcher,i-cc97eeb5:9000,0,0,12,196,,1730262,62628846149,259876401,,0,,0,13427,12219,4319,UP,1,1,0,260,68,3960,1061,,1,5,7,,1718043,,2,8,,80,
dogdispatcher,i-cc97eeb5:9001,0,0,12,228,,1730011,63776971654,259914086,,0,,0,13329,12162,3890,UP,1,1,0,291,70,319,1142,,1,5,8,,1717849,,2,8,,79,
dogdispatcher,BACKEND,0,0,95,677,0,13627712,503595445999,2066481403,0,0,,0,99007,127713,45500,UP,8,8,0,,0,254029,0,,1,5,0,,13673212,,1,65,,636,
dogarchive-backend,i-23518d45:9101,0,0,1,41,,274736,9053771969,29789822,,0,,0,909,264,98,UP,1,1,0,103,33,2532,4120,,1,6,1,,274472,,2,2,,11,
dogarchive-backend,i-23518d45:9102,0,0,0,44,,274545,9321922445,29948131,,0,,0,1079,338,131,UP,1,1,0,102,35,14506,4477,,1,6,2,,274207,,2,1,,11,
dogarchive-backend,i-23518d45:9103,0,0,1,42,,275063,9132375294,30030681,,0,,2,777,224,81,UP,1,1,0,109,30,8940,3868,,1,6,3,,274839,,2,1,,11,
dogarchive-backend,i-23518d45:9104,0,0,3,65,,274042,9017149827,29977883,,0,,0,1156,244,96,UP,1,1,0,116,39,3363,4662,,1,6,4,,273798,,2,2,,11,
dogarchive-backend,i-6529e303:9101,0,0,1,39,,273443,9327842357,29632950,,0,,0,1005,245,92,UP,1,1,0,505,35,89,5108,,1,6,5,,273198,,2,2,,11,
dogarchive-backend,i-6529e303:9102,0,0,1,43,,270997,9217488490,29591767,,0,,0,1539,373,146,UP,1,1,0,514,52,4316,7430,,1,6,6,,270624,,2,1,,11,
dogarchive-backend,i-6529e303:9103,0,0,0,41,,275421,8921303857,30051766,,0,,0,663,122,43,UP,1,1,0,470,23,1556,3390,,1,6,7,,275299,,2,1,,11,
dogarchive-backend,i-6529e303:9104,0,0,0,42,,271099,9144043657,29309309,,0,,0,1407,309,111,DOWN,1,1,0,509,48,110,7231,,1,6,8,,270790,,2,0,,11,
dogarchive-backend,i-8bcc18ed:9101,0,0,0,44,,272141,9258277704,29777358,,0,,0,1338,333,123,UP,1,1,0,393,43,14034,6353,,1,6,9,,271808,,2,1,,11,
dogarchive-backend,i-8bcc18ed:9102,0,0,0,39,,272921,9196099006,29634925,,0,,0,1180,276,95,UP,1,1,0,431,38,5016,5656,,1,6,10,,272645,,2,1,,11,
dogarchive-backend,i-8bcc18ed:9103,0,0,2,44,,274726,8966873428,29886474,,0,,0,771,159,50,UP,1,1,0,368,26,12821,3978,,1,6,11,,274567,,2,1,,11,
dogarchive-backend,i-8bcc18ed:9104,0,0,1,42,,274118,8880125993,30143454,,0,,0,926,193,72,UP,1,1,0,415,30,3789,4618,,1,6,12,,273925,,2,2,,11,
dogarchive-backend,i-cc97eeb5:9101,0,0,1,40,,278649,9306917340,29730228,,0,,0,5,162,63,UP,1,1,0,57,26,8786,670,,1,6,13,,278487,,2,2,,11,
dogarchive-backend,i-cc97eeb5:9102,0,0,0,39,,278592,9321324097,29866587,,0,,0,4,169,58,UP,1,1,0,62,29,12741,792,,1,6,14,,278423,,2,1,,12,
dogarchive-backend,i-cc97eeb5:9103,0,0,0,43,,278393,9352955202,29726417,,0,,0,3,218,91,UP,1,1,0,78,37,6151,954,,1,6,15,,278175,,2,1,,11,
dogarchive-backend,i-cc97eeb5:9104,0,0,0,37,,278693,9379329209,29818444,,0,,0,25,121,45,UP,1,1,0,45,21,11848,592,,1,6,16,,278572,,2,1,,11,
dogarchive-backend,BACKEND,0,0,11,373,0,4392434,146797799875,476916196,0,0,,2,12787,3750,1395,UP,15,15,0,,0,254029,0,,1,6,0,,4393829,,1,20,,177,

show info

Name: HAProxy
Version: 1.3.22
Release_date: 2009/10/14
Nbproc: 1
Process_num: 1
Pid: 23217
Uptime: 2d 22h32m52s
Uptime_sec: 253972
Memmax_MB: 0
Ulimit-n: 64045
Maxsock: 64045
Maxconn: 32000
Maxpipes: 0
CurrConns: 552
PipesUsed: 0
PipesFree: 0
Tasks: 584
Run_queue: 1
node: domU-12-31-39-17-30-F2
description: 

datadog-agent(-base) RedHat start/stop scripts explicitly reference python2.6

This was to force the agent to pickup 2.6 on CentOS 5. Since datadog-agent-base works with python2.4 out of the box, there is no need to force a different python.

On CentOS5, datadog-agent should always use python2.6 or greater.
On CentOS5, datadog-agent-base should always use python.
On CentOS6, both should use python.

Update redis check to optionally poll the length of specified collections.

The motivation being that many people use redis as a queue, and it's useful to know how big a queue is at any given moment in time.

The config might look like:

# redis_len is a comma-separated list of the form: db:keys_pattern:tagger
redis_len: 0:cache*:my_mod.my_redis_tagger

Where my_mod is a python module in the agent's PYTHONPATH and my_redis_tagger is a function in my_mod that might look like:

def my_redis_tagger(key):
    ''' Accepts a key and returns the appropriate tags for that key. Called for each key found from the `keys_pattern` config value
    '''
    return ["cache", key.replace("cache", "")]

For each check iteration, the redis_len check will:

  • call KEYS keys_pattern to get a list of keys from the given redis db
  • figure out their types
  • call the appropriate length function on them
  • call the tagger function with the key to get the tags

Need to verify that this won't kill redis dbs with lots of keys.

haproxy logs -> dogstream

What we need:

  • request stems as tags (see below for stems)
  • response code as tags
  • timings (Tr, Tt, Tq) as metrics

List of requests (including some patterns, but discarding everything after ?param=value): https://gist.github.com/2919020

Log sample (ignore first 2 fields, created by syslog)

2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40074 [12/Jun/2012:17:56:10.548] dogarchive-frontend dogarchive-backend/i-8bcc18ed:9105 14/0/0/283/575 200 107 - - ---- 232/162/162/5/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40079 [12/Jun/2012:17:56:10.636] dogarchive-frontend dogarchive-backend/i-8bcc18ed:9103 10/0/1/75/488 200 107 - - ---- 231/161/161/6/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40095 [12/Jun/2012:17:56:10.728] dogarchive-frontend dogarchive-backend/i-23518d45:9102 7/0/1/354/456 200 107 - - ---- 230/160/160/5/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28632 [12/Jun/2012:17:56:11.069] public dogdispatcher/i-8bcc18ed:9001 2/0/0/40/115 202 151 - - ---- 229/69/1/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40103 [12/Jun/2012:17:56:10.864] dogarchive-frontend dogarchive-backend/i-23518d45:9103 4/0/1/70/331 200 107 - - ---- 230/160/159/7/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28629 [12/Jun/2012:17:56:10.914] public dogdispatcher/i-8bcc18ed:9000 89/0/1/198/308 202 151 - - ---- 233/70/2/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28633 [12/Jun/2012:17:56:11.079] public dogdispatcher/i-6529e303:9000 139/0/4/11/174 202 151 - - ---- 235/70/1/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28635 [12/Jun/2012:17:56:11.185] public dogweb/i-23518d45 33/0/4/45/82 200 424 - - ---- 235/69/0/0/0 0/0 "GET /reports/v1/agents HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28636 [12/Jun/2012:17:56:11.222] public dogdispatcher/i-6529e303:9001 31/0/1/110/144 202 166 - - ---- 237/69/1/0/0 0/0 "POST /api/v1/series?api_key=REDACTED HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28634 [12/Jun/2012:17:56:11.103] public dogdispatcher/i-23518d45:9000 81/0/10/189/281 202 151 - - ---- 239/69/0/0/0 0/0 "POST /intake?api_key=REDACTED HTTP/1.1"

If you need to extract page list again:

gawk '{u = $(NF-1); split(u, p, "?"); req[p[1]]=1;} END {for (r in req) {print r}}' /mnt/log/haproxy_1.log

Last, haproxy timings are the first group of /-separated values

33/0/4/45/82

>>> Feb  6 12:14:14 localhost \
      haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in \
      static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} \
      {} "GET /index.html HTTP/1.1"
  Field   Format                                Extract from the example above
      1   process_name '[' pid ']:'                            haproxy[14389]:
      2   client_ip ':' client_port                             10.0.1.2:33317
      3   '[' accept_date ']'                       [06/Feb/2009:12:14:14.655]
      4   frontend_name                                                http-in
      5   backend_name '/' server_name                             static/srv1
      6   Tq '/' Tw '/' Tc '/' Tr '/' Tt*                       10/0/30/69/109
      7   status_code                                                      200
      8   bytes_read*                                                     2750
      9   captured_request_cookie                                            -
     10   captured_response_cookie                                           -
     11   termination_state                                               ----
     12   actconn '/' feconn '/' beconn '/' srv_conn '/' retries*    1/1/1/1/0
     13   srv_queue '/' backend_queue                                      0/0
     14   '{' captured_request_headers* '}'                   {haproxy.1wt.eu}
     15   '{' captured_response_headers* '}'                                {}
     16   '"' http_request '"'                      "GET /index.html HTTP/1.1"

8.4. Timing events

Timers provide a great help in troubleshooting network problems. All values are
reported in milliseconds (ms). These timers should be used in conjunction with
the session termination flags. In TCP mode with "option tcplog" set on the
frontend, 3 control points are reported under the form "Tw/Tc/Tt", and in HTTP
mode, 5 control points are reported under the form "Tq/Tw/Tc/Tr/Tt" :

  • Tq: total time to get the client request (HTTP mode only). It's the time
    elapsed between the moment the client connection was accepted and the
    moment the proxy received the last HTTP header. The value "-1" indicates
    that the end of headers (empty line) has never been seen. This happens when
    the client closes prematurely or times out.

  • Tw: total time spent in the queues waiting for a connection slot. It
    accounts for backend queue as well as the server queues, and depends on the
    queue size, and the time needed for the server to complete previous
    requests. The value "-1" means that the request was killed before reaching
    the queue, which is generally what happens with invalid or denied requests.

  • Tc: total time to establish the TCP connection to the server. It's the time
    elapsed between the moment the proxy sent the connection request, and the
    moment it was acknowledged by the server, or between the TCP SYN packet and
    the matching SYN/ACK packet in return. The value "-1" means that the
    connection never established.

  • Tr: server response time (HTTP mode only). It's the time elapsed between
    the moment the TCP connection was established to the server and the moment
    the server sent its complete response headers. It purely shows its request
    processing time, without the network overhead due to the data transmission.
    It is worth noting that when the client has data to send to the server, for
    instance during a POST request, the time already runs, and this can distort
    apparent response time. For this reason, it's generally wise not to trust
    too much this field for POST requests initiated from clients behind an
    untrusted network. A value of "-1" here means that the last the response
    header (empty line) was never seen, most likely because the server timeout
    stroke before the server managed to process the request.

  • Tt: total session duration time, between the moment the proxy accepted it
    and the moment both ends were closed. The exception is when the "logasap"
    option is specified. In this case, it only equals (Tq+Tw+Tc+Tr), and is
    prefixed with a '+' sign. From this field, we can deduce "Td", the data
    transmission time, by substracting other timers when valid :

    Td = Tt - (Tq + Tw + Tc + Tr)
    

    Timers with "-1" values have to be excluded from this equation. In TCP
    mode, "Tq" and "Tr" have to be excluded too. Note that "Tt" can never be
    negative.

These timers provide precious indications on trouble causes. Since the TCP
protocol defines retransmit delays of 3, 6, 12... seconds, we know for sure
that timers close to multiples of 3s are nearly always related to lost packets
due to network problems (wires, negociation, congestion). Moreover, if "Tt" is
close to a timeout value specified in the configuration, it often means that a
session has been aborted on timeout.

Most common cases :

  • If "Tq" is close to 3000, a packet has probably been lost between the
    client and the proxy. This is very rare on local networks but might happen
    when clients are on far remote networks and send large requests. It may
    happen that values larger than usual appear here without any network cause.
    Sometimes, during an attack or just after a resource starvation has ended,
    haproxy may accept thousands of connections in a few milliseconds. The time
    spent accepting these connections will inevitably slightly delay processing
    of other connections, and it can happen that request times in the order of
    a few tens of milliseconds are measured after a few thousands of new
    connections have been accepted at once.
  • If "Tc" is close to 3000, a packet has probably been lost between the
    server and the proxy during the server connection phase. This value should
    always be very low, such as 1 ms on local networks and less than a few tens
    of ms on remote networks.
  • If "Tr" is nearly always lower than 3000 except some rare values which seem
    to be the average majored by 3000, there are probably some packets lost
    between the proxy and the server.
  • If "Tt" is large even for small byte counts, it generally is because
    neither the client nor the server decides to close the connection, for
    instance because both have agreed on a keep-alive connection mode. In order
    to solve this issue, it will be needed to specify "option httpclose" on
    either the frontend or the backend. If the problem persists, it means that
    the server ignores the "close" connection mode and expects the client to
    close. Then it will be required to use "option forceclose". Having the
    smallest possible 'Tt' is important when connection regulation is used with
    the "maxconn" option on the servers, since no new connection will be sent
    to the server until another one is released.

Other noticeable HTTP log cases ('xx' means any value to be ignored) :

Tq/Tw/Tc/Tr/+Tt The "option logasap" is present on the frontend and the log
was emitted before the data phase. All the timers are valid
except "Tt" which is shorter than reality.

-1/xx/xx/xx/Tt The client was not able to send a complete request in time
or it aborted too early. Check the session termination flags
then "timeout http-request" and "timeout client" settings.

Tq/-1/xx/xx/Tt It was not possible to process the request, maybe because
servers were out of order, because the request was invalid
or forbidden by ACL rules. Check the session termination
flags.

Tq/Tw/-1/xx/Tt The connection could not establish on the server. Either it
actively refused it or it timed out after Tt-(Tq+Tw) ms.
Check the session termination flags, then check the
"timeout connect" setting. Note that the tarpit action might
return similar-looking patterns, with "Tw" equal to the time
the client connection was maintained open.

Tq/Tw/Tc/-1/Tt The server has accepted the connection but did not return
a complete response in time, or it closed its connexion
unexpectedly after Tt-(Tq+Tw+Tc) ms. Check the session
termination flags, then check the "timeout server" setting.

postgres: get stats tagged by table/index

Some nice stats broken down by table or index:

=> \d pg_stat_user_indexes 
View "pg_catalog.pg_stat_user_indexes"
    Column     |  Type  | Modifiers 
---------------+--------+-----------
 relid         | oid    | 
 indexrelid    | oid    | 
 schemaname    | name   | 
 relname       | name   | 
 indexrelname  | name   | 
 idx_scan      | bigint | 
 idx_tup_read  | bigint | 
 idx_tup_fetch | bigint | 

=> \d pg_stat_user_tables 
          View "pg_catalog.pg_stat_user_tables"
      Column      |           Type           | Modifiers 
------------------+--------------------------+-----------
 relid            | oid                      | 
 schemaname       | name                     | 
 relname          | name                     | 
 seq_scan         | bigint                   | 
 seq_tup_read     | bigint                   | 
 idx_scan         | bigint                   | 
 idx_tup_fetch    | bigint                   | 
 n_tup_ins        | bigint                   | 
 n_tup_upd        | bigint                   | 
 n_tup_del        | bigint                   | 
 n_tup_hot_upd    | bigint                   | 
 n_live_tup       | bigint                   | 
 n_dead_tup       | bigint                   | 
 last_vacuum      | timestamp with time zone | 
 last_autovacuum  | timestamp with time zone | 
 last_analyze     | timestamp with time zone | 
 last_autoanalyze | timestamp with time zone | 

Broken uninstall script

Ubuntu 10.04 LTS

# apt-get remove datadog-agent
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
 datadog-agent-lib supervisor python-tornado datadog-agent-base python-meld3 python-medusa python-mysqldb
Use 'apt-get autoremove' to remove them.
The following packages will be REMOVED:
 datadog-agent
0 upgraded, 0 newly installed, 1 to remove and 134 not upgraded.
After this operation, 14.3kB disk space will be freed.
Do you want to continue [Y/n]? y
(Reading database ... 74611 files and directories currently installed.)
Removing datadog-agent ...
Stopping datadog agent: (using supervisorctl) error: <class 'xmlrpclib.Fault'>, <Fault 10: 'BAD_NAME: datadog-agent'>: file: /usr/lib/python2.6/xmlrpclib.py line: 838
invoke-rc.d: initscript datadog-agent, action "stop" failed.
dpkg: error processing datadog-agent (--remove):
 subprocess installed pre-removal script returned error exit status 2
Errors were encountered while processing:
 datadog-agent
E: Sub-process /usr/bin/dpkg returned an error code (1)
$ sudo apt-get autoremove
Reading package lists... Done
Building dependency tree
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 130 not upgraded.
1 not fully installed or removed.
After this operation, 0B of additional disk space will be used.
Setting up datadog-agent (2.2.24-127) ...
No config updates to processes
Stopping datadog agent: (using supervisorctl) error: <class 'xmlrpclib.Fault'>, <Fault 10: 'BAD_NAME: datadog-agent'>: file: /usr/lib/python2.6/xmlrpclib.py line: 838
invoke-rc.d: initscript datadog-agent, action "restart" failed.
dpkg: error processing datadog-agent (--configure):
 subprocess installed post-installation script returned error exit status 2

Make the forwarder configuration for datadog-agent-base a configuration option

Right now, we have:

datadog-agent-base --(http://localhost:17123)--> datadog-agent --(https)--> Datadog

If we want to have:

datadog-agent-base --(http://proxy-node:17123)--> datadog-agent --(https)--> Datadog

we have to edit /etc/init.d/datadog-agent or /etc/dd-agent/supervisor.conf.

Instead the forwarder destination should be localhost by default and allow overrides in the configuration file.

Track disk stats (size and inodes)

Tasks

  • Add an option called: report_mount in datadog.conf to pick up mount points instead of linux devices as Datadog "devices".
  • Capture system.fs.inode_in_use/inode_total/inode_free metrics using df -i

New options

use_mount in datadog.conf

Add network reporting to OS X

The command on OS X is netstat -i -b.

Name  Mtu   Network       Address            Ipkts Ierrs     Ibytes    Opkts Oerrs     Obytes  Coll
lo0   16384 <Link#1>                        252623     0  368415571   252623     0  368415571     0
lo0   16384 localhost   fe80:1::1           252623     -  368415571   252623     -  368415571     -
lo0   16384 127           localhost         252623     -  368415571   252623     -  368415571     -
lo0   16384 localhost   ::1                 252623     -  368415571   252623     -  368415571     -
gif0* 1280  <Link#2>                             0     0          0        0     0          0     0
stf0* 1280  <Link#3>                             0     0          0        0     0          0     0
en0   1500  <Link#4>    04:0c:ce:db:4e:fa 20328868     0 13309516362 14810885     0 11363839527     0
en0   1500  seneca.loca fe80:4::60c:ceff: 20328868     - 13309516362 14810885     - 11363839527     -
en0   1500  2001:470:1f 2001:470:1f07:11d 20328868     - 13309516362 14810885     - 11363839527     -
en0   1500  2001:470:1f 2001:470:1f07:11d 20328868     - 13309516362 14810885     - 11363839527     -
en0   1500  192.168.1     192.168.1.63    20328868     - 13309516362 14810885     - 11363839527     -
en0   1500  2001:470:1f 2001:470:1f07:11d 20328868     - 13309516362 14810885     - 11363839527     -
p2p0  2304  <Link#5>    06:0c:ce:db:4e:fa        0     0          0        0     0          0     0
ham0  1404  <Link#6>    7a:79:05:4d:bf:f5    27552     0    6231369    17764     0    8030253     0
ham0  1404  5             5.77.191.245       27552     -    6231369    17764     -    8030253     -
ham0  1404  seneca.loca fe80:6::7879:5ff:    27552     -    6231369    17764     -    8030253     -
ham0  1404  2620:9b::54 2620:9b::54d:bff5    27552     -    6231369    17764     -    8030253     -

Can't turn on graphite listener

When I uncomment the following line in my /etc/dd-agent/datadog.conf file:

# Start a graphite listener on this port
graphite_listen_port: 17124

... and then restart...

$ sudo /etc/init.d/datadog-agent restart
Stopping datadog agent: (using supervisorctl) forwarder: stopped
collector: stopped
dd-agent.
Starting datadog agent: (using supervisorctl) collector: started
forwarder: ERROR (abnormal termination)
dd-agent.

It fails! But why?

$ tail /var/log/ddforwarder.log 
[I 120626 17:35:53 dd-forwarder:229] Listening on port 17123
[I 120626 17:35:53 dd-forwarder:243] Starting graphite listener on port 17124
Traceback (most recent call last):
  File "/usr/bin/dd-forwarder", line 271, in <module>
    main()
  File "/usr/bin/dd-forwarder", line 268, in main
    app.run()
  File "/usr/bin/dd-forwarder", line 244, in run
    from graphite import GraphiteServer
ImportError: No module named graphite

Is this why?

$ python --version
Python 2.6.5

Hmm... Don't know! But I have the right version, don't I?

$ sudo apt-cache policy datadog-agent
datadog-agent:
  Installed: 2.2.28-169
  Candidate: 2.2.28-169
  Version table:
 *** 2.2.28-169 0
        500 http://apt.datadoghq.com/ unstable/main Packages
        100 /var/lib/dpkg/status

¿ⓧ_ⓧﮌ

Add munin support

Details at https://sites.google.com/a/datadoghq.com/wiki/agent/munin-plugin, reproduced here:

Basic idea

Execute all scripts in /etc/munin/plugins/* without arguments from the agent, parse the values and send them back.

Security Consideration

By default munin-node executes as root, we don't so there needs to be a way to effectively call the scripts.
By using sudo we can sidestep the permission issues completely:

dd-agent ALL=(root) NOPASSWD: /usr/share/datadog/agent/munin.py

Munin Plugin

The logic goes like this:

  1. Agent starts a subprocess: sudo munin.py /etc/munin/plugin-conf.d/munin-node /etc/munin/plugins
  2. munin.py parses munin-node first to store environment variables and execution environments for all the checks
  3. munin.py traverses /etc/munin/plugins and execs it with the proper environment, without arguments
  4. munin.py translates the metric name if needed, extracts tags and devices (e.g. 5. postgres_connections_dogdatatest)
  5. munin.py returns: timestamp dd_metric_name value type(gauge|counter) tags
    Issues

If munin takes more than 150s to run, the agent will be killed by its watchdog.

cacti support on CentOS 5 is broken

import rrdtool

fails in our check even after python-rrdtool is installed. Cause: it only installs cacti for python2.4, not python2.6.

In other words, when you run:

yum install python-rrdtool

the rrdtool module is only available under python(2.4), not python2.6.

datadog-agent-base runs with python2.6 if it's installed along with datadog-agent.

Solution:

  1. if python2.4 is installed (which is the case for CentOS5), force datadog-agent base to run with python2.4, not python2.6.

Fix Cassandra metrics

Current code does not play nicely with cassandra 0.8 and 1.0.

Need to get:

  • Compaction tasks
  • Pending tasks
  • Anything else around compactions, jams

Get them via JMX instead of nodetool (or contrib json output to nodetool)

agent crashes on startup if it can't import rrdtool

Traceback (most recent call last):
  File "/usr/bin/dd-forwarder", line 32, in <module>
    from checks.common import getUuid
  File "/usr/share/datadog/agent/checks/common.py", line 41, in <module>
    from checks.cacti import Cacti
  File "/usr/share/datadog/agent/checks/cacti.py", line 5, in <module>
    import rrdtool
ImportError: No module named rrdtool

Cacti support: if an RRD file is missing, skip it and don't crash the check

File does not actually exist. Not sure why but the check should cope with it more gracefully.

2012-04-20 07:58:28,092 - checks - ERROR - Cannot check CactiTraceback (most recent call last):
File "/usr/share/datadog/agent/checks/cacti.py", line 195, in check metrics.extend(
File "/usr/share/datadog/agent/checks/cacti.py", line 109, in _read_rrd c_funcs = self._consolidation_funcs(rrd_path, rrdtool)
File "/usr/share/datadog/agent/checks/cacti.py", line 98, in _consolidation_funcs info = rrdtool.info(rrd_path)error: opening '/var/www/html/cacti/rra/sjl01lscomtest01_tcpinuse_32539.rrd': No such file or directory

MemUsed metric includes Buffered/Cached memory

In the Memory check, physUsed is calculated by MemTotal - MemFree as shown below

physTotal = int(meminfo['MemTotal'])
physFree = int(meminfo['MemFree'])
physUsed = physTotal - physFree

(found at checks/system.py)

This makes it awkward (in Datadog graphs) to differentiate between memory being used by the cache and memory being used by normal processes.

I'd like to change the physUsed definition to be

physTotal = int(meminfo['MemTotal'])
physFree = int(meminfo['MemFree'])
physBuffers = int(meminfo['Buffers'])
physCached = int(meminfo['Cached'])
physUsed = physTotal - physFree - physBuffers - physCached

Alternatively, we could add another data point that represents physUsed - physBuffers - physCached. I'm at a loss on what to call it however.

One side point; free deals with this problem by showing both:

             total       used       free     shared    buffers     cached
Mem:           512        494         17          0         76        326
-/+ buffers/cache:         91        420

The -/+ buffers/cache line represents used - buffers - cached and free + buffers + cached

Chris

Airbrake integration

When our agent throws an exception or errors out, collect the error in airbrake to help with troubleshooting.

Bonus point (in another story), if our users use airbrake, optionally push errors to their account as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.