elastic / support-diagnostics Goto Github PK

View Code? Open in Web Editor NEW

290.0 142.0 149.0 10.19 MB

Support diagnostics utility for elasticsearch and logstash

License: Other

Batchfile 1.27% Shell 1.40% Java 97.17% Dockerfile 0.16%

support-diagnostics's People

Contributors

Stargazers

Watchers

Forkers

seang-es pickypg markwalkom esamir nellicus tchyzh masaruh sherry-ger gingerwizard modulexcite poneyo spinscale jchelastic luxerama mondok laks2money jasontedor gmarz skade peterskim12 andersje dingminyi chi9rin calebcase tsouza phatthana jakommo yodasantu reginaldosoares cstrzadala elastic-adventures weichaojie jontsai filipegmiranda immon insukcho geekpete indu0207 nandhab basseljabak1 davecturner coolprag nephel alexsapran jakauppila robbavey phaedrusthegreek elgrippe sorbhs eedugon hub-cap xiaoyanghapi martijnvg ouyangchucai santoshjoshioracle dnhatn gwbrown linkaaq lastnico renshuki tvernum robinatw eddieturizo nvichare 111andre111 pgomulka fr3man1 vafaronaghi sandeepk17 bsaccamano lcy2013 ynuosoft ash9146 robin13 yanghongkjxy anyasabo dtuck9 sehdorn waffledunker saurabh1611 leaf-lin lucabelluccini lmcv3 mryuguangbao pugnascotia lupan2015 sinzui phpguru calebchaney hubbleview arvind110592 oscargbocanegra danleerunk ricardogralhoz isabella232 iberz tomonorisoejima aboughan inqueue osmanloproj

support-diagnostics's Issues

Re-add pretty flag

The pretty flag needs to be re-added. We are worse off without it.

Redact passwords in .yml files

We need to redact passwords some of the various password that appear in the yml files.

Auth check does not work with shield 2.0

Auth fails as we do not return the status in 2.0

Capture log files based on node configuration, not default location

Currently logs are grabbed from the default location - sometimes these are old logs, or no logs at all...
Possible to learn the configured location of log files from the nodes settings and capture the correct log files?

Remove AWS credentials from generated tarball

Every time we send result of tool run to ES support we have to remove AWS credentials from next section manually.

From cluster_state.20150219-122235.json:

"repositories" : {
"202" : {
"type" : "s3",
"settings" : {
"region" : "us-east-1",
"max_restore_bytes_per_sec" : "1mb",
"max_snapshot_bytes_per_sec" : "1mb",
"bucket" : "es-snapshot.appcelerator.prod.202",
"access_key" : "",
"secret_key" : ""
}
}
},

Please add auto-deletion of them to keep customers safe.

include filesystem type?

It would be great if the output could include the type of filesystem (e.g. ext2, nfs, whatever).

A long-term solution is to add this information (comes from java7 FileStore api) to ES stats, but as a start, maybe we could include the output from df -k + mount or similar?

Implement log sanitization

need to replace with dummy data:

IPs;
ports;
Hostnames;

any other info revealing sensitive information.
ideally some consistency should be kept to be able to correlate information during troubleshooting

e.g.
real data
dedicated master srv23.secret.domain 10.37.12.32 HTTP 9200 Transport 9300
data node srv24.secret.domain 10.37.12.33 HTTP 9201 Transport 9301
data node srv25.secret.domain 10.37.12.34 HTTP 9202 Transport 9302

sanitised data
dedicated master dm.x.y x.x.x.32 HTTP 1 Transport 10
data node 1 -> dn.x.y x.x.x.33 HTTP 2 Transport 11
data node 2 -> dn.x.y x.x.x.x.34 HTTP 3 Transport 12

require dependent options

When using authentication, -a is required when using -c, at least when using BASIC (untested with cookie). If -c is included, exit if -a is not provided.

Add _template

Can we also add /_template?pretty to the diagnostic dump? thx

Capture the last few days of marvel data

There are lots of times when marvel would help me help users diagnose problems. If marvel is in use for a cluster, I would like to capture at least the last 2 days of marvel data in a form that I can import into my own cluster for local analysis.

Return every thread (not just top 10 hot ones)?

When we pull diagnostics it's often useful to see not just the hot threads but any threads stuck waiting on a lock / IO operation as well, but because we only pull the top 10 "hot" ones now we won't (necessarily) see the stuck ones consuming 0% cpu.

I think we should show all threads?

Add flag to disable ?pretty

Large clusters can produce huge cluster states when using ?pretty which may not be desirable. A flag should be added to disable this.

Capture kernel version

Linux
$ uname -a
Linux w530 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

OSX
$ uname -a
Darwin Antonios-MacBook-Air-2.local 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

Windows
http://www.windows-commandline.com/find-windows-os-version-from-command/ ?

Add _cat/shards?v

Please add /_cat/shards?v as an additional output from the tool, we are finding it useful when looking at unbalanced shard allocation. thx!

Add a check for TransparentHugePages enabled - Linux only

THP enabled by default on most linux distros is known to cause long young GC

Add a check along the lines of

https://access.redhat.com/solutions/46111

Ideally identify one single command working on both rpm/deb/any_other_major based distros

Add ability to collect stats multiple times separated by interval

Plugin does not install on ES 2.0

There are 2 issues to address for ES 2.0 with the existing plugin:

The installation step using --install no longer works in 2.0:

Change:

./bin/plugin --install elasticsearch/elasticsearch-support-diagnostics

To:

./bin/plugin install elasticsearch/elasticsearch-support-diagnostics

Plugin does not install via the plugin command above because it is missing the now-required plugin-descriptor.properties

support-diagnostics.sh requires execute permission

bin/support-diagnostics$ ls -arlth
total 24K
-rw-r--r-- 1 elk elk 8.8K Sep 24 14:10 support-diagnostics.sh

Either make it executable by default OR specify in usage docs to chmod +x bin/support-diagnostics/support-diagnostics.sh

Cannot install plugin on 1.5.0 version of ES

Running the command recommend in the documentation ./bin/plugin --install elasticsearch/elasticsearch-support-diagnostics fails with the following:

-> Installing elasticsearch/elasticsearch-support-diagnostics...
Trying https://github.com/elasticsearch/elasticsearch-support-diagnostics/archive/master.zip...
Failed to install elasticsearch/elasticsearch-support-diagnostics, reason: failed to download out of all possible locations..., use --verbose to get detailed information

Using ./bin/plugin --install elastic/elasticsearch-support-diagnostics seems to work.

Side note, most of the links in the documentation in this repo all point to the old "elasticsearch/..." github repo instead of the new "elastic/..." so I'm assuming this bug an artifact of that as well.

Timestamp with timezone

Could the timestamp (including timezone) - possibly simply a single (JSON?) file?

After installing as a plugin, support-diagnostics.sh is not executable

File permissions are 644 on support-diagnostics.sh. Those should be 755 so it can be immediately executed.

Better Retrieval of Configuration

The current script only copies configuration files from --path.config.

However, if the user specified --config to point elasticsearch.yml to a total different location, then file is not copied. We should support --config as well.

Checksum for diagnostics

Possible to create a checksum for all files included, just to be sure that nothing has become corrupted (or manually changed) in transit.

Filter logs

As originally requested by @dadoonet :

May be we don't need to have 30 days of log files (depending on their logging settings).
Could it be possible to specify something like --days 1 to get only the last day of logs or so?

I think our default log4j settings do daily rollovers, so this certainly seems possible.

Windows version does not grab index mappings

_cat/recovery written as .json file, but should be .txt

The _cat/recovery request is redirected into a JSON file in both versions of the script even though it does not return JSON. They also slightly differ in the echoed name while it happens.

support-diagnostics.sh:

echo "Getting _/recovery"
curl -XGET "$eshost/_cat/recovery?v" >> $outputdir/cat_recovery.json 2> /dev/null

support-diagnostics.ps1:

Write-Host 'Getting _cat/recovery'
Invoke-WebRequest $esHost'/_cat/recovery?v' -OutFile $outputDir/cat_recovery.json

Flip timestamp and hostname in default filename

Recommend flipping hostname and timestamp.

Flipping them would better sort directories if users reran the script by grouping by hostname, then date/time.

The difference between

support-diagnostics.20140929-171450.host1.tar.gz
support-diagnostics.20140929-171551.host2.tar.gz
support-diagnostics.20140929-171811.host1.tar.gz
support-diagnostics.20140929-172012.host2.tar.gz

and

support-diagnostics.host1.20140929-171450.tar.gz
support-diagnostics.host1.20140929-171811.tar.gz
support-diagnostics.host2.20140929-171551.tar.gz
support-diagnostics.host2.20140929-172012.tar.gz

Plus, the user should be able to sort by the file's creation time if they want the reverse.

Add _cat/shards Logstash Configuration

As some of the output gets large, I have started to find it can be convenient to send it right back into a local instance of the ELK stack to analyze.

In my case, I found it convenient to look at _cat/shards to analyze where all the space was going.

My intent here is that people can run the support scripts, and then use ELK to analyze the results on their own using configurations. These can be added as they come up.

include _count output?

Today we can see the number of lucene documents, but when nested documents are in place, it would be really useful to know the number of "real" user-level documents.

No error if authentication fails

If Shield is enabled on the cluster and the user forgets to use the parameters -c and -p (username/password) the support-diagnostics completes without errors, but every file contains only:

{
  "error" : "AuthenticationException[missing authentication token for REST request [/?pretty]]",
  "status" : 401
}

Need to add testing to ensure that authentication is used.

Cheers,
-Robin-

Store diagnostic execution time with timezone in UTC

Storing the timestamp in the directory name is not very persistent/reliable. Should have a meta file in the directory storing details like host and timestamp.

Complaints in logs about missing _site directory

After installing this plugin, we've started to see log lines like this:

[2015-06-25 22:06:14,292][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:24,311][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:34,346][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:44,365][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:54,382][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:07:04,400][DEBUG][plugins                  ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.

Which looks like a bug to me. We install a bunch of other plugins and haven't seen this issue before. From what I can tell its harmless but annoying.

Please let me know if this is an issue with our particular install - if not, can it be fixed?

include pidstat on linux

the current top is just a 1 second snapshot in time.

On linux can we get the pid file and get global process statistics for the ES process? Something like pidstat -druvw -p

This gives a global summary of things like page faults/second, io rate/second, context switches/second as well as some things top doesnt show like number of file descriptors and threads.

Finishing adding features from the 1.x branch to 2.0

Some features still need to be added to the 2.0 branch:

Multi run support
Authentication support

If your certs are invalid the script will not run

The workaround is to add "-k" to every place in the script where curl is executed. It'd be nice if that was a passthrough command line flag.

Remove rmdir $outputdir line or need some additional checks to prevent accidental deletion of important directories

While this (https://github.com/elasticsearch/elasticsearch-support-diagnostics/blob/master/bin/support-diagnostics.sh#L80) is a convenience feature (for removing the outputdir if host is not reachable), it is quite dangerous esp if the script is run with sudo or root privileges. Had a scenario in the field where -H is specified but not reachable and the user happened to have -o set to the directory where ES is installed, as a result, that directory is removed and the ES installation is gone ...

#ensure we can connect to the host, or exit as there is nothing more we can do
connectionTest=`curl -s -S -XGET $eshost 2>&1`
if [ $? -ne 0 ]
then
    echo "Error connecting to $eshost: $connectionTest"
    rmdir $outputdir
    exit 1
fi

Add flags for passing in log and conf directories

Need to add flags (ie --conf, --log) to specify the config and log directories

Missing features in 2.0 branch

Some features still need to be added to the 2.0 branch:

Multi run support
Authentication support
Top
Netstat
Also need to ensure it runs on both Java 7 and 8

Include node.name in tar archive file name

Would be useful to have it.
Hostname already there but more nodes on same host would be hard to distinguish

Update available notification

It would be nice if this tool could display a message when a new version is available when the user runs the plugin. This way they don't need to check github.

Add authentication support

It would be nice if this could support basic auth and cookie auth.

Add a timeout on curl commands

We should add the --max-time curl parameter and fail gracefully when any of the curl commands takes a long time. This would mean we potentially lose some output at the sake of completing faster, which is useful in time-sensitive, critical situations.

Get fielddata usage with stats

Add ?field=* to indices stats and node stats so that we can see which fields are using lots of memory.

Capture cluster uuid

Cluster uuid is not exported by default. It would be helpful to link data from other sources:

GET _cluster/stats?output_uuid=true

Allow fetching of all nodes automatically

I feel like reconciling the results could be pretty confusing if users need to run it against more than one node.

Granted, it's just a matter of opening it up, but that could get annoying pretty quickly. Long term, I suspect that we could automate the retrieval of all nodes by pre-fetching all of their names.

If diagnostics fails, fail with an error

If I run diagnostics.sh with incorrect parameters (e.g. invalid host), it completes with something like:

Using /usr/bin/java as Java Runtime
Using -Xms256m -Xmx2000m  for options.
Prompt for a password?  No password value required, only the option. Hidden from the command line on entry.: 
Getting Network Interface Information - this may take some time...
Run 1 of 1 completed.

No file created, no error posted... it looks as if it did something, but nothing happened. This confused me, and will surely confuse others.

Add cluster/health and pending_tasks

I find these to be handy, can we add these to the tool? :) thx

_cluster/health?pretty (quick view of the status of the cluster with aggregated #s on unassigned, initializing shards,etc..)
_cluster/pending_tasks?pretty (to see if there are any slow, queued, stuck tasks potentially).

Multiple runs - one file per run

Some files are created at beginning of multiple runs (version.json) and others created for each run.
Could we have one file per run which in itself is complete? This would make incremental processing easier as each diagnostic would be complete (and the unlikely event that one of the common files had changed between runs would be removed).

AWS/Cloud provider key masking

Related to #30 but higher priority. Things like keys for aws can appear in yml and cluster state output. Would be nice to mask them automatically before packaging.

Diagnostics plugin should also pull segments API?

It would be helpful if we could see the segments API output when we pull diagnostics, e.g. this would let us see which Lucene version wrote which segments in each index.

    curl -XGET 'http://localhost:9200/_segments'