elastic / support-diagnostics Goto Github PK
View Code? Open in Web Editor NEWSupport diagnostics utility for elasticsearch and logstash
License: Other
Support diagnostics utility for elasticsearch and logstash
License: Other
The pretty flag needs to be re-added. We are worse off without it.
We need to redact passwords some of the various password that appear in the yml files.
Auth fails as we do not return the status in 2.0
Currently logs are grabbed from the default location - sometimes these are old logs, or no logs at all...
Possible to learn the configured location of log files from the nodes settings and capture the correct log files?
Every time we send result of tool run to ES support we have to remove AWS credentials from next section manually.
From cluster_state.20150219-122235.json:
"repositories" : {
"202" : {
"type" : "s3",
"settings" : {
"region" : "us-east-1",
"max_restore_bytes_per_sec" : "1mb",
"max_snapshot_bytes_per_sec" : "1mb",
"bucket" : "es-snapshot.appcelerator.prod.202",
"access_key" : "",
"secret_key" : ""
}
}
},
Please add auto-deletion of them to keep customers safe.
It would be great if the output could include the type of filesystem (e.g. ext2, nfs, whatever).
A long-term solution is to add this information (comes from java7 FileStore api) to ES stats, but as a start, maybe we could include the output from df -k
+ mount
or similar?
need to replace with dummy data:
any other info revealing sensitive information.
ideally some consistency should be kept to be able to correlate information during troubleshooting
e.g.
real data
dedicated master srv23.secret.domain 10.37.12.32 HTTP 9200 Transport 9300
data node srv24.secret.domain 10.37.12.33 HTTP 9201 Transport 9301
data node srv25.secret.domain 10.37.12.34 HTTP 9202 Transport 9302
sanitised data
dedicated master dm.x.y x.x.x.32 HTTP 1 Transport 10
data node 1 -> dn.x.y x.x.x.33 HTTP 2 Transport 11
data node 2 -> dn.x.y x.x.x.x.34 HTTP 3 Transport 12
When using authentication, -a is required when using -c, at least when using BASIC (untested with cookie). If -c is included, exit if -a is not provided.
Can we also add /_template?pretty to the diagnostic dump? thx
There are lots of times when marvel would help me help users diagnose problems. If marvel is in use for a cluster, I would like to capture at least the last 2 days of marvel data in a form that I can import into my own cluster for local analysis.
When we pull diagnostics it's often useful to see not just the hot threads but any threads stuck waiting on a lock / IO operation as well, but because we only pull the top 10 "hot" ones now we won't (necessarily) see the stuck ones consuming 0% cpu.
I think we should show all threads?
Large clusters can produce huge cluster states when using ?pretty which may not be desirable. A flag should be added to disable this.
Linux
$ uname -a
Linux w530 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
OSX
$ uname -a
Darwin Antonios-MacBook-Air-2.local 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64
Windows
http://www.windows-commandline.com/find-windows-os-version-from-command/ ?
Please add /_cat/shards?v as an additional output from the tool, we are finding it useful when looking at unbalanced shard allocation. thx!
THP enabled by default on most linux distros is known to cause long young GC
Add a check along the lines of
https://access.redhat.com/solutions/46111
Ideally identify one single command working on both rpm/deb/any_other_major based distros
Add ability to collect stats multiple times separated by interval
There are 2 issues to address for ES 2.0 with the existing plugin:
--install
no longer works in 2.0:Change:
./bin/plugin --install elasticsearch/elasticsearch-support-diagnostics
To:
./bin/plugin install elasticsearch/elasticsearch-support-diagnostics
bin/support-diagnostics$ ls -arlth
total 24K
-rw-r--r-- 1 elk elk 8.8K Sep 24 14:10 support-diagnostics.sh
Either make it executable by default OR specify in usage docs to chmod +x bin/support-diagnostics/support-diagnostics.sh
Running the command recommend in the documentation ./bin/plugin --install elasticsearch/elasticsearch-support-diagnostics
fails with the following:
-> Installing elasticsearch/elasticsearch-support-diagnostics...
Trying https://github.com/elasticsearch/elasticsearch-support-diagnostics/archive/master.zip...
Failed to install elasticsearch/elasticsearch-support-diagnostics, reason: failed to download out of all possible locations..., use --verbose to get detailed information
Using ./bin/plugin --install elastic/elasticsearch-support-diagnostics
seems to work.
Side note, most of the links in the documentation in this repo all point to the old "elasticsearch/..." github repo instead of the new "elastic/..." so I'm assuming this bug an artifact of that as well.
Could the timestamp (including timezone) - possibly simply a single (JSON?) file?
File permissions are 644 on support-diagnostics.sh. Those should be 755 so it can be immediately executed.
The current script only copies configuration files from --path.config
.
However, if the user specified --config
to point elasticsearch.yml to a total different location, then file is not copied. We should support --config
as well.
Possible to create a checksum for all files included, just to be sure that nothing has become corrupted (or manually changed) in transit.
As originally requested by @dadoonet :
May be we don't need to have 30 days of log files (depending on their logging settings).
Could it be possible to specify something like --days 1 to get only the last day of logs or so?
I think our default log4j settings do daily rollovers, so this certainly seems possible.
The _cat/recovery
request is redirected into a JSON file in both versions of the script even though it does not return JSON. They also slightly differ in the echoed name while it happens.
support-diagnostics.sh
:
echo "Getting _/recovery"
curl -XGET "$eshost/_cat/recovery?v" >> $outputdir/cat_recovery.json 2> /dev/null
support-diagnostics.ps1
:
Write-Host 'Getting _cat/recovery'
Invoke-WebRequest $esHost'/_cat/recovery?v' -OutFile $outputDir/cat_recovery.json
Recommend flipping hostname and timestamp.
Flipping them would better sort directories if users reran the script by grouping by hostname, then date/time.
The difference between
and
Plus, the user should be able to sort by the file's creation time if they want the reverse.
As some of the output gets large, I have started to find it can be convenient to send it right back into a local instance of the ELK stack to analyze.
In my case, I found it convenient to look at _cat/shards
to analyze where all the space was going.
My intent here is that people can run the support scripts, and then use ELK to analyze the results on their own using configurations. These can be added as they come up.
Today we can see the number of lucene documents, but when nested documents are in place, it would be really useful to know the number of "real" user-level documents.
If Shield is enabled on the cluster and the user forgets to use the parameters -c and -p (username/password) the support-diagnostics completes without errors, but every file contains only:
{
"error" : "AuthenticationException[missing authentication token for REST request [/?pretty]]",
"status" : 401
}
Need to add testing to ensure that authentication is used.
Cheers,
-Robin-
Storing the timestamp in the directory name is not very persistent/reliable. Should have a meta file in the directory storing details like host and timestamp.
After installing this plugin, we've started to see log lines like this:
[2015-06-25 22:06:14,292][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:24,311][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:34,346][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:44,365][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:06:54,382][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
[2015-06-25 22:07:04,400][DEBUG][plugins ] [c1b-searchb3-prod] [/opt/elasticsearch/plugins/support-diagnostics/_site] directory does not exist.
Which looks like a bug to me. We install a bunch of other plugins and haven't seen this issue before. From what I can tell its harmless but annoying.
Please let me know if this is an issue with our particular install - if not, can it be fixed?
the current top is just a 1 second snapshot in time.
On linux can we get the pid file and get global process statistics for the ES process? Something like pidstat -druvw -p
This gives a global summary of things like page faults/second, io rate/second, context switches/second as well as some things top doesnt show like number of file descriptors and threads.
Some features still need to be added to the 2.0 branch:
The workaround is to add "-k" to every place in the script where curl is executed. It'd be nice if that was a passthrough command line flag.
While this (https://github.com/elasticsearch/elasticsearch-support-diagnostics/blob/master/bin/support-diagnostics.sh#L80) is a convenience feature (for removing the outputdir if host is not reachable), it is quite dangerous esp if the script is run with sudo or root privileges. Had a scenario in the field where -H is specified but not reachable and the user happened to have -o set to the directory where ES is installed, as a result, that directory is removed and the ES installation is gone ...
#ensure we can connect to the host, or exit as there is nothing more we can do
connectionTest=`curl -s -S -XGET $eshost 2>&1`
if [ $? -ne 0 ]
then
echo "Error connecting to $eshost: $connectionTest"
rmdir $outputdir
exit 1
fi
Need to add flags (ie --conf, --log) to specify the config and log directories
Some features still need to be added to the 2.0 branch:
Would be useful to have it.
Hostname already there but more nodes on same host would be hard to distinguish
It would be nice if this tool could display a message when a new version is available when the user runs the plugin. This way they don't need to check github.
It would be nice if this could support basic auth and cookie auth.
We should add the --max-time
curl parameter and fail gracefully when any of the curl commands takes a long time. This would mean we potentially lose some output at the sake of completing faster, which is useful in time-sensitive, critical situations.
Add ?field=* to indices stats and node stats so that we can see which fields are using lots of memory.
Cluster uuid is not exported by default. It would be helpful to link data from other sources:
GET _cluster/stats?output_uuid=true
I feel like reconciling the results could be pretty confusing if users need to run it against more than one node.
Granted, it's just a matter of opening it up, but that could get annoying pretty quickly. Long term, I suspect that we could automate the retrieval of all nodes by pre-fetching all of their names.
If I run diagnostics.sh with incorrect parameters (e.g. invalid host), it completes with something like:
Using /usr/bin/java as Java Runtime
Using -Xms256m -Xmx2000m for options.
Prompt for a password? No password value required, only the option. Hidden from the command line on entry.:
Getting Network Interface Information - this may take some time...
Run 1 of 1 completed.
No file created, no error posted... it looks as if it did something, but nothing happened. This confused me, and will surely confuse others.
I find these to be handy, can we add these to the tool? :) thx
Some files are created at beginning of multiple runs (version.json) and others created for each run.
Could we have one file per run which in itself is complete? This would make incremental processing easier as each diagnostic would be complete (and the unlikely event that one of the common files had changed between runs would be removed).
Related to #30 but higher priority. Things like keys for aws can appear in yml and cluster state output. Would be nice to mask them automatically before packaging.
It would be helpful if we could see the segments API output when we pull diagnostics, e.g. this would let us see which Lucene version wrote which segments in each index.
curl -XGET 'http://localhost:9200/_segments'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.