Note, this is a derivative fork of the original IBM code, which is no longer under development. As such it contains significant changes:
- uses Elasticsearch Java API - required a bit of shading to get working with Spark
- makes use dynamic template mappings
- added support for date-based index names for easier aging off with Curator
- upgraded to support Elasticsearch 5 and beyond
- dropped the SparkListener which I currently don't see much benefit in
However, to-date I have only tested this on Spark 1.6.3, although plan to support Spark 2.x versions soon.
Spelk (spark-elk) is an add-on to Apache Spark (http://spark.apache.org/) to report metrics to an Elasticsearch server and allow visualizations using Kibana.
Spelk is built using Apache Maven. You can specify the Spark versions.
mvn clean package -P spark1.6.3
PUT the dynamic mapping in mapping.json against the following URL:
http://{{hostname}}:9200/_template/spark-metrics-template
This will enable the dynamic generation of any index that starts with spark-metrics-. This can be updated to match your preferred naming convention.
To use Spelk you need to add the Spelk jar to the Spark driver/executor classpaths and enable the metrics sink. You need to do the following:
- add the spelk jar file to your spark.files property
- add the spelk jar file to your spark.driver.extraClassPath property
- add the spelk jar file to your spark.executor.extraClassPath property
Enable the Elasticsearch sink
You will need to create a metrics.properties file, as shown below. Then add this to the following properties:
- spark.files
- spark.metrics.conf
Create a metrics.properties file, based on the template below:
driver.sink.elk.class=org.apache.spark.elk.metrics.sink.ElasticsearchSink
executor.sink.elk.class=org.apache.spark.elk.metrics.sink.ElasticsearchSink
# Name: Default: Description:
# clusterName none Elasticsearch cluster name
# host none Elasticsearch server host
# port none Elasticsearch server port
# index spark Elasticsearch index name
# indexDateFormat spark Elasticsearch index date format, if required - gets appended onto the index name, e,g, yyyy-MM-dd causes the index name to become spark-metrics-2018-12-19
# period 10 polling period
# units seconds polling period units
*.sink.elk.clusterName=elasticsearch
*.sink.elk.host=localhost
*.sink.elk.port=9200
*.sink.elk.index=spark
*.sink.elk.period=10
*.sink.elk.unit=seconds