Giter VIP home page Giter VIP logo

wonderdog's Introduction

Wonderdog

Wonderdog makes ElasticSearch easier to connect with Hadoop. It provides a few kinds of functionality:

# Hadoop MapReduce

Wonderdog provides InputFormat and OutputFormat classes that can be used in your own custom Hadoop MapReduce jobs.

  • com.infochimps.elasticsearch.ElasticSearchInputFormat
  • com.infochimps.elasticsearch.ElasticSearchOutputFormat
  • com.infochimps.elasticsearch.ElasticSearchStreamingInputFormat
  • com.infochimps.elasticsearch.ElasticSearchStreamingOutputFormat

These classes come in streaming (for the old mapred API) and non-streaming (for the new mapreduce API) flavors.

To use these classes, you'll need to declare a dependency on Wonderdog in your project's pom.xml:

<project>
  ...
  <dependencies>
    <dependency>
      <groupId>com.infochimps</groupId>
      <artifactId>elasticsearch</artifactId>
      <version>1.0-SNAPSHOT</version>
    </dependency>
    ...
  </dependencies>
  ...
</project>

Now when you build your code, it will include the Wonderdog InputFormat and OutputFormat classes you need.

TBD:

  • examples of using these classes in your on MapReduce jobs
  • examples of launching such a job from the command-line
# Wukong

Wonderdog also provides a Wukong plugin to make it easy to use the InputFormat and OutputFormat classes.

Installing Wonderdog

Ensure that Wonderdog is in your project's Gemfile:

# in Gemfile
gem 'wonderdog', git: 'https://github.com/infochimps-labs/wonderdog'

You'll have to require Wonderdog at the top of your job

# in my_elasticsearch_job.rb

require 'wukong'
require 'wonderdog'
 
Wukong.dataflow(:mapper) do
 ...
end
 
Wukong.dataflow(:reducer) do
 ...
end

If you are running a deploy pack then you may want to require Wonderdog at the top-level of your deploy pack by creating an initializer:

# in config/initializers/plugins.rb
require 'wonderdog'

Using Wonderdog From Wukong

Wukong uses Wukong-Hadoop to provide the basic functionality of connecting Wukong to Hadoop. Wonderdog modifies this connection by adjusting the command-lines passed to the hadoop program so that the correct input and output formats are used.

A "normal" Hadoop streaming job launched by Wukong-Hadoop might look like this:

$ wu hadoop my_job.rb --input=/some/hdfs/input/path --output=/some/hdfs/output/path

Assuming you've correctly installed Wonderdog into your job or deploy pack, you should be able to invoke Wonderdog's core classes by changing the URI for input or output to use a scheme of es. The "host" of the URI is the index in ElasticSearch and the "path" the type.

Embedded vs. Transport Nodes

Wonderdog provides two different ways of connnecting to ElasticSearch from within a Hadoop task.

  • By default, each map task will spin up a transport client which will attempt to connect to some ElasticSearch webnode. By default it will look for this webnode on the same machine as the task itself is running on. This is convenient in the common case when each Hadoop tasktracker is also an ElasticSearch webnode (datanodes may, of course, live elsewhere).

  • Each map task can also be configured to spin up its own embedded ElasticSearch node which directly connects to an ElasticSearch cluster.

The following options control this behavior:

  • --es_transport -- Use a transport client instead of an embedded node. True by default.
  • --es_transport_host -- When using a transport client, the host of the ElasticSearch webnode to connect to. Defaults to localhost.
  • --es_transport_port -- When using a transport client, the port of the ElasticSearch webnode to connect to. Defaults to 9300.

Writing data to ElasticSearch

Here's an example which would write all its output data to the index twitter in the type tweet:

$ wu hadoop my_job.rb --input=/some/hdfs/input/path --output=es://twitter/tweet

Data Format & Routing

It's always assumed that the output of the reducer is newline-delimited, JSON formatted data. Most fields in the each record are passed through unmodified or read. But some fields are important:

  • _id - if this field is present then it will be used as the document ID of the record created in ElasticSearch. This is the right way to ensure that a write updates an existing document instead of creating a new document. The name of this field (_id) can be modified with the --es_id_field option.

  • _mapping - if this field is present then it will be used as the type the document is written to, no matter what was passed on the command-line as the --output. This is the right way to allow writing to multiple types depending on the document. The name of this field (_mapping) can be modified with the --es_mapping_field option. And, yes, this field probably should have been called _type...

  • _index - if this field is present then it will be used as the index the document is written to, no matter what was passed on the command-line as the --output. This is the right way to allow writing to multiple types depending on the document. The name of this field (_index) can be modified with the --es_index_field option.

Optimization

It's not unusual to prepare an ElasticSearch index for bulk writing before executing a Hadoop job to write to it. The following operations should be enabled for best performance:

  • turn index.number_of_replicas down to 0 to ensure that there are as few shards (copies) of the data as possible that need to be updated on each write

  • turn index.refresh_interval to -1 to ensure that ElasticSearch doesn't allocate any of its resources refreshing data for search instead of indexing.

It's also a good idea to have created all mappings up front, before loading.

Wonderdog provides the --es_bulk_size option which sets the size of batch writes sent to ElasticSearch (default: 1000). Increasing this number can be appropriate and lead to higher throughput in some situations.

Reading data from ElasticSearch

Here's an example which would read all its input data from the index twitter in the type tweet:

$ wu hadoop my_job.rb --input=es://twitter/tweet --output=/some/hdfs/output/path

This would read in every single tweet record. This can be customized using the full power of ElasticSearch by providing an arbitrary ElasticSearch JSON query via the --es_query option. Wonderdog will run the query at Hadoop job submission time and use the result-set as the input data.

The result-set will be presented to Hadoop as newline-delimited, JSON-formatted data.

Here's an example, which would capture only tweets about Chicago:

$ wu hadoop my_job.rb --input=es://twitter/tweet --output=/some/hdfs/output/path --es_query='{"query": {"match":{"text": "Chicago"}}}'

Optimization

Wonderdog uses ElasticSearch's scroll API to fetch data from ElasticSearch. There are several options which can be used to tune the usage of this API for better performance:

  • --es_input_splits -- the number of input splits to create
  • --es_request_size -- the number of documents to request at a time (defaults to 50)
  • --es_scroll_timeout -- the amount of time to wait on each scroll / longest running map task (defaults to 5 minutes)

The larger the dataset and the fewer the input splits, the larger the data that needs to be processed (and hence scrolled through) within each task, the longer the scroll timeout should be set for. Essentially, no task should take longer to complete than the scroll timeout.

It's recommended to read data out of ElasticSearch into a temporary copy in HDFS which can then be used for more intensive processing.

# Pig

The most up-to-date (and simplest) way to store data into elasticsearch with hadoop is to use the Pig Store Function. You can write both delimited and json data to elasticsearch as well as read data from elasticsearch.

Storing tabular data

This allows you to store tabular data (eg. tsv, csv) into elasticsearch.

%default ES_JAR_DIR '/usr/local/share/elasticsearch/lib'
%default INDEX      'ufo_sightings'
%default OBJ        'sighting'

register target/wonderdog*.jar;
register $ES_JAR_DIR/*.jar;

ufo_sightings = LOAD '/data/domestic/aliens/ufo_awesome.tsv' AS (sighted_at:long, reported_at:long, location:chararray, shape:chararray, duration:chararray, description:chararray);
STORE ufo_sightings INTO 'es://$INDEX/$OBJ?json=false&size=1000' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();

Here the fields that you set in Pig (eg. 'sighted_at') are used as the field names when creating json records for elasticsearch.

Storing json data

You can store json data just as easily.

ufo_sightings = LOAD '/data/domestic/aliens/ufo_awesome.tsv.json' AS (json_record:chararray);
STORE ufo_sightings INTO 'es://$INDEX/$OBJ?json=true&size=1000' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();

Loading Data

Easy too.

-- dump some of the ufo sightings index based on free text query
alien_sightings = LOAD 'es://ufo_sightings/ufo_sightings?q=alien' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage() AS (doc_id:chararray, contents:chararray);
DUMP alien_sightings;

ElasticSearchStorage Constructor

The constructor to the UDF can take two arguments (in the following order):

  • esConfig - The full path to where elasticsearch.yml lives on the machine launching the hadoop job
  • esPlugins - The full path to where the elasticsearch plugins directory lives on the machine launching the hadoop job

Query Parameters

There are a few query paramaters available:

  • json - (STORE only) When 'true' indicates to the StoreFunc that pre-rendered json records are being indexed. Default is false.
  • size - When storing, this is used as the bulk request size (the number of records to stack up before indexing to elasticsearch). When loading, this is the number of records to fetch per request. Default 1000.
  • q - (LOAD only) A free text query determining which records to load. If empty, matches all documents in the index.
  • id - (STORE only) The name of the field to use as a document id. If blank (or -1) the documents are assumed to have no id and are assigned one by elasticsearch.
  • tasks - (LOAD only) The number of map tasks to launch. Default 100.

Note that elasticsearch.yml and the plugins directory are distributed to every machine in the cluster automatically via hadoop's distributed cache mechanism.

# Command-Line Utilities

There are a number of convenience commands in estool. Most of the common REST API operations have be mapped. Enumerating a few:

  • Print status of all indices as a json hash to the terminal
$ estool -c <elasticsearch_host> status
  • Check cluster health (red,green,yellow,relocated shards, etc)
$ estool -c <elasticsearch_host>  health
  • Set replicas for an index
$ estool set_replication -c <elasticsearch_host> --index <index_name> --replicas <num_replicas>
  • Optimize an index
$ estool optimize -c <elasticsearch_host> --index <index_name>
  • Snapshot an index
$ estool snapshot -c <elasticsearch_host> --index <index_name>
  • Delete an index
$ estool delete -c <elasticsearch_host> --index <index_name>

wonderdog's People

Contributors

kornypoet avatar temujin9 avatar berngp avatar joshbronson avatar mrflip avatar rjurney avatar arrawatia avatar

Watchers

Kwamena Appiah-Kubi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.