Giter VIP home page Giter VIP logo

stream2es's Introduction

stream2es

Standalone utility to stream different inputs into Elasticsearch.

Read This First

If you've just wandered here, first check out Logstash. It's a much more general tool, and one of our featured products. If for some reason it doesn't do something that's important to you, create an issue there. stream2es is a dev tool that originated before the author knew much about Logstash. That said, there are some important differences that are specific to Elasticsearch. stream2es supports bulks by byte-length (--bulk-bytes) instead of doc count, which is crucial with docs of varying size. It also supports exporting raw bulks via --tee-bulk to a hashed dir on the filesystem, and you can make the incoming stream finite with --max-docs.

Install

You'll need Java 8+. Run java -version to make sure.

Unix

Download stream2es and make it executable:

% curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es

Windows

> curl -O download.elasticsearch.org/stream2es/stream2es
> java -jar stream2es help

Usage

stdin

By default, stream2es reads JSON documents from stdin.

% echo '{"f":1}' | stream2es
2014-10-08T12:29:56.318-0500 INFO  00:00.116 8.6d/s 0.4K/s (0.0mb) indexed 1 streamed 1 errors 0
%

If you want more logging, set --log debug. If you don't want any output, set --log warn.

Wikipedia

Index the latest Wikipedia article dump.

% stream2es wiki --target http://localhost:9200/tmp --log debug
create index http://localhost:9200/tmp
stream wiki from http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
^Cstreamed 1158 docs 1082 bytes xfer 15906901 errors 0

If you're at a café or want to use a local copy of the dump, supply --source:

% ./stream2es wiki --max-docs 5 --source /d/data/enwiki-20121201-pages-articles.xml.bz2

Note that if you live-stream the WMF-hosted dump, it will cut off after a while. Grab a torrent and index it locally if you need more than a few thousand docs.

Generator

stream2es can fuzz data for you. It can create blank documents, or documents with integer fields, or documents with string fields if you supply a dictionary.

Blank documents are easy:

stream2es generator

Ints need to know how big you want them. This template would give you a single field with values between 0 and 127, inclusive.

stream2es generator --fields f1:int:128

To add a string, we need to add a template for it, and a file of newline-separated lines of text. Given a field template of NAME:str:N, stream2es will select N random words from the dictionary for each field.

# zsh
% stream2es generator --fields f1:int:128,f2:str:2 --dictionary <(/bin/echo -e "foo\nbar\nbaz")
#### same as:
% stream2es generator --fields f1:int:128,f2:str:2 --dictionary /dev/stdin --max-docs 5 <<EOF
foo
bar
baz
EOF
% curl -s localhost:9200/foo/_search\?format=yaml | fgrep -A2 _source
    _source:
      f1: 28
      f2: "foo baz"
--
    _source:
      f1: 88
      f2: "baz foo"
--
    _source:
      f1: 26
      f2: "baz baz"
--
    _source:
      f1: 68
      f2: "bar baz"
--
    _source:
      f1: 64
      f2: "foo foo"
%

Fortunately, most *nix systems come with /usr/share/dict/words (Ubuntu package wamerican-small, for example), which is a great choice if you just need some (English) text. Install other langs if you prefer.

Elasticsearch

Note: ES 2.3 added a reindex API that completely obviates this feature of stream2es. Also, Logstash 1.5.0 has an Elasticsearch input.

If you use the es stream, you can copy indices from one Elasticsearch to another. Example:

% stream2es es \
     --source http://foo.local:9200/wiki \
     --target http://bar.local:9200/wiki2

This is a convenient way to reindex data if you need to change the number of shards or update your mapping.

Twitter

In order to stream Twitter, you have to create an app and authorize it.

Create app

Visit (https://dev.twitter.com/apps/new) and create an app. Call it stream2es. Note the Consumer key and Consumer secret.

Authorize app

Now run stream2es twitter --authorize --key CONSUMER_KEY --secret CONSUMER_SECRET and complete the dialog.

Run with new creds

You should now be able to stream twitter with simply stream2es twitter. stream2es will grab the most recent cached credentials from ~/.authinfo.stream2es.

Tracking keywords

By default, stream2es will stream random sample of all public tweets, however you can configure stream2es to track specific keywords as follows:

	stream2es twitter --track "Linux%%New York%%March Madness"

Options

% stream2es --help
Copyright 2013 Elasticsearch

Usage: stream2es [CMD] [OPTS]

..........

You can change index settings by supplying --settings:

% echo '{"name":"alfredo"}' | ./stream2es stdin --settings '
{
    "settings": {
        "refresh_interval": "2m"
    }
}'

Contributing

stream2es is written in Clojure. You'll need leiningen 2.0+ to build.

% lein bin
% target/stream2es

You'll also need this little git alias if you want to do make:

[alias]
ver = "!git log --pretty=format:'%ai %h' -1 | perl -pe 's,(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) (\\d\\d):(\\d\\d):(\\d\\d) [^ ]+ ([a-z0-9]+),\\1\\2\\3\\7,'"

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2009-2013 Elasticsearch <http://www.elasticsearch.org>

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

stream2es's People

Contributors

brunobonacci avatar dakrone avatar damienalexandre avatar drewr avatar ikedam avatar murhaf avatar s1monw avatar scottbessler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stream2es's Issues

partial data imported

Only partial data was imported. I got this error.

cat file.json | ./stream2es stdin --target "http://localhost:9200/myindex/mytype/"

Exception in thread "stream dispatcher" com.fasterxml.jackson.core.JsonParseException: Unexpected character ('9' (code 57)): was expecting double-quote to start field name
at [Source: java.io.StringReader@66fd845; line: 1, column: 35]
at com.fasterxml.jackson.core.JsonParser.constructError(JsonParser.java:1586)
at com.fasterxml.jackson.core.base.ParserMinimalBase.reportError(ParserMinimalBase.java:521)
at com.fasterxml.jackson.core.base.ParserMinimalBase.reportUnexpectedChar(ParserMinimalBase.java:450)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.handleOddName(ReaderBasedJsonParser.java:1701)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:642)
at cheshire.parse$parse_STAR
.invokeStatic(parse.clj:63)
at cheshire.parse$parse_STAR
.invoke(parse.clj:61)
at cheshire.parse$parse_STAR
.invokeStatic(parse.clj:63)
at cheshire.parse$parse_STAR
.invoke(parse.clj:61)
at cheshire.parse$parse.invokeStatic(parse.clj:98)
at cheshire.parse$parse.invoke(parse.clj:86)
at cheshire.core$parse_string.invokeStatic(core.clj:205)
at cheshire.core$parse_string.invoke(core.clj:191)
at cheshire.core$parse_string.invokeStatic(core.clj:202)
at cheshire.core$parse_string.invoke(core.clj:191)
at stream2es.stream.stdin$fn__6955.invokeStatic(stdin.clj:52)
at stream2es.stream.stdin$fn__6955.invoke(stdin.clj:49)
at stream2es.stream$fn__6530$G__6525__6537.invoke(stream.clj:12)
at stream2es.main$make_object_processor$fn__7305.invoke(main.clj:224)
at stream2es.main$start_doc_stream$disp__7292.invoke(main.clj:209)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

^C2016-07-30T14:31:55.131+0000 INFO 20:11.473 85.8d/s 40.9K/s (48.4mb) indexed 103936 streamed 103967 errors 0

index wikipedia using local copy get clojure.lang.ExceptionInfo: throw+:

java version "1.8.0_51"
elasticsearch-2.0.0

I used the command

./stream2es wiki --source enwiki-20151102-pages-articles-multistream.xml.bz2 

but got

clojure.lang.ExceptionInfo: throw+: {:type :stream2es.http/urlparse} {:object {:type :stream2es.http/urlparse}, :environment {u "enwiki-20151102-pages-articles-multistream.xml.bz2", o__5349__auto__ {:type :stream2es.http/urlparse}}}
    at stream2es.http$make_jurl.invoke(http.clj:48)
    at stream2es.http$make_target.invoke(http.clj:78)
    at stream2es.http$make_target.invoke(http.clj:76)
    at stream2es.stream.wiki$fn__6629.invoke(wiki.clj:42)
    at stream2es.stream$fn__5804$G__5797__5811.invoke(stream.clj:22)
    at stream2es.bootstrap$boot.invoke(bootstrap.clj:70)
    at stream2es.main$_main.doInvoke(main.clj:339)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)

--help doesn't print out

Running ./stream2es es --help doesn't print to the screen..

Have to run ./stream2es --help > help.txt

Can it be updated to print to standard out?

Add us gov access log stream

Hey,

I'd love to have support for the us government http access log stream available at

http://developer.usa.gov/1usagov

The only thing to pay attention to is that they sometime send a heartbeat inbetween the messages like this

{"_heartbeat_":1370871031}

which should be ignored.

Reason for me is to have another non twitter stream I can use for geo data demos...

Don't pass meta fields through source root

Right now we add _id, _type, _routing, etc. to the stream's doc source root so that they can be read at index time. This isn't very elegant, and can step on fields that are named the same. We can still ship them in the source, but hide them in an inner object with a special name that we can be sure to remove during indexing.

es to json losing 1 event per file?

abonuccelli@w530 /tmp/test/tee $ /opt/elk/PROD/scripts/stream2es es --source http://192.168.0.101:9200/logstash-syslog-2015.03.19 --no-indexing --tee /tmp/test/tee/  --log debug  
2015-04-08T16:30:24.140+0200 DEBUG saving json to /tmp/test/tee/
2015-04-08T16:30:24.831+0200 DEBUG 00:00.695 2064.7d/s 1492.2K/s 1435 1435 1061989 0 AUwxHXokwxvhS6xbMjj6
2015-04-08T16:30:25.204+0200 DEBUG 00:01.068 2698.5d/s 1942.0K/s 2882 1447 1061845 0 AUwxMMMzwxvhS6xbMkIR
2015-04-08T16:30:25.441+0200 DEBUG 00:01.306 3340.7d/s 2382.2K/s 4363 1481 1062011 0 AUwxR_iXwxvhS6xbMksH
2015-04-08T16:30:25.779+0200 DEBUG 00:01.643 3557.5d/s 2524.9K/s 5845 1482 1062055 0 AUwxRf5kwxvhS6xbMkph
2015-04-08T16:30:26.032+0200 DEBUG 00:01.896 3861.3d/s 2735.3K/s 7321 1476 1062734 0 AUwxXZSnwxvhS6xbMlQj
2015-04-08T16:30:26.349+0200 DEBUG 00:02.214 3974.3d/s 2810.9K/s 8799 1478 1062110 0 AUwxX43ywxvhS6xbMlTH
2015-04-08T16:30:26.691+0200 DEBUG 00:02.556 4016.8d/s 2840.7K/s 10267 1468 1062387 0 AUwxdoDxwxvhS6xbMl41
2015-04-08T16:30:26.929+0200 DEBUG 00:02.794 4192.2d/s 2969.9K/s 11713 1446 1062023 0 AUwxjCAAwxvhS6xbMmfj
2015-04-08T16:30:27.340+0200 DEBUG 00:03.204 4102.4d/s 2913.5K/s 13144 1431 1061807 0 AUwxiPF1wxvhS6xbMmZK
2015-04-08T16:30:27.567+0200 DEBUG 00:03.432 4242.1d/s 3022.2K/s 14559 1415 1062326 0 AUwxlwiowxvhS6xbMm_i
2015-04-08T16:30:27.900+0200 DEBUG 00:03.765 4242.8d/s 3030.3K/s 15974 1415 1061652 0 AUwxln4UwxvhS6xbMm6q
2015-04-08T16:30:28.145+0200 DEBUG 00:04.010 4334.9d/s 3103.7K/s 17383 1409 1061685 0 AUwxuKtJwxvhS6xbMoDR
2015-04-08T16:30:28.459+0200 DEBUG 00:04.324 4360.8d/s 3118.2K/s 18856 1473 1062018 0 AUwxttbEwxvhS6xbMn-U
2015-04-08T16:30:28.912+0200 DEBUG 00:04.777 4258.7d/s 3039.7K/s 20344 1488 1062266 0 AUwxzLAUwxvhS6xbMoiO
2015-04-08T16:30:29.132+0200 DEBUG 00:04.996 4357.1d/s 3113.9K/s 21768 1424 1061471 0 AUwx4mOuwxvhS6xbMpKk
2015-04-08T16:30:29.462+0200 DEBUG 00:05.326 4350.2d/s 3115.6K/s 23169 1401 1061322 0 AUwx3woTwxvhS6xbMpCl
2015-04-08T16:30:29.761+0200 DEBUG 00:05.625 4371.2d/s 3134.3K/s 24588 1419 1061887 0 AUwx7du0wxvhS6xbMpiC
2015-04-08T16:30:30.116+0200 DEBUG 00:05.981 4326.9d/s 3121.0K/s 25879 1291 1060856 0 AUwx7seNwxvhS6xbMpjo
2015-04-08T16:30:30.340+0200 DEBUG 00:06.204 4372.7d/s 3175.7K/s 27128 1249 1060695 0 AUwyAeRFwxvhS6xbMqXN
2015-04-08T16:30:30.660+0200 DEBUG 00:06.525 4347.1d/s 3178.3K/s 28365 1237 1060739 0 AUwx_JQ-wxvhS6xbMqMz
2015-04-08T16:30:30.901+0200 DEBUG 00:06.765 4375.5d/s 3218.7K/s 29600 1235 1061110 0 AUwyAsLZwxvhS6xbMqyd
2015-04-08T16:30:31.229+0200 DEBUG 00:07.094 4353.8d/s 3215.5K/s 30886 1286 1061021 0 AUwyArp5wxvhS6xbMqvy
2015-04-08T16:30:31.459+0200 DEBUG 00:07.324 4398.3d/s 3256.0K/s 32213 1327 1061079 0 AUwyA0KCwxvhS6xbMrqY
2015-04-08T16:30:31.802+0200 DEBUG 00:07.666 4381.3d/s 3246.0K/s 33587 1374 1061644 0 AUwyApB5wxvhS6xbMqaA
2015-04-08T16:30:32.062+0200 DEBUG 00:07.927 4412.3d/s 3269.9K/s 34976 1389 1061556 0 AUwyA91uwxvhS6xbMsdP
2015-04-08T16:30:32.374+0200 DEBUG 00:08.239 4395.2d/s 3271.7K/s 36212 1236 1060249 0 AUwyBG9XwxvhS6xbMskl
2015-04-08T16:30:32.619+0200 DEBUG 00:08.484 4413.1d/s 3299.3K/s 37441 1229 1060358 0 AUwyBIxWwxvhS6xbMs0D
2015-04-08T16:30:32.962+0200 DEBUG 00:08.827 4382.9d/s 3288.4K/s 38688 1247 1060343 0 AUwyBLWHwxvhS6xbMtJZ
2015-04-08T16:30:33.192+0200 DEBUG 00:09.057 4409.1d/s 3319.2K/s 39933 1245 1060581 0 AUwyBL3mwxvhS6xbMtNs
2015-04-08T16:30:33.556+0200 DEBUG 00:09.420 4384.4d/s 3301.4K/s 41301 1368 1061386 0 AUwyBU0FwxvhS6xbMuFS
2015-04-08T16:30:33.902+0200 DEBUG 00:09.767 4369.2d/s 3290.2K/s 42674 1373 1061000 0 AUwyBSRpwxvhS6xbMt3I
2015-04-08T16:30:34.151+0200 DEBUG 00:10.016 4391.4d/s 3311.8K/s 43984 1310 1061175 0 AUwyBRhHwxvhS6xbMtzK
2015-04-08T16:30:34.498+0200 DEBUG 00:10.363 4368.8d/s 3300.9K/s 45274 1290 1060444 0 AUwyBoV7wxvhS6xbMvP0
2015-04-08T16:30:34.760+0200 DEBUG 00:10.625 4375.0d/s 3316.9K/s 46484 1210 1060558 0 AUwyBlAJwxvhS6xbMu1p
2015-04-08T16:30:35.096+0200 DEBUG 00:10.961 4352.3d/s 3309.8K/s 47706 1222 1060691 0 AUwyBlxTwxvhS6xbMu8P
2015-04-08T16:30:35.345+0200 DEBUG 00:11.210 4374.0d/s 3328.7K/s 49033 1327 1061293 0 AUwyBv-swxvhS6xbMwBn
2015-04-08T16:30:35.710+0200 DEBUG 00:11.575 4353.3d/s 3313.3K/s 50389 1356 1061201 0 AUwyBrsMwxvhS6xbMvpy
2015-04-08T16:30:35.968+0200 DEBUG 00:11.833 4378.9d/s 3328.6K/s 51815 1426 1061576 0 AUwyGKVvwxvhS6xbMwnk
2015-04-08T16:30:36.370+0200 DEBUG 00:12.235 4351.4d/s 3304.0K/s 53239 1424 1061548 0 AUwyE4lUwxvhS6xbMwgC
2015-04-08T16:30:36.583+0200 DEBUG 00:12.448 4388.1d/s 3330.7K/s 54623 1384 1061208 0 AUwyJZxvwxvhS6xbMxFs
2015-04-08T16:30:36.904+0200 DEBUG 00:12.769 4384.9d/s 3328.2K/s 55991 1368 1061414 0 AUwyI9ckwxvhS6xbMw__
2015-04-08T16:30:37.135+0200 DEBUG 00:13.000 4411.4d/s 3348.7K/s 57348 1357 1060862 0 AUwyPESFwxvhS6xbMyBx
2015-04-08T16:30:37.480+0200 DEBUG 00:13.345 4404.9d/s 3339.9K/s 58783 1435 1062040 0 AUwyOWnRwxvhS6xbMx5N
2015-04-08T16:30:37.807+0200 DEBUG 00:13.671 4403.8d/s 3336.1K/s 60205 1422 1062064 0 AUwySo8GwxvhS6xbMybb
2015-04-08T16:30:38.056+0200 DEBUG 00:13.921 4428.4d/s 3350.7K/s 61648 1443 1061974 0 AUwyWuyRwxvhS6xbMy98
2015-04-08T16:30:38.379+0200 DEBUG 00:14.244 4427.4d/s 3347.5K/s 63064 1416 1061758 0 AUwyVuQFwxvhS6xbMy1K
2015-04-08T16:30:38.603+0200 DEBUG 00:14.468 4453.7d/s 3367.3K/s 64436 1372 1061083 0 AUwyaLUXwxvhS6xbMzYN
2015-04-08T16:30:38.951+0200 DEBUG 00:14.816 4445.3d/s 3358.2K/s 65861 1425 1061644 0 AUwydXCXwxvhS6xbMz1s
2015-04-08T16:30:39.180+0200 DEBUG 00:15.045 4469.2d/s 3376.0K/s 67239 1378 1061574 0 AUwyilmiwxvhS6xbM0bm
2015-04-08T16:30:39.554+0200 DEBUG 00:15.419 4452.4d/s 3361.3K/s 68652 1413 1061694 0 AUwyh8XhwxvhS6xbM0XI
2015-04-08T16:30:39.886+0200 DEBUG 00:15.750 4452.1d/s 3356.5K/s 70121 1469 1062295 0 AUwykmEswxvhS6xbM008
2015-04-08T16:30:40.126+0200 DEBUG 00:15.991 4475.5d/s 3370.8K/s 71567 1446 1061774 0 AUwyqdHmwxvhS6xbM1Zp
2015-04-08T16:30:40.455+0200 DEBUG 00:16.320 4472.0d/s 3366.4K/s 72983 1416 1061620 0 AUwyp4kcwxvhS6xbM1UJ
2015-04-08T16:30:40.683+0200 DEBUG 00:16.548 4494.3d/s 3382.6K/s 74371 1388 1061210 0 AUwyy4scwxvhS6xbM2ic
2015-04-08T16:30:41.004+0200 DEBUG 00:16.869 4488.5d/s 3379.7K/s 75716 1345 1060939 0 AUwyxLlUwxvhS6xbM2Rq
2015-04-08T16:30:41.243+0200 DEBUG 00:17.108 4505.1d/s 3393.1K/s 77073 1357 1061839 0 AUwy0LBbwxvhS6xbM20J
2015-04-08T16:30:41.572+0200 DEBUG 00:17.437 4499.1d/s 3388.5K/s 78451 1378 1061968 0 AUwyzdJpwxvhS6xbM2rt
2015-04-08T16:30:41.805+0200 DEBUG 00:17.669 4519.2d/s 3402.7K/s 79849 1398 1061734 0 AUwy7QTnwxvhS6xbM3v1
2015-04-08T16:30:42.128+0200 DEBUG 00:17.993 4519.6d/s 3399.1K/s 81321 1472 1062387 0 AUwy6uqTwxvhS6xbM3sD
2015-04-08T16:30:42.461+0200 DEBUG 00:18.326 4517.0d/s 3393.9K/s 82778 1457 1062131 0 AUwzAWuKwxvhS6xbM4SA
2015-04-08T16:30:42.700+0200 DEBUG 00:18.565 4536.7d/s 3406.1K/s 84223 1445 1062050 0 AUwzFhFswxvhS6xbM40y
2015-04-08T16:30:43.031+0200 DEBUG 00:18.896 4533.0d/s 3401.3K/s 85656 1433 1062391 0 AUwzD-JVwxvhS6xbM4rH
2015-04-08T16:30:43.270+0200 DEBUG 00:19.135 4550.9d/s 3413.1K/s 87081 1425 1062218 0 AUwzJ82VwxvhS6xbM5WN
2015-04-08T16:30:43.584+0200 DEBUG 00:19.448 4550.5d/s 3411.4K/s 88498 1417 1061585 0 AUwzJxiDwxvhS6xbM5T1
2015-04-08T16:30:43.817+0200 DEBUG 00:19.682 4567.7d/s 3423.6K/s 89902 1404 1062141 0 AUwzR78kwxvhS6xbM6c6
2015-04-08T16:30:44.140+0200 DEBUG 00:20.005 4566.7d/s 3420.2K/s 91357 1455 1062507 0 AUwzQz00wxvhS6xbM6UF
2015-04-08T16:30:44.468+0200 DEBUG 00:20.332 4563.1d/s 3416.2K/s 92777 1420 1062113 0 AUwzVzVfwxvhS6xbM65X
2015-04-08T16:30:44.698+0200 DEBUG 00:20.563 4581.0d/s 3428.2K/s 94200 1423 1061822 0 AUwzZEpWwxvhS6xbM7W2
2015-04-08T16:30:45.038+0200 DEBUG 00:20.903 4575.5d/s 3422.1K/s 95642 1442 1062339 0 AUwzYfB5wxvhS6xbM7RB
2015-04-08T16:30:45.268+0200 DEBUG 00:21.133 4594.6d/s 3433.9K/s 97097 1455 1061977 0 AUwzdHObwxvhS6xbM729
2015-04-08T16:30:45.590+0200 DEBUG 00:21.455 4592.4d/s 3430.7K/s 98530 1433 1061814 0 AUwzdCjkwxvhS6xbM72f
2015-04-08T16:30:45.823+0200 DEBUG 00:21.688 4609.0d/s 3441.7K/s 99959 1429 1061865 0 AUwzhOnYwxvhS6xbM8Zg
2015-04-08T16:30:46.142+0200 DEBUG 00:22.007 4606.2d/s 3438.9K/s 101369 1410 1061580 0 AUwzlccuwxvhS6xbM8-M
2015-04-08T16:30:46.471+0200 DEBUG 00:22.336 4601.6d/s 3434.7K/s 102781 1412 1061720 0 AUwzpNluwxvhS6xbM9dl
2015-04-08T16:30:46.701+0200 DEBUG 00:22.566 4617.3d/s 3445.6K/s 104193 1412 1061693 0 AUwztbkTwxvhS6xbM98D
2015-04-08T16:30:47.036+0200 DEBUG 00:22.900 4610.8d/s 3440.6K/s 105587 1394 1061923 0 AUwzryNhwxvhS6xbM9vT
2015-04-08T16:30:47.267+0200 DEBUG 00:23.132 4623.8d/s 3451.0K/s 106957 1370 1061933 0 AUwzw73-wxvhS6xbM-b0
2015-04-08T16:30:47.589+0200 DEBUG 00:23.454 4622.2d/s 3447.8K/s 108409 1452 1061883 0 AUwz0OWwwxvhS6xbM-vg
2015-04-08T16:30:47.823+0200 DEBUG 00:23.688 4638.6d/s 3457.5K/s 109878 1469 1062057 0 AUw0IddnwxvhS6xbM_m7
2015-04-08T16:30:48.062+0200 DEBUG 00:23.926 4655.8d/s 3466.5K/s 111395 1517 1062328 0 AUw0Dn2kwxvhS6xbM_da
2015-04-08T16:30:48.242+0200 DEBUG 00:24.107 4669.7d/s 3473.9K/s 112573 1178 825874 0 AUw0d_ahwxvhS6xbNAYp
2015-04-08T16:30:48.271+0200 INFO  00:24.135 4664.3d/s 3469.9K/s (81.8mb) indexed 112573 streamed 112573 errors 0
2015-04-08T16:30:48.272+0200 INFO  done
abonuccelli@w530 /tmp/test/tee $ ls -alrth
total 48K
drwxr-xr-x  4 abonuccelli abonuccelli 4,0K abr  8 12:28 ..
drwxr-xr-x  6 abonuccelli abonuccelli 4,0K abr  8 16:30 4
drwxr-xr-x 12 abonuccelli abonuccelli 4,0K abr  8 16:30 .
drwxr-xr-x  9 abonuccelli abonuccelli 4,0K abr  8 16:30 5
drwxr-xr-x  7 abonuccelli abonuccelli 4,0K abr  8 16:30 7
drwxr-xr-x  8 abonuccelli abonuccelli 4,0K abr  8 16:30 2
drwxr-xr-x  9 abonuccelli abonuccelli 4,0K abr  8 16:30 1
drwxr-xr-x  7 abonuccelli abonuccelli 4,0K abr  8 16:30 8
drwxr-xr-x  9 abonuccelli abonuccelli 4,0K abr  8 16:30 0
drwxr-xr-x  7 abonuccelli abonuccelli 4,0K abr  8 16:30 9
drwxr-xr-x  7 abonuccelli abonuccelli 4,0K abr  8 16:30 6
drwxr-xr-x  8 abonuccelli abonuccelli 4,0K abr  8 16:30 3
abonuccelli@w530 /tmp/test/tee $ find . -type f | while read x; do gunzip -d $x; done
abonuccelli@w530 /tmp/test/tee $ find . -type f | while read b; do wc -l $b ;done | awk '{sum+=$1} END {print sum}'
112492
abonuccelli@w530 /tmp/test/tee $ curl -XGET "http://192.168.0.101:9200/logstash-syslog-2015.03.19/_search?search_type=count&pretty"
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 112573,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}
abonuccelli@w530 /tmp/test/tee $ find . -type f | wc -l
81
abonuccelli@w530 /tmp/test/tee $ find . -type f | tail -1
./4/8/AUwyh8XhwxvhS6xbM0XI.json

112492 + 81 = 112573 ?

High cpu usage

Setup

  1. Latest stream2es (20150720170522978252e) on server (6 cores / 64GB ram) separate from the es cluster
  2. A big (~65GB) file containing 1 large json object per line. There are about 15 million lines/documents and the average line size is ~4.3k characters

Problem

I'm running :

cat bigfile | stream2es stdin --target http://server:9200/index/type --log debug -w 12

I have tried several different options for --bulk-bytes, -w, -d and -q but always the same result. I'm getting a constant indexing speed of ~5MB/s which translates to 4 hours to import the file. While indexing the elasticsearch cluster is heavily under-utilized and the stream2es server has a single core at 100%. I have done extensive testing to ensure that there are no network or elasticsearch performance issues.

Workaround

My final solution was to run stream2es in parallel (not with -w) to see if that would help.

cat bigfile | parallel -j12 -L5000 --pipe "stream2es stdin --target http://server:9200/index/type"

That helped a lot. Now all 6 cores and 12 threads get 100% and the indexing time fell from 4 hours to 35 minutes but the elasticsearch cluster is still pretty much idle. It seems to me that something in stream2es uses way more cpu than it should.

NPE when trying to scan wildcard --source

Because you can't get a _mapping for a wildcard, stream2es es hits this when --target doesn't exist. Would be better to print a helpful error like, Wildcard --source detected, you have to create --target manually.

% stream2es es --source http://localhost:9200/foo\* --target http://localhost:9200/foo3 --log debug
2014-10-14T10:06:57.274-0500 DEBUG create index http://localhost:9200/foo3
java.lang.NullPointerException
        at clojure.core$val.invoke(core.clj:1489)
        at stream2es.es$idx_meta.invoke(es.clj:112)
        at stream2es.es$mapping.invoke(es.clj:116)
        at stream2es.stream.es$fn__2981.invoke(es.clj:47)
        at stream2es.stream$fn__2794$G__2774__2801.invoke(stream.clj:16)
        at stream2es.main$ensure_index.invoke(main.clj:306)
        at stream2es.main$main.invoke(main.clj:318)
        at stream2es.main$_main.doInvoke(main.clj:336)
        at clojure.lang.RestFn.invoke(RestFn.java:551)

Cannot run stream2es with java 1.6 anymore on mac

When running stream2es with java version "1.6.0_51" on mac (OS X 10.8.4), I get the call stack below. Works fine with 1.7 for me.

To reproduce, run

  export JAVA_HOME=$(/usr/libexec/java_home -v 1.6)

,to set the java version to 6 (mine says: java version "1.6.0_51" afterwards) an then

  echo '{"foo":1}' | stream2es

Message is:

Exception in thread "main" java.lang.ExceptionInInitializerError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at clojure.lang.RT.loadClassForName(RT.java:2098)
at clojure.lang.RT.load(RT.java:430)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$load_lib$fn__4967.invoke(core.clj:5375)
at clojure.core$load_lib.doInvoke(core.clj:5374)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$load_libs.doInvoke(core.clj:5413)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$require.doInvoke(core.clj:5496)
at clojure.lang.RestFn.invoke(RestFn.java:551)
at stream2es.auth$loading__4910__auto__.invoke(auth.clj:1)
at stream2es.auth__init.load(Unknown Source)
at stream2es.auth__init.(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at clojure.lang.RT.loadClassForName(RT.java:2098)
at clojure.lang.RT.load(RT.java:430)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$load_lib$fn__4967.invoke(core.clj:5375)
at clojure.core$load_lib.doInvoke(core.clj:5374)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$load_libs.doInvoke(core.clj:5413)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$require.doInvoke(core.clj:5496)
at clojure.lang.RestFn.invoke(RestFn.java:482)
at stream2es.stream.twitter$loading__4910__auto__.invoke(twitter.clj:1)
at stream2es.stream.twitter__init.load(Unknown Source)
at stream2es.stream.twitter__init.(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at clojure.lang.RT.loadClassForName(RT.java:2098)
at clojure.lang.RT.load(RT.java:430)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$load_lib$fn__4967.invoke(core.clj:5375)
at clojure.core$load_lib.doInvoke(core.clj:5374)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$load_libs.doInvoke(core.clj:5413)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$require.doInvoke(core.clj:5496)
at clojure.lang.RestFn.invoke(RestFn.java:436)
at stream2es.main$loading__4910__auto__.invoke(main.clj:1)
at stream2es.main__init.load(Unknown Source)
at stream2es.main__init.(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at clojure.lang.RT.loadClassForName(RT.java:2098)
at clojure.lang.RT.load(RT.java:430)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.lang.Var.invoke(Var.java:415)
at stream2es.main.(Unknown Source)
Caused by: java.lang.ClassNotFoundException: java/nio/file/attribute/PosixFilePermission
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:171)
at stream2es.util.io$loading__4910__auto__.invoke(io.clj:1)
at stream2es.util.io__init.load(Unknown Source)
at stream2es.util.io__init.(Unknown Source)
... 73 more

How can I know whether all the data from wikidumps were stored in the ElasticSearch, please?

How can I know whether all the data from wikidumps were stored in the ElasticSearch, please?
Because when I run the import data command,
it was stopped at :
2016-04-13T10:19:02.842+0800 DEBUG 03:05.733 572.7d/s 2415.3K/s 106363 1013 3159442 0
2016-04-13T10:19:04.208+0800 DEBUG 03:07.094 574.3d/s 2414.5K/s 107454 1091 3213698 0
2016-04-13T10:19:05.427+0800 DEBUG 03:08.317 576.6d/s 2415.2K/s 108575 1121 3162484 0
2016-04-13T10:19:06.541+0800 DEBUG 03:09.432 579.5d/s 2417.3K/s 109767 1192 3164995 0

without exit. and I don't know whether it finished successfully.

Problem running against current master

I cannot run stream2es against the current master due to the latest mapping changes

Tried to run

stream2es wiki -d 5000000 --source file:///Users/data/enwiki-latest-pages-articles.xml.bz2

resulting in this on the client side

clojure.lang.ExceptionInfo: clj-http: status 400 {:object {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, :environment {client #<client$wrap_output_coercion$fn__1005 clj_http.client$wrap_output_coercion$fn__1005@3f8f9dd6>, req {:request-method :put, :url "http://localhost:9200/wiki", :body "{\"settings\":{\"index.refresh_interval\":\"5s\",\"index.number_of_shards\":2,\"query.default_field\":\"text\",\"index.number_of_replicas\":0},\"mappings\":{\"_default_\":{\"_size\":{\"enabled\":true,\"store\":true},\"_all\":{\"enabled\":false}}}}"}, map__927 {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, resp {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, status 400}}
    at clj_http.client$wrap_exceptions$fn__926.invoke(client.clj:111)
    at clj_http.client$wrap_accept$fn__1046.invoke(client.clj:380)
    at clj_http.client$wrap_accept_encoding$fn__1052.invoke(client.clj:394)
    at clj_http.client$wrap_content_type$fn__1041.invoke(client.clj:370)
    at clj_http.client$wrap_form_params$fn__1090.invoke(client.clj:481)
    at clj_http.client$wrap_nested_params$fn__1108.invoke(client.clj:505)
    at clj_http.client$wrap_method$fn__1085.invoke(client.clj:464)
    at clj_http.cookies$wrap_cookies$fn__631.invoke(cookies.clj:118)
    at clj_http.links$wrap_links$fn__661.invoke(links.clj:50)
    at clj_http.client$wrap_unknown_host$fn__1117.invoke(client.clj:524)
    at clj_http.client$put.doInvoke(client.clj:633)
    at clojure.lang.RestFn.invoke(RestFn.java:423)
    at stream2es.es$put.invoke(es.clj:30)
    at stream2es.main$ensure_index.invoke(main.clj:310)
    at stream2es.main$main.invoke(main.clj:318)
    at stream2es.main$_main.doInvoke(main.clj:336)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)
2015-05-04T14:37:07.217+0200 ERROR unexpected exception: clojure.lang.ExceptionInfo: clj-http: status 400 {:object {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, :environment {client #<client$wrap_output_coercion$fn__1005 clj_http.client$wrap_output_coercion$fn__1005@3f8f9dd6>, req {:request-method :put, :url "http://localhost:9200/wiki", :body "{\"settings\":{\"index.refresh_interval\":\"5s\",\"index.number_of_shards\":2,\"query.default_field\":\"text\",\"index.number_of_replicas\":0},\"mappings\":{\"_default_\":{\"_size\":{\"enabled\":true,\"store\":true},\"_all\":{\"enabled\":false}}}}"}, map__927 {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, resp {:trace-redirects ["http://localhost:9200/wiki"], :request-time 808, :status 400, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "360"}, :body "{\"error\":{\"root_cause\":[{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}],\"type\":\"mapper_parsing_exception\",\"reason\":\"mapping [_default_]\",\"caused_by\":{\"type\":\"mapper_parsing_exception\",\"reason\":\"Mapping definition for [_size] has unsupported parameters:  [store : true]\"}},\"status\":400}"}, status 400}}
2015-05-04T14:37:07.367+0200 INFO  00:01.370 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0

and this on the server side

[2015-05-04 14:37:07,157][DEBUG][action.admin.indices.create] [Count Nefaria] [wiki] failed to create
MapperParsingException[mapping [_default_]]; nested: MapperParsingException[Mapping definition for [_size] has unsupported parameters:  [store : true]];
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:382)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:383)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: MapperParsingException[Mapping definition for [_size] has unsupported parameters:  [store : true]]
    at org.elasticsearch.index.mapper.DocumentMapperParser.checkNoRemainingFields(DocumentMapperParser.java:285)
    at org.elasticsearch.index.mapper.DocumentMapperParser.checkNoRemainingFields(DocumentMapperParser.java:279)
    at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:259)
    at org.elasticsearch.index.mapper.DocumentMapperParser.parseCompressed(DocumentMapperParser.java:211)
    at org.elasticsearch.index.mapper.DocumentMapperParser.parseCompressed(DocumentMapperParser.java:196)
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:216)
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:379)
    ... 6 more

NullPointerException when streaming es -> es

I'm moving some small indices (~1-2k documents) to another index and got this stack trace on one of them:

java.lang.NullPointerException
    at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
    at java.util.regex.Matcher.reset(Matcher.java:308)
    at java.util.regex.Matcher.<init>(Matcher.java:228)
    at java.util.regex.Pattern.matcher(Pattern.java:1088)
    at clojure.core$re_matcher.invoke(core.clj:4386)
    at clojure.core$re_find.invoke(core.clj:4438)
    at stream2es.es$scroll_STAR_.invoke(es.clj:72)
    at stream2es.es$scroll.invoke(es.clj:79)
    at stream2es.es$scan.invoke(es.clj:103)
    at stream2es.stream.es$make_callback$fn__3008.invoke(es.clj:75)
    at stream2es.main$stream_BANG_.invoke(main.clj:241)
    at stream2es.main$main.invoke(main.clj:330)
    at stream2es.main$_main.doInvoke(main.clj:336)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)
2015-01-19T19:59:12.461+0000 ERROR unexpected exception: java.lang.NullPointerException
2015-01-19T19:59:12.633+0000 INFO  00:07.973 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0

Unfortunately I'm doing these in large batches so I can't pin down the exact index/document for debugging.

Running on version:

2015-01-19T20:16:11.716+0000 INFO  stream2es 2014122282ace27

Tuning stream2es Guidance

I'm looking for guidance on how to "tune" stream2es. I'm using it to copy indexes into new mappings. I have no idea how many workers to be running, or if the default values are the bottleneck or if my elasticsearch is the bottleneck. I assume the default values are slowing me down, because the copy rate doesn't seem to change if I have multiple copies going and i don't see any real cpu / io pressure on the cluster. It could also be running it thru an elb, and it could also be the system that's doing the copy (t2.large in aws).

Do you have any guidance on what methodology to use other than "try a couple more workers until you don't see any improvements"?

stream2es is failing on my es index

I invoke stream2es es in order to reindex an index and it fails with e following exception

{:throwable #<ExceptionInfo clojure.lang.ExceptionInfo: clj-http: status 500 {:object {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, :environment {client #<client$wrap_output_coercion$fn__994 clj_http.client$wrap_output_coercion$fn__994@1b1f578c>, req {:request-method :put, :url "http://uatelasticui5:9200/deals", :body "{"settings":{"index.version.created":"900599","index.number_of_replicas":"3","index.number_of_shards":"1","index.auto_expand_replicas":"0-all"},"mappings":{"dynamic":"strict","_all":{"enabled":false},"properties":{"lastUpdate":{"type":"long"},"storeIds":{"type":"long"},"categoryPath":{"type":"long"},"endDate":{"type":"long"},"productTags":{"type":"long"},"discountPercent":{"type":"integer"},"priceBucket":{"type":"integer"},"key":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"externalProductId":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"offerType":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"regularPrice":{"type":"double","index":"no"},"dealPrice":{"type":"double"},"startDate":{"type":"long"},"feedSource":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"brandName":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"searchField":{"type":"string","analyzer":"snowball"},"productId":{"type":"long"},"storesKey":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"shares":{"type":"integer"},"id":{"type":"long"},"description":{"type":"string","index":"no"},"productName":{"type":"string","index":"no"},"tags":{"type":"long"}}}}"}, map__916 {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, resp {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, status 500}}>, :wrapper #<ExceptionInfo clojure.lang.ExceptionInfo: clj-http: status 500 {:object {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, :environment {client #<client$wrap_output_coercion$fn__994 clj_http.client$wrap_output_coercion$fn__994@1b1f578c>, req {:request-method :put, :url "http://uatelasticui5:9200/deals", :body "{"settings":{"index.version.created":"900599","index.number_of_replicas":"3","index.number_of_shards":"1","index.auto_expand_replicas":"0-all"},"mappings":{"dynamic":"strict","_all":{"enabled":false},"properties":{"lastUpdate":{"type":"long"},"storeIds":{"type":"long"},"categoryPath":{"type":"long"},"endDate":{"type":"long"},"productTags":{"type":"long"},"discountPercent":{"type":"integer"},"priceBucket":{"type":"integer"},"key":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"externalProductId":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"offerType":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"regularPrice":{"type":"double","index":"no"},"dealPrice":{"type":"double"},"startDate":{"type":"long"},"feedSource":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"brandName":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"searchField":{"type":"string","analyzer":"snowball"},"productId":{"type":"long"},"storesKey":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"shares":{"type":"integer"},"id":{"type":"long"},"description":{"type":"string","index":"no"},"productName":{"type":"string","index":"no"},"tags":{"type":"long"}}}}"}, map__916 {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, resp {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, status 500}}>, :stack-trace #<StackTraceElement[] [Ljava.lang.StackTraceElement;@35f5b2c4>, :cause nil, :message "clj-http: status 500", :object {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, :environment {client #<client$wrap_output_coercion$fn__994 clj_http.client$wrap_output_coercion$fn__994@1b1f578c>, req {:request-method :put, :url "http://uatelasticui5:9200/deals", :body "{"settings":{"index.version.created":"900599","index.number_of_replicas":"3","index.number_of_shards":"1","index.auto_expand_replicas":"0-all"},"mappings":{"dynamic":"strict","_all":{"enabled":false},"properties":{"lastUpdate":{"type":"long"},"storeIds":{"type":"long"},"categoryPath":{"type":"long"},"endDate":{"type":"long"},"productTags":{"type":"long"},"discountPercent":{"type":"integer"},"priceBucket":{"type":"integer"},"key":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"externalProductId":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"offerType":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"regularPrice":{"type":"double","index":"no"},"dealPrice":{"type":"double"},"startDate":{"type":"long"},"feedSource":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"brandName":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"searchField":{"type":"string","analyzer":"snowball"},"productId":{"type":"long"},"storesKey":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"shares":{"type":"integer"},"id":{"type":"long"},"description":{"type":"string","index":"no"},"productName":{"type":"string","index":"no"},"tags":{"type":"long"}}}}"}, map__916 {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, resp {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, status 500}}
clojure.lang.ExceptionInfo: clj-http: status 500 {:object {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, :environment {client #<client$wrap_output_coercion$fn__994 clj_http.client$wrap_output_coercion$fn__994@1b1f578c>, req {:request-method :put, :url "http://uatelasticui5:9200/deals", :body "{"settings":{"index.version.created":"900599","index.number_of_replicas":"3","index.number_of_shards":"1","index.auto_expand_replicas":"0-all"},"mappings":{"dynamic":"strict","_all":{"enabled":false},"properties":{"lastUpdate":{"type":"long"},"storeIds":{"type":"long"},"categoryPath":{"type":"long"},"endDate":{"type":"long"},"productTags":{"type":"long"},"discountPercent":{"type":"integer"},"priceBucket":{"type":"integer"},"key":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"externalProductId":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"offerType":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"regularPrice":{"type":"double","index":"no"},"dealPrice":{"type":"double"},"startDate":{"type":"long"},"feedSource":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"brandName":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"searchField":{"type":"string","analyzer":"snowball"},"productId":{"type":"long"},"storesKey":{"type":"string","index":"not_analyzed","omit_norms":true,"index_options":"docs"},"shares":{"type":"integer"},"id":{"type":"long"},"description":{"type":"string","index":"no"},"productName":{"type":"string","index":"no"},"tags":{"type":"long"}}}}"}, map__916 {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, resp {:trace-redirects ["http://uatelasticui5:9200/deals"], :request-time 10, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "93"}, :body "{"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}"}, status 500}}
at clj_http.client$wrap_exceptions$fn__915.invoke(client.clj:111)
at clj_http.client$wrap_accept$fn__1035.invoke(client.clj:380)
at clj_http.client$wrap_accept_encoding$fn__1041.invoke(client.clj:394)
at clj_http.client$wrap_content_type$fn__1030.invoke(client.clj:370)
at clj_http.client$wrap_form_params$fn__1079.invoke(client.clj:481)
at clj_http.client$wrap_nested_params$fn__1097.invoke(client.clj:505)
at clj_http.client$wrap_method$fn__1074.invoke(client.clj:464)
at clj_http.cookies$wrap_cookies$fn__526.invoke(cookies.clj:118)
at clj_http.links$wrap_links$fn__556.invoke(links.clj:50)
at clj_http.client$wrap_unknown_host$fn__1106.invoke(client.clj:524)
at clj_http.client$put.doInvoke(client.clj:633)
at clojure.lang.RestFn.invoke(RestFn.java:423)
at stream2es.es$put.invoke(es.clj:29)
at stream2es.main$ensure_index.invoke(main.clj:395)
at stream2es.main$main.invoke(main.clj:403)
at stream2es.main$_main.doInvoke(main.clj:435)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)

import unreliable due to random loss of data

this is a real showstopper and may hang together with #57 !?
Can reproduce it with a file of about 30 Mio Json documents:

gunzip -c /media/sf_common/something.json.gz |stream2es stdin --replace --target "http://localhost:9200/some/thing"

no error messages, but sometimes only 3 Mio documents are imported, another time it imports 7 Mio documents before it finishes. Nobody wants to fix this? It's rather unusable in this state!

Date format does not work on fr_FR systems

Hey @drewr

I tried to use your new version and get stucked with date formats.
I assume it's due to my MacOS version which is in french so my locale is fr_FR or something like that.

So basically, to make it work, I have to first create my index and set my own date settings for each date field as follow.

Note that dynamic_date_formats does not support locale attribute.

curl -XDELETE localhost:9200/twitter; echo

curl -XPUT localhost:9200/twitter -d '{
  "number_of_replicas":0,
  "refresh_interval":"1s",
  "number_of_shards":1
}'; echo

curl -XPUT localhost:9200/twitter/status/_mapping -d '{
    "status" : {
      "dynamic_date_formats" : [ "E MMM d HH:mm:ss Z yyyy" ],
      "_all" : {
        "enabled" : true
      },
      "properties" : {
        "bytes" : {
          "type" : "long"
        },
        "coordinates" : {
          "properties" : {
            "coordinates" : {
              "type" : "geo_point"
            },
            "type" : {
              "type" : "string"
            }
          }
        },
        "created_at" : {
          "type" : "date", "format" : "E MMM d HH:mm:ss Z yyyy", "locale": "en_EN"
        },
        "entities" : {
          "properties" : {
            "hashtags" : {
              "properties" : {
                "indices" : {
                  "type" : "long"
                },
                "text" : {
                  "type" : "string",
                  "index" : "not_analyzed",
                  "omit_norms" : true,
                  "index_options" : "docs"
                }
              }
            },
            "media" : {
              "properties" : {
                "display_url" : {
                  "type" : "string"
                },
                "expanded_url" : {
                  "type" : "string"
                },
                "id" : {
                  "type" : "long"
                },
                "id_str" : {
                  "type" : "string"
                },
                "indices" : {
                  "type" : "long"
                },
                "media_url" : {
                  "type" : "string"
                },
                "media_url_https" : {
                  "type" : "string"
                },
                "sizes" : {
                  "properties" : {
                    "large" : {
                      "properties" : {
                        "h" : {
                          "type" : "long"
                        },
                        "resize" : {
                          "type" : "string"
                        },
                        "w" : {
                          "type" : "long"
                        }
                      }
                    },
                    "medium" : {
                      "properties" : {
                        "h" : {
                          "type" : "long"
                        },
                        "resize" : {
                          "type" : "string"
                        },
                        "w" : {
                          "type" : "long"
                        }
                      }
                    },
                    "small" : {
                      "properties" : {
                        "h" : {
                          "type" : "long"
                        },
                        "resize" : {
                          "type" : "string"
                        },
                        "w" : {
                          "type" : "long"
                        }
                      }
                    },
                    "thumb" : {
                      "properties" : {
                        "h" : {
                          "type" : "long"
                        },
                        "resize" : {
                          "type" : "string"
                        },
                        "w" : {
                          "type" : "long"
                        }
                      }
                    }
                  }
                },
                "source_status_id" : {
                  "type" : "long"
                },
                "source_status_id_str" : {
                  "type" : "string"
                },
                "type" : {
                  "type" : "string"
                },
                "url" : {
                  "type" : "string"
                }
              }
            },
            "urls" : {
              "properties" : {
                "display_url" : {
                  "type" : "string"
                },
                "expanded_url" : {
                  "type" : "string"
                },
                "indices" : {
                  "type" : "long"
                },
                "url" : {
                  "type" : "string"
                }
              }
            },
            "user_mentions" : {
              "properties" : {
                "id" : {
                  "type" : "long"
                },
                "id_str" : {
                  "type" : "string"
                },
                "indices" : {
                  "type" : "long"
                },
                "name" : {
                  "type" : "string"
                },
                "screen_name" : {
                  "type" : "string"
                }
              }
            }
          }
        },
        "favorite_count" : {
          "type" : "long"
        },
        "favorited" : {
          "type" : "boolean"
        },
        "filter_level" : {
          "type" : "string"
        },
        "geo" : {
          "properties" : {
            "coordinates" : {
              "type" : "double"
            },
            "type" : {
              "type" : "string"
            }
          }
        },
        "id_str" : {
          "type" : "string"
        },
        "in_reply_to_screen_name" : {
          "type" : "string"
        },
        "in_reply_to_status_id" : {
          "type" : "long"
        },
        "in_reply_to_status_id_str" : {
          "type" : "string"
        },
        "in_reply_to_user_id" : {
          "type" : "long"
        },
        "in_reply_to_user_id_str" : {
          "type" : "string"
        },
        "lang" : {
          "type" : "string"
        },
        "offset" : {
          "type" : "long"
        },
        "place" : {
          "properties" : {
            "attributes" : {
              "type" : "object"
            },
            "bounding_box" : {
              "type" : "geo_shape"
            },
            "country" : {
              "type" : "string"
            },
            "country_code" : {
              "type" : "string"
            },
            "full_name" : {
              "type" : "string"
            },
            "id" : {
              "type" : "string"
            },
            "name" : {
              "type" : "string"
            },
            "place_type" : {
              "type" : "string"
            },
            "url" : {
              "type" : "string"
            }
          }
        },
        "possibly_sensitive" : {
          "type" : "boolean"
        },
        "retweet_count" : {
          "type" : "long"
        },
        "retweeted" : {
          "type" : "boolean"
        },
        "retweeted_status" : {
          "properties" : {
            "coordinates" : {
              "properties" : {
                "coordinates" : {
                  "type" : "double"
                },
                "type" : {
                  "type" : "string"
                }
              }
            },
            "created_at" : {
              "type" : "date", "format" : "E MMM d HH:mm:ss Z yyyy", "locale": "en_EN"
            },
            "entities" : {
              "properties" : {
                "hashtags" : {
                  "properties" : {
                    "indices" : {
                      "type" : "long"
                    },
                    "text" : {
                      "type" : "string"
                    }
                  }
                },
                "media" : {
                  "properties" : {
                    "display_url" : {
                      "type" : "string"
                    },
                    "expanded_url" : {
                      "type" : "string"
                    },
                    "id" : {
                      "type" : "long"
                    },
                    "id_str" : {
                      "type" : "string"
                    },
                    "indices" : {
                      "type" : "long"
                    },
                    "media_url" : {
                      "type" : "string"
                    },
                    "media_url_https" : {
                      "type" : "string"
                    },
                    "sizes" : {
                      "properties" : {
                        "large" : {
                          "properties" : {
                            "h" : {
                              "type" : "long"
                            },
                            "resize" : {
                              "type" : "string"
                            },
                            "w" : {
                              "type" : "long"
                            }
                          }
                        },
                        "medium" : {
                          "properties" : {
                            "h" : {
                              "type" : "long"
                            },
                            "resize" : {
                              "type" : "string"
                            },
                            "w" : {
                              "type" : "long"
                            }
                          }
                        },
                        "small" : {
                          "properties" : {
                            "h" : {
                              "type" : "long"
                            },
                            "resize" : {
                              "type" : "string"
                            },
                            "w" : {
                              "type" : "long"
                            }
                          }
                        },
                        "thumb" : {
                          "properties" : {
                            "h" : {
                              "type" : "long"
                            },
                            "resize" : {
                              "type" : "string"
                            },
                            "w" : {
                              "type" : "long"
                            }
                          }
                        }
                      }
                    },
                    "source_status_id" : {
                      "type" : "long"
                    },
                    "source_status_id_str" : {
                      "type" : "string"
                    },
                    "type" : {
                      "type" : "string"
                    },
                    "url" : {
                      "type" : "string"
                    }
                  }
                },
                "urls" : {
                  "properties" : {
                    "display_url" : {
                      "type" : "string"
                    },
                    "expanded_url" : {
                      "type" : "string"
                    },
                    "indices" : {
                      "type" : "long"
                    },
                    "url" : {
                      "type" : "string"
                    }
                  }
                },
                "user_mentions" : {
                  "properties" : {
                    "id" : {
                      "type" : "long"
                    },
                    "id_str" : {
                      "type" : "string"
                    },
                    "indices" : {
                      "type" : "long"
                    },
                    "name" : {
                      "type" : "string"
                    },
                    "screen_name" : {
                      "type" : "string"
                    }
                  }
                }
              }
            },
            "favorite_count" : {
              "type" : "long"
            },
            "favorited" : {
              "type" : "boolean"
            },
            "geo" : {
              "properties" : {
                "coordinates" : {
                  "type" : "double"
                },
                "type" : {
                  "type" : "string"
                }
              }
            },
            "id" : {
              "type" : "long"
            },
            "id_str" : {
              "type" : "string"
            },
            "in_reply_to_screen_name" : {
              "type" : "string"
            },
            "in_reply_to_status_id" : {
              "type" : "long"
            },
            "in_reply_to_status_id_str" : {
              "type" : "string"
            },
            "in_reply_to_user_id" : {
              "type" : "long"
            },
            "in_reply_to_user_id_str" : {
              "type" : "string"
            },
            "lang" : {
              "type" : "string"
            },
            "place" : {
              "properties" : {
                "attributes" : {
                  "type" : "object"
                },
                "bounding_box" : {
                  "properties" : {
                    "coordinates" : {
                      "type" : "double"
                    },
                    "type" : {
                      "type" : "string"
                    }
                  }
                },
                "country" : {
                  "type" : "string"
                },
                "country_code" : {
                  "type" : "string"
                },
                "full_name" : {
                  "type" : "string"
                },
                "id" : {
                  "type" : "string"
                },
                "name" : {
                  "type" : "string"
                },
                "place_type" : {
                  "type" : "string"
                },
                "url" : {
                  "type" : "string"
                }
              }
            },
            "possibly_sensitive" : {
              "type" : "boolean"
            },
            "retweet_count" : {
              "type" : "long"
            },
            "retweeted" : {
              "type" : "boolean"
            },
            "source" : {
              "type" : "string"
            },
            "text" : {
              "type" : "string"
            },
            "truncated" : {
              "type" : "boolean"
            },
            "user" : {
              "properties" : {
                "contributors_enabled" : {
                  "type" : "boolean"
                },
                "created_at" : {
                  "type" : "date", "format" : "E MMM d HH:mm:ss Z yyyy", "locale": "en_EN"
                },
                "default_profile" : {
                  "type" : "boolean"
                },
                "default_profile_image" : {
                  "type" : "boolean"
                },
                "description" : {
                  "type" : "string"
                },
                "favourites_count" : {
                  "type" : "long"
                },
                "followers_count" : {
                  "type" : "long"
                },
                "friends_count" : {
                  "type" : "long"
                },
                "geo_enabled" : {
                  "type" : "boolean"
                },
                "id" : {
                  "type" : "long"
                },
                "id_str" : {
                  "type" : "string"
                },
                "is_translator" : {
                  "type" : "boolean"
                },
                "lang" : {
                  "type" : "string"
                },
                "listed_count" : {
                  "type" : "long"
                },
                "location" : {
                  "type" : "string"
                },
                "name" : {
                  "type" : "string"
                },
                "profile_background_color" : {
                  "type" : "string"
                },
                "profile_background_image_url" : {
                  "type" : "string"
                },
                "profile_background_image_url_https" : {
                  "type" : "string"
                },
                "profile_background_tile" : {
                  "type" : "boolean"
                },
                "profile_banner_url" : {
                  "type" : "string"
                },
                "profile_image_url" : {
                  "type" : "string"
                },
                "profile_image_url_https" : {
                  "type" : "string"
                },
                "profile_link_color" : {
                  "type" : "string"
                },
                "profile_sidebar_border_color" : {
                  "type" : "string"
                },
                "profile_sidebar_fill_color" : {
                  "type" : "string"
                },
                "profile_text_color" : {
                  "type" : "string"
                },
                "profile_use_background_image" : {
                  "type" : "boolean"
                },
                "protected" : {
                  "type" : "boolean"
                },
                "screen_name" : {
                  "type" : "string"
                },
                "statuses_count" : {
                  "type" : "long"
                },
                "time_zone" : {
                  "type" : "string"
                },
                "url" : {
                  "type" : "string"
                },
                "utc_offset" : {
                  "type" : "long"
                },
                "verified" : {
                  "type" : "boolean"
                }
              }
            }
          }
        },
        "source" : {
          "type" : "string"
        },
        "text" : {
          "type" : "string"
        },
        "truncated" : {
          "type" : "boolean"
        },
        "user" : {
          "properties" : {
            "contributors_enabled" : {
              "type" : "boolean"
            },
            "created_at" : {
              "type" : "date", "format" : "E MMM d HH:mm:ss Z yyyy", "locale": "en_EN"
            },
            "default_profile" : {
              "type" : "boolean"
            },
            "default_profile_image" : {
              "type" : "boolean"
            },
            "description" : {
              "type" : "string"
            },
            "favourites_count" : {
              "type" : "long"
            },
            "followers_count" : {
              "type" : "long"
            },
            "friends_count" : {
              "type" : "long"
            },
            "geo_enabled" : {
              "type" : "boolean"
            },
            "id" : {
              "type" : "long"
            },
            "id_str" : {
              "type" : "string"
            },
            "is_translator" : {
              "type" : "boolean"
            },
            "lang" : {
              "type" : "string"
            },
            "listed_count" : {
              "type" : "long"
            },
            "location" : {
              "type" : "string"
            },
            "name" : {
              "type" : "string"
            },
            "profile_background_color" : {
              "type" : "string"
            },
            "profile_background_image_url" : {
              "type" : "string"
            },
            "profile_background_image_url_https" : {
              "type" : "string"
            },
            "profile_background_tile" : {
              "type" : "boolean"
            },
            "profile_banner_url" : {
              "type" : "string"
            },
            "profile_image_url" : {
              "type" : "string"
            },
            "profile_image_url_https" : {
              "type" : "string"
            },
            "profile_link_color" : {
              "type" : "string"
            },
            "profile_sidebar_border_color" : {
              "type" : "string"
            },
            "profile_sidebar_fill_color" : {
              "type" : "string"
            },
            "profile_text_color" : {
              "type" : "string"
            },
            "profile_use_background_image" : {
              "type" : "boolean"
            },
            "protected" : {
              "type" : "boolean"
            },
            "screen_name" : {
              "type" : "string"
            },
            "statuses_count" : {
              "type" : "long"
            },
            "time_zone" : {
              "type" : "string"
            },
            "url" : {
              "type" : "string"
            },
            "utc_offset" : {
              "type" : "long"
            },
            "verified" : {
              "type" : "boolean"
            }
          }
        }
      }
    }
}'; echo

Geospatially tagged data

I would like to import all of the articles that have a geolocation, eg Latitude and Longtude. Any ideas on how to do this?

Failing .. status: 500

Hi,
I'm getting an error when trying to copy a large index to another (~800,000,000) documents. We've gotten it twice, and it crashes stream2es.

I've included below some of the information with the error. We are curious if when we restart stream2es, if it'll try to copy all the same documents over again (duplicate documents in the 2nd index), or if it's smart about what it copies?

Thanks

Details:

{:throwable #<ExceptionInfo: clj-http:status 500 {:object {:trace-redirects.......}

Reason: remoteTransportException[[server_hostname]][inet[/ip:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [139746]];.. max_score 0.0 hits: []

at clj_http.client$wrap_exceptions$fn__1034.invoke(client.clj:111)
at clj_http.client$wrap_accept$fn__1154.invoke(client.clj:380)
at clj_http.client$wrap_accept_encoding$fn__1160.invoke(client.clj:394)
at clj_http.client$wrap_content_type$fn__1149.invoke(client.clj:370)
at clj_http.client$wrap_form_param$fn__1198.invoke(client.clj:481)
at clj_http.client$wrap_param$fn__1216.invoke(client.clj:505)
at clj_http.client$wrap_method$fn__1193.invoke(client.clj.464)
at clj_http.client$wrap_cookies$fn__645.invoke(cookies.clj:118)
at clj_http.client$wrap_link$fn__675.invoke(links.clj:50)
at clj_http.client$wrap_unknown_host$fn__1225.invoke(client.clj:524)
at clj_http.client$get.doInvoke(client.clj:615)
at clojure.lang.RestFn.invoke(RestFn.java:423)
at stream2es.es$scroll_STAR_.invoke(es.clj:64)
at stream2es.es$scroll.invoke(es.clj:72)
at stream2es.es$scroll$fn__1307.invoke(es.clj:77)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:67)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at stream2es.stream.es$make_callback$fn__1904.invoke(es.clj:68)
at stream2es.main$stream_BANG_.invoke(main.clj:291)
at stream2es.main$main.invoke(main.clj:417)
at stream2es.main$_main.doInvoke(main.clj:435)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)

The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING"

I am trying to index Wikipedia using a local bz2 copy to a local elasticsearch. It ran for a long time correctly, but then had an exception like this:
The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING"

This is what I ran:
java -jar stream2es wiki --target http://localhost:9200/testwiki --log debug create index http://localhost:9200/testwiki --source /home/testuser/enwiki-latest-pages-articles.xml.bz2

image

Copy local index to https and basic auth target

Tried to upload a local index to an internet server with HTTPS/Basich Auth and get this error: stream error: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The certificate is self-signed (not official one). Any parameter to change it. Username and password was included in the URL.

Url looks like


./stream2es es --source http://localhost:9200/osm --target https://user:[email protected]/es/osm

Pls.note "/es" is an alias forwarded to ElasticSearch server not the index name (which is osm).

Is there any better way to copy an index to another server in different cluster?

consider document version in es -> es streaming

since ES index APIs support optimistic concurrency control using document versioning, it would be very useful to support this functionality when using stream2es to migrate indexes.

In particular, it would allow for idempotency when copying data incrementally (one could copy documents from an older index's data conditionally, only applying updates or deletes to the new index when no subsequent change had been already made to those documents there).

Out of memory error with attachments

I saw closed issue #15 about an out of memory error, but I just built stream2es from master, so this seems to be a different issue. I am using ES with the mapper attachments plugin. If I transfer my document type that doesn't contain attachments, stream2es does fine, but if I try to stream the document type which does contain attachments, or the whole index, I get an out of memory error, as seen below. I am using ES 1.1.0.

 $./stream2es- es --source http://localhost:9200/fbopen/opp_attachment --target http://localhost:9200/fbopen2/opp_attachment
stream es from http://localhost:9200/fbopen/opp_attachment to http://localhost:9200/fbopen2/opp_attachment
clojure.lang.ExceptionInfo: clj-http: status 500 {:object {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, :environment {client #<client$wrap_output_coercion$fn__986 clj_http.client$wrap_output_coercion$fn__986@62897819>, req {:request-method :get, :url "http://localhost:9200/_search/scroll", :body "c2Nhbjs1OzYzMTY6V3hlZVcydC1TN3FmWnhJazlBZXdOZzs2MzE1Old4ZWVXMnQtUzdxZlp4SWs5QWV3Tmc7NjMxODpXeGVlVzJ0LVM3cWZaeElrOUFld05nOzYzMTk6V3hlZVcydC1TN3FmWnhJazlBZXdOZzs2MzE3Old4ZWVXMnQtUzdxZlp4SWs5QWV3Tmc7MTt0b3RhbF9oaXRzOjY1NTs=", :query-params {:scroll "15s"}}, map__908 {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, resp {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, status 500}}
        at clj_http.client$wrap_exceptions$fn__907.invoke(client.clj:111)
        at clj_http.client$wrap_accept$fn__1027.invoke(client.clj:380)
        at clj_http.client$wrap_accept_encoding$fn__1033.invoke(client.clj:394)
        at clj_http.client$wrap_content_type$fn__1022.invoke(client.clj:370)
        at clj_http.client$wrap_form_params$fn__1071.invoke(client.clj:481)
        at clj_http.client$wrap_nested_params$fn__1089.invoke(client.clj:505)
        at clj_http.client$wrap_method$fn__1066.invoke(client.clj:464)
        at clj_http.cookies$wrap_cookies$fn__518.invoke(cookies.clj:118)
        at clj_http.links$wrap_links$fn__548.invoke(links.clj:50)
        at clj_http.client$wrap_unknown_host$fn__1098.invoke(client.clj:524)
        at clj_http.client$get.doInvoke(client.clj:615)
        at clojure.lang.RestFn.invoke(RestFn.java:423)
        at stream2es.es$scroll_STAR_.invoke(es.clj:64)
        at stream2es.es$scroll.invoke(es.clj:72)
        at stream2es.es$scan.invoke(es.clj:96)
        at stream2es.stream.es$make_callback$fn__1771.invoke(es.clj:69)
        at stream2es.main$stream_BANG_.invoke(main.clj:299)
        at stream2es.main$main.invoke(main.clj:425)
        at stream2es.main$_main.doInvoke(main.clj:446)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at stream2es.main.main(Unknown Source)
stream error: clojure.lang.ExceptionInfo: clj-http: status 500 {:object {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, :environment {client #<client$wrap_output_coercion$fn__986 clj_http.client$wrap_output_coercion$fn__986@62897819>, req {:request-method :get, :url "http://localhost:9200/_search/scroll", :body "c2Nhbjs1OzYzMTY6V3hlZVcydC1TN3FmWnhJazlBZXdOZzs2MzE1Old4ZWVXMnQtUzdxZlp4SWs5QWV3Tmc7NjMxODpXeGVlVzJ0LVM3cWZaeElrOUFld05nOzYzMTk6V3hlZVcydC1TN3FmWnhJazlBZXdOZzs2MzE3Old4ZWVXMnQtUzdxZlp4SWs5QWV3Tmc7MTt0b3RhbF9oaXRzOjY1NTs=", :query-params {:scroll "15s"}}, map__908 {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, resp {:trace-redirects ["http://localhost:9200/_search/scroll"], :request-time 5074, :status 500, :headers {"content-type" "application/json; charset=UTF-8", "content-length" "58"}, :body "{\"error\":\"OutOfMemoryError[Java heap space]\",\"status\":500}"}, status 500}}
streamed 0 indexed 0 bytes xfer 0 errors null

Index templates are ignored

We were testing this out today and noticed that, despite having index templates available, these weren't used when stream2es created the index at start.
If we manually create the index via a curl and then populate it with stream2es, it obviously works.

Is this intended behaviour due to the --settings being "null", or are we missing something?

stdin import

If I use the es streamer to import data directly from another ES cluster, everything works fine, and document IDs are kept the same as from the original source. But if I try to import using the stdin streamer, and the JSON file has document IDs, the import fails with errors like:

MapperParsingException[failed to parse [_id]]; nested: MapperParsingException[Provided id [AUmQO1aas8DAEaQxci0-] does not match the content one [Zwof6Y]]

Support Elasticsearch Auth

I need stream2es to support the ES Auth plugin ( https://github.com/codelibs/elasticsearch-auth ) and want to program this myself, but I'm not sure how I should do it.
I think one possibility would be to add username and password parameters.

Another more generic solution is to add only one parameter and with this parameter one can expand the json payload.

For example, when you add the option --additional "token":"uineqlxvonrtueaiobud" the json which gets send looks like this: {[old json], "token":"uineqlxvonrtueaiobud"}

Which one of these solutions do you prefer?

[Twitter] index hashtags

Twitter provides hashtags in tweets under entities.hashtags array.
See entities twitter doc. For example:

{
    "text": "Loved #devnestSF"
    "entities": {
      "media": [
      ],
      "urls": [
      ],
      "user_mentions": [
      ],
      "hashtags": [
        "text": "devnestSF"
        "indices": [
          6,
          16
        ]
      ]
    }
}

I would love to have hashtags (not analyzed by default) to build live Kibana dashboards using that property too.

Really slow import of large batches?

Hi,

I am trying to use stream2es to import a wikipedia dump in an ElasticSearch cluster. I am trying to import the subset of Wikipedia articles corresponding to the "biology" category (about 300K bzip2 compressed).

Importing a really smale subset (--max-docs 5 or 10) works fine. However, when importing the full batch or a large subset (--max-docs 1000 for instance), it runs quite infinitely and seems to get stuck (running for more than 30 minutes, no significant CPU usage nor memory usage).

Do you know if this is normal, or what is happening?

Thanks

Code added to make stream death more pleasant obscures OOM exception

I ran into this issue trying to stream a not-terribly big index (~6GB) with an average document size of 2-3MB. stream2es version info:

./stream2es --version
2015-01-03T20:27:48.473-0700 INFO  stream2es 2014122282ace27

And here's a quick log snippet that illustrates the issue:

./stream2es es --scroll-size 50 --scroll-time 5m --source http://localhost:9200/myindex --target http://localhost:9200/myindex_v1 --log trace
2015-01-02T18:10:37.134-0700 DEBUG create index http://localhost:9200/myindex_v1
2015-01-02T18:10:37.057-0700 TRACE waiting for collectors
2015-01-02T18:10:37.278-0700 TRACE PUT 317021 bytes
2015-01-02T18:10:38.830-0700 DEBUG stream es from http://localhost:9200/myindex to http://localhost:9200/myindex_v1
java.lang.NullPointerException
        at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
        at java.util.regex.Matcher.reset(Matcher.java:308)
        at java.util.regex.Matcher.<init>(Matcher.java:228)
        at java.util.regex.Pattern.matcher(Pattern.java:1088)
        at clojure.core$re_matcher.invoke(core.clj:4386)
        at clojure.core$re_find.invoke(core.clj:4438)
        at stream2es.es$scroll_STAR_.invoke(es.clj:72)
        at stream2es.es$scroll.invoke(es.clj:79)
        at stream2es.es$scan.invoke(es.clj:103)
        at stream2es.stream.es$make_callback$fn__3008.invoke(es.clj:75)
        at stream2es.main$stream_BANG_.invoke(main.clj:241)
        at stream2es.main$main.invoke(main.clj:330)
        at stream2es.main$_main.doInvoke(main.clj:336)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at stream2es.main.main(Unknown Source)
2015-01-02T18:13:44.319-0700 ERROR unexpected exception: java.lang.NullPointerException
2015-01-02T18:13:44.468-0700 INFO  03:07.409 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0

After some investigation, I figured out that body was nil in scroll*, which of course caused the NPE trying to call re-find. I hacked in some more logging and the reason body was nil was that stream2es was OOMing (java.lang.OutOfMemoryError: GC overhead limit exceeded) trying to parse the first set of returned documents from the scroll. Lowering the scroll size, as suggested in #24, fixed that.

Long story short, it'd be nice if the exception handling, er, handled, this particular case, if only to save people time trying to figure out the real issue behind the NPE. My simple hack around this was just to add a nil check for body to the cond before the search-context-missing bits, and then throw the exception if body is nil, but my Clojure is rudimentary at best.

stream2es indexing of wikipedia fails at about 540K docs with IOException: unexpected end of stream

I'm trying to import wikipedia into an elasticsearch index using stream2es and getting the following error after indexing about 540K docs (I've re-run this multiple times on both the current wiki dump and older dumps - every time I run it results in the same error):

2015-05-24T15:45:29.300+0000 DEBUG 79:19.539 112.8d/s 648.6K/s 536993 682 3156312 0 881514
2015-05-24T15:45:33.597+0000 DEBUG 79:23.836 113.0d/s 648.6K/s 538112 1119 3165607 0 882684
java.io.IOException: unexpected end of stream
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:624)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:287)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:844)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:893)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.read0(CBZip2InputStream.java:210)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:178)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.read1(BufferedReader.java:210)
    at java.io.BufferedReader.read(BufferedReader.java:286)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1736)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1408)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2823)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
    at stream2es.stream.wiki$fn__3723$fn__3724.invoke(wiki.clj:44)
    at stream2es.main$stream_BANG_.invoke(main.clj:241)
    at stream2es.main$main.invoke(main.clj:330)
    at stream2es.main$_main.doInvoke(main.clj:336)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)
2015-05-24T15:45:37.653+0000 ERROR unexpected exception: java.io.IOException: unexpected end of stream
2015-05-24T15:45:37.860+0000 INFO  79:28.099 112.9d/s 648.1K/s (3017.6mb) indexed 538112 streamed 539294 errors 0

I first tried this on the default wikipedia dump:
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Just in case it was a problem with the lastest dump file, I also tried running using the previous months dump file as the --source parameter:
http://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

And got the exact same error at about the exact same doc count (~540k docs).

Here is the command I use to run stream2es:

./stream2es wiki --target http://<ES cluster>/wikipedia --log debug --source http://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

(ES cluster url redacted above)

I am running on an ubuntu AWS instance, here is my version of Java

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

I am running stream2es on a separate AWS instance from elasticsearch (our elasticsearch is hosted by qbox on AWS so we dont have shell access)

Any thoughts on how to get this to complete?

XML document structures must start and end within the same entity

I ran on the local wikipedia dump enwiki-20151102-pages-articles-multistream.xml.bz2

./stream2es wiki --source [dir]/enwiki-20151102-pages-articles-multistream.xml.bz2

but got the error message:

[Fatal Error] :46:1: XML document structures must start and end within the same entity.
org.xml.sax.SAXParseException; lineNumber: 46; columnNumber: 1; XML document structures must start and end within the same entity.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at stream2es.stream.wiki$fn__6625$fn__6626.invoke(wiki.clj:45)
at stream2es.main$stream_BANG_.invoke(main.clj:245)
at stream2es.main$main.invoke(main.clj:333)
at stream2es.main$_main.doInvoke(main.clj:339)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)

And I've tried other, simplewiki-20151102-pages-articles-multistream.xml.bz2, and simplewiki-20150901-pages-articles-multistream.xml.bz2

Same error occurs. Not sure how to fix it.

lein failure due to missing /etc/version.txt ?

Was attempting to open this project in eclipse and got the following error when running lein eclipse:

Was able to resolve it by creating /etc/version.txt

(env)mconlin-mbp:stream2es mconlin$ lein
java.lang.Exception: Error loading /Users/mconlin/workspaces/stream2es/project.clj
    at leiningen.core.project$read$fn__3256.invoke(project.clj:682)
    at leiningen.core.project$read.invoke(project.clj:679)
    at leiningen.core.project$read.invoke(project.clj:689)
    at leiningen.core.main$_main$fn__3025.invoke(main.clj:256)
    at leiningen.core.main$_main.doInvoke(main.clj:252)
    at clojure.lang.RestFn.invoke(RestFn.java:397)
    at clojure.lang.Var.invoke(Var.java:411)
    at clojure.lang.AFn.applyToHelper(AFn.java:159)
    at clojure.lang.Var.applyTo(Var.java:532)
    at clojure.core$apply.invoke(core.clj:617)
    at clojure.main$main_opt.invoke(main.clj:335)
    at clojure.main$main.doInvoke(main.clj:440)
    at clojure.lang.RestFn.invoke(RestFn.java:421)
    at clojure.lang.Var.invoke(Var.java:419)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.Var.applyTo(Var.java:532)
    at clojure.main.main(main.java:37)
Caused by: java.io.FileNotFoundException: etc/version.txt (No such file or directory)

show the reason for errors

I'm getting this when trying to import data that's producing errors on ES side:

2015-05-19T23:45:24.592+0400 INFO  00:01,426 30,2d/s 81,4K/s (0,1mb) indexed 43 streamed 86 errors 43
2015-05-19T23:45:24.593+0400 ERROR finished with 43 errors

How do I see the actual errors?

bulk API compatible file

Standard bulk API expects 2 lines and I generated a file with that format. stream2es added all the 4 lines whereas only 2 records should be added.

DELETE /test
PUT /test

PUT /_bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" }}
{ "field1" : "value1" }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }

Is there an option to specify this?

Option to move index settings and mapping

We find stream2es very useful and handy , especially for moving indices around.
But then once thing that we find hard is to copy the mapping by ourself.
It would be great if stream2es can do this also.
Something like 2 extra flag will do the work neatly

stream2es es --includeIndexSetting --includeMapping  \
     --source http://foo.local:9200/wiki \
     --target http://bar.local:9200/wiki2

stream2es should exit with a non-zero code on error

$ java -jar stream2es-b313650b-standalone.jar es \
    --source http://localhost:1234/bob \
    --target https://localhost/joe ; echo RC=$?

create index https://localhost:-1/joe
https://localhost/joe connection refused
streamed 0 indexed 0 bytes xfer 0 errors null
RC=0

Having it return zero tells shell scripts that all went according to plan, which makes bailing out on error with set -e (or even an if) problematic.

Obviously the localhost:-1 is a separate bug itself.

stream2es indexing of local wikipedia dump fails

I am getting the following error when attempting to ingest a local dump of the latest wikipedia. I am running ES 1.7.1 and stream2es 20150720170522978252e

[stream2es]$ ./stream2es wiki --max-docs 5 --source ./enwiki-latest-pages-articles1.xml.bz2
java.io.IOException: unexpected end of stream
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.bsGetBit(CBZip2InputStream.java:371)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:476)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:550)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:287)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.init(CBZip2InputStream.java:246)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.(CBZip2InputStream.java:148)
at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:80)
at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at stream2es.stream.wiki$fn__6612$fn__6613.invoke(wiki.clj:45)
at stream2es.main$stream_BANG_.invoke(main.clj:241)
at stream2es.main$main.invoke(main.clj:329)
at stream2es.main$_main.doInvoke(main.clj:335)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)
2015-09-11T11:13:32.937-0600 ERROR unexpected exception: java.io.IOException: unexpected end of stream
2015-09-11T11:13:33.056-0600 INFO 00:00.208 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0
[stream2es]$

[Twitter] add option to index raw tweets

It could be useful to be able to index all tweet fields using option like --raw.

It should send the full JSON to elasticsearch without the need of parsing it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.