lbroudoux / es-amazon-s3-river Goto Github PK

View Code? Open in Web Editor NEW

63.0 9.0 21.0 478 KB

Amazon S3 river for Elasticsearch

License: Apache License 2.0

Java 100.00%

es-amazon-s3-river's Introduction

es-amazon-s3-river

Amazon S3 river for Elasticsearch

This river plugin helps to index documents from a Amazon S3 account buckets.

WARNING: For 0.0.1 released version, you need to have the Attachment Plugin.

WARNING: Starting from 0.0.2, you don't need anymore the Attachment Plugin as we use now directly Tika, see issue #2.

Versions

Amazon S3 River Plugin	ElasticSearch	Attachment Plugin	Tika
master (1.6.1-SNAPSHOT)	1.6.x and 1.7.x	No more used	1.6
1.6.0	1.6.x and 1.7.x	No more used	1.6
1.4.1	1.4.x and 1.5.x	No more used	1.6
1.4.0	1.4.x and 1.5.x	No more used	1.6
1.3.0	1.3.x	No more used	1.4
1.2.0	1.2.x	No more used	1.4
0.0.4	1.0.x and 1.1.x	No more used	1.4
0.0.3	1.0.0	No more used	1.4
0.0.2	0.90.0	No more used	1.4
0.0.1	0.90.0	1.7.0

Build Status

Travis CI

Getting Started

Installation

Just install as a regular Elasticsearch plugin by typing :

$ bin/plugin --install com.github.lbroudoux.elasticsearch/amazon-s3-river/1.6.0

This will do the job...

-> Installing com.github.lbroudoux.elasticsearch/amazon-s3-river/1.6.0...
Trying http://download.elasticsearch.org/com.github.lbroudoux.elasticsearch/amazon-s3-river/amazon-s3-river-1.6.0.zip...
Trying http://search.maven.org/remotecontent?filepath=com/github/lbroudoux/elasticsearch/amazon-s3-river/1.6.0/amazon-s3-river-1.6.0.zip...
Downloading ......DONE
Installed amazon-s3-river

Get Amazon AWS credentials (accessKey and secretKey)

First, you need to login to Amazon AWS account owning the S3 bucket to and then retrieve your security credentials by visiting this page.

Once done, you should note your accessKey and secretKey codes.

Creating an Amazon S3 river

We create first an index to store our documents (optional):

$ curl -XPUT 'http://localhost:9200/mys3docs/' -d '{}'

We create the river with the following properties :

accessKey : AAAAAAAAAAAAAAAA
secretKey: BBBBBBBBBBBBBBBB
Amazon S3 bucket to index : myownbucket
Path prefix to index in this buckets : Work/ (This is optional. If specified, it should be an existing path with the trailing /)
Update Rate : every 15 minutes (15 * 60 * 1000 = 900000 ms)
Get only docs like *.doc and *.pdf
Don't index *.zip and *.gz

$ curl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "accessKey": "AAAAAAAAAAAAAAAA",
    "secretKey": "BBBBBBBBBBBBBBBB",
    "name": "My Amazon S3 feed",
    "bucket" : "myownbucket"
    "pathPrefix": "Work/",
    "update_rate": 900000,
    "includes": "*.doc,*.pdf",
    "excludes": "*.zip,*.gz"
  }
}'

By default, river is using an index that have the same name (mys3docs in the above example).

From 0.0.2 version

The source_url of documents is now stored within Elasticsearch index in order to allow you to access later the whole document content from your application (this is indeed a use case coming from Scrutmydocs).

By default, the plugin uses what is called the resourceUrl of a S3 bucket document. If the document have been made public within S3, it can be accessed directly from your browser. If it's not the case, the stored url is intended to be used by a regular S3 client that has the allowed set of credentials to access the document.

Another option to easily distribute S3 content is to setup a Web proxy in front of S3 such as CloudFront (see Service Private Content With CloudFront). In that later case, you'll want to rewrite source_url by substituting the S3 part by your own host name. This plugin allows you to do that by specifying a download_host as a river properties.

Specifying index options

Index options can be specified when creating an amazon-s3-river. The properties are the following :

Index name : "amazondocs"
Type of documents : "doc"
Size of an indexation bulk : 50 (default is 100)

You'll have to use them as follow when creating a river :

$ curl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "accessKey": "AAAAAAAAAAAAAAAA",
    "secretKey": "BBBBBBBBBBBBBBBB",
    "name": "My Amazon S3 feed",
    "bucket" : "myownbucket"
    "pathPrefix": "Work/",
    "update_rate": 900000,
    "includes": "*.doc,*.pdf",
    "excludes": "*.zip,*.gz"
  },
  "index": {
    "index": "amazondocs",
    "type": "doc",
    "bulk_size": 50
  }
}'

Indexing Json documents

From 0.0.4 version

If you want to index Json files directly without parsing them through Tika, you can set the json_support configuration option to true like

$ curl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "accessKey": "AAAAAAAAAAAAAAAA",
    "secretKey": "BBBBBBBBBBBBBBBB",
    "name": "My Amazon S3 feed",
    "bucket" : "myownbucket"
    "pathPrefix": "Jsons/",
    "update_rate": 900000,
    "json_support": true,
    "includes": "*.json"
  }
}'

Be sure in your river configuration to correclty use includes or excludes to only retrieve Json documents.

In this case of json_support and if you did not define a mapping prior creating it, this river will not automatically generate a mapping like mentioned below into the Advanced section. In this case, Elasticsearch will auto guess the mapping.

Advanced

Management actions

If you need to stop a river, you can call the _s3 endpoint with your river name followed by the _stop command like this :

GET _s3/mys3docs/_stop

To restart the river from the previous point, just call the corresponding _start endpoint :

GET _s3/mys3docs/_start

Extracted characters

From 1.4.1 version

By default this river plugin will extract only a limited size of characters (up to 100000 that is the default aloowed by Tika). But this may be not sufficient for big documents. You can override this limit using the indexed_chars_ratio river option like this :

$ curl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "accessKey": "AAAAAAAAAAAAAAAA",
    "secretKey": "BBBBBBBBBBBBBBBB",
    "name": "My Amazon S3 feed",
    "bucket" : "myownbucket"
    "pathPrefix": "Work/",
    "indexed_chars_ratio": 1
  }
}'

indexed_chars_ratio should actually been a positive double number. Setting indexed_chars_ratio to x will compute file size, multiply it with x and pass it to Tika. Setting a value of 1, will extract exactly the filesize.

That means that a value of 0.8 will extract 20% less characters than the file size. A value of 1.5 will extract 50% more characters than the filesize (think compressed files).

Note that Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars_ratio to a high number will require more memory !

Credential keys security and IAM Role

From 1.4.1 version

Transferring accessKey and secretKey as river creation option is not the always applicable depending on your context. This may lead to an exposition of this keys. From 1.4.1 version, you may now have the ability to :

either use the default credential retrieval process that checks system variables and configuration files,
either force the usage of IAM Role if your nodes are running directly onto an Amazon EC2 instance.

We recommend you to check see http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html for explanations of credential retrieval process.

The behaviour of this river plugin is now the following :

accessKey and secretKey are no longer mandatory fields. If not provided at index creation, river will just try to connect to your S3 bucket using the default provider chain,
new option use_EC2_IAM can be set to true to force the usage of EC2 IAM Role.

In action, this lead to something like when creating river :

$ curl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "name": "My Amazon S3 feed",
    "bucket" : "myownbucket"
    "pathPrefix": "Work/",
    "use_EC2_IAM": true
  }
}'

Autogenerated mapping

When the river detect a new type, it creates automatically a mapping for this type.

{
  "doc" : {
    "properties" : {
      "title" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "modifiedDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "file" : {
        "type" : "attachment",
        "fields" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      }
    }
  }
}

From 0.0.2 version

We now use directly Tika instead of the mapper-attachment plugin.

{
  "doc" : {
    "properties" : {
      "title" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "modifiedDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "source_url" : {
        "type" : "string"
      },
      "file" : {
        "properties" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      }
    }
  }
}

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2013-2015 Laurent Broudoux

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

es-amazon-s3-river's People

Contributors

Stargazers

Watchers

es-amazon-s3-river's Issues

Replace mapper-attachment plugin by Tika

Copied from dadoonet/fscrawler#38

If we want to have a finer control of JSon documents we generate, we need to remove the attachment type (mapper-attachment-plugin that is) and replace it with Tika.

It will allow to support features like "store-origin": false which basically won't require to encode in Base64 the content but only will generate json values for extracted content.

We need probably here to keep the original format of generated Json documents for bw compatibility.

Update to Elasticsearch 1.0.x

Elasticsearch is now released with a final 1.0.x version (and other, 1.1.x is already here). There's API break (see #5) so update should be done in order to have Amazon S3 river on latest ES version.

can't install plugin..

"ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip"

Elastic 2.0 has been released so the plugin should need an upgrade too

Unable to index pdf files stored in S3 bucket

I am trying to index pdf files stored in s3 bucket. I followed the instruction provided in the documentation. I am using ES 1.7.2 and sense plugin to execute queries. Below code is not giving any error. But it is not indexing s3 bucket.

PUT /river/mys3docs/meta
{
"type": "amazon-s3",
"amazon-s3": {
"accessKey": "AAAAAAAAAA",
"secretKey": "BBBBBBBBBBBB",
"bucket" : "upload_resumes",
"pathPrefix": "/",
"update_rate": 300000,
"includes": ".doc,.pdf",
"excludes": ".zip,.gz"
}
}

after defining a river getting the following error message:

failed to create river [amazon-s3][mys3docs]
org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class with value [amazon-s3]
at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:87)
at org.elasticsearch.river.RiverModule.spawnModules(RiverModule.java:58)
at org.elasticsearch.common.inject.ModulesBuilder.add(ModulesBuilder.java:44)
at org.elasticsearch.river.RiversService.createRiver(RiversService.java:137)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:273)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:267)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:113)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: amazon-s3
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:73)
... 9 more

elasticsearch 1.3.1 fails to initialize with plugin-installed

Restarting elasticsearch 1.3.1 server with the s3-river plugin installed throws the following error:

"[2014-07-30 18:27:54,301][ERROR][bootstrap                ] {1.3.1}: Initialization Failed ... - VerifyError[class com.github.lbroudoux.elasticsearch.river.s3.rest.S3ManageAction overrides final method handleRequest.(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;)V]"

Is there an incompatibility introduced with elasticsearch 1.3.1?

Ubuntu 14.04 x64
Oracle Java 1.8.0_11
Elasticsearch 1.3.1 (installed via the repository)
es-amazon-s3-river 0.0.3

Eventhandler to index on command

It would be helpful to sync ES with S3 at specific events. So for example one could send an action to the river, to start the sync process.

I've seen in your source code, that you're doing a recurring sync process and applying the ListObject function several times, until the truncated flag is set to false. This a costly process for very large buckets. Therefore it would be a nice to have, to only sync ES, when some new files have been added to the S3 bucket, and thus the S3 river could be triggered manually, by an application, instead of recurring.

Adding document metadata extracted with Tika

Tika is used for parsing document's content before indexing is done within Elasticsearch. Tika also offers abstract APIs for extracting various metadata from parsed file depending on content-type. This metadata may be added to ES document in a specialized metadata field within mapping.

Store origin URL in documents

We would like to have the original URL (full path https://s3.amazon.com/mydir/mydoc.txt) stored in a source_url field.

It would allow to index only content without the need of storing the _source document itself.

How to start indexing?

Hi there. Thanks for writing this plugin, this is exactly what I've been wanting to use ElasticSearch for (which I'm a complete newbie with). I basically followed your example on the front page verbatim, but ES doesn't appear to be indexing anything yet. Do I need to do something to kick off the indexer?

Any advice?

elasticsearch-1.4.2
amazon-s3-river 1.3.0

Stops indexing after >300k documents

Hi there,

Hoping this is just a mis config issue, but the river has just stopped indexing after 332k documents, the s3 bucket contains over 900k.

Was a clean install of elasticsearch with just this plugin and the cloud-aws plugin.

Upon starting, indexing seemed to progress well, but has stalled for days. I have tried restarting service and whole box as well as adding more servers to the cluster to see if they will help, but no progress.

Nothing in the logs that is relevant, just the odd "can not index /example.doc".

Any more information I can give that would help diagnose?

Thanks,

Richard

Will this plugin index documents in nested paths?

Or only the explicit path?

File contents garbled

I'm using 0.0.1 version of the plugin. I've configured ES to access my S3 bucket - which is works just fine; it is able to access the docs & read it . However, when I try to see the file (the file that was in S3) content via elastic search, I get garbled data . the data is in json format . WHat am I doing wrong ?

No region can be setted.

Having problem using s3-river plugin with shield plugin.

I'm experimenting with s3-river plugin with shield plugin.
As soon as I try to add index and point to a s3 bucket it throws following stacktrace.

[2015-05-07 18:39:41,594][INFO ][cluster.metadata         ] [Kwannon] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings [mys3docs]
[2015-05-07 18:39:41,794][WARN ][river.routing            ] [Kwannon] failed to get/parse _meta for [mys3docs]
org.elasticsearch.shield.authz.AuthorizationException: action [indices:data/read/get] is unauthorized for user [__es_system_user]
    at org.elasticsearch.shield.authz.InternalAuthorizationService.denial(InternalAuthorizationService.java:247)
    at org.elasticsearch.shield.authz.InternalAuthorizationService.authorize(InternalAuthorizationService.java:108)
    at org.elasticsearch.shield.action.ShieldActionFilter.apply(ShieldActionFilter.java:112)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
    at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:193)
    at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:201)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
    at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:73)
    at org.elasticsearch.river.routing.RiversRouter.updateRiverClusterState(RiversRouter.java:137)
    at org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:108)
    at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:110)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[2015-05-07 18:39:41,871][WARN ][river.routing            ] [Kwannon] failed to get/parse _meta for [mys3docs]
org.elasticsearch.shield.authz.AuthorizationException: action [indices:data/read/get] is unauthorized for user [__es_system_user]
    at org.elasticsearch.shield.authz.InternalAuthorizationService.denial(InternalAuthorizationService.java:247)
    at org.elasticsearch.shield.authz.InternalAuthorizationService.authorize(InternalAuthorizationService.java:108)
    at org.elasticsearch.shield.action.ShieldActionFilter.apply(ShieldActionFilter.java:112)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
    at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:193)
    at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:201)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
    at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:73)
    at org.elasticsearch.river.routing.RiversRouter.updateRiverClusterState(RiversRouter.java:137)
    at org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:108)
    at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:110)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[2015-05-07 18:39:41,911][INFO ][cluster.metadata         ] [Kwannon] [_river] update_mapping [mys3docs] (dynamic)
[2015-05-07 18:39:41,913][WARN ][river.routing            ] [Kwannon] failed to get/parse _meta for [mys3docs]
org.elasticsearch.shield.authz.AuthorizationException: action [indices:data/read/get] is unauthorized for user [__es_system_user]
    at org.elasticsearch.shield.authz.InternalAuthorizationService.denial(InternalAuthorizationService.java:247)
    at org.elasticsearch.shield.authz.InternalAuthorizationService.authorize(InternalAuthorizationService.java:108)
    at org.elasticsearch.shield.action.ShieldActionFilter.apply(ShieldActionFilter.java:112)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
    at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:193)
    at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:201)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
    at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:73)
    at org.elasticsearch.river.routing.RiversRouter.updateRiverClusterState(RiversRouter.java:137)
    at org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:108)
    at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:110)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Seems like elastic.co is aware of this bug, https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/8-YzteCnqzo/PohdZfU3PRcJ
but I wanted to know if there would be an easy work around for this.

Let me know if I can provide any additional information.

Update to Tika 1.6 / POI 3.11-beta2

There was an Apache POI release to address a security vulnerability. For some document types, this plugin will indirectly use POI. This plugin release forces an update to Apache POI and is a response to the POI issue. Previously, we did not have an explicit dependency on POI. For this issue, we should add a direct dependency and set it to the recent versions of POI. This will help users who might be unaware of these vulnerabilities, avoid them.

You can read more about the reported issues in CVE-2014-3529 and CVE-2014-3574

We encourage anyone using this plugin with untrusted documents to update to incoming 1.4.0 release.

Update to Elasticsearch 1.4.x

Update for compatibility with Elasticsearch 1.4.x APIs changes

Adding IAM Role support for EC2 instance

Instead of using the access keys in the command does s3-river support using the "IAM Role" function? This would allow s3-river to query the bucket without embedding the access key and secret key into the web request.

Thanks..

What ports need to be open for es-amazon-s3-river plugin to connect to s3 endpoint

I have two queries:

What are the ports that need to be open for the plugin to talk to the S3 end point. THe reason I ask this is because in my setup I see that I'm able to connect and index the docs in S3 when I initially activate the plugin (via the curl cmd) . However, subsequently, if there are new files in the S3 bucket, the plugin is not able to get an update. I see the following continuously apprearing in the logs:

[2014-04-17 19:03:14,102][INFO ][org.apache.http.impl.client.DefaultHttpClient] I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond
[2014-04-17 19:03:14,102][INFO ][org.apache.http.impl.client.DefaultHttpClient] Retrying request

I checked with netstat and found that the connection to s3 happens over a range of ports (each time I checked it is a diff port #); probably the plugin is trying diff port #'s when it fails to connect.
I don't have all ports open (I don't want to have all ports open! ) which is probably why I don't see updates/ new files in S3 not getting picked up by the plugin.
So, what ports need to open in order to get the plugin to be fully functional ?

If question # 1 is valid - i.e there are a set of ports that the plugin requires be open, then is this port range configurable? I do not see any such options in the curl cmd currently.

ES 1.2.x and 1.3.x fails to start with amazon-s3-river installed

This plugin dows not work with ES 1.2.x or 1.3.x

With 1.3.x it gives the following error:

 VerifyError[class com.github.lbroudoux.elasticsearch.river.s3.rest.S3ManageAction overrides final method handleRequest.(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;)V]

With 1.2.x it gives :

ExecutionError[java.lang.NoClassDefFoundError: org/elasticsearch/rest/XContentThrowableRestResponse]

I didn't try 1.1

Update to Elasticsearch 1.5.x

com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to sanitize XML document destined for handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler)

Having this error using the 1.1.2 ubuntu repo version on a very large bucket.

Exception for folder bucket-name is com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to sanitize XML document destined for handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler)

However I had previously used the 1.1.2 tar.gz version and that was working fine. I will continue testing and see if I can isolate the issue.

How to install master branch ?

Hi,

It's Hugo again. Sorry for bothering.
I tried many ways to install the master branch to my elasticsearch box. But no luck here.
Any hint would be appreciate ~~~

Anything wrong with my input as below ???

root@elk01:~/old# /usr/share/elasticsearch/bin/plugin -install com.github.lbroudoux.elasticsearch/amazon-s3-river/master

-> Installing com.github.lbroudoux.elasticsearch/amazon-s3-river/master...
Trying http://download.elasticsearch.org/com.github.lbroudoux.elasticsearch/amazon-s3-river/amazon-s3-river-master.zip...
Trying http://search.maven.org/remotecontent?filepath=com/github/lbroudoux/elasticsearch/amazon-s3-river/master/amazon-s3-river-master.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/com/github/lbroudoux/elasticsearch/amazon-s3-river/master/amazon-s3-river-master.zip...
Trying https://github.com/com.github.lbroudoux.elasticsearch/amazon-s3-river/archive/vmaster.zip...
Trying https://github.com/com.github.lbroudoux.elasticsearch/amazon-s3-river/archive/master.zip...
Failed to install com.github.lbroudoux.elasticsearch/amazon-s3-river/master, reason: failed to download out of all possible locations..., use -verbose to get detailed information
root@elk01:~/old#

Use BulkProcessor feature

We want to use one single shared bulk processor instead of creating multiple bulkRequest instances. Such processor offer much more options for controlling bulk flushing.

Update to Elasticsearch 1.6.x

Deploying 1.4.x versions onto ES 1.6 causes errors ... Fix this !

Add support for management commands available as REST endpoints

Amazon S3 River scan shall be started or stopped indenpendtly of the Elasticsearch node running them.

Offer REST endpoints such as :

curl 'localhost:9200/_s3/mys3docs/_start'
curl 'localhost:9200/_s3/mys3docs/_stop'

to start or stop a S3 river.

Update to Amazon SDK 1.6.x

Amazon SDK used (version 1.4.4) is now old ... Update this one to newer version according 0.0.3 milestone.

Failed to create the river with elasticsearch 1.0.2

[2014-04-16 02:44:17,067][WARN ][river ] [elk01] failed to create river [amazon-s3][mys3docs]
org.elasticsearch.common.inject.CreationException: Guice creation errors:

Error injecting constructor, java.lang.IllegalArgumentException: Access key cannot be null.
at com.github.lbroudoux.elasticsearch.river.s3.river.S3River.(Unknown Source)
while locating com.github.lbroudoux.elasticsearch.river.s3.river.S3River
while locating org.elasticsearch.river.River

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.river.RiversService.createRiver(RiversService.java:140)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
Caused by: java.lang.IllegalArgumentException: Access key cannot be null.
at com.amazonaws.auth.BasicAWSCredentials.(BasicAWSCredentials.java:37)
at com.github.lbroudoux.elasticsearch.river.s3.connector.S3Connector.connectUserBucket(S3Connector.java:66)
at com.github.lbroudoux.elasticsearch.river.s3.river.S3River.(S3River.java:134)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
... 10 more

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/W-_Do6sXDfw/T_NlbLkvTvsJ

index.type ignored?

First: I'm completely new to ES, so I might very well be missing something obvious, but when I start a river with index options, the "type" seem to be silently ignored:

curl -XPUT 'http://localhost:9200/_river/documents/_meta' -d '{
  "type": "amazon-s3",
  "amazon-s3": {
    "accessKey": "XXXXXXXXXXXX",
    "secretKey": "YYYYYYYYYYYYY",
    "name": "Amazon S3 river",
    "bucket" : "BBBBBBBBBBB",
    "update_rate": 120000
  }
},
  "index": {
    "index": "documents",
    "type": "files",
    "bulk_size": 100
}'

Looking at http://localhost:9200/documents, this is what I see:

  {"_index":"documents","_type":"doc","_id":....

I would have expected "type":"files", not "type":"doc". What am I missing?
(It doesn't seem to matter if I have created the index documents beforehand or not, or whether it's empty or not)

NoClassSettingsException

I get the following error when trying to query on a s3 river: "NoClassSettingsException[Failed to load class with value [amazon-s3]]; nested: ClassNotFoundException[amazon-s3];. Anyone got any clues about this?

edit: Using version 0.0.3

Document size limited to 100,000 characters.

Hi, thank you so much for this plugin, it's fitting my usecase perfectly.
One problem I encountered is that files only get indexed up to 100,000 characters.
Please let me know if you have any clue where this limitation is coming from, thank you again!

Deleting records?

Hi Laurent,
Thank you very much for this great river plugin. We are evaluating to use this for syncing our data via S3. We have mysql as our master data store. Another service builds data and exports into json format to be saved into S3.

Your plugin then will be picking up these json files from S3 and import into ES.

I have below questions, as i could not find answers in your README etc.

I have taking the below parameters as example from RSRiver:

do you have parameters "filename_as_id": true
--> This will sort my problem of document ID mapping with mysql records. Currently i do not know how your river is going to give document IDs? If above parameters is available then i would like to control these IDs by giving the document name as desired IDs.
do you have parameter: "remove_deleted": true
--> This will solve my problem of deleting files. E.g. when a record is no longer needed to be in ES, i would like it to be removed. So, if i simply delete it from S3, will it be reflected in ES too?
Finally, how are the updated handled? if i update a file in S3, will your river plugin also pick that and re-apply the values to the corresponding document in ES?

thank you for your time,
regards
Ali