codelibs / elasticsearch-river-web Goto Github PK

Web Crawler for Elasticsearch

License: Apache License 2.0

Java 88.44% HTML 11.02% Shell 0.55%

elasticsearch-river-web's Introduction

(River Web does not sync up with the latest elasticsearch. Fess is Enterprise Search Server and contains the same features as River Web. See Fess)

Elasticsearch River Web

Overview

Elasticsearch River Web is a web crawler application for Elasticsearch. This application provides a feature to crawl web sites and extract the content by CSS Query. (As of version 1.5, River Web is not Elasticsearch plugin)

If you want to use Full Text Search Server, please see Fess.

Version

River Web	Tested on ES	Download
master	2.4.X	Snapshot
2.4.0	2.4.0	Download
2.0.2	2.3.1	Download
2.0.1	2.2.0	Download
2.0.0	2.1.2	Download

For old version, see README_ver1.md or README_ver1.5.md.

Issues/Questions

Please file an issue. (Japanese forum is here.)

Installation

Install River Web

Zip File

$ unzip elasticsearch-river-web-[VERSION].zip

Tar.GZ File

$ tar zxvf elasticsearch-river-web-[VERSION].tar.gz

Usage

Create Index To Store Crawl Data

An index for storing crawl data is needed before starting River Web. For example, to store data to "webindex/my_web", create it as below:

$ curl -XPUT 'localhost:9200/webindex' -d '
{  
  "settings":{  
    "index":{  
      "refresh_interval":"1s",
      "number_of_shards":"10",
      "number_of_replicas" : "0"
    }
  },
  "mappings":{  
    "my_web":{  
      "properties":{  
        "url":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "method":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "charSet":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "mimeType":{  
          "type":"string",
          "index":"not_analyzed"
        }
      }
    }
  }
}'

Feel free to add any properties other than the above if you need them.

Register Crawl Config Data

A crawling configuration is created by registering a document to .river_web index as below. This example crawls sites of http://www.codelibs.org/ and http://fess.codelibs.org/.

$ curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://www.codelibs.org/", "http://fess.codelibs.org/"],
    "include_urls" : ["http://www.codelibs.org/.*", "http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 100,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
          "url" : "http://www.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body"
          },
          "bodyAsHtml" : {
            "html" : "body"
          },
          "projects" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://fess.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body",
            "trimSpaces" : true
          },
          "menus" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      }
    ]
}'

The configuration is:

Property	Type	Description
index	string	Stored index name.
type	string	Stored type name.
urls	array	Start point of URL for crawling.
include_urls	array	White list of URL for crawling.
exclude_urls	array	Black list of URL for crawling.
max_depth	int	Depth of crawling documents.
max_access_count	int	The number of crawling documents.
num_of_thread	int	The number of crawler threads.
interval	int	Interval time (ms) to crawl documents.
incremental	boolean	Incremental crawling.
overwrite	boolean	Delete documents of old duplicated url.
user_agent	string	User-agent name when crawling.
robots_txt	boolean	If you want to ignore robots.txt, false.
authentications	object	Specify BASIC/DIGEST/NTLM authentication info.
target.urlPattern	string	URL pattern to extract contents by CSS Query.
target.properties.name	string	"name" is used as a property name in the index.
target.properties.name.text	string	CSS Query for the property value.
target.properties.name.html	string	CSS Query for the property value.
target.properties.name.script	string	Rewrite the property value by Script(Groovy).

Start Crawler

./bin/riverweb --config-id [config doc id] --cluster-name [Elasticsearch Cluster Name] --cleanup

For example,

./bin/riverweb --config-id my_web --cluster-name elasticsearch --cleanup

Unregister Crawl Config Data

If you want to stop the crawler, kill the crawler process and then delete the config document as below:

$ curl -XDELETE 'localhost:9200/.river_web/config/my_web'

Examples

Full Text Search for Your site (ex. http://fess.codelibs.org/)

$ curl -XPUT 'localhost:9200/.river_web/fess/fess_site' -d '{
    "index" : "webindex",
    "type" : "fess_site",
    "urls" : ["http://fess.codelibs.org/"],
    "include_urls" : ["http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 1000,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
            "url" : "http://fess.codelibs.org/.*",
            "mimeType" : "text/html"
        },
        "properties" : {
            "title" : {
                "text" : "title"
            },
            "body" : {
                "text" : "body",
                "trimSpaces" : true
            }
        }
      }
    ]
}'

Aggregate a title/content from news.yahoo.com

$ curl -XPUT 'localhost:9200/.river_web/config/yahoo_site' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://news.yahoo.com/"],
    "include_urls" : ["http://news.yahoo.com/.*"],
    "max_depth" : 1,
    "max_access_count" : 10,
    "num_of_thread" : 3,
    "interval" : 3000,
    "user_agent" : "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
    "target" : [
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/video/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "h1.headline"
          },
          "content" : {
            "text" : "section#mediacontentstory p"
          }
        }
      }
    ]
}'

(if news.yahoo.com is updated, the above example needs to be updated.)

Others

BASIC/DIGEST/NTLM authentication

River Web supports BASIC/DIGEST/NTLM authentication. Set authentications object.

...
"num_of_thread" : 5,
"interval" : 1000,
"authentications":[
  {
    "scope": {
      "scheme":"BASIC"
    },
    "credentials": {
      "username":"testuser",
      "password":"secret"
    }
  }],
"target" : [
...

The configuration is:

Property	Type	Description
authentications.scope.scheme	string	BASIC, DIGEST or NTLM
authentications.scope.host	string	(Optional)Target hostname.
authentications.scope.port	int	(Optional)Port number.
authentications.scope.realm	string	(Optional)Realm name.
authentications.credentials.username	string	Username.
authentications.credentials.password	string	Password.
authentications.credentials.workstation	string	(Optional)Workstation for NTLM.
authentications.credentials.domain	string	(Optional)Domain for NTLM.

For example, if you want to use an user in ActiveDirectory, the configuration is below:

"authentications":[
  {
    "scope": {
      "scheme":"NTLM"
    },
    "credentials": {
      "domain":"your.ad.domain",
      "username":"taro",
      "password":"himitsu"
    }
  }],

Use attachment type

River Web supports attachment type. For example, create a mapping with attachment type:

curl -XPUT "localhost:9200/web/test/_mapping?pretty" -d '{
  "test" : {
    "properties" : {
...
      "my_attachment" : {
          "type" : "attachment",
          "fields" : {
            "file" : { "index" : "no" },
            "title" : { "store" : "yes" },
            "date" : { "store" : "yes" },
            "author" : { "store" : "yes" },
            "keywords" : { "store" : "yes" },
            "content_type" : { "store" : "yes" },
            "content_length" : { "store" : "yes" }
          }
      }
...

and then start your river. In "properties" object, when a value of "type" is "attachment", the crawled url is stored as base64-encoded data.

curl -XPUT localhost:9200/.river_web/config/2 -d '{
      "index" : "web",
      "type" : "data",
      "urls" : "http://...",
...
      "target" : [
...
        {
          "settings" : {
            "html" : false
          },
          "pattern" : {
            "url" : "http://.../.*"
          },
          "properties" : {
            "my_attachment" : {
              "type" : "attachment"
            }
          }
        }
      ]
...

Use Multibyte Characters

An example in Japanese environment is below. First, put some configuration file into conf directory of Elasticsearch.

$ cd $ES_HOME/conf    # ex. /etc/elasticsearch if using rpm package
$ sudo wget https://raw.github.com/codelibs/fess-server/master/src/tomcat/solr/core1/conf/mapping_ja.txt
$ sudo wget http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/lang/stopwords_ja.txt

and then create "webindex" index with analyzers for Japanese. (If you want to use uni-gram, remove cjk_bigram in filter)

$ curl -XPUT "localhost:9200/webindex" -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "custom",
          "char_filter" : ["mappingJa"],
          "tokenizer" : "standard",
          "filter" : ["word_delimiter", "lowercase", "cjk_width", "cjk_bigram"]
        }
      },
      "char_filter" : {
        "mappingJa": {
          "type" : "mapping",
          "mappings_path" : "mapping_ja.txt"
        }
      },
      "filter" : {
        "stopJa" : {
          "type" : "stop",
          "stopwords_path" : "stopwords_ja.txt"
        }
      }
    }
  }
}'

Rewrite a property value by Script

River Web allows you to rewrite crawled data by Java's ScriptEngine. "javascript" is available. In "properties" object, put "script" value to a property you want to rewrite.

...
        "properties" : {
...
          "flag" : {
            "text" : "body",
            "script" : "value.indexOf('Elasticsearch') > 0 ? 'yes' : 'no';"
          },

The above is, if a string value of body element in HTML contains "Elasticsearch", set "yes" to "flag" property.

Use HTTP proxy

Put "proxy" property in "crawl" property.

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

Specify next crawled urls when crawling

To set "isChildUrl" property to true, the property values is used as next crawled urls.

...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Intercept start/execute/finish/close actions

You can insert your script to Executing Crawler(execute)/Finished Crawler(finish). To insert scripts, put "script" property as below:

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "script":{
      "execute":"your script...",
      "finish":"your script...",
    },
    ...

FAQ

What does "No scraping rule." mean?

In a river setting, "url" is starting urls to crawl a site, "include_urls" filters urls whether are crawled or not, and "target.pattern.url" is a rule to store extracted web data. If a crawling url does not match "target.pattern.url", you would see the message. Therefore, it means the crawled url does not have an extraction rule.

How to extract an attribute of meta tag

For example, if you want to grab a content of description's meta tag, the configuration is below:

...
"target" : [
...
  "properties" : {
...
    "meta" : {
      "attr" : "meta[name=description]",
      "args" : [ "content" ]
    },

Incremental crawling dose not work?

"url" field needs to be "not_analyzed" in a mapping of your stored index. See Create Index To Store Crawl Data.

Where is crawled data stored?

crawled data are stored to ".s2robot" index during cralwing, data extracted from them are stored to your index specified by a river setting, and then data in "robot" index are removed when the crawler is finished.

Powered By

Lasta Di: DI Container
Fess Crawler: Web Crawler

elasticsearch-river-web's People

Contributors

Stargazers

Watchers

elasticsearch-river-web's Issues

NoClassSettingsException[Failed to load class with value [web]]

Hi there,

I've just created a brand new Centos VM (v6), installed ElasticSearch v1.0.0RC2 and elasticsearch-river-web v1.1.0 as per the instructions.

I then have gone to setup my crawl by running the following:

# create robot
curl -XPUT 'http://localhost:9200:443/robot/'

# Create Index
curl -XPUT "http://localhost:9200:443/compassion_uat/"

# create the duplicate mapping index
curl -XPUT "http://localhost:9200:443/compassion_uat/compassion_web/_mapping/" -d '
{
  "compassion_web" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}
'


# create the crawler
curl -XPUT 'http://localhost:9200:443/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "robotsTxt" : false,
                "userAgent" : "bingbot",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}
'

After doing this I cannot see any documents appearing in the index, so I have looked at the _river index and can see the following error:

NoClassSettingsException[Failed to load class with value [web]]; nested: ClassNotFoundException[web];

Have I missed a step?

Thanks,
Tim.

Unable to load river-web plugin using eclipse,getting following error

Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

Error injecting constructor, org.seasar.framework.exception.ResourceNotFoundRuntimeException: [ESSR0055]app.dicon
at org.codelibs.elasticsearch.web.service.S2ContainerService.(Unknown Source)
while locating org.codelibs.elasticsearch.web.service.S2ContainerService

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:59)

The check method of status

日本語で質問させていただきます。

設定したWeb Riverのステータス (クローリング中、クローリング完了等) を取得することは可能でしょうか。
ご確認の程よろしくお願いいたします。

Is it possible to get (in crawling, crawling completion, etc.) the status of the Web River?

How to prioritize the searchresults based on type

I have many types in my index for example a type for downloads,for resources,for main aspx pages,for casestudies and when the user types manuals or user manuals or pdf's then the search should prioritize in such a way that pdf and manuals should be shown first.If any other keyword like downloads then the search results should be shown from downloads types.Can anyone help.

Thanks in advance.

Unable to crawl from directory

I am unable to crawl the pdf files situated in other webserver. The scenario is that the pdf's folder is accessible through ftp but not http. I need to give full url of the pdf file(ex:http://xyz.com/pdf/search.pdf) . But i want to crawl from the folder itself .How can i crawl these files through elasticsearch.

Thanks in advance.

Create robot index automatically if it does not exist

How to index secured page(via Forms authentication) using Elastic Search service

Hi geeks,

I have a requirement to index secured pages via Forms authentication using elastic search. I have used BASIC authentication feature provided in this plugin which didn't worked for me. Please provide any suggestions.

Thanks,
Srinivas V

how to install the latest source code?

I am trying to update the river with the latest version with using

/bin/plugin --install codelibs/elasticsearch-river-web

To be sure about I have the latest version (other wise it installs fro maven repositry which I believe it is not the latest master version.)

How to index text inside <div> tags

Hi,

Can anyone help me on indexing text between particular

tags something like:
< div data-canvas-width="125.304" data-font-name="g_font_580_0" data-angle="0" style="font-size: 24px; font-family: sans-serif; left: 64px; top: 172px; transform: rotate(0deg) scale(1.00243, 1); transform-origin: 0% 0% 0px;" dir="ltr">Automotive < /div>

This is to index some content in pdf files as per my requirement.

Thanks In Advance,
Srinivas

Crawling Authenticated Sites

Hi,

Is there a way to crawl authenticated websites using this plugin? If yes, can you guide me on how to acheive it. The site could be either NTLM authenticated or might have a Forms based authentication.

Script support

This feature rewrites a stored value before indexing.
Therefore, you can replace a crawled data with something by MVEL.

Question: Boilerpipe?

Does the crawler uses boilerpipe ? I have seen that it has in dependencies however i couldnt see it uses it. What is the reason for storing it?

Thanks

elasticsearch 1.0.0.RC1 support

question about support of robots.txt

Are all the contents of robots.txt supported?
e.g.)
Allow, Request-rate, Crawl-delay, Visit-time...

Add a parameter for adding request headers

Using given values as request headers when crawling a site, it allow you to add a parameter to a crawl data setting.

Update dependencies

Update dependencies for ES, S2Robot and es-util.

Support attachment type

A file other than HTML is stored as attachment type.
The attachment type is provided by:
https://github.com/elasticsearch/elasticsearch-mapper-attachments

How can I get attribute src of the image tag?

for example.
I have this html.

<div id="thumb">
  <a href="#"><img src="http://example.com/test.jpg" /></a>
</div>

I want to get the src of the image and save.

I tried this way but it did not work.

{
"pattern": {
 .....        
  "properties":{
    "image_src": {
      "attr": "img[src]",
      "args": ["div#thumb a"],
      "trimSpaces": true
    }
  }
}

Can anyone teach me how can I get the src?

Thank you.

Possible enhancement request: Duplicates

I have a few situations where the same document is indexed twice because it has two different parentUrls. Is it possible to prevent this? It would be nice if I could provide duplicate exclusion rules. For example, if the md5 of properties body + title + language is the same for an existing document, ignore it.

I realize this would increase the indexing time as you would basically have to do a search first but is something like this possible? Or is there a recommended approach for managing this common situation? Maybe I'm missing an option in ElasticSearch itself.

the fresh install using tutorial from readme file doesnt work

I am using the latest version of ES as of today with a fresh install. The installation with river worked fine also. However, the scraping (or crawling) scenario doesnt start. I followed instructions in readmefile however no luck. Of course I was careful at the cronjob, also used "0 0 * * *?" (which I think , this means start the crawling right now). I was lucky with yahoo example but only 5-6 links extracted. I have tested the scraping with different urls. I cant see what is going on (which page is to crawled and so on). Only I get "scheduled". Here is the log I have taken from yahoo example. After receiving this, the river stops. I have no luck using the river to crawl other sites. Any hints?

[2014-04-21 19:00:29,600][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/comics/
[2014-04-21 19:00:29,833][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,580][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,764][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/abc-news/
[2014-04-21 19:00:31,712][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:36,438][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/originals/
[2014-04-21 19:00:36,455][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/yahoo_news_photos/
[2014-04-21 19:00:36,457][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/katie-couric-aereo-tv-supreme-court-212342689.html
[2014-04-21 19:00:37,284][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:37,531][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:39,906][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:41,241][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boston-marathon-bombing/
[2014-04-21 19:00:41,247][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/u-says-examining-toxic-chemical-syria-172200500.html
[2014-04-21 19:00:42,402][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:43,176][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/syria-elections-set-june-3-amid-civil-war-180035620.html
[2014-04-21 19:00:43,255][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:46,463][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boy-scouts-shutdown-troop-for-refusing-to-banish-gay-scoutmaster-171244503.html
[2014-04-21 19:00:47,523][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/obama-plans-clemency-for-hundreds-of-drug-offenders--162714911.html
No scraping rule.

Incremental crawling does not work when stored index does not have a mapping

From #18.

Use client instance in script

Elasticsearch's client is available in script property.

script property supports an array value

A long MVEL script is a low readability.
So, you can write MVEL script in values of array :)

elasticsearch 1.0.0.RC2 support

One time crawling

If no "cron" property, a crawler starts immediately(actually, 1 min later) and then unregister it after the crawling is completed.

Crawling sitemap.xml?

Hi there,

Just wondering if there's a property within this library that will allow it to crawl sitemap.xml files as a starting point (enter it as a url)?

E.g. https://compassionau.custhelp.com/ci/sitemap/

Thanks!

Source code improvement

Fix issues from Sonar.

question - is there a way to specify a file containing URL's to crawl......?

is there a way to specify a file containing URL to crawl......?

動的にクロールURLを設定するには

お世話になります。
日本語で質問をさせてください。

特定のriverをあらかじめ登録せさておいて、
別のアプリケーションから robot/queue にクロールしたいURL登録して
動的に特定のURLをクロールさせることは可能でしょうか？

Specify crawled urls in properties

If you set "isChildUrl" property to true, crawled urls are used by the property value.

"crawl" : {
...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Improve encoding handling

Add "preloadSizeForCharset" to set a preload content size for deciding the encoding.

S2Container is available in MVEL script

Not all pages being crawled

Hi @marevol,

Thanks for the help over the last few days - it is really appreciated!

I have managed to get the crawling working across my two sites, however I'm noticing that not all the pages are being crawled, which is quite strange.

There are pages within my primary navigation that are being skipped altogether, even though they appear right beside another that is being crawled?

I have left the crawler to run overnight but it hasn't yet discovered these pages?

I created the crawler (after setting up the other indexes) by:

curl -XPUT 'http://localhost:9200/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/", "http://uat.compassiondev.net.au/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*", "http://uat.compassiondev.net.au/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "userAgent" : "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Elasticsearch River Web/1.1.0",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/a_id/[0-9]*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          },
                    {
            "pattern" : {
              "url" : "http://uat.compassiondev.net.au/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1",
                                "trimSpaces": true
              },
              "body" : {
                "text" : "div#main",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/15 * * * * ?"
    }
}

Is there something I could have missed?

I notice that in one of your examples on the main wiki page you add

"menus" : {
                "text" : "ul.nav-list li a",
                "isArray" : true
              }

To the "properties" for one of your sites, could that have something to do with it, or is it unrelated?

How to delete the lost web page?

Is there a way to delete from indices the deleted web page using "river-web" ?

Meta tag crawling issue.

I want to crawl the meta tag 'name' and 'content' attributes values .I am able to crawl the head tag text and html .But not the only meta tag attributes. Can anyone help me out.

Thanks in advance.

Praveen.

NTLM support

Add NTLM support.

Update s2robot

Update to S2Robot 0.6.0.

Getting junk content when reading the indexed pdf files through river-web

I have indexed the pdf documents using river-web and its showing all junk content not even html also. Reference :
%PDF-1.4 % 379 0 obj <> endobj 405 0 obj <>/Filter/FlateDecode/ID[<0C462C2110F28740BFF2829978DDE58A><16AE43F1074EB84F96B5E99D3DB91FB7>]/Index[379 38]/Info 378 0 R/Length 125/Prev 330619/Root 380 0 R/Size 417/Type/XRef/W[1 3 1]>>stream h bbd\u0010``b``\u0016\u0004 S@$\u0013\u000B d=\u0005f e @$ 4X \u0003 \u0000f+6; \u0004"\u0019 " '\u0010D \u0002 H @ \u001E \u001CD\u001Eu\u0000 \u001CV q?c\u0010[\u0002H ۚ \u0005v\u001B\u0003# �\u0006 \u0019 \u0001\u0002 \u0000 ?\u0010f endstream endobj startxref 0 %%EOF 416 0 obj <>stream h b``eZ\u0005" %0\b2 \u0003\u000BP c 7< \u0007\u0018\u0018_\u0017 \u0019\u0005 N z D w \u001D\u0018 ;\u001A\u0018 +:\u0018\u0018 ;: ʀ@ r \u0000bi &3\u0006\u0001 s S\u0018 1\u0006 \n<e\u0011b b~ r \" \u000E\u0003\u00033; @ ܎\u0015 \u001B p p=\"\" } ! 0 =\"\" \u001Aq\u0006 g 4\u000B\u0003 7 =\"\" \u0017 r =\"\" _ \u0002 b ( =\"\" endstream=\"\" endobj=\"\" 380=\"\" 0=\"\" obj=\"\" <<=\"\" arkinfo<<=\"\" arked=\"\" true=\"\">\n >/Metadata 25 0 R/Pages 377 0 R/StructTreeRoot 48 0 R/Type/Catalog/ViewerPreferences<>>> endobj 381 0 obj <>/ExtGState<>/Font<>/ProcSet[/PDF/Text/ImageC]/XObject<>>>/Rotate 0/StructParents 0/TrimBox[0.0 0.0 612.0 792.0]/Type/Page>> endobj 382 0 obj <>stream hޔ N H\u0010 a Չ ~ۖ\* \u0010 \u0015 \u0000" $\u0016 \u000E > ͬ \u0010 z swg~Xs 8 \n <b\" !f\" 1a22l\u000Bδ w! 1 % q1a @3a\u0015i\u001A& l \\ m 0 +w &+\u000B8 \u001D \u001C\u001F ) 5\u001A a _g \u000F \u001E �\u000F}=\"\">\n 8 x& \u0017c\u0011 6 \u0016U< r b̙ ] v噄c p抹\u000F D m 8 ݝ` ^{ \u001B qY-]\u000E ^psr INN L ǚj a4t \u0015 \u0016\u001E(\u001C 4� \u001A xH] +W o \u0007 wO ;\u001D A Ӭ EV : � f3_ " X 0 6 , I n S_4x \u0006 n g Eì _ j 7 @ r \u0007 \ v \u0014 } Y * \u0000 |]CZ. \u000E\u0016 /\u0000ճr u \u0005 U G ͢ \u001Ef ( \u001E \u0006j� 6!\u001B(2TL \u001C\u000B ˬ `\u000F \u0000\u0006 \u0001\u000E \b>\u0002 6 ) \b \u0013| \u000B \u0002\u000E.! )`\b \u00022 \u001C P@ + \u000E\u0015 \u001A \u0007 - u \u0006 G !L) \݆ X H w ) h ʛ AĄ0\u0011 $ \u0006n o rw ]9v Uo[Y { d ʱ[z8 �q: 3 L | T I\u0017wUFK\u0017 \u0015j qy y \u0019 Q㗟Y " U A / 8 2Ry. ] \u0005j l n f\u0018 C ( u &0Z_6 \be# T N=S \u000E Yb O E=N4g& E (\u0005 :< )- \u0006 " E L\u0019\u0004`li ̨ i ] 0'} .%3 @i\u001Bې ]Y !- 6q ,Jbl \u0006 tdg y dE 8و t 0JKe# qL ^)&5 h\u0016 ( \u001A\u0004;i B Z% $Ӹc: tz̯ *RO2 Ά X L^ +% @ <~ QY 0` #޴ yJ '4y " \u0018 DCb A b\u0002 }me O !\u001A C4|= Z \u00192\b J4l\u000F A\u0006 x OȠ\u001F A \u0017Ƞ J \u001E\u0011 +P =\u001B o ^ ]hl+׶ H? a \u001AQ\u0014z "W ® : p b ˱ & ukx q N ( N ^ 4 =u a\u0016S 㪑- h # \u0012\u001D ݬ v \u0000\u0003\u00006 endstream endobj 383 0 obj <>stream H \ ͊ 0\u0014 z -g\u0016 \u001D[ ʀ1d \u0019Ȣ?4 \u00038 \u001A\u001A ( "o\u001D 0 \u0006\u001C} W v ۇq 8 \u0007 \u0018 - ^\u001F y jU a ] /ݬ p . \u000F I5 .~ % f Y\u0015 \u0018 m \u000F .u R / x] m/ !=\u001F K } + s ؅ WM ~ n>үU> =7kn; ]T #\u0015 H ^[ Te OK 5y ~% \u001A I jj ר +r\u0005 5ؐ ؑ\u001DxC $6 iI >\u0006} \u0018 1 c X \u0005\u000BY 4 4[ \u0016 # w0 \u0018d b A ,p\u0016 \u000B za z:\u000B E\u001E \u000E \b Dx 3\u0017f E o`\u0012 \u0012 \u0012 \u0012 \u0012 {E. \\u000E \u001C=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D}\u001C|\u001C ]Z0d i¸ B� r� 1 q t br ? y uڅK \u0015`\u0000% އ endstream endobj 384 0 obj <>stream H \ n @\u0010 < \u001C C XB \u001Cg# ?Zg\u001F\u0000 ؋ \u0006 o S\u0014JVk O T� Iw } N. 1 !L v \u0018 m ; s %Y __ !I : ˾; IY g|x ƻ{ 6 1<& c۝ ï ѥ 0 Mn 6\u001BׄS4 Z ߪKp i t� {>W ߇ y \u0011 p\u001D : Uw\u000EI + o yn; ՘ ..^ T k ^? / s |5 \u0010 Rc] L _ B \u0002 úq 5 \u00105 !)\u000B *P Ȩ3 : . \u000Bh \u0016hO Y 0j &s\u0001 \u0005 - \u0016 \u0005 \u0002\u0006! A `\u00102\b\u0018 \u0002\u0006! AXWPWXWPWXK Z J 0+AV¬\u0004Y \u0012d噕GV l\u001El l\u001El l\u001El l\u001El l\u001El x |wŻ+ \u0015 J� _ W .=\u0004M� _g� 0 2\u0007E\u000E ,rХW .} \u001Ct Q` ld60\u001B Ff\u0003 ld60\u001B Ff\u0003 l \u0019C \u0018x U Z | \u0019֬ \u001Au \u001B\u0007| \u0017 O8 4 | o \u0018 8 \u0004 \u0006m\u0017>N \u001F\܅+ + \u0000)q!n endstream endobj 385 0 obj <>stream H W n\u001B7\u0010}߯ #\u0017 ּ-/@\u0010 N ES >4Aa˲-D ]+ p ˕T!\b\u0010qvy9s s%n wJ| o> hu \u0014oߞ|z� A( q t} ;/ \u001A | Ʊ\u0012 M\u001D K K) >Qy \u00107 #p h h Au / \u001F a \u001B8 ʤ \u0004 B m @ \u0003/\u001A^j }T\u0001 PB+ )k{\u0013쐴u\u0016q . M 99\u001B ...

Can you pls help me out.

Thanks in advance.

HTTP Proxy support

Supports HTTP proxy to crawl contents.
An example for the configuration is:

curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
    "type" : "web",
    "crawl" : {
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

Hook scripts on start/execute/finish/stop actions

It's better to call a given script when starting/stoping a crawler.

{
  "crawl" : {
  ...
    "script":{
      "start":"script...",
      "execute":"script...",
      "finish":"script...",
      "close":"script..."
    },

It exceeds maxAccessCount

Crawler exceeds number of maxAccessCount that is defined in the config. Example, I have limited to 500 sites, however it crawled about 770 webpages. Is this a bug or system feature?

Narrow Your Search Feature in Elasticsearch

Hi,

How can i implement Narrow your search feature which is in GSA.I tried with completion suggester am getting the follow error:
index: metaindex
shard: 4
reason: BroadcastShardOperationFailedException[[metaindex][4] ]; nested: ElasticsearchException[failed to execute suggest]; nested: ElasticsearchException[Field [body] is not a completion suggest field];

Can anyone pls help me.

Thanks in advance.

Improve multi-thread accessing

A multi-thread accessing has some problems, sucn as #35.

meta tags

I don't see a way to grab and map certain meta tags to a property from the crawled html page. Is that an implemented feature?

Strange behavior between robots and myindex incides when crawling large site

I have started a crawl which I am monitoring right now. With about 10 threads and 200 msec interval. The thing is, robot indices has 25k documents which is way more than mycrawlindex about 9200 documents. I know robot.txt handles the crawling url list however, it came to strange that pushing the data into ES is slow?

Btw, the cpu usage is peaked at %80-90.

Duplicated URLs

When I checked my url list, I have seen that there are same URLS are indexed with different _ids. The pages are same. I have set to :

"maxDepth": 7,
"maxAccessCount": 500,
"numOfThread": 10,
"interval": 200,
"incremental": true,
"overwrite": true,

Cannot create index?

Hi there,

I've been using this plugin now for a few weeks with no issues (I'm running version 1.0.1) until I decided a few days ago to remove all my indexes and create new ones again from scratch.

Unfortunately now I can't seem to create my crawler indexes. I run the appropriate CURL command to create the index and I receive the {"ok":true...} json response, but when I try to query the index I receive a IndexMissingException.

The process I'm following is as follows:

a. Install robot index (as per instructions):

curl -XPUT '192.168.1.26:9200/robot/'

b. I then attempt to create an index using:

curl -XPUT '192.168.1.26:9200/_river/my_web/_meta' -d "{
    \"type\" : \"web\",
    \"crawl\" : {
        \"index\" : \"compassion_test\",
        \"url\" : [\"http://uat.compassiondev.net.au/\"],
        \"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
        \"maxDepth\" : 3,
        \"maxAccessCount\" : 100,
        \"numOfThread\" : 5,
        \"interval\" : 1000,
        \"overwrite\" : true,
        \"target\" : [
          {
            \"pattern\" : {
              \"url\" : \"http://uat.compassiondev.net.au/.*\",
              \"mimeType\" : \"text/html\"
            },
            \"properties\" : {
              \"title\" : {
                \"text\" : \"title\"
              },
              \"body\" : {
                \"text\" : \"div#page_content\",
                \"trimSpaces\" : true
              }
            }
          }
        ]
    }
}"

I receive the following json response:

{"ok":true,"_index":"_river","_type":"my_web","_id":"_meta","_version":1}

But the index doesn't seem to exist (I receive the exception mentioned above)...

Is there something that I've missed? Any help would be greatly appreciated. Thanks!

Facet issue

{
"query" : { "query_string" : {"query" : "ck3"} },
"facets" : {
"tags" : { "terms" : {"fields" : ["metakey","metaprod","metasol","metares"]} }
}
}

Printers and Media is a single word in meta tag content attribute.
When i run this code then its giving me the counts in such a way that if the word is 'Printers and Media' then its giving me as
{
term:"printers",
count:280
},
{
term:"and",
count : 300
}
{
term:"media",
count:100
}

But i need as
{
term:"Printers and Media"
count:200
}

like this what would be the changes i need to do in the query.Please suggest.

Thanks in advance.

Duplicated contents of different URLs

When they are the same contents although URL differs,
(like the contents which belong to 2 or more category)
Isn't there any setup which merges URLs and is used as one data?

Updated Schedule doesnt affect

When I update the config file with new river config, with an updated schedule time, the river doesnt change its time to a later time, the schedule doesnt start. However, If i restart ES, it automatically starts.

"schedule": {
"cron": "0 14 4 * * ?"
}

codelibs / elasticsearch-river-web Goto Github PK

elasticsearch-river-web's Introduction

Elasticsearch River Web

Overview

Version

Issues/Questions

Installation

Install River Web

Zip File

Tar.GZ File

Usage

Create Index To Store Crawl Data

Register Crawl Config Data

Start Crawler

Unregister Crawl Config Data

Examples

Full Text Search for Your site (ex. http://fess.codelibs.org/)

Aggregate a title/content from news.yahoo.com

Others

BASIC/DIGEST/NTLM authentication

Use attachment type

Use Multibyte Characters

Rewrite a property value by Script

Use HTTP proxy

Specify next crawled urls when crawling

Intercept start/execute/finish/close actions

FAQ

What does "No scraping rule." mean?

How to extract an attribute of meta tag

Incremental crawling dose not work?

Where is crawled data stored?

Powered By

elasticsearch-river-web's People

Contributors

Stargazers

Watchers

Forkers

elasticsearch-river-web's Issues

Recommend Projects

Recommend Topics

Recommend Org