Giter VIP home page Giter VIP logo

elasticsearch-river-web's Introduction

(River Web does not sync up with the latest elasticsearch. Fess is Enterprise Search Server and contains the same features as River Web. See Fess)

Elasticsearch River Web

Overview

Elasticsearch River Web is a web crawler application for Elasticsearch. This application provides a feature to crawl web sites and extract the content by CSS Query. (As of version 1.5, River Web is not Elasticsearch plugin)

If you want to use Full Text Search Server, please see Fess.

Version

River Web Tested on ES Download
master 2.4.X Snapshot
2.4.0 2.4.0 Download
2.0.2 2.3.1 Download
2.0.1 2.2.0 Download
2.0.0 2.1.2 Download

For old version, see README_ver1.md or README_ver1.5.md.

Issues/Questions

Please file an issue. (Japanese forum is here.)

Installation

Install River Web

Zip File

$ unzip elasticsearch-river-web-[VERSION].zip

Tar.GZ File

$ tar zxvf elasticsearch-river-web-[VERSION].tar.gz

Usage

Create Index To Store Crawl Data

An index for storing crawl data is needed before starting River Web. For example, to store data to "webindex/my_web", create it as below:

$ curl -XPUT 'localhost:9200/webindex' -d '
{  
  "settings":{  
    "index":{  
      "refresh_interval":"1s",
      "number_of_shards":"10",
      "number_of_replicas" : "0"
    }
  },
  "mappings":{  
    "my_web":{  
      "properties":{  
        "url":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "method":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "charSet":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "mimeType":{  
          "type":"string",
          "index":"not_analyzed"
        }
      }
    }
  }
}'

Feel free to add any properties other than the above if you need them.

Register Crawl Config Data

A crawling configuration is created by registering a document to .river_web index as below. This example crawls sites of http://www.codelibs.org/ and http://fess.codelibs.org/.

$ curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://www.codelibs.org/", "http://fess.codelibs.org/"],
    "include_urls" : ["http://www.codelibs.org/.*", "http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 100,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
          "url" : "http://www.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body"
          },
          "bodyAsHtml" : {
            "html" : "body"
          },
          "projects" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://fess.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body",
            "trimSpaces" : true
          },
          "menus" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      }
    ]
}'

The configuration is:

Property Type Description
index string Stored index name.
type string Stored type name.
urls array Start point of URL for crawling.
include_urls array White list of URL for crawling.
exclude_urls array Black list of URL for crawling.
max_depth int Depth of crawling documents.
max_access_count int The number of crawling documents.
num_of_thread int The number of crawler threads.
interval int Interval time (ms) to crawl documents.
incremental boolean Incremental crawling.
overwrite boolean Delete documents of old duplicated url.
user_agent string User-agent name when crawling.
robots_txt boolean If you want to ignore robots.txt, false.
authentications object Specify BASIC/DIGEST/NTLM authentication info.
target.urlPattern string URL pattern to extract contents by CSS Query.
target.properties.name string "name" is used as a property name in the index.
target.properties.name.text string CSS Query for the property value.
target.properties.name.html string CSS Query for the property value.
target.properties.name.script string Rewrite the property value by Script(Groovy).

Start Crawler

./bin/riverweb --config-id [config doc id] --cluster-name [Elasticsearch Cluster Name] --cleanup

For example,

./bin/riverweb --config-id my_web --cluster-name elasticsearch --cleanup

Unregister Crawl Config Data

If you want to stop the crawler, kill the crawler process and then delete the config document as below:

$ curl -XDELETE 'localhost:9200/.river_web/config/my_web'

Examples

Full Text Search for Your site (ex. http://fess.codelibs.org/)

$ curl -XPUT 'localhost:9200/.river_web/fess/fess_site' -d '{
    "index" : "webindex",
    "type" : "fess_site",
    "urls" : ["http://fess.codelibs.org/"],
    "include_urls" : ["http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 1000,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
            "url" : "http://fess.codelibs.org/.*",
            "mimeType" : "text/html"
        },
        "properties" : {
            "title" : {
                "text" : "title"
            },
            "body" : {
                "text" : "body",
                "trimSpaces" : true
            }
        }
      }
    ]
}'

Aggregate a title/content from news.yahoo.com

$ curl -XPUT 'localhost:9200/.river_web/config/yahoo_site' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://news.yahoo.com/"],
    "include_urls" : ["http://news.yahoo.com/.*"],
    "max_depth" : 1,
    "max_access_count" : 10,
    "num_of_thread" : 3,
    "interval" : 3000,
    "user_agent" : "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
    "target" : [
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/video/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "h1.headline"
          },
          "content" : {
            "text" : "section#mediacontentstory p"
          }
        }
      }
    ]
}'

(if news.yahoo.com is updated, the above example needs to be updated.)

Others

BASIC/DIGEST/NTLM authentication

River Web supports BASIC/DIGEST/NTLM authentication. Set authentications object.

...
"num_of_thread" : 5,
"interval" : 1000,
"authentications":[
  {
    "scope": {
      "scheme":"BASIC"
    },
    "credentials": {
      "username":"testuser",
      "password":"secret"
    }
  }],
"target" : [
...

The configuration is:

Property Type Description
authentications.scope.scheme string BASIC, DIGEST or NTLM
authentications.scope.host string (Optional)Target hostname.
authentications.scope.port int (Optional)Port number.
authentications.scope.realm string (Optional)Realm name.
authentications.credentials.username string Username.
authentications.credentials.password string Password.
authentications.credentials.workstation string (Optional)Workstation for NTLM.
authentications.credentials.domain string (Optional)Domain for NTLM.

For example, if you want to use an user in ActiveDirectory, the configuration is below:

"authentications":[
  {
    "scope": {
      "scheme":"NTLM"
    },
    "credentials": {
      "domain":"your.ad.domain",
      "username":"taro",
      "password":"himitsu"
    }
  }],

Use attachment type

River Web supports attachment type. For example, create a mapping with attachment type:

curl -XPUT "localhost:9200/web/test/_mapping?pretty" -d '{
  "test" : {
    "properties" : {
...
      "my_attachment" : {
          "type" : "attachment",
          "fields" : {
            "file" : { "index" : "no" },
            "title" : { "store" : "yes" },
            "date" : { "store" : "yes" },
            "author" : { "store" : "yes" },
            "keywords" : { "store" : "yes" },
            "content_type" : { "store" : "yes" },
            "content_length" : { "store" : "yes" }
          }
      }
...

and then start your river. In "properties" object, when a value of "type" is "attachment", the crawled url is stored as base64-encoded data.

curl -XPUT localhost:9200/.river_web/config/2 -d '{
      "index" : "web",
      "type" : "data",
      "urls" : "http://...",
...
      "target" : [
...
        {
          "settings" : {
            "html" : false
          },
          "pattern" : {
            "url" : "http://.../.*"
          },
          "properties" : {
            "my_attachment" : {
              "type" : "attachment"
            }
          }
        }
      ]
...

Use Multibyte Characters

An example in Japanese environment is below. First, put some configuration file into conf directory of Elasticsearch.

$ cd $ES_HOME/conf    # ex. /etc/elasticsearch if using rpm package
$ sudo wget https://raw.github.com/codelibs/fess-server/master/src/tomcat/solr/core1/conf/mapping_ja.txt
$ sudo wget http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/lang/stopwords_ja.txt 

and then create "webindex" index with analyzers for Japanese. (If you want to use uni-gram, remove cjk_bigram in filter)

$ curl -XPUT "localhost:9200/webindex" -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "custom",
          "char_filter" : ["mappingJa"],
          "tokenizer" : "standard",
          "filter" : ["word_delimiter", "lowercase", "cjk_width", "cjk_bigram"]
        }
      },
      "char_filter" : {
        "mappingJa": {
          "type" : "mapping",
          "mappings_path" : "mapping_ja.txt"
        }
      },
      "filter" : {
        "stopJa" : {
          "type" : "stop",
          "stopwords_path" : "stopwords_ja.txt"
        }
      }
    }
  }
}'

Rewrite a property value by Script

River Web allows you to rewrite crawled data by Java's ScriptEngine. "javascript" is available. In "properties" object, put "script" value to a property you want to rewrite.

...
        "properties" : {
...
          "flag" : {
            "text" : "body",
            "script" : "value.indexOf('Elasticsearch') > 0 ? 'yes' : 'no';"
          },

The above is, if a string value of body element in HTML contains "Elasticsearch", set "yes" to "flag" property.

Use HTTP proxy

Put "proxy" property in "crawl" property.

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

Specify next crawled urls when crawling

To set "isChildUrl" property to true, the property values is used as next crawled urls.

...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Intercept start/execute/finish/close actions

You can insert your script to Executing Crawler(execute)/Finished Crawler(finish). To insert scripts, put "script" property as below:

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "script":{
      "execute":"your script...",
      "finish":"your script...",
    },
    ...

FAQ

What does "No scraping rule." mean?

In a river setting, "url" is starting urls to crawl a site, "include_urls" filters urls whether are crawled or not, and "target.pattern.url" is a rule to store extracted web data. If a crawling url does not match "target.pattern.url", you would see the message. Therefore, it means the crawled url does not have an extraction rule.

How to extract an attribute of meta tag

For example, if you want to grab a content of description's meta tag, the configuration is below:

...
"target" : [
...
  "properties" : {
...
    "meta" : {
      "attr" : "meta[name=description]",
      "args" : [ "content" ]
    },

Incremental crawling dose not work?

"url" field needs to be "not_analyzed" in a mapping of your stored index. See Create Index To Store Crawl Data.

Where is crawled data stored?

crawled data are stored to ".s2robot" index during cralwing, data extracted from them are stored to your index specified by a river setting, and then data in "robot" index are removed when the crawler is finished.

Powered By

elasticsearch-river-web's People

Contributors

codelibsbuild avatar johtani avatar keiichiw avatar marevol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-river-web's Issues

NoClassSettingsException[Failed to load class with value [web]]

Hi there,

I've just created a brand new Centos VM (v6), installed ElasticSearch v1.0.0RC2 and elasticsearch-river-web v1.1.0 as per the instructions.

I then have gone to setup my crawl by running the following:

# create robot
curl -XPUT 'http://localhost:9200:443/robot/'

# Create Index
curl -XPUT "http://localhost:9200:443/compassion_uat/"

# create the duplicate mapping index
curl -XPUT "http://localhost:9200:443/compassion_uat/compassion_web/_mapping/" -d '
{
  "compassion_web" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}
'


# create the crawler
curl -XPUT 'http://localhost:9200:443/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "robotsTxt" : false,
                "userAgent" : "bingbot",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}
'

After doing this I cannot see any documents appearing in the index, so I have looked at the _river index and can see the following error:

NoClassSettingsException[Failed to load class with value [web]]; nested: ClassNotFoundException[web];

Have I missed a step?

Thanks,
Tim.

Unable to load river-web plugin using eclipse,getting following error

Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

  1. Error injecting constructor, org.seasar.framework.exception.ResourceNotFoundRuntimeException: [ESSR0055]app.dicon
    at org.codelibs.elasticsearch.web.service.S2ContainerService.(Unknown Source)
    while locating org.codelibs.elasticsearch.web.service.S2ContainerService

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:59)

The check method of status

日本語で質問させていただきます。

設定したWeb Riverのステータス (クローリング中、クローリング完了等) を取得することは可能でしょうか。
ご確認の程よろしくお願いいたします。

Is it possible to get (in crawling, crawling completion, etc.) the status of the Web River?

How to prioritize the searchresults based on type

I have many types in my index for example a type for downloads,for resources,for main aspx pages,for casestudies and when the user types manuals or user manuals or pdf's then the search should prioritize in such a way that pdf and manuals should be shown first.If any other keyword like downloads then the search results should be shown from downloads types.Can anyone help.

Thanks in advance.

Unable to crawl from directory

I am unable to crawl the pdf files situated in other webserver. The scenario is that the pdf's folder is accessible through ftp but not http. I need to give full url of the pdf file(ex:http://xyz.com/pdf/search.pdf) . But i want to crawl from the folder itself .How can i crawl these files through elasticsearch.

Thanks in advance.

how to install the latest source code?

I am trying to update the river with the latest version with using

/bin/plugin --install codelibs/elasticsearch-river-web

To be sure about I have the latest version (other wise it installs fro maven repositry which I believe it is not the latest master version.)

How to index text inside <div> tags

Hi,

Can anyone help me on indexing text between particular

tags something like:
< div data-canvas-width="125.304" data-font-name="g_font_580_0" data-angle="0" style="font-size: 24px; font-family: sans-serif; left: 64px; top: 172px; transform: rotate(0deg) scale(1.00243, 1); transform-origin: 0% 0% 0px;" dir="ltr">Automotive < /div>

This is to index some content in pdf files as per my requirement.

Thanks In Advance,
Srinivas

Crawling Authenticated Sites

Hi,

Is there a way to crawl authenticated websites using this plugin? If yes, can you guide me on how to acheive it. The site could be either NTLM authenticated or might have a Forms based authentication.

Script support

This feature rewrites a stored value before indexing.
Therefore, you can replace a crawled data with something by MVEL.

Question: Boilerpipe?

Does the crawler uses boilerpipe ? I have seen that it has in dependencies however i couldnt see it uses it. What is the reason for storing it?

Thanks

How can I get attribute src of the image tag?

for example.
I have this html.

<div id="thumb">
  <a href="#"><img src="http://example.com/test.jpg" /></a>
</div>

I want to get the src of the image and save.

I tried this way but it did not work.

{
"pattern": {
 .....        
  "properties":{
    "image_src": {
      "attr": "img[src]",
      "args": ["div#thumb a"],
      "trimSpaces": true
    }
  }
}

Can anyone teach me how can I get the src?

Thank you.

Possible enhancement request: Duplicates

I have a few situations where the same document is indexed twice because it has two different parentUrls. Is it possible to prevent this? It would be nice if I could provide duplicate exclusion rules. For example, if the md5 of properties body + title + language is the same for an existing document, ignore it.

I realize this would increase the indexing time as you would basically have to do a search first but is something like this possible? Or is there a recommended approach for managing this common situation? Maybe I'm missing an option in ElasticSearch itself.

the fresh install using tutorial from readme file doesnt work

I am using the latest version of ES as of today with a fresh install. The installation with river worked fine also. However, the scraping (or crawling) scenario doesnt start. I followed instructions in readmefile however no luck. Of course I was careful at the cronjob, also used "0 0 * * *?" (which I think , this means start the crawling right now). I was lucky with yahoo example but only 5-6 links extracted. I have tested the scraping with different urls. I cant see what is going on (which page is to crawled and so on). Only I get "scheduled". Here is the log I have taken from yahoo example. After receiving this, the river stops. I have no luck using the river to crawl other sites. Any hints?

[2014-04-21 19:00:29,600][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/comics/
[2014-04-21 19:00:29,833][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,580][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,764][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/abc-news/
[2014-04-21 19:00:31,712][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:36,438][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/originals/
[2014-04-21 19:00:36,455][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/yahoo_news_photos/
[2014-04-21 19:00:36,457][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/katie-couric-aereo-tv-supreme-court-212342689.html
[2014-04-21 19:00:37,284][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:37,531][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:39,906][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:41,241][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boston-marathon-bombing/
[2014-04-21 19:00:41,247][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/u-says-examining-toxic-chemical-syria-172200500.html
[2014-04-21 19:00:42,402][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:43,176][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/syria-elections-set-june-3-amid-civil-war-180035620.html
[2014-04-21 19:00:43,255][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:46,463][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boy-scouts-shutdown-troop-for-refusing-to-banish-gay-scoutmaster-171244503.html
[2014-04-21 19:00:47,523][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/obama-plans-clemency-for-hundreds-of-drug-offenders--162714911.html
No scraping rule.

One time crawling

If no "cron" property, a crawler starts immediately(actually, 1 min later) and then unregister it after the crawling is completed.

動的にクロールURLを設定するには

お世話になります。
日本語で質問をさせてください。

特定のriverをあらかじめ登録せさておいて、
別のアプリケーションから robot/queue にクロールしたいURL登録して
動的に特定のURLをクロールさせることは可能でしょうか?

Specify crawled urls in properties

If you set "isChildUrl" property to true, crawled urls are used by the property value.

"crawl" : {
...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Not all pages being crawled

Hi @marevol,

Thanks for the help over the last few days - it is really appreciated!

I have managed to get the crawling working across my two sites, however I'm noticing that not all the pages are being crawled, which is quite strange.

There are pages within my primary navigation that are being skipped altogether, even though they appear right beside another that is being crawled?

I have left the crawler to run overnight but it hasn't yet discovered these pages?

I created the crawler (after setting up the other indexes) by:

curl -XPUT 'http://localhost:9200/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/", "http://uat.compassiondev.net.au/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*", "http://uat.compassiondev.net.au/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "userAgent" : "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Elasticsearch River Web/1.1.0",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/a_id/[0-9]*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          },
                    {
            "pattern" : {
              "url" : "http://uat.compassiondev.net.au/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1",
                                "trimSpaces": true
              },
              "body" : {
                "text" : "div#main",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/15 * * * * ?"
    }
}

Is there something I could have missed?

I notice that in one of your examples on the main wiki page you add

"menus" : {
                "text" : "ul.nav-list li a",
                "isArray" : true
              }

To the "properties" for one of your sites, could that have something to do with it, or is it unrelated?

Meta tag crawling issue.

I want to crawl the meta tag 'name' and 'content' attributes values .I am able to crawl the head tag text and html .But not the only meta tag attributes. Can anyone help me out.

Thanks in advance.

Praveen.

Getting junk content when reading the indexed pdf files through river-web

I have indexed the pdf documents using river-web and its showing all junk content not even html also. Reference :
%PDF-1.4 % 379 0 obj <> endobj 405 0 obj <>/Filter/FlateDecode/ID[<0C462C2110F28740BFF2829978DDE58A><16AE43F1074EB84F96B5E99D3DB91FB7>]/Index[379 38]/Info 378 0 R/Length 125/Prev 330619/Root 380 0 R/Size 417/Type/XRef/W[1 3 1]>>stream h bbd\u0010``b``\u0016\u0004 S@$\u0013\u000B d=\u0005f e @$ 4X \u0003 \u0000f+6; \u0004"\u0019 " '\u0010D \u0002 H @ \u001E \u001CD\u001Eu\u0000 \u001CV q?c\u0010[\u0002H ۚ \u0005v\u001B\u0003# �\u0006 \u0019 \u0001\u0002 \u0000 ?\u0010f endstream endobj startxref 0 %%EOF 416 0 obj <>stream h b``eZ\u0005&quot; %0\b2 \u0003\u000BP c 7&lt; \u0007\u0018\u0018_\u0017 \u0019\u0005 N z D w \u001D\u0018 ;\u001A\u0018 +:\u0018\u0018 ;: ʀ@ r \u0000bi &amp;3\u0006\u0001 s S\u0018 1\u0006 \n<e\u0011b b~ r \" \u000E\u0003\u00033; @ ܎\u0015 \u001B p p=\"\" } ! 0 =\"\" \u001Aq\u0006 g 4\u000B\u0003 7 =\"\" \u0017 r =\"\" _ \u0002 b ( =\"\" endstream=\"\" endobj=\"\" 380=\"\" 0=\"\" obj=\"\" <<=\"\" arkinfo<<=\"\" arked=\"\" true=\"\">\n &gt;/Metadata 25 0 R/Pages 377 0 R/StructTreeRoot 48 0 R/Type/Catalog/ViewerPreferences&lt;&gt;&gt;&gt; endobj 381 0 obj &lt;&gt;/ExtGState&lt;&gt;/Font&lt;&gt;/ProcSet[/PDF/Text/ImageC]/XObject&lt;&gt;&gt;&gt;/Rotate 0/StructParents 0/TrimBox[0.0 0.0 612.0 792.0]/Type/Page&gt;&gt; endobj 382 0 obj &lt;&gt;stream hޔ N H\u0010 a Չ ~ۖ\* \u0010 \u0015 \u0000&quot; $\u0016 \u000E &gt; ͬ \u0010 z swg~Xs 8 \n <b\" !f\" 1a22l\u000Bδ w! 1 % q1a @3a\u0015i\u001A& l \\ m 0 +w &+\u000B8 \u001D \u001C\u001F ) 5\u001A a _g \u000F \u001E �\u000F}=\"\">\n 8 x&amp; \u0017c\u0011 6 \u0016U&lt; r b̙ ] v噄c p抹\u000F D m 8 ݝ` ^{ \u001B qY-]\u000E ^psr INN L ǚj a4t \u0015 \u0016\u001E(\u001C 4� \u001A xH] +W o \u0007 wO ;\u001D A Ӭ EV : � f3_ " X 0 6 , I n S_4x \u0006 n g Eì _ j 7 @ r \u0007 \ v \u0014 } Y * \u0000 |]CZ. \u000E\u0016 /\u0000ճr u \u0005 U G ͢ \u001Ef ( \u001E \u0006j� 6!\u001B(2TL \u001C\u000B ˬ `\u000F \u0000\u0006 \u0001\u000E \b>\u0002 6 ) \b \u0013| \u000B \u0002\u000E.! )`\b \u00022 \u001C P@ + \u000E\u0015 \u001A \u0007 - u \u0006 G !L) \݆ X H w ) h ʛ AĄ0\u0011 $ \u0006n o rw ]9v Uo[Y { d ʱ[z8 �q: 3 L | T I\u0017wUFK\u0017 \u0015j qy y \u0019 Q㗟Y " U A / 8 2Ry. ] \u0005j l n f\u0018 C ( u &0Z_6 \be# T N=S \u000E Yb O E=N4g& E (\u0005 :< )- \u0006 " E L\u0019\u0004`li ̨ i ] 0'} .%3 @i\u001Bې ]Y !- 6q ,Jbl \u0006 tdg y dE 8و t 0JKe# qL ^)&5 h\u0016 ( \u001A\u0004;i B Z% $Ӹc: tz̯ *RO2 Ά X L^ +% @ <~ QY 0` #޴ yJ '4y " \u0018 DCb A b\u0002 }me O !\u001A C4|= Z \u00192\b J4l\u000F A\u0006 x OȠ\u001F A \u0017Ƞ J \u001E\u0011 +P =\u001B o ^ ]hl+׶ H? a \u001AQ\u0014z "W ® : p b ˱ & ukx q N ( N ^ 4 =u a\u0016S 㪑- h # \u0012\u001D ݬ v \u0000\u0003\u00006 endstream endobj 383 0 obj <>stream H \ ͊ 0\u0014 z -g\u0016 \u001D[ ʀ1d \u0019Ȣ?4 \u00038 \u001A\u001A ( "o\u001D 0 \u0006\u001C} W v ۇq 8 \u0007 \u0018 - ^\u001F y jU a ] /ݬ p . \u000F I5 .~ % f Y\u0015 \u0018 m \u000F .u R / x] m/ !=\u001F K } + s ؅ WM ~ n>үU> =7kn; ]T #\u0015 H ^[ Te OK 5y ~% \u001A I jj ר +r\u0005 5ؐ ؑ\u001DxC $6 iI >\u0006} \u0018 1 c X \u0005\u000BY 4 4[ \u0016 # w0 \u0018d b A ,p\u0016 \u000B za z:\u000B E\u001E \u000E \b Dx 3\u0017f E o`\u0012 \u0012 \u0012 \u0012 \u0012 {E. \\u000E \u001C=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D}\u001C|\u001C ]Z0d i¸ B� r� 1 q t br ? y uڅK \u0015`\u0000% އ endstream endobj 384 0 obj <>stream H \ n @\u0010 < \u001C C XB \u001Cg# ?Zg\u001F\u0000 ؋ \u0006 o S\u0014JVk O T� Iw } N. 1 !L v \u0018 m ; s %Y __ !I : ˾; IY g|x ƻ{ 6 1<& c۝ ï ѥ 0 Mn 6\u001BׄS4 Z ߪKp i t� {>W ߇ y \u0011 p\u001D : Uw\u000EI + o yn; ՘ ..^ T k ^? / s |5 \u0010 Rc] L _ B \u0002 úq 5 \u00105 !)\u000B *P Ȩ3 : . \u000Bh \u0016hO Y 0j &s\u0001 \u0005 - \u0016 \u0005 \u0002\u0006! A `\u00102\b\u0018 \u0002\u0006! AXWPWXWPWXK Z J 0+AV¬\u0004Y \u0012d噕GV l\u001El l\u001El l\u001El l\u001El l\u001El x |wŻ+ \u0015 J� _ W .=\u0004M� _g� 0 2\u0007E\u000E ,rХW .} \u001Ct Q` ld60\u001B Ff\u0003 ld60\u001B Ff\u0003 l \u0019C \u0018x U Z | \u0019֬ \u001Au \u001B\u0007| \u0017 O8 4 | o \u0018 8 \u0004 \u0006m\u0017>N \u001F\܅+ + \u0000)q!n endstream endobj 385 0 obj <>stream H W n\u001B7\u0010}߯ #\u0017 ּ-/@\u0010 N ES >4Aa˲-D ]+ p ˕T!\b\u0010qvy9s s%n wJ| o> hu \u0014oߞ|z� A( q t} ;/ \u001A | Ʊ\u0012 M\u001D K K) >Qy \u00107 #p h h Au / \u001F a \u001B8 ʤ \u0004 B m @ \u0003/\u001A^j }T\u0001 PB+ )k{\u0013쐴u\u0016q . M 99\u001B ...

Can you pls help me out.

Thanks in advance.

HTTP Proxy support

Supports HTTP proxy to crawl contents.
An example for the configuration is:

curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
    "type" : "web",
    "crawl" : {
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

It exceeds maxAccessCount

Crawler exceeds number of maxAccessCount that is defined in the config. Example, I have limited to 500 sites, however it crawled about 770 webpages. Is this a bug or system feature?

Narrow Your Search Feature in Elasticsearch

Hi,

How can i implement Narrow your search feature which is in GSA.I tried with completion suggester am getting the follow error:
index: metaindex
shard: 4
reason: BroadcastShardOperationFailedException[[metaindex][4] ]; nested: ElasticsearchException[failed to execute suggest]; nested: ElasticsearchException[Field [body] is not a completion suggest field];

Can anyone pls help me.

Thanks in advance.

meta tags

I don't see a way to grab and map certain meta tags to a property from the crawled html page. Is that an implemented feature?

Strange behavior between robots and myindex incides when crawling large site

I have started a crawl which I am monitoring right now. With about 10 threads and 200 msec interval. The thing is, robot indices has 25k documents which is way more than mycrawlindex about 9200 documents. I know robot.txt handles the crawling url list however, it came to strange that pushing the data into ES is slow?

Btw, the cpu usage is peaked at %80-90.

Duplicated URLs

When I checked my url list, I have seen that there are same URLS are indexed with different _ids. The pages are same. I have set to :

"maxDepth": 7,
"maxAccessCount": 500,
"numOfThread": 10,
"interval": 200,
"incremental": true,
"overwrite": true,

Cannot create index?

Hi there,

I've been using this plugin now for a few weeks with no issues (I'm running version 1.0.1) until I decided a few days ago to remove all my indexes and create new ones again from scratch.

Unfortunately now I can't seem to create my crawler indexes. I run the appropriate CURL command to create the index and I receive the {"ok":true...} json response, but when I try to query the index I receive a IndexMissingException.

The process I'm following is as follows:

a. Install robot index (as per instructions):

curl -XPUT '192.168.1.26:9200/robot/'

b. I then attempt to create an index using:

curl -XPUT '192.168.1.26:9200/_river/my_web/_meta' -d "{
    \"type\" : \"web\",
    \"crawl\" : {
        \"index\" : \"compassion_test\",
        \"url\" : [\"http://uat.compassiondev.net.au/\"],
        \"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
        \"maxDepth\" : 3,
        \"maxAccessCount\" : 100,
        \"numOfThread\" : 5,
        \"interval\" : 1000,
        \"overwrite\" : true,
        \"target\" : [
          {
            \"pattern\" : {
              \"url\" : \"http://uat.compassiondev.net.au/.*\",
              \"mimeType\" : \"text/html\"
            },
            \"properties\" : {
              \"title\" : {
                \"text\" : \"title\"
              },
              \"body\" : {
                \"text\" : \"div#page_content\",
                \"trimSpaces\" : true
              }
            }
          }
        ]
    }
}"

I receive the following json response:

{"ok":true,"_index":"_river","_type":"my_web","_id":"_meta","_version":1}

But the index doesn't seem to exist (I receive the exception mentioned above)...

Is there something that I've missed? Any help would be greatly appreciated. Thanks!

Facet issue

{
"query" : { "query_string" : {"query" : "ck3"} },
"facets" : {
"tags" : { "terms" : {"fields" : ["metakey","metaprod","metasol","metares"]} }
}
}

Printers and Media is a single word in meta tag content attribute.
When i run this code then its giving me the counts in such a way that if the word is 'Printers and Media' then its giving me as
{
term:"printers",
count:280
},
{
term:"and",
count : 300
}
{
term:"media",
count:100
}

But i need as
{
term:"Printers and Media"
count:200
}

like this what would be the changes i need to do in the query.Please suggest.

Thanks in advance.

Duplicated contents of different URLs

When they are the same contents although URL differs,
(like the contents which belong to 2 or more category)
Isn't there any setup which merges URLs and is used as one data?

Updated Schedule doesnt affect

When I update the config file with new river config, with an updated schedule time, the river doesnt change its time to a later time, the schedule doesnt start. However, If i restart ES, it automatically starts.

"schedule": {
"cron": "0 14 4 * * ?"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.