bejean / crawl-anywhere Goto Github PK

View Code? Open in Web Editor NEW

95.0 25.0 38.0 94.72 MB

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.

Home Page: www.crawl-anywhere.com

License: Apache License 2.0

Shell 0.86% JavaScript 2.41% CSS 1.00% XSLT 1.77% Java 15.80% PHP 78.16%

crawl-anywhere's Introduction

April 2013 - Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0

Stable version 3.0.x is still available at http://www.crawl-anywhere.com/

Introduction

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

a Web Crawler with a Web administration interface (http://www.crawl-anywhere.com/crawl-anywhere/)
a document processing pipeline (http://www.crawl-anywhere.com/simple-pipeline/)
a Solr indexer
a Solr tags cloud analyzer
a full featured and customizable Web search application (some search engines using Crawl-anywhere : http://www.hurisearch.org/ or http://www.searchamnesty.org/)

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, ...) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

Support

For a reproducible bug, use the Github issues features : https://github.com/bejean/crawl-anywhere/issues?state=open
For questions or suggestions, use the Google forum : https://groups.google.com/forum/#!forum/crawl-anywhere

Build distribution

Pre-requisites :

Maven 3.0.0 or >
Oracle Java 7 or >

Steps :

Clone the this Github project or download the ZIP file (https://github.com/bejean/crawl-anywhere/archive/master.zip)
Open a console in the root directory of the project
Edit the build.sh file in order to define target directory
```
 export DISTRIB=/tmp/CA
```
./build.sh > build.log

Installation

Pre-requisites :

Oracle Java 7 or >
Apache 2.0 or >
PHP 5.2.x or 5.3.x or 5.4.x
MongoDB 64 bits 2.2 or >
Solr 4.3.0 or > (configuration files provided for Solr 4.3.0 and 4.10.0)

Steps :

Either build (see above) or download a pre-built version (http://www.crawl-anywhere.com/download-crawl-anywhere/)
Copy the build result or extract the downloaded archives into the installation directory (for instance "/opt/crawler")
Follow instructions here : http://www.crawl-anywhere.com/installation/

Getting Started

See the User Manual at http://www.crawl-anywhere.com/getting-started/

History

release 4.0.0-alpha-1 : April, 28 2013
release 4.0.0-alpha-2 : May, 22 2013
release 4.0.0-alpha-3 : June, 21 2013
release 4.0.0-alpha-4 : June, 23 2013
release 4.0.0-beta-1 : August, 6 2013
release 4.0.0-release-candidate : October, 20 2013
release 4.0.0 final : December, 1, 2014

crawl-anywhere's People

Contributors

Stargazers

Watchers

crawl-anywhere's Issues

Sitemap XML

When using a Google sitemap does the crawler look at the URLs present and crawl each of the pages?

I have added a Google sitemap.xml link, however the crawler doesn't appear to use it.

Provide version information

In the administration interface and log files, display and write the current current version

Main documentation

Some pages have to be rewritten.

Recrawl strategy

Create a fast recrawl option. This option could allow to recrawl a web site often an quickly by crawling only at a maximum depth of 1 or 2 levels for declared starting urls and declared RSS feed.
Memorize discovered RSS feeds during crawls in order to be used during next fast crawls.

Optimize indexer for SolrCloud

Use CloudSolrServer with SolrJ

PHP 5.4 doesn't work OOTB?

Your page claims that CA will run fine under PHP 5.4, but I get many errors relating to deprecated features (e.g. call-time pass by reference). Am I missing something?

Update the LICENSE-DEPENDENCIES.txt file

Support for HTTP compression

In order to reduce bandwidth usage, use when possible gzip and deflate compression in http transferts.

Indexer - remove elasticsearch 0.20 dependency and so lucene 3.6.2 dependencies

elasticsearch integration is postponed to future version

add DjVu support

Can you please add DjVu indexing support? There is a tool like pdftotext available for djvu files: http://djvu.sourceforge.net/doc/man/djvutxt.html
I like crawl anywhere, because it is super fast. Sadly I'm not able to add djvu support by myself as I do not understand Java.

Database initialisation

For new installation and upgrade purposes, add an database import feature in the web administration.
If the DB is empty, propose to initialize it with minimal data set or select crawler v3 msysql csv export files.

NLP tools integration

Create pipeline stages in order to add NLP features like :

named entities extraction
summarization

Look at :

Weka - http://www.cs.waikato.ac.nz/~ml/index.html
OpenNLP
Gate
UIMA

Crawler status no correct in Web Admin UI

All "Number of ..." are 0

Error while extracting links from html page

see - https://groups.google.com/forum/#!topic/crawl-anywhere/oDlLxBuAa2I

Crawl web site with basic authentication sheme

According to this message in the forum, it looks like basic authentication settings doesn't work properly.
https://groups.google.com/forum/#!topic/crawl-anywhere/TiAz0rGiIfw

Recrawling does not start

i have questions about recrawl peroid and schedules. how do recrawl period and schedules depend on each other?

two examples.:

if i define a recrawl peroid with 1 day and an enabled schedule with day:all, start:1, stop:5, i think the crawler should start recrawling the source every day between the defined times?

if i define a recrawl peroid with 1 week and an enabled schedule with day:all, start:1, stop:5, i think the crawler should start recrawling the source every day between the defined times, but should fetch each document at most 1 time in one week?

Could you please explain how recrawling and scheduling works excactly? in my case, i have to start the crawler manually although i defined recrawling and scheduling rules.

Add a max pages option

Add a max pages number option.
Should this be the maximum number of pages fetched on the server or the max number of pages sent to the pipeline ? This can be very different.
https://groups.google.com/forum/#!topic/crawl-anywhere/Rcb5zricvTo

first crawl date is null

the first crawl date of an item is null when i try to rescan a source. this leads to a numberformatexception in WebConnector.java

String firstCrawlDate = StringUtils.trimToEmpty(queue.getCreated(itemData));
                        Date d = null;
                        if ("".equals(firstCrawlDate) ) {
                            d = new Date();
                        } else {
                            d = new Date(Long.parseLong(firstCrawlDate));
                        }   
                        params.put("firstCrawlDate", dateFormat.format(d.getTime()));

a quick solutuion would be to check for

if (firstCrawlDate == null || "".equals(firstCrawlDate) ) {

but i don't know if it is correct in this situation that the firstCrawlDate is nuli.

Title not parsed correctly for some international sites.

For this page,

http://insurance.rakuten.co.jp/ , the title is not getting parsed correctly. Checked the code and found out that this looks like a bug with the TikeWrapper being used.

Tile on page is "楽天: 保険一括見積もり｜自動車保険・ペット保険・海外旅行保険を比較"
Tile reported by crawler: ??: ??????????????????????????????

Thanks,
Aditya

Keep direct dependency to only one html parser library

There are several direct dependencies to html parser libraries

jsoup
jericho-html
htmlcleaner

Try to keep only jsoup (already used by snacktory)

Download links

Add binary download for version 4.0.0 in www.crawl-anywhere.com and github readme

Provided settings and analyzer for Solr 4.10

Improve virtual appliance

use a lower default screen resolution (1024x768)
use an English qwerty keyboard as default
change passwords in order to be not azerty/qwerty sensitive
add bookmarks in firefox (www.crawl-anywhere.com, forum, github)

Indexer - optimize for Solr 4.x and remove dependency to Apache Commons HttpClient 3.x

Indexer pom.xml link indexer to SolrJ 3.5.0 and make Apache Commons HttpClient 3.x a dependency.
Tasks :

Change Indexer dependency to SolrJ 4.3.0
Add an option in indexer.xml file in order to specify response format (default xml)
Optimize indexer for Solr 4.0 by using xml response format by default but javabin for Solr 4

Add a maximum fetch rate per minute

See discussion - https://groups.google.com/forum/#!topic/crawl-anywhere/eFXxGcJRCXs

Scripts tools

I can't get tools_test_scripts or tools_list_script_engines.sh to work.
Here are the errors I get :

Error: Could not find or load main class fr.eolya.utils.ListScriptEngines

and

Error: Main method not found in class fr.eolya.extraction.ScriptsWrapper, please define the main method as: public static void main(String[] args)

Better handle session time-out in admin interface

Redirect to login page in any cases when a session time-out occurs.

Admin UI doubles backslash

if i enter a regex crawl rule with backslashes, admin ui doubles number of entered backslashes on saving the source

e.g ^domain.de becomes to ^domain\.de

Crawler admin UI is looking for crawler.properties

The new configuration file for the crawler process is crawler.xml

Enable disabled v3.0.3 features

Before final version, enable missing features from version 3.

clear a source
rescan a source
rescan from cache
crawl deeper a source
check deletion

protocol strategy for https

if protocol strategy is set to keep only https page, both http- and https-pages are indexed to solr. if i set it to keep only http, it works correctly.

Fieldmapping stage enhancement

Allow remove value in target element (https://groups.google.com/forum/#!topic/crawl-anywhere/KmsyjPsw_vA)
check documentation
add unit test

Crawl web site with NTLM authentication sheme

According to this message in the forum, implement support for NTLM authentication sheme
https://groups.google.com/forum/#!topic/crawl-anywhere/TiAz0rGiIfw

Crawl resume after crash

After a crawler crash, the UI admin reports status "Crawling"
First crawl resume do nothing. Second resume works fine.

elasticsearch intergration

Use elasticsearch as an alternative to Solr.

implies :

pipeline mapping stage creation
indexer update
search interface update

advantages :

dynamic mapping for better multi-lingual indexing and search

Allow wildcards in Host aliases

It would be convenient to allow wildcards in host alias definitions. If you have many subdomains, e.g. sd1.domain.de, sd2.domain.de, sd3.domain.de, sd4.domain.de, sd5.domain.de and the host aliases change frequently, it would be easier to define a host alias like "*.domain.de"

Use document date provided by sitemap files

In order to know the real publish date of a document, use when available the date provided by sitemap files.

MetaExtractor stage enhancement

Check logging consistency (verbose / no verbose)
Change the action option in testScript class
"To test the meta extraction with the script tool, you need to use the "-a meta" parameter. Which is kind of weird as the scipt uses action="extractmeta""
https://groups.google.com/d/msg/crawl-anywhere/9sQBykMmJXE/pcpQYn0IvGMJ
Allows data to be extracted to contains colon
https://groups.google.com/forum/#!topic/crawl-anywhere/KetdzBSrAts
Extracted values are converted to lowercase. Is it a good choice ? Allows disable this.
https://groups.google.com/forum/#!topic/crawl-anywhere/Re_jlpBpNwc
Check documentation

Implement a better suggester in search interface

Implement a multi-terms suggester
http://wiki.apache.org/solr/Suggester
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/

At the same time check "did you mean" feature.

Allow alternate Solr UpdateRequestHandler in the indexer

As alternate UpdateRequestHandler can be configure, it could be great to be able to select the UpdateRequestHandler to be used by the indexer.
https://groups.google.com/forum/#!topic/crawl-anywhere/4NRZWYQ8FJA

proxy params are ignored

if proxy params are defined in crawler.xml, the params are only beeing used to create auth-cookies in initiliaze-method of class WebConnector.java, but not beeing injected to class WebPageLoader.java and further to HttpLoader.java to guarantee generell access through proxy without any auth mechanism. This means that proxy setting are ignored.

Crawler admin UI status shows new source in Next Crawl list

with Next crawl date = 1970-01-01 01:00:00
even when the crawl is started

'Crawl Now' doesnt seem to work

When I push the 'Crawl Now' button from within the admin interface, I would expect the source to be crawled immediately. This does not seem to happen - I find I need to delete the source and re add it to have the site crawled again.

Support for multiple solr cores

Hi,

my company wants to use your project for multiple web sites and separate them into multiple solr cores. I worked around by using your "target" option for sources and then run the pipeline and the indexer with different configuration files for different queue folders. I wonder if there is a better way to do that or if you plan to more integrate multiple cores? There is also some budget for this.

Merci!
Wolfram

Use base href element

Use base href element when available in order to build absolute url for link discovered in html pages
see - https://groups.google.com/forum/#!topic/crawl-anywhere/rcuuh1aAoEs

Pipeline stages documentation

Check and update pipeline stages documentation - http://www.crawl-anywhere.com/configure-the-pipeline/

MetaExtractor
IndexerQueueWriter (mapping according to targets, parser_imageurl)
solrboost

Cleaning page test
solr.xml
search / cache
search config uses port 9090 -> Ok