Giter VIP home page Giter VIP logo

crawl-anywhere's Introduction

Crawl-Anywhere

April 2013 - Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0

Stable version 3.0.x is still available at http://www.crawl-anywhere.com/

Introduction

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, ...) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

Support

Build distribution

Pre-requisites :

  • Maven 3.0.0 or >
  • Oracle Java 7 or >

Steps :

Installation

Pre-requisites :

  • Oracle Java 7 or >
  • Apache 2.0 or >
  • PHP 5.2.x or 5.3.x or 5.4.x
  • MongoDB 64 bits 2.2 or >
  • Solr 4.3.0 or > (configuration files provided for Solr 4.3.0 and 4.10.0)

Steps :

Getting Started

See the User Manual at http://www.crawl-anywhere.com/getting-started/

History

  • release 4.0.0-alpha-1 : April, 28 2013
  • release 4.0.0-alpha-2 : May, 22 2013
  • release 4.0.0-alpha-3 : June, 21 2013
  • release 4.0.0-alpha-4 : June, 23 2013
  • release 4.0.0-beta-1 : August, 6 2013
  • release 4.0.0-release-candidate : October, 20 2013
  • release 4.0.0 final : December, 1, 2014

crawl-anywhere's People

Contributors

bejean avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawl-anywhere's Issues

Sitemap XML

When using a Google sitemap does the crawler look at the URLs present and crawl each of the pages?

I have added a Google sitemap.xml link, however the crawler doesn't appear to use it.

Recrawl strategy

Create a fast recrawl option. This option could allow to recrawl a web site often an quickly by crawling only at a maximum depth of 1 or 2 levels for declared starting urls and declared RSS feed.
Memorize discovered RSS feeds during crawls in order to be used during next fast crawls.

PHP 5.4 doesn't work OOTB?

Your page claims that CA will run fine under PHP 5.4, but I get many errors relating to deprecated features (e.g. call-time pass by reference). Am I missing something?

Database initialisation

For new installation and upgrade purposes, add an database import feature in the web administration.
If the DB is empty, propose to initialize it with minimal data set or select crawler v3 msysql csv export files.

Recrawling does not start

i have questions about recrawl peroid and schedules. how do recrawl period and schedules depend on each other?

two examples.:

  1. if i define a recrawl peroid with 1 day and an enabled schedule with day:all, start:1, stop:5, i think the crawler should start recrawling the source every day between the defined times?

if i define a recrawl peroid with 1 week and an enabled schedule with day:all, start:1, stop:5, i think the crawler should start recrawling the source every day between the defined times, but should fetch each document at most 1 time in one week?

Could you please explain how recrawling and scheduling works excactly? in my case, i have to start the crawler manually although i defined recrawling and scheduling rules.

first crawl date is null

the first crawl date of an item is null when i try to rescan a source. this leads to a numberformatexception in WebConnector.java

String firstCrawlDate = StringUtils.trimToEmpty(queue.getCreated(itemData));
                        Date d = null;
                        if ("".equals(firstCrawlDate) ) {
                            d = new Date();
                        } else {
                            d = new Date(Long.parseLong(firstCrawlDate));
                        }   
                        params.put("firstCrawlDate", dateFormat.format(d.getTime()));

a quick solutuion would be to check for

if (firstCrawlDate == null || "".equals(firstCrawlDate) ) {

but i don't know if it is correct in this situation that the firstCrawlDate is nuli.

Improve virtual appliance

  • use a lower default screen resolution (1024x768)
  • use an English qwerty keyboard as default
  • change passwords in order to be not azerty/qwerty sensitive
  • add bookmarks in firefox (www.crawl-anywhere.com, forum, github)

Scripts tools

I can't get tools_test_scripts or tools_list_script_engines.sh to work.
Here are the errors I get :

Error: Could not find or load main class fr.eolya.utils.ListScriptEngines

and

Error: Main method not found in class fr.eolya.extraction.ScriptsWrapper, please define the main method as: public static void main(String[] args)

Admin UI doubles backslash

if i enter a regex crawl rule with backslashes, admin ui doubles number of entered backslashes on saving the source

e.g ^domain.de becomes to ^domain\.de

Enable disabled v3.0.3 features

Before final version, enable missing features from version 3.

  • clear a source
  • rescan a source
  • rescan from cache
  • crawl deeper a source
  • check deletion

protocol strategy for https

if protocol strategy is set to keep only https page, both http- and https-pages are indexed to solr. if i set it to keep only http, it works correctly.

Crawl resume after crash

After a crawler crash, the UI admin reports status "Crawling"
First crawl resume do nothing. Second resume works fine.

elasticsearch intergration

Use elasticsearch as an alternative to Solr.

implies :

  • pipeline mapping stage creation
  • indexer update
  • search interface update

advantages :

  • dynamic mapping for better multi-lingual indexing and search

Allow wildcards in Host aliases

It would be convenient to allow wildcards in host alias definitions. If you have many subdomains, e.g. sd1.domain.de, sd2.domain.de, sd3.domain.de, sd4.domain.de, sd5.domain.de and the host aliases change frequently, it would be easier to define a host alias like "*.domain.de"

MetaExtractor stage enhancement

proxy params are ignored

if proxy params are defined in crawler.xml, the params are only beeing used to create auth-cookies in initiliaze-method of class WebConnector.java, but not beeing injected to class WebPageLoader.java and further to HttpLoader.java to guarantee generell access through proxy without any auth mechanism. This means that proxy setting are ignored.

'Crawl Now' doesnt seem to work

When I push the 'Crawl Now' button from within the admin interface, I would expect the source to be crawled immediately. This does not seem to happen - I find I need to delete the source and re add it to have the site crawled again.

Support for multiple solr cores

Hi,

my company wants to use your project for multiple web sites and separate them into multiple solr cores. I worked around by using your "target" option for sources and then run the pipeline and the indexer with different configuration files for different queue folders. I wonder if there is a better way to do that or if you plan to more integrate multiple cores? There is also some budget for this.

Merci!
Wolfram

Boost recent indexed documents

In search interface add an option in order to boost recent documents based on the real publish date or the first crawl date

Last installation issues

To be reviewed in documentation and test :

  • Cleaning page test
  • solr.xml
  • search / cache
  • search config uses port 9090 -> Ok

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.