Giter VIP home page Giter VIP logo

pimcore-dynamic-search-data-provider-crawler's Introduction

Dynamic Search | Data Provider: Web Crawler

Software License Latest Release Travis PhpStan

A Spider Crawler Extension for Pimcore Dynamic Search.

Requirements

  • Pimcore >= 5.8.0
  • Pimcore Dynamic Search

Basic Setup

dynamic_search:
    context:
        default:
            data_provider:
                service: 'web_crawler'
                options:
                    always:
                        own_host_only: true
                    full_dispatch:
                        seed: 'http://your-domain.test'
                        valid_links:
                            - '@^http://your-domain.test.*@i'
                        user_invalid_links:
                            - '@^http://your-domain.test\/members.*@i'
                    single_dispatch:
                        host: 'http://your-domain.test.test'
                normalizer:
                    service: 'web_crawler_localized_resource_normalizer'

Provider Options

always

Name Default Value Description
own_host_only false
allow_subdomains false
allow_query_in_url false
allow_hash_in_url false
allowed_mime_types ['text/html', 'application/pdf']
allowed_schemes ['http']
content_max_size 0

full_dispatch

Name Default Value Description
seed null
valid_links []
user_invalid_links []
max_link_depth 15
max_crawl_limit 0

single_dispatch

Name Default Value Description
host null

Resource Normalizer

DefaultResourceNormalizer

Identifier: web_crawler_default_resource_normalizer Normalize simple documents Options: none

LocalizedResourceNormalizer

Identifier: web_crawler_localized_resource_normalizer Scaffold localized documents

Options:

Name Default Value Allowed Type Description
locales all pimcore enabled languages array
skip_not_localized_documents true bool if false, an exception rises if a document/object has no valid locale

Transformer

Scaffolder

HttpResponseHtmlDataScaffolder

Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type text/html.

HttpResponsePdfDataScaffolder

Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type application/pdf.

PimcoreElementScaffolder

Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset, Document, DataObject\Concrete.

Field Transformer

UriExtractor

Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null
Options: none

LanguageExtractor

Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

MetaExtractor

Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options: none

HtmlTagExtractor

Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options: none

TextExtractor

Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

TitleExtractor

Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.