Giter VIP home page Giter VIP logo

inm_googlesitemap's Introduction

What does it do?

This is a TYPO3 CMS extension. It provides an Extbase Command Controller task to generate a sitemap.xml by using the PHPCrawl library. It parses the given URL and finds all links in the HTML an follows them. So it works like a frontend crawler.

Usage

After installing the extension (activating it in your Extension Manager) you have to create a new Scheduler Task using the Extbase CommandController Task. Select InmGooglesitemap Sitemap: generateSitemap as command and then you´ll get the following options. (Besides the default cron options of TYPO3).

You may get this extension via Git clone or composer to your preferred destination.

Scheduler Task Options / Arguments for the crawling process

url: The URL entry point for crawling.

http://example.com - This will be the entry point for crawling, the first URL that will be called.

sitemapFileName: File name of the XML file. Default is "sitemap.xml".

sitemap.xml - This file will be saved in your webroot, so the sitemap will be reachable under URL http://example.com/sitemap.xml

regexFileEndings: Regular expression for file endings to skip

#\.(jpg|jpeg|gif|png|mp3|mp4|gz|ico)$# i - per default, URLs having one of these file endings will be skipped

regexDirectoryExclude: Regular expression for directories to skip.

#\/(typo3conf|fileadmin|uploads)\/.*$# i - per default, these paths are skipped when found in URL

obeyRobotsTxt: Check to obey rules from robots.txt

Check this if you want to obey the rules in robots.txt

requestLimit: Max number of URLs to crawl.

0 - Default is "0" which means no limit. Enter a number > 0 to set a limit.

countOnlyProcessed: Check if only fetched URLs should count for $requestLimit.

Checkbox to fine tune the limit of max requested URLs.

phpTimeLimit: Value in seconds for setting time limit. Default = 10000.

10000 - is the default value.

htmlSuffix: Default true: will only allow .htm|.html endings. Will also exclude query strings

Checkbox to tell the crawler that a URL must end with .html or .html.

linkExtractionTags: By default the crawler searches for links in the following html-tags:

href, src, url, location, codebase, background, data, profile, action and open. You may change this comma-separated list

useTransferProtocol: Enter transfer protocol to use: http (=default) or https. URLs with wrong protocol will not be written.

http - maybe if you use a prox you have to set the protocol that must be prepended to the URLs.

requestDelay: float or string / time in seconds (float, e.g. 0.5 or 60/100 for 100 request per minute). Sets a delay for every HTTP-requests the crawler executes.

2 - default value, means 2 seconds.

username: HTTP Auth username

default: empty. Must be at least 2 chars long.

password: HTTP Auth password

default: empty. Must be at least 2 chars long.

urlRegexHttpAuth: URL to send authentication information to, e.g. "#http://www\.foo\.com/protected_path/#"

default: empty. With the given example, for every URL within path "protectec_path", the auth data would be added to the request.

Big Thanks to Uwe Hunfeld for th GPL licensed PHPCrawl library

http://phpcrawl.cuab.de

PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2

More to know

The PHPCrawl Library offers the possibility to use multi-processes. But there are a few requirements which may be not on every webserver. http://phpcrawl.cuab.de/requirements.html

For the moment, the extension has the multi-process not implemented yet. It is planned to be able to activate it in the Scheduler Task settings, too.

A temporary file

While the process runs, it generates a file named _temporary_sitemap.xml which will be renamed to sitemap.xml (or the given name in the settings), after the Scheduler Task run successfully.

The generated sitemap.xml

The sitemap.xml only contains the URLs that the crawling process has found, which is the minimum requirement for a XML sitemap. This means we do not extend pages with fields like priority or add dates. I think that´ ok as Google does a good job either.

Why another sitemap extension?

We think the approach to crawl from the frontend gives better results than trying to get all URLs form within the backend, where you have to write your own sitemap providers and so on. By using inm_googlesitemap all links are found from the view of the frontend, well... that means "just" like a crawler (which it is in deed though), or like a link checker, or maybe a bot like Google.

inm_googlesitemap's People

Contributors

merzilla avatar mrslntghost avatar

Watchers

 avatar

Forkers

ralessandri

inm_googlesitemap's Issues

obey robots.txt is not working

Hi,

I have set a rule in robots.txt that
Disallow: /mailto:%20iasdf%66o%40r%65%69asdfdf%2ede
Disallow: /news-letter/unsub

And started the cron job to index but the job always indexed the above both urls.

How to skip some urls not getting indexed in the sitemap.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.