This is a TYPO3 CMS extension. It provides an Extbase Command Controller task to generate a sitemap.xml by using the PHPCrawl library. It parses the given URL and finds all links in the HTML an follows them. So it works like a frontend crawler.
After installing the extension (activating it in your Extension Manager) you have to create a new Scheduler Task using
the Extbase CommandController Task
. Select InmGooglesitemap Sitemap: generateSitemap
as command and then you´ll get
the following options. (Besides the default cron options of TYPO3).
You may get this extension via Git clone or composer to your preferred destination.
http://example.com - This will be the entry point for crawling, the first URL that will be called.
sitemap.xml - This file will be saved in your webroot, so the sitemap will be reachable under URL http://example.com/sitemap.xml
#\.(jpg|jpeg|gif|png|mp3|mp4|gz|ico)$# i - per default, URLs having one of these file endings will be skipped
#\/(typo3conf|fileadmin|uploads)\/.*$# i - per default, these paths are skipped when found in URL
Check this if you want to obey the rules in robots.txt
0 - Default is "0" which means no limit
. Enter a number > 0 to set a limit.
Checkbox to fine tune the limit of max requested URLs.
10000 - is the default value.
Checkbox to tell the crawler that a URL must end with .html
or .html
.
href, src, url, location, codebase, background, data, profile, action and open. You may change this comma-separated list
useTransferProtocol: Enter transfer protocol to use: http (=default) or https. URLs with wrong protocol will not be written.
http - maybe if you use a prox you have to set the protocol that must be prepended to the URLs.
requestDelay: float or string / time in seconds (float, e.g. 0.5 or 60/100 for 100 request per minute). Sets a delay for every HTTP-requests the crawler executes.
2 - default value, means 2 seconds.
default: empty. Must be at least 2 chars long.
default: empty. Must be at least 2 chars long.
urlRegexHttpAuth: URL to send authentication information to, e.g. "#http://www\.foo\.com/protected_path/#"
default: empty. With the given example, for every URL within path "protectec_path", the auth data would be added to the request.
PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2
The PHPCrawl Library offers the possibility to use multi-processes. But there are a few requirements which may be not on every webserver. http://phpcrawl.cuab.de/requirements.html
For the moment, the extension has the multi-process not implemented yet. It is planned to be able to activate it in the Scheduler Task settings, too.
While the process runs, it generates a file named _temporary_sitemap.xml
which will be renamed to sitemap.xml
(or the
given name in the settings), after the Scheduler Task run successfully.
The sitemap.xml
only contains the URLs that the crawling process has found, which is the minimum requirement for a XML
sitemap. This means we do not extend pages
with fields like priority
or add dates. I think that´ ok as Google does a
good job either.
We think the approach to crawl from the frontend gives better results than trying to get all URLs form within the backend,
where you have to write your own sitemap providers and so on. By using inm_googlesitemap
all links are found from the view
of the frontend, well... that means "just" like a crawler (which it is in deed though), or like a link checker, or maybe
a bot like Google.