c4software / python-sitemap Goto Github PK

View Code? Open in Web Editor NEW

362.0 25.0 110.0 219 KB

Mini website crawler to make sitemap from a website.

License: GNU General Public License v3.0

Python 98.99% Dockerfile 1.01%

python sitemap

python-sitemap's Introduction

Python-Sitemap

Simple script to crawl websites and create a sitemap.xml of all public link in it.

Warning : This script only works with Python3

Simple usage

>>> python main.py --domain http://blog.lesite.us --output sitemap.xml

Advanced usage

Read a config file to set parameters: You can overide (or add for list) any parameters define in the config.json

>>> python main.py --config config/config.json

Enable debug:

  $ python main.py --domain https://blog.lesite.us --output sitemap.xml --debug

Enable verbose output:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --verbose

Disable sorting output:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --no-sort

Enable Image Sitemap

More informations here https://support.google.com/webmasters/answer/178636?hl=en

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --images

Enable report for print summary of the crawl:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --report

Skip url (by extension) (skip pdf AND xml url):

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --skipext pdf --skipext xml

Drop a part of an url via regexp :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --drop "id=[0-9]{5}"

Exclude url by filter a part of it :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --exclude "action=edit"

Read the robots.txt to ignore some url:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --parserobots

Use specific user-agent for robots.txt:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --parserobots --user-agent Googlebot

Human readable XML

$ python3 main.py --domain https://blog.lesite.us --images --parserobots | xmllint --format -

Multithreaded

$ python3 main.py --domain https://blog.lesite.us --num-workers 4

with basic auth

You need to configure username and password in your config.py before

$ python3 main.py --domain https://blog.lesite.us --auth

Output sitemap index file

Sitemaps with over 50,000 URLs should be split into an index file that points to sitemap files that each contain 50,000 URLs or fewer. Outputting as an index requires specifying an output file. An index will only be output if a crawl has more than 50,000 URLs:

$ python3 main.py --domain https://blog.lesite.us --as-index --output sitemap.xml

Docker usage

Build the Docker image:

$ docker build -t python-sitemap:latest .

Run with default domain :

$ docker run -it python-sitemap

Run with custom domain :

$ docker run -it python-sitemap --domain https://www.graylog.fr

Run with config file and output :

You need to configure config.json file before

$ docker run -it -v `pwd`/config/:/config/ -v `pwd`:/home/python-sitemap/ python-sitemap --config config/config.json

python-sitemap's People

Contributors

Stargazers

Watchers

Forkers

sebclick rosselini eneagu ghuntley vishalchaudhary countrymarmot yanshanjing kvdaniel vionemc westwood score42 garrett-r camtosh yuseferi edgi-govdata-archiving berlotto jhuckins briantully gsavastano nmg0721 knightth0r tuian efferon tonydony jordanknott kinglu willismax lsa1es okeeffe salvopr conanjun chenhl cole-meurer reo7sp amyjbs olmerg ttopholm kenck9 ryanhsieh20 ryanpadilha probonobonobo sfponce deszie obeliskgolem rilma raman-kapanets bash- shonesonawane mnlipp forvendettaw nkirnos redmanc hhy5277 buitrunghieu1997 seocrow modulexcite nealzh eglet27 gautamsharma0095 forsgren piyushgodara rogerbca nvutri ishandutta2007 stefanpejcic eliauktm isnotempty samvarankashyap jesselau76 njzhaowei shyboy0415 jswilson rahulvramesh singharsh0 raghavkumarbhatia53 reuning rajeev-kumar-dsa fjong3 gasbarroni8 erteno ea-nikolaeva rstular zhang-sir0620 wymored bpgallagher snusmumr1000 gyyfifafans chenkuansun wphuocom farazgh secuoyas-experience summitstha murali-opensource-projects struva nx-117 firebear mnce33 malo94 nevermin techthiyanes

python-sitemap's Issues

AttributeError: 'NoneType' object has no attribute 'geturl'

I got such error
python3 main.py --domain https://domain.com --output sitemap.xml

Traceback (most recent call last):
File "main.py", line 60, in
crawl.run()
File "/root/python-sitemap/crawler.py", line 127, in run
self.__crawl(current_url)
File "/root/python-sitemap/crawler.py", line 264, in __crawl
final_url = response.geturl()
AttributeError: 'NoneType' object has no attribute 'geturl'

patch for <lastmod> in sitemap

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:27:24.706698113 +0300
@@ -5,6 +5,7 @@
from urllib.request import urlopen, Request
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
+from datetime import datetime

import os

@@ -105,12 +106,17 @@
else:
self.response_code[response.getcode()]=1
response.close()

      if 'last-modified' in response.headers:

          date = response.headers['Last-Modified']

```
      else:
```

          date = response.headers['Date']

      date = datetime.strptime(date, '%a, %d %b %Y %H:%M:%S %Z')
except Exception as e:
    logging.debug ("{1} ===> {0}".format(e, crawling))
    return self.__continue_crawling()

  print ("<url><loc>"+url.geturl()+"</loc></url>", file=self.output_file)

  print ("<url><loc>"+url.geturl()+"</loc><lastmod>"+date.strftime('%Y-%m-%dT%H:%M:%S')+"</lastmod></url>", file=self.output_file)
if self.output_file:
    self.output_file.flush()

Change name project

Hi! I propose change name projecto to Pysitemap.

HTTPS urls

Hi,

I noticed that even though the links in the page doesn't specify this, the tag always defaults to http://, even when the <a href="/"></a> doesn't include the domain?

i.e. with this command:

python3 main.py --domain https://www.****/ --images --output sitemap.xml --verbose

I get:

Add package to PyPI

This package seems quite popular and would benefit from being on PyPI. We could check out Poetry to keep it simple.

I can take a look at doing this one if it's of interest.

Images from different domains should not be added to sitemap

Sitemap should contain only URLs that belong to the same domain and are under the current directory where the sitemap is located. The same rule applies to images and videos. Currently the script adds all images not checking the domain or directory.

See https://www.sitemaps.org/protocol.html

Working with Angular sites?

Wanted to try and see if this would work with single page apps like Angular and it appears it will only pick up the index page. Hopefully support can be added to support these types of use cases. Thanks.

Python 3.9.6 support? SyntaxError

Hi,

I am getting a SyntaxError when trying to execute the file, no matter what link I type in. Also "" and '' don't work
Is there a way to "revert" the python version back to 3.6 without installing another instance?

Or am I doing something wrong here??

Thx

Duplicate entry

Script adds the root directory twice to the sitemap, the first entry in the beginning is without ending slash and the second entry at the end is with the ending slash.
See:
python3 sitemap.py --domain https://www.2globalnomads.info --image --output sitemap.xml
Output:
<url><loc>https://www.2globalnomads.info</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>
...
<url><loc>https://www.2globalnomads.info/</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>

Image Licence

mage sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:

python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/

With the following output added inside image:image after image:loc:
image:licensehttp://creativecommons.org/publicdomain/zero/1.0/</image:license>

Endless loop part 2: report error and document workaround

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose
Your fix

--drop "sid=[a-z0-9]{32}"

Although it is not likely that people will run this script on phpbb3 forums as there is already a mod for making sitemap, please consider adding your workaround to the documentation.

The same indefinite loop will happen with all phpbb3 installation and there are tens of thousands of them. Also, might be good idea to add there some kind of guard or timeout to detect loops so that you can gracefully exit and give a proper error message. Similar issue can actually happen with any website that has session management.

Limit search to path instead of domain?

Could it be possible to restrict the search to a certain path?
A bad example would be to restrict a search to http://google.com/maps/ and ignore results which are in other "subdirectories" of http://google.com/.
Using "domain" for this purpose does not work.

Windows and/or Python 3.7.2?

Hello
I have a problem with python-sitemap on Windows and Python 3.7.2.
I haven't looked into the problem yet, but whatever I do (even 'empty'/solo 'python main.py') I get:

Traceback (most recent call last):
  File "C:\! git !\python-sitemap\main.py", line 8, in <module>
    import crawler
  File "C:\! git !\python-sitemap\crawler.py", line 240
    image_link = f"{self.domain.strip("/")}{image_link.replace("./", "/")}"
                                                                 ^
SyntaxError: invalid syntax

Adding trailing '/' to all URLs

All of my site's URLs include a trailing '/'

https://www.example.com/
https://www.example.com/dir/

not the following:

https://www.example.com
https://www.example.com/dir

This script made all of my links use the links without the trailing '/'

How do I add the trailing '/' in?

Double entry if 2 slashes in the url

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info//web-design-websites/ --image --output sitemap.xml --report
This location will appears twice in the sitemap because of the double slash:

<loc>https://www.2globalnomads.info//web-design-websites/</loc>

RuntimeError: Event loop is closed - with > 1 workers

When I run with any number of workers greater than 1, I get the following error after crawling around 40 urls.

INFO:root:Crawling #56: https://up.codes/s/natural-ventilation
ERROR:concurrent.futures:exception calling callback for <Future at 0x10ddc1190 state=finished returned NoneType>
Traceback (most recent call last):
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/futures.py", line 362, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 728, in call_soon_threadsafe
    self._check_closed()
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 475, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

I'm on a Mac with Catalina. Seems to run fine on Linux.

Here command I'm using to repro:

python main.py --domain="https://up.codes" --output="sitemap.xml" -v -n 2

Online interface for the script

This is not really an issue, but I did not find any better/other way to talk with you.

I made an online interface for the script. I am updating your code there manually after testing each release by myself first. Hopefully it will make it easier for people to run the script and help to attract others join the projects and do testing or possibly even coding.

The interface is available at: https://www.2globalnomads.info/web-design-websites/#generateimagesitemap.

Improvement proposal: video support

Adding videos to site would work the same way as images and if you make ALT/TITLE and FIGURECAPTION, the same code would work with <video:video> as well. So far I have not found a single public image sitemap generator and the same applies to video sitemaps.

About video sitemaps: https://developers.google.com/webmasters/videosearch/sitemaps

Endless loop fix

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose

Handling more than 50,000 URLs

Hi, just wanted to say thanks for such a great library.

One need we have is to generate a sitemap for a site that has more than 50,000 URLs. The search engines typically only handle a maximum of 50,000 URLs per sitemap file, which means today that we manually create a sitemap index and move the URLs into individual sitemap files, each containing less than 50,000 URLs each.

One option I was considering was adding a feature to python-sitemap that would optionally output a sitemap index and multiple sitemap files if there are more than 50,000 URLs; would that be of interest? Just wanted to make sure that kind of feature would be desired prior to implementing; thanks!

URL en erreur 404 affichée dans le sitemap

sebclick$ python3 main.py --config config.fb6.json
...
DEBUG:root:http://www.freebox-v6.fr//www.mediawiki.org/ ==> HTTP Error 404: Not Found
...
et dans le sitemap.xml, je retrouve la ligne suivante :

http://www.freebox-v6.fr//www.mediawiki.org/

Video sitemap

How to create video sitemap for my site?

Question about Sitemap

Hello,

This sitemap have index xml sitemap where is located all little sitemap.xml ... ?
Can i add limit e.g. I have a lot links on my website, I can add limit e.g. 10 000 links after this script stop ?
This script not add copy/same links in sitemap ?

Thanks.

Crawl fails to find one page

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info --image --output sitemap.xml --report
Missing page:

https://www.2globalnomads.info/suomi-fi/

No URLs found

Number of found URL : 1
Number of links crawled : 1

python main.py --domain https://www.domain.com --output sitemap.xml --report

<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

</urlset>

sintax error

Just pulled the code and I get this:

$ python main.py www.cafonline.com --output sitemap.xml --verbose
Traceback (most recent call last):
File "main.py", line 8, in
import crawler
File "/home/francisco/python-sitemap/crawler.py", line 105
print(config.xml_header, file=self.output_file)
^
SyntaxError: invalid syntax

Add link that caused 404 to the sitemap

My french is not great but here goes

Quand il y a un "HTTP Error 404" je me demande quel hyperlie a cause le problem.

Relative URLs are parsed incorrectly

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

Image Sitemap?

Hi,

Would you consider adding support for images in the future?

i.e.
https://support.google.com/webmasters/answer/178636?hl=en

Stack overflow error

$ python3 main.py --domain http://ua.shop-ink.su --output sitemap.xml
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fff7edeb180:
....
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 201 in __continue_crawling
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 197 in __crawling
...
Abort trap: 6

Crawler.py giving error

On running command mentioned in simple usage of readme.md ........
File "main.py", line 6, in
import crawler
File "/Users/kartikey/Desktop/SoftwareIncubator/sitemapgen/python-sitemap/crawler.py", line 85
print(config.xml_header, file=self.output_file)

Soucis sur le nombre de lien avec code

Il semblerait qu'il y est un soucis :

[valentin@valentinpc crawler]$ python main.py --config config.json --debug
[...]
DEBUG:root:Number of link crawled : 15
DEBUG:root:Nb Code HTTP 200 : 14

Read the image title / alt

Take image:title from TITLE and/or ALT and image:caption from FIGCAPTION tags if they are present.

Error: No space left on device

My VPS has a lot of space, but the script always gives this error
NOTE: I use the Screen program to run the script in the background

Feature Request: Limit per category/section the number of URLs to parse

I have a website with millions of categorized records, it will be useful if I could limit the number of urls to parse per section. E.g. the 900,000 first urls under /products/toys/ section but not from a higher category.

Add options to pretty print the output XML

Add xmllint to produce an human readable XML.

Ref to #26

Please move the project away from GitHub

I just read that Microsoft is acquiring GitHub. I have seen enough of Microsoft love for open source for a life time to avoid everything that involves them. It is at best just a kiss of death and soon all users are required to install Microsoft malware and open Microsoft accounts to use GitHub, all all our information is for sale. I am quitting GitHub. So long and thanks for all the fish.

Exclude canonicalized pages

sometimes we have URLs that are canonicalized to other pages, and these should not be included in the sitemap. See google's reference: https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap

So the logic would be to look for a canonical tag and check if it matches the crawled URL. If it does not, then do not include that page in the sitemap.

I'm working on updating your code myself to include this but I'm still new to Python.

IMG Data URI and image license

Data URI image links gets added, but they should be left out. Those are commonly used for example for lazy loading images. The real image URLs are inside NOSCRIPT tags and they get added OK.
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:loc>https://www.2globalnomads.info/data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7</image:loc>

A few improvement proposals

Image sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/
With the following output added inside <image:image> after <image:loc>:
<image:license>http://creativecommons.org/publicdomain/zero/1.0/</image:license>

You could prettyprint the sitemap.xml a bit and add there newlines after every closing tag. That would make it a bit more human readable.

If you want, you can also take <image:title> from TITLE and/or ALT and <image:caption> from FIGCAPTION tags if they are present.

Cheers,
Santeri

urls not saved to sitemap.xml

This is the first script I am ever running.

Thank you for creating it.

After it finished, there are 634 crawled urls.

However, the sitemap.xml file in the directory is empty. How do I fix this?

Thank you in advance.

This is what I see:

patch for response error

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:09:44.910676587 +0300
@@ -84,8 +84,8 @@
url = urlparse(crawling)
self.crawled.add(crawling)

  request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
try:

      request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
    response = urlopen(request)
except Exception as e:
    if hasattr(e,'code'):

@@ -94,7 +94,6 @@
else:
self.response_code[e.code]=1
logging.debug ("{1} ==> {0}".format(e, crawling))

      response.close()
    return self.__continue_crawling()

# Read the response

crawling depth setting

Can you add a crawling depth setting? I found that because my website has filtering and searching, the URL will repeat to a very large amount.

Slash missing in URL

Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots

Output:
<image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/subscribe.png</image:loc></image:image><image:image><image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/logo.png</image:loc>

There should be "/" in the URL before path, between "info" and "paivi" like this "info/paivi"

The same issue happens with all local URLs. Remote URLs are all OK.

UnicodeDecodeError possibly with Scandinavian letters

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.books.2globalnomads.info --image --output sitemap.xml
Output
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

With multiple errors: HTTP Error 404: Not Found

How to add hreflang tags

Dear Creator,

Thank you very much for creating this.

Is there a way to add hreflang tags automatically?

Take care.

Suggestion: Not parseable resources ->parseable resources

I took a peak at your source code. One source for crawling issues is that you currently define in the code not_parseable_ressources. Instead, if you define parseable resources and limit those to only truly parseable resources that are are supported in the sitemap and may contain plain text html links, you can limit issues with unknown extensions. Also you might take a look at using mime types instead of file extensions. I am not sure how that works in Python though.

UnicodeDecodeError possibly with Scandinavian letters

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.xetnet.fi --image --output sitemap.xml --verbose
Output:

INFO:root:Start the crawling process
INFO:root:Crawling #1: https://www.xetnet.fi
INFO:root:Crawling #2: https://www.xetnet.fi/category/ror/
INFO:root:Crawling #3: https://www.xetnet.fi/wordpress-asennus-webhotelliin-2/
INFO:root:Crawling #4: https://www.xetnet.fi/category/ruby/
INFO:root:Crawling #5: https://www.xetnet.fi/asiakaspalvelu/reilua-palvelua/
INFO:root:Crawling #6: https://www.xetnet.fi/webhotelli/wordpress-webhotelli/
INFO:root:Crawling #7: https://www.xetnet.fi/wordpress/
INFO:root:Crawling #8: https://www.xetnet.fi/palvelupaketin-vaihtaminen-suurempaan-tai-pienempaan/
Traceback (most recent call last):
File "/home/paivisanteri/sitemap/python-sitemap-master/main.py", line 53, in
crawl.run()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 101, in run
self.__crawling()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 205, in __crawling
print (""+self.htmlspecialchars(url.geturl())+"" + lastmod + image_list + "", file=self.output_file)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 745: ordinal not in range(128)

Can this be somehow local problem or maybe in my python settings? I am not familiar with python.

Stop and continue

The issue with this tool is once it halts, your have to start all over again from scratch.
And with large sites this is a very common scenario.
Since we already have the partially generated xml, it would be nice to continue from where it was interrupted. Let me know your thoughts on this and how to achieve this, I am willing to send pull request once I have a better understanding of the code

Tracker images are included

Tracker image links gets added, but they should be left out. You could simply check that the image extension is not php or js, or that it is a valid image type, before adding it:
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:image><image:loc>https://analytics.2globalnomads.info/piwik.php?idsite=1&rec=1</image:loc>

I appears that the exclusion parameters (--skipext --exclude --drop) don't seem to have any effect to images.

URL UnicodeEncodeError

If the URL contains UNICODE encoding, python will report an error.

debug info:

INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html
DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no
t in range(128)

Solution:

edit crawler.py
Add the following code at the top

import string
from urllib.parse import unquote

then search
current_url = self.urls_to_crawl.pop()
add a line below

current_url = self.urls_to_crawl.pop()
current_url = quote(current_url, safe=string.printable)