Giter VIP home page Giter VIP logo

python-sitemap's Introduction

Python-Sitemap

Simple script to crawl websites and create a sitemap.xml of all public link in it.

Warning : This script only works with Python3

Simple usage

>>> python main.py --domain http://blog.lesite.us --output sitemap.xml

Advanced usage

Read a config file to set parameters: You can overide (or add for list) any parameters define in the config.json

>>> python main.py --config config/config.json

Enable debug:

  $ python main.py --domain https://blog.lesite.us --output sitemap.xml --debug

Enable verbose output:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --verbose

Disable sorting output:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --no-sort

Enable Image Sitemap

More informations here https://support.google.com/webmasters/answer/178636?hl=en

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --images

Enable report for print summary of the crawl:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --report

Skip url (by extension) (skip pdf AND xml url):

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --skipext pdf --skipext xml

Drop a part of an url via regexp :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --drop "id=[0-9]{5}"

Exclude url by filter a part of it :

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --exclude "action=edit"

Read the robots.txt to ignore some url:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --parserobots

Use specific user-agent for robots.txt:

$ python main.py --domain https://blog.lesite.us --output sitemap.xml --parserobots --user-agent Googlebot

Human readable XML

$ python3 main.py --domain https://blog.lesite.us --images --parserobots | xmllint --format -

Multithreaded

$ python3 main.py --domain https://blog.lesite.us --num-workers 4

with basic auth

You need to configure username and password in your config.py before

$ python3 main.py --domain https://blog.lesite.us --auth

Output sitemap index file

Sitemaps with over 50,000 URLs should be split into an index file that points to sitemap files that each contain 50,000 URLs or fewer. Outputting as an index requires specifying an output file. An index will only be output if a crawl has more than 50,000 URLs:

$ python3 main.py --domain https://blog.lesite.us --as-index --output sitemap.xml

Docker usage

Build the Docker image:

$ docker build -t python-sitemap:latest .

Run with default domain :

$ docker run -it python-sitemap

Run with custom domain :

$ docker run -it python-sitemap --domain https://www.graylog.fr

Run with config file and output :

You need to configure config.json file before

$ docker run -it -v `pwd`/config/:/config/ -v `pwd`:/home/python-sitemap/ python-sitemap --config config/config.json

python-sitemap's People

Contributors

c4software avatar chenkuansun avatar cyai avatar etw3gh avatar garrett-r avatar ghuntley avatar jswilson avatar lovebootcaptain avatar marshvee avatar mnlipp avatar reuning avatar rstular avatar sebclick avatar todpole3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-sitemap's Issues

Question about Sitemap

Hello,

This sitemap have index xml sitemap where is located all little sitemap.xml ... ?
Can i add limit e.g. I have a lot links on my website, I can add limit e.g. 10 000 links after this script stop ?
This script not add copy/same links in sitemap ?

Thanks.

urls not saved to sitemap.xml

This is the first script I am ever running.

Thank you for creating it.

After it finished, there are 634 crawled urls.

However, the sitemap.xml file in the directory is empty. How do I fix this?

Thank you in advance.


This is what I see:

Untitled

patch for response error

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:09:44.910676587 +0300
@@ -84,8 +84,8 @@
url = urlparse(crawling)
self.crawled.add(crawling)

  •   request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
    try:
    
  •       request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
        response = urlopen(request)
    except Exception as e:
        if hasattr(e,'code'):
    
    @@ -94,7 +94,6 @@
    else:
    self.response_code[e.code]=1
    logging.debug ("{1} ==> {0}".format(e, crawling))
  •       response.close()
        return self.__continue_crawling()
    
    # Read the response
    

Suggestion: Not parseable resources ->parseable resources

I took a peak at your source code. One source for crawling issues is that you currently define in the code not_parseable_ressources. Instead, if you define parseable resources and limit those to only truly parseable resources that are are supported in the sitemap and may contain plain text html links, you can limit issues with unknown extensions. Also you might take a look at using mime types instead of file extensions. I am not sure how that works in Python though.

HTTPS urls

Hi,

I noticed that even though the links in the page doesn't specify this, the tag always defaults to http://, even when the <a href="/"></a> doesn't include the domain?

i.e. with this command:

python3 main.py --domain https://www.****/ --images --output sitemap.xml --verbose

I get:

image

UnicodeDecodeError possibly with Scandinavian letters

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.xetnet.fi --image --output sitemap.xml --verbose
Output:

INFO:root:Start the crawling process
INFO:root:Crawling #1: https://www.xetnet.fi
INFO:root:Crawling #2: https://www.xetnet.fi/category/ror/
INFO:root:Crawling #3: https://www.xetnet.fi/wordpress-asennus-webhotelliin-2/
INFO:root:Crawling #4: https://www.xetnet.fi/category/ruby/
INFO:root:Crawling #5: https://www.xetnet.fi/asiakaspalvelu/reilua-palvelua/
INFO:root:Crawling #6: https://www.xetnet.fi/webhotelli/wordpress-webhotelli/
INFO:root:Crawling #7: https://www.xetnet.fi/wordpress/
INFO:root:Crawling #8: https://www.xetnet.fi/palvelupaketin-vaihtaminen-suurempaan-tai-pienempaan/
Traceback (most recent call last):
File "/home/paivisanteri/sitemap/python-sitemap-master/main.py", line 53, in
crawl.run()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 101, in run
self.__crawling()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 205, in __crawling
print (""+self.htmlspecialchars(url.geturl())+"" + lastmod + image_list + "", file=self.output_file)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 745: ordinal not in range(128)

Can this be somehow local problem or maybe in my python settings? I am not familiar with python.

Image Licence

mage sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:

python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/

With the following output added inside image:image after image:loc:
image:licensehttp://creativecommons.org/publicdomain/zero/1.0/</image:license>

Slash missing in URL

Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots

Output:
<image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/subscribe.png</image:loc></image:image><image:image><image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/logo.png</image:loc>

There should be "/" in the URL before path, between "info" and "paivi" like this "info/paivi"

The same issue happens with all local URLs. Remote URLs are all OK.

Soucis sur le nombre de lien avec code

Il semblerait qu'il y est un soucis :

[valentin@valentinpc crawler]$ python main.py --config config.json --debug
[...]
DEBUG:root:Number of link crawled : 15
DEBUG:root:Nb Code HTTP 200 : 14

Add package to PyPI

This package seems quite popular and would benefit from being on PyPI. We could check out Poetry to keep it simple.

I can take a look at doing this one if it's of interest.

Python 3.9.6 support? SyntaxError

Hi,

I am getting a SyntaxError when trying to execute the file, no matter what link I type in. Also "" and '' don't work
Is there a way to "revert" the python version back to 3.6 without installing another instance?

Or am I doing something wrong here??

Thx

Limit search to path instead of domain?

Could it be possible to restrict the search to a certain path?
A bad example would be to restrict a search to http://google.com/maps/ and ignore results which are in other "subdirectories" of http://google.com/.
Using "domain" for this purpose does not work.

Adding trailing '/' to all URLs

All of my site's URLs include a trailing '/'

https://www.example.com/
https://www.example.com/dir/

not the following:

https://www.example.com
https://www.example.com/dir

This script made all of my links use the links without the trailing '/'

How do I add the trailing '/' in?

Relative URLs are parsed incorrectly

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

Please move the project away from GitHub

I just read that Microsoft is acquiring GitHub. I have seen enough of Microsoft love for open source for a life time to avoid everything that involves them. It is at best just a kiss of death and soon all users are required to install Microsoft malware and open Microsoft accounts to use GitHub, all all our information is for sale. I am quitting GitHub. So long and thanks for all the fish.

AttributeError: 'NoneType' object has no attribute 'geturl'

I got such error
python3 main.py --domain https://domain.com --output sitemap.xml

Traceback (most recent call last):
File "main.py", line 60, in
crawl.run()
File "/root/python-sitemap/crawler.py", line 127, in run
self.__crawl(current_url)
File "/root/python-sitemap/crawler.py", line 264, in __crawl
final_url = response.geturl()
AttributeError: 'NoneType' object has no attribute 'geturl'

UnicodeDecodeError possibly with Scandinavian letters

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.books.2globalnomads.info --image --output sitemap.xml
Output
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

With multiple errors: HTTP Error 404: Not Found

No URLs found

Number of found URL : 1
Number of links crawled : 1

python main.py --domain https://www.domain.com --output sitemap.xml --report

<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

</urlset>

Windows and/or Python 3.7.2?

Hello
I have a problem with python-sitemap on Windows and Python 3.7.2.
I haven't looked into the problem yet, but whatever I do (even 'empty'/solo 'python main.py') I get:

Traceback (most recent call last):
  File "C:\! git !\python-sitemap\main.py", line 8, in <module>
    import crawler
  File "C:\! git !\python-sitemap\crawler.py", line 240
    image_link = f"{self.domain.strip("/")}{image_link.replace("./", "/")}"
                                                                 ^
SyntaxError: invalid syntax

RuntimeError: Event loop is closed - with > 1 workers

When I run with any number of workers greater than 1, I get the following error after crawling around 40 urls.

INFO:root:Crawling #56: https://up.codes/s/natural-ventilation
ERROR:concurrent.futures:exception calling callback for <Future at 0x10ddc1190 state=finished returned NoneType>
Traceback (most recent call last):
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/futures.py", line 362, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 728, in call_soon_threadsafe
    self._check_closed()
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 475, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

I'm on a Mac with Catalina. Seems to run fine on Linux.

Here command I'm using to repro:

python main.py --domain="https://up.codes" --output="sitemap.xml" -v -n 2

Stack overflow error

$ python3 main.py --domain http://ua.shop-ink.su --output sitemap.xml
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fff7edeb180:
....
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 201 in __continue_crawling
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 197 in __crawling
...
Abort trap: 6

How to add hreflang tags

Dear Creator,

Thank you very much for creating this.

Is there a way to add hreflang tags automatically?

Take care.

IMG Data URI and image license

Data URI image links gets added, but they should be left out. Those are commonly used for example for lazy loading images. The real image URLs are inside NOSCRIPT tags and they get added OK.
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:loc>https://www.2globalnomads.info/</image:loc>

A few improvement proposals

Image sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/
With the following output added inside <image:image> after <image:loc>:
<image:license>http://creativecommons.org/publicdomain/zero/1.0/</image:license>

You could prettyprint the sitemap.xml a bit and add there newlines after every closing tag. That would make it a bit more human readable.

If you want, you can also take <image:title> from TITLE and/or ALT and <image:caption> from FIGCAPTION tags if they are present.

Cheers,
Santeri

patch for <lastmod> in sitemap

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:27:24.706698113 +0300
@@ -5,6 +5,7 @@
from urllib.request import urlopen, Request
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
+from datetime import datetime

import os

@@ -105,12 +106,17 @@
else:
self.response_code[response.getcode()]=1
response.close()

  •       if 'last-modified' in response.headers:
    
  •           date = response.headers['Last-Modified']
    
  •       else:
    
  •           date = response.headers['Date']
    
  •       date = datetime.strptime(date, '%a, %d %b %Y %H:%M:%S %Z')
    except Exception as e:
        logging.debug ("{1} ===> {0}".format(e, crawling))
        return self.__continue_crawling()
    
  •   print ("<url><loc>"+url.geturl()+"</loc></url>", file=self.output_file)
    
  •   print ("<url><loc>"+url.geturl()+"</loc><lastmod>"+date.strftime('%Y-%m-%dT%H:%M:%S')+"</lastmod></url>", file=self.output_file)
    if self.output_file:
        self.output_file.flush()
    

Handling more than 50,000 URLs

Hi, just wanted to say thanks for such a great library.

One need we have is to generate a sitemap for a site that has more than 50,000 URLs. The search engines typically only handle a maximum of 50,000 URLs per sitemap file, which means today that we manually create a sitemap index and move the URLs into individual sitemap files, each containing less than 50,000 URLs each.

One option I was considering was adding a feature to python-sitemap that would optionally output a sitemap index and multiple sitemap files if there are more than 50,000 URLs; would that be of interest? Just wanted to make sure that kind of feature would be desired prior to implementing; thanks!

Duplicate entry

Script adds the root directory twice to the sitemap, the first entry in the beginning is without ending slash and the second entry at the end is with the ending slash.
See:
python3 sitemap.py --domain https://www.2globalnomads.info --image --output sitemap.xml
Output:
<url><loc>https://www.2globalnomads.info</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>
...
<url><loc>https://www.2globalnomads.info/</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>

Working with Angular sites?

Wanted to try and see if this would work with single page apps like Angular and it appears it will only pick up the index page. Hopefully support can be added to support these types of use cases. Thanks.

Crawler.py giving error

On running command mentioned in simple usage of readme.md ........
File "main.py", line 6, in
import crawler
File "/Users/kartikey/Desktop/SoftwareIncubator/sitemapgen/python-sitemap/crawler.py", line 85
print(config.xml_header, file=self.output_file)

Endless loop fix

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose

Endless loop part 2: report error and document workaround

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose
Your fix

--drop "sid=[a-z0-9]{32}"

Although it is not likely that people will run this script on phpbb3 forums as there is already a mod for making sitemap, please consider adding your workaround to the documentation.

The same indefinite loop will happen with all phpbb3 installation and there are tens of thousands of them. Also, might be good idea to add there some kind of guard or timeout to detect loops so that you can gracefully exit and give a proper error message. Similar issue can actually happen with any website that has session management.

URL UnicodeEncodeError

If the URL contains UNICODE encoding, python will report an error.

debug info:

INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html
DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no
t in range(128)

Solution:

  1. edit crawler.py
    Add the following code at the top
import string
from urllib.parse import unquote
  1. then search
    current_url = self.urls_to_crawl.pop()

  2. add a line below

current_url = self.urls_to_crawl.pop()
current_url = quote(current_url, safe=string.printable)

Error: No space left on device

My VPS has a lot of space, but the script always gives this error
NOTE: I use the Screen program to run the script in the background
screenshot_20180210-184154

crawling depth setting

Can you add a crawling depth setting? I found that because my website has filtering and searching, the URL will repeat to a very large amount.

Stop and continue

The issue with this tool is once it halts, your have to start all over again from scratch.
And with large sites this is a very common scenario.
Since we already have the partially generated xml, it would be nice to continue from where it was interrupted. Let me know your thoughts on this and how to achieve this, I am willing to send pull request once I have a better understanding of the code

sintax error

Just pulled the code and I get this:

$ python main.py www.cafonline.com --output sitemap.xml --verbose
Traceback (most recent call last):
File "main.py", line 8, in
import crawler
File "/home/francisco/python-sitemap/crawler.py", line 105
print(config.xml_header, file=self.output_file)
^
SyntaxError: invalid syntax

Tracker images are included

Tracker image links gets added, but they should be left out. You could simply check that the image extension is not php or js, or that it is a valid image type, before adding it:
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:image><image:loc>https://analytics.2globalnomads.info/piwik.php?idsite=1&amp;rec=1</image:loc>

I appears that the exclusion parameters (--skipext --exclude --drop) don't seem to have any effect to images.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.