agude / wayback-machine-archiver Goto Github PK
View Code? Open in Web Editor NEWA Python script to submit web pages to the Wayback Machine for archiving.
Home Page: https://pypi.org/project/wayback-machine-archiver/
License: MIT License
A Python script to submit web pages to the Wayback Machine for archiving.
Home Page: https://pypi.org/project/wayback-machine-archiver/
License: MIT License
I followed your suggestion on the Server Error issue:
Anyway, the most common reason for a 500 error is the Internet Archive rate-limiting you. My suggestion is to turn the
--rate-limit-wait
parameter higher! It defaults to 5 seconds; I'd try 30 or even 60.
I tried 30 and this happens:
Microsoft Windows [Version 10.0.18363.1082]
(c) 2019 Microsoft Corporation. All rights reserved.
C:\Users\yewhe\Downloads>archiver --file fest.txt --rate-limit-wait 30
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
r = session.head(request_url, allow_redirects=True)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 553, in head
return self.request('HEAD', url, **kwargs)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 137, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
pool.map(partial_call, archive_urls)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
raise self._value
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
The Wayback Machine extension here has a feature called 'Bulk save', and I find it's principle pretty similar to your archiver, the differences being the user interface and live updates. What's more, it works out great! What do you think of this one?
Here is the change log the Internet Archive sent me: https://docs.google.com/document/d/19RJsRncGUw2qHqGGg9lqYZYf7KKXMDL1Mro5o1Qw6QI/edit# (also uploaded as a PDF to this issue).
SPN2 change log.pdf
And the API docs: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit (Also uploaded)
SPN2 public API page docs.pdf
Use build stages to only push to Pypi after the tests all succeed: https://docs.travis-ci.com/user/build-stages
There are no tests for this library! We should have a few at least!
They are not yet in the README.
hi. Backing up twitter is throwing error pages but if we manually add the url on
http://web.archive.org/save
and uncheck the save error pages, the page is saved. It must be a problem with twitter or something. Anyways, is there a flag with which we can unset this error page setting? i am hoping this would help
It would be nice to catch bugs like cb3d72a BEFORE I push a release. ๐คฆโโ๏ธ
The script should generate a list of urls to hit, and then non-blocking running through them. There is no reason to wait for a request to finish to start the next.
Why is there no (clear) way to change the why: No Collection Info which is displayed on wayback machine when you click a date? Normally it displays info like what bot submitted it or whatever else.
pip install wayback-machine-archiver
Yields:
Collecting wayback-machine-archiver
Could not find a version that satisfies the requirement wayback-machine-archiver (from versions: )
No matching distribution found for wayback-machine-archiver
Let's run the tests on upload as well!
The Internet Archive only allows one backup per 10 minutes, so we shouldn't submit duplicate URLs.
Around 10 minutes into running archiver for a txt file:
Here's the text file:
bluemaxima.txt
This happens: 500 Server Error: Internal Server Error for url
C:\Users\yewhe\Downloads>archiver --file bluemaxima.txt
ERROR:root:500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
ERROR:root:500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Special:RecentChangesLinked/Unity_Curation
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Special:RecentChangesLinked/Unity_Curation
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
pool.map(partial_call, archive_urls)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
raise self._value
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
Does this mean that after the error link, the archiver doesn't run anymore? In other words, does archiver interrupt the process and ignore the remaining urls in the text file? Does archiver retry the link after a period of time?
I tried 60 for this file: jass.txt
...then 520 Server Error: UNKNOWN for url
Microsoft Windows [Version 10.0.18363.1082]
(c) 2019 Microsoft Corporation. All rights reserved.
C:\Users\yewhe\Downloads>archiver --file jass.txt --rate-limit-wait 60
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2013-uk-vocalist-of-the-year/
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2013-uk-vocalist-of-the-year/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/and-the-judges-are/
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/and-the-judges-are/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2020-prs-for-music-uk-jazz-act-of-the-year/
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2020-prs-for-music-uk-jazz-act-of-the-year/
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
pool.map(partial_call, archive_urls)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
raise self._value
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
Originally posted by @Melonadev in https://github.com/agude/wayback-machine-archiver/issue_comments/710761066
Not all pages are contained in a sitemap, and not all pages have sitemaps. The script should allow a list of pages to be specified to backup.
Sitemaps do not often point to themselves, so we should also add it to the backup queue.
Travis should try to install this project, and run the tests from #1.
The first txt file I fed into archiver after upgrading to version 1.9.1 contains 100 urls, and it works with --rate-limit-wait 30
(1st attempt)
I tried the second one with --rate-limit-wait 70
and it fails:
guld2.txt
Microsoft Windows [Version 10.0.18363.1198]
(c) 2019 Microsoft Corporation. All rights reserved.
C:\Users\yewhe\Downloads\archiver files>archiver --file guld2.txt --rate-limit-wait 60
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://guldbaggen.se/nominering/sara-sommerfeld/
Traceback (most recent call last):
File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
r.raise_for_status()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://guldbaggen.se/nominering/sara-sommerfeld/
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
r = session.head(request_url, allow_redirects=True)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 553, in head
return self.request('HEAD', url, **kwargs)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 137, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 100 redirects.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 245, in main
pool.map(partial_call, archive_urls)
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
raise self._value
requests.exceptions.TooManyRedirects: Exceeded 100 redirects.
And if I use --rate-limit-wait 120
it'd take too long.
is there a way to import this into a python script?
Is there a parameter for updating archiver? Like a --update
or something.
Originally posted by @Melonadev in https://github.com/agude/wayback-machine-archiver/issue_comments/711156395
I'm using the latest version of pip (20.2.4), and I'm still having troubles with reinstalling archiver:
C:\Users\yewhe>pip install wayback-machine-archiver
Collecting wayback-machine-archiver
Using cached wayback_machine_archiver-1.9.0-py3-none-any.whl (7.1 kB)
Collecting requests
Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\python38\lib\site-packages (from requests->wayback-machine-archiver) (1.25.10)
Collecting certifi>=2017.4.17
Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\python38\lib\site-packages (from requests->wayback-machine-archiver) (3.0.4)
Installing collected packages: idna, certifi, requests, wayback-machine-archiver
WARNING: Failed to write executable - trying to use .deleteme logic
ERROR: Could not install packages due to an EnvironmentError: [WinError 2] The system cannot find the file specified: 'c:\\python38\\Scripts\\archiver.exe' -> 'c:\\python38\\Scripts\\archiver.exe.deleteme'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.