Giter VIP home page Giter VIP logo

wayback-machine-archiver's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wayback-machine-archiver's Issues

TooManyRedirects: Exceeded 30 redirects

I followed your suggestion on the Server Error issue:

Anyway, the most common reason for a 500 error is the Internet Archive rate-limiting you. My suggestion is to turn the --rate-limit-wait parameter higher! It defaults to 5 seconds; I'd try 30 or even 60.

I tried 30 and this happens:

Microsoft Windows [Version 10.0.18363.1082]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Users\yewhe\Downloads>archiver --file fest.txt --rate-limit-wait 30
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
    r = session.head(request_url, allow_redirects=True)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 553, in head
    return self.request('HEAD', url, **kwargs)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 518, in request
    resp = self.send(prep, **send_kwargs)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 137, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
    pool.map(partial_call, archive_urls)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

Similarity to a feature in Wayback Machine extension?

The Wayback Machine extension here has a feature called 'Bulk save', and I find it's principle pretty similar to your archiver, the differences being the user interface and live updates. What's more, it works out great! What do you think of this one?

image

Add tests

There are no tests for this library! We should have a few at least!

flag for setting no to "save error pages"

hi. Backing up twitter is throwing error pages but if we manually add the url on
http://web.archive.org/save
and uncheck the save error pages, the page is saved. It must be a problem with twitter or something. Anyways, is there a flag with which we can unset this error page setting? i am hoping this would help

Don't block on requests

The script should generate a list of urls to hit, and then non-blocking running through them. There is no reason to wait for a request to finish to start the next.

Fix pip install

pip install wayback-machine-archiver

Yields:

Collecting wayback-machine-archiver
Could not find a version that satisfies the requirement wayback-machine-archiver (from versions: )
No matching distribution found for wayback-machine-archiver

Server Error: Internal Server Error for url

Around 10 minutes into running archiver for a txt file:

Here's the text file:
bluemaxima.txt

This happens: 500 Server Error: Internal Server Error for url

C:\Users\yewhe\Downloads>archiver --file bluemaxima.txt
ERROR:root:500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
ERROR:root:500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Special:RecentChangesLinked/Unity_Curation
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Special:RecentChangesLinked/Unity_Curation
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
    pool.map(partial_call, archive_urls)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://web.archive.org/save/https://bluemaxima.org/flashpoint/datahub/Game_Master_List

Does this mean that after the error link, the archiver doesn't run anymore? In other words, does archiver interrupt the process and ignore the remaining urls in the text file? Does archiver retry the link after a period of time?

520 Server Error for some URLS

I tried 60 for this file: jass.txt

...then 520 Server Error: UNKNOWN for url

Microsoft Windows [Version 10.0.18363.1082]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Users\yewhe\Downloads>archiver --file jass.txt --rate-limit-wait 60
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2013-uk-vocalist-of-the-year/
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2013-uk-vocalist-of-the-year/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/and-the-judges-are/
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/and-the-judges-are/
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2020-prs-for-music-uk-jazz-act-of-the-year/
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2020-prs-for-music-uk-jazz-act-of-the-year/
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\wayback_machine_archiver\archiver.py", line 243, in main
    pool.map(partial_call, archive_urls)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://www.jazzfmawards.com/awards/2017-ppl-lifetime-achievement-award/

Originally posted by @Melonadev in https://github.com/agude/wayback-machine-archiver/issue_comments/710761066

requests.exceptions.TooManyRedirects: Exceeded 100 redirects.

The first txt file I fed into archiver after upgrading to version 1.9.1 contains 100 urls, and it works with --rate-limit-wait 30 (1st attempt)

I tried the second one with --rate-limit-wait 70 and it fails:
guld2.txt

Microsoft Windows [Version 10.0.18363.1198]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Users\yewhe\Downloads\archiver files>archiver --file guld2.txt --rate-limit-wait 60
ERROR:root:520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://guldbaggen.se/nominering/sara-sommerfeld/
Traceback (most recent call last):
  File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 38, in call_archiver
    r.raise_for_status()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\models.py", line 928, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 520 Server Error: UNKNOWN for url: https://web.archive.org/save/https://guldbaggen.se/nominering/sara-sommerfeld/
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
    r = session.head(request_url, allow_redirects=True)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 553, in head
    return self.request('HEAD', url, **kwargs)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 518, in request
    resp = self.send(prep, **send_kwargs)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 661, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\site-packages\requests\sessions.py", line 137, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 100 redirects.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\yewhe\AppData\Local\Programs\Python\Python38-32\Scripts\archiver.exe\__main__.py", line 7, in <module>
  File "C:\Users\yewhe\AppData\Roaming\Python\Python38\site-packages\wayback_machine_archiver\archiver.py", line 245, in main
    pool.map(partial_call, archive_urls)
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\yewhe\appdata\local\programs\python\python38-32\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
requests.exceptions.TooManyRedirects: Exceeded 100 redirects.

And if I use --rate-limit-wait 120 it'd take too long.

Issues in updating

Is there a parameter for updating archiver? Like a --update or something.

Originally posted by @Melonadev in https://github.com/agude/wayback-machine-archiver/issue_comments/711156395

I'm using the latest version of pip (20.2.4), and I'm still having troubles with reinstalling archiver:

C:\Users\yewhe>pip install wayback-machine-archiver
Collecting wayback-machine-archiver
  Using cached wayback_machine_archiver-1.9.0-py3-none-any.whl (7.1 kB)
Collecting requests
  Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\python38\lib\site-packages (from requests->wayback-machine-archiver) (1.25.10)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\python38\lib\site-packages (from requests->wayback-machine-archiver) (3.0.4)
Installing collected packages: idna, certifi, requests, wayback-machine-archiver
  WARNING: Failed to write executable - trying to use .deleteme logic
ERROR: Could not install packages due to an EnvironmentError: [WinError 2] The system cannot find the file specified: 'c:\\python38\\Scripts\\archiver.exe' -> 'c:\\python38\\Scripts\\archiver.exe.deleteme'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.