akamhy / waybackpy Goto Github PK

Wayback Machine API interface & a command-line tool

Home Page: https://pypi.org/project/waybackpy/

License: MIT License

Python 100.00%

internet-archive wayback-machine internet-archiving archive-webpage archive-webpages wayback-machine-api cdx-api wayback-machine-python savepagenow web-archiving

waybackpy's People

Contributors

Stargazers

Watchers

waybackpy's Issues

Create a link to SecSI projects

Add credit/link to the projects:

Gif timelapse of webpage

Should we have an option to make webpage time-lapse using the IA?

README.md

https://github.com/akamhy/waybackpy#supported-features

should have collapsed code or link to the docs

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Fix code scanning alert - Incomplete regular expression for hostnames

Tracking issue for:

https://github.com/akamhy/waybackpy/security/code-scanning/1

Fix problem with non padded input in near()

It returns an incorrect archive.

Filtering issues

https://repl.it/@akamhy/CDX-Filtering-Date-Range#main.py

allow near() to be called with a timestamp

Instead of having to convert timestamps into year, month, day, hour, and minutes and then pass all of those parameters separately like wayback_url_obj.near(year=year, month=month, day=day, hour=hour, minute=minute), it would be great if it was possible to just pass in a Unix timestamp (either integer or floating point seconds since epoch like time.time()) and have that automatically rounded to the nearest minute and converted internally into the appropriate format.

Then the call could just be like wayback_url_obj.near(my_timestamp) or if you prefer a named parameter wayback_url_obj.near(timestamp=my_timestamp).

add doc string to every complex method or func

Update CLI

CLI code currently doesn't supports some new functionalities in the wrapper. archive_url and JSON

Add New Exceptions, inherit from WaybackError

We need different exceptions all these new Exceptions must inherit from WaybackError.

List of proposed Exceptions

NotSavedError when saving fails
ArchiveNotFoundError when near and its child method fails
ArchiveScrapingError when can't scraping archive from header

Cdx() is not working for some URLs

It works if I set the url to your Github page as in your example, but when I try this URL, it does not work.

I am using version 2.4.0.

>>> from waybackpy import Cdx
>>> user_agent = 'some user agent string'
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> cdx = Cdx(url=url, user_agent=user_agent)
>>> snapshots = cdx.snapshots()
>>> for i, snapshot in  enumerate(snapshots, start =1):
...     snapshot_printer(i, snapshot)
...

Export archive as jpeg image.

Update tests

Add tests for the new methods and update the older ones.

cli.py (near,archive,oldest,newset) : Output a nice message and tell the user that the page isn't archived

Problem : https://github.com/akamhy/foo-bar-is-real is 404. I am requesting an archive for an URL that doesn't exist and therefore ain't archived on wayback machine.

The following text is not helpful to folks who don't know python. And looks unprofessional from CLI point of view.

$ waybackpy --oldest --url https://github.com/akamhy/foo-bar-is-real
Traceback (most recent call last):
  File "/home/akamhy/anaconda3/bin/waybackpy", line 8, in <module>
    sys.exit(main())
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 254, in main
    output = args_handler(args)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 138, in args_handler
    return _oldest(obj)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 25, in _oldest
    return obj.oldest()
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 202, in oldest
    return self.near(year=year)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 184, in near
    raise WaybackError(
waybackpy.exceptions.WaybackError: Can not find archive for 'https://github.com/akamhy/foo-bar-is-real' try later or use wayback.Url(url, user_agent).save() to create a new archive.

add a way to get a list of all archives for a given URL

Right now, there seems to be no way to list all of the available archives by timestamp. Unless I am missing something, the only options are to call .oldest(), .newest(), or request for an archive .near() a specific timestamp. It would be great if it was possible to get a list of all available archives which could then be individually fetched as desired with .get(), printed to get the archive URL, etc.

"Error while retrieving" errors should include more information

I get this error sometimes and there really isn't much information in (the string version of) this error.

2021-01-13T20:26:54 | ERROR | save_archive | archive of <some URL> failed: Error while retrieving https://web.archive.org/save/<some URL>

"No archive URL found in the API response" errors

I seem to get this error a lot of the time when saving the archive actually succeeded (rechecking later finds it). It seems like the error statement could be more specific about the failure and I'm not sure that the upgrade suggestion is helpful when the version is current.

No archive URL found in the API response. If 'https://old.reddit.com/<some post on Reddit here>' can be accessed via your web browser then either this version of waybackpy (2.4.0) is out of date or WayBack Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' for the latest version of waybackpy.
Header:
{'Server': 'nginx/1.15.8', 'Date': 'Thu, 14 Jan 2021 22:13:08 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'X-App-Server': 'wwwb-app52', 'X-ts': '404', 'X-Tr': '138505', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': 'Google'}

Update docs

Add documentation for the new functionalities
IMP: Add the new archive_url at the top of docs.

snapshot str must be the line str not archive url

Too Many Requests

Too Many Requests
We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.

If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.

Please feel free to write to us at [email protected] if you have questions about this. Please include your IP address and any URLs in the email so we can provide you with better service.

Exceeded 30 redirects

akamhy at device in ~
$ waybackpy -u https://github.com/akamhy/videohash/blob/main/videohash/vhash.py -s
Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 377, in _get_response
    return s.get(url, headers=headers)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in send
    history = [resp for resp in gen]
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in <listcomp>
    history = [resp for resp in gen]
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 166, in resolve_redirects
    raise TooManyRedirects('Exceeded {} redirects.'.format(self.max_redirects), response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 15, in _save
    return obj.save()
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/wrapper.py", line 160, in save
    instance=self,
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 242, in _archive_url_parser
    res = _get_response(url, headers=headers)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 389, in _get_response
    raise exc
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/bin/waybackpy", line 33, in <module>
    sys.exit(load_entry_point('waybackpy==2.4.1', 'console_scripts', 'waybackpy')())
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 327, in main
    print(args_handler(parse_args(argv)))
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 178, in args_handler
    output = _save(obj)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 31, in _save
    raise WaybackError(err)
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.

akamhy at device in ~
$

get rid of requirements.txt, use Pipfiles

CLI: --json returns invalid json

$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json
{'archived_snapshots': {'closest': {'timestamp': '20210121111134', 'url': 'http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON', 'status': '200', 'available': True}}, 'url': 'https://en.wikipedia.org/wiki/JSON#Syntax'}
#attempt to parse the json output
$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | jq 
parse error: Invalid numeric literal at line 1, column 22

While the equivalent query using curl [1] outputs valid json

$ curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax"
{"archived_snapshots": {"closest": {"timestamp": "20210121111134", "url": "http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON", "status": "200", "available": true}}, "url": "https://en.wikipedia.org/wiki/JSON"}
#attempt to parse the json output
curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax" | jq
[works as expected]

It seems waybackpy's json is invalid because:

it uses single quotes(') to denote strings instead of double quotes(")
the first letter of True is uppercase

Just to exemplify, it becomes valid if I manually fix the two points above:

$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | tr "'" "\"" | sed -e 's/True/true/g' | jq
[works as expected]

minify & fix the logo and rm old ugly logos

Not able to run getting error

Hi when running the tool i am getting errors like below.
Kindly help.

Support threads

Saving archive results in an incorrect "cannot be archived by wayback machine as it is a redirect" failure

I tried upgrading from 2.4.2 to 2.4.4 and the first page I tried to archive returned a new error:

archive of https://old.reddit.com/r/personalfinance/comments/s4gdfx/how_i_recovered_my_50k_investment_stolen_off/ failed: URL cannot be archived by wayback machine as it is a redirect.
Header:
save redirected

This page is not a redirect. It appears that the archive actually worked, though. It's successfully archived on archive.org with a timestamp that is about the same timestamp as the above error in my log.

get() doesn't return the correct version of archives

get() is fetching the wrong version of archives.

I've seen this happen to multiple different archives and it is a reproducible issue. Here is an example:

>>> import waybackpy
>>> user_agent = "something goes here"
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> url_object = waybackpy.Url(url, user_agent)
>>> archive = url_object.newest()
>>> str(archive)
'https://web.archive.org/web/20210111103734/https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> archive.get().find("[removed]")
50479
>>> archive.get().find("become your own BOSS")
-1

However, that archive version does not include the string [removed] and it should show the spam text that included the become your own BOSS phrase.

If you fetch the page separately, you can [removed] is not present and the expected text is there.

>>> import requests
>>> response = requests.get(str(archive))
>>> response.text.find("[removed]")
-1
>>> response.text.find("become your own BOSS")
61087

For some reason, get() seems to be fetching an older version of the archive.

Update the CLI, support everything we can do with the wrapper.py

Currently CLI supports less features reactive to the wrapper.py (as module).

Question: Archiving a page will retrieve the lastest (and outdated) copy of the page instead of save

I know that the CLI tool as little to do with this, but maybe someone can point me to the right direction. I'm trying to archive a page, but instead of archiving a fresh copy, i'm getting an outdated copy of the page.

Any reason as to why this is happening?

Support a get webpage function

Users should be able to get the webpage using this package.

total archives

https://web.archive.org/cdx/search/cdx?url=https://github.com/&output=json

total archives = len the list

use format instead of %s

.format is better

Error getting a page when the encoding is 'None'

The error comes on 'wrapper.py' when the encoding is 'None'. Webpage for example of the error "https://akamhy.github.io/". The file that causes the issue has a extension of '.woff2'.

Error:

return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
AttributeError: 'NoneType' object has no attribute 'replace'

In my case I want to download all the versions of a domain (html, js...) and also subdomains, so to do that I use Cdx and then, I download using the get method.

Cdx class

see https://web.archive.org/cdx/search/cdx?&output=json&fl=original&collapse=urlkey&url=https://github.com/akamhy*

Create a wildcard search class? Cuz not really compatible with the Url class.

Source: confidential via Email (primary)

com,github)/ 20190101054633 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101065422 https://github.com/ text/html 200 G5OJHNLJSMMMEABXKD4C723QZ4HVYG7I 18291
com,github)/ 20190101082252 https://github.com/ text/html 200 3FGP34I4EWYWIDIQ6T3WMQ2E47X6GPTZ 18299
com,github)/ 20190101105635 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101110106 https://github.com/ text/html 200 VGYZSUH55NEH5F5AJ3HJ5RMJZGOKU6CP 18285
com,github)/ 20190101121552 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 321
com,github)/ 20190101131350 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101135435 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101141556 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 325

Why?

Used by many to find vulnerability in small to medium level websites. Imo, will be a good feature.

save redirected

I would like to provide an error to handle this.

import waybackpy
w = waybackpy.Url('http://google.com')
w.save() #=> redirected to https://www.google.com

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/wrapper.py", line 161, in save
    instance=self,
  File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/utils.py", line 303, in _archive_url_parser
    raise WaybackError(exc_message)
waybackpy.exceptions.WaybackError: No archive URL found in the API response. If 'http://google.com' can be accessed via your web browser then either Wayback Machine is malfunctioning or it refused to archive your URL.
Header:
save redirected

akamhy / waybackpy Goto Github PK

waybackpy's People

Contributors

Stargazers

Watchers

Forkers

waybackpy's Issues

List of proposed Exceptions

Recommend Projects

Recommend Topics

Recommend Org