akamhy / waybackpy Goto Github PK
View Code? Open in Web Editor NEWWayback Machine API interface & a command-line tool
Home Page: https://pypi.org/project/waybackpy/
License: MIT License
Wayback Machine API interface & a command-line tool
Home Page: https://pypi.org/project/waybackpy/
License: MIT License
Add credit/link to the projects:
Should we have an option to make webpage time-lapse using the IA?
https://github.com/akamhy/waybackpy#supported-features
should have collapsed code or link to the docs
The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
Tracking issue for:
It returns an incorrect archive.
Instead of having to convert timestamps into year, month, day, hour, and minutes and then pass all of those parameters separately like wayback_url_obj.near(year=year, month=month, day=day, hour=hour, minute=minute)
, it would be great if it was possible to just pass in a Unix timestamp (either integer or floating point seconds since epoch like time.time()
) and have that automatically rounded to the nearest minute and converted internally into the appropriate format.
Then the call could just be like wayback_url_obj.near(my_timestamp)
or if you prefer a named parameter wayback_url_obj.near(timestamp=my_timestamp)
.
CLI code currently doesn't supports some new functionalities in the wrapper. archive_url
and JSON
We need different exceptions all these new Exceptions must inherit from WaybackError.
NotSavedError when saving fails
ArchiveNotFoundError when near and its child method fails
ArchiveScrapingError when can't scraping archive from header
It works if I set the url to your Github page as in your example, but when I try this URL, it does not work.
I am using version 2.4.0.
>>> from waybackpy import Cdx
>>> user_agent = 'some user agent string'
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> cdx = Cdx(url=url, user_agent=user_agent)
>>> snapshots = cdx.snapshots()
>>> for i, snapshot in enumerate(snapshots, start =1):
... snapshot_printer(i, snapshot)
...
Add tests for the new methods and update the older ones.
Problem : https://github.com/akamhy/foo-bar-is-real is 404. I am requesting an archive for an URL that doesn't exist and therefore ain't archived on wayback machine.
The following text is not helpful to folks who don't know python. And looks unprofessional from CLI point of view.
$ waybackpy --oldest --url https://github.com/akamhy/foo-bar-is-real
Traceback (most recent call last):
File "/home/akamhy/anaconda3/bin/waybackpy", line 8, in <module>
sys.exit(main())
File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 254, in main
output = args_handler(args)
File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 138, in args_handler
return _oldest(obj)
File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 25, in _oldest
return obj.oldest()
File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 202, in oldest
return self.near(year=year)
File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 184, in near
raise WaybackError(
waybackpy.exceptions.WaybackError: Can not find archive for 'https://github.com/akamhy/foo-bar-is-real' try later or use wayback.Url(url, user_agent).save() to create a new archive.
Right now, there seems to be no way to list all of the available archives by timestamp. Unless I am missing something, the only options are to call .oldest()
, .newest()
, or request for an archive .near()
a specific timestamp. It would be great if it was possible to get a list of all available archives which could then be individually fetched as desired with .get()
, printed to get the archive URL, etc.
I get this error sometimes and there really isn't much information in (the string version of) this error.
2021-01-13T20:26:54 | ERROR | save_archive | archive of <some URL> failed: Error while retrieving https://web.archive.org/save/<some URL>
I seem to get this error a lot of the time when saving the archive actually succeeded (rechecking later finds it). It seems like the error statement could be more specific about the failure and I'm not sure that the upgrade suggestion is helpful when the version is current.
No archive URL found in the API response. If 'https://old.reddit.com/<some post on Reddit here>' can be accessed via your web browser then either this version of waybackpy (2.4.0) is out of date or WayBack Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' for the latest version of waybackpy.
Header:
{'Server': 'nginx/1.15.8', 'Date': 'Thu, 14 Jan 2021 22:13:08 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'X-App-Server': 'wwwb-app52', 'X-ts': '404', 'X-Tr': '138505', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': 'Google'}
Add documentation for the new functionalities
IMP: Add the new archive_url
at the top of docs.
Too Many Requests
We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.
If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.
Please feel free to write to us at [email protected] if you have questions about this. Please include your IP address and any URLs in the email so we can provide you with better service.
akamhy at device in ~
$ waybackpy -u https://github.com/akamhy/videohash/blob/main/videohash/vhash.py -s
Traceback (most recent call last):
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 377, in _get_response
return s.get(url, headers=headers)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in send
history = [resp for resp in gen]
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in <listcomp>
history = [resp for resp in gen]
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 166, in resolve_redirects
raise TooManyRedirects('Exceeded {} redirects.'.format(self.max_redirects), response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 15, in _save
return obj.save()
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/wrapper.py", line 160, in save
instance=self,
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 242, in _archive_url_parser
res = _get_response(url, headers=headers)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 389, in _get_response
raise exc
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/akamhy/.pyenv/versions/3.5.3/bin/waybackpy", line 33, in <module>
sys.exit(load_entry_point('waybackpy==2.4.1', 'console_scripts', 'waybackpy')())
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 327, in main
print(args_handler(parse_args(argv)))
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 178, in args_handler
output = _save(obj)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 31, in _save
raise WaybackError(err)
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.
akamhy at device in ~
$
$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json
{'archived_snapshots': {'closest': {'timestamp': '20210121111134', 'url': 'http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON', 'status': '200', 'available': True}}, 'url': 'https://en.wikipedia.org/wiki/JSON#Syntax'}
#attempt to parse the json output
$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | jq
parse error: Invalid numeric literal at line 1, column 22
While the equivalent query using curl [1] outputs valid json
$ curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax"
{"archived_snapshots": {"closest": {"timestamp": "20210121111134", "url": "http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON", "status": "200", "available": true}}, "url": "https://en.wikipedia.org/wiki/JSON"}
#attempt to parse the json output
curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax" | jq
[works as expected]
It seems waybackpy's json is invalid because:
'
) to denote strings instead of double quotes("
)True
is uppercaseJust to exemplify, it becomes valid if I manually fix the two points above:
$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | tr "'" "\"" | sed -e 's/True/true/g' | jq
[works as expected]
I tried upgrading from 2.4.2 to 2.4.4 and the first page I tried to archive returned a new error:
archive of https://old.reddit.com/r/personalfinance/comments/s4gdfx/how_i_recovered_my_50k_investment_stolen_off/ failed: URL cannot be archived by wayback machine as it is a redirect.
Header:
save redirected
This page is not a redirect. It appears that the archive actually worked, though. It's successfully archived on archive.org with a timestamp that is about the same timestamp as the above error in my log.
get()
is fetching the wrong version of archives.
I've seen this happen to multiple different archives and it is a reproducible issue. Here is an example:
>>> import waybackpy
>>> user_agent = "something goes here"
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> url_object = waybackpy.Url(url, user_agent)
>>> archive = url_object.newest()
>>> str(archive)
'https://web.archive.org/web/20210111103734/https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> archive.get().find("[removed]")
50479
>>> archive.get().find("become your own BOSS")
-1
However, that archive version does not include the string [removed]
and it should show the spam text that included the become your own BOSS
phrase.
If you fetch the page separately, you can [removed]
is not present and the expected text is there.
>>> import requests
>>> response = requests.get(str(archive))
>>> response.text.find("[removed]")
-1
>>> response.text.find("become your own BOSS")
61087
For some reason, get()
seems to be fetching an older version of the archive.
Currently CLI supports less features reactive to the wrapper.py (as module).
I know that the CLI tool as little to do with this, but maybe someone can point me to the right direction. I'm trying to archive a page, but instead of archiving a fresh copy, i'm getting an outdated copy of the page.
Any reason as to why this is happening?
Users should be able to get the webpage using this package.
https://web.archive.org/cdx/search/cdx?url=https://github.com/&output=json
.format is better
The error comes on 'wrapper.py' when the encoding is 'None'. Webpage for example of the error "https://akamhy.github.io/". The file that causes the issue has a extension of '.woff2'.
Error:
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
AttributeError: 'NoneType' object has no attribute 'replace'
In my case I want to download all the versions of a domain (html, js...) and also subdomains, so to do that I use Cdx and then, I download using the get method.
Create a wildcard search class? Cuz not really compatible with the Url class.
Source: confidential via Email (primary)
The docs would grow more when Cdx is supported. README is not the place for big docs.
availability is not robust compared to CDX. We can use CDX as a reliable fallback.
com,github)/ 20190101054633 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101065422 https://github.com/ text/html 200 G5OJHNLJSMMMEABXKD4C723QZ4HVYG7I 18291
com,github)/ 20190101082252 https://github.com/ text/html 200 3FGP34I4EWYWIDIQ6T3WMQ2E47X6GPTZ 18299
com,github)/ 20190101105635 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101110106 https://github.com/ text/html 200 VGYZSUH55NEH5F5AJ3HJ5RMJZGOKU6CP 18285
com,github)/ 20190101121552 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 321
com,github)/ 20190101131350 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101135435 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101141556 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 325
prefer --no-file to not change the API for existing API
The coverage is below 90% and thus unacceptable. Add more tests.
Hi, are there any plans to make waybackpy available in conda-forge?
cheers
waybackpy --url https://github.com/akamhy/waybackpy--user_agent haha --total
concept from
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
Why?
Used by many to find vulnerability in small to medium level websites. Imo, will be a good feature.
I would like to provide an error to handle this.
import waybackpy
w = waybackpy.Url('http://google.com')
w.save() #=> redirected to https://www.google.com
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/wrapper.py", line 161, in save
instance=self,
File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/utils.py", line 303, in _archive_url_parser
raise WaybackError(exc_message)
waybackpy.exceptions.WaybackError: No archive URL found in the API response. If 'http://google.com' can be accessed via your web browser then either Wayback Machine is malfunctioning or it refused to archive your URL.
Header:
save redirected
5000 build backlog is not welcoming.
We are currently lacking command line support. This might be a problem for users who are not using python.
But don't use any non stdlib package; we don't want any non stdlib dependency at all.
weekly testing will help detect API changes within a week.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.