Giter VIP home page Giter VIP logo

waybackpy's People

Contributors

akamhy avatar anticompositenumber avatar arztklein avatar deepsource-autofix[bot] avatar deepsourcebot avatar dv11364 avatar eggplants avatar jfinkhaeuser avatar jonasjancarik avatar mend-bolt-for-github[bot] avatar pyup-bot avatar rafaelrdealmeida avatar xrisk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

waybackpy's Issues

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

allow near() to be called with a timestamp

Instead of having to convert timestamps into year, month, day, hour, and minutes and then pass all of those parameters separately like wayback_url_obj.near(year=year, month=month, day=day, hour=hour, minute=minute), it would be great if it was possible to just pass in a Unix timestamp (either integer or floating point seconds since epoch like time.time()) and have that automatically rounded to the nearest minute and converted internally into the appropriate format.

Then the call could just be like wayback_url_obj.near(my_timestamp) or if you prefer a named parameter wayback_url_obj.near(timestamp=my_timestamp).

Update CLI

CLI code currently doesn't supports some new functionalities in the wrapper. archive_url and JSON

Add New Exceptions, inherit from WaybackError

We need different exceptions all these new Exceptions must inherit from WaybackError.

List of proposed Exceptions

  • NotSavedError when saving fails

  • ArchiveNotFoundError when near and its child method fails

  • ArchiveScrapingError when can't scraping archive from header

Cdx() is not working for some URLs

It works if I set the url to your Github page as in your example, but when I try this URL, it does not work.

I am using version 2.4.0.

>>> from waybackpy import Cdx
>>> user_agent = 'some user agent string'
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> cdx = Cdx(url=url, user_agent=user_agent)
>>> snapshots = cdx.snapshots()
>>> for i, snapshot in  enumerate(snapshots, start =1):
...     snapshot_printer(i, snapshot)
... 

Update tests

Add tests for the new methods and update the older ones.

cli.py (near,archive,oldest,newset) : Output a nice message and tell the user that the page isn't archived

Problem : https://github.com/akamhy/foo-bar-is-real is 404. I am requesting an archive for an URL that doesn't exist and therefore ain't archived on wayback machine.

The following text is not helpful to folks who don't know python. And looks unprofessional from CLI point of view.

$ waybackpy --oldest --url https://github.com/akamhy/foo-bar-is-real
Traceback (most recent call last):
  File "/home/akamhy/anaconda3/bin/waybackpy", line 8, in <module>
    sys.exit(main())
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 254, in main
    output = args_handler(args)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 138, in args_handler
    return _oldest(obj)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/cli.py", line 25, in _oldest
    return obj.oldest()
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 202, in oldest
    return self.near(year=year)
  File "/home/akamhy/.local/lib/python3.8/site-packages/waybackpy/wrapper.py", line 184, in near
    raise WaybackError(
waybackpy.exceptions.WaybackError: Can not find archive for 'https://github.com/akamhy/foo-bar-is-real' try later or use wayback.Url(url, user_agent).save() to create a new archive.

add a way to get a list of all archives for a given URL

Right now, there seems to be no way to list all of the available archives by timestamp. Unless I am missing something, the only options are to call .oldest(), .newest(), or request for an archive .near() a specific timestamp. It would be great if it was possible to get a list of all available archives which could then be individually fetched as desired with .get(), printed to get the archive URL, etc.

"No archive URL found in the API response" errors

I seem to get this error a lot of the time when saving the archive actually succeeded (rechecking later finds it). It seems like the error statement could be more specific about the failure and I'm not sure that the upgrade suggestion is helpful when the version is current.

No archive URL found in the API response. If 'https://old.reddit.com/<some post on Reddit here>' can be accessed via your web browser then either this version of waybackpy (2.4.0) is out of date or WayBack Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' for the latest version of waybackpy.
Header:
{'Server': 'nginx/1.15.8', 'Date': 'Thu, 14 Jan 2021 22:13:08 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'X-App-Server': 'wwwb-app52', 'X-ts': '404', 'X-Tr': '138505', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': 'Google'}

Update docs

Add documentation for the new functionalities
IMP: Add the new archive_url at the top of docs.

Too Many Requests

Too Many Requests
We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.

If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.

Please feel free to write to us at [email protected] if you have questions about this. Please include your IP address and any URLs in the email so we can provide you with better service.

Too Many Requests Screenshot from 2021-04-24 17-14-23

Exceeded 30 redirects

akamhy at device in ~
$ waybackpy -u https://github.com/akamhy/videohash/blob/main/videohash/vhash.py -s
Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 377, in _get_response
    return s.get(url, headers=headers)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in send
    history = [resp for resp in gen]
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 677, in <listcomp>
    history = [resp for resp in gen]
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/requests/sessions.py", line 166, in resolve_redirects
    raise TooManyRedirects('Exceeded {} redirects.'.format(self.max_redirects), response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 15, in _save
    return obj.save()
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/wrapper.py", line 160, in save
    instance=self,
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 242, in _archive_url_parser
    res = _get_response(url, headers=headers)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 389, in _get_response
    raise exc
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/akamhy/.pyenv/versions/3.5.3/bin/waybackpy", line 33, in <module>
    sys.exit(load_entry_point('waybackpy==2.4.1', 'console_scripts', 'waybackpy')())
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 327, in main
    print(args_handler(parse_args(argv)))
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 178, in args_handler
    output = _save(obj)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/cli.py", line 31, in _save
    raise WaybackError(err)
waybackpy.exceptions.WaybackError: Error while retrieving https://web.archive.org/web/202101290913/https://github.com/akamhy/videohash/blob/main/videohash/vhash.py.
Exceeded 30 redirects.

akamhy at device in ~
$ 

CLI: --json returns invalid json

$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json
{'archived_snapshots': {'closest': {'timestamp': '20210121111134', 'url': 'http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON', 'status': '200', 'available': True}}, 'url': 'https://en.wikipedia.org/wiki/JSON#Syntax'}
#attempt to parse the json output
$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | jq 
parse error: Invalid numeric literal at line 1, column 22

While the equivalent query using curl [1] outputs valid json

$ curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax"
{"archived_snapshots": {"closest": {"timestamp": "20210121111134", "url": "http://web.archive.org/web/20210121111134/https://en.wikipedia.org/wiki/JSON", "status": "200", "available": true}}, "url": "https://en.wikipedia.org/wiki/JSON"}
#attempt to parse the json output
curl "http://archive.org/wayback/available?url=https://en.wikipedia.org/wiki/JSON#Syntax" | jq
[works as expected]

It seems waybackpy's json is invalid because:

  • it uses single quotes(') to denote strings instead of double quotes(")
  • the first letter of True is uppercase

Just to exemplify, it becomes valid if I manually fix the two points above:

$ waybackpy --url "https://en.wikipedia.org/wiki/JSON#Syntax" --json | tr "'" "\"" | sed -e 's/True/true/g' | jq
[works as expected]

Saving archive results in an incorrect "cannot be archived by wayback machine as it is a redirect" failure

I tried upgrading from 2.4.2 to 2.4.4 and the first page I tried to archive returned a new error:

archive of https://old.reddit.com/r/personalfinance/comments/s4gdfx/how_i_recovered_my_50k_investment_stolen_off/ failed: URL cannot be archived by wayback machine as it is a redirect.
Header:
save redirected

This page is not a redirect. It appears that the archive actually worked, though. It's successfully archived on archive.org with a timestamp that is about the same timestamp as the above error in my log.

get() doesn't return the correct version of archives

get() is fetching the wrong version of archives.

I've seen this happen to multiple different archives and it is a reproducible issue. Here is an example:

>>> import waybackpy
>>> user_agent = "something goes here"
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> url_object = waybackpy.Url(url, user_agent)
>>> archive = url_object.newest()
>>> str(archive)
'https://web.archive.org/web/20210111103734/https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> archive.get().find("[removed]")
50479
>>> archive.get().find("become your own BOSS")
-1

However, that archive version does not include the string [removed] and it should show the spam text that included the become your own BOSS phrase.

If you fetch the page separately, you can [removed] is not present and the expected text is there.

>>> import requests
>>> response = requests.get(str(archive))
>>> response.text.find("[removed]")
-1
>>> response.text.find("become your own BOSS")
61087

For some reason, get() seems to be fetching an older version of the archive.

Error getting a page when the encoding is 'None'

The error comes on 'wrapper.py' when the encoding is 'None'. Webpage for example of the error "https://akamhy.github.io/". The file that causes the issue has a extension of '.woff2'.

Error:

return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
AttributeError: 'NoneType' object has no attribute 'replace'

In my case I want to download all the versions of a domain (html, js...) and also subdomains, so to do that I use Cdx and then, I download using the get method.

Write docs in wiki

The docs would grow more when Cdx is supported. README is not the place for big docs.

cdx diffrent mimetypes

com,github)/ 20190101054633 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101065422 https://github.com/ text/html 200 G5OJHNLJSMMMEABXKD4C723QZ4HVYG7I 18291
com,github)/ 20190101082252 https://github.com/ text/html 200 3FGP34I4EWYWIDIQ6T3WMQ2E47X6GPTZ 18299
com,github)/ 20190101105635 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101110106 https://github.com/ text/html 200 VGYZSUH55NEH5F5AJ3HJ5RMJZGOKU6CP 18285
com,github)/ 20190101121552 http://github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 321
com,github)/ 20190101131350 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 323
com,github)/ 20190101135435 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 324
com,github)/ 20190101141556 https://www.github.com/ unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 325

improve coverage

The coverage is below 90% and thus unacceptable. Add more tests.

fix total

waybackpy --url https://github.com/akamhy/waybackpy--user_agent haha --total

save redirected

I would like to provide an error to handle this.

import waybackpy
w = waybackpy.Url('http://google.com')
w.save() #=> redirected to https://www.google.com
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/wrapper.py", line 161, in save
    instance=self,
  File "/home/eggplants/.pyenv/versions/3.7.9/lib/python3.7/site-packages/waybackpy/utils.py", line 303, in _archive_url_parser
    raise WaybackError(exc_message)
waybackpy.exceptions.WaybackError: No archive URL found in the API response. If 'http://google.com' can be accessed via your web browser then either Wayback Machine is malfunctioning or it refused to archive your URL.
Header:
save redirected

add command line support

We are currently lacking command line support. This might be a problem for users who are not using python.
But don't use any non stdlib package; we don't want any non stdlib dependency at all.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.