Giter VIP home page Giter VIP logo

internetwaybackmachine's Introduction

Internet Wayback Machine Build Status

This small utility enables you to submit as many URLs as you like to https://archive.org.

Banner

How to use

Download a Jar file from release section.

Then run:

$ java -jar [jar file name].jar

After that you should be able to interact with the app.

To get started type help.

At the moment only three commands are supported which are:

  • save: saves a URL. Examples:

    • save "https://google.com"
    • save https://google.com
  • save-batch: saves a batch of URLs. Example,

    • save-batch '["https://google.com", "https://yahoo.com"]
  • save-file: submits all URLs of the file. For sample file see sample_file.txt. Example,

    • save-file /file/path Or file\\path

Important note

This is a revamped version (rewritten) of the classic Internet Wayback Machine. PHP supported is dropped. However, if you still want to access to the old code you can access to classic-old branch. That code is not maintained anymore though.

Development

Clone the repository and to run the project:

$ mvn spring-boot:run

License

Internet Wayback Machine is MIT licensed.

Contact

internetwaybackmachine's People

Contributors

dependabot[bot] avatar kasramp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

internetwaybackmachine's Issues

IA now 429s out. Don't know if the IA is planning to do this permanently.

I'm experiencing fails more frequently, caused by the IA now using 429 Too Many Requests (I found this out by having the command prompt running through a list of URLs, wait till it starts failing frequently, then test the error using a browser).

I was thinking of having a delay during save (like 10 seconds), the syntax goes:
java -jar InternetWaybackMachine.jar <URL_QuotesOptional> <Time_In_Seconds>

I would like to have it in this case [1]:

java -jar InternetWaybackMachine.jar "http://www.example.com" 10 >>OutputLog.txt
java -jar InternetWaybackMachine.jar "http://www.example.com/1" 10 >>OutputLog.txt
java -jar InternetWaybackMachine.jar "http://www.example.com/2" 10 >>OutputLog.txt
java -jar InternetWaybackMachine.jar "http://www.example.com/3" 10 >>OutputLog.txt

and it would output in this format:

<outputline>
<outputline>
<outputline>
<outputline>

is Your page submitted sucessfully! or Page submission failed :-(

Although the command prompt does have timeout /t <TimeoutInSeconds> [/nobreak] the problem is that it would reformat the list, causing the output text to be misaligned in relation to the URL save list. Not only that, I use browser extensions, and they paste each URL on each line one after another, which is more difficult that I have to add a delay command between each saveURL command.

For example:

java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5 
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5 
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5 
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5 

Outputs:

Your page submitted sucessfully!
Your page submitted sucessfully!
Your page submitted sucessfully!
Your page submitted sucessfully!

While adding “>>OutputLog.txt” to all lines:

java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5  >>OutputLog.txt
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5  >>OutputLog.txt
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5  >>OutputLog.txt
java -jar InternetWaybackMachine.jar google.com >>OutputLog.txt
timeout /t 5  >>OutputLog.txt

Results this:

Your page submitted sucessfully!

Waiting for 5 seconds, press a key to continue ...�4�3�2�1�0
Your page submitted sucessfully!

Waiting for 5 seconds, press a key to continue ...�4�3�2�1�0
Your page submitted sucessfully!

Waiting for 5 seconds, press a key to continue ...�4�3�2�1�0
Your page submitted sucessfully!

Waiting for 5 seconds, press a key to continue ...�4�3�2�1�0

(ignore the weird characters between the countdown numbers displayed)
Notice if you print the output of a timeout, it prints:

[linebreak]
Waiting for 5 seconds, press a key to continue ...�4�3�2�1�0

This will make it harder to know what output corresponds to what command, as timeouts occupy 2 lines. I'm using Notepad++ (Notepad plus plus), by the way. And currently, you can use the column-select and paste and it will paste each URL lining up with each URL to save, making it immediate to search and extract all the URLs that failed (like temporary 404 on twitter) and try again.

--OR--
I believed the tool reads the HTTP response code (looking at the source code, I assume excluded URLs output 403, 404 for file not found or invalid URL, IDK exactly), therefore, if it returns 429, it will pause internally for a certain amount of time (10 seconds, for example), then try again. If 429 again, do it again, until it gets any other error (which outputs Page submission failed :-( ) or succeeds (Your page submitted sucessfully!). It would output the same line format mentioned at “[1]”.

Thank you for reading, even if the 429 issue is temporary, websites or pages can vanish at any time, and automated saves is the best option to prevent or reduce this.

[Suggestions] auto re-attempts to save a url if it fails.

I've noticed when saving URLs, there is now an increased rate of the URLs failing to save, especially from twitter and uchinokomato.me, 20-40% of the time it fails (before, it was 1-4%). This is not due to excluded URLs, because when I try to save the URLs again, they then get saved. I was thinking of adding a feature in which you add this to the syntax to make it save the URL again (after a user-specified delay, in seconds) and repeats until it succeeds or fails if it fails by a number specified by the user:

java -jar InternetWaybackMachine.jar "<URL address>" <MaxNumberOfFails> <TimeInSecondsDelayBetweenAttempts>

And it will output Page submission failed :-( or Your page submitted sucessfully! (failed x time(s)).

x is the number of fails before it finally saves the URL.

MaxNumberOfFails is a number of how many failed attempts the program will try again to save the URL. Once this number is reached, will quit re-attempts and outputs Page submission failed :-(

<TimeInSecondsDelayBetweenAttempts> is how long (in seconds) between re-attempts to try to save a failed URL. This is because yesterday, the 429 error now states you are not to save more than 15 urls in a minute.

For example:

java -jar InternetWaybackMachine.jar "https://www.example.com" 3 4

if it fails, it will try again, until it successfully saves the URL and ends (Your page submitted sucessfully! (failed 2 time(s))), and continues the following command in the batch file. If it fails 3 times, then it will stop attempting to save and outputs Page submission failed :-( and continue towards the next command in the batch file.

Even more character issues

Still using the “fixed” version you gave me.

Oddly, I've tested saving these URLs and somehow the command prompt is treating them differently despite both being on the same code page:


chcp 65001
del OutputLog.txt
java -jar InternetWaybackMachine.jar "https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/077/904/original/鳥と鳥籠キャラシ.png?1450609113" >>OutputLog.txt & timeout /t 3                                
java -jar InternetWaybackMachine.jar "https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/original/新ガラテアCS.png?1450710929" >>OutputLog.txt & timeout /t 3                                   
pause

The first URL saved successfully, but the second does not work AT ALL[1] , messing around with NP++'s encoding doesn't work either. And yes, I tried manuel saving and it worked. I also tried with and without the chcp 65001, still fails.

I've noticed that chcp 65001 makes a huge difference on how the command prompt handles the characters. I did “pseudo-auto-saving” by making a batch file using the start command, making it open a default browser (if multiple URLs, opens in a new tab) that would save the page as opposed to merely using a script that merely read the HTTP status:

start https://web.archive.org/save/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/medium/新ガラテアCS.png?1450710929 & timeout /t 5
start https://web.archive.org/save/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/original/新ガラテアCS.png?1450710929 & timeout /t 5
pause

This alone (without the chcp) takes me to a different URL (invalid by the way):

https://web.archive.org/web/20191113023522/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/medium/%E8%AD%81%EF%BD%B0%E7%B9%A7%EF%BD%AB%E7%B9%A7%E5%90%B6%CE%9B%E7%B9%9D%E3%83%BB%E3%81%84CS.png?1450710929
https://web.archive.org/web/20191113023527/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/original/%E8%AD%81%EF%BD%B0%E7%B9%A7%EF%BD%AB%E7%B9%A7%E5%90%B6%CE%9B%E7%B9%9D%E3%83%BB%E3%81%84CS.png?1450710929

Raw form when testing with the chcp command:

https://web.archive.org/web/20191113023522/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/medium/譁ー繧ォ繧吶Λ繝・いCS.png?1450710929
https://web.archive.org/web/20191113023527/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/original/譁ー繧ォ繧吶Λ繝・いCS.png?1450710929

whereas I have this in the batch file instead:

chcp 65001
start https://web.archive.org/save/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/medium/新ガラテアCS.png?1450710929 & timeout /t 5
start https://web.archive.org/save/https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/078/212/original/新ガラテアCS.png?1450710929 & timeout /t 5
pause

Always works, even URLs that failed (see at [1]).

Errors out when URLs have percent encoding.

Example:

java -jar InternetWaybackMachine.jar https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/021/892/original/%E5%AD%A6%E8%80%85.jpeg?1426006855

URL to save:

https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/021/892/original/学者.jpeg?1426006855

This results a fail to save unless you convert them to their original characters and use chcp 65001. Tested using windows 10.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.