alexiskattan / warrick Goto Github PK

Automatically exported from code.google.com/p/warrick

Perl 25.09% JavaScript 1.21% CSS 0.14% HTML 64.23% Shell 0.07% ASP 5.48% PHP 3.57% Java 0.03% ColdFusion 0.07% Python 0.10%

warrick's People

Watchers

warrick's Issues

Port 80

 IA had a number of URLs like this

http://www.harding.edu:80/fmccown/

that had the default port 80 in them (who knows why).  Warrick treats
this URL as if it were different than

http://www.harding.edu/fmccown/

which of course they are not.

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 7:16

Memory Overrun

What steps will reproduce the problem?
1. when recovering large files, recovered file grows
2. files potentially having whitespace added in the sed commands
3. file then can't be loaded into memory without causing machine to run out of 
ram.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 Sep 2012 at 1:05

Encoding

What steps will reproduce the problem?
1.Try to recover sites with non english accents and characters
2.
3.

What is the expected output? What do you see instead?

instead of ó i see Ã³
instead of á i see Ã¡

etc...


What version of the product are you using? On what operating system?

Last V on ubuntu 12.04LTS
Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 10 Apr 2014 at 1:37

URI Rewriting

URI format changes by the archives have rendered the -k and uri-rewriting 
features inoperable. we need to develop a way -- without hardcoding -- to 
either automatically or easily manually change the URI strings that we need to 
detect to make relative. 

Archive watermarks, status bars, and other archive-added features should be 
treated similarly.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:09

[Help]Resume Warrick when I turn off computer

Hello everybody!
Warrick is useful tool to recovery website,When I've completed one half my 
website my computer suddenly turn off, I spent 4 days to do it, How can I 
resume it.Thanks in advance!

Original issue reported on code.google.com by tunghk54 on 21 Mar 2013 at 9:29

git repo is empty

What steps will reproduce the problem?
1. git clone https://code.google.com/p/warrick/
as stated here: http://code.google.com/p/warrick/source/checkout .

What is the expected output? 
{{{Cloning into junk.git...
done.}}}

What do you see instead?
{{{Cloning into warrick.git...
warning: You appear to have cloned an empty repository.}}}

What version of the product are you using? On what operating system?
N/A

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 Jun 2012 at 6:45

New code mod.

check also these lines, needs to fixed

2025
2121
2157
2940

my line looks now so,

if($_[0] =~ m/\.jpg/i || $_[0] =~ m/\.jpeg/i || $_[0] =~ m/\.png/i || $_[0] =~ 
m/\.gif/i || $_[0] =~ m/\.doc/i || $_[0] =~ m/\.tiff/i || $_[0] =~ m/\.bmp/i || 
$_[0] =~ m/\.pdf/i)

small question, im not very familier with perl,

wouldnt it better if you putt all these formats in a global array on the top of 
the script

$blacklist = array("jpg", "tiff" .....) and foreach it against the filename, so 
its clearer and easier to fix, instead having it on 4 lines and check whats 
wrong. :)

Original issue reported on code.google.com by [email protected] on 7 Dec 2012 at 3:29

Relative links: Starting recovery from subdirectory

Relative links seem to break when starting recovery from a sub-directory. For 
example, if you recovery site.com/dir/, the images in these files may point to 
site.com/imgs/

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:38

zero length content "No Content in ..."

What steps will reproduce the problem?
1. ./warrick.pl -dr 2013-08-05 -d -a ia -D ../ftp/ http://www.atlantischild.hu/

What is the expected output? What do you see instead?

http://wayback.archive.org/web/20111031230326/http://www.atlantischild.hu/index.
php?option=com_content&task=view&id=21&Itemid=9
has non-zero lenght, I get zero lenght files:
"index.php?option=com_content&task=view&id=21&Itemid=9"

What version of the product are you using? On what operating system?
warrickv2-2-5

Please provide any additional information below.

I've got a non-zero lenght file which has GET parameters in its name
but all files containing & (ampersand) in their names are empty.

log says (below)
as you see, nothig anfter "?" in "To stats ... Location:"


-------
At Frontier location 79 of 769
-------


My frontier at 79: 
http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&
Itemid=28
My memento to get: 
|http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21
&Itemid=28|

targetpath: index.php

appending query string option=com_content&task=blogcategory&id=21&Itemid=28



 mcurling: /home/davidprog/dev/design-check/atlantis/warrick//mcurl.pl -D "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o"  -dt "Sun, 04 Aug 2013 22:00:00 GMT"  -tg "http://web.archive.org/web" -L -o "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28" "http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28"

Reading logfile: 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o


To stats 
http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&
Itemid=28 => Location: 
http://web.archive.org/web/20120903050228/http://www.atlantischild.hu/index.php?
 => 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28 --> stat IA

returning 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28
Search HTML resource 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28 for links to other missing resources...
No Content in 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28!!

Original issue reported on code.google.com by [email protected] on 31 Aug 2013 at 4:42

Distribution archive looks sloppy

What steps will reproduce the problem?
1. Download warrickv2-2-5.tar.gz from project's "Downloads"
2. Look inside it with mc for example (with Midnight Commander)

What is the expected output? What do you see instead?

1. README file says version is 2.0  Expected: 2.5

2. Almost all files are executable. Even .o text files. Expected: executable 
must be only files which you have to run.

3. .o extension is confusing for text files (it usually have compiled object 
files) with 2 URL inside it. Expected: another extension and may be put these 
20 files into a subdir.

4. I do not expect to see 'curl.exe' here. Expected: either Windows support is 
officially claimed on the project page or remove the file.

5. piklog.log is 1 Mb size. Is it really need into distribution archive? 
Expected: if the file is a part of test suite it might be placed into 
TEST_FILES subdir.


What version of the product are you using? On what operating system?
warrick-2.5 Ubuntu 13.04 x32


Please provide any additional information below.

The usual thing which developers do is just using 'dist' makefile target or 
'makedist.sh' script. It does some clean-up ('make clean' or 'rm' command for 
specified filelist) and put only really needed files into distribution archive.

Original issue reported on code.google.com by [email protected] on 7 Sep 2013 at 3:46

old_make

When I finished running TEST, I saw this:

TESTING DOWNLOAD COMPLETE...
-----------------------------
Downloaded 55 resources, which is greater than the 27 from the
testfile. We've found a new memento. Test success!
cat: old_make/MAKEFILE/TESTHEADERS.out: No such file or directory
0 vs 82


I don't see a directory called old_make.  Was this perhaps an old
reference that should be removed?

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 6:09

No Clobber sleeping

Please describe your feature requests here.

When the "no clobber" option is picked, and a file is detected to already 
exist, Warrick should not do any sleeping while moving on to the next file in 
the frontier.  This will greatly enhance the usability of Warrick, and reduce 
the need for session management with  the *.save files, and would be far more 
flexible.  The most difficult part of the documentation to understand is the 
*.save feature used for saving sessions.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:47

Perl Logging

Please describe your feature requests here.

 Use a genuine perl logging package which will give you "levels" of logging, and flexibility to turn logging off and on, or direct it to TTY or file(s) with configuration, saving you headache and typing in the long run.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:46

Include VM

Please describe your feature requests here.

You may want to distribute a VMware machine running your preferred version of 
free Unix/Linux instead of porting to Windows.  That way, you save yourself the 
headache of maintaining two versions, and you also eliminate needing to write 
Warrick for the many flavors of Linux and Unix, including whatever versions of 
support libraries they happen to have.  It is a bigger download, but bits are 
free, and your time is valuable.  Also, when you support users, bugs that creep 
through CPAN or libraries would cease to be a factor, as the VMWare machine 
would be tested from every line you type, down to the VMWare ethernet driver.  
What's packed up in the assumption behind this suggestion is that you want to 
provide a useful tool, and that may not be the case.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:47

Crash Recovery

What steps will reproduce the problem?
1.Had a problem "resuming" the process when my laptop crashed, but figured out 
a workaround.
2.I was seeding Warrick with http://etpv.org/2000.html, and it was generating 
paths to recover like ...

http://etpv.org/2000.html/2000/example1.html
http://etpv.org/2000.html/2000/example2.html
http://etpv.org/2000.html/2000/example3.html
...

... which meant when I resumed the recovery after the crash, instead of 
expediently skipping over files that existed since I have "no-clobber" on, it 
would generate these bogus paths that mcurl would then try to recover.

The workaround is to delete the file off my laptop that links to the other 
files.  In this case, we're talking files like ...

1999.html
2000.html
2001.html
...

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

attached shell script with a set of commands as a solution

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:18

Warrick Sleeping 2

Please describe your feature requests here.

Once you start timing how long mcurl is taking (too see if it has hung or not), 
then you can adjust how long Warrick sleeps between requests to maximize 
recovery speed.  

Right now, Warrick is sleeping for 7 seconds between requests, but if the 
Internet is unclogged, and archive.org is returning the page to mcurl in 1 
second, then why not sleep for 1 second since no one else is using the 
bandwidth and archive.org?  If archive.org returns a result in 10 seconds 
because it's loaded down, THEN Warrick should sleep for 10 seconds to ease up.  
A 30 second response would be matched with a long 30 second sleep.

Simply put, Warrick should sleep the number of seconds it takes to pull the 
last page.  This is how the ftp algorithm maximizes bandwidth, without killing 
the internet.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:22

Brass rework

The brass interface at warrick.cs.odu.edu needs to be reinstalled, and then a 
new load balancing algorithm should be put into place. This will include 
distributing brass between multiple machines, each with its own memento 
aggregator for discovering mementos.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:11

Exclude URL paths from being reconstructed and/or crawled

It is sometimes useful to exclude URLs in order to reduce the scope of the 
reconstruction job. 

Example: A site where every page has an 'edit' URL. These pages should be 
excluded from even being crawled in the first place, for example by excluding 
the /edit pattern.

Example: A site that has many browsing and searching paths that lead to the 
same content pages. These pages should be crawled to ensure complete coverage 
of the content pages, but not reconstructed, for example by excluding a pattern 
such as ?page=[0-9]+

Example: Only wanting to reconstruct a section of a website (everything under a 
particular subdirectory) by excluding specific other subdirectories from 
reconstruction.

Because excluding crawling and excluding reconstruction solve separate 
use-cases (although excluding crawling obviously also excludes reconstruction), 
I recommend separate command-line switches for each.

Original issue reported on code.google.com by [email protected] on 27 Apr 2012 at 3:03

GetOptionsFromString is not exported....

What steps will reproduce the problem?
1. Install warrick, etc., under CentOS 5.8
2. run it (e.g. warrick.pl -o mylog http://www.mysite.com

What is the expected output? What do you see instead?
A directory of stuff. Get an error message and nothing in the log.


What version of the product are you using? On what operating system?
Latest (as of today). CentOS 5.8

Please provide any additional information below.

FYI... email to [email protected] is better than the gmail address under reporter 
for contact

Original issue reported on code.google.com by [email protected] on 24 Dec 2012 at 9:16

problems with start

What steps will reproduce the problem?
1. nothing is downloaded 
2. they can't make first files
3. I try to run TEST with some changes

What is the expected output? What do you see instead?
something :)


What version of the product are you using? On what operating system?
2.0.1 or 2.2.1

Please provide any additional information below.
when I try to run TEST file, everything is ok, something is downloaded,
but when I remove standard directory MAKEFILE, to run clean script, make him to 
do everthing from the start, nothing is downloaded, new MAKEFILE dir is 
created, but nothing else, and I see

---
 mcurling: /home/chali/warrick3//mcurl.pl -D "/home/chali/warrick3/MAKEFILE/logfile.o"  -dt "Wed, 01 Aug 2007 22:00:00 GMT"  -tg "http://mementoproxy.cs.odu.edu/aggr/timegate" -L -o "/home/chali/warrick3/MAKEFILE/index.html" "http://www.cs.odu.edu/"

Unable to open file /home/chali/warrick3/MAKEFILE/logfile.o
Reading logfile: /home/chali/warrick3/MAKEFILE/logfile.o


Unable to download...


To stats FAILED:: http://www.cs.odu.edu/ => ??? => 
/home/chali/warrick3/MAKEFILE/index.html --> Stat Failure...

Search HTML resource  for links to other missing resources...
No such file

No Content in !!

Starting recovery at position 0 of -1

(...)

---

that is happened for all sites, it stopes, says that finished, but nothing is 
downloaded.

what is the problem?
every moduls is installed

best regards

Original issue reported on code.google.com by [email protected] on 25 Jan 2013 at 11:54

MP3s corrupted during recovery

What steps will reproduce the problem?
1. recovered site with mp3s
2.
3.

What is the expected output? What do you see instead?
mp3 files still played, but were corrupted.

What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:42

./TEST fails complaining that -nr is an invalid option

What steps will reproduce the problem?
1. ./INSTALL
2. ./TEST
3. Error presents

What is the expected output?

Successful test

What do you see instead?

net4-dev# ./TEST
Starting test...


#########################################################################
# Welcome to the Warrick Program!
# Warrick is a website recovery tool out of Old Dominion University
# Please provide feedback to Justin F. Brunelle at [email protected]
#########################################################################



Arguments: -D MAKEFILE -o MAKEFILE_LOGFILE.log -xc -nr -dr 2007-08-02 -T -nv 
http://www.cs.odu.edu/

Unknown option: nr
TESTING DOWNLOAD COMPLETE...
-----------------------------
Downloaded 28 resources, which is greater than the 27 from the testfile. We've 
found a new memento. Test success!
cat: MAKEFILE/TESTHEADERS.out: No such file or directory
0 vs 0

What version of the product are you using? On what operating system?

Downloaded version warrickv2-2-5.tar.gz except version string in source says 
2.2.3 + I'll open a separate ticket on this problem.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 21 Jun 2015 at 2:13

Regex for URLs to download

I have a photo gallery site I am trying to recover the photos from, but I am 
having trouble limiting warrick to the type of page I want to download. It 
finds links with tons of query strings, etc and wants to download them all. I 
would rather have it use the lister to get the URLs and then only grab *.jpg 
files. Is that possible to implement?

Original issue reported on code.google.com by [email protected] on 29 Jan 2015 at 3:54

Warrick Sleeping

Please describe your feature requests here.

Warrick needlessly sleeping for 2 seconds when the option "no clobber" is 
selected, and a previous session was interrupted, and the -n and -R options are 
not being used.

1) The sleep(2) at line 1030 is commented out
2) The sleep(5) at line 1855 is now a sleep(7)

So, we're not missing any sleep.  It's just that we don't sleep when we already 
have the file and the "no clobber" option has been selected.

Thus, there is less of a need for -n and -R options, since it's less of a pain 
to just kill the process, and re-start it

Also, to possibly step around the problem of being blacklisted by Google or 
anyone else, perhaps calling sleep for random intervals would do the trick, to 
make it seem more like a human being looking at this & that.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:20

Version string incorrect in latest download

What steps will reproduce the problem?
1. Download warrickv2-2-5.tar.gz
2. Unpack
3. $version in warick.pl reports 2.2.3 instead of 2.2.5

Please provide any additional information below.

Either v2.2.3 is incorrectly in download slot for 2.2.5 or incorrect version 
string is in file.

Original issue reported on code.google.com by [email protected] on 21 Jun 2015 at 2:15

Made a replacement tool

Thanks for a great tool, too bad the project seems to be abandoned.

Getting inspiration from warrick, I've made a small tool in Ruby that does 
something similar:
https://github.com/hartator/wayback_machine_downloader (Getting a backung any 
website from the Wayback Machine. Optional timestamp)

It's only working for now with Wayback Machine but contributions are welcome!

Original issue reported on code.google.com by [email protected] on 10 Aug 2015 at 6:54

warrick is not working

What steps will reproduce the problem?
1. install warrick on linux system
2. install all dependencies
3. run ./TEST 
4. try to get any site

What is the expected output? What do you see instead?
Expected to have a directory with html files. Instead there is only index.htm 
and lister.o file.  

What version of the product are you using? On what operating system?
I tried all versions of warrick - with same result. 
Tried on ubuntu 12.10 and ubuntu 12.04 (updated and upgraded)

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Mar 2014 at 3:13

Stream Editting

Please describe your feature requests here.

Optimization:  Use perl's internal 'stream editing' instead of calling out to 
sed by fork/exec.  Not only do you save system call & fork & exec overhead, but 
perl has higher performance I/O according to one source.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:49

http://www.animalbehavior.org/Resources/CSASAB/#Uncert

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 6 Nov 2014 at 3:35

sed Filterings

Please describe your feature requests here.


Optimization:   if the user is only pulling from one source, why call sed to do 
filtering as if pulling from all sources?

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:50

Has Optimization

Please describe your feature requests here.

Optimization: Use a perl hash-table instead of a perl array.  This way, you 
neatly eliminate duplicates, but can still traverse it like an array.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:48

Use Warnings

Please describe your feature requests here.

 Put "use warnings;" right after "use strict;" to help you find more problems sooner, rather than later.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:45

Testing feature is outdated

The testing feature of Warrick is not operational due to changes in archival 
mementos and their features and other aspects of poor planning. Needs a 
complete restructuring.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:12

Incomplete Dump

I want to recover an older version of a website, that contains "better" 
information compared to the pervius one and make a local copy for my personal 
use.
I can't get a full website dump, it's alsways incomplete. I'm using OSX.

It's me or the software ? :)

sudo perl ./warrick.pl -o /Users/cesare/Documents/drclark.net/warrick.log  -dr 
2012-07-23 -D /Users/cesare/Documents/drclark.net - -ic -nc -nv -a ia -nB 
"http://www.drclark.net/"

Original issue reported on code.google.com by [email protected] on 7 Dec 2012 at 3:07

Attachments:

Corrupted *.save file

What steps will reproduce the problem?
1. ./warrick.pl -D myfolia -nc -k -o myfolia.log -n 100 -a ia 
http://myfolia.com/plants


Save file:

<pid>30873</pid>
<hostname>workshop</hostname>
<command></command>
<lastFrontier>99</lastFrontier>
<startUrl>FAILED:: http://myfolia.com/plants/1-basil-ocimum-basilicum/edit => 
??? => 
/home/webmaven/Desktop/projects/warrick2/myfolia/plants/1-basil-ocimum-basilicum
/edit/index.html</startUrl>
<dir></dir>


And the <resource> tag is filled with content from the log file. Need to try to 
replicate.

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:36

Javascript being included in link extraction

Links such as "./mypage.com:javascript" need to be cleaned.

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:39

Not priocessing Images within CSS style sheets

What steps will reproduce the problem?
1. reconstruct a typical site
2.
3.

What is the expected output? What do you see instead?
I am finding that it is not processing images within css files - eg background 
images.in the CSS file i just retrieved.

background-image: 
url(http://api.wayback.archive.org/memento/20110228053720/http://www.domainname.
com.au/_images/landing/10px-hover-bg.png);

It hasn't extracted and processed the CSS


What version of the product are you using? On what operating system?
warrick_2.0.1.tar.gz    

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 19 Feb 2012 at 3:04

name download/releases w/ consistent regex friendly revision pattern

The current downloads listed are

* warrickv2-2-1.tar.gz
* warrick_v2-2.tar.gz
* warrick_2.0.1.tar.gz

which do not follow a regular pattern, making it difficult for automated tools, 
e.g. Macports, to determine if new revisions are available.  Recommend 
something like: `warrick-v2.2.1.tar.gz`, but any consistent preference should 
be supportable.

Original issue reported on code.google.com by [email protected] on 24 Jun 2012 at 6:51

CPAN Install

When I ran the INSTALL script on one of my machines, I repeatedly got
the error message:

Can't locate CPAN.pm in @INC

To fix this, you could add

yum install perl-CPAN

to INSTALL before using -MCPAN.

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 6:10

Pause, Suspend, and Resume

Currently, you can set a limit on the number of URLs to reconstruct by using 
the -n X switch, and you can later resume the job to get another batch (of the 
same size), by using the -R PID_computername.save switch.

It would be convenient if a job that didn't have a limit set could also be 
stopped and then resumed where it left off.

Potential solutions:

1. When the job is stopped by using ctrl-C, a PID_computername.save is written 
out which can later be used to resume the job.

2. When the job is stopped by using another key combination (such as Q for quit 
or S for Suspend), a PID_computername.save is written out which can later be 
used to resume the job.

3. Jobs can be paused by using P, and resumed by using R, without ending the 
process.

Original issue reported on code.google.com by [email protected] on 27 Apr 2012 at 2:46

Installation Script Rework

The installation script has syntax errors as well as issues with OS detection 
and handling. Needs to be reworked to be more elegant.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:12

alexiskattan / warrick Goto Github PK

warrick's People

Watchers

warrick's Issues

Recommend Projects

Recommend Topics

Recommend Org