zachbrowne / warrick Goto Github PK

Automatically exported from code.google.com/p/warrick

Perl 25.09% JavaScript 1.21% CSS 0.14% HTML 64.23% Shell 0.07% ASP 5.48% PHP 3.57% Java 0.03% ColdFusion 0.07% Python 0.10%

warrick's Introduction

Warrick

Version 2.0

"The website reconstructor"

Created by Frank McCown at Old Dominion University - 2006

Modified by Justin F. Brunelle at Old Dominion University - 2011
[email protected]

Please note: This software has the following dependencies:
Perl5 or later
cURL
Python
and these Perl libraries: HTML::TagParser, LinkExtractor, Cookies,
	Status, and Date, and the URI library

Running ./INSTALL at the command line should install these dependencies.
To test the installation, run ./TEST. This will recover a web
page and compare it to a master copy.

For information on running Warrick, run at your command line:
 `perl warrick.pl --help`

This version of Warrick has been redesigned to reconstruct lost
websites from the Web Infrastructure using Memento. (For more
information on Memento, please visit http://www.mementoweb.org/.)

**************************************************************

We want to know if you have if you have used Warrick to 
reconstruct your lost website.  Please e-mail me at [email protected]

**************************************************************

This program creates several files that provide information or 
log data about the recovery. For a given recovery RECO_NAME, we
will create a RECO_NAME_recoveryLog.out, PID_SERVERNAME.save,
and logfile.o. These are created for every recovery job.
RECO_NAME_recoveryLog.out is created in the home warrick
directory, and contains a report of every URI recovered,
the location of the recovered archived copy (the memento), and 
the location the file was saved to on the local machine in the 
following format:
ORIGINAL URI => MEMENTO URI => LOCAL FILE
Lines pre-pended with "FAILED" indicate a failed recovery of
ORIGINAL URI
PID_SERVERNAME.save is the saved status file. This file is  
stored in the recovery directory and contains the information
for resuming a suspended recovery job, as well as the stats
for the recovery, such as the number of resources failed
to be recovered, the number from different archives, etc.
logfile.o is a temporary file that can be regarded as junk.
It contains the headers for the last recovered resource.

If you would like to assist the development team in refining
and improving Warrick, please provide each of these files to
the development team by emailing them to [email protected].

Thank you for your help.



**************************************************************





This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

The GNU General Public License can be seen here:
http://www.gnu.org/copyleft/gpl.html

-----------------------------------------------------------

warrick's People

Watchers

warrick's Issues

Port 80

 IA had a number of URLs like this

http://www.harding.edu:80/fmccown/

that had the default port 80 in them (who knows why).  Warrick treats
this URL as if it were different than

http://www.harding.edu/fmccown/

which of course they are not.

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 7:16

problems with start

What steps will reproduce the problem?
1. nothing is downloaded 
2. they can't make first files
3. I try to run TEST with some changes

What is the expected output? What do you see instead?
something :)


What version of the product are you using? On what operating system?
2.0.1 or 2.2.1

Please provide any additional information below.
when I try to run TEST file, everything is ok, something is downloaded,
but when I remove standard directory MAKEFILE, to run clean script, make him to 
do everthing from the start, nothing is downloaded, new MAKEFILE dir is 
created, but nothing else, and I see

---
 mcurling: /home/chali/warrick3//mcurl.pl -D "/home/chali/warrick3/MAKEFILE/logfile.o"  -dt "Wed, 01 Aug 2007 22:00:00 GMT"  -tg "http://mementoproxy.cs.odu.edu/aggr/timegate" -L -o "/home/chali/warrick3/MAKEFILE/index.html" "http://www.cs.odu.edu/"

Unable to open file /home/chali/warrick3/MAKEFILE/logfile.o
Reading logfile: /home/chali/warrick3/MAKEFILE/logfile.o


Unable to download...


To stats FAILED:: http://www.cs.odu.edu/ => ??? => 
/home/chali/warrick3/MAKEFILE/index.html --> Stat Failure...

Search HTML resource  for links to other missing resources...
No such file

No Content in !!

Starting recovery at position 0 of -1

(...)

---

that is happened for all sites, it stopes, says that finished, but nothing is 
downloaded.

what is the problem?
every moduls is installed

best regards

Original issue reported on code.google.com by [email protected] on 25 Jan 2013 at 11:54

Perl Logging

Please describe your feature requests here.

 Use a genuine perl logging package which will give you "levels" of logging, and flexibility to turn logging off and on, or direct it to TTY or file(s) with configuration, saving you headache and typing in the long run.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:46

zero length content "No Content in ..."

What steps will reproduce the problem?
1. ./warrick.pl -dr 2013-08-05 -d -a ia -D ../ftp/ http://www.atlantischild.hu/

What is the expected output? What do you see instead?

http://wayback.archive.org/web/20111031230326/http://www.atlantischild.hu/index.
php?option=com_content&task=view&id=21&Itemid=9
has non-zero lenght, I get zero lenght files:
"index.php?option=com_content&task=view&id=21&Itemid=9"

What version of the product are you using? On what operating system?
warrickv2-2-5

Please provide any additional information below.

I've got a non-zero lenght file which has GET parameters in its name
but all files containing & (ampersand) in their names are empty.

log says (below)
as you see, nothig anfter "?" in "To stats ... Location:"


-------
At Frontier location 79 of 769
-------


My frontier at 79: 
http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&
Itemid=28
My memento to get: 
|http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21
&Itemid=28|

targetpath: index.php

appending query string option=com_content&task=blogcategory&id=21&Itemid=28



 mcurling: /home/davidprog/dev/design-check/atlantis/warrick//mcurl.pl -D "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o"  -dt "Sun, 04 Aug 2013 22:00:00 GMT"  -tg "http://web.archive.org/web" -L -o "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28" "http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28"

Reading logfile: 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o


To stats 
http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&
Itemid=28 => Location: 
http://web.archive.org/web/20120903050228/http://www.atlantischild.hu/index.php?
 => 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28 --> stat IA

returning 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28
Search HTML resource 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28 for links to other missing resources...
No Content in 
/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_c
ontent&task=blogcategory&id=21&Itemid=28!!

Original issue reported on code.google.com by [email protected] on 31 Aug 2013 at 4:42

Corrupted *.save file

What steps will reproduce the problem?
1. ./warrick.pl -D myfolia -nc -k -o myfolia.log -n 100 -a ia 
http://myfolia.com/plants


Save file:

<pid>30873</pid>
<hostname>workshop</hostname>
<command></command>
<lastFrontier>99</lastFrontier>
<startUrl>FAILED:: http://myfolia.com/plants/1-basil-ocimum-basilicum/edit => 
??? => 
/home/webmaven/Desktop/projects/warrick2/myfolia/plants/1-basil-ocimum-basilicum
/edit/index.html</startUrl>
<dir></dir>


And the <resource> tag is filled with content from the log file. Need to try to 
replicate.

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:36

Made a replacement tool

Thanks for a great tool, too bad the project seems to be abandoned.

Getting inspiration from warrick, I've made a small tool in Ruby that does 
something similar:
https://github.com/hartator/wayback_machine_downloader (Getting a backung any 
website from the Wayback Machine. Optional timestamp)

It's only working for now with Wayback Machine but contributions are welcome!

Original issue reported on code.google.com by [email protected] on 10 Aug 2015 at 6:54

Include VM

Please describe your feature requests here.

You may want to distribute a VMware machine running your preferred version of 
free Unix/Linux instead of porting to Windows.  That way, you save yourself the 
headache of maintaining two versions, and you also eliminate needing to write 
Warrick for the many flavors of Linux and Unix, including whatever versions of 
support libraries they happen to have.  It is a bigger download, but bits are 
free, and your time is valuable.  Also, when you support users, bugs that creep 
through CPAN or libraries would cease to be a factor, as the VMWare machine 
would be tested from every line you type, down to the VMWare ethernet driver.  
What's packed up in the assumption behind this suggestion is that you want to 
provide a useful tool, and that may not be the case.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:47

Relative links: Starting recovery from subdirectory

Relative links seem to break when starting recovery from a sub-directory. For 
example, if you recovery site.com/dir/, the images in these files may point to 
site.com/imgs/

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:38

old_make

When I finished running TEST, I saw this:

TESTING DOWNLOAD COMPLETE...
-----------------------------
Downloaded 55 resources, which is greater than the 27 from the
testfile. We've found a new memento. Test success!
cat: old_make/MAKEFILE/TESTHEADERS.out: No such file or directory
0 vs 82


I don't see a directory called old_make.  Was this perhaps an old
reference that should be removed?

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 6:09

[Help]Resume Warrick when I turn off computer

Hello everybody!
Warrick is useful tool to recovery website,When I've completed one half my 
website my computer suddenly turn off, I spent 4 days to do it, How can I 
resume it.Thanks in advance!

Original issue reported on code.google.com by tunghk54 on 21 Mar 2013 at 9:29

Distribution archive looks sloppy

What steps will reproduce the problem?
1. Download warrickv2-2-5.tar.gz from project's "Downloads"
2. Look inside it with mc for example (with Midnight Commander)

What is the expected output? What do you see instead?

1. README file says version is 2.0  Expected: 2.5

2. Almost all files are executable. Even .o text files. Expected: executable 
must be only files which you have to run.

3. .o extension is confusing for text files (it usually have compiled object 
files) with 2 URL inside it. Expected: another extension and may be put these 
20 files into a subdir.

4. I do not expect to see 'curl.exe' here. Expected: either Windows support is 
officially claimed on the project page or remove the file.

5. piklog.log is 1 Mb size. Is it really need into distribution archive? 
Expected: if the file is a part of test suite it might be placed into 
TEST_FILES subdir.


What version of the product are you using? On what operating system?
warrick-2.5 Ubuntu 13.04 x32


Please provide any additional information below.

The usual thing which developers do is just using 'dist' makefile target or 
'makedist.sh' script. It does some clean-up ('make clean' or 'rm' command for 
specified filelist) and put only really needed files into distribution archive.

Original issue reported on code.google.com by [email protected] on 7 Sep 2013 at 3:46

Use Warnings

Please describe your feature requests here.

 Put "use warnings;" right after "use strict;" to help you find more problems sooner, rather than later.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:45

CPAN Install

When I ran the INSTALL script on one of my machines, I repeatedly got
the error message:

Can't locate CPAN.pm in @INC

To fix this, you could add

yum install perl-CPAN

to INSTALL before using -MCPAN.

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 6:10

Regex for URLs to download

I have a photo gallery site I am trying to recover the photos from, but I am 
having trouble limiting warrick to the type of page I want to download. It 
finds links with tons of query strings, etc and wants to download them all. I 
would rather have it use the lister to get the URLs and then only grab *.jpg 
files. Is that possible to implement?

Original issue reported on code.google.com by [email protected] on 29 Jan 2015 at 3:54

http://www.animalbehavior.org/Resources/CSASAB/#Uncert

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 6 Nov 2014 at 3:35

No Clobber sleeping

Please describe your feature requests here.

When the "no clobber" option is picked, and a file is detected to already 
exist, Warrick should not do any sleeping while moving on to the next file in 
the frontier.  This will greatly enhance the usability of Warrick, and reduce 
the need for session management with  the *.save files, and would be far more 
flexible.  The most difficult part of the documentation to understand is the 
*.save feature used for saving sessions.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:47

./TEST fails complaining that -nr is an invalid option

What steps will reproduce the problem?
1. ./INSTALL
2. ./TEST
3. Error presents

What is the expected output?

Successful test

What do you see instead?

net4-dev# ./TEST
Starting test...


#########################################################################
# Welcome to the Warrick Program!
# Warrick is a website recovery tool out of Old Dominion University
# Please provide feedback to Justin F. Brunelle at [email protected]
#########################################################################



Arguments: -D MAKEFILE -o MAKEFILE_LOGFILE.log -xc -nr -dr 2007-08-02 -T -nv 
http://www.cs.odu.edu/

Unknown option: nr
TESTING DOWNLOAD COMPLETE...
-----------------------------
Downloaded 28 resources, which is greater than the 27 from the testfile. We've 
found a new memento. Test success!
cat: MAKEFILE/TESTHEADERS.out: No such file or directory
0 vs 0

What version of the product are you using? On what operating system?

Downloaded version warrickv2-2-5.tar.gz except version string in source says 
2.2.3 + I'll open a separate ticket on this problem.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 21 Jun 2015 at 2:13

Memory Overrun

What steps will reproduce the problem?
1. when recovering large files, recovered file grows
2. files potentially having whitespace added in the sed commands
3. file then can't be loaded into memory without causing machine to run out of 
ram.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 Sep 2012 at 1:05

git repo is empty

What steps will reproduce the problem?
1. git clone https://code.google.com/p/warrick/
as stated here: http://code.google.com/p/warrick/source/checkout .

What is the expected output? 
{{{Cloning into junk.git...
done.}}}

What do you see instead?
{{{Cloning into warrick.git...
warning: You appear to have cloned an empty repository.}}}

What version of the product are you using? On what operating system?
N/A

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 Jun 2012 at 6:45

Stream Editting

Please describe your feature requests here.

Optimization:  Use perl's internal 'stream editing' instead of calling out to 
sed by fork/exec.  Not only do you save system call & fork & exec overhead, but 
perl has higher performance I/O according to one source.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:49

name download/releases w/ consistent regex friendly revision pattern

The current downloads listed are

* warrickv2-2-1.tar.gz
* warrick_v2-2.tar.gz
* warrick_2.0.1.tar.gz

which do not follow a regular pattern, making it difficult for automated tools, 
e.g. Macports, to determine if new revisions are available.  Recommend 
something like: `warrick-v2.2.1.tar.gz`, but any consistent preference should 
be supportable.

Original issue reported on code.google.com by [email protected] on 24 Jun 2012 at 6:51

Installation Script Rework

The installation script has syntax errors as well as issues with OS detection 
and handling. Needs to be reworked to be more elegant.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:12

warrick is not working

What steps will reproduce the problem?
1. install warrick on linux system
2. install all dependencies
3. run ./TEST 
4. try to get any site

What is the expected output? What do you see instead?
Expected to have a directory with html files. Instead there is only index.htm 
and lister.o file.  

What version of the product are you using? On what operating system?
I tried all versions of warrick - with same result. 
Tried on ubuntu 12.10 and ubuntu 12.04 (updated and upgraded)

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Mar 2014 at 3:13

Exclude URL paths from being reconstructed and/or crawled

It is sometimes useful to exclude URLs in order to reduce the scope of the 
reconstruction job. 

Example: A site where every page has an 'edit' URL. These pages should be 
excluded from even being crawled in the first place, for example by excluding 
the /edit pattern.

Example: A site that has many browsing and searching paths that lead to the 
same content pages. These pages should be crawled to ensure complete coverage 
of the content pages, but not reconstructed, for example by excluding a pattern 
such as ?page=[0-9]+

Example: Only wanting to reconstruct a section of a website (everything under a 
particular subdirectory) by excluding specific other subdirectories from 
reconstruction.

Because excluding crawling and excluding reconstruction solve separate 
use-cases (although excluding crawling obviously also excludes reconstruction), 
I recommend separate command-line switches for each.

Original issue reported on code.google.com by [email protected] on 27 Apr 2012 at 3:03

Incomplete Dump

I want to recover an older version of a website, that contains "better" 
information compared to the pervius one and make a local copy for my personal 
use.
I can't get a full website dump, it's alsways incomplete. I'm using OSX.

It's me or the software ? :)

sudo perl ./warrick.pl -o /Users/cesare/Documents/drclark.net/warrick.log  -dr 
2012-07-23 -D /Users/cesare/Documents/drclark.net - -ic -nc -nv -a ia -nB 
"http://www.drclark.net/"

Original issue reported on code.google.com by [email protected] on 7 Dec 2012 at 3:07

Attachments:

Not priocessing Images within CSS style sheets

What steps will reproduce the problem?
1. reconstruct a typical site
2.
3.

What is the expected output? What do you see instead?
I am finding that it is not processing images within css files - eg background 
images.in the CSS file i just retrieved.

background-image: 
url(http://api.wayback.archive.org/memento/20110228053720/http://www.domainname.
com.au/_images/landing/10px-hover-bg.png);

It hasn't extracted and processed the CSS


What version of the product are you using? On what operating system?
warrick_2.0.1.tar.gz    

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 19 Feb 2012 at 3:04

sed Filterings

Please describe your feature requests here.


Optimization:   if the user is only pulling from one source, why call sed to do 
filtering as if pulling from all sources?

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:50

GetOptionsFromString is not exported....

What steps will reproduce the problem?
1. Install warrick, etc., under CentOS 5.8
2. run it (e.g. warrick.pl -o mylog http://www.mysite.com

What is the expected output? What do you see instead?
A directory of stuff. Get an error message and nothing in the log.


What version of the product are you using? On what operating system?
Latest (as of today). CentOS 5.8

Please provide any additional information below.

FYI... email to [email protected] is better than the gmail address under reporter 
for contact

Original issue reported on code.google.com by [email protected] on 24 Dec 2012 at 9:16

Version string incorrect in latest download

What steps will reproduce the problem?
1. Download warrickv2-2-5.tar.gz
2. Unpack
3. $version in warick.pl reports 2.2.3 instead of 2.2.5

Please provide any additional information below.

Either v2.2.3 is incorrectly in download slot for 2.2.5 or incorrect version 
string is in file.

Original issue reported on code.google.com by [email protected] on 21 Jun 2015 at 2:15

Pause, Suspend, and Resume

Currently, you can set a limit on the number of URLs to reconstruct by using 
the -n X switch, and you can later resume the job to get another batch (of the 
same size), by using the -R PID_computername.save switch.

It would be convenient if a job that didn't have a limit set could also be 
stopped and then resumed where it left off.

Potential solutions:

1. When the job is stopped by using ctrl-C, a PID_computername.save is written 
out which can later be used to resume the job.

2. When the job is stopped by using another key combination (such as Q for quit 
or S for Suspend), a PID_computername.save is written out which can later be 
used to resume the job.

3. Jobs can be paused by using P, and resumed by using R, without ending the 
process.

Original issue reported on code.google.com by [email protected] on 27 Apr 2012 at 2:46

Testing feature is outdated

The testing feature of Warrick is not operational due to changes in archival 
mementos and their features and other aspects of poor planning. Needs a 
complete restructuring.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:12

Crash Recovery

What steps will reproduce the problem?
1.Had a problem "resuming" the process when my laptop crashed, but figured out 
a workaround.
2.I was seeding Warrick with http://etpv.org/2000.html, and it was generating 
paths to recover like ...

http://etpv.org/2000.html/2000/example1.html
http://etpv.org/2000.html/2000/example2.html
http://etpv.org/2000.html/2000/example3.html
...

... which meant when I resumed the recovery after the crash, instead of 
expediently skipping over files that existed since I have "no-clobber" on, it 
would generate these bogus paths that mcurl would then try to recover.

The workaround is to delete the file off my laptop that links to the other 
files.  In this case, we're talking files like ...

1999.html
2000.html
2001.html
...

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

attached shell script with a set of commands as a solution

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:18

Warrick Sleeping

Please describe your feature requests here.

Warrick needlessly sleeping for 2 seconds when the option "no clobber" is 
selected, and a previous session was interrupted, and the -n and -R options are 
not being used.

1) The sleep(2) at line 1030 is commented out
2) The sleep(5) at line 1855 is now a sleep(7)

So, we're not missing any sleep.  It's just that we don't sleep when we already 
have the file and the "no clobber" option has been selected.

Thus, there is less of a need for -n and -R options, since it's less of a pain 
to just kill the process, and re-start it

Also, to possibly step around the problem of being blacklisted by Google or 
anyone else, perhaps calling sleep for random intervals would do the trick, to 
make it seem more like a human being looking at this & that.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:20

Encoding

What steps will reproduce the problem?
1.Try to recover sites with non english accents and characters
2.
3.

What is the expected output? What do you see instead?

instead of ó i see Ã³
instead of á i see Ã¡

etc...


What version of the product are you using? On what operating system?

Last V on ubuntu 12.04LTS
Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 10 Apr 2014 at 1:37

Has Optimization

Please describe your feature requests here.

Optimization: Use a perl hash-table instead of a perl array.  This way, you 
neatly eliminate duplicates, but can still traverse it like an array.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:48

MP3s corrupted during recovery

What steps will reproduce the problem?
1. recovered site with mp3s
2.
3.

What is the expected output? What do you see instead?
mp3 files still played, but were corrupted.

What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 10:42

URI Rewriting

URI format changes by the archives have rendered the -k and uri-rewriting 
features inoperable. we need to develop a way -- without hardcoding -- to 
either automatically or easily manually change the URI strings that we need to 
detect to make relative. 

Archive watermarks, status bars, and other archive-added features should be 
treated similarly.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:09

Javascript being included in link extraction

Links such as "./mypage.com:javascript" need to be cleaned.

Original issue reported on code.google.com by [email protected] on 24 Apr 2012 at 2:39

New code mod.

check also these lines, needs to fixed

2025
2121
2157
2940

my line looks now so,

if($_[0] =~ m/\.jpg/i || $_[0] =~ m/\.jpeg/i || $_[0] =~ m/\.png/i || $_[0] =~ 
m/\.gif/i || $_[0] =~ m/\.doc/i || $_[0] =~ m/\.tiff/i || $_[0] =~ m/\.bmp/i || 
$_[0] =~ m/\.pdf/i)

small question, im not very familier with perl,

wouldnt it better if you putt all these formats in a global array on the top of 
the script

$blacklist = array("jpg", "tiff" .....) and foreach it against the filename, so 
its clearer and easier to fix, instead having it on 4 lines and check whats 
wrong. :)

Original issue reported on code.google.com by [email protected] on 7 Dec 2012 at 3:29

Warrick Sleeping 2

Please describe your feature requests here.

Once you start timing how long mcurl is taking (too see if it has hung or not), 
then you can adjust how long Warrick sleeps between requests to maximize 
recovery speed.  

Right now, Warrick is sleeping for 7 seconds between requests, but if the 
Internet is unclogged, and archive.org is returning the page to mcurl in 1 
second, then why not sleep for 1 second since no one else is using the 
bandwidth and archive.org?  If archive.org returns a result in 10 seconds 
because it's loaded down, THEN Warrick should sleep for 10 seconds to ease up.  
A 30 second response would be matched with a long 30 second sleep.

Simply put, Warrick should sleep the number of seconds it takes to pull the 
last page.  This is how the ftp algorithm maximizes bandwidth, without killing 
the internet.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 11:22

Brass rework

The brass interface at warrick.cs.odu.edu needs to be reinstalled, and then a 
new load balancing algorithm should be put into place. This will include 
distributing brass between multiple machines, each with its own memento 
aggregator for discovering mementos.

Original issue reported on code.google.com by [email protected] on 23 Oct 2014 at 12:11

zachbrowne / warrick Goto Github PK

warrick's Introduction

warrick's People

Watchers

warrick's Issues

Recommend Projects

Recommend Topics

Recommend Org