chfoo / warcat Goto Github PK
View Code? Open in Web Editor NEWTool and library for handling Web ARChive (WARC) files.
License: GNU General Public License v3.0
Tool and library for handling Web ARChive (WARC) files.
License: GNU General Public License v3.0
This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.
This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.
edit: better explanation of what this would be used for.
In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see #7 ).
It would be useful if you could pass warcat a regexp like .*foo\.wordpress\.com.*
to extract all files in a megawarc dealing with a particular website. This can be approximated by telling warcat to extract all files and then deleting non-matches with find
or other shell script approaches, but at the cost of far more disk IO, temporary storage, and having to work with find
. (It might also be faster, aside from the disk IO reduction, depending on whether the format stores filenames and warcat can skip over all non-matching warcs. I don't know the details there.)
More accurately, how am I supposed to handle a "file" that is really just a bunch of bytes?
Ideally, I would like to use a BinaryIO
object, however, these don't have a name
attribute, so I get this error:
File "/usr/local/lib/python3.5/site-packages/warcat/model/block.py", line 83, in load
binary_block.set_file(file_obj.name or file_obj, file_obj.tell(), length)
AttributeError: '_io.BytesIO' object has no attribute 'name'
I'm not sure how to get around this.
I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.
I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top
, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)
This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd
using the CDX metadata on index/length of the specific warcs I needed).
I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned:
HTTP/1.1 302
Content-Type: text/html
nnCoection: close
Content-Length: 0
Location: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20
Cache-Control: no-cache
Pragma: no-cache
More precisely, in Python repr notation:
b'HTTP/1.1 302 \nContent-Type: text/html\nnnCoection: close\nContent-Length: 0\nLocation: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20\nCache-Control: no-cache\nPragma: no-cache\n\n'
There are several issues with this response, but the main one and the one causing trouble is that the line endings are just LF, not CRLF. This causes warcat verify to crash with the following traceback:
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File ".../lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
main()
File ".../lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
command_info[1](args)
File ".../lib/python3.4/site-packages/warcat/__main__.py", line 136, in verify_command
tool.process()
File ".../lib/python3.4/site-packages/warcat/tool.py", line 95, in process
check_block_length=self.check_block_length)
File ".../lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File ".../lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
content_type)
File ".../lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
field_cls=HTTPHeader)
File ".../lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
fields = field_cls.parse(file_obj.read(field_length).decode())
File ".../lib/python3.4/site-packages/warcat/model/field.py", line 215, in parse
http_headers.status, s = s.split(newline, 1)
ValueError: need more than 1 value to unpack
I'm getting a lot of these errors - some pages work just fine, all the warc files I'm reading have HTML, the error itself is strange enough since 200 ok
is not a bad statusline
Error on record <urn:uuid:3b608490-1308-11ec-a263-3905f05120b4>
Traceback (most recent call last):
File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/tool.py", line 108, in process
self.action(record)
File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/tool.py", line 216, in action
response = util.parse_http_response(data)
File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/util.py", line 273, in parse_http_response
response.begin()
File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/http/client.py", line 301, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: http/1.1 200 OK
my code is
import warcat.tool
tool = warcat.tool.ExtractTool(
['/tmp/my.warc'],
out_dir='/tmp/out/',
preserve_block=False,
keep_going=True
)
tool.process()
Currently warcat gives the following error on revisit records from a deduplicated WARC:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action
action(record)
File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 298, in verify_payload_digest
raise VerifyProblem('Bad payload digest.', '5.9')
warcat.tool.VerifyProblem: ('Bad payload digest.', '5.9', True)
The payload digest of a revisit record should be the payload digest of the record the revisit record points to, see 6.7.2 on page 15 (page 21 in the PDF) on http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf:
To report the payload digest used for comparison, a 'revisit' record using this profile shall include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.
(...)
For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.
Currently warcat reports an error for the payload digest, it would be nice if it would check the WARC for the record the revisit record refers to. If that record is in the WARC, compare the payload digest with that. If the record is not in the WARC, throw a warning or info that the record the revisit record refers to is not in the WARC.
WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example:
> wpull --version
1.2.3
> wpull https://example.org/ --warc-file example.org --warc-max-size 1234567890 --delete-after
<snip>
> python3 -m warcat verify example.org-meta.warc.gz --verbose --verbose
INFO:warcat.model.warc:Opened gziped file example.org-meta.warc.gz
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.record:Record start at 0 0x0
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=3665
DEBUG:warcat.model.block:Field length=3665
DEBUG:warcat.model.block:Payload length=0
WARNING:warcat.model.record:Content block length changed from 3665 to 3656
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:eb5182ba-c3fc-41a0-8be8-649c463c5c1d>
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok
DEBUG:warcat.model.record:Record start at 3986 0xf92
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=511
DEBUG:warcat.model.block:Binary content block length=511
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:3f2124c6-b9f8-4a12-9870-d93ea45335d8>
INFO:warcat.model.warc:Finished reading Warc
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok
The difference between the numbers is exactly the same as the number of lines in that warcinfo record body. I doubt that's a coincidence, but I wasn't able to narrow down the origin based on a brief glance over the source code. I imagine it has something to do with the block length being recalculated from the normalised field representation. If that interpretation's correct, then I think the warning should be suppressed in verify mode since it should be irrelevant when not writing out a new WARC.
The following:
byte_stream = io.BytesIO(r.content)
file_object = gzip.GzipFile(fileobj=byte_stream)
warc = warcat.model.WARC().read_file_object(file_object)
record = warc.records[0]
binary_block = record.content_block.binary_block.get_file()
results in an AttributeError
in warcat.model.binary.BinaryFileRef
:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-1319f0884b9c> in <module>()
----> 1 rec.content_block.binary_block.get_file()
/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
128 file_obj = self.file_obj
129
--> 130 original_position = file_obj.tell()
131
132 if self.file_offset:
AttributeError: 'NoneType' object has no attribute 'tell'
The same error also occurs with the Payload.get_file
method. This seems to be because the BinaryBlock
and BlockWithPayload
classes' load
method passes the file object's name directly to set_file
on lines 40, 83, and 96 of warcat/model/block.py; changing these lines to pass in the file object itself instead of its name seems to work.
See #10 (comment) for details
Traceback (most recent call last):
File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
main()
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
command_info[1](args)
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 131, in extract_command
tool.process()
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 112, in process
raise e
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 106, in process
self.action(record)
File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 229, in action
shutil.copyfileobj(response, f)
File "/0/home/waxy/usr/local/lib/python3.4/shutil.py", line 66, in copyfileobj
buf = fsrc.read(length)
File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 500, in read
return super(HTTPResponse, self).read(amt)
File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 529, in readinto
return self._readinto_chunked(b)
File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 621, in _readinto_chunked
n = self._safe_readinto(mvb)
File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 680, in _safe_readinto
raise IncompleteRead(bytes(mvb[0:total_bytes]), len(b))
http.client.IncompleteRead: IncompleteRead(7052 bytes read, 16384 more expected)
Add support for handling long filenames instead of crashing
I've installed warcat on my server under Python 3.4. The warc.load() command to a warc file gives me the following error message:
>> warc.load("/gstorage01/external-data/internet-archive/archive.org/download/archiveteam_pdf_20160412083746/pdf_20160412083746.megawarc.warc.gz")
Content block length changed from 92850 to 92849
Content block length changed from 150326 to 150325
Content block length changed from 156258 to 156257
Content block length changed from 129362 to 129361
Content block length changed from 156196 to 156195
Content block length changed from 129336 to 129335
Content block length changed from 147763 to 147762
Content block length changed from 129338 to 129337
Content block length changed from 129350 to 129349
Content block length changed from 156195 to 156194
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
self.read_file_object(f)
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
record, has_more = self.read_record(file_object)
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File "/usr/lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
content_type)
File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
field_cls=HTTPHeader)
File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 26: invalid continuation byte
The data is available from internet archive website that everyone can download. The size is about 130GB, but I don't think it should matter. The key issue is how does a codec error happen.
I was surprised that example provided in documentation:
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)
Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.
In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.
After reading sources I came up with this helper function:
import warcat.model
def readwarc(filename, types=('response',)):
f = warcat.model.WARC.open(filename)
has_more = True
while has_more:
record, has_more = warcat.model.WARC.read_record(f)
if not types or record.warc_type in types:
if isinstance(record.content_block, warcat.model.BlockWithPayload):
yield record, record.content_block.payload.get_file
elif hasattr(record.content_block, 'binary_block'):
yield record, record.content_block.binary_block.get_file
else:
yield record, record.content_block.get_file
for record, content in readwarc('pages.warc.gz'):
with content() as f:
# process f
I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:
import warcat
for record in warcat.readrecords('pages.warc.gz'):
with record.content() as f:
# process f
Also, if I could get lxml, BeautifulSoap and json from records, something like this:
for record in warcat.readrecords('pages.warc.gz'):
record.lxml.xpath('//a')
record.soap.select('a')
record.json['a']
Then it would be really amazing.
If you agree with suggested API, I can create pull request with the implementation.
Currently, cdx-writer expects Content-Type to be in the form application/http; msgtype=response
. However, the WARC spec allows the form application/http;msgtype=response
(notice there is no space). Warcat should warn when this is detected.
See Issue: internetarchive/CDX-Writer#4.
I'm not sure if this is suppose to be supported but this generates an error.
$ python3 -m warcat pass co012304aa.warc.gz
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 154, in
main()
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 70, in main
command_info1
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 113, in pass_command
warc.load(filename, force_gzip=args.force_read_gzip)
TypeError: load() got an unexpected keyword argument 'force_gzip'
According to the WARC 1.0 and 1.1 specifications "[t]he WARC-Refers-To field shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records".
In warcat's source code, verify_refers_to does not mention 'resource'.
Is this a bug or a feature?
Line 359 in fb63a37
In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using warcat
.
One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like
[...] unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
I can handwrite the extraction using dd
:
$ dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.warc.gz bs=1 && gunzip 1.warc.gz
1326824+0 records in
1326824+0 records out
1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
Which is >11,200x faster than extracting everything in warcat
and looking for the file I need.
The downside is needing to mess with dd
, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.
It'd be great if warcat could include some additional arguments to the extract functionality like a pair of --length=n
and --index=i
flags to provide a nicer interface to pulling out a few warcs.
This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)
For example, hanzo's warc-tools expects WARC-Type
and not Warc-Type
. The ISO spec says that field names are case-insensitive, but implementations may not follow the spec closely. The verify should warn when it detects this.
Fields such as
example:\r\n
get serialized to
example: \r\n
which may be undesirable.
In section 1 of http://tools.ietf.org/search/draft-kunze-anvl-02 , the example shows a space is not added for an empty value.
With chfoo/wpull being a success at supporting Python 2 using the latest lib3to2, Warcat shouldn't have problems with being backported.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.