chfoo / warcat Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 21.0 107 KB

Tool and library for handling Web ARChive (WARC) files.

License: GNU General Public License v3.0

Python 100.00%

python

warcat's People

Contributors

Stargazers

Watchers

Forkers

atbrox linearregression jbinfo pombredanne frogging101 bytearchive backwardn jesseweinstein chrisoei woodenphone acrois rainyrainyday billsargent 00mjk idontwantausernameok muzie11 dlitz

warcat's Issues

URL agnostic deduplication of WARC

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.

This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.

edit: better explanation of what this would be used for.

Feature: extract only files matching a regexp

In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see #7 ).

It would be useful if you could pass warcat a regexp like .*foo\.wordpress\.com.* to extract all files in a megawarc dealing with a particular website. This can be approximated by telling warcat to extract all files and then deleting non-matches with find or other shell script approaches, but at the cost of far more disk IO, temporary storage, and having to work with find. (It might also be faster, aside from the disk IO reduction, depending on whether the format stores filenames and warcat can skip over all non-matching warcs. I don't know the details there.)

Handling for "files" that are purely in memory?

More accurately, how am I supposed to handle a "file" that is really just a bunch of bytes?

Ideally, I would like to use a BinaryIO object, however, these don't have a name attribute, so I get this error:

  File "/usr/local/lib/python3.5/site-packages/warcat/model/block.py", line 83, in load
    binary_block.set_file(file_obj.name or file_obj, file_obj.tell(), length)
AttributeError: '_io.BytesIO' object has no attribute 'name'

I'm not sure how to get around this.

Extract performance is extremely slow on megawarcs

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.

I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)

This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd using the CDX metadata on index/length of the specific warcs I needed).

Malformed HTTP headers lead to "ValueError: need more than 1 value to unpack" crash

I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned:

HTTP/1.1 302 
Content-Type: text/html
nnCoection: close
Content-Length: 0
Location: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20
Cache-Control: no-cache
Pragma: no-cache

More precisely, in Python repr notation:

b'HTTP/1.1 302 \nContent-Type: text/html\nnnCoection: close\nContent-Length: 0\nLocation: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20\nCache-Control: no-cache\nPragma: no-cache\n\n'

There are several issues with this response, but the main one and the one causing trouble is that the line endings are just LF, not CRLF. This causes warcat verify to crash with the following traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 136, in verify_command
    tool.process()
  File ".../lib/python3.4/site-packages/warcat/tool.py", line 95, in process
    check_block_length=self.check_block_length)
  File ".../lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File ".../lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
    content_type)
  File ".../lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
    field_cls=HTTPHeader)
  File ".../lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
    fields = field_cls.parse(file_obj.read(field_length).decode())
  File ".../lib/python3.4/site-packages/warcat/model/field.py", line 215, in parse
    http_headers.status, s = s.split(newline, 1)
ValueError: need more than 1 value to unpack

http.client.BadStatusLine: http/1.1 200 OK

I'm getting a lot of these errors - some pages work just fine, all the warc files I'm reading have HTML, the error itself is strange enough since 200 ok is not a bad statusline

Error on record <urn:uuid:3b608490-1308-11ec-a263-3905f05120b4>
Traceback (most recent call last):
  File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/tool.py", line 108, in process
    self.action(record)
  File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/tool.py", line 216, in action
    response = util.parse_http_response(data)
  File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/site-packages/warcat/util.py", line 273, in parse_http_response
    response.begin()
  File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/home/korny/.conda/envs/ploomber-gpt/lib/python3.9/http/client.py", line 301, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: http/1.1 200 OK

my code is

import warcat.tool
tool = warcat.tool.ExtractTool(
        ['/tmp/my.warc'],
        out_dir='/tmp/out/',
        preserve_block=False,
        keep_going=True
        )
tool.process()

Support payload digest of revisit records

Currently warcat gives the following error on revisit records from a deduplicated WARC:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action
    action(record)
  File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 298, in verify_payload_digest
    raise VerifyProblem('Bad payload digest.', '5.9')
warcat.tool.VerifyProblem: ('Bad payload digest.', '5.9', True)

The payload digest of a revisit record should be the payload digest of the record the revisit record points to, see 6.7.2 on page 15 (page 21 in the PDF) on http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf:

To report the payload digest used for comparison, a 'revisit' record using this profile shall include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.
(...)
For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.

Currently warcat reports an error for the payload digest, it would be nice if it would check the WARC for the record the revisit record refers to. If that record is in the WARC, compare the payload digest with that. If the record is not in the WARC, throw a warning or info that the record the revisit record refers to is not in the WARC.

wpull WARCs cause "Content block length changed from X to Y" warnings on warcinfo record

WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example:

> wpull --version
1.2.3
> wpull https://example.org/ --warc-file example.org --warc-max-size 1234567890 --delete-after
<snip>
> python3 -m warcat verify example.org-meta.warc.gz --verbose --verbose
INFO:warcat.model.warc:Opened gziped file example.org-meta.warc.gz
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.record:Record start at 0 0x0
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=3665
DEBUG:warcat.model.block:Field length=3665
DEBUG:warcat.model.block:Payload length=0
WARNING:warcat.model.record:Content block length changed from 3665 to 3656
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:eb5182ba-c3fc-41a0-8be8-649c463c5c1d>
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok
DEBUG:warcat.model.record:Record start at 3986 0xf92
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=511
DEBUG:warcat.model.block:Binary content block length=511
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:3f2124c6-b9f8-4a12-9870-d93ea45335d8>
INFO:warcat.model.warc:Finished reading Warc
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok

The difference between the numbers is exactly the same as the number of lines in that warcinfo record body. I doubt that's a coincidence, but I wasn't able to narrow down the origin based on a brief glance over the source code. I imagine it has something to do with the block length being recalculated from the normalised field representation. If that interpretation's correct, then I think the warning should be suppressed in verify mode since it should be irrelevant when not writing out a new WARC.

Reading in an in-memory gzip.GzipFile object breaks warcat.model.binary.BinaryFileRef objects

The following:

byte_stream = io.BytesIO(r.content)
file_object = gzip.GzipFile(fileobj=byte_stream)
warc = warcat.model.WARC().read_file_object(file_object)
record = warc.records[0]
binary_block = record.content_block.binary_block.get_file()

results in an AttributeError in warcat.model.binary.BinaryFileRef:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1319f0884b9c> in <module>()
----> 1 rec.content_block.binary_block.get_file()

/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
    128             file_obj = self.file_obj
    129 
--> 130         original_position = file_obj.tell()
    131 
    132         if self.file_offset:

AttributeError: 'NoneType' object has no attribute 'tell'

The same error also occurs with the Payload.get_file method. This seems to be because the BinaryBlock and BlockWithPayload classes' load method passes the file object's name directly to set_file on lines 40, 83, and 96 of warcat/model/block.py; changing these lines to pass in the file object itself instead of its name seems to work.

A name to a file object is not handled correctly

See #10 (comment) for details

http.client.IncompleteRead crash during extract

Traceback (most recent call last):
  File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 131, in extract_command
    tool.process()
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 112, in process
    raise e
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 106, in process
    self.action(record)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 229, in action
    shutil.copyfileobj(response, f)
  File "/0/home/waxy/usr/local/lib/python3.4/shutil.py", line 66, in copyfileobj
    buf = fsrc.read(length)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 500, in read
    return super(HTTPResponse, self).read(amt)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 529, in readinto
    return self._readinto_chunked(b)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 621, in _readinto_chunked
    n = self._safe_readinto(mvb)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 680, in _safe_readinto
    raise IncompleteRead(bytes(mvb[0:total_bytes]), len(b))
http.client.IncompleteRead: IncompleteRead(7052 bytes read, 16384 more expected)

Handle long filenames

Add support for handling long filenames instead of crashing

'utf-8' codec can't decode byte invalid continuation byte

I've installed warcat on my server under Python 3.4. The warc.load() command to a warc file gives me the following error message:

>> warc.load("/gstorage01/external-data/internet-archive/archive.org/download/archiveteam_pdf_20160412083746/pdf_20160412083746.megawarc.warc.gz")
Content block length changed from 92850 to 92849
Content block length changed from 150326 to 150325
Content block length changed from 156258 to 156257
Content block length changed from 129362 to 129361
Content block length changed from 156196 to 156195
Content block length changed from 129336 to 129335
Content block length changed from 147763 to 147762
Content block length changed from 129338 to 129337
Content block length changed from 129350 to 129349
Content block length changed from 156195 to 156194
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
    self.read_file_object(f)
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
    record, has_more = self.read_record(file_object)
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File "/usr/lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
    content_type)
  File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
    field_cls=HTTPHeader)
  File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
    fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 26: invalid continuation byte

The data is available from internet archive website that everyone can download. The size is about 130GB, but I don't think it should matter. The key issue is how does a codec error happen.

Add easy way to iterate over warc records

I was surprised that example provided in documentation:

>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)

Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.

In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.

After reading sources I came up with this helper function:

import warcat.model


def readwarc(filename, types=('response',)):
    f = warcat.model.WARC.open(filename)
    has_more = True
    while has_more:
        record, has_more = warcat.model.WARC.read_record(f)
        if not types or record.warc_type in types:
            if isinstance(record.content_block, warcat.model.BlockWithPayload):
                yield record, record.content_block.payload.get_file
            elif hasattr(record.content_block, 'binary_block'):
                yield record, record.content_block.binary_block.get_file
            else:
                yield record, record.content_block.get_file


for record, content in readwarc('pages.warc.gz'):
    with content() as f:
        # process f

I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:

import warcat

for record in warcat.readrecords('pages.warc.gz'):
    with record.content() as f:
        # process f

Also, if I could get lxml, BeautifulSoap and json from records, something like this:

for record in warcat.readrecords('pages.warc.gz'):
    record.lxml.xpath('//a')
    record.soap.select('a')
    record.json['a']

Then it would be really amazing.

If you agree with suggested API, I can create pull request with the implementation.

Support warnings when Content-Type doesn't match what cdx-writer expects

Currently, cdx-writer expects Content-Type to be in the form application/http; msgtype=response. However, the WARC spec allows the form application/http;msgtype=response (notice there is no space). Warcat should warn when this is detected.

See Issue: internetarchive/CDX-Writer#4.

pass on warc.gz error

I'm not sure if this is suppose to be supported but this generates an error.

$ python3 -m warcat pass co012304aa.warc.gz
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 154, in
main()
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 70, in main
command_info1
File "/home/marked/.local/lib/python3.7/site-packages/warcat/main.py", line 113, in pass_command
warc.load(filename, force_gzip=args.force_read_gzip)
TypeError: load() got an unexpected keyword argument 'force_gzip'

No mention of 'resource' in list at verify_refers_to

According to the WARC 1.0 and 1.1 specifications "[t]he WARC-Refers-To field shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records".
In warcat's source code, verify_refers_to does not mention 'resource'.
Is this a bug or a feature?

warcat/warcat/tool.py

Line 359 in fb63a37

if record.warc_type in ('warcinfo', 'response', 'request',

Feature: extract WARCs specified with index/length

In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using warcat.

One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like

[...] unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

I can handwrite the extraction using dd:

$ dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.warc.gz bs=1 && gunzip 1.warc.gz
1326824+0 records in
1326824+0 records out
1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s

Which is >11,200x faster than extracting everything in warcat and looking for the file I need.

The downside is needing to mess with dd, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.

It'd be great if warcat could include some additional arguments to the extract functionality like a pair of --length=n and --index=i flags to provide a nicer interface to pulling out a few warcs.

This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)

Support warnings when WARC field name casing don't match hanzo's warc-tools.

For example, hanzo's warc-tools expects WARC-Type and not Warc-Type. The ISO spec says that field names are case-insensitive, but implementations may not follow the spec closely. The verify should warn when it detects this.

Fields with empty values in metadata records increases block length

Fields such as

example:\r\n

get serialized to

example: \r\n

which may be undesirable.

In section 1 of http://tools.ietf.org/search/draft-kunze-anvl-02 , the example shows a space is not added for an empty value.

Support older Python 2.7

With chfoo/wpull being a success at supporting Python 2 using the latest lib3to2, Warcat shouldn't have problems with being backported.