internetarchive / warc Goto Github PK

View Code? Open in Web Editor NEW

234.0 23.0 114.0 207 KB

Python library for reading and writing warc files

License: GNU General Public License v2.0

Python 100.00%

warc's Introduction

warc: Python library to work with WARC files

WARC (Web ARChive) is a file format for storing web crawls.

http://bibnum.bnf.fr/WARC/

This warc library makes it very easy to work with WARC files.:

import warc
f = warc.open("test.warc")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

Documentation

The documentation of the warc library is available at http://warc.readthedocs.org/.

License

This software is licensed under GPL v2. See LICENSE file for details.

warc's People

Contributors

Stargazers

Watchers

warc's Issues

How to pass the WARC-Target-URI to a variable? eg:- f == record.header['WARC-Target-URI']

Python3 Compat

I'd be interested in getting warc working on Python3. Spent some time fixing up the imports with six, but lost momentum with gzip2.py because Python3's gzip has moved things around.

Is anyone else interested in this?

The manual does not indicate how to dump the body of the warc file. Particularly, what command should I use to access binary files such as PDFs? It would be better for the author to publish a more complete version of the manual for better use.

How to extract a record based on offset?

I've generated indexes on *.warc.gz file. Now I want extract record based on offset in the index, does warc provide with any methods to do so?

KeyError warc-target-uri

I get this:

~$ python warcread.py 
<warc.warc.WARCFile instance at 0x7fc61fc34290>
Traceback (most recent call last):
  File "warcread.py", line 6, in <module>
    print(record['WARC-Target-URI'])
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 199, in __getitem__
    return self.header[name]
  File "/usr/local/lib/python2.7/dist-packages/warc/utils.py", line 34, in __getitem__
    return self._d[name.lower()]
KeyError: 'warc-target-uri'
```python

Input file can be downloaded from here (191 MB): 
https://www.dropbox.com/s/25tk1mpo03g73pj/1009wb-39.warc.gz?dl=0

Create ARC records with actual HTTP conversation

Using Httplib or anything else makes some "convenient" changes to the HTTP transaction (e.g. transfer-encoding: chunked is removed and it's converted into a full stream etc.). This needs to be changed and the actual conversation needs to be obtained and archived.

Quick creation of record with default values for headers

There should be a way to create ARCRecords quickly without having to pass all headers. Just pass the payload and a record with some default values for the headers should be created.

Unsupported WARC version: 1.1

example file

f = warc.open("example.warc.gz")
for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

Expected Behaviour

Prints records with URI + Content Length

Observed Behaviour:

Traceback (most recent call last):
File "", line 1, in
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 390, in iter
record = self.read_record()
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 334, in read_header
raise IOError("Unsupported WARC version: %s" % version)
IOError: Unsupported WARC version: 1.1

support reading older WARC versions

The reader code barfs on versions other than "WARC/1.0".

I have not seen anything on what are the differences between, say, 1.0, 0.18 and 0.17 (apart from the version stamp itself). If version 1.0 is otherwise equal to either or both of those, please allow reading them, or add a configuration variable that determines whether they are alllowed.

I could fork the code and add this feature, but I do not know of the differences. If someone can point me to spec on 1.0, I'd be happy to do it.

Create ARCRecord with version so that write_to can work without it

It should be possible to create an ARCRecord with a version so that the write_to can work directory without having to pass a version number.

ModuleNotFoundError: No module named 'builtin'

how to solve this error?

KeyError: 'warc-target-uri'

I have been unable to accomplish anything with this, I always get errors like the following:

$ python2
Python 2.7.18 (default, Mar  8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import warc
>>> f = warc.open("195.242.99.71-8181-2016-03-23-3324e7c6-00001.warc")
>>> for record in f:
...     print record['WARC-Target-URI'], record['Content-Length']
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/username/.local/lib/python2.7/site-packages/warc/warc.py", line 199, in __getitem__
    return self.header[name]
  File "/home/username/.local/lib/python2.7/site-packages/warc/utils.py", line 34, in __getitem__
    return self._d[name.lower()]
KeyError: 'warc-target-uri'
>>>

WARC: from_response incompatible with Requests>=1.0.0

Description/Summary

Seems like warc-0.2.1 is incompatible with requests>=1.0.0. All that needs to be changed is to strip 'full' from 'full_url' and everything will work just fine again.

Environment details:

Python 2.7
warc 0.2.1
Requests 1.0.4

Example code

#!/usr/bin/python
import warc, requests

print warc.WARCRecord.from_response(requests.get("http://archive.org"))

Traceback

>>> warc.WARCRecord.from_response(requests.get("http://archive.org"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ersi/venv/lib/python2.7/site-packages/warc/warc.py", line 240, in from_response
    "WARC-Target-URI": response.request.full_url.encode('utf-8')
AttributeError: 'PreparedRequest' object has no attribute 'full_url'

Bug in WARCRecord.repr()

The line in repr:

return "<WARCRecord: type=%r record_id=%s>" % (self['type'], self['record_id'])

results in an error. I don't know if the above line is wrong or whether there's a bug in how the attribute lookup chain is supposed to work. Changing the above line to this works though:

return "<WARCRecord: type=%r record_id=%s>" % (self.header.type, self.header.record_id)

Traceback below.

h=WARCHeader({"WARC-Type": "response"}, defaults=True)
r = WARCRecord(h)
r
Traceback (most recent call last):
File "", line 1, in
File "init.py", line 186, in repr
return "<WARCRecord: type=%r record_id=%s>" % (self['type'], self['record_id'])
File "init.py", line 172, in getitem
return self.header[name]
File "init.py", line 40, in getitem
return self._d[name.lower()]
KeyError: 'type'

BEWARE

This code is only valuable as a reference. If you are considering writing your own version of this WARC management module you'd be better off starting from scratch.

.gz WARC files not properly read

When reading WARC files compressed with gzip, many of the entries contained are skipped or misread. To reproduce, use common crawl data in .gz format, count the number of entries found by the WARC library and then count the number of appearances of WARC/1.0 in the file. It is a very large difference.

Apparent issue using wget-created .warc.gz files

Hi!

First of all, thank you for writing this, it's very useful!

It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:

Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1

Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?

Thank you,
Zoë.