Giter VIP home page Giter VIP logo

warc's Introduction

warc: Python library to work with WARC files

build status

WARC (Web ARChive) is a file format for storing web crawls.

http://bibnum.bnf.fr/WARC/

This warc library makes it very easy to work with WARC files.:

import warc
f = warc.open("test.warc")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

Documentation

The documentation of the warc library is available at http://warc.readthedocs.org/.

License

This software is licensed under GPL v2. See LICENSE file for details.

warc's People

Contributors

anandology avatar nibrahim avatar rajbot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warc's Issues

Python3 Compat

I'd be interested in getting warc working on Python3. Spent some time fixing up the imports with six, but lost momentum with gzip2.py because Python3's gzip has moved things around.

Is anyone else interested in this?

incomplete manual

The manual does not indicate how to dump the body of the warc file. Particularly, what command should I use to access binary files such as PDFs? It would be better for the author to publish a more complete version of the manual for better use.

KeyError warc-target-uri

I get this:

~$ python warcread.py 
<warc.warc.WARCFile instance at 0x7fc61fc34290>
Traceback (most recent call last):
  File "warcread.py", line 6, in <module>
    print(record['WARC-Target-URI'])
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 199, in __getitem__
    return self.header[name]
  File "/usr/local/lib/python2.7/dist-packages/warc/utils.py", line 34, in __getitem__
    return self._d[name.lower()]
KeyError: 'warc-target-uri'
```python

Input file can be downloaded from here (191 MB): 
https://www.dropbox.com/s/25tk1mpo03g73pj/1009wb-39.warc.gz?dl=0

Create ARC records with actual HTTP conversation

Using Httplib or anything else makes some "convenient" changes to the HTTP transaction (e.g. transfer-encoding: chunked is removed and it's converted into a full stream etc.). This needs to be changed and the actual conversation needs to be obtained and archived.

Unsupported WARC version: 1.1

example file

f = warc.open("example.warc.gz")
for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

Expected Behaviour

Prints records with URI + Content Length

Observed Behaviour:

Traceback (most recent call last):
File "", line 1, in
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 390, in iter
record = self.read_record()
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/home/kiska/.local/lib/python2.7/site-packages/warc/warc.py", line 334, in read_header
raise IOError("Unsupported WARC version: %s" % version)
IOError: Unsupported WARC version: 1.1

support reading older WARC versions

The reader code barfs on versions other than "WARC/1.0".

I have not seen anything on what are the differences between, say, 1.0, 0.18 and 0.17 (apart from the version stamp itself). If version 1.0 is otherwise equal to either or both of those, please allow reading them, or add a configuration variable that determines whether they are alllowed.

I could fork the code and add this feature, but I do not know of the differences. If someone can point me to spec on 1.0, I'd be happy to do it.

KeyError: 'warc-target-uri'

I have been unable to accomplish anything with this, I always get errors like the following:

$ python2
Python 2.7.18 (default, Mar  8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import warc
>>> f = warc.open("195.242.99.71-8181-2016-03-23-3324e7c6-00001.warc")
>>> for record in f:
...     print record['WARC-Target-URI'], record['Content-Length']
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/username/.local/lib/python2.7/site-packages/warc/warc.py", line 199, in __getitem__
    return self.header[name]
  File "/home/username/.local/lib/python2.7/site-packages/warc/utils.py", line 34, in __getitem__
    return self._d[name.lower()]
KeyError: 'warc-target-uri'
>>>

WARC: from_response incompatible with Requests>=1.0.0

Description/Summary

Seems like warc-0.2.1 is incompatible with requests>=1.0.0. All that needs to be changed is to strip 'full' from 'full_url' and everything will work just fine again.

Environment details:

Python 2.7
warc 0.2.1
Requests 1.0.4

Example code

#!/usr/bin/python
import warc, requests

print warc.WARCRecord.from_response(requests.get("http://archive.org"))

Traceback

>>> warc.WARCRecord.from_response(requests.get("http://archive.org"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ersi/venv/lib/python2.7/site-packages/warc/warc.py", line 240, in from_response
    "WARC-Target-URI": response.request.full_url.encode('utf-8')
AttributeError: 'PreparedRequest' object has no attribute 'full_url'

Bug in WARCRecord.__repr__()

The line in repr:

return "<WARCRecord: type=%r record_id=%s>" % (self['type'], self['record_id'])

results in an error. I don't know if the above line is wrong or whether there's a bug in how the attribute lookup chain is supposed to work. Changing the above line to this works though:

return "<WARCRecord: type=%r record_id=%s>" % (self.header.type, self.header.record_id)

Traceback below.


h=WARCHeader({"WARC-Type": "response"}, defaults=True)
r = WARCRecord(h)
r
Traceback (most recent call last):
File "", line 1, in
File "init.py", line 186, in repr
return "<WARCRecord: type=%r record_id=%s>" % (self['type'], self['record_id'])
File "init.py", line 172, in getitem
return self.header[name]
File "init.py", line 40, in getitem
return self._d[name.lower()]
KeyError: 'type'

BEWARE

This code is only valuable as a reference. If you are considering writing your own version of this WARC management module you'd be better off starting from scratch.

.gz WARC files not properly read

When reading WARC files compressed with gzip, many of the entries contained are skipped or misread. To reproduce, use common crawl data in .gz format, count the number of entries found by the WARC library and then count the number of appearances of WARC/1.0 in the file. It is a very large difference.

Apparent issue using wget-created .warc.gz files

Hi!

First of all, thank you for writing this, it's very useful!

It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:

Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1

Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?

Thank you,
Zoë.

fast seek() for multiprocessing

here I want to split warc file to small chunks and then use multiprocessing in python

for text file, we can use seeks, but how to seek in warc module or .gz warc files ??
any advices ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.