Compare with what we get from [https://webrecorder.io/] Read with [https://github.

Ok with warcdump but <div class="snippet-clipboard-content notranslate position-re

<a href="https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc

Interesting. We compress the whole stream in <a href="https://github.com/DigitalPebble

Btw., that's why we need record-at-time compression: <a href="http://index.commonc

Good question whether a sequence file will do, <p dir="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Check WARC files generated about news-crawl HOT 11 CLOSED

commoncrawl commented on June 14, 2024

Check WARC files generated

from news-crawl.

Comments (11)

jnioche commented on June 14, 2024

Proxy with WARCPROX and compare the outputs

from news-crawl.

jnioche commented on June 14, 2024

Ok with warcdump but

webarchiveplayer /data/warc/CC-NEWS-20160721104121-00000.warc.gz 
WebArchivePlayer is unable to read the input file(s) and will quit.

Details: 
    ERROR: Non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-chunk gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip

from news-crawl.

jnioche commented on June 14, 2024

Works fine when unzipping it. Seeing some double entries though

https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap
https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap

2016-07-21 15:08:49.998 c.d.s.b.FetcherBolt [INFO] [Fetcher #5] Fetched https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap with status 200 in msec 109
2016-07-21 15:08:50.372 c.d.s.b.FetcherBolt [INFO] [Fetcher #5] Fetched https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap with status 200 in msec 90

Not a WARC issue strictly speaking.

from news-crawl.

sebastian-nagel commented on June 14, 2024

Compressing one record in one chunk is a recommendation of the WARC standard. Wayback machines (including http://index.commoncrawl.org/) rely on this to be able to unpack a record by WARC file path and offset.

from news-crawl.

jnioche commented on June 14, 2024

Interesting. We compress the whole stream in GzipHdfsBolt. We'd probably need to move to a SequenceFile to get a concept of entries.

[https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/SequenceFileWriter.java] could be a starting point.

Do you want to open a separate issue for this?

from news-crawl.

sebastian-nagel commented on June 14, 2024

Good question whether a sequence file will do, resp. whether a WARC is formally a sequence file. The nice thing about the WARC format is that it allows both: gzip -dc warc.gz and a seek + decompression of the chunk. But yes, I'll open an issue about this.

from news-crawl.

sebastian-nagel commented on June 14, 2024

Btw., that's why we need record-at-time compression:
http://index.commoncrawl.org/CC-MAIN-2015-40/http%3A%2F%2Fdigitalpebble.com%2F

Opened #4 - I was not sure, maybe sc-warc would have been the right place.

from news-crawl.

jnioche commented on June 14, 2024

Good question whether a sequence file will do,

Ideally if we could find a way of doing the record-at-time compression using the existing stream based code that would be ideal. Having to generate a valid WARC using SequenceFiles will be quite a trick.

Opened #4 - I was not sure, maybe sc-warc would have been the right place.

copied it there.

from news-crawl.

jnioche commented on June 14, 2024

Tricky issue. One short term alternative could be to use ZipEntry + ZipOutputStream in a separate process e.g. when sending to S3. Or do without the HDFS layer altogether and write exclusively to the local file sytem - later on straight to S3 perhaps.

See DigitalPebble/storm-crawler#313 for a related issue.

from news-crawl.

jnioche commented on June 14, 2024

@sebastian-nagel fixed in sc-warc! Could you please give it a try?

from news-crawl.

sebastian-nagel commented on June 14, 2024

Yes, I'll do!

from news-crawl.

Check WARC files generated about news-crawl HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent