Giter VIP home page Giter VIP logo

Comments (11)

jnioche avatar jnioche commented on June 14, 2024

Proxy with WARCPROX and compare the outputs

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

Ok with warcdump but

webarchiveplayer /data/warc/CC-NEWS-20160721104121-00000.warc.gz 
WebArchivePlayer is unable to read the input file(s) and will quit.

Details: 
    ERROR: Non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-chunk gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

Works fine when unzipping it. Seeing some double entries though

https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap
https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap
2016-07-21 15:08:49.998 c.d.s.b.FetcherBolt [INFO] [Fetcher #5] Fetched https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap with status 200 in msec 109
2016-07-21 15:08:50.372 c.d.s.b.FetcherBolt [INFO] [Fetcher #5] Fetched https://www.theguardian.com/sustainable-business/2016/jul/21/jpmorgan-chase-bank-minimum-wage-pay-gap with status 200 in msec 90

Not a WARC issue strictly speaking.

from news-crawl.

sebastian-nagel avatar sebastian-nagel commented on June 14, 2024

Compressing one record in one chunk is a recommendation of the WARC standard. Wayback machines (including http://index.commoncrawl.org/) rely on this to be able to unpack a record by WARC file path and offset.

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

Interesting. We compress the whole stream in GzipHdfsBolt. We'd probably need to move to a SequenceFile to get a concept of entries.

[https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/SequenceFileWriter.java] could be a starting point.

Do you want to open a separate issue for this?

from news-crawl.

sebastian-nagel avatar sebastian-nagel commented on June 14, 2024

Good question whether a sequence file will do, resp. whether a WARC is formally a sequence file. The nice thing about the WARC format is that it allows both: gzip -dc warc.gz and a seek + decompression of the chunk. But yes, I'll open an issue about this.

from news-crawl.

sebastian-nagel avatar sebastian-nagel commented on June 14, 2024

Btw., that's why we need record-at-time compression:
http://index.commoncrawl.org/CC-MAIN-2015-40/http%3A%2F%2Fdigitalpebble.com%2F

Opened #4 - I was not sure, maybe sc-warc would have been the right place.

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

Good question whether a sequence file will do,

Ideally if we could find a way of doing the record-at-time compression using the existing stream based code that would be ideal. Having to generate a valid WARC using SequenceFiles will be quite a trick.

Opened #4 - I was not sure, maybe sc-warc would have been the right place.

copied it there.

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

Tricky issue. One short term alternative could be to use ZipEntry + ZipOutputStream in a separate process e.g. when sending to S3. Or do without the HDFS layer altogether and write exclusively to the local file sytem - later on straight to S3 perhaps.

See DigitalPebble/storm-crawler#313 for a related issue.

from news-crawl.

jnioche avatar jnioche commented on June 14, 2024

@sebastian-nagel fixed in sc-warc! Could you please give it a try?

from news-crawl.

sebastian-nagel avatar sebastian-nagel commented on June 14, 2024

Yes, I'll do!

from news-crawl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.