Giter VIP home page Giter VIP logo

Comments (6)

Argon- avatar Argon- commented on June 17, 2024 1

Is there a solution for this?
I use(d) grab-site to create large, singular warc files (with uncompressed .cdx) and just noticed that ReplayWeb is not able to open these. Previously I only tested with smaller archives.
Is there a tool to convert my existing warc archive to a WACZ collection without scraping again?

edit: currently reading the linked repo. Will try this
edit: There's now a tool for such a conversion in the linked repo. Great!

from replayweb.page.

ikreymer avatar ikreymer commented on June 17, 2024 1

Yes, would recommend using WACZ for larger warcs due to browsers issue having to load the entire file into memory at once. Firefox I know still seems to lock up randomly when streaming a larger file.
Perhaps grab-site can have an option to generate wacz, I'll suggest that there.

from replayweb.page.

ikreymer avatar ikreymer commented on June 17, 2024

Yeah, at this scale (30GB+), it won't be able to load it all with just the WARC.
It needs to be converted into a WACZ collection with a compressed index, and then it can load the collection on-demand.
I don't quite have the tools ready to do this, but the idea is that it could take a pywb collection and create a compressed .wacz file as per: https://github.com/webrecorder/web-archive-collection-format

I'll try to have a skeleton of a tool that does this (for the current spec) fairly soon, if you want to try it out.

from replayweb.page.

jswrenn avatar jswrenn commented on June 17, 2024

Ah, sorry I should have clarified: I packaged the collection as WACZ before attempting to load it into replayweb.page.

from replayweb.page.

ikreymer avatar ikreymer commented on June 17, 2024

Ah ok! But just using the plain, uncompressed .cdx, right? WACZ supports both compressed and uncompressed
The compression part is rather straightforward, just need to have a script that does it.. I've been using the one on webarchive-indexing but its a bit old..

from replayweb.page.

ikreymer avatar ikreymer commented on June 17, 2024

The latest version includes some fixes to prevent timing out when loading large WARCs (1.3.11), hopefully even on Firefox, so will close this for now. WACZ is still recommended for large WARCs, but I think this is now working as best as possible, given that the entire WARC must be loaded

from replayweb.page.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.