Giter VIP home page Giter VIP logo

Comments (3)

tw4l avatar tw4l commented on June 29, 2024

Hi @HeliosLHC, we could look into making this optional in the Helm chart, however screenshots and extracted text are both necessary for our QA features, so we'd have to find a way to make it clear to users that changing these settings would have an adverse effect on Quality Assurance.

from browsertrix.

ikreymer avatar ikreymer commented on June 29, 2024

The screenshots pngs are 100K-300K (if even that) in size, so the total size is generally negligible in the overall size of a crawl. The time it takes to take them is also very small. It will most likely not make much difference in storage or resource consumption, and will affect usability of other features. @HeliosLHC is there a particular issue you're trying to solve? How big are the screenshots compared to rest of the crawl data?

from browsertrix.

HeliosLHC avatar HeliosLHC commented on June 29, 2024

Hi @tw4l and @ikreymer, my primary use case in is to reduce the output size of crawls, especially for static text only sites.

In one of my crawls, the output WACZ contained 60 MB of crawl data + 30 MB extracted text data and 900 MB of screenshots (thumbnail + view) which is a 10x size increase. This was a barebones text-heavy site with little to no images.

For other more image/media heavy sites, I would expect this ratio to be lower (< 5x) as the actual crawl data becomes a much larger proportion of the output. As such, in such scenarios, the additional overhead of the screenshots are not as significant.

I don't mind generation of thumbnails/views as they are useful for monitoring crawls, but an option to disable writing them to WACZ files would be useful.

from browsertrix.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.