Comments (3)
Hi @HeliosLHC, we could look into making this optional in the Helm chart, however screenshots and extracted text are both necessary for our QA features, so we'd have to find a way to make it clear to users that changing these settings would have an adverse effect on Quality Assurance.
from browsertrix.
The screenshots pngs are 100K-300K (if even that) in size, so the total size is generally negligible in the overall size of a crawl. The time it takes to take them is also very small. It will most likely not make much difference in storage or resource consumption, and will affect usability of other features. @HeliosLHC is there a particular issue you're trying to solve? How big are the screenshots compared to rest of the crawl data?
from browsertrix.
Hi @tw4l and @ikreymer, my primary use case in is to reduce the output size of crawls, especially for static text only sites.
In one of my crawls, the output WACZ contained 60 MB of crawl data + 30 MB extracted text data and 900 MB of screenshots (thumbnail + view) which is a 10x size increase. This was a barebones text-heavy site with little to no images.
For other more image/media heavy sites, I would expect this ratio to be lower (< 5x) as the actual crawl data becomes a much larger proportion of the output. As such, in such scenarios, the additional overhead of the screenshots are not as significant.
I don't mind generation of thumbnails/views as they are useful for monitoring crawls, but an option to disable writing them to WACZ files would be useful.
from browsertrix.
Related Issues (20)
- Use rounded border radius on QA meter bars
- [Bug]: QA analysis fails all the time for "pol frontpage with all context" HOT 1
- Add button to QA crawl in Watch Crawl tab when crawl completes
- QA: Show number of files and errored pages separately from QA meter HOT 1
- [Bug]: Ensure the qa configmap updated for long running QA runs
- [Bug]: losing warc.gz filename syntax for scheduled crawls HOT 5
- [Feature]: Differentiate org invite for first admin user
- [Feature]: Display subscription info in org settings
- [Feature]: Update org banners HOT 4
- [Feature]: Remove uniqueness validation from org name HOT 2
- [Bug]: Browser Profiles Restore Tabs after Each (was: saving browser profiles are unstable) HOT 3
- [Feature]: Disable Org Execution
- [Bug]: ad shown in replay of the scheduled job is not allways shown in the collection or comes and goes
- [Bug]: Fix Rescaling
- Validate org slugs
- [Feature]: Improve validation of org name and slug HOT 1
- [Change]: Remove crawl workflow scoped configmaps
- [Bug]: browsertrix replay of brave archiveWebpage uploaded wacz mixes video's from different times HOT 1
- [Feature]: Update org create endpoint to support additional options HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from browsertrix.