Comments (3)
Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today?
from browsertrix.
Hi @tuehlarsen, longer explanation coming but in short:
WARC-Resource-Type
: Proposal created at iipc/warc-specifications#96; this is used to differentiate resources fetched via JavaScript from those loaded directly in the page, and has other possibilities for future analysis of crawlsWARC-Page-ID
: We added this in Browsertrix to be able to easily associate pages between original crawls and QA replay crawlsWARC-JSON-Metadata
: Proposal created at iipc/warc-specifications#27
None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:
Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.
Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified.
from browsertrix.
I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process..
from browsertrix.
Related Issues (20)
- [Feature]: search collection items by tags
- [Feature]: Improve UX of prefix search or switch to fulltext search HOT 2
- [Feature]: QA should include certain workflow settings
- [Feature]: Org Billing Page
- [Change]: Graph non-HTML page QA results as a discrete bar HOT 2
- Use first seed for workflows with no name in browser profile detail workflows list
- Shoelace progress rings always display at 100% completion in Chrome HOT 1
- Shoelace button groups don't appear correctly HOT 2
- [Bug]: Profiles are cut off at the bottom HOT 1
- [Bug]: Profile VNC connection fails while profile browser is still running (was: Profile ping returning success after expired) HOT 1
- [Feature]: Allow setting scale for QA runs in helm chart
- [Bug]: The copy-field label is inside the field
- [Feature]: Show and update the QA results bar graph while analysis is running
- [Change]: Update column sorting for all tables HOT 1
- Indicate pages with significant failures/unable to be analyzed separately from "No data" in QA meter HOT 1
- Use rounded border radius on QA meter bars
- [Bug]: QA analysis fails all the time for "pol frontpage with all context" HOT 1
- Add button to QA crawl in Watch Crawl tab when crawl completes
- QA: Show number of files and errored pages separately from QA meter HOT 1
- [Bug]: Ensure the qa configmap updated for long running QA runs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from browsertrix.