Comments (3)
I have been looking at the different implementations possibilites, and hame some notes that can help to someone who wants to implement this.
Unfortunately I'm not skilled enough nor have the time to master it, at the moment, to implement in the current project a good solution.
Some possible implementations notes:
-
One possible approach is implement a behaviour, similar to ExtractLinks, that extracts M3U8 media segments. The JS in the browser could use something like (this parser)[https://github.com/globocom/m3u8].
-
An alternative approach is when processing the content downloader by the browser, if the content is a M3U8 playlist, parse it and download the content.
-
Make an extra step/tool to read warc files and for each M3U8 playlist there, download the media content, append it to the warc (or generate a new warc with the same requests as the original warc + the media content).
In my opinion, solution 1 is the cleanest. Solution 3 is the dirtiest since downloads related content at two different points in time.
from crocoite.
I wouldn’t consider option 3 “dirty”. In fact, it’s pretty clean and you can easily add a conversion record to the WARC containing the full video downloaded by, say “youtube-dl”, and referencing the original M3U8. Another option would be to click all play buttons for <video>
and <audio>
tags, wait until every one of those finishes playing, limit the network speed, rinse and repeat.
from crocoite.
The option 3 can be easily implemented using FFMPEG, the download of the various elements and the complete media record output. The only missing piece is appending the content to the warc file.
The latest option "clicking all play buttons" is the closest to the normal web browsing execution. But is the slowest, since the browser downloads media on demand and would take the time of the slowest media in the page.
On the upside, would be generic across all media types.
Ofc, all option need to be careful about encounter live-streams that never end.
from crocoite.
Related Issues (20)
- Replace warcio HOT 2
- Proper URL handling HOT 1
- Handle sites using onhashchange
- Reinject behavior scripts when site is reloaded
- Site screenshots do not work when document is not scrollable HOT 3
- behavior click: Support software matching
- Youtube without polymer is never idle
- shutils.rmtree() can fail
- behavior: Ignore invalid URLs when extracting
- Content-Type “encoding” should be “charset”
- behavior click: Support matching text
- behavior click: Allow passing custom click selectors
- Click "Show more replies" on individual Tweets HOT 3
- Crashing when request is sent twice
- Link extraction may fail
- Process gets stuck sometimes
- Is crocoite Linux-only? HOT 2
- Ignore sets
- Errata handling
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crocoite.