w3c / webcodecs Goto Github PK
View Code? Open in Web Editor NEWWebCodecs is a flexible web API for encoding and decoding audio and video.
Home Page: https://w3c.github.io/webcodecs/
License: Other
WebCodecs is a flexible web API for encoding and decoding audio and video.
Home Page: https://w3c.github.io/webcodecs/
License: Other
It can be much more efficient to decode to and encode from pools of buffers rather than allocating for every frame. This likely requires a way to return buffers quickly (more quickly than GC).
Encoded data is typically small enough that copies into/out of a pool are a negligible cost, but we should verify that.
Current proposals provide for accessing decoded frames as WebGL textures, but these would be RGB, implying a conversion for most content. We should investigate whether we can provide access to individual planes.
Is there a path for using this proposal to encode/decode images?
I'm suspecting it's possible by implementing the image container and somehow getting the binary keyframe out, but how to plumb this is somewhat unclear.
Given that a lot of developers have been asking for functionality to encode/decode images without using an HTMLImageElement as an intermediary, it would be nice to have this use case covered.
Hey,
As you are likely aware, there is a huge and painful limitation in the Web Audio API, and accessing only a specific time range of audio sample data is not possible in any remotely feasible fashion without jamming the entire audio file into memory.
We are looking forward to finally getting this limitation behind using Web Codecs. Are there any plans for somehow supporting extracting raw PCM audio data from a specific time range, say from 5 seconds to 10 seconds? (obviously given an audio file not shorter than 10 seconds for this specific example)
Many applications may wish to get statistics from the encoder/decoder in real time. Examples:
Not sure if there's as much need for cumulative statistics across frames.
More ideas?
How about we leave rate control to Web App, because it depends heavily in use case but is not a computationally heavy algorithm. If we go that route, it implies bitrate
attribute can be be removed from VideoEncoderTuneOptions
dictionary VideoEncoderEncodeOptions {
..
unsigned long long quantization;
};
It would affect the rate distortion balance so that 0 would be maximum quality and highest
bitrate while 255 would be the worst quality with smallest encoded chunk size. By controlling quantization
value Web App could achieve the desired bit rate or variable bit rate at constant quality, whichever is preferred. Internally, quantization would be mapped to encoder quantization parameter depending on platform encoder. The highest quality setting (255) then imply lossless coding if the selected codec supports it.
@sandersdan knows better if there's support for this in all platform decoders.
Chromium's WebCodecs implementation will offload decoding to a separate thread. We probably want to specify this behavior.
An alternative would be to specify a Worklet type for codec implementations, but this doesn't seem compatible with having a codec that uses many threads.
MediaCapabilities uses a content type of "video/foo; codecs=A.B.C". We should probably use the "A.B.C" instead of what we have currently. For example, instead of "h264", use "avc1.4200" or instead of "vp9", "vp09.00.10.08". It's a little crazy and I can't find any RFCs for these things, but here is where Chromium parses them:
https://cs.chromium.org/chromium/src/media/base/mime_util_internal.cc?g=0&l=820
The APIs make sense for encoding a bitstream, transporting it reliably then decoding and rendering it.
But what if you want to packetize an encoded bitstream within QUIC datagrams and then reassemble and decode it on the other side?
Or what if you want a mixture of reliable and unreliable transport? For example, you might want to provide different transport for keyframes and P-frames. So you might want to separate out portions of the encoded bitstream for reliable transport, and other portions for datagram transport.
readonly attribute unsigned long long? duration; // microseconds
The duration
is an optional attribute in VideoFrame object that represents the time interval for which the video composition should render the composed VideoFrame in microseconds.
In WebGL applications we facing with issue that downloading from server and loading texture to GPU is pretty slow operation. In web we have progressive JPEG/WEBP but we can't access to level of details and need to create 2-4 versions of a file each of which laready has progressive format.
In same WebGPU issue gpuweb/gpuweb#766 they suggested to me Basis and your repo. But looks like basis should be smaller in size or inside js fetch API?
During encoding, it is often desirable to limit the number of predicted frames between keyframes. For example, if a frame is lost during transmission, all frames predicted from that are also lost until the next keyframe
. However, reducing the number of predicted frames also makes the rate-distortion behavior worse. Therefore a Web App should be allowed to choose the number of predicted frames (often called GOP size).
How about adding max_prediction
-
dictionary VideoEncoderEncodeOptions {
..
unsigned long max_prediction;
};
There could also be -
dictionary VideoEncoderEncodeOptions {
..
unsigned byte error_resiliency;
};
which would enable codec-specific error resiliency features, such as extra resynchronization headers, if greater than 0.
Related to #57
It would be nice to add an ErrorEvent
to nicely handle errors in the error EventListener
. I understand with promises we can always reject Promise and throw errors that we cannot escape, but for other "minor" errors it's good to handle the onerror
EventListener. It will help us to distinguish different kinds of errors, temporary vs permanent, etc.
const videoDecoder = new VideoDecoder({
output: someCanvas
});
videoDecoder.configure({codec: 'vp8'}).then(() => {
streamEncodedChunks(videoDecoder.decode.bind(videoDecoder));
}).catch(() => {
// App provides fallback logic when config not supported.
...
});
...
videoDecoder.onerror = event => {
console.log("Error! ");
};
OR
videoDecoder.addEventListener('error', error => console.log(`Error: ${error.name}`));
Strawman proposal :
[SecureContext, Exposed=(DedicatedWorker, Window)]
interface WebcodecsErrorEvent : Event {
constructor(DOMString type, WebcodecsErrorEventInit errorEventInitDict);
readonly attribute DOMException error;
};
dictionary WebcodecsErrorEventInit : EventInit {
required DOMException error;
};
interface VideoDecoder {
...
attribute EventHandler onerror;
}
Somewhat related to #49
Ideally decoded content would be in the same colorspace as the encoded content, and colorspace negotiation is "just" a metadata management problem. Android MediaCodec works differently:
Open questions:
what is the intention regarding simulcast support?
We have the following options:
Not support it, and make the app create N encoder instances, and make the app support it. Not very likely as currently there is no support for accessing the image raw data, so it will have draw the media stream in n different canvas, downscale it, create the capture stream for each one and pass each one to an instance of the decoder.
Support it as configuration parameter of an encoder. This would allow to have single input for the simulcast encoder, but would make it much difficult to provide a good api compatible with non-encoders.
Provide a helper/adapter that can be created with a webrtc encoding-like object parameter and will create internally n encoders, exposing each one for controlling each simulcast layer individually. This would allow to have a single input, and the simulcast adapter will internally downscale/forward it to each of the encoders individually.
I think that the latest one is the current approach in libwbertc internal code and also my preferred option.
At TPAC we got a lot of questions about what type to use for unencoded video frame. Currently the explainer uses the ImageData type, but that might not be efficiently implemented.
How to efficiently process video frames is really a larger problem than what we want to solve in WebCodecs, but we're getting a lot of questions about it so should try to figure something out (or find a different group to work on it).
After audio is decoded, would it be best for the decoder to return 1 buffer of interleaved audio or multiple buffers of decoded audio per channel (e.g. 2 buffers for stereo audio)?
Would de-interleave functionality be natively provided by the platform/UA or would it be the developer's responsibility? I think the former could be an easier API for devs to use.
TheVideoEncodeLayer
dictionary is using the approach to SVC used in the ORTC API. Unfortunately, this approach cannot support non-hierarchical scalability modes such as K-SVC. That is why we took a different approach in WebRTC-SVC. Can we instead use an approach based on scalabiltyModes
?
Currently there is no way of creating a VideoFrame from an existing image data, so the only way to implement a custom media stream track is to paint it on a canvas object and capture the stream from it.
This would allow implementing custom decoders in WASM and half the work required for the funy hat use case.
When we express sizes, how do we distinguish between encoded size, visible size, and natural size. From what I understand:
This is a tracking issue for designing a solution for complex coding modes supported by various audio/video codecs like simulcast, SVC, and FEC.
We need an API for capabilities, like what MediaCapabilities has. Is that enough for us? What input do we need to provide to MediaCapabilities?
Many video codecs support B-frames, frames which have dependencies on previous and future (chronological) frames. Some decoders want to be fed packet in the decoder order (e.g., the future I-frame is fed before the B-frame that depends on it) while others will take care of buffering and reordering internally.
This is a tracking issue for sketching how the WebCodecs API supports B-frames.
It comes up as a common question: can we have an API for media containers? It's something that can be done JS/wasm and is arguably orthogonal to WebCodecs. But for some formats that you might consider video (GIF, (M)JPEG), the line is blurry between container and codec.
This is a tracking issue for a conversation around this topic. My current opinion is to leave it out of WebCodecs until it's more mature and then perhaps readdress it later.
A reminder to:
Rather than call transform_stream.setParameters(...), attach new parameters to the stream that go into the WritableStream.
const decoded = await imageDecoder.decode(input);
const canvas = ...;
canvas.getContext('2d').putImageData(decoded, 0, 0);
The above (from the explainer) suggests that decoding doesn't stream, which feels like a missed opportunity.
In its current state, it feels like it'd be better to change createImageBitmap
so it could accept a stream.
Maybe that should happen, whereas a whole new API could provide expose streamed image decoding.
Images can stream in a few ways:
This would allow partially-loaded images to be used in things like <canvas>
.
As currently defined, WebCodecs supports packetized codecs, where we expect one decoded frame per encoded chunk. For some codecs (eg. H.264 in Annex B format), it makes sense to use a byte stream instead.
This changes the interface of an encoder or decoder, so it's not a trivial change. It doesn't seem to be compatible with our flush or configure model unless streams gain support for flush.
How is the application supposed to be informed about decoding errors?
As of #14 the mechanism for changing encoder settings is to bundle a dictionary with the input frame. An alternative discussed was to have a separate chunk type that only has changed settings (no input frame).
Would it be helpful to offer APIs that use pre-defined ring buffers to reduce garbage collection and maintain low latency? SharedArrayBuffer
(SAB) could also be used for cross-realm/thread processing and browser support is returning.
Additionally, would it be helpful to control the decoder by specifying how many samples/frames to decode per call? We could decode quickly at first for low-latency playback and then gradually increase frame sizes after we have enough decoded data for playback continuity.
For example, consider a streaming audio AudioWorklet where GC is reduced using ring buffers and specifying 128 samples to decode synchronously (relates to #19).
audio-worklet-processer.js
// ring buffer of encoded bytes (set by "onmessage" or SAB from main/worker thread)
inputBuffer = new ArrayBuffer(...)
// ring buffers for decoded stereo 2.5s PCM @ 48,000hz
outLeft = new ArrayBuffer(Float32Array.BYTES_PER_ELEMENT * 48000 * 2.5) // 469K
outRight = new ArrayBuffer(Float32Array.BYTES_PER_ELEMENT * 48000 * 2.5) // 469K
// decoded PCM samples
samplesLeft = new Float32Array(outLeft) // 120,000 samples
samplesRight = new Float32Array(outRight) // 120,000 samples
// new stereo decoder (could also be on Worker/main thread via SAB)
decoder = new AudioDecoder({
srcBuffer: inputBuffer,
outputBuffers: [outLeft, outRight]
})
// buffer read/write index values
inStart, inEnd, outStart, outEnd
// return values after decode() call
totalSrcBytesUsed, totalSamplesDecoded
// AudioWorkletProcessor.process - processes 128 frames per quantum
process(inputs_NOT_USED, outputs) {
// specify the max samples to decode (could also be called on Worker/main thread)
{ totalSrcBytesUsed, totalSamplesDecoded } = decoder.decode({ maxToDecode: 128 })
// update src & output buffers read/write indexes
...
// output decoded [samplesLeft, samplesRight] to @outputs
...
}
Some users prefer to use their own WASM software decoder when hw acceleration is not available.
A number of considerations all combine in somewhat complex ways, such as (re)initialization, buffering, failure recovery, and flushing. Obviously we'd like a good API for all of these, but there are tradeoffs between different options. This issue is for tracking the discussion of what we want the API to look like.
Here are some options. Note that it may be possible to have different options for encode and decode. For example, we could do a combination of B for encode and D for decode.
Every time you want to change something that requires (re)initialization, such as changing the codec or resolution, create a new Encoder/Decoder. Also reinit every time a flush is desired.
Pros:
Cons:
If a change requires a reinitialization, call Initialize(), as many times as you want. The .writable and .readable are stable.
Pros:
Cons:
If a change requires a reinitialization, call Initialize(), as many times as you want. The .readable is stable, but not the .writable (if there is one).
Pros:
Cons:
To reinitialize, put new parameters on the chunk passed into the .writable. Init failure is conveyed via a write failure.
Pros:
Cons:
Instead of asking for an init, just give it what you want and have it (re)init when it needs. There is a fine line between this and Option D. But consider resolution changes. Instead of specifying that the codec reinit with a new size, you just give it whatever frame comes from a MediaStreamTrack and it reinits based on that size. Similarly, an EncodedVideoFrame could simple express what codec it is and the decoder deals with whatever it is.
Pros:
Cons:
Modern standards allow a frame to predict its content temporally from both past and future (B-frames). Encoded chunks are usually stored in encoding/decoding order which may be different from presentation order. VideoFrames
are given to VideoEncoder
in presentation order which may reorder them before coding and output encoded video chunks in decoding order. Similarly, encoded chunks are given in decoding order to VideoDecoder.
What if VideoDecoder
calls VideoDecoderOutputCallback
as soon as a video frame decoding has finished, i.e. it gives VideoFrames to VideoDecoderOutputCallback in decoding order, not presentation order. In use cases like video playback Web App then needs to reorder the VideoFrame into presentation order. Although this increases a bit Web App complexity, it has several advantages:
VideoFrames
that will be given to the Web App sometimes later after other frames.How about adding decodingSequence
and presentationSequence
like -
interface EncodedVideoChunk {
...
readonly attribute unsigned long decodingSequence;
};
[Exposed=(Window)]
interface VideoFrame {
...
readonly attribute unsigned long presentationSequence;
};
In EncodedVideoChunk
decodingSequence would denote the coded order and be filled by the Web App before giving the chunk to VideoDecoder
. VideoDecoder
would fill presentationSequence of VideoFrame
before calling OutputCallback
.
There has been similar discussions in #7.
It should be possible to mux encoded content using VideoTrackWriter with MediaRecoder (encoded frames → VideoTrackWriter → VideoTrack → MediaRecorder → muxed content). We would need to verify that this combination is capable of correctly representing reordered frames (or reject them at runtime).
The same API could be used to handle encoding as well (raw frames → VideoTrackWriter → VideoTrack → MediaRecorder → muxed content). It's unclear if this is useful to WebCodecs users.
Many codecs require additional information that is not stored in the bitstream. This information is usually stored in the container format. For example, H264 requires SPS and PPS to be specified out of band when using the AVCC format.
WebCodecs must have a way to get the exported side data when encoding and provide a way to specify the side data when decoding.
This is a tracking issue for describing how timestamps work with WebCodecs and integrate with the video and audio playout time domains.
Hello.
we have a requirement that we need do ourselves encode/decoder/packetization ourselves on WASM. And then transport the data via webtransport or RtcQuic. In the begining, we might need do some media process, besides the encoder/decoder.
What we need is :
1) read the raw data of captured media data, such as YUV video data, or PCM audio data,etc. Then the raw data could be passed to wasm/js module.
2) It will be better the capture could be worked on worker.
3) The data passing is efficient without memory copy.
4) hardware accelerate encoder/decoder could be used by wasm/ js module, which is a Plus.
Could this spec satisfy our requirement?
After an error in processing (configure, decode, or otherwise), there are multiple potential ways we could allow recovery:
The latter two are easier to reason about, but some apps would be simpler with a different choice. Perhaps it should be configurable.
(Example that is difficult to reason about: if a configure() fails or is aborted, what configuration would we use to decode the next keyframe? This may be a reason to require configure() after reset()--it is always unambiguous.)
Background
MediaRecorder
supports VP8
video codec at Chromium, Chrome, and Firefox. However, the specification provides no means to programmatically set encoder options, if available at the implementation source code, for width
and height
of individual frames (images) of input. The result at Chromium and Chrome is that "video/webm;codecs=vp8"
and "video/webm;codecs=vp9"
result in a WebM file that does not match the input width and height where the input frames potentially have variable width and height (https://bugs.chromium.org/p/chromium/issues/detail?id=972470; https://bugs.chromium.org/p/chromium/issues/detail?id=983777). The only code shipped with Chromium that have been able to record variable width and height input which outputs frames identical to the input frames is using "video/x-matroska;codecs=h264"
or "video/x-matroska;codecs=avc1"
https://plnkr.co/edit/Axkb8s?p=info although technically WebM is specified to only support VP8
or VP9
, the fact is Chromium, Chrome support video codecs other than VP8
or VP9
for WebM container (https://bugzilla.mozilla.org/show_bug.cgi?id=1562862; https://bugs.chromium.org/p/chromium/issues/detail?id=980822; https://bugs.chromium.org/p/chromium/issues/detail?id=997687; https://bugs.chromium.org/p/webm/issues/detail?id=1642). When the codecs are changed to VP8
or VP9
the resulting WebM file does not output the correct pixel dimensions corresponding to input MediaStreamTrack
.
Mozilla Firefox and Nightly does record and encode the correct variable input video frames, both when using MediaRecorder
to create media files and MediaSource
to playback media files.
Proposed solution
Specify options or a method to encode pixel dimensions (width and height) of individual frames, and make sure the options and/or method does output the expected result, if not then write code from scratch that achieves that requirement to be included in Web Codecs specification.
For example, using code at the Explainer
const videoEncoder = new VideoEncoder({
codec: "vp9",
// code
});
include options or a method to explicitly set the encoder to encode each input frame width and height, to avoid the output at Chromium, Chrome, which outputs pixel dimensions that do not match input dimensions.
Due to the lack of read-only ArrayBuffers, extra copying may be necessary to make some WebCodecs APIs safe.
Blobs however can be read-only. We should investigate whether supporting Blobs in some APIs would be beneficial.
At TPAC it was brought up that there are some privacy concerns with low-level codec access. We should try to address some of these concerns in the explainer.
A couple concerns that were brought up:
In the TrackWriter ->
Decoded audio could be passed to WebAudio or the audio device client API. We may need dynamic feedback from the audio api to say how much delay the platform's internal buffering adds to the playout of the most recently provided buffer.
I'm not an expert on those Audio APIs so yell if I'm missing something obvious. @padenot @hoch
@padenot mentioned there are some use cases for synchronously encoding/decoding media in contrast to the current API proposal which encourages/mandates asynchronous execution.
Explainer presently states
MediaRecorder allows encoding a MediaStream that has audio and video tracks.
Technically, as currently specified MediaRecorder
does not provide a means to record multiple video tracks within a single MediaStream
Recording multiple tracks was intended to be possible from day one. Many formats handle multiple tracks.
When it was pointed out that a lot of container formats couldn't handle increasing the number of channels mid-recording, we were left with two choices:
- Make the behavior dependent on container format (unpalatable)
- Make the behavior consistent, but not very useful (ie stop).
The WG chose the latter.
I am not aware of any change in the landscape of container formats that seems to indicate that varying the number of tracks is a generally available option. If you know of such changes, please provide references.As for the "replace track" option - I don't think anyone thought about that possibility at the time.
The fact that MediaRecorder
specification does not provide a means to record multiple video tracks is a problem that this proposal can resolve.
For example, it should be possible to write code similar to
const merged = await decodeVideoData(["video1.webm#t=5,10", "video2.mp4#t=10,15", "video3.ogv#t=0,5"], {codecs="openh264"});
I would like to understand how WebCodecs supports content protection. In WebRTC NV Uses Cases, we initially had a use case where Javascript could be trusted with keys used to encrypt or decrypt protected content. That use case was removed after the IESG took objection. So the question is how WebCodecs can address the only remaining use case (untrusted Javascript).
One of the key use cases mentioned in the explainer is
Non-realtime encoding/decoding/transcoding, such as for local file editing
Can we add an example demonstrating basic media editing operations like trim and concatenation ?
We are trying to understand on how to use WebCodecs to achieve the same functionality as MediaBlob to ensure that it meets developer needs.
VideoEncoder can increase coding efficiency by referencing future frames when coding a frame (eg. B-frames). However, this also increases latency since the future frames need to be decoded before the bidirectionally encoded frames. This is not desirable in all use cases such as video conferencing.
How about adding maxPredictionFrames
-
dictionary VideoEncoderEncodeOptions {
..
unsigned long maxPredictionFrames;
};
which gives the maximum number of future frames to use for prediction. Setting maxPredictionFrames to 0 would disable B-frames altogether.
Related to #55 .
The Streams spec assumes you can get some backpressure signal from the implementation of a TransformStream.
But a hardware codec may not give you that much control over what happens after frame data is passed into the codec API. It will typically decode the frame data immediately and render it into a GPU frame buffer.
So the high level feedback is: the spec (at a future level of maturity) should map the behavior of an abstract codec onto the behavior of the TransformStream, and allow codec implementations that run in immediate mode (without buffering). (Or require implementations to do this buffering internally.)
Background
Merging Matroska and WebM files requires at least
See
The WebM file output by MediaRecorder
implementations at Chromium and Chrome, Mozilla Firefox and Nightly can have arbitrary AV track order, in general, per Media Capture and Streams specification https://www.w3.org/TR/mediacapture-streams/
The tracks of a MediaStream are stored in a track set. The track set MUST contain the MediaStreamTrack objects that correspond to the tracks of the stream. The relative order of the tracks in the set is User Agent defined and the API will never put any requirements on the order.
Proposed solution
The Web Codecs specification should define a means to get input media track order and set output file track order.
How are webcodecs objects supposed to work with web workers? Should the mediastreamtrack be transferable from/to the main thread to/from a web worker?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.