In the TrackWriter -> Decoded audio could be passed to WebAudio

This is already available on the Web Audio API: <a href="https://webaudio.github.io/we

re: requestAnimationFrame, I take it you mean <a href="https://github.com/WICG/video-r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Offline <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Ping <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

After more discussion, <a class="user-mention notranslate" data-hovercard-type="user"

Audio rendering and AV sync about webcodecs HOT 15 CLOSED

w3c commented on July 16, 2024

Audio rendering and AV sync

from webcodecs.

Comments (15)

padenot commented on July 16, 2024

This is already available on the Web Audio API: https://webaudio.github.io/web-audio-api/#dom-audiocontext-outputlatency. I've written about this here: https://blog.paul.cx/post/audio-video-synchronization-with-the-web-audio-api/#new-members-on-the-audiocontext-interface, maybe it's less dry than the spec text.

What we're missing is the graphics stack delay, it's much more severe than audio delays on modern machines, see the third footnote in this post.

I would expect around up to 3 * (1000ms / 60hz) = 50ms of latency when triple buffering, but this depends on the OS, driver, and configuration. Having another frame or two of latency in the software is common (bringing the number to 66ms or 83ms), having 30fps screens is common (especially unintentionally, when plugging a 4k display with the wrong cable, this brings the number to over 100ms), so we're looking at something a lot more important than audio latency.

This is especially problematic given the asymmetry of auditory and visual perception: late audio (compared to video) sounds mostly fine even with a reasonably large shift (until it doesn't), but early audio compared to video looks off very rapidly. Footnote 2 of my post talks about this, happy to scan and send the relevant chapter from the book mentioned, privately.

from webcodecs.

chcunningham commented on July 16, 2024

Does Firefox consider display latency in its AV sync? AFAIK Chrome does not. @dalecurtis, thoughts?

from webcodecs.

dalecurtis commented on July 16, 2024

With the timestamp provided by requestAnimationFrame, audioContext and the outputLatency value, don't you have enough to determine the correct a/v sync delays?

from webcodecs.

chcunningham commented on July 16, 2024

re: requestAnimationFrame, I take it you mean video.requestVideoFrameCallback()? This wouldn't be available if we're rendering to a Canvas. Canvas rendering is the primary focus atm.

from webcodecs.

dalecurtis commented on July 16, 2024

No I mean normal window.rAF() -- which I assume would be driving canvas updates.

from webcodecs.

padenot commented on July 16, 2024

Does Firefox consider display latency in its AV sync? AFAIK Chrome does not. @dalecurtis, thoughts?

I don't think we do, but I'd like us to.

from webcodecs.

chcunningham commented on July 16, 2024

@padenot - I got a bit ahead of myself with sync questions. Allow me to redirect to a larger concern: how should we render decoded audio? This was more straightforward with the earlier MediaStreamTrackWriters proposal, but all the partners we've talked to want lower level control of rendering (mostly of video, but that implies the same for audio).

Offline you mentioned we should use AudioWorklet. Can you sketch how you envision this working? The demo's I've read are all about processing the audio during rendering. I'm less clear on how to get the audio to into the worklet and/or whether that is really even required if what you really want is just play the audio (in sync with your video) without additional processing.

At a glance WebAudio's AudioBuffer seems well suited to describe decoded Audio. But I was surprised to find that AudioBufferSourceNode is associated with a single AudioBuffer. Is there a mechanism to queue up more than one AudioBuffer?

I found this ancient thread with lots of insights.

Scheduling Web Audio with Precision, offers an approach only works for high latency apps
Even ^that isn't guaranteed. @jernoble wrote "For stiching together separate AudioBuffers seamlessly, having a buffer queue node available would be much more preferable to having web authors implement their own queuing model." Looks like that wasn't pursued though - at least not that I've found.
In a reply crogers wrote "... with the new ondone event for
AudioBufferSource, it would effectively be possible to "queue up" two
chunks, and then queue an additional chunk each time an ondone is received." I'm guessing this is now AudioBufferSourceNode and the event is now named onended. Approach seems viable though?

from webcodecs.

chcunningham commented on July 16, 2024

Offline @hoch mentioned a flow like:

AudioDecoderOutput => MediaStream => MediaStreamSourceNode => AudioContext.destination

And, for users that need to manipulate during rendering:

AudioDecoderOutput => MediaStream => MediaStreamSourceNode => AudioWorkletNode (JS processing) => AudioContext.destination

Basically the same idea as AudioTrackWriter from the Streams-based proposal. While we've decided to decouple from streams, we could still offer a utility to create a MediaStreamTrack from raw decoder outputs.

I'm not sure yet how I feel on this vs AudioBufferSourceNode approach from the final bullet in my previous comment. It sounds sort of like a round-about version of the BufferQueueNode Jer suggested.

from webcodecs.

dalecurtis commented on July 16, 2024

Ping @padenot for the above comments. Sorry forgot to ask about this during the weekly.

from webcodecs.

chcunningham commented on July 16, 2024

After more discussion, @hoch and @steveanton suggested we instead use a ring of SharedArrayBuffers to provide input to the AudioWorklet- basically this: https://developers.google.com/web/updates/2018/06/audio-worklet-design-pattern#webaudio_powerhouse_audio_worklet_and_sharedarraybuffer

This helped me realized we don't need to use the 'input' argument from process(), so we don't need to feed AudioBuffers from AudioBufferSourceNode to get the data in.

@padenot, I imagine that's more what you had in mind? This seems like a much better approach than any of the proposals I outlined above.

This suggests we shouldn't actually use AudioBuffer as an output form AudioDecoder. Just using an (Shared)ArrayBuffer or Float32Array would do. We should still design a means for users to provide these buffers to avoid allocating them on every decode.

from webcodecs.

padenot commented on July 16, 2024

This works well in general, but the devil is in the details. Here are questions that need an answer, and a couple precision (my opinion on those questions in italic parenthesis after each one):

Are we happy about exposing float samples directly at the output of the decoder, and passing in the usual arraybuffer/length/offset parameters (I think this is what we should do, this means it will integrate better with WASM and is generally what native media API do).
Does our audio decoder api allow specifying a number of frames to output? (I think we should, this is necessary if we're going to recommend using a ring buffer, otherwise we need an intermediate buffer that is linear, or use more memory)
Are we happy about not providing audio resampling for authors? This can be done by creating a second AudioContext at the rate of the media, and then sending that to another AudioContext, if that's needed at all. (I'm mixed on this, but if we're happy with the ergonomics of using two AudioContext, it's fine. I can provide example code. Also the resampling quality is not configurable for now).
How do we plan to have everything run in the clock domain of the audio, as it's done with regular media playback, considering decoding is done using a system clock? We can derive a clock drift from the ring buffer, but it will take some time to converge, and authors can decide to do something else, and then it won't work as well. Maybe good documentation will help. (Mixed again, this is something people take for granted, and mistakes will be made, so I'm truly on the fence).

Also you don't want to allocate anything here except when starting up. You want to have a single ring buffer, and make copies: this is not video, the number of bytes is small, so it's faster to always work in the same small buffer to have a small memory working set, than it is to juggle a few buffers that will have to be pulled from main memory each time. Production ready code is available here for this very use case.

We don't want to use condition variables here, this is an iso-synchronous system with heterogeneous thread priorities, it's best to rely on back-pressure and not signal a low priority thread from a high priority thread, that won't work under load.

AudioBuffer are good in the sense that the rate/channel and the PCM data are tied together, we need to decide if we care about this, or if we'd rather have the rate/channel count available out-off-band.

from webcodecs.

chcunningham commented on July 16, 2024

Are we happy about exposing float samples directly at the output of the decoder, and passing in the usual arraybuffer/length/offset parameters (I think this is what we should do, this means it will integrate better with WASM and is generally what native media API do).

I'm happy to do so. I imagine the decoder output might actually be an interface/dictionary containing a Float32Array alongside some metadata (like timestamp).

Does our audio decoder api allow specifying a number of frames to output? (I think we should, this is necessary if we're going to recommend using a ring buffer, otherwise we need an intermediate buffer that is linear, or use more memory)

What if we instead tell folks what the number of frames would be given the other properties of their config. Ex: av_samples_get_buffer_size. I need to review the other underlying decoders in chrome, but this would obviously map better for ffmpeg.

Are we happy about not providing audio resampling for authors? This can be done by creating a second AudioContext at the rate of the media, and then sending that to another AudioContext, if that's needed at all. (I'm mixed on this, but if we're happy with the ergonomics of using two AudioContext, it's fine. I can provide example code. Also the resampling quality is not configurable for now).

I'm not familiar enough to quickly visualize the setup with two AudioContexts. How often would you expect this to be needed?

How do we plan to have everything run in the clock domain of the audio, as it's done with regular media playback, considering decoding is done using a system clock? We can derive a clock drift from the ring buffer, but it will take some time to converge, and authors can decide to do something else, and then it won't work as well. Maybe good documentation will help. (Mixed again, this is something people take for granted, and mistakes will be made, so I'm truly on the fence).

You lost me here. My sense is: The time a given decode output contributes would be a function of sample rate and the number of samples. Users would observe the media clock position by noting the media time of the last output samples in the AudioWorklet. Video renderer would then try its best to paint a frame close to that.

from webcodecs.

padenot commented on July 16, 2024

What if we instead tell folks what the number of frames would be given the other properties of their config. Ex: av_samples_get_buffer_size. I need to review the other underlying decoders in chrome, but this would obviously map better for ffmpeg.

Sound good. Authors would do their own buffering but that's normal.

Are we happy about not providing audio resampling for authors? This can be done by creating a second AudioContext at the rate of the media, and then sending that to another AudioContext, if that's needed at all. (I'm mixed on this, but if we're happy with the ergonomics of using two AudioContext, it's fine. I can provide example code. Also the resampling quality is not configurable for now).

I'm not familiar enough to quickly visualize the setup with two AudioContexts. How often would you expect this to be needed?

The sample-rate is a property of the AudioContext, it's either decided by authors at construction or picked automatically to use what the system prefers. You can connect two AudioContext via a MediaStreamAudioDestinationNode in the first AudioContext (to which you route your audio, this has a MediaStream that can be used to connect to other APIs) and at the MediaStreamTrackAudioSourceNode/MediaStreamAudioSourceNode in the second AudioContext (this allows having a MediaStreamTrack as an input `AudioNode). This connection resamples to the destination sample-rate and (probably) corrects the (probable) clock skew between the contexts.

How do we plan to have everything run in the clock domain of the audio, as it's done with regular media playback, considering decoding is done using a system clock? We can derive a clock drift from the ring buffer, but it will take some time to converge, and authors can decide to do something else, and then it won't work as well. Maybe good documentation will help. (Mixed again, this is something people take for granted, and mistakes will be made, so I'm truly on the fence).

You lost me here. My sense is: The time a given decode output contributes would be a function of sample rate and the number of samples. Users would observe the media clock position by noting the media time of the last output samples in the AudioWorklet. Video renderer would then try its best to paint a frame close to that.

AudioContext clock (which is running off the audio clock, derived from the system level audio callbacks) is not in the same domain as the system clock (on most systems). One second in one is not the same as one second on the other. For example on my MacBook Pro 2018 with the built-in audio output device, the audio clock is skewed (from the look of it, linearly) by (about) 60ms per hour. This means that, on a decoding scenario, for a movie that is 1h30, A/V sync is completely off by the end if nothing is done (and possibly this means that audio will be ahead of video, which is horrible perceptually). With a system that uses MediaStream to drive the decoding (i.e. decode more when the buffers are low on data, etc.), this is handled by the UA. When done manually, authors need to be aware of this.

The Streams-based solution solved this with the provisions about back-pressure/under-run, so we need to decide if the API does something, or if this is up to authors (there are multiple strategies to fix this).

from webcodecs.

chcunningham commented on July 16, 2024

@padenot and I synched up offline.

The major questions in this issue (how will we do audio rendering, how to do avsync) are now answered above.

For the final points about clock skew, we discussed that apps can overcome this by just letting audio drive the clock (as a function of its output samples and sample rate), and then paint video frames to match wherever audio is at.

from webcodecs.

rvadhavk commented on July 16, 2024

The major questions in this issue (how will we do audio rendering, how to do avsync) are now answered above.

For the final points about clock skew, we discussed that apps can overcome this by just letting audio drive the clock (as a function of its output samples and sample rate), and then paint video frames to match wherever audio is at.

@chcunningham What do you envision/recommend for estimating the latency of painting video frames? I might have missed it, but I didn't see that answered above. If it's not possible in general to measure the time from draw to pixels on the screen, is it possible to measure the time from draw to pixels in the backbuffer?

from webcodecs.

Audio rendering and AV sync about webcodecs HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent