return to table of content

Replacing WebRTC: real-time latency with WebTransport and WebCodecs

Thanks for the nice write up! I work on the WebCodecs team at Chrome. I'm glad to hear it's mostly working for you. If you (or anyone else) has specific requests for new knobs regarding "We may need more encoding options, like non-reference frames or SVC", please file issues at https://github.com/w3c/webcodecs/issues

I'm currently working with WebCodecs to get (the long awaited) frame-by-frame seeking and reverse playback working in the browser. And it even seems to work, albeit the VideoDecoder queuing logic seems to give some grief for this. Any tips on figuring out how many chunks have to be queued for a specific VideoFrame to pop out?

An aside: to work with video/container files, be sure to check the libav.js project that can be used to demux streams (WebCodecs don't do this) and even used as a polyfill decoder for browsers without WebCodec support!

https://github.com/Yahweasel/libav.js/

The amount of frames necessary is going to depend on the codec and bitstream parameters. If it's H264 or H265, there's some more discussion and links here: https://github.com/w3c/webcodecs/issues/698#issuecomment-161...

The optimizeForLatency parameter may also help in some cases: https://developer.mozilla.org/en-US/docs/Web/API/VideoDecode...

Thanks. I appreciate that making an API that can be implemented with the wide variety of decoding implementations is not an easy task.

But to be specific, this is a bit problematic with I-frames only videos too, and with optimizeForLatency enabled (that does make the queue shorter). I can of course .flush() to get the frames out but this is too slow for smooth playback.

I think I could just keep pushing chunks until I see the frame I want coming out but it will have to be done in an async "busy loop" which feels a bit nasty. But this is done also in the "official" examples I think.

Something like "enqueue" event (similarly to dequeue) that more chunks after last .decode() are needed to saturate the decoder would allow for a clean implementation. Don't know if this is possible with all backends though.

Often Chrome doesn't know when more frames are needed either, so it's not something we could add an API for unfortunately.

Yes, just feeding inputs 1 by 1 for each dequeue event until you get the number of outputs you want in your steady state is the best way. It minimizes memory usage. I'll see about updating the MDN documentation to state this better.

Wow, great to see some work in this space. I've been wanting to do reverse playback, frame accurate seek and step by step forward and back rendering in the browser for esports game analysis. The regular video tag gets you somewhat of the way there but navigating frame by frame will sometimes jump an extra frame. Likewise trying to stop at an exact point will often be 1 or 2 frames off where you should be. Firefox is much worse, when pausing at a time you could +-12 frames where you should be.

I must find some time to dig into this, thanks for sharing it.

I have it working with WebCodecs, but currently i-frames only videos and all the decoded frames are read to memory. Not impossible to lift these restrictions, but the current WebCodec API will likely make it a bit brittle (and/or janky). For my current case this is not a big problem so I haven't fought with it too much.

Figuring out libav.js demuxing may be a bit of a challenge, even though the API is quite nice as traditional AV APIs go. I'll put out my small wrapper for these in a few days.

Edit: to be clear I don't have anything to do with libav.js other than happening to find it and using it to scratch my itch. Most demuxing examples for WebCodecs use mp4box.js which really makes one a bit uncomfortably intimate with guts of the MP4 format.

There’s a few that would be neat:

* maybe possible already, but it’s not immediately clear how to change the bitrate of the encoder dynamically when doing VBR/CBR (seems like you can only do it with per-frame quantization params which isn’t very friendly)

* being able to specify the reference frame to use for encoding p frames

* being able to generate slices efficiently / display them easily. For example, Oculus Link encodes 1/n of the video in parallel encoders and decodes similarly. This way your encoding time only contributes 1/n frame encode/decode worth of latency because the rest is amortized with tx+decode of other slices. I suspect the biggest requirement here is to be able to cheaply and easily get N VideoFrames OR be able to cheaply split a VideoFrame into horizontal or vertical slices.

* Hmm, what kind of scheme are you thinking beyond per frame QP? Does an abstraction on top of QP work for the case you have in mind?

* Reference frame control seems to be https://github.com/w3c/webcodecs/issues/285, there's some interest in this for 2024, so I'd expect progress here.

* Does splitting frames in WebGPU/WebGL work for the use case here? I'm not sure we could do anything internally (we're at the mercy of hardware decode implementations) without implementing such a shader.

> what kind of scheme are you thinking beyond per frame QP

Ideally I'd like to be able to set the CBR / VBR bitrate instead of some vague QP parameter that I manually have to profile to figure out how it corresponds to a bitrate for a given encoder. Of course, maybe encoders don't actually support this? I can't recall. It's been a while.

> Does splitting frames in WebGPU/WebGL work for the use case here? I'm not sure we could do anything internally (we're at the mercy of hardware decode implementations) without implementing such a shader.

I don't think you need a shader. We did it at Oculus Link with existing HW encoders and it worked fine (at least for AMD and NVidia - not 100% sure about Intel's capabilities). It did require some bitmunging to muck with the NVidia H264 bitstream to make the parallel QCOM decoders happy with slices coming from a single encoder session* but it wasn't that significant a problem.

For video streaming, supporting a standard for Webcams to be able to deliver slices with timestampped information about the rolling shutter (+ maybe IMU for mobile use cases) would help create a market for premium low-latency webcams. You'd need to figure out how to implement just in time rolling shutter corrections on the display side to mitigate the downsides of rolling shutter but the extra IMU information would be very useful (many mobile camera display packages support this functionality). VR displays often have rolling shutter so a rolling shutter webcam + display together would really make it possible to do "just in time" corrections for where pixels end up to adjust for latency. I'm not sure how much you'd get out of that, but my hunch is that if you knock out all the details you should be able to shave off nearly a frame of latency glass to glass.

Speaking of adjustments, extracting motion vectors from the video is also useful, at least for VR, so that you can give the compositor the relevant information to apply last-minute corrections for that "locked to your motion" feeling (counteracts motion sickness).

On a related note, with HW GPU encoders, it would be nice to have the webcam frame sent from the webcam directly to the GPU instead of round-tripping into a CPU buffer that you then either transport to the GPU or encode on the CPU - this should save a few ms of latency. Think NVidia's Direct standards but extended so that the GPU can grab the frame from the webcam, encode & maybe even send it out over Ethernet directly (the Ethernet part would be particularly valuable for tech like Stadia / GeForce now). I know the HW standards for that don't actually exist yet, but it might be interesting to explore with NVidia, AMD, and Intel what HW acceleration of that data path might look like.

* NVidia's encoder supports slices directly and has an artificial limit on the number of encoder sessions on consumer drivers (they raised it in the past few years but IIRC it's still anemic). That however means that the generated slices have some incorrect parameters in the bitstream if you want to decode them independently. So you have to muck with the bitstream in a trivial way so that the decoders see independent valid H264 bitstreams they can decode. On AMD you don't have a limit to the number of encoder.

> Ideally I'd like to be able to set the CBR / VBR bitrate

What's wrong with the existing VBR/CBR modes? https://developer.mozilla.org/en-US/docs/Web/API/VideoEncode...

> I don't think you need a shader...

Ah I see what you mean. It'd probably be hard for us to standardize this in a way that worked across platforms which likely precludes us from doing anything quickly here. The stuff easiest to standardize for WebCodecs is stuff that's already standardized as part of the relevant codec spec (e.g, AVC, AV1, etc) and well supported on a significant range of hardware.

> ... instead of round-tripping into a CPU buffer

We're working on optimizing this in 2024, we do avoid CPU buffers in some cases, but not as many as we could.

> It'd probably be hard for us to standardize this in a way that worked across platforms which likely precludes us from doing anything quickly here. The stuff easiest to standardize for WebCodecs is stuff that's already standardized as part of the relevant codec spec (e.g, AVC, AV1, etc) and well supported on a significant range of hardware.

As I said, oculus link worked with off the shelf encoders. Only the Nvidia one needed some special work and even that’s not even needed anymore since they raised the number of encoders (and the amount of work was really trivial - just adjusting some header information in the h.264 framing). I think all you really need is the ability to either slice a VideoFrame into strips 0 cost and have the user feed them into separate encoders OR to request sliced encoding and under the hood that’s implemented however (either multiple encoder sessions or using Nvidia slice API if using nvenc). You can even make support for sliced encoding optional and implement it just for the backends where it’s doable.

Thanks for WebCodecs!

I'm still just trying to get A/V sync working properly because WebAudio makes things annoying. WebCodecs itself is great; I love the simplicity.

https://blog.paul.cx/post/audio-video-synchronization-with-t... has some background, https://github.com/w3c/webcodecs/blob/main/samples/lib/web_a... is part of a full example that you can run using web codecs, web audio, audioworklet SharedArrayBuffer, and does A/V sync.

If it doesn't answer your question let me know because I wrote both (and part of the web audio spec, and part of the webcodecs spec).

I'm using AudioWorklet and SharedArrayBuffer. Here's my code: https://github.com/kixelated/moq-js/tree/main/lib/playback/w...

It's just a lot of work to get everything right. It's kind of working, but I removed synchronization because the signaling between the WebWorker and AudioWorklet got too convoluted. It all makes sense; I just wish there was an easier way to emit audio.

While you're here, how difficult would it be to implement echo cancellation? The current demo is uni-directional but we'll need to make it bi-directional for conferencing.

Just use getUserMedia as usual and it will just work, nothing special to do.

I use the WebCodecs API with VideoDecoder with a very specific use case, to get data arrays using the great compression of video codec and the data having temporal coherency. Demo here : https://energygraph.info/d/f487b4fd-45ad-4f94-8e7e-ea32fc280...

And I have some issues with the copyTo method of VideoFrame, on mobile (Pixel 7 Pro) it is unreliable and output all 0 Uint8Array beyond 20 frames, to the point I am forced to render each frame to an OffscreenCanvas. Also the many formats of frame output around RGBA/R8 with reduced range 16-235 or full range 0-255 makes it hard to use in my convoluted way.

Please file an issue at https://crbug.com/new with the details and we can take a look. Are you rendering frames in order?

Android may have some quirks due to legacy MediaCodec restrictions around how we more commonly need frames for video elements, frames only work in sequential order since they must be released to an output texture to access them (and releasing invalidates prior frames to speed up very old MediaCodecs).

It will try to do a simple reproduction, and yes the frame are decoded in order.

Encoding alpha, please! https://github.com/w3c/webcodecs/issues/672

Thanks for the great work on WebCodecs!

I got excited for a second that a new something will replace webrtc for media/videos.. it is not. Around 2 years ago in a project I wanted to transfer a raw 4K stream with low latency (sub 50ms) over cellular network, WebRTC performed poorly it was a no go, I ended up making my own algorithm that enables FEC (Forward Error Correction) based on IETF payload scheme 20, and piped it through UDP with Gstreamer and managed to make it work, but obviously wasn’t in the browser.

That sounds like it was an interesting project! Jean-Baptiste Kempf also presented something at Demuxed last week using FEC and QUIC datagrams to get similarly low latency delivery over WebTransport into a browser. There's also a draft[1] for adding FEC to QUIC itself so it's quite possible Media over QUIC (MoQ) could benefit from this approach as well.

I'm not sure why you say "it is not." We have a working MoQ demo running already on https://quic.video that includes a Rust-based relay server and a TypeScript player. The BBB demo video is being ingested using a Rust-based CLI tool I wrote called moq-pub which takes fragmented MP4 input from ffmpeg and sends it to a MoQ relay.

You can also "go live" from your own browser if you'd like to test the latency and quality that way, too.

[1]: https://www.ietf.org/archive/id/draft-michel-quic-fec-01.htm...

Thanks, it was a big project and streaming 4k in realtime from a flying drone was one of the challenging parts! I have some write up about it although nothing too technical but some videos there demonstrating some differences(1)

> I'm not sure why you say "it is not."

Pardon my ignorance it looked like it isn’t replacing WebRTC entirely yet, but glad I was wrong, I never tried anything QUIC for media related, would love to try the MoQ tool you did, and like the fact it’s rust based too as the one I did was written in rust. I will give it a test for sure, it’s been two years and I wasn’t following any updates so hopefully there’s an improvement compared what it was back then.

> takes fragmented MP4 input from ffmpeg and sends it to a MoQ relay.

Just a quick question, is ffmpeg a “requirement” per se for that CLI tool? As I remember I had to ditch ffmpeg in favor of gstreamer since the former one was eating up a lot of resources compared to Gstreamer, and it was crucial issue since the server was basically an SBC on a flying drone.

(1) https://tamim.io/professional_projects/nerds-heavy-lift-dron...

The current moq-pub implementation only requires valid fMP4 (a la CMAF) to be provided over stdin. I haven't tested, but I imagine you can do the same with gstreamer.

Separately, I've been working on a wrapper library for moq-rs that I've been calling 'libmoq'. The intent there is to provide a C FFI that non-Rust code can link against. The first integration target for libmoq is ffmpeg. (I have a few bugs to work out before I clean up and shout about that code, but it does mostly work already.)

I gave a presentation about some of this work last week at Demuxed, but the VoDs probably won't be available on YouTube until Decemberish.

Also, I understand the gstreamer project has better support for Rust so I'll be looking at that soon, too.

Appreciated! Where do I find these presentations, if you don’t me asking?

I believe gstreamer has support for fMP4: https://gstreamer.freedesktop.org/documentation/fmp4/index.h...

I stumbled upon a MoQ video by kixelated a couple months ago and have been meaning to give the above a try, but haven't gotten around to it yet so not sure if it will do the trick.

That's a cool problem!

I could see how WebRTC out of the box would perform poorly. It wants to NACK + delay to give a good 'conferencing experience'.

I bet with FlexFEC [0] and playout-delay [1] you would get the behavior you were looking for. The sender would have to be custom, but the receivers (browsers) should just work! If you are interested in giving it a shot again would love to help :)

[0] https://datatracker.ietf.org/doc/html/rfc8627

[1] https://webrtc.googlesource.com/src/+/main/docs/native-code/...

It was an interesting problem indeed! I had some write up about it (and the whole project) in a link I have in the comment above, might gives more context.

> I bet with FlexFEC [0]

The previous draft (1) of this was my basis when I did the FEC sender. I didn’t manage to have it streamed 4K into the browser though, my client was OBS with gstreamer as it was far more performant than a browser in my tests, have you/any demo you did that I can try to stream it into the browser? That would be really major improvement! And appreciate the help, O would definitely give it another shot!

(1) https://datatracker.ietf.org/doc/html/draft-ietf-payload-fle...

I would love too! Join https://pion.ly/slack and I am Sean-Der

If you prefer discord there also is a dedicated space for ‘real-time broadcasting’ https://discord.gg/DV4ufzvJ4T

Thank you!

Well, a raw 4k stream has a bit rate of 11.9 gbps. I would be surprised if that worked over a cellular network at all.

At 60fps but yeah, I think they mean they still passed the raw 4k stream through a lossy codec before putting it over cellular.

I think it was around 3gbps or even less if I remember correctly, it wasn’t 60fps and I can’t remember the color depth either, and the cellular network was a SA (Stand Alone) mmwave private network, not a commercial one, so it did work in our project eventually.

Yeah. From what I can see WebRTC is basically a ready made implementation of a video calling app, bolted on the side of the browser, that you can slap some UI on top of. If you want to do anything that isn't a straight Zoom style video calling app (or Stadia, RIP), you'll immediately run into a wall of 5 year old known but unfixed bugs, or worse.

To be fair, video calling is an incredibly important use case and I'm glad it can be done in the browser. I am grateful to the WebRTC team every time I can join a Zoom call without installing a bunch of extra crap on my machine. I just hope that WebRTC really can someday be replaced by composing multiple APIs that are much smaller in scope and more flexible, to allow for use cases other than essentially Zoom and Stadia.

I guess I'm just repeating what the article said, but it's so right that it's worth repeating.

> WebRTC is basically a ready made implementation of a video calling app, bolted on the side of the browser

Yeah, Google bought GlobalIPSound, dumps the source code to w3c as a standard.

Unfortunately Safari is still a major hold out on WebTransport with no clear update, considering all other browsers have now supported it in GA for 3+ years.

Google Chrome never implemented trailers, so no gRPC. Every browser is guilty.

> I spent almost two years building/optimizing a partial WebRTC stack @ Twitch using pion. Our use-case was quite custom and we ultimately scrapped it, but your millage may vary.

So many words about protocols. Protocols aren't hard or interesting. QA is hard. libwebrtc has an insurmountable lead on QA. None of these ad-hoc things, nor WebRTC implementations like Pion, will ever catch up, let alone be deployed in Mobile Safari.

I agree with the general thrust of this, but libwebrtc has a tendency to have Google convenient defaults which are distinctly non obvious, such as the quality vs framerate tradeoffs being tied to what Hangouts needs to implement as opposed to useful general purpose hooks. (I hope they have updated that API since I last looked). Once you know how to poke it to make it do what you want, which tends to require reading the C++ source, it's definitely got a massive head start, especially around areas like low level hardware integration. Even bolting on ML inference and so on is not hard. The huge point though is everyone knows everyone has to talk to libwebrtc at some point, so it is the de facto implementation of SDP etc.

Curiously I was working on a webrtc project for about 18 months which also hit the wall, however, since then I have learned of several high profile data only libwebrtc deployments, which really just use it in the way classic libjingle was intended to be used, P2P, NAT punching etc. I'd go so far as to say if you don't have a P2P aspect to what you're doing with libwebrtc you're missing the point.

The big picture though is there seems to be a general denial of the fact that web style CDNs and multidirectional media streaming graphs are two totally different beasts.

Yeah, if you're using WebRTC only for data channels, then 100% you should switch to WebTransport with all due haste. Once Safari adds support of course.

This is what I profoundly disagree about, and I am referring to native code, not in browsers.

The whole P2P, STUN/TURN/ICE integration has value far beyond just media streaming, especially for large amounts of real time data where the central node doesn't want to be handling the data itself.

There are definite oddities to it but having the P2P setup and negotiation working (and QAed, as pointed out) is huge.

I think they'll catch up because libwebrtc is huge and can't take advantage of most of the existing browser protocol stack. WebTransport is a pretty thin layer over HTTP3, which already gets used much more often than WebRTC's SCTP-over-DTLS ever will. Not to mention the fact that it takes like two orders of magnitude less code to implement WebTransport than the whole WebRTC stack.

I am not at Twitch anymore so can't speak to what the state of the things today are.

WebRTC was deployed and is in use. For twitch.tv you have Guest Star [0]

You can use that same back-end via IVS Stages[1]. I always call it 'White-label Twitch', but it is more nuanced then that! Some pretty big/interesting customers were using it.

[0] https://help.twitch.tv/s/article/guest-star?language=en_US

[1] https://docs.aws.amazon.com/ivs/latest/RealTimeUserGuide/wha...

I don't know the timeline, but Apple has committed [1] to adding WebTransport support to WebKit.

[1]: https://github.com/WebKit/standards-positions/issues/18#issu...

Eric Kinnear (linked post; Apple) is the author of the HTTP/2 fallback for WebTransport, so it's safe to say that WebTransport will be available in WebKit at some point.

Great news! It looks like Google GRPC team is probably waiting for Safari Webtransport support to implement grpc over the web in WT. Even though Chrome has had it for 2 years they have kept saying they’ll wait until it becomes GA.

It's being worked on now: https://github.com/WebKit/WebKit/pull/17320

And Firefox hasn't implemented WebCodecs yet. At this moment, the WebTransport+WebCodecs solution is a no-go unless you only intend to serve Chromium.

Fortunately, Mozilla is working on WebCodecs and Apple is working on WebTransport, so these problems will probably disappear in the future.

Yeah, I have a project where WebTransport would be a huge improvement, but not being able to support Safari is a dealbreaker.

I don't get this. After the initial ICE negotiation you can just send raw RTP packets (encrypted with SRTP/DTLS). There is no need for any ACK packets. FEC can be done at codec level. What I am missing?

Are you referencing this line? > 2x the packets, because libsctp immediately ACKs every “datagram”.

The section is about data channels, which uses SCTP and is ACK-based. Yes, you can use RTP with NACK and/or FEC with the media stack, but not with the data stack.

TCP can coalesce acks for multiple packets, can't SCTP do the same?

The protocol can do it, but libsctp (used by browsers) was not coalescing ACKs. I'm not sure if it has been fixed yet.

SCTP can (and does) a SACK[0] isn't needed for each DATA chunk.

[0] https://datatracker.ietf.org/doc/html/rfc4960#section-3.3.4

From a server or a native client, you can send whatever RTP packets you want, but you cannot send whatever RTP packets you want from a web client, and you cannot have access to the RTP packets from a web client and do whatever you want with them, at least not very easily. We are working on an extension to WebRTC called RtpTransport that would allow for just that, but it's in early stages of design and standardization.

That's so exciting! I had no idea you were working on this :)

Here [0] is a link for anyone that is looking for it.

[0] https://github.com/w3c/webrtc-rtptransport/blob/main/explain...

> Back then, the web was a very different place. Flash was the only way to do live media and it was a mess.

Not sure why people are saying that. WebRTC is far harder to make it work. Peer-to-peer is a cpu black-hole-level hog

I maintained the Flash video player at Twitch until I couldn't take it any longer and created an HTML5 player. Flash was a mess. :)

is html5 video part of webrtc?

no, they are not related

What's a good place to learn about all the networking jargon in this article?

Also, a nitpicky point: As a community it would be nice to stop thinking of 'links to wikipedia pages, mdn docs, or github issues' as useful or pertinent forms of citations or footnotes. If I'm immersed in an article and I come across a term I don't know like 'SCTP', do you think a link to some dense proposal standard webpage is appropriate in this context for providing the background info I need?

Of course academia is guilty of this too. Canonical citations don't tell the reader much else beyond 'I know what I'm talking about' and 'if you want to know more spend 30 minutes reading this other article'

But since we're web based here I think we can do better. Tooltips that expand into a helpful paragraph about e.g. SCTP would be a start.

Here are a couple resources which may be helpful:

- https://www.mux.com/video-glossary

- https://howvideo.works/

Also, if you do want to take the time to read a longer and denser article, but come away with an understanding of much of the breadth of modern streaming tech, there's a really great survey paper available in pre-print here:

https://doi.org/10.48550/arXiv.2310.03256

For the WebRTC jargon check out https://webrtcforthecurious.com/

If that still doesn’t cover enough I would love to hear! Always trying to make it better.

I really like High Performance Browser Networking by Ilya Grigorik. It's published by O'Reilly but is also free online [0]. What's particularly great about it is, unlike most other networking books, it focuses on the issue from the browser/web developer perspective which is particularly helpful for WebRTC and generally applicable to daily web dev work.

0. https://hpbn.co/

> The core issue is that WebRTC is not a protocol; it’s a monolith.

> The WebRTC media stack is designed for conferencing and does an amazing job at it. The problems start when you try to use it for anything else.

The main problem with WebRTC is that it's designed to be overcomplicated and garbage on purpose. A way to leverage UDP for video streaming without any possibility for anyone to ever send an actual UDP packet... so that it's not possible to turn every mildly popular website into a god tier DDOS botnet.

WebRTC is what we get when we can't have nice things.

When WebRTC was being developed, a lot of the underlying protocols were derived from existing standards (RTP, SDP, etc), ostensibly with the intent of making it easier to bridge to legacy video-conference systems that may have been running SIP or similar.

Beyond that, there is a ton of stuff to accommodate different device capabilities, network conditions, realities of the Internet, etc. WebRTC has tried to be a "nice thing" for developers from an API perspective for the use-case of streaming realtime video to the browser, but of course all those knobs under the hood make it seem like a hodgepodge, rightly or not.

Would a clean-slate design with more tightly scoped goals fare better? Probably. But the underlying complexity needs to be handled somewhere and there are always trade-offs between control and ease-of-use.

I very much disagree with the characterization of WebRTC being "designed to be overcomplicated and garbage on purpose" but I do think your point about needing to not open the door to DDoS botnet capabilities is something worth highlighting.

There are a number of very challenging constraints placed on the design of browser APIs and this is one of them that is often grappled with when trying to expose more powerful networking capabilities.

Another that's particularly challenging to deal with in the media space is the need to avoid adding too much additional fingerprinting surface area.

For each, the appropriate balance can be very difficult to strike and often requires a fair bit of creativity that can look like "complication" without the context of the constraints being solved for.

> The best and worst part about WebRTC is that it supports peer-to-peer.

I hope P2P stays around in browsers.

> I hope P2P stays around in browsers.

I do, too.

There was a W3C draft [1] ~pthatcherg was working on for P2P support for QUIC in the browser that could have maybe become a path to WebRTC using QUIC as a transport, but I think it may have been dropped somewhere along the line to the current WebTransport spec and implementations. (If Peter sees this, I'd love to learn more about how that transpired and what the current status of those ideas might be.)

A more recent IETF individual draft [2] defines a way to do ICE/STUN-like address discovery using an extension to QUIC itself, so maybe that discussion indicates some revived interest in P2P use cases.

[1]: https://w3c.github.io/p2p-webtransport/ [2]: https://datatracker.ietf.org/doc/html/draft-seemann-quic-add...

An additional benefit of WebTransport over WebRTC DataChannels is that the WebTransport API is supported in Web Workers, meaning you can send and receive data off the main thread for better performance.

Absolutely!

I'm doing that in my implementation: the main thread immediately transfers each incoming QUIC stream to a WebWorker, which then reads/decodes the container/codec and renders via OffscreenCanvas.

I didn't realize that DataChannels were main thread only. That's good to know!

A wonderful write-up, thank you very much.

However, it needs to be said that with currently available technologies, there is no need to have a 5 seconds buffer.

Have a look at the amazing work of the OvenMedia team (not affiliated). Using their stack, and using Low-Latency HLS (LLHLS) I have been able to easily reach an end-to-end latency between the camera on site to the end-user viewing it in the browser of <500ms. At 20 MBit/s, and using either SRT or RTMP for stream upload.

I understand that most likely your huge buffers come from you expecting your streamers to be using RTMP over crappy links, and therefore you already need to buffer THEIR data. Twitch really should invest in supporting SRT. It's supported in OBS.

Anyway:

Once you have the stream in your backend, the technology to have sub-second latency live streaming using existing web standards is there.

https://airensoft.gitbook.io/ovenmediaengine/

But all of this being said: What you are doing there is looking amazing, so keep up the good work!

Glad you liked it!

It's really difficult to compare the latency of different protocols because it depends on the network conditions.

If you assume flawless connectivity, then real-time latency is trivial to achieve. Pipe frames over TCP like RTMP and bam, you've done it. It's almost meaningless to compare the best-case latency.

The important part is determining how a protocol behaves during congestion. LL-HLS doesn't do great in that regard; frankly it will perform worse than RTMP if that's our yardstick because of head-of-line blocking, large fragments, and the playlist in the hot path. Twitch uses a fork of HLS called LHLS which should have lower latency, but we were still seeing 3-5s in some parts of the world.

But yeah, P90 matters more than P10 when it comes to latency. One late frame ruins the broth. A real-time protocol needs a plan to avoid queues at all costs and that's just difficult with TCP.

This is not going to replace the p2p functionality of webrtc right?

Twitch doesn’t need the same aggressive latency as Google Meet, but WebRTC is hard-coded to compromise on quality.

In general, it’s quite difficult to customize WebRTC outside of a few configurable modes. It’s a black box that you turn on, and if it works it works. And if it doesn’t work, then you have to deal with the pain that is forking libwebrtc… or just hope Google fixes it for you.

Solving this was indeed quite a challenge, but what we did was marry WebRTC data channels with compressed video frames that were unencoded. This meant we could drop frames all day long with zero artefacts and seamlessly adapt to the changing bandwidth conditions you find in the real world.

The reason glitch-avoidance was so crucial was the same reason we wanted fine-grained control over the quality: the text and interface heavy use case would not permit frequent low quality frames.

Our ack-based protocol is simple and effective.

In fact, many companies use WebRTC data channels to avoid the WebRTC media stack (ex. Zoom).

So, in a sense we use an approach similar to Zoom. The post author says this didn't work for them because their datagrams were not efficiently chunked. Somehow, we don't worry about this, likely because we don't need to retransmit frames avoiding head-of-line blocking, and our frames are compressed with JPEG, making them smaller by default.

Our congestion control involves:

- per N frames ack (or ACK is worth N frames, more efficient than per-frame-ack)

- measuring frame-ack RTT over a short window and decreasing quality in 20% increments when alerts are tripped. As soon as RTT returns to normal we jump back to max quality.

- if quality hits minimum, we increase Q, where we normally send every Qth-frame.

Additionally we also maintain two channels: WebSocket and WebRTC, and we actually switch frames to the fastest one. It's usually WebRTC but sometimes it's WebSocket. This switching further eases congestion.

Overall, it's a pretty smoothly working machine, developed from scratch through thousands of iterations and experiments.

This system allows us to essentially side step most congestion.

We trade frame rate for latency, as staying in "sync" with real time is the most important metric for application usability and responsiveness. This is because in resource constrained scenarios it's more important for people to feel that their actions are still having immediate effects, than it is to receive high frequency visual updates about those effects.

Personally, im interested in seeing flyweb developed. Devices easily locally connected sounds good to me

https://flyweb.github.io/

i made a small robot with realtime video, spent many hours on all these acronyms and webRTC stuff, wound up using simple jmuxer and python websockify. everything else was so complicated i could never figure it out. there is just so much jargon and layers on layers on layers.