FFmpeg lands CLI multi-threading as its "most complex refactoring" in decades

Meanwhile, I've been enjoying threaded filter processing in VapourSynth for nearly a decade.

Not that this isn't great. Its fantastic. But TBH its not really going to change my workflow of VapourSynth preprocessing + av1an encoding for "quality" video encodes.

FFMPEG does so much more than just video encoding. I use ffmpeg all day every day, and only a fraction of the time do I actually make a video.

Like what do you do?

One can use it instead of cat to display text files. Easy syntax to remember.

  ffmpeg -v quiet -f data -i file.txt -map 0:0 -c text -f data -

I'm dying.

https://www.youtube.com/watch?v=9kaIXkImCAM

I'm glad we've reached a point where there is quality parody content online for our industry.

Check out Krazam. I quote their Microservices video on a regular basis (https://youtu.be/y8OnoxKotPQ)

THIS!!! It was so refreshing!

"Do you know ffmpeg supports OCR? I haven't found the command yet, but it does support it."

This is probably 80% of my experience with ffmpeg, to be honest, but the other 20% is invaluable enough anyway.

I thought this was a joke, but it really exists! https://ffmpeg.org/ffmpeg-filters.html#ocr

Lol what. I have a bot that processes ~50 videos a day, burning in translated whisper-generated subtitles. It also translates images using Tesseract, then overlaying texts in-place. I once thought of exporting frames as images to maybe do this for video too, I actually did not even start to think FFMPEG would have tesseract support on top of everything.

Later on though I've realized the quality of tesseract's OCR on arbitrary media is often quite bad. Google translates detection and replacement is so much ahead my current image system I'd think I would just somehow reutilize that for my app, either thru public API or browser emulation ...

That was one of the funniest things I've seen in a while!!!! I had to stop drinking my decaf for fear of spitting it all over my computer I was laughing out loud so much!

(ps: and no, it's not Rick Astley/Never Gonna Give You Up)

The artfully inserted, corrupted predicted frames was :chefskiss:

I use dd for that.

    dd if=./file.txt

Can you also format your drive with ffmpeg? I'm looking for a more versatile dd replacement..

It can't create partition tables or filesystems, so no, but

  ffmpeg -f data -i /dev/zero -map 0:0 -c copy -f data - > /dev/sda

is roughly equivalent to to

  dd status=progress if=/dev/zero of=/dev/sda

you might need a -disposition default type option, otherwise, it introduce some abnormal behavior

can I use ffmpeg to embed a gif in a Hacker News comment, because I want that so much right now

No, but you can use ffmpeg to create a GIF from ASCII art embedded in a Hacker News comment:

  $ ffmpeg -v quiet -codecs | egrep -i 'gif|ascii'
   D.V.L. ansi                 ASCII/ANSI art
   DEV..S gif                  CompuServe GIF (Graphics Interchange Format)

(“D” and “E” in the first field indicate support for decoding and encoding)

That doesn't work[1], but

  ffmpeg -v quiet -f data -i file.txt -map 0:0 -c copy -f data -

does.

[1] “Encoder 'text' specified, but only '-codec copy' supported for data streams”

Ah now I can replace all my useless uses of cat with ffmpeg

I bet ffmpeg special cases that combination of flags and calls cat.

Good thing it's now multi-threaded so it can process all those command line arguments in parallel!

I've used it for video and audio concatenation of laserdisc game segments, transcoding audio clips for gamedev, programmatically generating GIFs of automatically generated video clips from tests in a CI pipeline, ripping songs and audio clips from YouTube videos to ogg/mp3, creating GIFs from burst-shot and time-lapse photography (and decimating them), excerpting clips from a video without re-encoding, and compressing or transforming audio on remote servers where VLC wasn't and couldn't be installed.

Sounds like you already have a process for most of this, but I found https://github.com/mifi/editly to be incredibly helpful to run ffmpeg and make my little time lapse video. Could be useful for others

you can make gif's with it

Beside video conversion/compression ? Sound extraction or processing, image processing, video casting or streaming, anything related to image/multimedia format, basically

ffmpeg can produce an amazing amount of analysis

I use ffmpeg everytime I download a YouTube video.

Vapoursynth can be used for image processing too (albeit more clumsily with variable size input), and its also a great way to hook into PyTorch.

Does it can fix broken files?

It reads image sequences with imagemagick or ffmpeg, but in place repair is not its thing, no.

Can you run doom on it?

I don't understand why you would want to piggyback on this story to say this.

are people just itching for reasons to dive into show & tell or to wax poetic about how they have solved the problem for years? I really don't understand people at all, because I don't understand why people do this. and I'm sure I've done it, too.

There is hype for FEAT. People who have achieved similar FEAT perk up their heads but say nothing.

Hype for FEAT is beyond sensibility. People with similar FEAT are bristled by this and wish that their projects received even a fraction of FEAT's hype.

I think it's normal.

not gonna define FEAT, then? ok.

...in this case, multi-threading. In other cases; AI workflows that others commercialize, a new type system in a language that already exists in another, a new sampling algorithm that has already existed by another name for decades, a permaculture innovation that farmers have been using for aeons, the list goes on...

just say "feature".

language is for communicating. don't impede that communication by using unnecessary terms.

I was trying to generalise a concept so I used a placeholder word. I feel that most people got that.

Not gonna lie, I think VapourSynth has been flying under the radar for far too long, and is an awesome largely unused alternative to ffmpeg filter chains in certain cases. I don't see any harm in piggybacking on an ffmpeg story to bring it up, especially if readers find it useful.

It's been threaded since its inception, so it seems somewhat topical.

I'm guessing from context that VapourSynth is a frame-server in the vein of avisynth? If so, does it run on Linux? Avisynth was the single biggest thing I missed when moving to Linux about 20 years ago.

[edit]

found the docs; it's available on Linux[1]. I'm definitely looking into it tonight because it can't be worse than writing ffmpeg CLI filtergraphs!

1: http://www.vapoursynth.com/doc/installation.html#linux-insta...

20 years ago, the best feature of avisynth was running random plugin dlls downloaded from doom9, none with source code and all running on an XP Administrator account.

The frameserver is one thing, but an ecosystem of (trustable! open source!) plugins is harder to replicate.

At least we don't need deinterlacers so badly any more, though.

They have pretty much all been ported, and much more! VapourSynth can do a lot more than AviSynth can.

Yep, and its so much better than ffmpeg CLI that its not even funny.

This is a pretty good (but not comprehensive) db of the filters: https://vsdb.top/

I believe I have too with gstreamer's pipe framework for threading, but ffmpeg's syntax has stuck in my mind far longer than any of the elaborate setups I built with gstreamer. I'm excited for this development

Interesting! I'm among today's lucky 10,000 in learning for the first time about VapourSynth.

How come it only has 4 measly entries in HN, and none got any traction. I've posted a new entry, just for the curiosity of others.

It’s nuts to think that in the near future LLM will be able to do that refactoring in seconds. All we need is enough context window.

Parent post is getting down voted to oblivion but it seems a reasonable belief for someone who is not highly engaged with AI. I have only the vaguest understanding of how it works (and it's probably wrong) and to my layman mind it also seems like a totally fair assumption, based on experience as a user and the constant flood of news. Please explain why the suggestion that a future AI / sufficiently advanced LLM could refactor a complex codebase is so preposterous.

Because refactoring requires understanding, which LLMs completely lack.

Because refactoring requires understanding, which LLMs completely lack.

Interesting.

Here's a sample question I just asked the AI:

We are working on a React mapping application. I am going to give you two components from this application. Our goal is to extract functionality to deduplicate code

We can make a single base component, extract code to hooks, or any other strategy which you think is appropriate

Here is travel.tsx:

<snip>

Here is heatmap.tsx:

<snip>

And here is the response:

To deduplicate code and improve maintainability between the `TravelMap` and `Heatmap` components in your React mapping application, we can identify common functionalities and extract them into reusable hooks or components. Here's a strategy considering your code:

1. *Extract Common Hooks*: Both components use similar hooks for managing markers, popups, and map references. We can create custom hooks that encapsulate these functionalities.

2. *Create Shared Utility Functions*: Functions like `initialViewState`, which are identical in both components, can be moved to a separate utility file and imported where needed.

3. *Refactor Map Layers and Sources*: The way map layers and sources are handled is quite similar in both components. A custom hook or component to manage map layers and sources would be beneficial.

4. *Generalize Popup Handling*: The popup handling in both components can be abstracted into a custom hook or a higher-order component.

5. *Combine Common Styling*: Shared styles, such as `MAP_CONTAINER_STYLE`, can be moved to a common CSS module or a styled-component that can be reused.

6. *Refactor Event Handlers*: Functions like `onMapClick`, `onMouseMove`, and `onMouseLeave` have similar structures. These can be part of a custom hook or a utility function with configurable parameters.

7. *Legend and Live Region Components*: These components are used in a similar way in both `TravelMap` and `Heatmap`. Ensure they are designed to be reusable with different props.

8. *Refactor Map Initialization*: The way the map is initialized and configured is quite similar. This can be abstracted into a custom component or hook.

### Example Implementations:

#### 1. Custom Hook for Map Markers:

```typescript // useMapMarkers.ts import { useState, useCallback } from 'react';

export const useMapMarkers = (initialLat, initialLng, world) => { // Similar marker logic from your components // ...

  return {
    marker,
    onMarkerDrag,
    onMarkerDragEnd,
    updateMarkerPosition,
    // other marker related states and functions
  };

}; ```

#### 2. Shared Utility Function:

```typescript // utils/mapUtils.ts export const initialViewState = (map, mapType) => { // Implementation }; ```

#### 3. Custom Hook for Map Layers and Sources:

```typescript // useMapLayers.ts import { useState, useEffect } from 'react';

export const useMapLayers = (props) => { // Logic to handle layers and sources // ...

  return {
    dataLayer,
    patternLayer,
    // other layer related states and functions
  };

}; ```

#### 4. Generalized Popup Handling:

```typescript // useMapPopup.ts import { useState } from 'react';

export const useMapPopup = () => { // Popup logic from your components // ...

  return {
    popupState,
    openPopup,
    closePopup,
    setPopupData,
    // other popup related states and functions
  };

}; ```

### Integration:

Once these hooks and utilities are created, integrate them into your `TravelMap` and `Heatmap` components. Replace the duplicated logic with calls to these shared resources. This will significantly reduce code duplication and make future maintenance easier.

### Testing and Refinement:

After refactoring, thoroughly test both components to ensure that all functionalities work as expected. Watch out for any edge cases or specific behaviors that might be affected by the refactoring.

For those suggestions, I might use five out of eight of them, and probably do one or two things differently. But you cannot, with a straight face, say the model did not understand. It clearly did. It suggested reasonable refactors. If being able to refactor means understanding, I guess we have understanding!

I could continue with this conversation, ask it to produce the full code for the hooks (I have in my custom prompt to provide outlines) and once the hooks are complete, ask it to rewrite the components using the shared code.

Have you ever used one of these models?

Eliminating duplication and cleaning code is a different type of refactoring than supporting concurrency, which is much much harder.

Cleaning up code also follows some well established patterns, performance work is much less pattern-y.

Codebases like FFMPEG are one of the kind. I bet you need 10 or 100 times more understanding than the react thing you mentioned above.

One day maybe AI can do it, but it probably won't be LLM. It would be something which can understand symbols and math.

Ah, we're having some classic goalpost moving!

Because refactoring requires understanding, which LLMs completely lack.

Cleaning up code also follows some well established patterns, performance work is much less pattern-y.

Just as writing shitty react apps follow patterns, low-level performance and concurrency work also follow patterns. See [0] for a sample.

I bet you need 10 or 100 times more understanding

Okay, so a 10 or 100 times larger model? Sounds like something we'll have next year, and certainly within a decade.

One day maybe AI can do it, but it probably won't be LLM. It would be something which can understand symbols and math.

You do understand that the reason some of the earlier GPTs had trouble with symbols and math was the tokenization scheme, completely separate from how they work in general, right?

[0]: C++ Concurrency in Action: Practical Multithreading 1st Edition https://www.amazon.com/C-Concurrency-Action-Practical-Multit...

Because refactoring requires understanding, which LLMs completely lack.

It's obvious from context here that the refactoring that was mentioned was specifically around concurrency, not simply cleaning up code.

So if I show you an LLM implementing concurrency, will you concede the point? Is this your true objection?

https://chat.openai.com/share/7c41f59a-c21c-4abd-876c-c95647...

Hope you're looking for good-faith discussion here. I'll assume that you're looking for a response where someone has taken the time to read through your previous messages and also the linked ChatGPT interaction logs.

What you've shown is actually a great example of the what folks mean that LLMs lack any sort of understanding. They're fundamentally predict-the-next-token machines; they regurgitate and mix parts of their training data in order to satisfy the token prediction loss function they were trained with.

In the linked example you provided, *you* are the one that needs to provide the understanding. It's a rather lengthly back-and-forth to get that code into a somewhat useable state. Importantly, if you didn't tell it to fix things (sqlite connections over threads, etc.), it would have failed.

And while it's concurrent, it's using threads, so it's not going to be doing any work in parallel. The example you have mixes some IO and compute-bound looking operations.

So, if your need was to refactor your original code to _actually be fast_, ChatGPT demonstrated it doesn't understand nearly enough to actually make this happen. This thread conversation got started around correcting the misnomer that an LLM would actually ever be able to possess enough knowledge to do actually valuable, complex refactoring and programming.

While I believe that LLMs can be good tools for a variety of usecases, they have to be used in short bursts. Since their output is fundamentally unreliable, someone always has to read -- then comprehend -- its output. Giving it too much context and then prompting it in such a way to align its next token prediction with a complex outcome is a highly variable and unstable process. If it outputs millions of tokens, how is someone going to actually review all of this?

In my experience using ChatGPT, GPT4, and a few other LLMs, I've found that it's pretty good at coming up with little bits to jog one's own thinking and problem solving. But doing an actual complex task with lots of nuance and semantics-to-be-understood outright? The technology is not quite there yet.

Did you learn anything from that exercise? Are you a better programmer now for having seen that solution? Because if not, this seems like a great way for getting the fabled "one year of experience, twenty times"

That ... didn't even refactor the code. It just returned some generic Python concurrency methods which vaguely fit the posted code.

Ok bro I am not the parent commenter who set the goalpost.

Let's see how your smooth talking LLM is going to do with things that are not web development or leetcode medium, for which so much stuff has been written. All the best.

Chess requires understanding, which computers lack. Go requires understanding, which computers lack. X requires Y which AI technology today lacks. AI is a constantly moving goalpost it seems.

It was always clear that games like chess or go can be played by computers well, even with simple algorithms, because they were completely formalized. The only issue was with performance / finding more efficient algorithms.

That's very different from code which (perhaps surprisingly) isn't well formalized. The goals are often vague and it's difficult to figure out what is intentional and what incidental behavior (esp. with imperative code).

It was always clear that games like chess or go can be played by computers well

As someone who was deeply involved in the Go scene since the early 2000s let me emphatically assure you it was not at all clear. Indeed it was a major point of pride among Go enthusiasts that computers could not play it well, for various reasons (some, like the branching factor, one could potentially grant that advances in hardware and software could solve for eventually. Others, like the inherent difficulty in constructing an evaluation function, seemed intractable).

Betting markets at the time of the AlphaGo match still had favorable odds for Sedol, even with the knowledge that Google was super-confident baked in.

It is extreme hindsight-bias of exactly the type the grandparent was talking about to suggest that obviously everybody knew all along that Go was very beatable by "non-real AI".

AI is a constantly moving goalpost it seems.

alwayshasbeen.png

The AI effect occurs when onlookers discount the behavior of an artificial intelligence program by arguing that it is not "real" intelligence.[1] > Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."[2] Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"[3]

"AI is whatever hasn't been done yet."

—Larry Tesler

https://en.wikipedia.org/wiki/AI_effect

Chess requires understanding

Actually it doesn't, as demonstrated by the fact that people have made fairly decent chess implementations in 1K lines of code and things like that. Chess is comparatively easy because it has well-defined rules and well-defined concepts of "good" and "bad". Refactoring something to be multi-threaded is incomparably more complex and any comparison to this is just pointless.

AI is not very good at single threaded code which is widely regarded as much easier. The breathless demos don't generalize well when you truly test on data not in the training set, it's just that most people don't come up with good tests because they take something from the internet, which is the training set. But the code most people need to write is to do tasks that are bespoke to individual businesses/science-experiments/etc not popular CS problems that there are 1000 tutorials online for. When you get into those areas it becomes apparent really quickly that the AI only gets the "vibes" of what code should look like, it doesn't have any mechanistic understanding.

The ffmpeg tests take a lot more than a few seconds to run, and an AI god is still going to have trouble debugging multithreaded code.

Why on earth would you possibly think that?

I'm quite confused by the answers I got from this thread. Haven't you tried refactoring with gpt-4 yet?

Haven't you tried refactoring with gpt-4 yet?

I most certainly have not. At work, I do greenfield development in a specialized problem domain, and I would not trust a model (or, for that matter, a junior developer) to do any kind of refactor in an acceptable manner. (That aside, there's no way I'm goingto expose company code to any sort of outside party without the approval of upper management).

At home, I program for fun and self-improvement, and a big part of both is thinking hard about problems. Why would I want to wreck that with asking a model to do it for me?

Oh yea your work is far too sophisticated for a llm, got it

Some of us do actual creative work, yes.

Oh, I understand.

What if you have an idea and you tell the computer to implement it and then thoroughly check the code?

About the self-improvement part, I suppose you don't operate your graphics card yourself, but delegate it to your driver.

The LLM is just another tool.

Some people get significance from their ability to write code. To them, admitting an LLM can (or will soon be able to) do their work inflicts cognitive dissonance, so they refuse to believe it. Some refuse to even try it—not realizing that refusing to engage does nothing to hinder the advancement of the tool they fear.

Just as your human intelligence lead to you writing the same darn comment as another human above you, AI can often write the same code as a human would, without having to even bring creativity into it! For those of us who write code, this can be useful!

Yeah, it's nuts to think that.

Refactoring is really rather well defined. It's " just transformations that are invariant w.r.t. the outcome". The reason they are hard to automate is that 'invariant w.r.t. the outcome' is a lot more lenient than most semantic models van handle. But this kind of well-defined task with a slight amount of nuance (and decently checkable) seems pretty well-suited to an LLM.

At least for the linux kernel, qemu and other large c projects, this is a solved problem with coccinelle[1]. Compared to AI, it has the added benefit of not doing incorrect changes and/or hallucinating stuff or promt injections or ...

I guess you could use AI to help create a coccinelle semantic patch.

[1] https://en.wikipedia.org/wiki/Coccinelle_(software)

The part coccinelle does is the part GPT is good at, the problem is neither of them actually reason about the code

I'm genuinely confused about your point of view. Have you tried refactoring with GPT-4?

I have been refactoring code using gpt-4 for some months now and the limiting factor have been the context size.

GPT-4 turbo now have 128k context and I can provide it with larger portions of the code base for the refactors.

When we have millions of tokens of context, based on what I'm experiencing now, I can see that a refactoring like the one made in ffmpeg would be possible. Or not? What am I missing here?

I see that you got some responses from people who may have not even used gpt-4 as a coding assistant, but I absolutely agree with you. A larger context window, a framework like Aider, and slightly-better tooling so the AI can do renames and other high-level actions without having to provide the entire changeset as patches, and tests. Lots of tests. Then you can just run the migration 15 times, pick from the one which passes all the tests... run another integration pass to merge ideas from the other runs, rinse and repeat. Of course the outer loops will themselves be automated.

The trick to this is continuous iteration and feedback. It's remarkable how far I've gotten with GPT using these simple primitives and I know I'm not the only one.

If you ask GPT to refactor a single threaded program much smaller than the context window that is truly out of sample into a multithreaded program, its often going to fail. GPT has trouble understanding bit masks in single threaded code, let alone multiple threads.

so the AI can do renames

I do renames in a big window with my IDE.

If you think a large refactor is just renaming some stuff then it makes sense you think this.

Why would you think this is possible?

based on my current experience with gpt-4. Have you tried some sort of refactoring in it? Because I have been routinely turning serial scripts into parallel ones with success.

Couldn't do the same with larger codebases because the context is not enough for the input code and output refactoring.

I have some snake oil to sell you

If I'm operating a cloud service like Netflix, then I'm already running thousands of ffmpeg processes on each machine. In other words, it's already a multi-core job.

Curious, what would that many ffmpeg processes be doing at Netflix? I assume new VOD content gets encoded once per format, and the amount of new content added per day is not gigantic.

Agree with the general premise, of course, if I've got 10 different videos encoding at once then I don't need additional efficiency because the CPU's already maxed out.

I assume they re-compress for each resolution / format, quite possibly they also have different bitrate levels per resolution. Potentially even variants tweaked for certain classes of device (in cases this is not already covered by combination of format/resolution/bitrate). I would also assume they re-compress with new advances in video processing (things like HDR, improved compression).

Also, their devs likely want fast feedback on changes - I imagine they might have CI running changes on some standard movies, checking various stats (like SNR) for regressions. Everybody loves if their CI finishes fast, so you might want to compress even a single movie in multiple threads.

They'll be doing VBR encodes to DASH, HLS & (I guess still) MSS which covers the resolutions & formats... DRM will be what prevents high res content from working on some "less-trusted" platforms so the same encodes should work.

(Plus a couple more "legacy" encodes with PIFF instead of CENC for ancient devices, probably.)

New tech advances, sure, they probably do re-encode everything sometimes - even knocking a few MB off the size of a movie saves a measurable amount of $$ at that scale. But are there frequent enough tech advances to do that more than a couple of times a year..? The amount of difficult testing (every TV model group from the past 10 years, or something) required for an encode change is horrible. I'm sure they have better automation than anyone else, but I'm guessing it's still somewhat of a nightmare.

Youtube, OTOH, I really can imagine having thousands of concurrent ffmpeg processes.

Why bring up assumptions/suppositions about Netflix's encoding process?

Their tech blog and tech presentations discuss many of the requirements and steps involved for encoding source media to stream to all the devices that Netflix supports.

The Netflix tech blog: https://netflixtechblog.com/ or https://netflixtechblog.medium.com/

Netflix seems to use AWS CPU+GPU for encoding, whereas YouTube has gone to the expense of producing an ASIC to do much of their encoding.

2015 blog entry about their video encoding pipeline: https://netflixtechblog.com/high-quality-video-encoding-at-s...

2021 presentation of their media encoding pipeline: https://www.infoq.com/presentations/video-encoding-netflix/

An example of their FFmpeg usage - a neural-net video frame downscaler: https://netflixtechblog.com/for-your-eyes-only-improving-net...

Their dynamic optimization encoding framework - allocating more bits for complex scenes and fewer bits for simpler, quieter scenes: https://netflixtechblog.com/dynamic-optimizer-a-perceptual-v... and https://netflixtechblog.com/optimized-shot-based-encodes-now...

Netflix developed an algorithm for determining video quality - VMAF, which helps determine their encoding decisions: https://netflixtechblog.com/toward-a-practical-perceptual-vi..., https://netflixtechblog.com/vmaf-the-journey-continues-44b51..., https://netflixtechblog.com/toward-a-better-quality-metric-f...

Their dynamic optimization encoding framework - allocating more bits for complex scenes and fewer bits for simpler, quieter scenes: https://netflixtechblog.com/dynamic-optimizer-a-perceptual-v... and https://netflixtechblog.com/optimized-shot-based-encodes-now...

This is overrated - of course that's how you do it, what else would you do?

Mean-squared-error (MSE), typically used for encoder decisions, is a number that doesn’t always correlate very nicely with human perception.

Academics, the reference MPEG encoder, and old proprietary encoder vendors like On2 VP9 did make decisions this way because their customers didn't know what they wanted. But people who care about quality, i.e. anime and movie pirate college students with a lot of free time, didn't.

It looks like they've run x264 in an unnatural mode to get an improvement here, because the default "constant ratefactor" and "psy-rd" always behaved like this.

You're letting the video codec make all the decisions for bitrate allocation.

Netflix tries to optimize the encoding parameters per shot/scene.

from the dynamic optimization article:

- A long video sequence is split in shots ("Shots are portions of video with a relatively short duration, coming from the same camera under fairly constant lighting and environment conditions.")

- Each shot is encoded multiple times with different encoding parameters, such as resolutions and qualities (QPs)

- Each encode is evaluated using VMAF, which together with its bitrate produces an (R,D) point. One can convert VMAF quality to distortion using different mappings; we tested against the following two, linearly and inversely proportional mappings, which give rise to different temporal aggregation strategies, discussed in the subsequent section

- The convex hull of (R,D) points for each shot is calculated. In the following example figures, distortion is inverse of (VMAF+1)

- Points from the convex hull, one from each shot, are combined to create an encode for the entire video sequence by following the constant-slope principle and building end-to-end paths in a Trellis

- One produces as many aggregate encodes (final operating points) by varying the slope parameter of the R-D curve as necessary in order to cover a desired bitrate/quality range

- Final result is a complete R-D or rate-quality (R-Q) curve for the entire video sequence

You're letting the video codec make all the decisions for bitrate allocation. > Netflix tries to optimize the encoding parameters per shot/scene.

That's the problem - if the encoding parameters need to be varied per scene, it means you've defined the wrong parameters. Using a fixed H264 QP is not on the rate-distortion frontier, so don't encode at constant QP then. That's why x264 has a different fixed quality setting called "ratefactor".

What about VP9? And any of the other codecs that Netflix uses (I'll assume AV1 is one they currently use)?

It's not a codec-specific concept, so it should be portable to any encoder. x265 and AV1 should have similar things, not sure about VP9 as I think it's too old and On2 were, as I said, not that competent.

Isn't two pass encoding similar? In the first pass you collect statistics you use in the second pass for bandwidth allocation?

Possibly Netflix statistics are way better.

This is overrated - of course that's how you do it, what else would you do?

That's not what has been done previously for adaptive streaming. I guess you are referring to what encoding modes like CRF do for an individual, entire file? Or where else has this kind of approach been shown before?

In the early days of streaming you would've done constant bitrate for MPEG-TS, even adding zero bytes to pad "easy" scenes. Later you'd have selected 2-pass ABR with some VBV bitrate constraints to not mess up the decoding buffer. At the time, YouTube did something where they tried to predict the CRF they'd need to achieve a certain (average) bitrate target (can't find the reference anymore). With per-title encoding (which was also popularized by Netflix) you could change the target bitrates for an entire title based on a previous complexity analysis. It took quite some time for other players in the field to also hop on the per-title encoding train.

Going to a per-scene/per-shot level is the novely here, and exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.

exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.

This is unnecessary if the encoder is well-written. It's like how some people used to run multipass encoders 3 or 4 times just in case the result got better. You only need one analysis pass to find the optimal quality at a bitrate.

Sure, the whole point of CRF is to set a quality target and forget about it, or, with ABR, to be as good as you can with an average bitrate target (under constraints). But you can't do that across resolutions, e.g. do you pick the higher bitrate 360p version, or the lower bitrate 480p one, considering both coding artifacts and upscaling degradation?

At those two resolutions you'd pick the higher resolution one. I agree that generation of codec doesn't scale all the way up to 4K and at that point you might need to make some smart decisions.

I think it should be possible to decide in one shot in the codec though. My memory is that codecs (image and video) have tried implementing scalable resolutions before, but it didn't catch on simply because dropping resolution is almost never better than dropping bitrate.

It's been reported in the past that Netflix encodes 120 different variants of each video they have [1] for different bitrates and different device's needs.

And that was years ago, I wouldn't be surprised to learn it's a bigger number now.

[1] https://news.ycombinator.com/item?id=4946275

Probably a lot more than once when you consider that different devices have different capabilities, and that they might stream you different bitrates depending on conditions like your network capability, screen resolution, how much you've paid them..

You could also imagine they might apply some kind of heuristic to decide to re-encode something based on some condition... Like fine tune encoder settings when a title becomes popular. No idea if they do that, just using some imagination.

Latency is still valuable. For example YouTube (which IIRC uses ffmpeg) often takes hours to do transcodes. This is likely somewhat due to scheduling but assuming that they can get the same result doing 4x threads for 1/4 of the time they would prefer that as each job finishes faster. The only real question is at what efficiency cost the latency benefit stops being worth it.

YouTube does not use ffmpeg, at the scale at which they operate it would be too slow / expensive.

They use custom hardware just for encoding.

fyi they have to transcode over 500h of videos per minute. So multiple that by all the formats they support.

They operate at an insane scale, Netflix looks like a garage project for comparison.

There's still decoding. If a service claims to support all kinds of weird formats (like a MOV or AVI from the 90s) that means ffmpeg is running.

Google's use of ffmpeg: https://multimedia.cx/eggs/googles-youtube-uses-ffmpeg/

For encoding, recently, they've built their own ASIC to deal with H264 and VP9 encoding (for 7-33x faster encoding compared to CPU-only): https://arstechnica.com/gadgets/2021/04/youtube-is-now-build...

Facebook does, and contributes to ffmpeg.

I think that if you're operating at the scale of Google using a single-threaded ffmpeg will finish your jobs in less time.

If you have a queue of 100k videos to process and a cluster of 100 cores, assigning a video to each core as it becomes available is the most efficient way to process them, because your skipping the thread joining time.

Anytime there is a queue of jobs, assigning the next job in the queue to the next free core is always going to be faster than assigning the next job to multiple cores.

I guess it's irrelevant for Netflix then*. But it sounds great for the remaining 99.99%.

* I would be very surprised if Netflix even uses vanilla ffmpeg

But it sounds great for the remaining 99.99%.

I believe the vast majority of ffmpeg usages are web services, or one off encodings.

Well, this feature is awesome for one-off encoding by a home user.

Subjectively, me compressing my holiday video is much more important than Netflix re-compressing a million of them.

I use ffmpeg all the time, so this change is much appreciated. Well not really that often, but when I do encode video/audio it's generally with ffmpeg.

As multi-core as Python and Ruby then.

Yes. The kernel multiplies your efforts for you. It works great for web services.

Okaaay, and if I'm not operating a cloud service like Netflix, and I'm not running thousands of ffmpeg processes? In other words, it's not already a multi-core job?

I've always wondered if better multi-core performance can come from processing different keyframe segments separately.

IIUC all current encoders that support parallelism work by multiple threads working on the same frame at the same time. Often times the frame is split into regions and each thread focuses on a specific region of the frame. This approach can have a (usually small) quality/efficiency cost and requires per-encoder logic to assemble those regions into a single frame.

What if instead/additionally different keyframe segments are processed independently? So if keyframes are every 60 frames ffmpeg will read 60 frames pass that to the first thread, the next 60 to the next thread, ... then assemble the results basically by concatenating them. It seems like this could be used to parallelize any codec in a fairly generic way and it should be more efficient as there is no thread-communication overhead or splitting of the frame into regions which harms cross-region compression.

Off the top of my head I can only think of two issues:

1. Requires loading N*keyframe period frames into memory as well as the overhead memory for encoding N frames.

2. Variable keyframe support would require special support as the keyframe splits will need to be identified before passing the video to the encoding threads. This may require extra work to be performed upfront.

But both of these seem like they won't be an issue in many cases. Lots of the time I'd be happy to use tons of RAM and output with a fixed keyframe interval.

Probably I would combine this with intra-frame parallelization such as process every frame with 4 threads and then run 8 keyframe segments in parallel. This way I can get really good parallelism but only minor quality loss from 4 regions rather than splitting the video into 32 regions which would harm quality more.

Video codecs often encode the delta from the previous frame, and because this delta is often small, it's efficient to do it this way. If each thread needed to process the frame separately, you would need to make significant changes to the codec, and I hypothesize it would cause the video stream to be bigger in size.

The parent comment referred to "keyframes" instead of just "frames". Keyframes—unlike normal frames—encode the full image. That is done in case the "delta" you mentioned could be dropped in a stream ending up with strange artifacts in the resulting video output. Keyframes are where the codec gets to press "reset".

Oh right. For non realtime, if you're not IO bound, this is better. Though I'd wonder how portable the codec code itself would be.

The encoder has a lot of freedom in how it arrives at the encoded data.

That is done in case the "delta" you mentioned could be dropped in a stream ending up with strange artifacts in the resulting video output.

Also to be able to seek anywhere in the steam without decoding all previous frames.

Isn't that delta partially based on the last keyframe? I guess it would be codec dependent, but my understanding is that keyframes are like a synchronization mechanism where the decoder catches up to where it should be in time.

In most codecs the entropy coder doesn't reset across frames, so there is enough freedom that you can do multithreaded decoding. ffmpeg has frame-based and slice-based threading for this.

It also has a lossless codec ffv1 where the entropy coder doesn't reset, so it truly can't be multithreaded.

Yes, key frames are fully encoded, and some delta frames are based on the previous frame (which could be keyframe or another delta frame). Some delta frames (b-frames) can be based on next frame instead of previous. That's why sometimes you could have a visual glitch and mess up the image until the next key frame.

I'd assume if each thread is working on its own key frame, it would be difficult to make b-frames work? Live content also probably makes it hard.

your idea also doesn't work with live streaming, and may also not work with inter-frame filters (depending on implementation). nonetheless, this exists already with those limitations: av1an and I believe vapoursynth work more or less the way you describe, except you don't actually need to load every chunk into memory, only the current frames. as I understand, this isn't a major priority for mainstream encoding pipelines because gop/chunk threading isn't massively better than intra-frame threading.

It can work with live streaming, you just need to add N keyframes of latency. With low-latency livestreaming keyframes are often close together anyways so adding say 4s of latency to get 4x encoding speed may be a good tradeoff.

4s of latency is not acceptable for applications like live chat

As I said, "may be". "Live" varies hugely with different use cases. Sporting events are often broadcast live with 10s of seconds of latency. But yes, if you are talking to a chat in real-time a few seconds can make a huge difference.

Well, you don't add 4s of latency for 4x encoding speed though. You add 4s of latency for very marginal quality/efficiency improvement and significant encoder simplification, because the baseline is current frame-parallel encoders, not sequential encoders.

Plus, computers aren't quad cores any more, people with powerful streaming rigs probably have 8 or 16 cores; and key frames aren't every second. Suddenly you're in this hellish world where you have to balance latency, CPU utilization and encoding efficiency. 16 cores at a not-so-great 8 seconds of extra latency means terrible efficiency with a key frame every 0.5 second. 16 cores at good efficiency (say, 4 seconds between key frames) means terrible 64 second of extra latency.

You can pry vp8 out of my cold dead heands. I'm sorry, but if it takes more than 200ms including network latency it is too slow and video encoding is extremely CPU intensive so exploding your cloud bill is easy.

except you don't actually need to load every chunk into memory, only the current frames.

That's a good point. In the general case of reading from a pipe you need to buffer it somewhere. But for file-based inputs the buffering concerns aren't relevant, just the working memory.

Actually, not only does it work with live streaming, it's not an uncommon approach in a number of live streaming implementations*. To be clear, I'm not talking about low latency stuff like interactive chat, but e.g. live sports.

It's one of several reasons why live streams of this type are often 10-30 seconds behind live.

* Of course it also depends on where in the pipeline they hook in - some take the feed directly, in which case every frame is essentially a key frame.

There's already software that does this: https://github.com/master-of-zen/Av1an Encoding this way should indeed improve quality slightly. Whether that is actually noticeable/measurable... I'm not sure.

ffmpeg and x265 allow you to do this too. frame-threads=1 will use 1 thread per frame addressing the issue OP mentioned, without big perf penalty, in contrary to 'pools' switch which sets the threads to be used for encoding.

I've messed around with av1an. Keep in mind the software used for scene chunking, L-SMASH, is only documented in Japanese [1], but it does the trick pretty well as long as you're not messing with huge dimensions like HD VR where you have video dimensions that do stuff like crash quicktime on a mac

[1] http://l-smash.github.io/l-smash/

I know next to nothing about video encoders, and in my naive mind I absolutely thought that parallelism would work just like you suggested it should. It sounds absolutely wild to me that they're splitting single frames into multiple segments. Merging work from different threads for every single frame sounds wasteful somehow. But I guess it works, if that's how everybody does it. TIL!

Most people concerned about encoding performance are doing livestreaming and so they can't accept any additional latency. Splitting a frame into independent segments (called "slices") doesn't add latency / can even reduce it, and it recovers from data corruption a bit better, so that's usually done at the cost of some compression efficiency.

Most people concerned about encoding performance are doing livestreaming

What make you think that? I very much care about encoding performance (for a fixed quality level) for offline use.

IIUC - International Islamic University Chittagong?

IIUC - If I understand correctly.

If I Understand Correctly

This definitely happens. This is how videos uploaded to Facebook or YouTube become available so quickly. The video is split into chunks based on key frame, the chunks are farmed out to a cluster of servers and encoded in parallel, and the outputs are then re-assembled into the final file.

I don't know anything about ffmpeg codebase, but I just wonder... how would I go about doing this _slowly_ without completely doing a giant commit that changes everything?

The presentation says it's 700 commits. Was that a separate branch? Or was it slowly merged back to the project?

Well I can look at github I guess

It seems ffmpeg uses the mailing list patch way of doing "PRs", which is... well it is what it is. It doesn't help me understand the process unless I just go through all the mailing list archives, I guess.

Ugh why? That is so old school

Like the linux kernel?

Which is also archaic and old school

I mean they might be used to doing that as ffmpeg is predating github. (and git.)

as long as it works for them...

A theory about this that may also affect other older solid software: the assumptions made on where to optimally "split" a problem for multi-threading/processing has likely changed over time.

It wasn't that long ago that reading, processing, and rendering the contents of a single image took a noticeable amount of time. But both hardware and software techniques have gotten significantly faster. What may have made sense many years ago (lots of workers on a frame) may not matter today when a single worker can process a frame or a group of frames more efficiently than the overhead of spinning up a bunch of workers to do the same task.

But where to move that split now? Ultra-low-end CPUs now ship with multiple cores and you can get over 100 easily on high-end systems, system RAM is faster than ever, interconnect moves almost a TB/sec on consumer hardware, GPUs are in everything, and SSDs are now faster than the RAM I grew up with (at least on continuous transfer). Basically the systems of today are entirely different beasts to the ones commonly on the market when FFmpeg was created.

This is tremendous work that requires lots of rethinking about how the workload needs to be defined, scheduled, distributed, tracked, and merged back into a final output. Kudos to the team for being willing to take it on. FFmpeg is one of those "pinnacle of open source" infrastructure components that civilizations are built from.

I assume 1 process with 2 threads takes up less space on the die than 2 processes, both single threaded. If this is true, a threaded solution will always have the edge on performance, even as everything scales ad infinitum.

Hardware threads are not the same as software threads.

It's not the codecs that were multithreaded in this release. Pretty much all modern codecs are already multithreaded. What they decided to parallelize is ffmpeg itself. You know, the filter graphs and such. They didn't do anything to the codecs themselves.

When I stream 4k from my laptop ffmpeg gets very intense about cpu usage to the point fans are constantly at high speed and it's distracting. I hope this helps in some way. I have a fairly decent specces mid-tier laptop.

I believe ffmpeg can be compiled to support the GPU, if your laptop has one. It works at least for CUDA-enabled GPUs (https://docs.nvidia.com/video-technologies/video-codec-sdk/1...)

Talk with ChatGPT about it and see if you can do it.

???

Just Google it.

Thanks, I don't think it has a gpu separate from that is cuda-enabled but I have other systems that do, will look into it.

Shameless plug. I teach the FFmpeg C api here : https://www.udemy.com/course/part-1-video-coding-with-ffmpeg...

That's such a very niche topic to teach. What usecases?

It's for engineers tired of memorizing long weird CLI commands. I teach you the underlying C data structures so you can get out of command line hell and make the most out of your time!

Is there a recording of this talk from VDD@Dublin? Can't easily find it on the author's site or here https://www.videolan.org/videolan/events/vdd23/

Update: Found here! https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=1221s

As someone who has only used ffmpeg for very trivial scenarios this was really interesting to listen to. It's nice knowing it's getting a proper refactor (not that I knew much of its internals until now)

So does this mean that FFMPEG will be able to use multiple cores with all the included codecs?

I'm using FFMPEG to encode MP3 with LAME for an audio hosting service and it would be great to improve encode times for long files.

Doubtful. Many codecs like MP3 aren't well suited to efficient multi-threaded encoding.

I routinely use ffmpeg CLI and I see in top that multiple cores are engaged in the processing. Hadn't it been multi-threaded for years? What exactly is changing?

My interested-lay-person understanding: Previously it would read a chunk of file, then decode it, then apply filters, then run multi-threaded encoding, then write a chunk of file.

Now it can read the next chunk of file while decoding a chunk while applying filters while running multi-threaded encoding while writing the previous chunk.

I expect this will be a big help for complicated setups (applying multiple filters, rendering to multiple outputs with different codecs), but probably not much change for a simple 1-input 1-output transcode.

Does it mean that my FFMPEG H264 encoding until now was single threaded?

Apparently not — many of the encoders were already multithreaded this was about making the ffmpeg pipeline itself parallelised.

All without Rust?

I think this was not "basic" multi-threading: they were careful about keeping latency as low as possible and some internal modifications of ffmpeg libs had to be done.

That said, I don't think we still get input buffering (for HLS).

Will this allow multiple machines to process the same file? If not is there anything out there that can?

This will hopefully improve the startup times for FFmpeg when streaming from virtual display buffers. We use FFmpeg in LLMStack (low-code framework to build and run LLM agents) to stream browser video. We use playwright to automate browser interactions and provide that as tool to the LLM. When this tool is invoked, we stream the video of these browser interactions with FFmpeg by streaming the virtual display buffer the browser is using.

There is a noticeable delay booting up this pipeline for each tool invoke right now. We are working on putting in some optimizations but improvements in FFmpeg will definitely help. https://github.com/trypromptly/LLMStack is the project repo for the curious.

the pdf with a presentation requires a password

This must have been quite the challenge to continually rebase the ongoing changes coming in on the daily. Wow. Now that it is actually in, it should be much easier to go forward.

Big win too! This is going to really speed things up!

Intel Core Duo CPU was released in 2006. By then it was obvious computationally intensive programs need multithreading, these Unix-style processes are no longer adequate.

I wonder why did it took so long for FFmpeg?

BTW, MS Media foundation is a functional equivalent of FFmpeg. It was released as a part of Windows Vista in 2006, and is heavily multithreaded by design.

Random reach here but has anyone here managed to get FFMPeg to render JS text over a video? I've been thinking about this workflow and just haven't quite figured it out yet, only a prototype in MoviePy but I'd like to move away from that.