FFmpeg gets 100x faster with AVX-512 and handwritten assembly code

Alfonso Maruccia

Posts: 2,515   +935
Staff
What just happened? FFmpeg developers keep on crunching "handwritten" assembly code to make the multimedia project faster than ever before. Thanks to newer vector-based instructions included in modern x86 processors, FFmpeg can truly provide a massive speedup in media transcoding workloads – if you are lucky enough.

The FFmpeg team recently announced a massive speed increase thanks to some newly patched code. The open-source project is now more than 100 times faster – likely the biggest performance increase it's ever experienced. However, the developers warn that only a single function is receiving this full boost, though some huge speed improvements are coming to other parts of the project as well.

As clearly stated in the recently submitted patch, the "rangedetect8_avx512" function is now 100 times faster. The coders credit their handwritten assembly code for the speed increase, together with the extensive use of the AVX-512 extensions to the x86 ISA available in modern computer processors.

The FFmpeg team clearly is a big proponent of assembly programming. There is even an online school focused on how assembly is used in the project, where people interested in joining the challenge are pushed to "open their eyes" to what's actually going on in a computer when it's running some binary code in RAM.

Assembly is a low-level programming language where human-readable instructions have a direct correspondence to the CPU architecture's machine code instructions. Unlike high-level languages such as C, assembly code doesn't need to be "compiled" to work. Assembly programs are simply "assembled" into direct binary code designed to run on a specific processor ISA, and are definitely the best (and most difficult) way to extract every single bit of number-crunching performance from a CPU.

As confirmed by FFmpeg programmers, "register allocator sucks on compilers." The AVX-512 instruction set is a vector-based addition to the traditional x86 ISA, a type of "single instruction, multiple data" computing standard implemented by Intel and AMD in modern(ish) CPUs.

Vector-based instructions such as AVX-512, or the more recent AVX10 ISA introduced by Intel, can indeed provide a massive performance boost in parallel processing workloads. FFmpeg, a comprehensive suite of libraries and tools for processing multimedia streams, is well suited to exploit this kind of computing acceleration. The project experienced its first AVX-512-powered massive speed boost in 2024, when video decoding routines became three to 94 times faster.

Even on older processors that don't provide direct AVX-512 hardware support, the latest FFmpeg patch can still bring some eye-opening speed increases. The "rangedetect8_avx2" function is now 64 times faster, with AVX2 extensions being introduced together with the Haswell microarchitecture back in 2013.

Permalink to story:

 
The optimisations were submitted by Niklas Haas, the lead developer of libplacebo.

Though welcome, the function in question won't affect most FFmpeg usage, being part of a filter. Nonetheless, Haas added some other AVX2/512 optimisations too.
 
I really want to see someone add AC-4 decoding to ffmpeg. Without it, ATSC 3.0 and OTA TV in general, on the PC, will be dead.
 
You would need an AMD Zen 4 or Zen 5 (preferably Zen 5) to take full advantage of these new performance improvements.
 
Nice. This is why I exclusively encode on Zen 4/5 CPUs. Those things absolutely haul while encoding video.
 
100x faster pfff. I wrote a fully vectorised 3D Integrator for Matlab that was 1000x faster than the built-in function, uploaded it to the repository and it is still not implemented as an option.
 
For the uninitiated.
Writing in assembly I can only compare to something like a monk creating a super fancy book, carefully copying the original text by hand taking care every single stroke is perfect. Creating fine detail using carefully chosen pigments and using various materials to create something extremely high quality. Or just a huge pain in the backside that only a few have the skill set for whilst for almost everyone a simple mass produced copy is fine for.
Assembly most programmers aren't familiar with and typically is mostly used when exact control is needed for either security or performance. Mostly it was used 'back in the day' for things like routers where performance was very limited and certain task where very common. Even there it's not even needed anymore/done however as routers nowadays are pretty powerful. Heck, even microcontrollers like the ESP32-S3 can run Micro Python.

The AVX512 instruction set is only available on a subset of CPUs used and it's a bit of a mess - there's some base functionality but there's also extensions that may not be available across the board. Which lead to Linus Torvalds famously saying (the guy we can thank for inventing Linux) "I Hope AVX512 Dies a Painful Death".
Intel introduced it first mostly for server/workstation use but later on even actively removed it from some CPUs as it was causing overheating issues and could even slow down regular code. They seem to have given up on it now(?). So their offerings have a weird gap where some older CPUs support it but then the new ones do not.
AMD in a hold-my-beer moment seems to have doubled down on it, initially lazily implementing it but with Zen 5 doing it so that it can actually result in significantly better performance.
(So Intel did the heavy work of popularizing (but not really) it and writing the compilers and now seems to have given up on it whilst AMD and AMD users enjoy the benefits)

Anyways, cool news I guess. Doubt it'll have much of an impact but maybe the hard, tedious and very skilled work can benefit a subsection of end users? Perhaps it'll motivate Intel to start supporting AVX-512 again although with their large/small core concept that's more difficult now.





 
Intel introduced it first mostly for server/workstation use but later on even actively removed it from some CPUs as it was causing overheating issues and could even slow down regular code. They seem to have given up on it now(?). So their offerings have a weird gap where some older CPUs support it but then the new ones do not.
Intel stopped supporting AVX512 because they "invented" panic solution called Hybrid architecture. Too bad, they didn't plan it before so only viable solution for E-core was Gracemont. Since Gracemont did not support AVX512, supporting it on P-cores, well, imagine if single AVX512 instruction happen to go on E-cores...

Almost certainly future Intel Hybrid solutions will support some sort of AVX512 too. Might be AVX10 but anyway.
 
Intel stopped supporting AVX512 because they "invented" panic solution called Hybrid architecture. Too bad, they didn't plan it before so only viable solution for E-core was Gracemont. Since Gracemont did not support AVX512, supporting it on P-cores, well, imagine if single AVX512 instruction happen to go on E-cores...

Almost certainly future Intel Hybrid solutions will support some sort of AVX512 too. Might be AVX10 but anyway.

If true, they might be abandoning the hybrid architecture in the future:

https://www.igorslab.de/en/intel-plant-unified-core-nach-razer-lake-das-ende-des-hybrid-experiments/
 
Luckily Intel stopped supporting AVX512 since Alder Lake " (y) (Y)"
And well informed consumers, buy AMD processors.
🧠 AVX-512 support on AMD CPUs is a relatively recent development, and it's mostly found in their Zen 4 and Zen 5 architectures. Here's a breakdown of which AMD processors include AVX-512 capabilities:
🧬 Zen 4 (Ryzen 7000 Series)
- Ryzen 9 7950X
- Ryzen 9 7900X
- Ryzen 7 7700X
- Ryzen 5 7600X
Zen 4 supports AVX-512 through a dual-issue pipeline, meaning it uses two 256-bit operations to emulate 512-bit instructions. This results in functional support, but not full native throughput.

🚀 Zen 5 (Ryzen 9000 Series)
- Ryzen 9 9950X
- Ryzen 9 9900X
- Ryzen 7 9800X3D
- Ryzen 5 9600X
Zen 5 introduces a full-width 512-bit AVX-512 pipeline, offering true native support and significantly better performance and efficiency compared to Zen 4.

🧪 Threadripper & EPYC (Zen 4/5-based)
- Threadripper 7000 & 9000 Series
- EPYC 9004 Series (Genoa)
These workstation and server chips also support AVX-512, often with enhanced throughput and additional instruction sets for HPC workloads.

If you're aiming for maximum AVX-512 performance, Zen 5 desktop CPUs like the Ryzen 9 9950X are your best bet.
 
Back