Explainer: What are MMX, SSE, and AVX?

To really explain MMX and its successors, you need to go much further back in time. Remember the AN/FSQ-7, the computer used in the SAGE air defense system, with a front panel with many flashing lights that ended up on many science-fiction TV shows as a prop?

Instead of having a normal arithmetic unit that worked on one number at a time, it worked on two numbers at once, so as to be faster in processing vectors representing the geographical locations of aircraft.

Computers developed as possible replacements for that computer, like the TX-0 and the AN/FSQ-32, had accumulators which could do arithmetic on either one long number or two smaller numbers at once, with other choices in some cases. This was very similar to what MMX offered - but with a 36-bit or 48-bit accumulator instead of with 64-bit registers.

This is the antecedent to MMX, unlike the very different form of vector processing used in the Cray-1 computer. Because the Cray-1 computer was so successful as an improvement on the power of other computer systems of the time, and otherwise neither today's computers nor past mainframes embody significant improvements to the design of the IBM System/360 Model 195, which had both out-of-order execution and cache memory along with hardware floating-point, to me it had seemed that the next logical improvement in computers would be to include features like those of the Cray-1.

But for whatever reason, this has not happened; the last surviving example of that kind of vector processing is NEC's SX-Aurora TSUBASA computer, with a CPU that uses HBM and is inside of what looks like a video card; it has less FLOPS than a GPU accelerator, but it may be possible to benefit from its vector capabilities a larger fraction of the time.

And at the time MMX was first introduced, it was believed to stand for Multimedia Extensions.
 
Last edited by a moderator:
This is the antecedent to MMX, unlike the very different form of vector processing used in the Cray-1 computer.
I don't understand the rationale for this statement. I remember the Crays having a rather ordinary SIMD implementation: differing in technical details of course, but not functionals.
 
Any game (BFV, Modern Warfare, etc) that uses AVX murders my D15 ability to cool my 5900x. I'm talking 80C just sitting in the menu.
 
I don't understand the rationale for this statement. I remember the Crays having a rather ordinary SIMD implementation: differing in technical details of course, but not functionals.
MMX took a 64-bit register, and split it into a variable number of parts, depending on the length of those parts.
The Cray-1 just had banks of 64 registers. So, just as an ordinary computer without an MMX-like feature would use a 32-bit register to handle only one 16-bit number or only one 8-bit number, computers with a vector architecture like the Cray-1 but which also handled vectors of other types, such as 32-bit integers, would handle vectors with exactly as many 32-bit integers as 64-bit floats, not twice as many.
This affects a lot of things; which version of hardware FFT support makes more sense, how to share parts between ALUs for different precisions, and so on.
 
The Cray-1 just had banks of 64 registers. [It]would handle vectors with exactly as many 32-bit integers as 64-bit floats, not twice as many.
I believe you're wrong on two counts. First, the Cray architecture had a vector-length setting controlling the interpretation of the number of elements in a vector register: anywhere from 1 to 64 vectors within a single V register.

Secondly, if you require variable-length vector capability as your definition of "early forms of MMX", then computers like the AN/FSQ-7 don't qualify either, as they didn't support such features.
 
With regards to the Cray-1’s CPU:
V registers
Eight 64-element V registers provide operands to and receive results from the functional units at a one clock period rate. Each element of a V register holds a 64-bit quantity. When associated data is grouped into successive elements of a V register, the register may be considered to contain a vector.
 
With regards to the Cray-1’s CPU: (link)
Nice find, thanks. The text is a bit opaque, but I believe that it supports my statement -- a single register could be divided up into up to 64 vector elements.
 
It's worth noting that AMD doesn't offer support for AVX-512 and has no plans to do so. It sees the task of handling large vector calculations as the preserve of the GPU, just as Nvidia does, and both have released products specifically for such roles.

Not so sure about this one. There are quite many "AVX-support levels" (Read: some implementations on certain CPU's are much faster than others) with Intel so AMD could also offer support for AVX-512. How fast this "AMD's support" will be is totally another question.

But the rise of the GPU does mean that CPUs don't have to sport very big vector units; this is almost certainly why AMD hasn't looked to develop their own successor to AVX2 (an extension they've had in their chips since 2015).

Unless something happens on communication side, CPU vector instructions are very useful for vector calculations that require low latency. PCI Express bus is very slow and I doubt even using integrated GPU gets nowhere near latencies that are available when staying inside CPU.
 
Not so sure about this one. There are quite many "AVX-support levels" (Read: some implementations on certain CPU's are much faster than others) with Intel so AMD could also offer support for AVX-512. How fast this "AMD's support" will be is totally another question.
While AMD themselves haven't directly commented either way regarding future products and AVX-512, I would argue the fact that they've ignored it for years (unlike AVX & AVX2) and that there seems to be no truly compelling application for it, are reasons enough to believe AMD won't bother with it. They may, of course, offer an alternative, as they did with MMX and SSE4, but both of those appeared very rapidly after the original versions came to market.

It may also be a case of the AVX-512's biggest problem: the die space it takes up. The image below highlights the SIMD registers in a Zen 2 CCD:

zen2_simd.jpg

Those are 256-bit registers, so they would need to be four times bigger to offer full AVX-512 support (as the size and number of registers doubles).

Unless something happens on communication side, CPU vector instructions are very useful for vector calculations that require low latency. PCI Express bus is very slow and I doubt even using integrated GPU gets nowhere near latencies that are available when staying inside CPU.
This is all very true, although since all AMD's workstation & server CPUs support AVX2 and their EPYC range, especially, scales nicely using Infinity Fabric, they have sufficient products and capability to support that specific need. For everything else, that's where their GPUs come in to play and in the case of their CDNA architecture, it uses IF for GPU-to-GPU communication, not PCIe. The system enables the GPU's memory to be fully coherent, too.
 
Thank you Nick for this, effort of an article, which, tries to put such complex instructions in simple terms.
 
Last edited:
While AMD themselves haven't directly commented either way regarding future products and AVX-512, I would argue the fact that they've ignored it for years (unlike AVX & AVX2) and that there seems to be no truly compelling application for it, are reasons enough to believe AMD won't bother with it. They may, of course, offer an alternative, as they did with MMX and SSE4, but both of those appeared very rapidly after the original versions came to market.

It may also be a case of the AVX-512's biggest problem: the die space it takes up. The image below highlights the SIMD registers in a Zen 2 CCD:

View attachment 87323

Those are 256-bit registers, so they would need to be four times bigger to offer full AVX-512 support (as the size and number of registers doubles).

That's one reason why I did say about different levels of "support". There are many levels from using microcode (very cheap and compatible but ultra slow) and "full support" (lots of heat and die space but very fast). Zen and Zen2 both support AVX2, Zen2 implementation is much faster. Same thing can be seen on Intel CPU's, there are different AVX implementations, some are faster than others.

My point is that there is no single "AVX/AVX2/AVX-512 support". Not all "supports" are equally fast or even near that.

This is all very true, although since all AMD's workstation & server CPUs support AVX2 and their EPYC range, especially, scales nicely using Infinity Fabric, they have sufficient products and capability to support that specific need. For everything else, that's where their GPUs come in to play and in the case of their CDNA architecture, it uses IF for GPU-to-GPU communication, not PCIe. The system enables the GPU's memory to be fully coherent, too.

Infinity Fabric is still very slow latency wise vs using AVX. AMD is if course promoting GPU systems for heavy use but I wouldn't be surprised if AMD also like to use AVX for low latency vector calculations. Intel still doesn't have competitive GPU so they try heavily promote AVX.

Since there is point using AVX even when big crunching is done with GPU, AMD may have uses for AVX-512. Offering "more or less support" of course ;)
 
Nice find, thanks. The text is a bit opaque, but I believe that it supports my statement -- a single register could be divided up into up to 64 vector elements.
A vector register in the Cray-1 only had 64 vector elements, but you could choose to use only the first 27, say. Why would you do that? Well, if you were doing a calculation on 539 element vectors, you would break them up into eight 64-element vectors, and one 27-element vector.
The 64-element vector registers could be used partially, but they could not be split up in a different way, such as into 128 elements each 32 bits long.
 
486 was the first desktop CPU to have an integrated FPU. As opposed to the first digital computer made by Konrad Zuse which had an integrated FPU back in 1941. It wasn't quite a "microprocessor" and the FPU was only 16-bit, but still.... a real FPU. That was before Alan Turing made his "world's first" digital computer, which was actually world's second. Remember that next time someone says that Turing has invented the first digital computer.
 
Interesting article but it could do with a section where all the acronyms are explained. I was continually jumping up and down in the article to remind myself what different acronyms were. There was no explanation as to what MMX stood for. It would be good to have a year when the acronym came into being too.
 
There was no explanation as to what MMX stood for.
It doesn't stand for anything. Initially, it stood for "matrix math extensions" or "multimedia extensions", depending on what marketing/press documentation one looked at in 1996 (I vaguely recall seeing the first one, in my very first job as an online content writer). However, since these are very generic terms, the chances of Intel achieving trademark rights, in all countries, for the initialization of these phrases would have been slim: hence why it officially doesn't stand for anything and MMX and MMX Technology are both trademarked.

It would be good to have a year when the acronym came into being too.
That's indicated in the article:

"In October 1996, Intel launched the 'Pentium with MMX technology'."

"Matters improved in 1999, with the launch of Intel's Pentium III processor. Its shiny vector feature came in the form of SSE (Streaming SIMD Extensions)"

"in 2011, the Sandy Bridge range of CPUs were launched, featuring AVX (Advanced Vector Extensions)."
 
However, since these are very generic terms, the chances of Intel achieving trademark rights, in all countries, for the initialization of these phrases would have been slim: hence why [MMX] officially doesn't stand for anything
My original post on this was deleted, perhaps because of the (since-removed) article typos it identified. Neeyik's statement is correct, except for one minor point: the trademark impediments were not because the phrases in question were generic, but because they were being used in context to describe the applicable product.

This is the reason you can trademark the generic word "apple" to sell computers, but not to sell apples. The generic phrase "Matrix Math Extensions" might be a catchy trademark for Ben & Jerry's ice cream flavor, but not for a processor that performs matrix math.
 
486 was the first desktop CPU to have an integrated FPU. As opposed to the first digital computer made by Konrad Zuse which had an integrated FPU back in 1941. It wasn't quite a "microprocessor" and the FPU was only 16-bit, but still.... a real FPU. That was before Alan Turing made his "world's first" digital computer, which was actually world's second. Remember that next time someone says that Turing has invented the first digital computer.
According to wiki the Z3 was a 22-bit Floating point calculator that was only shown to be Turing complete in 1998. It was also destroyed by allied bombing in 1943.
 
What I want to know, from a gamer's perspective, do we NEED AVX-512 for gaming/games or is it more dead weight/power hog than doing good?

I've heard contradictory facts about this... so far from what I know is that it does more harm than good.
 
What I want to know, from a gamer's perspective, do we NEED AVX-512 for gaming/games or is it more dead weight/power hog than doing good?
Need? No. Even AVX isn't really needed, simply because the bulk of the workload in a game that either needs to be vectorised, or the processing speed is increased when vectorised, is handled by the GPU. The majority of games aren't heavily CPU-limited and where they are, it's mostly due to shifting data about.

That said, a good developer will use every trick in the book to improve the performance of the 'hidden' part in a game's engine (I.e. the elements that the end user cannot adjust to balance speed vs quality on the basis of their hardware setup). So if using AVX2 could be shave off 0.5 milliseconds (with no other significant penalties involved) in a simulation routine, for example, then it would make sense to do so.
 
Need? No. Even AVX isn't really needed, simply because the bulk of the workload in a game that either needs to be vectorised, or the processing speed is increased when vectorised, is handled by the GPU. The majority of games aren't heavily CPU-limited and where they are, it's mostly due to shifting data about.

That said, a good developer will use every trick in the book to improve the performance of the 'hidden' part in a game's engine (I.e. the elements that the end user cannot adjust to balance speed vs quality on the basis of their hardware setup). So if using AVX2 could be shave off 0.5 milliseconds (with no other significant penalties involved) in a simulation routine, for example, then it would make sense to do so.
I see, well can I ask a another question: Are you familiar with CPU performance in Cyberpunk 2077?

Yes, we all know it still needs more patches, but is one of the few next gen games that takes advantage up to 12c/24t and it does scale with the cores too. From 4c/8t to 12c/24t there is a huge performance gap.

It also uses AVX-512, because it can't run on systems without it (Intel ones, had issues without AVX-512).
Here are some links > LINK 1 LINK 2

Would you say that CP 77 benefits from the AVX-512? So Intel has an advantage because of that?
 
Cyberpunk 2077 doesn't require AVX-512 - if it did, no PC using an AMD CPU would be able to run it, nor would any Intel desktop CPU 10th Gen Core or earlier. The only mainstream CPUs that have AVX-512 are Intel's X-series processors (and they're not really mainstream) and Ice Lake/Tiger Lake laptop chips.

When you look at the 18% difference between the i7-10900K and i9-9900K results in the first link you gave, I would suggest that there's some issues in the testing taking place. This is because there is virtually no difference between the two CPUs, other than the former supports a 10% higher memory speed and a 2% higher boost clock - other than that, there's no architectural changes that can account for the 18%.

The other explanation is, of course, a little more obvious: Cyberpunk 2077 is a mess of code.
 
Cyberpunk 2077 doesn't require AVX-512 - if it did, no PC using an AMD CPU would be able to run it, nor would any Intel desktop CPU 10th Gen Core or earlier. The only mainstream CPUs that have AVX-512 are Intel's X-series processors (and they're not really mainstream) and Ice Lake/Tiger Lake laptop chips.

When you look at the 18% difference between the i7-10900K and i9-9900K results in the first link you gave, I would suggest that there's some issues in the testing taking place. This is because there is virtually no difference between the two CPUs, other than the former supports a 10% higher memory speed and a 2% higher boost clock - other than that, there's no architectural changes that can account for the 18%.

The other explanation is, of course, a little more obvious: Cyberpunk 2077 is a mess of code.
So then it must be the other AVX instructions, because there are multiple posts and even mods that FIX the AVX issue in CP 77 for older CPUs...
https://www.nexusmods.com/cyberpunk2077/mods/107?tab=description

Anyway, if there is no AVX-512 then I guess it answers my question: it's useless for games.
 
So then it must be the other AVX instructions, because there are multiple posts and even mods that FIX the AVX issue in CP 77 for older CPUs...
https://www.nexusmods.com/cyberpunk2077/mods/107?tab=description

Anyway, if there is no AVX-512 then I guess it answers my question: it's useless for games.

AVX introduced on Sandy Bridge 2011 (AMD Bulldozer 2011)
AVX2 introduced on Haswell 2013 (AMD Excavator 2015)
AVX-512 introduced on Skylake-X 2017 (AMD ?)

Cyberpunk wanted AVX, not newer ones. Making AVX-512 mandatory would lock out 100% of AMD CPU's and 99.9% Intel CPU's (I won't count server CPU's) so AVX-512 is needed in games probably 2033 or later.
 
Back