Why is Amazon building CPUs?

Jay Goldberg

Posts: 75   +1
Staff

When it comes to companies rolling their own custom chips, our core thesis is that doing this to save a few dollars on chips is breakeven at best. Instead, companies want to build their own chips when it conveys some form of strategic advantage.

The textbook example is Apple, which ties its chips to its own software to meaningfully differentiate their phones and their computers. Or Google, which is customizing chips for their most intense workloads like search algorithms and video encoding. A few hundred million dollars in chip design costs are more than paid pack in billions in extra sales for Apple or billions in capital expenses and operating expenses savings for Google. It is important to point out that in both those cases the company completely controls what software is being run on its homegrown chips.

Editor's Note:
Guest author Jonathan Goldberg is the founder of D2D Advisory, a multi-functional consulting firm. Jonathan has developed growth strategies and alliances for companies in the mobile, networking, gaming, and software industries.

So what is in it for Amazon?

For Amazon, and more specifically for AWS, software control is beyond them. AWS runs everyone else's software, and so by definition, AWS cannot control it. They have to run almost literally every form of software in the world. Nonetheless, AWS seems to be working very hard to push their customers to run workloads on their Graviton CPUs. AWS has many ways to lock customers in, but silicon is not one of them. At least not yet.

AWS is probably not doing this to save money on the AMD and Intel x86 CPUs they are buying. The fact that they have two vendors alone means they have ample room for pricing leverage. To some degree, Graviton may be a hedge against the day when Intel stops being competitive in x86. (A point we may have already reached.)

That being said, we think there is a bigger reason – power. The chief constraint in data center construction today is electricity. Data centers use a lot of power, and when designing new ones, companies have to work around a power budget. Now imagine they could reduce power consumption by 20%, that means they could add more equipment in the same electricity footprint, which means more revenue. A reduction in power consumption by one part of the system means a much higher return on the overall investment. Then multiply that gain by 38 as the savings percolate through all of AWS' data global centers.

Now of course the math is a bit more complicated than that. CPUs are only part of a system, so even if Graviton is 20% more power efficient for the same performance versus an x86 chip, that does not really translate into 20% more profit from the data center, but the scale is about right. Switching to an internally designed Arm CPU can generate sufficient increase in datacenter capacity to more than offset the cost of designing the chip.

Taking this a step further, one big obstacle that prevents more companies from moving to Arm workloads is the cost of optimizing their software for a new instruction set. We have touched on this topic before, porting software can be labor intensive. AWS has a big incentive to get their customers to switch, and seems to be doing what they can to make this process easier. However, we have to wonder if this is something of a one-way street.

Once customers make the switch to Graviton, that just shifts the friction. As we said above, today AWS cannot use x86 silicon to lock their customers into their service, but once customers switch to Graviton all that optimization friction shifts to work in AWS' favor, creating a new form of lock in. Admittedly, the barrier today exists between Arm and x86, not among the various versions of Arm servers. But one of the beauties of working with Arm is the ability to semi-customize a chip, and so it is entirely possible that AWS may introduce proprietary-ish features in future versions of Graviton.

We think Amazon has many other good reasons to encourage the move to their Arm-based Graviton CPU, but we have to wonder if this lock-in is not lingering somewhere in the back of their brains. If true, that just gives the other hyperscalers more reasons to shift to Arm servers as well.

Permalink to story.

 
An interesting article that was completely negated by this claim.

"To some degree, Graviton may be a hedge against the day when Intel stops being competitive in x86. (A point we may have already reached.)"

They may lose or have lost their once dominating lead in x86, but to claim they are no longer competitive. That is just ridiculous.
 
<p>That being said, we think there is a bigger reason – power. The chief constraint in data center construction today is electricity. Data centers use a lot of power, and when designing new ones, companies have to work around a power budget. Now imagine they could reduce power consumption by 20%, that means they could add more equipment in the same electricity footprint, which means more revenue. A reduction in power consumption by one part of the system means a much higher return on the overall investment. Then multiply that gain by 38 as the savings percolate through all of AWS' data global centers.</p>
And what is source for that "Graviton power consumption is lower?". Amazon, right? You do realize Amazon would Never tell their mighty Graviton chip
consumes more power than AMD Epyc even if it really does?

That pretty much invalidates this article.
 
I believe they refer to the server world where they are having a tough time though still dominant.


Yeah, each server CPU amazon can make in-house costs them half the money and half the power of ice lake-sp - why wouldn't they invest?

AMD is more competitive, but these Neoverse CPU s are almost as fast!
 
I believe they refer to the server world where they are having a tough time though still dominant.
I did some research online, and it seems that in the x86 server market Intel controls 75% of the market with AMD recently grabbing 25%. From webtribunal.net: "Intel shipped 7,710,000 server CPUs in Q4 2021 alone. The number of Intel server CPUs on the market signalizes that the company’s share grew by 24.6% in Q4 2021 alone." I still think the author is way off base to claim that Intel is no longer competitive. Now, how much potential volume has Intel lost to Google, Amazon, etc. switching to custom chips, that I don't know.
 
And what is source for that "Graviton power consumption is lower?". Amazon, right? You do realize Amazon would Never tell their mighty Graviton chip
consumes more power than AMD Epyc even if it really does?

That pretty much invalidates this article.
Not really. Amazon wouldn't use their own chip if it drew more power than commercially available parts. I've worked with Microsoft and Amazon engineers and when you consider the size of "cloud" computing, a few watts here or there matter. At least in terms of cost. Often times operational costs (such as power, cooling etc) can be much higher over the lifecycle of the component than the capital costs.
 
AWS is probably not doing this to save money on the AMD and Intel x86 CPUs they are buying.

Well, define "saving" money. I'm sure Amazon et al, can get some of the best pricing that Intel or AMD offers. The issue is when buying a general purpose CPU you don't get the kind of efficiencies that a cloud provider needs. Shaving a few watts of power consumption off a component can save millions for a cloud provider.

People are rethinking how they design and build data centers now. I talked with one company that is designing data centers in Nevada/Arizona to use ambient cooling at night since the desert can get quite cool in the evenings. Fanless designs, devices that shutdown when not in use and more are all getting factored into the DC design.
 
An interesting article that was completely negated by this claim.

"To some degree, Graviton may be a hedge against the day when Intel stops being competitive in x86. (A point we may have already reached.)"

They may lose or have lost their once dominating lead in x86, but to claim they are no longer competitive. That is just ridiculous.
Yeah, maybe another way of saying this would be along the lines of a "a hedge against the possibility that Intel won't or can't focus on our priorities." I've certainly read stories implying this was the issue with Apple, where they could not get Intel to focus on efficiency to the extent they wanted as early as they wanted, and allegedly had more hiccups in the (was it Skylake?) qualification process than they expected to boot.
 
Not really. Amazon wouldn't use their own chip if it drew more power than commercially available parts. I've worked with Microsoft and Amazon engineers and when you consider the size of "cloud" computing, a few watts here or there matter. At least in terms of cost. Often times operational costs (such as power, cooling etc) can be much higher over the lifecycle of the component than the capital costs.
And why not? Since Amazon can get their own chips cheaply, that's more than enough reason to use them.

And few watts doesn't really matter at all on servers. No, it does not. If watts really do matter, then nobody would have used Intel Prescott's or current Ice Lake server CPU's, since they are hot as Hell compared against AMD offerings.

People keep saying server power consumption matters but in reality majority of server CPUs are hottest ones available 🤦‍♂️
 
Yeah, maybe another way of saying this would be along the lines of a "a hedge against the possibility that Intel won't or can't focus on our priorities." I've certainly read stories implying this was the issue with Apple, where they could not get Intel to focus on efficiency to the extent they wanted as early as they wanted, and allegedly had more hiccups in the (was it Skylake?) qualification process than they expected to boot.
Apple started designing M1 around 2015 when Intel was just released 14nm CPUs and at that time Apple didn't have any clue about Intel's 10nm problems. Also Intel's 14nm at that time was pretty much best manufacturing tech available. That means "efficiency" was not an issue.

Tbh, fact that Apple designed ARM parts for phones, it was not too hard to guess they would also design own SOC's for bigger machines too. At that point however Apple had no idea about how TSMC would hold up against Intel so they kept low profile. In other words, if TMSC would have failed against Intel, then M1 would never have been released. Apple could afford that.

In other words, Apple designed M1 and if it looked to be better than Intel's offerings, they would switch into it. If not, then they would have just abandoned it.
 
People keep saying server power consumption matters but in reality majority of server CPUs are hottest ones available 🤦‍♂️
As true as this may have been, it doesn't mean times don't change.

I don't know because I'm not in the room, but if the last dozen scrapes the data center permitting team has had with local governments have been more about environmental impact -- I.e., power and water usage -- than square footage, you can expect design parameters to slowly evolve accordingly. Certainly when I read about data centers in the mainstream press, the risks/complaints highlighted tend to be about industry power usage more than anything else.
 
And why not? Since Amazon can get their own chips cheaply, that's more than enough reason to use them.

And few watts doesn't really matter at all on servers. No, it does not. If watts really do matter, then nobody would have used Intel Prescott's or current Ice Lake server CPU's, since they are hot as Hell compared against AMD offerings.

People keep saying server power consumption matters but in reality majority of server CPUs are hottest ones available 🤦‍♂️
Watts do matter when you have hundreds of thousands of servers running 7x24. No, a single machine isn't going to matter if it's 10 or 100 watts. But, when you're running over 1M servers as Amazon does, a single watt saved is massive to them.

Many major corporations are abandoning their data centers and moving to the cloud. It is largely cost driven, but also driven by operational efficiencies. Cloud servers can be turned on and off on-demand to save computing (and power) costs.

There are other factors as well, like cooling (which adds to the cost) not to mention density. If a server takes 1000W to run you can't put dozens of them into the same 40U rack space. There simply isn't enough power to accommodate that many servers on one circuit.

So, yes, a few watts do matter to a company like Amazon (or Microsoft or Google) when they are running millions of machines every day. Yes, it matters.
 
The summary of the content is greed...
Such a narrow-minded assertion. Consider the carbon footprint of a 10 yr old data center. Maybe the summary is that large providers like Amazon, Microsoft and Google are interested in reducing that carbon footprint. Maybe they actually think about and are concerned about the impact of that much computing power and want to do something about it.

If making money is nothing more than "greed" then you and everyone else that works for a living must be greedy. Making money is not an inherently evil thing.
 
Power is definitely one of the main selling points. The AWS Well Architected Framework (WAF) recently added Sustainability as its sixth pillar, and while not called out explicitly in the whitepaper (they tend to stay abstract and principled, rather than concrete, in their recommendations), Graviton processors are linked as a resource for further reading. (https://docs.aws.amazon.com/wellarc...use-instance-types-with-the-least-impact.html)

From the documentation, "Graviton3-based instances use up to 60% less energy for the same performance than comparable EC2 instances." They are also designed to be the best from a price/performance ratio standpoint. So we aren't talking about just a few watts or just a few bits of performance. If I have a server that's running 24/7, there's no reason for me to not use Graviton if my software can run well on it. If I need more performance, a cloud-native architecture would focus on horizontal scaling first, vertical scaling second (although the WAF also calls out that 2 instances at 30% load is more power hungry than a single instance at 60% load).

x86_64 is definitely better at vertical scaling than ARM at present, and there are plenty of workloads that work better on x64, but ARM isn't anything to shake a stick at in the server space, the largest limitation for ARM these days is software.

That being said, the biggest limitation for x86_64 in AWS these days isn't the hardware, it's supply chain. AWS has been able to roll out Graviton processors faster than x86_64 can keep up (either due to supply chain shortages, business decisions by AWS, or a combination thereof), so part of the analysis may be unfairly biased towards ARM (like how the M1 was praised for being so much more power efficient, much of that was due to a more advanced lithographic process).

At work we still primarily rely on the x86_64 instance types, but we've made the switch to Graviton in a few places and have seen improvements in price and performance, and software is the main thing holding us back from switching in other places. If it weren't for the software issue I wouldn't have any practical reason to say stick to the x86_64 instances (at least compared to the m5/c5/r5 instance types, we have yet to compare to the newly released 6th gen Intel/AMD types, but that speaks back to supply chain: Graviton2 has been available for over a year, 6th gen Intel/AMD is still rolling out, meanwhile, Graviton3 is dubbed 7th generation and is already rolling out).
 
As true as this may have been, it doesn't mean times don't change.

I don't know because I'm not in the room, but if the last dozen scrapes the data center permitting team has had with local governments have been more about environmental impact -- I.e., power and water usage -- than square footage, you can expect design parameters to slowly evolve accordingly. Certainly when I read about data centers in the mainstream press, the risks/complaints highlighted tend to be about industry power usage more than anything else.
That's what you read from press releases but since Prescott times, almost nothing have changed: Intel Inside is what matters, everything else is nah.
Watts do matter when you have hundreds of thousands of servers running 7x24. No, a single machine isn't going to matter if it's 10 or 100 watts. But, when you're running over 1M servers as Amazon does, a single watt saved is massive to them.

Many major corporations are abandoning their data centers and moving to the cloud. It is largely cost driven, but also driven by operational efficiencies. Cloud servers can be turned on and off on-demand to save computing (and power) costs.

There are other factors as well, like cooling (which adds to the cost) not to mention density. If a server takes 1000W to run you can't put dozens of them into the same 40U rack space. There simply isn't enough power to accommodate that many servers on one circuit.

So, yes, a few watts do matter to a company like Amazon (or Microsoft or Google) when they are running millions of machines every day. Yes, it matters.
And then care to explain why Intel sells more server CPUs than anyone else? Btw Intel's current server CPUs are about hottest ones available.

Yeah, power Should matter if you think logically but no matter how you put it, in reality it does not. Welcome to real world.
 
And then care to explain why Intel sells more server CPUs than anyone else? Btw Intel's current server CPUs are about hottest ones available.

Yeah, power Should matter if you think logically but no matter how you put it, in reality it does not. Welcome to real world.
Sure, I'll explain it to you. SW. Most of the SW written out there for servers runs on Intel x86 architecture. As such, Amazon has to support that architecture as do most server vendors. Customers are either not willing or unable to make the conversion to ARM CPUs, but that is changing. On top of that, it's not trivial to spin up new CPU designs, much less build or get the factory space to manufacture them.

However, this article will explain to you why Amazon is looking at building their own CPUs. POWER. Why do you think ARM dominates Intel and AMD in mobile computing? POWER. Intel and AMD are living off decades of x86 SW development and will for some time to come. But, rest assured, ALL of these parts suppliers and service (cloud) providers are looking for ways to reduce POWER consumption.

You might also want to read this article which talks about sustainability, specifically with respect to compute resources. From that paper Amazon states:

  • "Improve the power efficiency of your compute workload by switching to Graviton2-based instances. Graviton2 is our most power-efficient processor. It delivers 2-3.5 times better CPU performance per watt than any other processor in AWS. Additionally, Graviton2 provides up to 40% better price performance over comparable current generation x86-based instances for various workloads."
I'll happily welcome you to the real world, once you get over here instead of living in the world of denial.
 
And why not? Since Amazon can get their own chips cheaply, that's more than enough reason to use them.

And few watts doesn't really matter at all on servers. No, it does not. If watts really do matter, then nobody would have used Intel Prescott's or current Ice Lake server CPU's, since they are hot as Hell compared against AMD offerings.

People keep saying server power consumption matters but in reality majority of server CPUs are hottest ones available 🤦‍♂️
Cost of a CPU can be dwarfed by the power to operate it. And, who says Amazon is getting their chips "cheaply"? Small quantities, single buyer, hardly a recipe for low pricing. You're thinking way too small. Like I said earlier, a few watts in a single computer doesn't matter. A 40% power reduction across millions of machines does matter.

Yes, people keep saying server power matters, because IT DOES!
 
Intel isn't really taking the problem of power seriously until Arrow Lake, which is clearly now going to be a 2025 release and god knows when the server chips will be ready. AMD is clearly ahead on efficiency but also has a lot of work to do to compete with ARM based server cpu's. AMD is yet to reveal it's Bergamo cores, but one would expect them to be lower powered than regular cores. Both Intel and AMD will be adding all sorts of accelerators to the cpu's as we go forward, such as for AI, which should help as well. Come Zen 5, Epyc Turin will be offering 256 core Bergamo solutions, I just cannot see Intel competing at all. Sapphire Rapids is delayed until 2023 and with only 56 cores will face of against 96 core regular Epyc and 128 core Bergamo based Epyc. It appears come Arrow Lake Intel will be massively increasing E-core counts, but no one knows what is happening with P-cores.
 
Sure, I'll explain it to you. SW. Most of the SW written out there for servers runs on Intel x86 architecture. As such, Amazon has to support that architecture as do most server vendors. Customers are either not willing or unable to make the conversion to ARM CPUs, but that is changing. On top of that, it's not trivial to spin up new CPU designs, much less build or get the factory space to manufacture them.
Because AMD has much lower power conumption than Intel has on server CPUs, you would expect AMD to be dominant. However Intel dominates. That also means power consumption is not an issue. It has never been.
However, this article will explain to you why Amazon is looking at building their own CPUs. POWER. Why do you think ARM dominates Intel and AMD in mobile computing? POWER. Intel and AMD are living off decades of x86 SW development and will for some time to come. But, rest assured, ALL of these parts suppliers and service (cloud) providers are looking for ways to reduce POWER consumption.
ARM dominates on mobile because mobile phones decided to use ARM decades ago. That had Nothing to do with power but fact that getting x86 license was pretty much impossible, practically locking down to AMD, Intel, VIA or some others. Whereas ARM license was very easy to obtain.

For exact same reason Amazon uses ARM. Basically it's only good choice if they want in-house design. Other choices, well, RISC-V or...

You're running in circles here. ARM is only option and ARM is "low power" = they chose ARM for low power.
You might also want to read this article which talks about sustainability, specifically with respect to compute resources. From that paper Amazon states:

  • "Improve the power efficiency of your compute workload by switching to Graviton2-based instances. Graviton2 is our most power-efficient processor. It delivers 2-3.5 times better CPU performance per watt than any other processor in AWS. Additionally, Graviton2 provides up to 40% better price performance over comparable current generation x86-based instances for various workloads."
I'll happily welcome you to the real world, once you get over here instead of living in the world of denial.
How about realizing fact that Amazon Must Have arguments for making Graviton? They just cannot say Graviton sucks pre if it does. Also since you really cannot test Graviton outside Amazon cloud, checking those claims is pretty much impossible. I have never seen press release where company says their own new product suck.
 
Cost of a CPU can be dwarfed by the power to operate it. And, who says Amazon is getting their chips "cheaply"? Small quantities, single buyer, hardly a recipe for low pricing. You're thinking way too small. Like I said earlier, a few watts in a single computer doesn't matter. A 40% power reduction across millions of machines does matter.

Yes, people keep saying server power matters, because IT DOES!
Production costs are pretty small for CPUs. For example 400mm2 CPU assuming 30% defect rate only costs around 300 dollars to manufacture. Compare that against AMD and Intel server CPU prices. Design costs are different matter but just manufacturing is cheap.

Buyer is always rightsome. Since Intel sells most, power does not matter IRL. It only matters in your imagination.
 
Such a narrow-minded assertion. Consider the carbon footprint of a 10 yr old data center. Maybe the summary is that large providers like Amazon, Microsoft and Google are interested in reducing that carbon footprint. Maybe they actually think about and are concerned about the impact of that much computing power and want to do something about it.

If making money is nothing more than "greed" then you and everyone else that works for a living must be greedy. Making money is not an inherently evil thing.
There is no efficiency gain in processors that can't even run the software that the consumer expects. In addition to tests where they are compatible and run much slower.

They are just increasing profit margins to please shareholders, nothing more, if you don't understand the difference between a business "making money In a healthy way, developing services and products that add value and quality etc..." and "Pure greed, the focus is on profit regardless of the means. People don't need it, it's not new, it's not faster, but hey... I will increase my profit in few %"

Then you won't understand what's to come when a machine takes your place. No matter what you do, a machine eventually can and will take your place, because the goal is profit. period.
 
Yes, the goal is profit, although happily enough pleasing customers, reducing waste, and not running afoul of politics & regulators all overlap often enough that I continue to believe that profit-motivated systems are the "worst possible economic systems except for all others ever tried."

Generalities aside, I'm not sure how this specific case of Amazon offering an additional CPU choice to the many already on offer is some evil plot against me. If I don't have a workload it's a better fit for, then I won't use it. If no one has such a workload then no one would use it, and the goal of "profit. period" would be harmed by wasted development. So by your own logic AWS must be pretty convinced they have a market who wants this option (and I wouldn't be shocked if one of those customers is Amazon itself.)
 
Back