Under Bulldozer's hood

AMD's new microarchitecture is designed to provide the perfect balance between performance, cost and power consumption for multithreaded applications. It focuses on high frequencies and resource sharing to achieve optimal throughput. As mentioned previously, the AMD FX processors offer up to eight power-efficient cores. These represent the first generation of a new execution-core family (15h) from AMD.

The Bulldozer concept is based on a 2-core design that shares latency-tolerant functionality, smoothes bursty/inefficient usage and provides dynamic resource allocation between threads. Each core has its own 16KB L1 cache with a 1MB L2 cache, while the L3 cache is shared. The other units are now effectively shared between two cores and include: Fetch, Decode, Floating-point pipelines, and the L2 cache.

This design allows two Cores to use a larger, higher-performance function unit (ex: floating-point unit) as they need it with less total die area than having separate, smaller function units for each Core. It also means that there shouldn't be Bulldozer-based CPUs with an uneven number of cores like the Phenom X3 series.

The Zambezi Bulldozer-based processors have a die size of 315mm², which is smaller than the Phenom II x6's 346mm² die, while it's bigger than the Phenom II X4's 258mm² die. The 6-core "Gulftown" Intel Core i7 processors are also smaller at 240mm2, and the complex Sandy Bridge chips such as the i7-2600K are 216 mm².

A large 32nm die means a lot of resistors and AMD tells us that the Zambezi architecture has roughly two billion of them. That's pretty incredible given the Intel Core i7-990X Gulftown (32nm) features 1.17 billion while the Core i7-2600K has just 995 million. The older Phenom II X6 processors have 904 million and the Phenom II X4 chips just 758 million. Those numbers help convey just how complex these Bulldozer CPUs really are.

The floating-point unit has also undergone a complete redesign. It has been improved to support many new instructions and it now allows resource sharing between cores. There are two 128-bit FMACs shared per module, allowing for two 128-bit instructions per core or one 256-bit instruction per dual-core module.

AMD has also designed a shared front-end which is responsible for driving the processing pipeline and will ensure that the cores are constantly fed with information. It has been designed to work with each dual-core unit and allocate threads to individual cores themselves. AMD has made heavy changes that include decoupled predict and fetch pipelines as well as prediction-directed instruction prefetchers.

A Prediction Queue can manage direct and indirect branches that are now fed with a L1 and L2 Branch Target Buffer, which stores destination addresses. The Bulldozer modules can decode up to four instructions per cycle, which is one more than the Phenom II processors. The prediction pipeline produces a sequence of fetch addresses. The Fetch pipeline performs a look up in the instruction cache and pulls 32 bytes per cycle into the fetch queue to feed the decoders.

AMD has also built new instructions into the Bulldozer architecture. While AMD and Intel share SSE3, SSE4.1/4.2, AES, and AVX, there are two new instruction sets called FMA4 and XOP that are now unique to AMD. The former is designed for HPC applications while the latter is used for numeric and multimedia applications as well as algorithms used for audio and radio.

Unlike Sandy Bridge, which features an on-die GPU with the System Agent (aka northbridge), AMD has taken a more traditional approach with the Bulldozer architecture. The company is avoiding an IGP (Integrated Graphics Platform) all together with AM3+, leaving that functionality for its 32nm Llano processors, which feature a speedy Radeon core.

The northbridge is also separate from the processor. Even though AMD claims to include an integrated northbridge, it's really just a memory controller. In fact, AMD pioneered this technology back in the Athlon64 days. Bulldozer's northbridge features two 72-bit wide DDR3 memory channels and four 16-bit receive/16-bit transmit HyperTransport links.