Tuesday, June 12, 2007

A quick look at AMD's quad-core Barcelona

Last week, AMD showed off a working quad-core processor at an event in San Francisco. The company had promised a quad-core demo by the end of the year, and they did manage to deliver, even if all the audience saw was a Windows machine running task manager. Clearly, the silicon for their next-generation core microarchitecture, codenamed Barcelona (also popularly called "K8L"), has a few kinks left to be worked out.

AMD touts Barcelona as a "true" quad-core processor, because it features a highly integrated design with all four cores on a single die with some shared parts. This is in contrast to Intel's "quad-core" Kentsfield parts, which use package-level integration to get two separate dual-core dies in the same socket. For my part, I'm inclined to agree with AMD that Barcelona is real quad-core and Kentsfield isn't, but I gave up fighting that semantic fight a long time ago. Nowadays, if it has four cores in a single package, I (grudgingly) call it "quad-core."

Barcelona is slated for a 2Q 2007 release, and judging from the bare-bones nature of their recent demo they have some ways to go before the new chip is ready for primetime. The hope among AMD fans (and indeed among all savvy consumers who like healthy price/performance competition) is that Barcelona's improved microarchitecture will put them back in the running with Core 2 Duo. Right now, Core 2 Duo's microarchitecture (codenamed Conroe) is the undisputed king of per-thread performance. Can K8L take the crown back?

A quick look at Barcelona

There are two levels at which we can examine Barcelona: the die level, and the core microarchitecture level. Let's take a look at the die level first.

The diagram below shows the basic, high-level block architecture of a Barcelona die. Each K8L core has its own private L1 cache (128K) and L2 cache (512K). In Barcelona as in their current dual-core parts, AMD likes to go with private L2 caches because they keep the cores from fighting over cache space.

All four cores share a common L3 victim cache that can be expanded based on the processor model. So different Barcelona implementations will have different L3 cache sizes, the way that different Conroe implementations have different L2 cache sizes.

Also situated on the Barcelona die are the now familiar DDR2 and HyperTransport controllers. Barcelona's HT controller has three links, which gives it more bandwidth with which to feed the four cores.

Barcelona's core

From a high-level, block diagram perspective, the K8L core that powers Barcelona is identical to its predecessor designs in the K8 and K7. Of course, a minimally detailed block diagram of Core 2 Duo also looks quite a bit like the Pentium III, so the devil is in the details.

K8L boasts some substantial improvements over its predecessors, many of which will be familiar to students of Conroe. Here's a brief run-down of the major improvements.

First, K8L sports a side-band stack optimizer that does pretty much the same thing as Conroe's stack execution unit. It removes stack pointer updates from the code stream, so that they don't take up dispatch and execution bandwidth.

The new core design also has an improved branch prediction unit, with a larger branch history table and an indirect branch predictor.

K8L's memory units can do more aggressive load reordering, like Conroe with its memory disambiguation. This should enable the cores to make better use of memory bandwidth, and retire more loads per cycle.

Instruction fetch has also been widened from 16 bytes to 32 bytes. This feature will help keep the core's other IPC (instructions per clock) improvements from being bottlenecked by a lack of fetch bandwidth.

Finally, K8L's floating-point/SSE datapaths have been widened from 80 bits to 128 bits, and the resulting scheduler and execution hardware has been widened, as well. This improvement gives K8L's FP/SIMD units the ability to do many common scalar and 128-bit packed single- and double-precision operations at a rate of one per cycle, bringing it up to speed with Conroe in terms of per-unit throughput.

It does not seem that K8L will catch up to Conroe in terms of the theoretical peak number of 128-bit SSE operations per cycle, however. K8L's two floating-point/SSE pipes give it two 128-bit SSE ops/cycle, and its FSTORE pipe can do another 128-bit SSE move per cycle, for a total of three per cycle peak. This is half of Conroe's peak theoretical throughput of six 128-bit SSE ops/cycle.

Comparing these kinds of theoretical peak numbers doesn't give you as much information as you might think, though. It takes a pretty specialized instruction mix to get Conroe up to its peak, so the real-world average number of SSE operations per cycle will definitely be less than six, and probably closer to three or so, depending on the code. So all told, Conroe has more theoretical vector processing potential than K8L, but actual benchmark results are going to depend on a number of other factors, like memory bandwidth and latency (K8L will have the edge in this).

This brings me to my final, and most important point about Barcelona: system- and die-level integration factors matter. Folks who have been reading my processor articles over the years, or who pick up my book, are probably accustomed to seeing microarchitectures compared by placing similar functional blocks from each core side-by-side (e.g., the Pentium 4's FPU vs. the G4e's FPU, and so on). This kind of comparison is much less apt for multicore systems with different memory subsystem designs (where "memory subsystem" includes the cache hierarchy). The way that code and data move over the memory bus and through the levels of the cache is a decisive factor in benchmark performance—it's not just about the sheer amount of floating-point, integer, or SIMD hardware in a core. This has always been true, of course, but now different cores from different vendors are tied to different memory subsystems, so effects are more apparent and dramatic.
Further reading

* ExtremeTech, AMD Unveils Barcelona Quad-Core Details
* The Inquirer, AMD quad-core Barcelona laid bare
* The Inquirer, Demo of AMD's Barcelona points the way downhill
* Xbit Labs, K8L Architecture on the 65nm SOI Process Technology in Quad-Core Processors Already in Q2 2007?
* AMD Quad Core, Official Web site

No comments: