Hammering x86 into the 64 bit World
In October AMD revealed some aspects of K8, its next generation x86 core code-named Hammer . This new design is primarily distinguished by being the first processor to implement x86-64, AMD’s extension to the x86 instruction that supports 64 bit flat addressing, 64 bit GPRs, as well as other enhancements. As can be seen in Figure 2, the Hammer core heavily leverages AMD’s highly successful K7 core
Figure 2 Comparison of K7 Athlon and K8 Hammer Organization
The back end execution engine of the K8 Hammer core is basically identical to that of the K7 except that the integer schedulers are expanded from 5 to 8 ROPs. The increase in the integer out-of-order instruction scheduling capability this implies may have been intended to better hide the data cache’s two cycle load-use latency, and thus slightly increase per clock performance. An alternative hypothesis is that the latency of some integer operations may have been increased to allow higher clock rates and the change was made to prevent a slight loss in per clock performance. The basic execution pipeline of the K7 and K8 are compared in Figure 3.
Figure 3 Comparison of K7 and K8 Basic Execution Pipeline
The K8 execution pipeline has two more stages than K7, and these new stages seem to be related to x86 instruction decode and macro op distribution to the integer and floating point schedulers. Although some of the stages have been renamed it appears that the final five pipe stages, representing the back end execution engine, are comparable. This is unsurprising as the most complex and difficult task in an x86 processor like the K7 or K8 is the parallel parsing of up to three variable length x86 instructions from the instruction fetch byte stream and their decoding into groups of systematized internal operations. In comparison, the execution engine is hardly much more complex than a typical out-of-order execution RISC processor.
Both the block diagram and execution pipeline indicate that AMD has spent nearly all its effort in Hammer development at revamping the front end of the K7 design. Some of the extra degree of pipelining may be related to the extra degree of complexity in decoding yet another level of extensions (x86-64) on top of the already Byzantine x86 ISA. Some of the increase may be related to increased flexibility in internal operation dispatch to reduce the occurrence of stall conditions and increase IPC. And, some of the increase may simply reflect a reduction in the work per stage to increase the clock scalability relative to the K7 core. Without a detailed description of each of the pipeline stages in the K8 it is difficult to correlate front end pipe stages in the K7 to the K8, and next to impossible to assess how the benefit of the extra two pipe stages is allocated between accounting for increased ISA complexity, measures to increase IPC, and reduction in timing pressure per pipe stage to allow higher clock rates.
Although the 64-bit instruction set extension makes for attention grabbing headlines in the technical trade press, the major performance enhancements in the Hammer series are much more prosaic from a processor architecture point of view. These enhancements are the direct integration of interprocessor communications interfaces and a high performance memory controller. Like a “poor man’s EV7”, the Hammer includes three bi-directional HyperTransport (HT) links and a memory controller supporting a 64 or 128-bit wide DDR memory system using unbuffered or registered DIMMs. With the latter, a K8 processor can directly connect to 8 DIMMs, although this number may be reduced at the higher memory speeds supported. It is interesting to compare the results of the same design philosophy applied to the high-end server and mainstream PC segments of the MPU market as shown in Table 2. Power and clock rates for the Hammer MPU are estimates.
Alpha EV7 
0.18 um bulk CMOS
0.13 um SOI CMOS
125 W @ 1.2 GHz
~70 W @ 2 GHz
4 links, each 6.4 GB/s,
one 6.4 GB/s IO bus
3 links, each ~6 GB/s
2 x 64 bit DRDRAM
12.8 GB/s peak
64 or 128 bit DDR
2.7 or 5.4 GB/s peak
Although the Intel McKinley and AMD Hammer are both 64 bit MPUs, these devices are directed at different markets. While the large and expensive McKinley will target medium and high-end server applications, the first member of the Hammer family, code named “Clawhammer”, will target the high end desktop PC market. That is not to say that McKinley will outperform the Clawhammer device. Indeed, I expect the AMD device will easily beat the much slower clocked IA64 server chip in SPECint2K and many other integer benchmarks, as well as challenge much faster clocked Pentium 4 devices in both integer and floating point performance.
Exactly how much performance the Hammer core may provide is the subject of some controversy. AMD’s Fred Weber was quoted as stating the Hammer core could offer SPECint2k performance as much as twice that of current processors. Although this comment is vague enough to drive a truck through (twice as fast as the best AMD processor? Best x86 processor? Best processor announced but not yet shipping?, IA-32 or x86-64 code?, Clawhammer or the big cache Sledgehammer?) a few web based news sites interpreted this comment as meaning the Hammer would achieve 1400 SPECint2k and now some people are incorrectly attributing this figure to Weber himself. Keep in mind that no Hammer device has even taped out as of the end of 3Q01 let alone been fabricated, debugged, verified, and benchmarked at the target clock frequency. Whatever figure Mr. Weber had in mind was derived from architectural simulation and for a benchmark suite as cycle intensive as SPEC CPU simulation results are approximate at best . As been shown time and time again, it is best not to count performance chickens too closely before the silicon eggs hatch.
Be the first to discuss this article!