Making a Mountain Out of a Molehill

Pages: 1 2 3 4 5 6 7 8

If at First You Don’t Succeed Try, Try Again

In many respects the reputation of Intel and the competitiveness of its IA64 system partners are riding on the shoulders of McKinley. Despite its importance there has been relatively little official technical information disclosed about McKinley. Intel has said that it will run in excess of 1 GHz and offer at least twice the performance of Itanium. Unofficial sources even suggest the McKinley will demonstrate 2.5x the integer performance and 2.0x the FP performance of Itanium as measured by SPEC2k. It is known that the McKinley system interface has a peak bandwidth of 6.4 Gbyte/s, at least double that of current high end 64 bit RISC processors and triple Itanium’s 2.1 GB/s capability.

As I have mentioned in a previous article, the best clues outsiders have of the nature of the McKinley device came courtesy of a technical paper that was never presented. As was widely reported in the trade press, Intel mysteriously withdrew its McKinley paper from ISSCC 2001. The abstract for this paper was subsequently deleted from the ISSCC web site but the withdrawal came too late to excise it from the advance printed program mailed out to previous attendees [4]. What the abstract revealed about McKinley is summarized in Table 3.

Table 3 Comparison of McKinley to Itanium (Merced)
 

Itanium (Merced)

McKinley

Comment

Process

0.18 um CMOS

0.18 um CMOS

Same basic process

Transistors

25 million

214 million

Lots more cache

Die area

>300 mm2 (est)

>400 mm2 (est)

McKinley CPU core reputed to be smaller

On-chip

cache

2 levels

128 KB total

3 levels

>2.5 MB total (est)

McKinley on-chip L3 avoids external SRAM, MCM packaging

Clock Rate

800 MHz

1.2 GHz

50% faster

Pipeline

10 stages

7 stages

30% shorter

The fact that the McKinley can outperform the Itanium by at least 2x with a 50% clock rate advantage implies that the newer design achieves a 33% higher IPC over a wide range of applications. If McKinley has a 2.5x performance advantage on SPECInt2k, that implies a 67% higher IPC. No doubt the McKinley benefits from larger on-chip cache (the Itanium has less on-chip cache than a Celeron. Intel tried to make up for it with 2 or 4 MB of L3 cache implemented using custom high speed SRAMs incorporated in the Itanium MCM package) and shorter execution pipeline, but this cannot explain the apparent IPC improvements on integer code by themselves.

Some observers have suggested that McKinley gains its IPC performance from being a “four banger” implementation of the IA64 instruction set architecture. This idea may have come from Intel’s disclosure that McKinley has more functional units than Itanium. But I think this is very unlikely simply from the complexity, die size, and power implications of going to 12 instruction wide fetch, issue, and execution in a 0.18 um technology. It is even an open question if EPIC compiler technology will ever be able to find enough instruction level parallelism (ILP) in most programs to keep a four banger adequately fed. Instead, I will try to show that extra functional units can significantly improve the IPC performance of a two banger IA64 MPU compared to Itanium and can reasonably account for a good portion of McKinley’s increase in effective IPC.


Pages: « Prev   1 2 3 4 5 6 7 8   Next »

Be the first to discuss this article!