Countdown to IA-64

Pages: 1 2 3 4 5

IA-64 Performance – A Moving Target

To ensure the competitiveness of IA-64, Intel and its partners must invent, optimize, and verify new compilation, code generation, and code transformation algorithms. And they must deliver them to software developers in the form of robust and reliable compiler tool chains to create a pool of native applications. IA-64 compilers have been in development since the architecture was created about six years ago (and the underlying concepts for much longer than that) and will likely be a hotbed of research and development activity long after McKinley is launched. Reports suggest that there are still serious issue with both compile times and compiler robustness on real world applications. Some observers even suggest the extended pre-production phase of Merced has more to do with shortcomings in the compilers than silicon. So expect to see benchmark results for IA-64 processors gradually escalate over their lifetime. When the 60 MHz Pentium was first shipped Intel reported SPECint92 score a bit under 60. Several years later the same processor was scoring 70.4, an increase of over 15%. Expect to see the same dynamic at work for IA-64 processors, except to an even larger degree.

The integer performance of IA-64 processors on benchmarks and commercial applications is the most difficult aspect for outsiders to predict. IA-64 has the three address instructions and large registers files used to great effect in RISC processors, as well as uncommon or new features like full predication and powerful logical conjunctive operators that can help tackle even difficult, conditional branch intensive, integer code. These features are not only helpful on their own but they also allow compilers to transform poorly structured code into new forms that permit more intensive exploitation of existing optimization techniques [8]. On the other hand, IA-64 devices lack the dynamic instruction scheduling found in high performance x86 and RISC processors. Out of order execution is generally most beneficial to integer code performance and is generally credited as increasing performance by 30% or more. The relative influence of these various positive and negative factors will likely vary significantly for each different application. In some cases IA-64 processors will perform amazingly well. In other cases the rigidity of in-order execution will dominate and IA-64 processors will fall behind x86 and RISC processors of similar complexity and technology.

When it comes to floating point intensive code the picture is much clearer. That is because FP code tends to be more loop intensive, and thus more predictable in both the flow of execution and the flow of data into and out of caches (unsurprisingly the first VLIW processors were developed by Multiflow and Cydrome for scientific and technical computing applications). The FP performance of IA-64 machines shouldn’t differ much from RISC machines with similar issue width, number of execution units, and memory hierarchy characteristics. IA-64 includes features like rotating registers to improve the performance of loop-based code, but these will likely act more as a substitute for true register renaming than as a factor for increased performance.

Predicting how well a radical new processor design will work in practice is fraught with difficulty and pitfalls. Nevertheless I will attempt to assess how the first and second generation of IA-64 processors will fare relative to their competitors. In Table 1 are the characteristics and performance of the Compaq Alpha EV68, the Sun UltraSPARC-III, and Intel Merced/Itanium. Keep in mind that much faster 2nd generation EV68 devices will likely be introduced at the same time as Intel’s infamous chip. Unless the compiler effort to date has been disastrous, the six fetch, six issue wide Itanium should turn in respectable benchmark numbers. Keep in mind that over a large set of applications performance will be all over the map, so watch out for Intel cherry picking individual applications for inclusion as benchmarks in its marketing campaign.

<b>Table 1 IA-64 vs RISC – Round One</b>
&nbsp;

Compaq

Alpha EV68

Sun

UltraSPARC-III

Intel

Itanium

Process

0.18 um Al

0.18 um Cu

0.18 um Al

Die Size (mm2)

193

232

300 (est)

Transistors (million)

15.2

23

25.4

Maximum Power (W)

60

65

100 (est)

Clock Rate (MHz)

833

900

800

Instruction/Data Cache

64 / 64 KB

32 / 64 KB

16 / 16 KB

Level 2 Cache

off-chip

off-chip

96 KB

Peak Instruction Fetch

4

4

6

Peak Instruction Issue

6

6

6

Basic Execution Pipeline

7 stages

14 stages

10 stages

System Bandwidth (GB/s)

2.7

2.4

2.1

SPECint2k, base

518

438

450 (est)

SPECfp2k, base

590

427

600 (est)

The estimated characteristics and performance of the Compaq Alpha EV7, the IBM POWER 4, and Intel McKinley (Itanium 2?) are given in Table 2.

<b>Table 2 IA-64 vs RISC – Round Two</b>
&nbsp;

Compaq

Alpha EV7

IBM

POWER4

Intel

McKinley

Process

0.18 um Cu

0.18 um Cu

0.18 um Al

Die Size (mm2)

397

400 (est)

450 (est)

Transistors (million)

152

174

214

Maximum Power (W)

140 (est)

120 (est)

130 (est)

Clock Rate (GHz)

1.4 (est)

1.2 (est)

1.2 (est)

Instruction/Data Cache

64 / 64 KB

64 / 32 KB

16 / 16 KB (est)

Level 2 / Level 3 Cache

1.75 MB

1.5 MB

0.5 / 2.5 MB (est)

Peak Instruction Fetch

4

4

6

Peak Instruction Issue

6

8

6

Basic Execution Pipeline

7 stages

14 stages

7 stages

System Bandwidth (GB/s)

44.8

92

6.4

SPECint2k, base

1100 (est)

850 (1 CPU, est)

900 (est)

SPECfp2k, base

1400 (est)

1200 (1 CPU, est)

1250 (est)

At the recent Intel Developers Forum (IDF), Intel demonstrated working silicon for McKinley and stated it would enter pilot production at the end of this year. It has generally been known for a year and a half that McKinley would be a much faster device than Merced (‘at least twice the performance’) and would clock above 1 GHz. The value 1.2 GHz was mentioned in the advance program of ISSCC 2001 as part of the title of the McKinley paper, which was later withdrawn prior to presentation [9]. The fact that McKinley is capable of being clocked much faster than Merced in the same process is made more intriguing because the second generation IA-64 device will have three fewer pipe stages in its basic execution pipeline, seven instead of ten, as well as more functional units. It is known that very experienced HP designers, many of them veterans of multiple generations of PA-RISC, contributed to McKinley, but a 50% frequency bump while decreasing pipeline length by 30% and adding functional units is still no mean feat.

Another possibility that should worry competitors is that Intel might skip production of the original 0.18 um version McKinley in favor of a 0.13 um shrunk version of this device. The huge die size needed to realize the ~3 MB of on-chip cache in an 0.18 um process technology means Intel will have great economic incentive to shrink McKinley even if it means extra competition for 0.13 um wafer starts with future Xeon, Pentium 4, and Pentium III devices. A quick shrink to 0.13 um would not only make the McKinley a much more practical device to manufacture (which always figures prominently in Intel thinking) but would likely push clock rates to 1.6 GHz or faster. In fact, I expect Intel’s ability to launch IA-64 devices in advanced processes one or two years ahead of its RISC competition will prove far more formidable than any actual architectural advantage.


Pages: « Prev   1 2 3 4 5   Next »

Be the first to discuss this article!