Making x86 Run Cool

Pages: 1 2 3 4 5 6 7 8 9

Putting It all Together – ‘Cool_x86’

So what would a hypothetical low power 0.18 um x86 processor designed using all these suggestions look like? Here is the outline for one possible implementation I call ‘Cool_x86’. It is based on the Coppermine type P6 core with an enlarged P4 style trace cache and single x86 decoder replacing the instruction cache and triple x86 decoder. The data cache is 32 KB, 4-way set associative with three cycle load-to-use latency. The on-chip L2 cache is 512 KB, 16-way set associative. To minimize power consumption, L2 tag and data lookup is performed sequentially instead of simultaneously so only the data SRAM array associated with the selected way is accessed. The data path from L2 to data cache remains 256 bits wide. At first glance keeping such a wide data cache fill path sounds like an extravagant waste of power. But a narrower path requires more transfers and consumes about the same total energy while lowering performance. The block diagram of Cool_x86 is provided in Figure 4.


Figure 4 Block Diagram of Hypothetical Low Power MPU ‘Cool_x86’

The use of a trace cache permits a shorter execution pipeline than P6, yet the Cool_x86 could likely be clocked at somewhat higher frequencies than the P6 core in the same process (although far lower than the hyperpipelined P4). Conversely the Cool_x86 core could operate at similar clock frequencies, but with reduced power supply voltage. A comparison of the P6 and Cool_x86 execution pipeline is shown in Figure 5. The short pipeline also has the benefit of reducing the load on the clock net, and the complexity of bypass and exception recovery logic, all of which reduces power consumption.


Figure 5 Comparison of the P6 and Cool_x86 Execution Pipelines

Along with a shorter execution pipeline, the Cool_x86 also has reduced branch mispredict penalty, better branch prediction, similar execution resources, and improved on-chip cache hierarchy. This suggests that the Cool_x86 would enjoy higher IPC on integer programs and retain about the same IPC for FP programs despite a higher clock frequency. The hypothetical Cool_x86 design is compared to the recently announced 1000/700 MHz mobile Pentium III in Table 2.

<strong>Table 2 Comparison of Mobile PIII and Cool_x86</strong>
&nbsp;

Mobile PIII [5]

Cool_x86 (est)

Die Size (mm2)

106

150

Transistors

L2 cache

L1 cache

Logic

Total

18.6m

3.1m

6.4m

28.1m

37.2m

12.7m

3.9m

53.8m

Instruction Cache

16 KB

16 Kuops

Data Cache

16 KB

32 KB

L2 Cache

256 KB

512 KB

Clock Rate

VDD = 1.35 Volts

VDD = 1.70 Volts

700 MHz

1000 MHz

850 MHz

1200 MHz

Power (1.35/1.70 V)

Worst Case Max

Typical Active

Deep Sleep

16.1 / 34.0 Watts

11.2 / 24.8 Watts

0.40 / 0.93 Watts

15 / 29 Watts

10 / 22 Watts

0.3 / 0.7 Watts

Performance (relative)

Integer

SIMD

FP

1.00

1.00

1.00

1.38

1.26

1.20

Computational Energy Efficiency (relative)

Integer

SIMD

FP

&nbsp;

1.00

1.00

1.00

&nbsp;

1.48

1.35

1.29

The most noticeable physical differences are the significantly larger die size and transistor count related to the 2x larger L2 cache in the Cool_x86. Despite this disparity, the total leakage current power lost in the Cool_x86 could likely be kept to about 70% or so of the PIII if the leakage in the L2 SRAM arrays transistors is reduced by an order of magnitude. This might be accomplished by a combination of back biasing the L2 array wells and increasing the gate length and/or threshold voltage of the access transistors within the 6T SRAM cell used in the L2 cache.

The clock normalized switching power of the Cool_x86 was estimated to be reduced to about 75% of the PIII from the replacement of the triple parallel x86 instruction decoders by a trace cache and single decoder, shorter execution pipeline, reduced speculative factor in out-of-order execution, and reduced number of off-chip memory accesses from the doubling L2 cache size and associativity. The Cool_x86’s IPC averaged over integer, SIMD, and FP intensive programs is estimated to 10%, 5%, and 0% higher relative to the PIIII respectively due to the shorter pipeline, improved branch predictor, and improved cache hierarchy.


Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (78 comments)