Alpha Gets Stretched
Unfortunately such massive hardware resources doesn’t come without a penalty. In this case, register file access occupies 3 of the 18 physical pipe stages in the EV8’s basic execution pipeline. To minimize the performance impact of the superpipelined register file, each functional unit contains its own register cache, which stores copies of the unit’s generated results for the previous 8 cycles. The functional unit register caches assist in both local bypass and to help align results for register write back. The EV8 design has a branch misprediction penalty of 17 clock cycles (nearly high as that of the Pentium 4). With most high-end processors, a branch mispredict penalty of 17 stages would be a major performance pothole, but in the EV8 it is simply another opportunity to exploit the benefits of SMT. The stages of the logical pipeline of the EV8 are listed in Table 1. Keep in mind that each logical pipe stage may be superpipelined across multiple physical clocking stages. For example, as previously mentioned the register file or RF pipe stage takes 3 clock cycles to perform.
thread per cycle
Instruction Buffering / Slotting
Register File Access
Cache access stage 1
Cache access stage 2
The EV8 is more than twice as deeply pipelined than the venerable EV6 processor core. Yet this does not represent a change in basic design philosophy back to the pure speed racer camp that might have characterized the early Alpha implementations. It is simple accommodation of the physical laws of nature. The ratio between transistor switching speed and the signal propagation time to cross a given fraction of the die has radically changed since the EV6 was designed into 0.35 um process technology. That is, transistors have gotten much faster with reduction in feature sizes, while absolute interconnect performance has changed relatively little.
However the expectations placed on MPU designers are that clock rates will increase proportionally to transistor speed. The net effect is that over time the relative distance on an MPU die that a signal can travel in a single clock stage shrinks with each new semiconductor process technology. That is why you see EV8 designers adopt much deeper pipelines and schemes like placing register caches within each functional unit. Simply put, it is almost always desirable to replace wire intensive design elements with ones that rely more heavily on transistors and localized communications.
Figure 2 EV8 Floorplan
Physically, the EV8 would have been an impressive device incorporating 250 million transistors on a 420 mm2 die. The floorplan of the EV8 is shown in Figure 2. Although the initial EV8 targeted a 0.13 um SOI CMOS process, the Alpha design team balanced circuit and physical design for an effective lifetime across three process generations. The performance sweet spot was in the 0.10 um technology node. The initial 0.13 um EV8 implementation would have been somewhat transistor limited, while the final 0.07 um based device would have been somewhat interconnect (RC) limited.
Despite being tuned for peak efficiency in the second generation, the initial EV8 device would have still likely achieved industry leadership performance levels. Running at a predicted 1.8 GHz, the EV8 was estimated to achieve 2x the single thread performance improvement, and 4x the throughput of the soon to be released EV7, a 0.18 um bulk CMOS design running at 1.2 GHz. The SOI processing that was to be used for the EV8, a first for an Alpha processor, was described as permitting 10 to 15% higher clock rates over bulk CMOS at a 0.13 um feature size, if critical circuits are redesigned to work well with floating transistor bodies. The expected power consumption was 150 W at 1.1 Volts.
Be the first to discuss this article!