Intel’s Merom Unveiled

Pages: 1 2 3 4 5 6 7 8 9 10 11

The Out-of-Order Engine – Execution Units

While Intel was a little shy about the mechanics of renaming and scheduling, they were positively enthusiastic about the dispatch and execution in Merom. Merom has three execution dispatch ports, which feed a total of three 128 bit SSE units, two 128 bit floating point units, and three 64 bit integer units. The integer unit on dispatch 1 also handles 128 bit shifts and rotates and all of the ports can perform FP Moves. The FPUs and SSE units also share hardware where appropriate. The execution subsystems of Yonah and especially the P4 are paltry in comparison, as shown below in Figure 6.

Figure 6 – Execution Unit Comparison

The first thing to notice is that Merom has an extra dispatch port compared to the P4 or Yonah, so it can consistently execute up to three instructions each cycle. The P4 can dispatch and execute 4 instructions each cycle, but that is relatively rare. To dispatch that many instructions, they all must be simple ALU operations, and there is a latency penalty for 64 bit operations. More importantly, Merom has a relatively balanced arrangement of functional units; in the P4, many operations go to dispatch 1, which causes contention. For integer operations, Merom will execute 3 operations per cycle much more consistently than the P4.

Merom also substantially improves on the floating point and SSE capabilities of its predecessors. Although Merom’s 3 SSE units are not fully symmetric, the differences are relatively minor (shifting and multiplication resources). The SSE units are fully pipelined and each one can execute the appropriate 128 bit SSE operation in a single cycle. In comparison, the P4’s SSE resources are somewhat scanty; the two 64 bit SSE units use two cycles to execute 128 bit operations, and therefore are only partially pipelined. Similarly, Yonah only has 64 bit data paths. In comparison, Merom can perform 4 DP FLOPS/cycle and then some: a 128 bit multiply, a 128 bit add, a 128 bit load, a 128 bit store and then perhaps an ALU or fused compare and jump instruction in the last dispatch port.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (148 comments)