Intel’s Merom Unveiled

Pages: 1 2 3 4 5 6 7 8 9 10 11

The Memory System

Merom has extraordinary execution resources, and the memory system has been improved in tandem with the rest of the design. As Figure 7 shows, the memory system really looks like Yonah, but with the bandwidth of the P4. Note that the P4 uses the fast ALUs to calculate store addresses, which is why there is no store address unit.

Figure 7 – Memory System Comparison

The caches in Merom and Yonah are both write back, and use 64 byte cache lines. The L1D cache for the P4 is write through with 64 byte lines and the L2 is write back with 128 byte lines, broken into two sectors.

The shared L2 cache for Merom is a non-inclusive, non-exclusive design. Latency numbers were not disclosed, but it is very likely that the L1D cache latency is 2-3 cycles, most likely 2. As previously mentioned, Merom can transfer directly between the L1D caches in some variants. However, it is currently unknown how often this transfer can occur, how much data is transferred (probably a cache line) and whether such a transaction would replace an L2 cache access. The Merom memory subsystem also implements new prefetchers designs to work effectively with shared caches. Each L1D cache has several prefetchers, and the L2 prefetchers dynamically allocate bandwidth between the two CPUs based on the data access patterns and intensity using a modified round-robin algorithm. The front-side bus is similarly arbitrated for fairness.

The memory system also implements a new technique for memory disambiguation in the MOB, which is described in the next two sections.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (148 comments)