Inside Nehalem: Intel’s Future Processor and System

Pages: 1 2 3 4 5 6 7 8 9 10

Cores, Caches and Interconnects

Intel’s current front-side bus based architecture is quite workable for notebook and desktop systems, but is far from ideal for modern servers and workstations. Notebook systems use one to two cores and are cost and power sensitive, and need low latency for a single task. Desktop requirements are similar, but more sensitive to cost and less sensitive to power; but they must also be able to handle the extreme bandwidth required by high performance discrete graphics solutions. In contrast, servers can use an unlimited number of cores and require massive bandwidth and low latency for many different tasks operating simultaneously while keeping power within a reasonable envelope. At the higher end of the spectrum, reliability and availability becomes a key concern for MP servers.

Given that one of the goals for Nehalem was to define a flexible architecture that could be optimized for (and not just shoehorned into) all different market segments, one of the key changes is to adopt a system architecture that could comfortably address all of these competing goals and requirements. Nehalem represents a complete and total overhaul of Intel’s system infrastructure, one of the most dramatic changes since the introduction of the P6 in the latter part of the last decade. Figure 1 below shows the system architecture for Nehalem, Harpertown and Barcelona devices.


Figure 1 – System Architecture Comparison, B indicates Bytes, b indicates bits

Nehalem is a fully integrated quad-core device, with an inclusive and shared last level cache. A central queue acts as a crossbar and arbiter between the four cores and the ‘uncore’ region of Nehalem, which includes the L3 cache, integrated memory controller and QPI links. From a performance perspective, the inclusive L3 cache is the ideal configuration since it keeps the most cache coherency transactions on-die. On-die communication has the benefit of both lower latency and lower power. Additionally, a shared last level cache can reduce replication. AMD made this transition with Barcelona at 65nm, leading to a massive 283mm2 die size as a result. Intel’s 45nm Harpertown was actually two dual-core devices in a package, which is advantageous due to the smaller die size and more flexible binning options. Depending on the workload, there was probably a fair bit of data replicated between the two caches in Harpertown (certainly the working set for the instruction caches at least). By eliminating this duplication, a unified cache can actually be smaller yet cache the same amount of data.

While an integrated quad-core has performance advantages, as Intel itself has pointed out previously there are costs to this degree of integration. Greater die size means lower percentage yields (and absolute yields) per wafer, which can be problematic early in the lifecycle of a manufacturing process. Greater integration can reduce the number of top bin devices, since a device runs at the speed of the slowest component, and increase vulnerability to point defects.

Nehalem also replaces the front-side bus, with an integrated memory controller and on-die dedicated interprocessor interconnects [2]. The QuickPath Interconnect (previously known as the Common System Interface or CSI) has been extensively described in an earlier article. To summarize, QPI is a packet-based, high bandwidth, low latency point to point interconnect that operates at up to 6.4GT/s in 45nm (4.8GT/s in 65nm). Each full width link is implemented as a 20 bit wide interface using high speed differential signaling and dedicated clock lanes (with failover between clock lanes). QPI packets (or flits) are 80 bits in length and transmitted in 4-16 cycles, depending on link width. Although flits are 80 bits long, only 64 bits is available for data, with the remainder used for flow control, CRC and other purposes. This means that each link provides 16 bits or 2 bytes of data per transfer for a total of 12.8GB/s in each direction – the remaining bits are used for CRC. Since links are bi-directional, there is actually 25.6GB/s total for a full-width link. Implementations of Nehalem will scale the number of QPI links based upon the target market and system complexity, with client systems having the fewest (as few as one or a half width link), DP servers will have two links and MP servers likely three and a half or four.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10   Next »

Discuss (3 comments)