The K8 system architecture is already quite good. For four socket servers, it clearly rules the x86 roost. The two major issues for AMD systems were the lack of quad core processors, and poor eight socket server performance.
The single most emphasized selling point for Barcelona is that it integrates four processor cores, bringing them to parity with Intel’s Xeon 53xx series. The Xeon 53xx, codenamed Clovertown, is actually a pair of dual core Woodcrest processors in a multi-chip package (MCP). These processors communicate over the front-side bus, rather than through an on-chip bus or caches. In contrast, AMD has opted for a shared cache approach, where the last level of cache, the L3 is used by all four cores. Figure 1 below compares the Revision F Opteron, Barcelona and Intel’s upcoming 3GHz Clovertown.
Figure 1 – System Architecture Comparison
The architects that designed Barcelona opted for a fully integrated MPU. A monolithic device ultimately provides higher performance, especially for bandwidth sensitive workloads that don’t benefit from caching, such as HPC or data mining. However, like any engineering decision it does not come without trade-offs. First of all, fully integrating everything is a decision that must be made at the beginning of the project. An MCP approach is far less time consuming and can use a slightly modified existing product; most importantly, these changes can be made late in the design cycle. Monolithic devices also have lower yields, because the larger die size means fewer candidate dice per wafer, and hence random defects have a larger impact. Monolithic MPUs are also more difficult to bin for frequency, since to run at a given speed, all four cores must exceed that target with appropriate power dissipation. However, there are design techniques that will let a MPU with a slow core and a fast core run at the slow speed, but with lower power.
While AMD’s marketing department likes to bill their approach as a ‘native’ or ‘true’ quad core design, the truth is that both approaches are equally valid; a fact belatedly recognized by some of AMD’s own executives. Intel’s Clovertown is a quad core device. Operating systems recognize Clovertown as four processors, and it certainly offers higher performance for many applications than a dual core MPU. However, it is equally true that in most situations performance favors fully integrated quad cores.
In the case of Barcelona, the advantages of greater integration have been augmented by careful attention to I/O bandwidth. The memory controllers in Barcelona received a major overhaul. The most visible change is that each controller supports independent 64B transactions, rather than a single 128B transaction across both controllers (memory mirroring is also supported now). Since DDR2 bursts stay at 32B, this improves command efficiency. However, when using DDR3, the command efficiency will drop because the burst length will double to 64B. Each controller also supports a separate set of open pages in DRAM, which is controlled by a new history based pattern predictor (which is somewhat analogous to a simple branch predictor). The predictor uses both per-bank access history and page accesses across banks to decide whether to keep a page open to improve performance, or close the page to reduce power. Lastly, Barcelona introduces data poisoning, which ensures that if a double bit error is detected by ECC, it is contained and only impacts the process which first accesses it, rather than crashing or corrupting the whole system.
While revision F Opteron processors supported DDR2, there was little performance advantage, if any. To actually take advantage of the available bandwidth for DDR2, deeper request and response queues are needed; these changes were not made in revision F, but are present in Barcelona. AMD also introduced a 16-20 entry write buffer in the memory controller, so that writes can be deferred, avoiding costly bus turn-arounds. Lastly, the memory controllers now support DRAM prefetchers that share the write buffer and can detect positive and negative strides. Server versions of Barcelona will support registered DIMMs at up to 667MHz, and desktop versions will work with slightly faster 800MHz DDR2.
Barcelona also adds a fourth HyperTransport lane for interprocessor communications and I/O devices. With four lanes, system vendors can build fully connected four socket systems; this reduces transaction latency substantially, since all processors can be reached with a single hop. Each node within the system could even have an attached I/O hub (see our preview of Barcelona). However, the current socket infrastructure only supports three HT1.1 lanes, so these innovative system designs will have to wait for a new socket interface. Initially, each link will run at 2GT/s, but they are compatible with HyperTransport 3.0 and future parts may operate at up to 5.2GT/s in newer systems. HT3.0 can also modulate link width and frequency to save power. Coherent Hypertransport also features a slight change that will improve latency for some transactions. When a K8 fetches a cache line into its L1D or L2, it has to snoop the system and wait for the results. In particular, the K8 will snoop memory and every other cache in the system; once it gets all of these responses, it can use the cache line it fetched. However, in Barcelona if a requested cache line is in the M or O state (meaning that memory has a dirty copy), the CPU does not wait to get the snoop response from memory, improving the transaction latency. The newer protocol also adds a retry mechanism to survive transient errors at higher clock rates.
HyperTransport 3.0 also adds a feature called ‘unganging’ or lane-splitting. The HT3.0 links are actually composed of 16 bit lanes running in each direction. These lanes can be split up into a pair of independent 8-bit wide links. This is fairly useful for connecting to I/O devices, as few systems have enough I/O devices to saturate a full 8GB/s interface; even 8 SAS hard drives and a pair of 10GBE cards would not require that much bandwidth. However, AMD has also pitched link-splitting as a way to build fully interconnected 8 socket servers. Previous generation Opteron-based systems supported up to 8 sockets, but the performance is positively underwhelming on the few benchmarks that have been published (mainly SAP 2 tier and SPECjbb2005). While Barcelona will offer higher performance 8 socket implementations, it isn’t clear how much demand there is from end-users. Sun and Fujitsu currently sell 8 socket servers, but both HP and Dell shelved earlier efforts in 2003.