Circuit Techniques, Power Savings and More
From a circuit level perspective, the changes between the K8 and Barcelona were extremely significant. Barcelona is specified to operate at a wide range of voltages, from 0.8-1.4V. However, unlike its predecessor, each core in Barcelona has a dedicated clock distribution system (including PLL) and power grid. The frequency for each core is independent of both the other cores, and the various non-core regions; the voltage for all four cores is shared, but separate from the non-core. As a result, power can be aggressively managed by lowering frequency and voltage whenever possible. To support independent clocking and modular design, asynchronous dynamic FIFO buffers are used to communicate between different cores and the northbridge/L3 cache. These FIFOs absorb any global skew or clock rate variation, but the latency for passing through depends on the skew and frequency variance – which is why the L3 cache latency is variable. The northbridge and L3 cache compose roughly 20% of the die and share a voltage and clock domain that is independent of the four cores, which is essential for mobile applications. Previously, the northbridge clock and voltage was tied to the processors, so systems with integrated graphics could not reduce the processor voltage or frequency to deep power saving states. Separate sleep states, voltages and frequencies for the northbridge and processors should lower AMD’s average power dissipation which will help in the mobile market.
Figure 6 – Barcelona Die Micrograph
Barcelona also features a dedicated temperature sensor circuit for each core, and a separate one for the northbridge. Each core has 8 sensors on the circuit, while the northbridge contains 6. All the circuits are connected to and controlled by a global thermal control circuit. The global thermal controller uses the results to select power saving modes to reduce the temperature of the device.
One of the trickier areas for AMD’s design team was the SRAM cells for the caches. The L1 caches share a common 1.06um2 cell design. The 6T SRAM cells read during the first half of the cycle, and then perform a self-timed write and precharge in the latter part of the cycle. The timing for the write is based on extensive Monte Carlo analysis, incorporating lot-to-lot and local process variation and can be modified post-production with programmable fuses.
The L2 and L3 cache share many design elements, including the SRAM cells. The L2/3 cells are 0.81um2 and are also single ended for stability, which is unusual. One of the difficulties that AMD’s SRAM designers faced is that because they use the same die across all product lines, the likelihood of a read disturbance (i.e. reading the wrong data) must be very small. Specifically, a 5 sigma margin across the entire 0.7-1.3V range is required. Unfortunately, the floating body effect of the SOI substrate precluded a more efficient small swing read design. According to AMD’s presentation, using a small swing read cell, they were only able to achieve a 4.53 sigma margin. The single ended design which was chosen had larger margins that were sufficient for actual product use.
Shifting to more software oriented matters; Barcelona also adds support for a variety of new instructions. Fortunately, these coincide with the supplemental SSE3 instructions that Intel added to the Core 2. Generally, these instructions were not terribly significant, except for the POPCOUNT instruction, a perennial favorite of intelligence agencies, which counts the number of 0’s, or 1’s in a given register. AMD also added support for unaligned SSE loads, as previously mentioned, and it will be interesting to see when or if Intel chooses to follow their lead.
More significant to server users are the nested page tables, which improve virtualization performance. One of the drawbacks of Shadow Page Tables is that page faults become very expensive, since the VMM is invoked to manage any changes to the SPT. The alternative, Nested or Extended Page Tables, which are used in Barcelona, is to virtualize the memory management unit. On Barcelona each guest maintains a hardware walked table that maps the guest physical to host physical address. Unfortunately, walking these tables can be extraordinarily expensive, so parts of the mappings can be cached as well. While this reduces the performance overhead of virtualization, customers waiting for I/O virtualization will have to wait till 2008 for that particular feature.