Despite the massive system level changes in Penwell, the actual CPU core is similar to the previous generation. There were modest tweaks to the microarchitecture, but overall the main improvements were the migration from 45nm to 32nm and physical design changes to reduce power.
The Saltwell core is a dual-threaded, two issue, in-order, stalling microarchitecture reminiscent of the original Pentium. The instruction set architecture is x86-64 with all the extensions up to and including SSSE3. However, variants optimized for phones will only be available as 32-bit devices, as the memory footprint is limited to 1GB. Versions that are optimized for tablets, low-end notebooks or other form factors that can fit in 2-4GB of memory are likely to have 64-bit support enabled.
The branch predictor is an 8K entry Gshare predictor, twice the size of the previous generation. The 48B post-fetch instruction buffer in Lincroft has been augmented to act as a cache to save power by eliminating repeated instruction fetches in Saltwell. This technique is similar to the Loop Stream Detector that was first implemented in Merom. The L1 I-cache is 32KB and 8-way associative and the single ported L1 D-cache is 24KB and 6-way associative. The instruction issue ports are asymmetric and have similar pairing rules to the familiar U/V pipes of the P55. Other enhancements in the 32nm Saltwell core include more flexible integer instruction pairing and faster memcopy microcode routines. An always-on TSC and local APIC timer have been added, primarily to assist with power management and avoid waking up the CPU.
The L2 cache remains at 512KB and 8-way associative. It operates at the core frequency, with a read bandwidth of 32B/cycle and 32 outstanding cache requests. One of the major power improvements in the design was moving the L2 cache to a separate voltage rail from the CPU core. The 6T SRAM in the L2 cache has a significantly higher minimum operating voltage compared to the logic and 8T SRAM cells used in the CPU core. The Vmin for the Saltwell core is 0.7V, while the L2 operates on a fixed 1.05V rail. Intel’s engineers estimate that using separate voltage rails improved the Vmin by around 18mV. Although the core and L2 cache have dedicated voltage rails, they still use on-die power gates. The power gating is significantly faster than adjusting the voltage regulator in the external PMIC, so this technique reduces the latency of C6 power state transitions at the cost of extra die area.
The clock distribution was also improved to reduce power, with finer-grained frequencies. The Z2460 SKU of Medfield varies the CPU frequency in steps of 100MHz all the way up to 1.6GHz, although 1.3GHz is the highest sustained mode. In constrast, the Z600 frequency starts at 200MHz and scales up to 0.8GHz sustained and 1.2GHz for bursts, in coarser increments of 133MHz. The PLL, which generates the core clock, was also tweaked for lower dynamic power consumption.
Figure 2. Saltwell CPU Frequency versus Power
The net result of the 32nm process technology and circuit design improvements is significant. Figure 1 shows the projected power for the 32nm Saltwell core and L2 cache at a number of operating points. As a rough comparison, it consumes about 40% less power at a given frequency, compared to the previous generation Z600 core. Note that the projections assume a 70C junction temperature and represent the worst case single thread application in steady state. Intel’s numbers are marked as estimates, because they represent a median binning of the Medfield parts; thus some SKUs will run cooler and some hotter. Enabling multithreading will increase power consumption, although the impact depends on the workload. According to Intel’s engineers, the extra power consumption is around 11-13% for web browsing, 18% for SPECint2000 and a little bit over 20% for a power virus.
There has been some confusion regarding the Saltwell numbers that Intel cited for the various frequency levels and power states (C0, C1/E, C2/E, C4, C6). Figure 2 shows several C-states in addition to the C0 operating curve for Saltwell. To briefly recap, the C1 state turns off the clock distribution, but the PLL is still ticking (at 600MHz in the above Figure). C1E is similar to C1, however, the voltage has also been reduced to the minimum possible for additional power savings. The C2/C2E states are similar to C1/C1E, with the additional twist that the chipset blocks interrupts, which increases the wake-up latency. The reason that Saltwell C0 at 100MHz is lower power than C1E or C2E in Figure 2 is that the PLL is running significantly slower (100MHz vs. 600MHz). Note that the overall power management policies can be tuned in the BIOS by vendors, so there is a certain degree of variation expected in shipping products.
Discuss (68 comments)