Silvermont and Saltwell are tailored for mobile devices, where integer workloads are dominant and floating point is often an expensive luxury. Silvermont will also be used in microservers, but this is another area where floating point is not particularly beneficial. More to the point, customers looking for good FP performance already have an excellent solution in the form of Haswell and possibly a GPU as a co-processor.
Silvermont and Saltwell are both dual-issue microarchitectures, and have a similar balance of execution units. However, Silvermont was reworked for higher and more consistent performance, as well as support for more recent instruction set extensions, such as SSE4.2 and AES-NI.
The pipeline for Saltwell is optimized for instructions with register and memory operands, one of the complex aspects of the x86 instruction set. The execution part of the pipeline includes three stages for data cache (L1D) access and feeds into a fourth stage for executing integer instructions (FP or SIMD instructions often take additional cycles). Even register-only instructions must incur this three cycle latency, which is relatively inefficient.
More problematic is that any cache access uses port 0 and a miss ends up stalling dependent instructions, although the multi-threading in Saltwell avoids some of these challenges. A stalled issue queue can simply switch to the other (hopefully unstalled) thread. Similarly, an instruction which takes resources in both ports blocks any dual-issue. As a practical matter, this is quite common; an FP add or an integer shift with a memory operand or an indirect branch, not to mention any PUSHes or POPs.
Silvermont decouples memory access from execution, which reduces the length of the basic integer pipeline by three stages and also avoids propagating memory related stalls throughout the pipeline. The effective throughput on Silvermont is significantly higher by virtue of out-of-order scheduling. Operations are dispatched as soon as they are ready, even if an earlier instruction is stalled. Silvermont can theoretically execute two integer and two FP instructions in parallel, one in each of the execution units; as a practical matter this only occurs after a significant stall, since the peak sustained throughput of the entire pipeline is two instructions per cycle. Another big improvement is the forwarding network, which has no delays between any functional units. In contrast, Saltwell has significant penalties for moving data between the integer and floating point clusters.
As shown in Figure 5, each distributed scheduler will dispatch the oldest, ready to execute µop to the appropriate port. The Silvermont integer schedulers contain input operands for the associated µops, so that the input data is dispatched along with the µop. On the integer side, both port 0 and port 1 execute basic ALU and logical operations and port 0 also contains an integer shifter. In Silvermont, port 0 performs LEA instructions with unscaled addressing (i.e., base + index + offset). Port 1 resolves branch instructions as well as basic bit processing.
In Saltwell, integer multiply instructions were performed in the FP cluster with relatively poor performance. The latency is 5-13 cycles and 2 cycle and 7 cycle throughput for 32-bit and 64-bit multiplies where both operands are in registers (as opposed to constants). Silvermont includes a dedicated integer multiplier in port 1 that reduces latency to 3-5 cycles with 1-2 cycle throughput. The integer multiplier serves double duty and is also used to calculate full scaled LEA instructions (i.e., base + index*scale + offset). Silvermont also handles POPCNT instructions through port 1.
Saltwell actually executes LEA instructions in a dedicated AGU on port 1, so Silvermont simplifies the memory cluster by moving that functionality into the integer execution units as well as doubling the throughput for unscaled addressing.
For Saltwell and Silvermont the floating point units are designed to handle SSE data types that are 128-bit wide and contain either integer or floating point data elements. However, to save power not all instructions are executed with single cycle throughput, particularly on the floating point side. Another power optimization is microcoding overflow and other FP exceptions, rather than aiming for higher performance.
Silvermont’s FP reservation stations do not hold any data to reduce power, so the input operands are read from the FP architectural register file and FP rename buffers after dispatch. Both ports contain vector ALUs for basic arithmetic and logical operations. Port 0 executes SIMD integer shift and shuffles for any data type. Perhaps most importantly, port 0 contains a 64-bit vector multiplier, which is used for both integer and floating point data. For 128-bit SIMD integer, the throughput is two cycles. In Saltwell, scalar single precision FP multiply is 4 cycle latency and fully pipelined; packed single precision data, scalar double precision (or 80-bit extended precision) takes an extra cycle and has half throughput. Packed double precision is pitifully slow with 9 cycle latency and throughput. Silvermont has a much faster double precision multiply and also improved the speed for FP division, which is unpipelined.
Many of the new instructions in Silvermont also execute in port 0, including AES-NI (which is microcoded), SSE4.1 blending, the string handling instructions in SSE4.2, and carry-less multiplication for encryption.
The floating point unit on port 1 mainly consists of the FP adder. On Saltwell, scalar FP adds are 5 cycle latency and fully pipelined, while packed FP adds are 6 cycle latency and half pipelined. Silvermont has a more robust FP adder that is lower latency and higher throughput.
Discuss (408 comments)