Instruction Set Extensions
Saltwell is roughly speaking compatible with Merom, offering x86-64 instructions up to SSSE3. For a core initially released in 2008, the gap is not particularly bad. However, Medfield (which uses the Saltwell core) was released in 2012, essentially lagging six years behind Intel’s various extensions.
Silvermont is more modern and includes ISA extensions from the Westmere timeframe (roughly 2010), but is not quite on par with the newer high-end CPU cores like Haswell or Sandy Bridge. Silvermont includes support for SSE4.1, SSE4.2 (which is largely string manipulation and a CRC instruction), AES-NI, and POPCNT. The new core has extended page tables, Virtual Processor IDs, and the VMFUNC instruction for virtualization, as well as the RDRAND and PCLMULQDQ instructions for security. Lastly, there are several platform improvements such as timers, SMEP mode, real time instruction tracing and better branch tracing.
The newer AVX and AVX2 instructions were not added because there are few benefits and the costs are significant. AVX and AVX2 are primarily useful for high performance computing, which is outside of the target market for Silvermont. Moreover, emulating wide SIMD through microcode is inefficient for operations like gather, shuffle, and permute.
Decoding takes three pipeline stages in both Silvermont and Saltwell and both cores have two decoders, but the similarity ends there. The Saltwell decoders have a number of restrictions that reduce the throughput. For example, only a single x87 instruction can be decoded each cycle, and any jump instruction will end decoding for that cycle, potentially causing a bubble in the pipeline. Moreover, the in-order nature of Saltwell also requires that instructions following pairing rules for optimal performance. This made code generation rather tricky for compilers, and the Saltwell microarchitecture is relatively fragile as a result.
The Silvermont decoders are significantly more robust. Any instructions which are not microcoded are handled by either decoder at full throughput. One area where Intel’s architects spent considerable effort is reducing the number of instructions that require microcode. Since each microcoded instruction has 3-4 cycles of overhead, this can yield substantial performance gains. Intel showed some results from benchmarks indicating that the number of static instructions requiring microcode fell from around 8-12% to 1-2%. This is a single data point, and doesn’t represent the dynamic instruction count, but is certainly promising. Intel’s architects indicated that the most common microcoded instructions for Silvermont are CALL or PUSH instructions that load from a first address in memory and then store data to a second different address.
When a branch is decoded, the overriding branch predictor can make a more accurate prediction using several different mechanisms. For conditional branches, a large gshare based predictor is used to determine whether the branch is taken or not; the target address will be encoded in the instruction itself.
In contrast, indirect branches jump to an unknown address that is specified in a register or memory. Indirect branches can have multiple target addresses and are commonly used in interpreted (e.g., Android Dalvik) or heavily object-oriented code (e.g., for polymorphic virtual functions). A specialized indirect branch target array predicts the target address for indirect jumps, based on both the IP of the branch and the global branch history. The indirect branch predictor is new in Silvermont (although it was initially used in Intel’s Pentium M) and should substantially increase performance for mobile applications, which are predominately written in object-oriented and/or interpreted languages.
Lastly, the overriding predictor includes a 16 entry Return Stack Buffer (RSB) for calls and returns; there is no RSB renaming here since address corruption is unusual this late in the pipeline.
While the BTB directs the instruction fetching, the overriding predictor actually controls what instructions are speculatively sent into the out-of-order back-end for execution. When the overriding predictor correctly predicts a branch, there is no pipeline flush. However, if it overruled the earlier BTB prediction, the instruction fetching will stall while the front-end is resteered. For example, if the gshare predictor correctly predicts a branch is taken and disagrees with the BTB, a six cycle bubble is created.
The branch prediction in Silvermont is substantially more accurate than in Saltwell, avoiding expensive pipeline flushes. However, Intel’s architects also improved the other side of the equation by reducing the penalty for a branch misprediction. The mispredict penalty for Silvermont is 10 cycles, compared to 13 cycles in Saltwell. Moreover, the decoders can restart once the correct path is known, rather than waiting for the out-of-order machine to fully flush the pipeline. Moving the overriding predictor later in the pipeline is also highly beneficial for power consumption because it is only used when a branch is detected in decoding, rather than being checked every cycle.
In contrast to Intel’s high-end cores such as Haswell, the instruction decoding in Silvermont does not translate x86 instructions into µops. Silvermont continues to use x86 instructions as the basis of the pipeline, just like Saltwell. Up to two macroinstructions are emitted from the decoders into the instruction queue each cycle.
The 32-entry instruction queue separates the front-end of the pipeline from the out-of-order machinery and also functions as a loop cache. When executing out of the loop cache, the entire front-end is clock gated to reduce power consumption. Additionally, the queue can absorb some of the pipeline bubbles introduced by the overriding predictor.
Discuss (408 comments)