Silvermont, Intel’s Low Power Architecture

Pages: 1 2 3 4 5 6 7 8

Out-of-Order Scheduling

Once instructions have been decoded, the pipeline and microarchitecture for Silvermont begins to diverge rather rapidly from Saltwell. The single biggest difference between the two cores is that Silvermont is a fully out-of-order design that is quite robust and handles a wide variety of code efficiently. In contrast, Saltwell is a simple in-order pipeline that is incredibly fragile because of issue and execution limitations that software must overcome through instruction scheduling. This fragility is problematic for some markets, because the vast majority of x86 software is compiled for out-of-order microarchitectures such as Sandy Bridge or Bulldozer.

One of the biggest advantages of Silvermont’s out-of-order scheduling is that the whole concept of ‘ports’ or ‘execution pipes’ becomes irrelevant. The microarchitecture will schedule instructions in a nearly optimal fashion, and easily tolerate poorly written code. Of course, the resources for out-of-order execution are not free from a power or area standpoint. However, Intel’s architects found that the area overhead for dynamic scheduling is comparable to the cost for multi-threading, with far better single threaded performance.

Silvermont and Saltwell Instruction Scheduling

Figure 4. Silvermont and Saltwell instruction scheduling.

The microarchitecture of Silvermont is conceptually and practically quite different from Haswell and other high performance Intel cores. The latter decode x86 instructions into simpler µops, which are subsequently the basis for execution. Silvermont tracks and executes macro operations that correspond very closely to the original x86 instructions, a concept that is present in Saltwell. This different approach is driven by power consumption and efficiency concerns and manifests in a number of implementation details. Other divergences between Haswell-style out-of-order execution and Silvermont are dedicated architectural register files and the distributed approach to scheduling.

The out-of-order portion of the Silvermont pipeline take three cycles; two for renaming and allocating the necessary resources to track an instruction, and a third for placing the instruction into the schedulers. Saltwell also used three cycles for the issue control logic, two to dispatch instructions and a third to read input operands from the register file.

Instruction are sent in program order from the instruction queue to the out-of-order engine. The first step is allocating the appropriate buffers and renaming architectural registers to physical registers. Like Haswell, the register renaming uses a separate Reorder Buffer (ROB) and physical register files. Each instruction is allocated one of the 32 ROB entries, which tracks information about the status, branch resolution and retirement.

There are two rename register files (labelled as rename buffers in Figure 4), one for integer data and one for SSE and FP data; both are 32 entries and hold speculative execution results. The ROB implicitly indexes into the rename register files, based on the tag of the ROB entry.

Additionally, there are two architectural register files (not shown in Figure 4), which are physically separate from the speculative rename buffers. These register files contain the program visible state as well as registers for microcode. The integer register file contains 32 registers; the 16 architectural registers (RAX-RSP and R8-R15) and 16 temporary registers that are used by the microcode for complex instructions and operations. The FP register file also holds 32 registers including 16 architectural SSE registers (XMM0-XMM15), 8 architectural x87 registers (ST0-ST7), and 8 temporary microcode registers. When an instruction finishes, the speculative result is written into the appropriate architectural register file and control logic checks for exceptions and faults over four cycles.

Silvermont’s approach of separating speculative and program visible state was also used in the original P6 up through Nehalem, whereas the P4, Sandy Bridge and Haswell use a single register file with a more complex mapping table. One advantage for Silvermont is that recovering from a branch mispredict is relatively simple. The architectural register files can also be used as sources for input operands, which distributes the read ports necessary to feed all the execution units.

Once an instruction has been renamed and allocated, it is issued to one or more reservation stations. Silvermont relies on distributed reservation stations (labelled as RSV due to space constraints in Figure 4) to hold µops, until they are ready to dispatch to an execution unit. This approach is quite different from the unified scheduler that is employed in Haswell and other high performance Intel cores.

Distributed schedulers are less flexible and efficient from a throughput perspective. For example, when running code that is integer only, the FP schedulers are unused and clock gated. However, the power consumed by a scheduler is non-linear compared to the number of entries, so that several smaller schedulers are often more energy efficient than a single large scheduler. Neither approach is uniformly better; rather they represent different optimization points.

Silvermont contains five reservation stations; each is coupled to a specific execution unit and only accepts the associated type of µop. The two integer schedulers have 8 entries each and hold the input operands, which are read from the register files or the forwarding network. Once all the input operands are available, the µop can be dispatched to the execution unit. The scheduler generally dispatches the oldest µop. The floating point schedulers also have 8 entries each, but do not hold data. Instead, the input operands are read when the instruction is dispatched to the execution units (similar to Haswell’s unified scheduler) to minimize the movement of the larger 128-bit SSE data.

The Saltwell instruction issuing logic is far simpler and less efficient. Saltwell is organized as a dual-issue in-order pipeline. Instructions are decoded and placed into a 16 entry instruction queue, which is replicated for each of the two threads. Instructions in the queue can issue to port 0, port 1, or both. Each port is bound to a specific set of integer, floating point and memory execution units. Instructions can only dual-issue if they pair correctly and have no resource conflicts, and any instruction which requires resources on both ports must single issue. Conceptually, this is similar to the U and V pipe arrangement found in the original Pentium, which was reasonable in the early 1990’s, but is rather antiquated today.

Pages: « Prev   1 2 3 4 5 6 7 8   Next »

Discuss (409 comments)