One of the key goals for QPI 1.1 is scaling up the interconnect bandwidth by ramping up the frequency. One of the physical layer changes for higher frequency is receiver equalization. QPI 1.0 calls for transmitter equalization. During the training phase, the transmitter measures the distortion of the electrical signals that it sends across the channel to the receiver. The distortion can be caused by interference, temperature changes, low quality motherboards or simply running traces over long distances. Based on the measured distortion, the transmitter adjusts the signals it sends to cancel out noise and present a cleaner signal at the receiver. However, at higher frequencies, the transmission across the channel becomes more and more difficult and noisy.
To improve signal quality for higher frequencies, QPI 1.1 relies on both receiver and transmitter equalization. In essence, the receiver will post-process the electrical signal it observes to clean it up and make the logical data easier to understand. Initial implementations will use straight forward equalization – it is probably configured during the boot-up process. However, future QPI versions will use adaptive techniques, where the receiver is constantly measuring the channel distortion and then adjusts the equalization to yield the best possible results.
The QPI 1.1 specification also adds advanced link power management that was described in Intel’s patents, but not implemented in the first generation. Specifically, there is a new L0p state for situations where the link is continuously active, but with relatively little traffic. The link shuts down a portion of the data transmission lanes, which reduces bandwidth and power consumption. A full-width link is 20 lanes, and the L0p state can theoretically modulate the link to half-width (10 lanes, or L0.5) and even one quarter width (5 lanes, or L0.25).
As with all features in a broad specification, this is implementation specific, so some products will use half-width modulation, some will use quarter-width and some products may have no L0p at all. The L0p state is actually closely connected to the fail-over RAS features in QPI because they both involve shutting down portions of a link – albeit for very different reasons.
In QPI links with failover, when a clock or data lane encounters a hard error, half of the link will be disabled. If the clock lane fails, it can be reassigned to one of the known good data lanes in the disabled half. Data lane failures are simpler and would merely require disabling the half of the link that contains the dead lane. In general, any full-width QPI link which has failover will also have a half-width L0p – since the same techniques are used for both. The QPI controller must map the 20-bit output packet (phit) onto a smaller number of actual lanes and send over multiple cycles. Half-width QPI links with failover are relatively uncommon (they are primarily found in Tukwila), but would similarly have quarter-width L0p. Shifting a full-width link to quarter-width in L0p will be fairly unusual, since the incremental power savings are small and double-failover seems like overkill for now.
Figure 2 – QPI Operating Modes
Figure 2 shows the QPI operating modes for a single port – the clock lane is in the middle and there are four groups of 5 data lanes. Active lanes are in red and active circuits (e.g. DLLs and muxes for each pin) in blue. The inactive lanes are shown in grey, while inactive circuitry is white. This highlights the advantage of L0p – it is an operating mode that can continue to transmit data, but saves power by shutting down some of the data lanes. The L0s state cannot send any data, but is able to put most of the tranceiver circuits to sleep, while in L1 everything is shut down. Waking up from L0s and L1 is fairly slow – Intel’s patents indicate roughly 20ns for L0s and 10µs for L1. That’s quite a latency penalty and would add 20% or more to a memory access. The wake-up latency also determines how long the link must be active for efficiency’s sake. Even with a fast L0s wake-up, the link should be operating for 200-400ns to amortize the cost of an entry and exit.
With L0p, the link can efficiently transmit small amounts of traffic at lower bandwidth with essentially zero latency penalty. According to patents, the L0p state is typically triggered by a utilization threshold. For example, a full QPI link that is 40% utilized might shift down to half-width, while the utilization might have to be 20% or less to move to quarter-width. The utilization thresholds are software configurable and probably dynamic in some cases. Similarly, there are utilization thresholds that determine when the link sould shift back to full-bandwidth L1 (e.g. 80% or 90% utilization of the L0p bandwidth).
It’s also important to note that each side of a QPI 1.1 link is independent – the receive side might be in L0 (fully operational), while the send side is in L0p. The wake-up time from L0p to L0 is also significantly faster than L0s, since the tranceivers are awake. While Intel has not disclosed the wake up latency, it is probably under 10ns and perhaps under 5ns.
Discuss (32 comments)