Jaguar and Bobcat are primarily targeted for consumer systems, and will be paired with AMD’s high performance GPUs. The most prominent examples are the current generation of game consoles, specifically the Sony Playstation 4 and Microsoft’s Xbox One. The target workloads really do not include high performance computing or workstation applications; consequently, double precision floating point is de-emphasized, while integer, vector integer and single precision floating point are critical for success.
Both Jaguar and Bobcat are dual-issue microarchitectures, and offer a similar number of execution pipelines. The main differences lie in the floating point execution resources, which are substantially more robust for Jaguar, to support SSE instructions with full performance.
The basic integer µop data flow is three cycles. Once the µop is written into the scheduler, there may be a variable length delay – driven by the availability of input operands as well as execution resources. When a µop is ready to execute, it reads input operands in the first cycle and then is actually executed in the next cycle. The result is immediately available for forwarding, but the actual register write back takes a third and final cycle.
The integer scheduler can issue two µops per cycle. The two ALU pipelines are symmetrical and fully pipelined with single cycle throughput for nearly all basic operations. However, integer multiplier and divider hardware is relatively expensive and is only available on ALU1. 32b integer multiplication is fully pipelined with 3 cycle latency, while 64b integer multiplies have one quarter throughput and 6 cycle latency. The divider is derived from Llano and is an unpipelined, radix-4 design that computes 2 bits of the result each cycle, whereas the integer divide was microcoded in Bobcat. Additionally, 3-operand LEA instructions use the store AGU in the first cycle and feed into ALU1 for the second.
The floating point and SIMD cluster is considerably more complex than the integer side, although the basic pipeline timing is relatively similar. Generally, the FP cluster is derived from the K7, but heavily modified. The design target for Bobcat and Jaguar emphasize vectorized code (e.g., SSE and AVX), rather than legacy x87. The chief difference is that the Jaguar FP is designed for native 128-bit wide execution, rather than the 64-bit execution resources in Bobcat.
Writing a FP or SIMD µop into the scheduler takes a single cycle. Since the FP and SIMD operations are variable latency, the SIMD scheduler must prevent any writeback collisions. Once a µop is issued, the input registers are read over two cycles in Jaguar. Bobcat only takes a single cycle for reading registers, but the architects added an extra stage to improve frequency in Jaguar. Actually executing an operation is variable latency, but most µops are 1-4 cycles. As with the integer side, forwarding is available immediately, but there is a final cycle for writing results back to the register file.
Conceptually, the FP cluster is organized into three distinct domains: floating point, vector integer, and store/conversion. Forwarding within each domain is free, but crossing into another domain costs an extra cycle – yet another hazard that the scheduler must handle.
The scheduler issues up to 2 µops per cycle to two asymmetric pipelines. Unlike the integer execution units, FP execution units are fairly expensive and often inefficient to replicate – hence the asymmetry. Jaguar and Bobcat feature symmetric SIMD integer ALUs in both pipelines. The ALUs are fully pipelined with single cycle latency, since the execution hardware is relatively inexpensive. However, the ALUs are 128-bit wide for Jaguar (e.g., the width of an SSE register), compared to just 64-bit for Bobcat.
On Jaguar, port 0 contains a fully pipelined 128-bit SIMD integer multiplier that handles IMUL operations in two cycles. The integer multiply is also used for AES and carry-less multiplication, although these instructions are microcoded. In contrast, the integer multiply is only 64-bits wide for Bobcat and the newer security instructions are not supported.
The store and conversion unit hangs off port 1 and handles FP/integer data conversion, most floating point denormals, as well as routing up to 128-bits of write data to the L1D (compared to 64-bits for Bobcat). The latency is 3 cycles for most operations, but fully pipelined.
The asymmetry is more pronounced on the floating point side. On Jaguar, port 0 is equipped with a 128-bit FP adder that is pipelined for single and double precision, with 3 cycle latency. The adder also handles denormals for SSE and AVX operations (but not x87). In contrast, the Bobcat FP adder is only 64-bits wide and does not natively resolve any denormals.
FP multiplication and division is resolved on port 1. Jaguar’s FP multiplier is implemented as two 76-bit × 27-bit multipliers to save area and power. For single precision data, it is fully pipelined and executes with 2 cycle latency for up to 4 operations. However, double precision operations require an additional iteration – doubling the latency and cutting throughput in half. Rarer 80-bit x87 multiplication takes 5 cycles and can only execute every third cycle. The Bobcat multiplier is a similar design, however with a single instance, rather than two to account for the narrower 64-bit data path.
Port 1 also executes FP division and square roots, re-using the multiplier circuitry and therefore blocking any multiply µops. Division is only available through the deprecated x87 instruction set and takes 14-22 cycles depending on the specific variant. Exact square root instructions similarly have a latency of 16-35 cycles, but certain reciprocal square root instructions (e.g., RSQRTPS, RSQRTSS) are fully pipelined with 2 cycle latency. Division and square roots are handled similarly on Bobcat, but with slightly higher latencies.
Discuss (86 comments)